正在使用的版本:tensorflow-gpu:2.0,CUDA v10,CuDNN v7.6.5,Python 3.7.4]
系统规格:i9-7920X,4个RTX 2080Ti,128GB 2400MHz RAM,2TB SATA SSD
问题:
虽然使用tensorflow 2.0训练任何模型,但在某个时期内,GPU将会冻结并且GPU的功耗将降至70W左右,其中Core的使用为0,内存利用率也固定为某个随机值。发生这种情况时,我也不会收到任何错误或异常。还原的唯一方法是重新启动jupyter内核并从头开始运行。我首先以为我的代码可能有问题。因此,我认为在Cifar100上训练Densenet时,我将尝试复制该问题,并且该问题仍然存在。
如果我在多个GPU上进行训练,那么GPU也会冻结,但是这种情况很少发生。但是,使用单个GPU可以确保在某些时候卡住。
下面是用于训练Cifar100的代码
from densenet import DenseNet
from tensorflow.keras.datasets import cifar100
import tensorflow as tf
from tqdm import tqdm_notebook as tqdm
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = cifar100.load_data(label_mode='fine')
num_classes = 100
y_test_original = y_test
# Convert class vectors to binary class matrices. [one hot encoding]
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
for i in range(3):
mean = np.mean(X_train[:,:,:,i])
std = np.std(X_train[:,:,:,i])
X_train[:,:,:,i] = (X_train[:,:,:,i] - mean)/std
X_test[:,:,:,i] = (X_test[:,:,:,i] - mean)/std
with tf.device('/gpu:0'):
model = DenseNet(input_shape=(32,32,3), dense_blocks=3, dense_layers=-1, growth_rate=12, nb_classes=100, dropout_rate=0.2,
bottleneck=True, compression=0.5, weight_decay=1e-4, depth=100)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01,
momentum=0.9,
nesterov=True,
name='SGD')
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer, metrics = ['accuracy'])
def scheduler(epoch):
if epoch < 151:
return 0.01
elif epoch < 251:
return 0.001
elif epoch < 301:
return 0.0001
callback = tf.keras.callbacks.LearningRateScheduler(scheduler)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=300, verbose = 2)
PS:我什至在装有i7-8750h和RTX 2060(带32GB和970 EVO NVME)的笔记本电脑上尝试了该代码。不幸的是,我遇到了GPU冻结的同样问题。
有人知道这是什么问题吗?
可能与软件包版本有关,请尝试降级为tensorflow 1.14。
也可能是您正在使用gpu重定位。尝试在CPU上使用donig,如:
with tf.device('/cpu:0'):
model = DenseNet(....)
parallel_model = multi_gpu_model(model, gpus=4)
然后您可以使用parallel_model
进行训练,依此类推。