[GPU在使用tf.keras模型训练时随机冻结

问题描述 投票:1回答:1

正在使用的版本:tensorflow-gpu:2.0,CUDA v10,CuDNN v7.6.5,Python 3.7.4]

系统规格:i9-7920X,4个RTX 2080Ti,128GB 2400MHz RAM,2TB SATA SSD

问题:

虽然使用tensorflow 2.0训练任何模型,但在某个时期内,GPU将会冻结并且GPU的功耗将降至70W左右,其中Core的使用为0,内存利用率也固定为某个随机值。发生这种情况时,我也不会收到任何错误或异常。还原的唯一方法是重新启动jupyter内核并从头开始运行。我首先以为我的代码可能有问题。因此,我认为在Cifar100上训练Densenet时,我将尝试复制该问题,并且该问题仍然存在。

如果我在多个GPU上进行训练,那么GPU也会冻结,但是这种情况很少发生。但是,使用单个GPU可以确保在某些时候卡住。

下面是用于训练Cifar100的代码

from densenet import DenseNet
from tensorflow.keras.datasets import cifar100
import tensorflow as tf
from tqdm import tqdm_notebook as tqdm

# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = cifar100.load_data(label_mode='fine')
num_classes = 100
y_test_original = y_test

# Convert class vectors to binary class matrices. [one hot encoding]
y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

for i in range(3):
    mean = np.mean(X_train[:,:,:,i])
    std = np.std(X_train[:,:,:,i])
    X_train[:,:,:,i] = (X_train[:,:,:,i] - mean)/std
    X_test[:,:,:,i] = (X_test[:,:,:,i] - mean)/std


with tf.device('/gpu:0'):  
    model = DenseNet(input_shape=(32,32,3), dense_blocks=3, dense_layers=-1, growth_rate=12, nb_classes=100, dropout_rate=0.2,
             bottleneck=True, compression=0.5, weight_decay=1e-4, depth=100)


optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, 
                                    momentum=0.9, 
                                    nesterov=True,
                                    name='SGD')
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer, metrics = ['accuracy'])

def scheduler(epoch):
    if epoch < 151:
        return 0.01
    elif epoch < 251:
        return 0.001
    elif epoch < 301:
        return 0.0001

callback = tf.keras.callbacks.LearningRateScheduler(scheduler)

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=300, verbose = 2)

PS:我什至在装有i7-8750h和RTX 2060(带32GB和970 EVO NVME)的笔记本电脑上尝试了该代码。不幸的是,我遇到了GPU冻结的同样问题。

有人知道这是什么问题吗?

tensorflow keras nvidia freeze tensorflow2.0
1个回答
0
投票

可能与软件包版本有关,请尝试降级为tensorflow 1.14。

也可能是您正在使用gpu重定位。尝试在CPU上使用donig,如:

 with tf.device('/cpu:0'):
    model = DenseNet(....)

parallel_model = multi_gpu_model(model, gpus=4)

然后您可以使用parallel_model进行训练,依此类推。

© www.soinside.com 2019 - 2024. All rights reserved.