为什么我扫模型参数时GPU内存不断增加?

问题描述 投票:0回答:1

我正在尝试评估特定架构的不同丢失率的模型分类错误率。当我这样做时,内存使用量会增加,并且我无法阻止这种情况发生(有关详细信息,请参阅下面的代码):

N=2048 split 0 memory usage
{'current': 170630912, 'peak': 315827456}
{'current': 345847552, 'peak': 430210560}
{'current': 530811136, 'peak': 610477568}
...
{'current': 1795582208, 'peak': 1873805056}
N=2048 split 1 memory usage
{'current': 1978317568, 'peak': 2056609280}
{'current': 2157136640, 'peak': 2235356160}
...
2024-12-15 18:55:04.141690: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 52.00MiB (rounded to 54531328)requested by op 
...
2024-12-15 18:55:04.144298: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 54531208 bytes.
...

这是我正在运行的代码的相关部分,包括每次迭代后清除内存的一些不成功的尝试。

import tensorflow as tf
import tensorflow_datasets as tfds
import gc

batch_size = 128
sizes = [2048 + n * batch_size * 5 for n in range(10)]
dropout_points = 10

vals_ds = tfds.load(
    'mnist',
    split=[f'train[{k}%:{k+10}%]' for k in range(0, 100, 10)],
    as_supervised=True,
)
trains_ds = tfds.load(
    'mnist',
    split=[f'train[:{k}%]+train[{k+10}%:]' for k in range(0, 100, 10)],
    as_supervised=True,
)
_, ds_info = tfds.load('mnist', with_info=True)


def normalize_img(image, label):
    return tf.cast(image, tf.float32) / 255., label


for N in sizes:
    for i, (ds_train, ds_test) in enumerate(zip(trains_ds, vals_ds)):
        ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
        ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
        ds_train = ds_train.batch(128)

        ds_test = ds_test.map(normalize_img, num_parallel_calls=tf.data.AUTOTUNE)
        ds_test = ds_test.batch(128)

        print(f"N={N} split {i} memory usage")
        with open(f"out_{N}_{i}.csv", "w") as f:
            f.write(("retention_rate,"
                     "train_loss,"
                     "train_err,"
                     "test_loss,"
                     "test_err,"
                     "epochs\n"))
            for p in range(dropout_points):
                dropout_rate = p / dropout_points

                layers = [tf.keras.layers.Flatten(input_shape=(28, 28))]
                for i in range(4):
                    layers.append(tf.keras.layers.Dense(N, activation='relu'))
                    layers.append(tf.keras.layers.Dropout(dropout_rate))
                layers.append(tf.keras.layers.Dense(10))

                with tf.device('/GPU:0'):
                    model = tf.keras.models.Sequential(layers)
                    model.compile(
                        optimizer=tf.keras.optimizers.Adam(0.001),
                        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                        metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
                    )

                    callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
                    history = model.fit(
                        ds_train,
                        epochs=100,
                        validation_data=ds_test,
                        verbose=0,
                        callbacks=[callback]
                    )

                    train_loss, train_acc = model.evaluate(ds_train, verbose=0)
                    test_loss, test_acc = model.evaluate(ds_test, verbose=0)
                    epochs = len(history.history['loss'])
                f.write((
                    f"{1 - dropout_rate},"
                    f"{train_loss},"
                    f"{1 - train_acc},"
                    f"{test_loss},"
                    f"{1 - test_acc},"
                    f"{epochs}\n"))
                del model
                tf.keras.backend.clear_session()
                gc.collect()
                print(tf.config.experimental.get_memory_info('GPU:0'))

如何在不增加内存使用量的情况下有效地执行此循环?

python tensorflow machine-learning dropout
1个回答
0
投票

是的,据我所知,这种内存使用情况是完全正常的。 我有一个项目是对一些定制模型与迁移学习模型之间的比较研究。每次我运行每个模型时,它都会完全用完 Kaggle 中 p100 上可用的所有 vram,并且在模型拟合开始时内存使用量飙升至近 22GB。

我建议在拟合过程中使用较小的批量大小,以稍微减少资源使用,尽管我不是专家,而且几乎是新手。 我还建议尝试不同的学习率和优化器,并减少 epoch 的数量,根据我的经验,在大约 40 - 50 epoch 后,几乎没有任何明显的差异。

© www.soinside.com 2019 - 2024. All rights reserved.