训练keras模型时如何避免这种内存分配错误?

问题描述 投票:0回答:1

我一直在关注本指南,尝试学习如何使用 keras 创建 POS 标记器。

我正在使用

Python 3.9
,并且我已将
Tensorflow 2.10
CUDA Toolkit 11.2
cuDNN 8.2
一起安装,因为这是 Windows 10 上本机支持的最后一个配置。

我正在使用配备 8Gb VRAM 的 NVIDIA GeForce RTX 2070 SUPER 进行训练,我的 PC 上有 64Gb RAM。

我在训练中使用的数据是典型的标记元组和词性标签,组合成句子列表:

[[("hello", "INTJ"), ("world", "NOUN"), ("!", "PUNCT")], [("oh", "INTJ"), ("hi", "INTJ")], ...]

然后将它们分为测试集、验证集和训练集,然后使用

DictVectorizer
中的
sklearn
进行矢量化,并按照我遵循的指南进行单热编码。

我编写了以下函数来构建模型:

def construct_model(input_dim, hidden_neurons, output_dim):
    pos_model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim),
        Activation('relu'),
        Dropout(0.2),
        Dense(hidden_neurons),
        Activation('relu'),
        Dropout(0.2),
        Dense(output_dim, activation='softmax')
    ])
    pos_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return pos_model

然后我加载数据并使用以下代码拟合模型:

if __name__ == "__main__":

    X_train = processed_data[0]
    y_train = processed_data[1]

    X_val = processed_data[2]
    y_val = processed_data[3]

    X_test = processed_data[4]
    y_test = processed_data[5]

    model_params = {
        'build_fn': construct_model,
        'input_dim': X_train.shape[1],
        'hidden_neurons': 512,
        'output_dim': y_train.shape[1],
        'epochs': 5,
        'batch_size': 256,
        'verbose': 1,
        'validation_data': (X_val, y_val),
        'shuffle': True
    }

    classifier = KerasClassifier(**model_params)
    pos_model = classifier.fit(X_train, y_train)

每当我尝试拟合模型时,我都会收到一个带有长回溯的错误:

2023-12-25 09:44:36.255452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5973 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2070 SUPER, pci bus id: 0000:0a:00.0, compute capability: 7.5
2023-12-25 09:57:57.132922: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 73.90GiB (rounded to 79346949120)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2023-12-25 09:57:57.134352: I tensorflow/core/common_runtime/bfc_allocator.cc:1033] BFCAllocator dump for GPU_0_bfc
2023-12-25 09:57:57.134940: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (256):  Total Chunks: 12, Chunks in use: 12. 3.0KiB allocated for chunks. 3.0KiB in use in bin. 120B client-requested in use in bin.
2023-12-25 09:57:57.135083: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (512):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.135206: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (1024):     Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2023-12-25 09:57:57.135492: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (2048):     Total Chunks: 2, Chunks in use: 2. 4.0KiB allocated for chunks. 4.0KiB in use in bin. 4.0KiB client-requested in use in bin.
2023-12-25 09:57:57.135686: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (4096):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.136594: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (8192):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.136830: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (16384):    Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137034: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (32768):    Total Chunks: 1, Chunks in use: 1. 34.0KiB allocated for chunks. 34.0KiB in use in bin. 34.0KiB client-requested in use in bin.
2023-12-25 09:57:57.137230: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (65536):    Total Chunks: 1, Chunks in use: 0. 67.8KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137483: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (131072):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137682: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (262144):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137801: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (524288):   Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138025: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (1048576):  Total Chunks: 2, Chunks in use: 1. 2.90MiB allocated for chunks. 1.00MiB in use in bin. 1.00MiB client-requested in use in bin.
2023-12-25 09:57:57.138253: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (2097152):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138490: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (4194304):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138743: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (8388608):  Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138971: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (16777216):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139212: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (33554432):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139371: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (67108864):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139585: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (134217728):    Total Chunks: 1, Chunks in use: 1. 205.92MiB allocated for chunks. 205.92MiB in use in bin. 205.92MiB client-requested in use in bin.
2023-12-25 09:57:57.139717: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (268435456):    Total Chunks: 2, Chunks in use: 0. 5.63GiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.140005: I tensorflow/core/common_runtime/bfc_allocator.cc:1056] Bin for 73.90GiB was 256.00MiB, Chunk State: 
2023-12-25 09:57:57.141115: I tensorflow/core/common_runtime/bfc_allocator.cc:1062]   Size: 408.84MiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 1.00MiB | Requested Size: 1.00MiB | in_use: 1 | bin_num: -1, next:   Size: 205.92MiB | Requested Size: 205.92MiB | in_use: 1 | bin_num: -1
2023-12-25 09:57:57.141309: I tensorflow/core/common_runtime/bfc_allocator.cc:1062]   Size: 5.23GiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev:   Size: 205.92MiB | Requested Size: 205.92MiB | in_use: 1 | bin_num: -1
2023-12-25 09:57:57.141420: I tensorflow/core/common_runtime/bfc_allocator.cc:1069] Next region of size 6263144448
2023-12-25 09:57:57.141763: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000000 of size 256 next 1
2023-12-25 09:57:57.141835: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000100 of size 1280 next 2
2023-12-25 09:57:57.141920: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000600 of size 256 next 3
2023-12-25 09:57:57.141987: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000700 of size 256 next 4
2023-12-25 09:57:57.142074: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000800 of size 256 next 6
2023-12-25 09:57:57.142135: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000900 of size 2048 next 7
2023-12-25 09:57:57.142194: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001100 of size 256 next 5
2023-12-25 09:57:57.142252: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001200 of size 256 next 8
2023-12-25 09:57:57.142581: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001300 of size 2048 next 12
2023-12-25 09:57:57.142678: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001b00 of size 256 next 13
2023-12-25 09:57:57.142789: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001c00 of size 256 next 11
2023-12-25 09:57:57.142910: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001d00 of size 256 next 17
2023-12-25 09:57:57.143002: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001e00 of size 256 next 18
2023-12-25 09:57:57.143142: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001f00 of size 256 next 14
2023-12-25 09:57:57.143262: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d002000 of size 256 next 19
2023-12-25 09:57:57.143450: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free  at 130d002100 of size 69376 next 20
2023-12-25 09:57:57.143583: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d013000 of size 34816 next 21
2023-12-25 09:57:57.143682: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free  at 130d01b800 of size 1990144 next 15
2023-12-25 09:57:57.143840: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d201600 of size 1048576 next 16
2023-12-25 09:57:57.143986: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free  at 130d301600 of size 428696832 next 9
2023-12-25 09:57:57.144123: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 1326bd7b00 of size 215922688 next 10
2023-12-25 09:57:57.144276: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free  at 13339c3300 of size 5615373568 next 18446744073709551615
2023-12-25 09:57:57.144457: I tensorflow/core/common_runtime/bfc_allocator.cc:1094]      Summary of in-use Chunks by size: 
2023-12-25 09:57:57.144711: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 12 Chunks of size 256 totalling 3.0KiB
2023-12-25 09:57:57.144796: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 1280 totalling 1.2KiB
2023-12-25 09:57:57.144876: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 2 Chunks of size 2048 totalling 4.0KiB
2023-12-25 09:57:57.144950: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 34816 totalling 34.0KiB
2023-12-25 09:57:57.145023: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 1048576 totalling 1.00MiB
2023-12-25 09:57:57.145094: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 215922688 totalling 205.92MiB
2023-12-25 09:57:57.145168: I tensorflow/core/common_runtime/bfc_allocator.cc:1101] Sum Total of in-use chunks: 206.96MiB
2023-12-25 09:57:57.145231: I tensorflow/core/common_runtime/bfc_allocator.cc:1103] total_region_allocated_bytes_: 6263144448 memory_limit_: 6263144448 available bytes: 0 curr_region_allocation_bytes_: 12526288896
2023-12-25 09:57:57.145763: I tensorflow/core/common_runtime/bfc_allocator.cc:1109] Stats: 
Limit:                      6263144448
InUse:                       217014528
MaxInUse:                    647770624
NumAllocs:                          33
MaxAllocSize:                215922688
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2023-12-25 09:57:57.146178: W tensorflow/core/common_runtime/bfc_allocator.cc:491] *_____*****_________________________________________________________________________________________
Traceback (most recent call last):
  File "C:\Users\admd9\PycharmProjects\codalab-sigtyp2024\train_pos_tagger.py", line 104, in <module>
    pos_model = classifier.fit(X_train, y_train)
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\wrappers\scikit_learn.py", line 248, in fit
    return super().fit(x, y, **kwargs)
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\wrappers\scikit_learn.py", line 175, in fit
    history = self.model.fit(x, y, **fit_args)
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

显然这个错误与内存分配方式有关,并且可能是验证集一次性传递到 GPU 的结果。

我在网上寻找解决方案,大多数似乎都建议减少批量大小。我尝试将批量大小减少到 2,但这没有帮助。

我还尝试插入以下代码块,在加载数据后我在here找到了该代码块,以允许 TensorFlow 分配 GPU 内存:

gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

这也没有解决问题。

最后,this thread中的某人建议在循环中使用

gc.collect()
在每个循环后清除RAM,但我没有像提出问题的用户那样使用循环,所以我不知道如何才能让这个发挥作用。

如何解决这个问题并训练我的模型?

tensorflow keras out-of-memory
1个回答
0
投票

看起来整个数据正在加载到GPU内存中。我建议您实现生成器来加载数据。它避免将整个数据集传递给 GPU。

我建议您访问此线程以获取一些示例:

无法将输入张量从 CPU 复制到 GPU 以运行 GatherVe:Dst 张量未初始化。 [操作:聚集V2]

© www.soinside.com 2019 - 2024. All rights reserved.