我一直在关注本指南,尝试学习如何使用 keras 创建 POS 标记器。
我正在使用
Python 3.9
,并且我已将 Tensorflow 2.10
与 CUDA Toolkit 11.2
和 cuDNN 8.2
一起安装,因为这是 Windows 10 上本机支持的最后一个配置。
我正在使用配备 8Gb VRAM 的 NVIDIA GeForce RTX 2070 SUPER 进行训练,我的 PC 上有 64Gb RAM。
我在训练中使用的数据是典型的标记元组和词性标签,组合成句子列表:
[[("hello", "INTJ"), ("world", "NOUN"), ("!", "PUNCT")], [("oh", "INTJ"), ("hi", "INTJ")], ...]
然后将它们分为测试集、验证集和训练集,然后使用
DictVectorizer
中的 sklearn
进行矢量化,并按照我遵循的指南进行单热编码。
我编写了以下函数来构建模型:
def construct_model(input_dim, hidden_neurons, output_dim):
pos_model = Sequential([
Dense(hidden_neurons, input_dim=input_dim),
Activation('relu'),
Dropout(0.2),
Dense(hidden_neurons),
Activation('relu'),
Dropout(0.2),
Dense(output_dim, activation='softmax')
])
pos_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return pos_model
然后我加载数据并使用以下代码拟合模型:
if __name__ == "__main__":
X_train = processed_data[0]
y_train = processed_data[1]
X_val = processed_data[2]
y_val = processed_data[3]
X_test = processed_data[4]
y_test = processed_data[5]
model_params = {
'build_fn': construct_model,
'input_dim': X_train.shape[1],
'hidden_neurons': 512,
'output_dim': y_train.shape[1],
'epochs': 5,
'batch_size': 256,
'verbose': 1,
'validation_data': (X_val, y_val),
'shuffle': True
}
classifier = KerasClassifier(**model_params)
pos_model = classifier.fit(X_train, y_train)
每当我尝试拟合模型时,我都会收到一个带有长回溯的错误:
2023-12-25 09:44:36.255452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5973 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2070 SUPER, pci bus id: 0000:0a:00.0, compute capability: 7.5
2023-12-25 09:57:57.132922: W tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 73.90GiB (rounded to 79346949120)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2023-12-25 09:57:57.134352: I tensorflow/core/common_runtime/bfc_allocator.cc:1033] BFCAllocator dump for GPU_0_bfc
2023-12-25 09:57:57.134940: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (256): Total Chunks: 12, Chunks in use: 12. 3.0KiB allocated for chunks. 3.0KiB in use in bin. 120B client-requested in use in bin.
2023-12-25 09:57:57.135083: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.135206: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2023-12-25 09:57:57.135492: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (2048): Total Chunks: 2, Chunks in use: 2. 4.0KiB allocated for chunks. 4.0KiB in use in bin. 4.0KiB client-requested in use in bin.
2023-12-25 09:57:57.135686: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.136594: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.136830: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137034: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (32768): Total Chunks: 1, Chunks in use: 1. 34.0KiB allocated for chunks. 34.0KiB in use in bin. 34.0KiB client-requested in use in bin.
2023-12-25 09:57:57.137230: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (65536): Total Chunks: 1, Chunks in use: 0. 67.8KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137483: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137682: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.137801: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138025: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (1048576): Total Chunks: 2, Chunks in use: 1. 2.90MiB allocated for chunks. 1.00MiB in use in bin. 1.00MiB client-requested in use in bin.
2023-12-25 09:57:57.138253: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (2097152): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138490: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138743: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (8388608): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.138971: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139212: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139371: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.139585: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (134217728): Total Chunks: 1, Chunks in use: 1. 205.92MiB allocated for chunks. 205.92MiB in use in bin. 205.92MiB client-requested in use in bin.
2023-12-25 09:57:57.139717: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Bin (268435456): Total Chunks: 2, Chunks in use: 0. 5.63GiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-12-25 09:57:57.140005: I tensorflow/core/common_runtime/bfc_allocator.cc:1056] Bin for 73.90GiB was 256.00MiB, Chunk State:
2023-12-25 09:57:57.141115: I tensorflow/core/common_runtime/bfc_allocator.cc:1062] Size: 408.84MiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev: Size: 1.00MiB | Requested Size: 1.00MiB | in_use: 1 | bin_num: -1, next: Size: 205.92MiB | Requested Size: 205.92MiB | in_use: 1 | bin_num: -1
2023-12-25 09:57:57.141309: I tensorflow/core/common_runtime/bfc_allocator.cc:1062] Size: 5.23GiB | Requested Size: 0B | in_use: 0 | bin_num: 20, prev: Size: 205.92MiB | Requested Size: 205.92MiB | in_use: 1 | bin_num: -1
2023-12-25 09:57:57.141420: I tensorflow/core/common_runtime/bfc_allocator.cc:1069] Next region of size 6263144448
2023-12-25 09:57:57.141763: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000000 of size 256 next 1
2023-12-25 09:57:57.141835: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000100 of size 1280 next 2
2023-12-25 09:57:57.141920: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000600 of size 256 next 3
2023-12-25 09:57:57.141987: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000700 of size 256 next 4
2023-12-25 09:57:57.142074: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000800 of size 256 next 6
2023-12-25 09:57:57.142135: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d000900 of size 2048 next 7
2023-12-25 09:57:57.142194: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001100 of size 256 next 5
2023-12-25 09:57:57.142252: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001200 of size 256 next 8
2023-12-25 09:57:57.142581: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001300 of size 2048 next 12
2023-12-25 09:57:57.142678: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001b00 of size 256 next 13
2023-12-25 09:57:57.142789: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001c00 of size 256 next 11
2023-12-25 09:57:57.142910: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001d00 of size 256 next 17
2023-12-25 09:57:57.143002: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001e00 of size 256 next 18
2023-12-25 09:57:57.143142: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d001f00 of size 256 next 14
2023-12-25 09:57:57.143262: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d002000 of size 256 next 19
2023-12-25 09:57:57.143450: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free at 130d002100 of size 69376 next 20
2023-12-25 09:57:57.143583: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d013000 of size 34816 next 21
2023-12-25 09:57:57.143682: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free at 130d01b800 of size 1990144 next 15
2023-12-25 09:57:57.143840: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 130d201600 of size 1048576 next 16
2023-12-25 09:57:57.143986: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free at 130d301600 of size 428696832 next 9
2023-12-25 09:57:57.144123: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] InUse at 1326bd7b00 of size 215922688 next 10
2023-12-25 09:57:57.144276: I tensorflow/core/common_runtime/bfc_allocator.cc:1089] Free at 13339c3300 of size 5615373568 next 18446744073709551615
2023-12-25 09:57:57.144457: I tensorflow/core/common_runtime/bfc_allocator.cc:1094] Summary of in-use Chunks by size:
2023-12-25 09:57:57.144711: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 12 Chunks of size 256 totalling 3.0KiB
2023-12-25 09:57:57.144796: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 1280 totalling 1.2KiB
2023-12-25 09:57:57.144876: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 2 Chunks of size 2048 totalling 4.0KiB
2023-12-25 09:57:57.144950: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 34816 totalling 34.0KiB
2023-12-25 09:57:57.145023: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 1048576 totalling 1.00MiB
2023-12-25 09:57:57.145094: I tensorflow/core/common_runtime/bfc_allocator.cc:1097] 1 Chunks of size 215922688 totalling 205.92MiB
2023-12-25 09:57:57.145168: I tensorflow/core/common_runtime/bfc_allocator.cc:1101] Sum Total of in-use chunks: 206.96MiB
2023-12-25 09:57:57.145231: I tensorflow/core/common_runtime/bfc_allocator.cc:1103] total_region_allocated_bytes_: 6263144448 memory_limit_: 6263144448 available bytes: 0 curr_region_allocation_bytes_: 12526288896
2023-12-25 09:57:57.145763: I tensorflow/core/common_runtime/bfc_allocator.cc:1109] Stats:
Limit: 6263144448
InUse: 217014528
MaxInUse: 647770624
NumAllocs: 33
MaxAllocSize: 215922688
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2023-12-25 09:57:57.146178: W tensorflow/core/common_runtime/bfc_allocator.cc:491] *_____*****_________________________________________________________________________________________
Traceback (most recent call last):
File "C:\Users\admd9\PycharmProjects\codalab-sigtyp2024\train_pos_tagger.py", line 104, in <module>
pos_model = classifier.fit(X_train, y_train)
File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\wrappers\scikit_learn.py", line 248, in fit
return super().fit(x, y, **kwargs)
File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\wrappers\scikit_learn.py", line 175, in fit
history = self.model.fit(x, y, **fit_args)
File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\admd9\anaconda3\envs\tf_codalab_sharedtask\lib\site-packages\tensorflow\python\framework\constant_op.py", line 102, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
显然这个错误与内存分配方式有关,并且可能是验证集一次性传递到 GPU 的结果。
我在网上寻找解决方案,大多数似乎都建议减少批量大小。我尝试将批量大小减少到 2,但这没有帮助。
我还尝试插入以下代码块,在加载数据后我在here找到了该代码块,以允许 TensorFlow 分配 GPU 内存:
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
这也没有解决问题。
最后,this thread中的某人建议在循环中使用
gc.collect()
在每个循环后清除RAM,但我没有像提出问题的用户那样使用循环,所以我不知道如何才能让这个发挥作用。
如何解决这个问题并训练我的模型?
看起来整个数据正在加载到GPU内存中。我建议您实现生成器来加载数据。它避免将整个数据集传递给 GPU。
我建议您访问此线程以获取一些示例: