当vgg16迁移学习模型正在拟合时。张量流抛出错误。每个图像补丁都是 224x224 RGB,约为 602KB(使用 float32。根据公式 vRAM/补丁大小/4,最大批量大小约为 5000)。 Ubuntu 24 上的 RTX 4070 Super 上的 GPU vRAM 为 12GB :
2024-09-23 05:49:25.162827: I external/local_tsl/tsl/framework/bfc_allocator.cc:1112] Sum Total of in-use chunks: 3.76GiB
2024-09-23 05:49:25.162840: I external/local_tsl/tsl/framework/bfc_allocator.cc:1114] Total bytes in pool: 10671489024 memory_limit_: 10671489024 available bytes: 0 curr_region_allocation_bytes_: 21342978048
2024-09-23 05:49:25.162860: I external/local_tsl/tsl/framework/bfc_allocator.cc:1119] Stats:
Limit: 10671489024
InUse: 4039777024
MaxInUse: 4091143936
NumAllocs: 215
MaxAllocSize: 3507001344
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2024-09-23 05:49:25.162904: W external/local_tsl/tsl/framework/bfc_allocator.cc:499] ***************************************_____________________________________________________________
Traceback (most recent call last):
File "/home/aiworker9/code/py/aimodels/common/prepare_data_vgg16.py", line 363, in <module>
scores, history = model_fit_image_label_array(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/aiworker9/code/py/aimodels/common/prepare_data_vgg16.py", line 260, in model_fit_image_label_array
history = model.fit(
^^^^^^^^^^
File "/home/aiworker9/code/py/myenv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/aiworker9/code/py/myenv/lib/python3.12/site-packages/tensorflow/python/framework/constant_op.py", line 108, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
这是我为解决该问题所做的事情:
将 model.fit() 和 data.batch 的批量大小减少到 1 倍
添加了一些GPU内存管理代码:
import os
os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async' #reduce memory fragmentation.
#clear gpu memory
tf.keras.backend.clear_session()
#reduce memory footprint. nvidia gpu related
os.environ['XLA_FLAGS'] = '--xla_gpu_strict_conv_algorithm_picker=false'
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')
#reduce GPU memory
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
这是 model.fit() 的代码:
def model_fit_image_label_array(model, train_images, train_label_user_id, train_label_binary, test_images, test_label_user_id, test_label_binary, epochs=10):
# Setting callbacks
earlyStopping = EarlyStopping(monitor = 'val_loss', patience = 3, verbose = 0, mode = 'min') # Monitora as épocas e para caso não esteja melhorando
mcp_save = ModelCheckpoint('best_weights.keras', save_best_only = True, monitor = 'val_loss', mode = 'min') # Salvando a melhor configuração
reduce_lr_loss = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.4, patience = 7, verbose = 1, min_delta = 1e-4, mode = 'auto') # Reduz o learning_rate quando a métrica de avaliação para de melhorar
# Training model. epochs=1 as it takes long time for each epoch. reach about 99.6% accuracy
history = model.fit(
[train_images, train_label_user_id],
[train_label_user_id, train_label_binary],
batch_size = BATCH_SIZE, #1,
epochs = epochs,
validation_data = ([test_images, test_label_user_id], [test_label_user_id, test_label_binary]),
callbacks = [earlyStopping, mcp_save, reduce_lr_loss]
)
# Evaluate model
eval_results = model.evaluate(
[test_images, test_label_user_id],
[test_label_user_id, test_label_binary],
verbose=0
)
# Unpack based on the number of outputs and metrics
print(f'\nmodel.fit history keys : {history.history}')
print(f'\neval_results : {eval_results}')
return eval_results, history
我有点没主意了。还有什么办法可以解决 GPU 内存问题?是否有可能整个数据集以某种方式加载到 GPU 内存中,而不是代码中指定的 BATCH_SIZE(此处为 2)?
更新:模型摘要部分:
│ (Embedding) │ │ │ │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ dropout (Dropout) │ (None, 256) │ 0 │ dense[0][0] │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ flatten_1 (Flatten) │ (None, 150) │ 0 │ embedding[0][0] │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ concatenate │ (None, 406) │ 0 │ dropout[0][0], │
│ (Concatenate) │ │ │ flatten_1[0][0] │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ user_output (Dense) │ (None, 3) │ 1,221 │ concatenate[0][0] │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ binary_output │ (None, 2) │ 814 │ concatenate[0][0] │
│ (Dense) │ │ │ │
└─────────────────────┴──────────────────┴───────────┴───────────────────┘
Total params: 21,139,657 (80.64 MB)
Trainable params: 6,424,969 (24.51 MB)
Non-trainable params: 14,714,688 (56.13 MB)
数据集的创建方式如下:
def return_dataset_image_embedding(data, window_size=(224, 224), step_size=112, shuffle_buffer_size=1000, prefetch_buffer_size=1):
# Example DataFrame
print("\nBatch size: ", BATCH_SIZE)
print("\nShuffle Buffer size : ", shuffle_buffer_size)
print("\nPrefetch Buffer size : ", prefetch_buffer_size)
df = pd.DataFrame(data)
image_paths = './data/'+df['image_id'].values
label_user_ids = df['label_user_id'].values
label_binary_flags = df['label_binary_flag'].values
image_patches, patch_label_user_ids, patch_label_binary_flags =
preprocess_image_patches_image_embedding(image_paths, label_user_ids, label_binary_flags, window_size, step_size)
# Create TensorFlow dataset
patch_dataset = tf.data.Dataset.from_tensor_slices((image_patches, patch_label_user_ids, patch_label_binary_flags))
# Shuffle, augment, batch, and prefetch the dataset
patch_dataset = patch_dataset.shuffle(buffer_size=shuffle_buffer_size) # Shuffle data
patch_dataset = patch_dataset.map(augment_data_image_embedding) # Apply augmentation
patch_dataset = patch_dataset.batch(batch_size=BATCH_SIZE) # Create batches
patch_dataset = patch_dataset.prefetch(buffer_size=prefetch_buffer_size) # reduce gpu memory usage
return patch_dataset
处理内存问题确实具有挑战性,您可以尝试建议的方法之一来识别和解决问题。
尝试进一步减小批量大小 但我发现,由于您已经尝试批量大小为 1,请确保代码的其他部分没有增加批量大小或增加内存负载。
使用混合精度训练 您已经在使用混合精度与mixed_ precision.set_global_policy('mixed_float16')。确保您的模型和优化器与混合精度兼容。如果遇到问题,请考虑暂时禁用混合精度,看看它是否会影响内存使用。
训练前清除 GPU 内存 您已经在使用 tf.keras.backend.clear_session() 来清除会话。您可以在训练代码开始时调用此函数,以在加载模型之前释放内存。
限制 GPU 内存增长 您已经通过以下方式设置内存增长:
tf.config.experimental.set_memory_growth(GPU,True) 另一个好方法是确保这是在任何模型处理之前执行的代码。
检查内存泄漏 确保您的数据集没有将所有图像加载到内存中。检查 return_dataset_image_embedding 函数以确保它只处理一批图像。
调整预取或缓冲区大小 您当前正在使用:
patch_dataset = patch_dataset.prefetch(buffer_size=prefetch_buffer_size) 考虑将 prefetch_buffer_size 增加到更大的数字,以实现更好的数据管道性能。
配置文件内存使用情况 使用 TensorFlow 分析工具监控内存使用情况。您可以使用 TensorBoard 启用分析。
正确使用tf.data 一个示例 - 如果您的数据集适合内存,则在 .shuffle() 之后使用 .cache(),这可能会提高性能。
一些流行的提示:
如果这有帮助,请告诉我,否则我可以帮助您修复代码。