Tensorflow GPU 错误：model.fit() 时 Dst 张量未初始化

Question

当vgg16迁移学习模型正在拟合时。张量流抛出错误。每个图像补丁都是 224x224 RGB，约为 602KB（使用 float32。根据公式 vRAM/补丁大小/4，最大批量大小约为 5000）。 Ubuntu 24 上的 RTX 4070 Super 上的 GPU vRAM 为 12GB :

    2024-09-23 05:49:25.162827: I external/local_tsl/tsl/framework/bfc_allocator.cc:1112] Sum Total of in-use chunks: 3.76GiB
2024-09-23 05:49:25.162840: I external/local_tsl/tsl/framework/bfc_allocator.cc:1114] Total bytes in pool: 10671489024 memory_limit_: 10671489024 available bytes: 0 curr_region_allocation_bytes_: 21342978048
2024-09-23 05:49:25.162860: I external/local_tsl/tsl/framework/bfc_allocator.cc:1119] Stats: 
Limit:                     10671489024
InUse:                      4039777024
MaxInUse:                   4091143936
NumAllocs:                         215
MaxAllocSize:               3507001344
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2024-09-23 05:49:25.162904: W external/local_tsl/tsl/framework/bfc_allocator.cc:499] ***************************************_____________________________________________________________
Traceback (most recent call last):
  File "/home/aiworker9/code/py/aimodels/common/prepare_data_vgg16.py", line 363, in <module>
    scores, history = model_fit_image_label_array(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/aiworker9/code/py/aimodels/common/prepare_data_vgg16.py", line 260, in model_fit_image_label_array
    history = model.fit(
              ^^^^^^^^^^
  File "/home/aiworker9/code/py/myenv/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/aiworker9/code/py/myenv/lib/python3.12/site-packages/tensorflow/python/framework/constant_op.py", line 108, in convert_to_eager_tensor
    return ops.EagerTensor(value, ctx.device_name, dtype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

这是我为解决该问题所做的事情：

将 model.fit() 和 data.batch 的批量大小减少到 1 倍

添加了一些GPU内存管理代码：

import os
os.environ['TF_GPU_ALLOCATOR'] = 'cuda_malloc_async'  #reduce memory    fragmentation.
#clear gpu memory
tf.keras.backend.clear_session()

#reduce memory footprint. nvidia gpu related
os.environ['XLA_FLAGS'] = '--xla_gpu_strict_conv_algorithm_picker=false'
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')
#reduce GPU memory
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

这是 model.fit() 的代码：

def model_fit_image_label_array(model, train_images, train_label_user_id, train_label_binary, test_images, test_label_user_id, test_label_binary, epochs=10):
    
    # Setting callbacks
    earlyStopping = EarlyStopping(monitor = 'val_loss', patience = 3, verbose = 0, mode = 'min') # Monitora as épocas e para caso não esteja melhorando
    mcp_save = ModelCheckpoint('best_weights.keras', save_best_only = True, monitor = 'val_loss', mode = 'min') # Salvando a melhor configuração
    reduce_lr_loss = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.4, patience = 7, verbose = 1, min_delta = 1e-4, mode = 'auto') # Reduz o learning_rate quando a métrica de avaliação para de melhorar

    # Training model. epochs=1 as it takes long time for each epoch. reach about 99.6% accuracy
    history = model.fit(
        [train_images, train_label_user_id], 
        [train_label_user_id, train_label_binary], 
        batch_size = BATCH_SIZE, #1, 
        epochs = epochs, 
        validation_data = ([test_images, test_label_user_id], [test_label_user_id, test_label_binary]), 
        callbacks = [earlyStopping, mcp_save, reduce_lr_loss]
        )

    # Evaluate model
    eval_results = model.evaluate(
        [test_images, test_label_user_id], 
        [test_label_user_id, test_label_binary], 
        verbose=0
        )
  # Unpack based on the number of outputs and metrics
    print(f'\nmodel.fit history keys : {history.history}')
    print(f'\neval_results : {eval_results}')

    return eval_results, history

运行 Nvidia-smi。这是崩溃错误前几秒的输出：

我有点没主意了。还有什么办法可以解决 GPU 内存问题？是否有可能整个数据集以某种方式加载到 GPU 内存中，而不是代码中指定的 BATCH_SIZE（此处为 2）？

更新：模型摘要部分：

│ (Embedding)         │                  │           │                   │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ dropout (Dropout)   │ (None, 256)      │         0 │ dense[0][0]       │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ flatten_1 (Flatten) │ (None, 150)      │         0 │ embedding[0][0]   │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ concatenate         │ (None, 406)      │         0 │ dropout[0][0],    │
│ (Concatenate)       │                  │           │ flatten_1[0][0]   │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ user_output (Dense) │ (None, 3)        │     1,221 │ concatenate[0][0] │
├─────────────────────┼──────────────────┼───────────┼───────────────────┤
│ binary_output       │ (None, 2)        │       814 │ concatenate[0][0] │
│ (Dense)             │                  │           │                   │
└─────────────────────┴──────────────────┴───────────┴───────────────────┘
 Total params: 21,139,657 (80.64 MB)
 Trainable params: 6,424,969 (24.51 MB)
 Non-trainable params: 14,714,688 (56.13 MB)

数据集的创建方式如下：

def return_dataset_image_embedding(data, window_size=(224, 224), step_size=112, shuffle_buffer_size=1000, prefetch_buffer_size=1):
    # Example DataFrame
    
    print("\nBatch size: ", BATCH_SIZE)
    print("\nShuffle Buffer size : ", shuffle_buffer_size)
    print("\nPrefetch Buffer size : ", prefetch_buffer_size)
    df = pd.DataFrame(data)
    
    image_paths = './data/'+df['image_id'].values
    label_user_ids = df['label_user_id'].values
    label_binary_flags = df['label_binary_flag'].values

    image_patches, patch_label_user_ids, patch_label_binary_flags = 

    preprocess_image_patches_image_embedding(image_paths, label_user_ids, label_binary_flags, window_size, step_size)
        
    

 # Create TensorFlow dataset
        patch_dataset = tf.data.Dataset.from_tensor_slices((image_patches, patch_label_user_ids, patch_label_binary_flags))
    
  # Shuffle, augment, batch, and prefetch the dataset
        
  patch_dataset = patch_dataset.shuffle(buffer_size=shuffle_buffer_size)  # Shuffle data
  patch_dataset = patch_dataset.map(augment_data_image_embedding)  # Apply augmentation
  patch_dataset = patch_dataset.batch(batch_size=BATCH_SIZE)  # Create batches
  patch_dataset = patch_dataset.prefetch(buffer_size=prefetch_buffer_size)  # reduce gpu memory usage
    
        
   return patch_dataset

Answer 1

处理内存问题确实具有挑战性，您可以尝试建议的方法之一来识别和解决问题。

尝试进一步减小批量大小但我发现，由于您已经尝试批量大小为 1，请确保代码的其他部分没有增加批量大小或增加内存负载。
使用混合精度训练您已经在使用混合精度与mixed_ precision.set_global_policy('mixed_float16')。确保您的模型和优化器与混合精度兼容。如果遇到问题，请考虑暂时禁用混合精度，看看它是否会影响内存使用。
训练前清除 GPU 内存您已经在使用 tf.keras.backend.clear_session() 来清除会话。您可以在训练代码开始时调用此函数，以在加载模型之前释放内存。
限制 GPU 内存增长您已经通过以下方式设置内存增长：

tf.config.experimental.set_memory_growth（GPU，True）另一个好方法是确保这是在任何模型处理之前执行的代码。

检查内存泄漏确保您的数据集没有将所有图像加载到内存中。检查 return_dataset_image_embedding 函数以确保它只处理一批图像。
调整预取或缓冲区大小您当前正在使用：

patch_dataset = patch_dataset.prefetch(buffer_size=prefetch_buffer_size) 考虑将 prefetch_buffer_size 增加到更大的数字，以实现更好的数据管道性能。

配置文件内存使用情况使用 TensorFlow 分析工具监控内存使用情况。您可以使用 TensorBoard 启用分析。
正确使用tf.data 一个示例 - 如果您的数据集适合内存，则在 .shuffle() 之后使用 .cache()，这可能会提高性能。

一些流行的提示：

使用 nvidia-smi 监控训练时的 GPU 内存使用情况。
尝试合并较小的输入图像 VGG16 期望 224x224 输入，我们仍然可以尝试较小的尺寸。

如果这有帮助，请告诉我，否则我可以帮助您修复代码。

Tensorflow GPU 错误：model.fit() 时 Dst 张量未初始化

问题描述投票：0回答：1

1个回答

最新问题

Tensorflow GPU 错误：model.fit() 时 Dst 张量未初始化

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1