我正在 PubTabNet 数据集上训练基于注意力的编码器双解码器 (EDD) 架构,git:https://github.com/ibm-aur-nlp/EDD。
在 2 个 GPU 上训练双解码器时,出现错误:
Encoder LR changed to 0.001000
Tag Decoder LR changed to 0.001000
Using 2 GPUs
Traceback (most recent call last):
File "/home/mertovla/EDD/train_dual_decoder.py", line 233, in <module>
decoder.train_epoch(train_loader=train_loader,
File "/home/mertovla/EDD/models.py", line 2359, in train_epoch
output = self(imgs, tags_sorted, tag_lens, cells, cell_lens, num_cells)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mertovla/anaconda3/envs/edd/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mertovla/EDD/models.py", line 2267, in forward
tag_H[j] = tag_H[j][s_ind]
~~~~~~~~^^^^^^^
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:1)
执行的命令:
python train_dual_decoder.py \
--checkpoint {CHECKPOINT_DIR}/PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size/checkpoint_12.pth.tar \
--out_dir {CHECKPOINT_DIR}/cell_decoder \
--data_folder {TRAIN_DATA_DIR} \
--data_name PubTabNet_False_keep_AR_300_max_tag_len_100_max_cell_len_512_max_image_size \
--epochs 23 \
--batch_size 8 \
--fine_tune_encoder \
--encoder_lr 0.001 \
--fine_tune_tag_decoder \
--tag_decoder_lr 0.001 \
--tag_loss_weight 0.5 \
--cell_decoder_lr 0.001 \
--cell_loss_weight 0.5 \
--tag_embed_dim 16 \
--cell_embed_dim 80 \
--encoded_image_size 28 \
--decoder_cell LSTM \
--tag_attention_dim 256 \
--cell_attention_dim 256 \
--tag_decoder_dim 256 \
--cell_decoder_dim 512 \
--cell_decoder_type 1 \
--cnn_stride '{"tag":1, "cell":1}' \
--resume \
--predict_content