今天,当我将 BERT 与 Pytorch 和 cuda 一起使用时,出现以下错误: /pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [234,0,0], thread: [ 0,0,0] 断言
srcIndex < srcSelectDimSize
失败。
Epoch [1/100]
Iter: 0, Train Loss: 1.1, Train Acc: 39.06%, Val Loss: 1.0, Val Acc: 51.90%, Time: 0:00:04 *
Iter: 10, Train Loss: 0.99, Train Acc: 57.81%, Val Loss: 1.0, Val Acc: 52.01%, Time: 0:00:11 *
Iter: 20, Train Loss: 1.0, Train Acc: 42.19%, Val Loss: 0.99, Val Acc: 52.01%, Time: 0:00:17 *
Iter: 30, Train Loss: 1.0, Train Acc: 40.62%, Val Loss: 0.99, Val Acc: 52.12%, Time: 0:00:23 *
Iter: 40, Train Loss: 1.0, Train Acc: 50.00%, Val Loss: 0.98, Val Acc: 52.12%, Time: 0:00:29 *
Iter: 50, Train Loss: 1.1, Train Acc: 43.75%, Val Loss: 0.98, Val Acc: 52.12%, Time: 0:00:35 *
Traceback (most recent call last):
File "/content/drive/MyDrive/Prediction/run.py", line 38, in <module>
train(config, model, train_iter, dev_iter, test_iter)
File "/content/drive/MyDrive/Prediction/train_eval.py", line 50, in train
outputs = model(trains)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/Prediction/models/BERT+Covid.py", line 68, in forward
output = self.bert(context, attention_mask=mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 1005, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 589, in forward
output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 475, in forward
past_key_value=self_attn_past_key_value,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 408, in forward
output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 323, in forward
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [234,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
#......I SKIPPED SEVERAL LINES DUE TO THE CHARACTER LIMITATION
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:702: indexSelectLargeIndex: block: [235,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed
为了找到到底哪里出了问题,我再次用 CPU 运行我的代码,我得到了这个错误:IndexError: index out of range in self.
traceback (most recent call last):
File "/content/drive/MyDrive/Prediction/run.py", line 37, in <module>
train(config, model, train_iter, dev_iter, test_iter)
File "/content/drive/MyDrive/Prediction/train_eval.py", line 49, in train
outputs = model(trains)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/Prediction/models/BERT+Covid.py", line 66, in forward
output = self.bert(context, attention_mask=mask, )
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 993, in forward
past_key_values_length=past_key_values_length,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 215, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2043, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
根据我在网上找到的指导,我确定了以下问题:
输入长度没有超过模型中的最大长度(我设置的pad大小是98,在出错之前我曾尝试打印出输入的形状。确实是(batch_size, pad_size) ).
len(tokenizer)==model.config.vocab_size,所以这不是问题
我现在不知道可能是什么问题,有人可以帮助我吗?
我的模型结构是:
class Model(nn.Module):
def __init__(self, config):
super(Model, self).__init__()
self.modelConfig = BertConfig.from_pretrained('./bert_pretrain/config.json')
self.bert = BertModel.from_pretrained(config.bert_path,config=self.modelConfig)
for param in self.bert.parameters():
param.requires_grad = False
self.cls_fc_layer = FCLayer(config.hidden_size, config.word_size, config.dropout_rate)
self.label_classifier = FCLayer(
config.word_size+config.numerical_size,
config.num_classes,
config.dropout_rate,
use_activation=False,
)
def forward(self, x):
context = x[0] # input token ids
mask = x[2] # mask
numerical=x[3] #size(batch_size,18)
output = self.bert(context, attention_mask=mask)
pooled_output=output[1]
##size(batch_size,768)
pooled_output = self.cls_fc_layer(pooled_output)
##size(batch_size,18)
concat_h = torch.cat([pooled_output, numerical], dim=-1)
##size(batch_size,36)
logits = self.label_classifier(concat_h)
return logits
有同样的问题 - 问题是我们添加了一个新的特殊令牌 (填充标记 =
[PAD]
)到标记生成器。将填充标记更改为标记生成器已知的标记(在我的例子中是 eos_token)解决了问题:)
我已经解决了!!!
通过打印出每批的最大 input_ids
for i, (trains, labels) in enumerate(train_iter):
print("train max input:", torch.max(trains[0]))
print("train min input:", torch.min(trains[0]))
print("train max label:", torch.max(labels))
print("train min label:", torch.min(labels))
我得到以下输出,最大input_id == 21128,而我的分词器的长度== 21128,这意味着最大input_id应该是21127,这是索引超出范围的地方!
train max input: tensor(21128, device='cuda:0')
train min input: tensor(0, device='cuda:0')
train max label: tensor(2, device='cuda:0')
train min label: tensor(0, device='cuda:0')
出现这个错误的原因可能是因为我手动更改了bert模型的vocab.txt文件(抱歉我是新手......),我通过重新加载原始BERT模型/vocab和config解决了这个问题