给BERT的文本长度很短。我选择了最大长度31并得到以下错误: “
ValueError:input_ids(形状 torch.Size([31]))或 attention_mask(形状 torch.Size([31]))错误的形状”
如何设置BERT输入长度?
我使用 Transformer 2.9.0 版本
模特来电:
与标记文本和创建转换器输入相关的代码: def _get_transformer_input2(tokens_a, tokens_b, max_seq_length, tokenizer, model_specs):
tokens = []
segment_ids = []
tokens.append(model_specs['CLS_TOKEN'])
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append(model_specs['SEP_TOKEN'])
segment_ids.append(0)
if model_specs['MODEL_TYPE'] == 'roberta':
tokens.append(model_specs['SEP_TOKEN'])
segment_ids.append(0)
if model_specs['MODEL_TYPE'] != 'roberta':
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append(model_specs['SEP_TOKEN'])
segment_ids.append(1)
else:
for token in tokens_b:
tokens.append(token)
segment_ids.append(0)
tokens.append(model_specs['SEP_TOKEN'])
segment_ids.append(0)
if model_specs['MODEL_TYPE'] == 'roberta':
tokens.append(model_specs['SEP_TOKEN'])
segment_ids.append(0)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < 31:
if model_specs['MODEL_TYPE'] == 'roberta':
input_ids.append(1)
else:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) ==31
assert len(input_mask) == 31
assert len(segment_ids) == 31
return tokens, input_ids, input_mask, segment_ids