变压器输入序列的长度如何确定?

问题描述 投票:0回答:0

给BERT的文本长度很短。我选择了最大长度31并得到以下错误: “

ValueError:input_ids(形状 torch.Size([31]))或 attention_mask(形状 torch.Size([31]))错误的形状”

如何设置BERT输入长度?

我使用 Transformer 2.9.0 版本

模特来电:

与标记文本和创建转换器输入相关的代码: def _get_transformer_input2(tokens_a, tokens_b, max_seq_length, tokenizer, model_specs):

    tokens = []
    segment_ids = []
    tokens.append(model_specs['CLS_TOKEN'])
    segment_ids.append(0)
    for token in tokens_a:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append(model_specs['SEP_TOKEN'])
    segment_ids.append(0)
    if model_specs['MODEL_TYPE'] == 'roberta':
        tokens.append(model_specs['SEP_TOKEN'])
        segment_ids.append(0)
    if model_specs['MODEL_TYPE'] != 'roberta':
        for token in tokens_b:
            tokens.append(token)
            segment_ids.append(1)
        tokens.append(model_specs['SEP_TOKEN'])
        segment_ids.append(1)
    else:
        for token in tokens_b:
            tokens.append(token)
            segment_ids.append(0)
        tokens.append(model_specs['SEP_TOKEN'])
        segment_ids.append(0)
        if model_specs['MODEL_TYPE'] == 'roberta':
            tokens.append(model_specs['SEP_TOKEN'])
            segment_ids.append(0)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)
    # Zero-pad up to the sequence length.
    while len(input_ids) < 31:
        if model_specs['MODEL_TYPE'] == 'roberta':
            input_ids.append(1)
        else:
            input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
    assert len(input_ids) ==31
    assert len(input_mask) == 31
    assert len(segment_ids) == 31
    return tokens, input_ids, input_mask, segment_ids
python deep-learning bert-language-model transformer-model
© www.soinside.com 2019 - 2024. All rights reserved.