抱脸| tokenizer.batch_encode_plus 不同数据集上的列不一致

Question

我目前使用 tokenizer.batch_encode_plus 并且相同的标记器应用于不同的数据集/文本列表。 df_train_feats 和 df_test_feats 产生不同的列长度。

df_test_feats.shape
Out[2]: (2, 8)

df_train_feats.shape
Out[3]: (2, 20)

由于这个不一致的列名，在传递给xgboost模型时会导致错误。

import os, sys
import pandas as pd

import torch
from transformers import AutoModel, AutoTokenizer
str_token = 'distilbert-base-uncased'


if __name__ == '__main__':
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        
        model_check_point = 'distilbert-base-uncased'
        model = AutoModel.from_pretrained(model_check_point)
        tokenizer = AutoTokenizer.from_pretrained(model_check_point, add_prefix_space=True, use_fast=False)
        df_train_feats_encoded = tokenizer.batch_encode_plus(["today I went to the movies ", "today I went to the movies and had dinner at saints a new resturant in italy"], max_length=20, padding=True)
        df_train_feats = pd.DataFrame(df_train_feats_encoded['input_ids'])
        
        df_test_feats_encoded = tokenizer.batch_encode_plus(['we could not play paddal','it rain most of the afternoon'], max_length=20, padding=True)
        df_test_feats = pd.DataFrame(df_test_feats_encoded['input_ids'])

如何修复此问题，以便 xgboost 的输入数据具有相同的数据框形状，或者无论数据集如何，标记生成器的输出都是一致的？

Answer 1

Tokenizers 填充到其 current 输入中的最长值。 对两个独立批次进行标记可能会具有不同的形状。

例如

>>> tokenize([["Hello World"], 
              ["Hello There Mister"]])
[[123, 533, <pad>],
 [123, 3535, 6834]]

>>> tokenize([["Hello World"], 
              ["Hello There Mister X Y Z"]])
[[123, 533, <pad>, <pad>, <pad>, <pad>],
 [123, 3535, 6834, 58, 59, 60]]

因此，您要么必须将所有内容标记在一起，然后将其拆分为训练和测试，要么手动填充较小的输出以适应较大集合的大小。

抱脸| tokenizer.batch_encode_plus 不同数据集上的列不一致

问题描述投票：0回答：1

1个回答

最新问题

抱脸| tokenizer.batch_encode_plus 不同数据集上的列不一致

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1