我目前使用 tokenizer.batch_encode_plus 并且相同的标记器应用于不同的数据集/文本列表。 df_train_feats 和 df_test_feats 产生不同的列长度。
df_test_feats.shape
Out[2]: (2, 8)
df_train_feats.shape
Out[3]: (2, 20)
由于这个不一致的列名,在传递给xgboost模型时会导致错误。
import os, sys
import pandas as pd
import torch
from transformers import AutoModel, AutoTokenizer
str_token = 'distilbert-base-uncased'
if __name__ == '__main__':
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_check_point = 'distilbert-base-uncased'
model = AutoModel.from_pretrained(model_check_point)
tokenizer = AutoTokenizer.from_pretrained(model_check_point, add_prefix_space=True, use_fast=False)
df_train_feats_encoded = tokenizer.batch_encode_plus(["today I went to the movies ", "today I went to the movies and had dinner at saints a new resturant in italy"], max_length=20, padding=True)
df_train_feats = pd.DataFrame(df_train_feats_encoded['input_ids'])
df_test_feats_encoded = tokenizer.batch_encode_plus(['we could not play paddal','it rain most of the afternoon'], max_length=20, padding=True)
df_test_feats = pd.DataFrame(df_test_feats_encoded['input_ids'])
如何修复此问题,以便 xgboost 的输入数据具有相同的数据框形状,或者无论数据集如何,标记生成器的输出都是一致的?
Tokenizers 填充到其 current 输入中的最长值。 对两个独立批次进行标记可能会具有不同的形状。
例如
>>> tokenize([["Hello World"],
["Hello There Mister"]])
[[123, 533, <pad>],
[123, 3535, 6834]]
>>> tokenize([["Hello World"],
["Hello There Mister X Y Z"]])
[[123, 533, <pad>, <pad>, <pad>, <pad>],
[123, 3535, 6834, 58, 59, 60]]
因此,您要么必须将所有内容标记在一起,然后将其拆分为训练和测试,要么手动填充较小的输出以适应较大集合的大小。