ValueError：TextEncodeInput 必须是 Union[TextInputSequence，Tuple[InputSequence，InputSequence]] - 标记 BERT / Distilbert 错误

Question

def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.1, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

当我尝试使用 BERT 分词器从数据帧中拆分时，我收到了这样的错误。

Answer 1

我也有同样的错误。问题是我的列表中没有，例如：

from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-german-cased')

# create test dataframe
texts = ['Vero Moda Damen Übergangsmantel Kurzmantel Chic Business Coatigan SALE',
         'Neu Herren Damen Sportschuhe Sneaker Turnschuhe Freizeit 1975 Schuhe Gr. 36-46',
         'KOMBI-ANGEBOT Zuckerpaste STRONG / SOFT / ZUBEHÖR -Sugaring Wachs Haarentfernung',
         None]

labels = [1, 2, 3, 1]

d = {'texts': texts, 'labels': labels} 
test_df = pd.DataFrame(d)

因此，在将 Dataframe 列转换为列表之前，我删除了所有“无”行。

test_df = test_df.dropna()
texts = test_df["texts"].tolist()
texts_encodings = tokenizer(texts, truncation=True, padding=True)

这对我有用。

Answer 2

就我而言，我必须设置

is_split_into_words=True

https://huggingface.co/transformers/main_classes/tokenizer.html

要编码的序列或序列批次。每个序列可以是一个字符串或字符串列表（预标记化字符串）。如果序列作为字符串列表（预标记化）提供，则必须设置 is_split_into_words=True （以消除一批序列的歧义）。

Answer 3

与 MarkusOdenthal 类似，我的列表中有一个非字符串类型。我通过将列转换为字符串，然后将其转换为列表，然后将其拆分为训练和测试段来修复它。所以你会这样做

train_texts = train['text'].astype(str).values.to_list()

Answer 4

在分词器中，第一个文本必须是 STR，例如： train_encodings = tokenizer(str(train_texts), 截断=True, 填充=True)

Answer 5

def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.2, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

尝试更改分割的大小。它会起作用的。这意味着分割数据不足以让分词器进行分词

Answer 6

我遇到了同样的错误。如下转换 tokenize_function 对我有用。

之前：

def tokenize_function(examples):
    return tokenizer(examples["text"],padding='max_length', truncation=True,max_length=512,return_tensors='pt')

之后：

def tokenize_function(examples):
    if isinstance(examples["text"], list):
        examples["text"] = [str(text) for text in examples["text"]]
    else:
        examples["text"] = str(examples["text"])
    return tokenizer(examples["text"],padding='max_length', truncation=True,max_length=512,return_tensors='pt')

这里，“text”是我的数据框中的列名称。我为模型使用自定义 BERT。

ValueError：TextEncodeInput 必须是 Union[TextInputSequence，Tuple[InputSequence，InputSequence]] - 标记 BERT / Distilbert 错误

问题描述投票：0回答：6

6个回答

最新问题

ValueError：TextEncodeInput 必须是 Union[TextInputSequence，Tuple[InputSequence，InputSequence]] - 标记 BERT / Distilbert 错误

问题描述 投票：0回答：6

6个回答

最新问题

问题描述投票：0回答：6