我正在尝试将一些数据加载到数据加载器中。 当尝试使用例如预处理数据帧时
df.dropna()
,运行数据帧时不时地随机丢弃“关键错误”。当我不删除任何行时,不会发生这种情况。
数据集创建:
# load data and add columns
column_names = ["id","category","labelStr","text"]
trainingData = pd.read_csv(r"Path.csv",names=column_names)
validationData = pd.read_csv(r"Path.csv",names=column_names)
trainingData = trainingData.drop(columns=['id', 'category'])
validationData = validationData.drop(columns=['id', 'category'])
trainingData = trainingData.dropna() ############### Adding this causes the keyerror
trainingData['text'] = trainingData['text'].apply(clean_tweet)
# convert String into int
label_encoder = LabelEncoder()
trainingData['label'] = label_encoder.fit_transform(trainingData[["labelStr"]])
validationData['label'] = label_encoder.transform(validationData[["labelStr"]])
# initialize tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
MAX_LEN = 128 # Length should be enough as tweets are limited in length anyway
# create datasets
train_dataset = TweetsDataset(trainingData, tokenizer, MAX_LEN)
val_dataset = TweetsDataset(validationData, tokenizer, MAX_LEN)
数据加载器
BATCH_SIZE = 16
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
for batch in train_loader:
print(batch)
break
错误
KeyError Traceback (most recent call last)
...\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 63935
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-187-9bd6dc44b4d8> in <module>
5 val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
6
----> 7 for batch in train_loader:
8 print(batch)
...
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: 63935
我认为您可能需要在
dropna()
之后重置索引
trainingData = trainingData.dropna()
trainingData.reset_index(drop=True, inplace=True)
trainingData['text'] = trainingData['text'].apply(clean_tweet)
...