从 pandas 数据框中删除行会导致数据加载器中出现关键错误

问题描述 投票:0回答:1

我正在尝试将一些数据加载到数据加载器中。 当尝试使用例如预处理数据帧时

df.dropna()
,运行数据帧时不时地随机丢弃“关键错误”。当我不删除任何行时,不会发生这种情况。

数据集创建:

# load data and add columns
column_names = ["id","category","labelStr","text"]
trainingData = pd.read_csv(r"Path.csv",names=column_names)
validationData = pd.read_csv(r"Path.csv",names=column_names)

trainingData = trainingData.drop(columns=['id', 'category'])
validationData = validationData.drop(columns=['id', 'category'])

trainingData = trainingData.dropna()  ############### Adding this causes the keyerror

trainingData['text'] = trainingData['text'].apply(clean_tweet)

# convert String into int
label_encoder = LabelEncoder()
trainingData['label'] = label_encoder.fit_transform(trainingData[["labelStr"]])
validationData['label'] = label_encoder.transform(validationData[["labelStr"]])

# initialize tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
MAX_LEN = 128  # Length should be enough as tweets are limited in length anyway

# create datasets
train_dataset = TweetsDataset(trainingData, tokenizer, MAX_LEN)
val_dataset = TweetsDataset(validationData, tokenizer, MAX_LEN)

数据加载器

BATCH_SIZE = 16

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

for batch in train_loader:
   print(batch)
   break

错误

KeyError                                  Traceback (most recent call last)
...\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 63935

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-187-9bd6dc44b4d8> in <module>
      5 val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
      6 
----> 7 for batch in train_loader:
      8     print(batch)
...
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: 63935
python pandas dataframe pytorch pytorch-dataloader
1个回答
0
投票

我认为您可能需要在

dropna()

之后重置索引
trainingData = trainingData.dropna()
trainingData.reset_index(drop=True, inplace=True)

trainingData['text'] = trainingData['text'].apply(clean_tweet)
...
© www.soinside.com 2019 - 2024. All rights reserved.