我创建了如图所示的数据框。然后,我想删除标点符号。此函数将返回纯文本。我使用的是熊猫,每行将包含每行文本的干净版本。代码如下:
#Regular Expressions
def preprocess(text):
clean_data = []
for x in (text[:][0]): #this is Df_pd for Df_np (text[:])
new_text = re.sub('<.*?>', '', x) # remove HTML tags
new_text = re.sub(r'[^\w\s]', '', new_text) # remove punc.
new_text = re.sub(r'\d+','',new_text)# remove numbers
new_text = new_text.lower() # lower case, .upper() for upper
if new_text != '':
clean_data.append(new_text)
return clean_data
然后我像这样应用它:
clean_test = preprocess(berita)
print(clean_test)
然后我得到了这样的错误:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2896 try:
-> 2897 return self._engine.get_loc(key)
2898 except KeyError:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-31-17480cb1ec78> in <module>
----> 1 clean_test = preprocess(berita)
2 print(clean_test)
<ipython-input-21-61598ecfd3e5> in preprocess(text)
2 def preprocess(text):
3 clean_data = []
----> 4 for x in (text[:][0]): #this is Df_pd for Df_np (text[:])
5 new_text = re.sub('<.*?>', '', x) # remove HTML tags
6 new_text = re.sub(r'[^\w\s]', '', new_text) # remove punc.
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2978 if self.columns.nlevels > 1:
2979 return self._getitem_multilevel(key)
-> 2980 indexer = self.columns.get_loc(key)
2981 if is_integer(indexer):
2982 indexer = [indexer]
~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2897 return self._engine.get_loc(key)
2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key))
2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2901 if indexer.ndim > 1 or indexer.size > 1:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
我怎么错?之前感谢
此行没有意义:
for x in (text[:][0]):
应该是:
for x in text['news']: