我正在处理一列df文本,并且我试图计算频率最高的单词,但是偏离某些单词,例如“ for”,“ and”,“ the” .. etc等。主导结果。我试图创建一个for循环来删除这些单词,以免在我的分析中造成干扰。下面是我正在生成的代码;
lst= ["for", "of", "and", "in", "which", "the", "to", "a", "an"]
for i in papers.title_processed:
if i in lst:
papers.title_processed= papers.title_processed.replace(i, "")
output:
0 Self-Organization of Associative Database and ...
1 A Mean Field Theory of Layer IV of Visual Cort...
2 Storing Covariance by the Associative Long-Ter...
3 Bayesian Query Construction for Neural Network...
4 Neural Network Ensembles, Cross Validation, an...
Name: title, dtype: object
0 self-organization of associative database and ...
1 a mean field theory of layer iv of visual cort...
2 storing covariance by the associative long-ter...
3 bayesian query construction for neural network...
4 neural network ensembles, cross validation, an...
Name: title_processed, dtype: object
所以它什么也没做。有什么建议我做错了吗?我试过.map(lambda x: papers.title_processed.str.replace(x, "") for x in lst)
并出现错误
用途:
import re
lst= ["for", "of", "and", "in", "which", "the", "to", "a", "an"]
regex = re.compile('|'.join([rf'\b{w}\b' for w in lst]))
papers['title_processed'] = papers['title_processed'].str.replace(regex, '')
从lst
中删除单词后,title_processed
系列应如下所示:
# print(papers['title_processed'])
0 self-organization associative database ...
1 mean field theory layer iv visual cort...
2 storing covariance by associative long-ter...
3 bayesian query construction neural network...
4 neural network ensembles, cross validation, ...
Name: title_processed, dtype: object