如果字符串包含停用词,请从字符串中删除元素[复制]

问题描述 投票:1回答:2

我有一个列表如下:

lst = ['for Sam', 'Just in', 'Mark Rich']

我试图从字符串列表中删除一个元素(字符串包含一个或多个单词),其中包含stopwords

由于列表中的第1和第2个元素包含forin,它们将返回stopwords,它将返回

new_lst = ['Mark Rich'] 

我尝试了什么

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

lst = ['for Sam', 'Just in', 'Mark Rich']
new_lst = [i.split(" ") for i in lst]
new_lst = [" ".join(i) for i in new_lst for j in i if j not in stop_words]

这给了我输出为:

['for Sam', 'Just in', 'Mark Rich', 'Mark Rich']
python python-3.x nltk
2个回答
1
投票

你需要一个if语句而不是额外的嵌套:

new_lst = [' '.join(i) for i in new_lst if not any(j in i for j in stop_words)]

如果你想使用set,你可以使用set.isdisjoint

new_lst = [' '.join(i) for i in new_lst if stop_words.isdisjoint(i)]

这是一个演示:

stop_words = {'for', 'in'}

lst = ['for Sam', 'Just in', 'Mark Rich']
new_lst = [i.split() for i in lst]
new_lst = [' '.join(i) for i in new_lst if stop_words.isdisjoint(i)]

print(new_lst)

# ['Mark Rich']

1
投票

您可以使用列表推导并使用sets检查两个列表中的任何单词是否相交:

[i for i in lst if not set(stop_words) & set(i.split(' '))]
['Mark Rich']]
© www.soinside.com 2019 - 2024. All rights reserved.