如何从NLTK导入和使用停用词列表？

Question

我已经从

stopwords

导入了

nltk.corpus

，但出现了

STOPWORDS is not defined

错误。下面是我的代码：

import nltk
from nltk.corpus import stopwords
#Create stopword list:
stopwords = set(STOPWORDS)

上面给出了以下错误：

NameError: name 'STOPWORDS' is not defined

Answer 1

第一次使用

stopwords

包中的

NLTK

时，您需要执行以下代码，以便将 stopwords 列表

下载

到您的设备：

import nltk
nltk.download('stopwords')

然后，每次您必须使用

stopwords

时，您只需从包中加载即可。例如，要加载英语

stopwords

列表，您可以使用以下命令：

from nltk.corpus import stopwords
stop_words = list(stopwords.words('english'))

如果您愿意，您甚至可以扩展列表，如下所示（注意：如果

stopwords.words()

返回

set

类型的对象，则需要转换为

list

（如上所示），以便在

extend()

上调用

stop_words

方法对象）：

stop_words.extend(["best", "item", "fast"])

要从文本中删除停用词，您可以使用以下内容（查看各种可用的标记器此处和此处）：

from nltk.tokenize import word_tokenize
word_tokens = word_tokenize(text)
clean_word_data = [w for w in word_tokens if w.lower() not in stop_words]

Answer 2

您需要下载您想要使用的正确停用词。例如，如果您只想打印英语中使用的停用词：

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
print(stopwords.words('english'))

这应该会输出英语停用词，例如

'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves',....]