Spacy is_stop不识别停用词?

问题描述 投票:1回答:2

当我使用SpaCy来识别停用词时,如果我使用en_core_web_lg语料库它就不起作用,但是当我使用en_core_web_sm时它确实有效。这是一个错误,还是我做错了什么?

import spacy
nlp = spacy.load('en_core_web_lg')

doc = nlp(u'The cat ran over the hill and to my lap')

for word in doc:
    print(f' {word} | {word.is_stop}')

结果:

 The | False
 cat | False
 ran | False
 over | False
 the | False
 hill | False
 and | False
 to | False
 my | False
 lap | False

但是,当我更改此行以使用en_core_web_smcorpus时,我会得到不同的结果:

nlp = spacy.load('en_core_web_sm')

 The | False
 cat | False
 ran | False
 over | True
 the | True
 hill | False
 and | True
 to | True
 my | True
 lap | False
python nlp spacy
2个回答
2
投票

您遇到的问题是有记录的bug。建议的解决方法如下:

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

nlp = spacy.load('en_core_web_lg')
for word in STOP_WORDS:
    for w in (word, word[0].capitalize(), word.upper()):
        lex = nlp.vocab[w]
        lex.is_stop = True

doc = nlp(u'The cat ran over the hill and to my lap')

for word in doc:
    print('{} | {}'.format(word, word.is_stop))

产量

The | False
cat | False
ran | False
over | True
the | True
hill | False
and | True
to | True
my | True
lap | False

0
投票

尝试from spacy.lang.en.stop_words import STOP_WORDS,然后你可以明确检查单词是否在集合中

from spacy.lang.en.stop_words import STOP_WORDS
import spacy

nlp = spacy.load('en_core_web_lg')

doc = nlp(u'The cat ran over the hill and to my lap')

for word in doc:
    # Have to convert Token type to String, otherwise types won't match
    print(f' {word} | {str(word) in STOP_WORDS}')

输出以下内容:

The | False
 cat | False
 ran | False
 over | True
 the | True
 hill | False
 and | True
 to | True
 my | True
 lap | False

对我来说看起来像个错误。但是,如果需要,这种方法还可以灵活地向STOP_WORDS集添加单词

© www.soinside.com 2019 - 2024. All rights reserved.