所以,我正在运行Python 3.3.2,我有一个字符串(句子,段落):
mystring=["walk walked walking talk talking talks talked fly flying"]
我还有另一个列表,其中包含我需要在该字符串中搜索的单词:
list_of_words=["walk","talk","fly"]
我的问题是,有没有办法得到结果:
底线,是否有可能计算一个单词的所有可能变体?
我知道这是一个老问题,但我觉得如果不提及 NLTK 库,这个讨论就不完整,它提供了大量的自然语言处理工具,包括可以非常轻松地执行此任务的工具。
本质上,您想要将目标列表中未变形的单词与 mystring 中单词的未变形形式进行比较。 有两种常见的删除词形变化的方法(例如 -ing -ed -s):词干提取或词形还原。 在英语中,词形还原(将单词还原为字典形式)通常更好,但对于这项任务,我认为词干提取是正确的。 无论如何,词干通常会更快。
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]
word_counts = {}
from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()
for target in list_of_words:
word_counts[target] = 0
for word in mystring.split(' '):
# Stem the word and compare it to the stem of the target
stem = stemmer.stem(word)
if stem == stemmer.stem(target):
word_counts[target] += 1
print word_counts
输出:
{'fly': 2, 'talk': 4, 'walk': 3}
from difflib import get_close_matches
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]
sp = mystring.split()
for x in list_of_words:
li = [y for y in get_close_matches(x,sp,cutoff=0.5) if x in y]
print '%-7s %d in %-10s' % (x,len(li),li)
结果
walk 2 in ['walk', 'walked']
talk 3 in ['talk', 'talks', 'talked']
fly 2 in ['fly', 'flying']
截止值是指与
SequenceMatcher
计算得出的相同比率:
from difflib import SequenceMatcher
sq = SequenceMatcher(None)
for x in list_of_words:
for w in sp:
sq.set_seqs(x,w)
print '%-7s %-10s %f' % (x,w,sq.ratio())
结果
walk walk 1.000000
walk walked 0.800000
walk walking 0.727273
walk talk 0.750000
walk talking 0.545455
walk talks 0.666667
walk talked 0.600000
walk fly 0.285714
walk flying 0.200000
talk walk 0.750000
talk walked 0.600000
talk walking 0.545455
talk talk 1.000000
talk talking 0.727273
talk talks 0.888889
talk talked 0.800000
talk fly 0.285714
talk flying 0.200000
fly walk 0.285714
fly walked 0.222222
fly walking 0.200000
fly talk 0.285714
fly talking 0.200000
fly talks 0.250000
fly talked 0.222222
fly fly 1.000000
fly flying 0.666667
一种方法可能是用空格分割字符串,然后查找包含要查找变体的特定单词的所有单词。
例如:
def num_variations(word, sentence):
return sum(1 for snippit in sentence.split(' ') if word in snippit)
for word in ["walk", "talk", "fly"]:
print word, num_variations(word, "walk walked walking talk talking talks talked fly flying")
但是,这种方法有点幼稚,看不懂英语词法。例如,使用此方法,“fly”将不会匹配“flies”。
在这种情况下,您可能需要使用某种配备了像样字典的自然语言库来捕获这些边缘情况。
您可能会发现这个答案很有用。它通过使用 NLTK 库查找词干(删除复数、不规则拼写等),然后使用与上面类似的方法对它们进行总结来完成类似的操作。不过,对于您的情况来说,这可能有点矫枉过正,具体取决于您想要实现的目标。
import spacy
nlp = spacy.load("en_core_web_sm-3.7.1") # models need to be first downloaded from https://spacy.io/usage/models
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]
doc = nlp(mystring)
verb_lemmas_in_list_of_words = [token.lemma_ for token in doc if token.pos_ == 'VERB' and token.lemma_ in list_of_words]
verb_lemmas_in_list_of_words
['walk', 'walk', 'talk', 'talk', 'fly']