在 python 上查找字符串中单词的变体

问题描述 投票:0回答:4

所以,我正在运行Python 3.3.2,我有一个字符串(句子,段落):

mystring=["walk walked walking talk talking talks talked fly flying"]

我还有另一个列表,其中包含我需要在该字符串中搜索的单词:

list_of_words=["walk","talk","fly"]

我的问题是,有没有办法得到结果:

  1. 单词 walk 或其变体出现 3 次
  2. 单词 talk 或变体出现 4 次
  3. “fly”一词或其变体出现 2 次

底线,是否有可能计算一个单词的所有可能变体?

python list
4个回答
3
投票

我知道这是一个老问题,但我觉得如果不提及 NLTK 库,这个讨论就不完整,它提供了大量的自然语言处理工具,包括可以非常轻松地执行此任务的工具。

本质上,您想要将目标列表中未变形的单词与 mystring 中单词的未变形形式进行比较。 有两种常见的删除词形变化的方法(例如 -ing -ed -s):词干提取或词形还原。 在英语中,词形还原(将单词还原为字典形式)通常更好,但对于这项任务,我认为词干提取是正确的。 无论如何,词干通常会更快。

mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]

word_counts = {}

from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()

for target in list_of_words:
    word_counts[target] = 0

    for word in mystring.split(' '):

        # Stem the word and compare it to the stem of the target
        stem = stemmer.stem(word)        
        if stem == stemmer.stem(target):
            word_counts[target] += 1

print word_counts

输出:

{'fly': 2, 'talk': 4, 'walk': 3}

2
投票
from difflib import get_close_matches
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]

sp = mystring.split()
for x in list_of_words:
    li = [y for y in get_close_matches(x,sp,cutoff=0.5) if x in y]
    print '%-7s %d in %-10s' % (x,len(li),li)

结果

walk    2  in ['walk', 'walked']
talk    3  in ['talk', 'talks', 'talked']
fly     2  in ['fly', 'flying']

截止值是指与

SequenceMatcher
计算得出的相同比率:

from difflib import SequenceMatcher

sq = SequenceMatcher(None)
for x in list_of_words:
    for w in sp:
        sq.set_seqs(x,w)
        print '%-7s %-10s %f' % (x,w,sq.ratio())

结果

walk    walk       1.000000
walk    walked     0.800000
walk    walking    0.727273
walk    talk       0.750000
walk    talking    0.545455
walk    talks      0.666667
walk    talked     0.600000
walk    fly        0.285714
walk    flying     0.200000
talk    walk       0.750000
talk    walked     0.600000
talk    walking    0.545455
talk    talk       1.000000
talk    talking    0.727273
talk    talks      0.888889
talk    talked     0.800000
talk    fly        0.285714
talk    flying     0.200000
fly     walk       0.285714
fly     walked     0.222222
fly     walking    0.200000
fly     talk       0.285714
fly     talking    0.200000
fly     talks      0.250000
fly     talked     0.222222
fly     fly        1.000000
fly     flying     0.666667

2
投票

一种方法可能是用空格分割字符串,然后查找包含要查找变体的特定单词的所有单词。

例如:

def num_variations(word, sentence):
    return sum(1 for snippit in sentence.split(' ') if word in snippit)

for word in ["walk", "talk", "fly"]:
    print word, num_variations(word, "walk walked walking talk talking talks talked fly flying")

但是,这种方法有点幼稚,看不懂英语词法。例如,使用此方法,“fly”将不会匹配“flies”。

在这种情况下,您可能需要使用某种配备了像样字典的自然语言库来捕获这些边缘情况。

您可能会发现这个答案很有用。它通过使用 NLTK 库查找词干(删除复数、不规则拼写等),然后使用与上面类似的方法对它们进行总结来完成类似的操作。不过,对于您的情况来说,这可能有点矫枉过正,具体取决于您想要实现的目标。


0
投票
import spacy
nlp = spacy.load("en_core_web_sm-3.7.1") # models need to be first downloaded from https://spacy.io/usage/models

mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]

doc = nlp(mystring)

verb_lemmas_in_list_of_words = [token.lemma_ for token in doc if token.pos_ == 'VERB' and token.lemma_  in list_of_words]
verb_lemmas_in_list_of_words


['walk', 'walk', 'talk', 'talk', 'fly']
© www.soinside.com 2019 - 2024. All rights reserved.