识别多词表达的模型?

问题描述 投票:0回答:1

是否有模型或库具有预构建的多词表达式存储库?

我正在尝试对一系列段落进行字数统计,但希望包含 ngram。然而,我遇到了一个障碍,因为我的一些句子包含双字母组和三字母组,它们比字母组合更有意义。

我想过只做一系列 ngram 并执行计数,但这会导致过度计数。

使用 ngram 的示例

import nltk

sentences = ["the dog jumped over the green car and into the market place",
             "the cat was sleeping in the market place",
             "the man in a green car was waiting near the market place"]

word_counts = {}
for sentence in sentences:
    word_list = nltk.word_tokenize(sentence)
    ngrams = nltk.everygrams(word_list, 1, 3)
    for word in ngrams:
        if word not in word_counts.keys():
            word_counts[word] = 0
        word_counts[word] += 1

结果

{('the',): 7, ('the', 'dog'): 1, ('the', 'dog', 'jumped'): 1, ('dog',): 1, ('dog', 'jumped'): 1, ('dog', 'jumped', 'over'): 1 ...

我还考虑过使用 mwetokenizer,虽然这符合我的要求,但我遇到了必须定义 MWE 的问题。

使用 mwetokenizer 的示例

from nltk.tokenize import MWETokenizer

mwe_tokenizer = MWETokenizer()
mwe_tokenizer.add_mwe(('green','car'))
mwe_tokenizer.add_mwe(('market', 'place'))

word_counts = {}
for sentence in sentences:
    word_list = nltk.word_tokenize(sentence)
    new_word_list = mwe_tokenizer.tokenize(word_list)
    for word in new_word_list:
        if word not in word_counts.keys():
            word_counts[word] = 0
        word_counts[word] += 1

结果

{'the': 7, 'dog': 1, 'jumped': 1, 'over': 1, 'green_car': 2, 'and': 1, 'into': 1, 'market_place': 3, ...
python nltk
1个回答
0
投票

您可以尝试mwetoolkit。您可以在https://gitlab.com/mwetoolkit/mwetoolkit3获取代码。描述它的论文是https://aclanthology.org/L10-1553/

还有

© www.soinside.com 2019 - 2024. All rights reserved.