是否有模型或库具有预构建的多词表达式存储库?
我正在尝试对一系列段落进行字数统计,但希望包含 ngram。然而,我遇到了一个障碍,因为我的一些句子包含双字母组和三字母组,它们比字母组合更有意义。
我想过只做一系列 ngram 并执行计数,但这会导致过度计数。
使用 ngram 的示例
import nltk
sentences = ["the dog jumped over the green car and into the market place",
"the cat was sleeping in the market place",
"the man in a green car was waiting near the market place"]
word_counts = {}
for sentence in sentences:
word_list = nltk.word_tokenize(sentence)
ngrams = nltk.everygrams(word_list, 1, 3)
for word in ngrams:
if word not in word_counts.keys():
word_counts[word] = 0
word_counts[word] += 1
结果
{('the',): 7, ('the', 'dog'): 1, ('the', 'dog', 'jumped'): 1, ('dog',): 1, ('dog', 'jumped'): 1, ('dog', 'jumped', 'over'): 1 ...
我还考虑过使用 mwetokenizer,虽然这符合我的要求,但我遇到了必须定义 MWE 的问题。
使用 mwetokenizer 的示例
from nltk.tokenize import MWETokenizer
mwe_tokenizer = MWETokenizer()
mwe_tokenizer.add_mwe(('green','car'))
mwe_tokenizer.add_mwe(('market', 'place'))
word_counts = {}
for sentence in sentences:
word_list = nltk.word_tokenize(sentence)
new_word_list = mwe_tokenizer.tokenize(word_list)
for word in new_word_list:
if word not in word_counts.keys():
word_counts[word] = 0
word_counts[word] += 1
结果
{'the': 7, 'dog': 1, 'jumped': 1, 'over': 1, 'green_car': 2, 'and': 1, 'into': 1, 'market_place': 3, ...
您可以尝试mwetoolkit。您可以在https://gitlab.com/mwetoolkit/mwetoolkit3获取代码。描述它的论文是https://aclanthology.org/L10-1553/。
还有