我需要将
nltk.pos_tag()
与双字母组合一起使用,这是我的代码:
from nltk.util import ngrams
from collections import Counter
bigrams = list(ngrams(all_file_data, 2))
print(bigrams[:50])
print(Counter(bigrams).most_common(30))
输出为:
[('SUBDELAGATION', 'ON'), ('ON', 'AGENDA'), ('AGENDA', 'ITEM'), ('ITEM', '3'), ...]
如何获得 pos_tag 以及附图中的二元组频率结果?
试试这个:
from nltk import pos_tag, word_tokenize
from nltk.util import ngrams
from collections import Counter
text = "hello world is a common sentence. A common sentence is foo bar. A foo bar is a common ice cream."
tagged_texts = pos_tag(word_tokenize(text))
counter = Counter(ngrams(tagged_texts, 2))
counter.most_common(3)
[出]:
[((('is', 'VBZ'), ('a', 'DT')), 2),
((('a', 'DT'), ('common', 'JJ')), 2),
((('common', 'JJ'), ('sentence', 'NN')), 2),
((('.', '.'), ('A', 'DT')), 2),
((('foo', 'JJ'), ('bar', 'NN')), 2),
((('hello', 'JJ'), ('world', 'NN')), 1),
((('world', 'NN'), ('is', 'VBZ')), 1),
((('sentence', 'NN'), ('.', '.')), 1),
((('A', 'DT'), ('common', 'JJ')), 1),
((('sentence', 'NN'), ('is', 'VBZ')), 1),
((('is', 'VBZ'), ('foo', 'JJ')), 1),
((('bar', 'NN'), ('.', '.')), 1),
((('A', 'DT'), ('foo', 'JJ')), 1),
((('bar', 'NN'), ('is', 'VBZ')), 1),
((('common', 'JJ'), ('ice', 'NN')), 1),
((('ice', 'NN'), ('cream', 'NN')), 1),
((('cream', 'NN'), ('.', '.')), 1)]