从 nltk
包,我看到我们可以只用三角图来实现Kneser-Ney平滑,但当我尝试在下面的包中使用同样的函数时,它却抛出了错误。bigrams
. 有什么方法可以在Bigrams上实现平滑?
## Working code for trigrams
tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!".split()
gut_ngrams = nltk.ngrams(tokens,3)
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
当我们使用bigrams时。
import nltk
tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!".split()
gut_ngrams = nltk.ngrams(tokens,2)
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
代码会抛出一个错误
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-1ce73b806bb8> in <module>
4 gut_ngrams = nltk.ngrams(tokens,2)
5 freq_dist = nltk.FreqDist(gut_ngrams)
----> 6 kneser_ney = nltk.KneserNeyProbDist(freq_dist)
~/.pyenv/versions/3.8.0/lib/python3.8/site-packages/nltk/probability.py in __init__(self, freqdist, bins, discount)
1737 self._trigrams_contain = defaultdict(float)
1738 self._wordtypes_before = defaultdict(float)
-> 1739 for w0, w1, w2 in freqdist:
1740 self._bigrams[(w0, w1)] += freqdist[(w0, w1, w2)]
1741 self._wordtypes_after[(w0, w1)] += 1
ValueError: not enough values to unpack (expected 3, got 2)
如果我们看一下执行情况, https:/github.comnltknltkblobdevelopnltkprobability.py#L1700。
class KneserNeyProbDist(ProbDistI):
def __init__(self, freqdist, bins=None, discount=0.75):
if not bins:
self._bins = freqdist.B()
else:
self._bins = bins
self._D = discount
# cache for probability calculation
self._cache = {}
# internal bigram and trigram frequency distributions
self._bigrams = defaultdict(int)
self._trigrams = freqdist
# helper dictionaries used to calculate probabilities
self._wordtypes_after = defaultdict(float)
self._trigrams_contain = defaultdict(float)
self._wordtypes_before = defaultdict(float)
for w0, w1, w2 in freqdist:
self._bigrams[(w0, w1)] += freqdist[(w0, w1, w2)]
self._wordtypes_after[(w0, w1)] += 1
self._trigrams_contain[w1] += 1
self._wordtypes_before[(w1, w2)] += 1
我们看到,在初始化中,在计算当前词之前的n格和之后的n格时,有一些假设。
for w0, w1, w2 in freqdist:
self._bigrams[(w0, w1)] += freqdist[(w0, w1, w2)]
self._wordtypes_after[(w0, w1)] += 1
self._trigrams_contain[w1] += 1
self._wordtypes_before[(w1, w2)] += 1
在这种情况下,只有卦象可以用KN平滑来处理 KneserNeyProbDist
对象!
tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!".split()
gut_ngrams = nltk.ngrams(tokens,4)
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)
[out]:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-60a48ed2ffce> in <module>
4 gut_ngrams = nltk.ngrams(tokens,4)
5 freq_dist = nltk.FreqDist(gut_ngrams)
----> 6 kneser_ney = nltk.KneserNeyProbDist(freq_dist)
~/.pyenv/versions/3.8.0/lib/python3.8/site-packages/nltk/probability.py in __init__(self, freqdist, bins, discount)
1737 self._trigrams_contain = defaultdict(float)
1738 self._wordtypes_before = defaultdict(float)
-> 1739 for w0, w1, w2 in freqdist:
1740 self._bigrams[(w0, w1)] += freqdist[(w0, w1, w2)]
1741 self._wordtypes_after[(w0, w1)] += 1
ValueError: too many values to unpack (expected 3)
问:那是不是意味着在NLTK中无法让KN平滑来进行语言建模呢?
答:这不完全正确。NLTK中有一个合适的语言模型模块 nltk.lm
这里有一个使用它的教程例子。https:/www.kaggle.comalvationsn-gram-language-model-with-nltknotebook#Training-an-N-gram-Model
那么你只需要定义正确的语言模型对象就可以了。)
from nltk.lm import KneserNeyInterpolated
from nltk.lm.preprocessing import padded_everygram_pipeline
tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \
form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
the beauty of the world, the paragon of animals!".split()
n = 4 # Order of ngram
train_data, padded_sents = padded_everygram_pipeline(n, tokens)
model = KneserNeyInterpolated(n)
model.fit(train_data, padded_sents)