我正在创建一类Unigram Tagger。我相信我已经处于最后阶段,尽管我现在对如何计算标签的概率迷失了方向。到目前为止,我已经完成的工作是统计标签总数,统计单词总数,最后统计每个单词的每个标签。当然还有培训。
我收到的标签数量:
{'PRON': 2820, 'VERB': 6201, ...
对于收到的字数:
{'i': 531, 'need': 3, "n't": 213, 'be': 145, ...
对于我收到的tagperwordcounts:
{'i': {'PRON': 531}, 'need': {'VERB': 3}, "n't": {'ADV': 213}, 'be': {'VERB': 145}, 'afraid': {'ADJ': 12}, 'of': {'ADP': 502, 'ADV': 10}, ...
而且我很确定我应该用来计算概率的公式是:
[𝑃(𝑡│𝑤)=(“标记为”𝑡“的”𝑤“出现次数/(”“𝑤”出现次数
尽管我不确定如何将其放入代码中?
class unigram_tagging():
def __init__(self,traind=[]):
self.tagcounts={}
self.wordcounts={}
self.tagperwordcounts={}
self.train(traind=traind)
def train(self,traind):
for sentence in traind:
for token,tag in sentence:
self.tagcounts[tag]=self.tagcounts.get(tag,0)+1
self.wordcounts[token]=self.wordcounts.get(token,0)+1
current=self.tagperwordcounts.get(token,{})
current[tag]=current.get(tag,0)+1
self.tagperwordcounts[token]=current
def tag(self,traind): #Here I want to work out probability
想法是,该最终方法将分配标签,以使标签概率最大化。
def标签(自我,单词,培训数据):对于训练数据中的句子:对于self.tagperwordcounts.get(word).items()中的i,j:self.tag_dists [i] = j / self.wordcounts.get(word)返回self.tag_dists