NLTK RegEx Chunker 未使用通配符捕获定义的语法模式

Question

我正在尝试使用 NLTK 的 POS 标签作为正则表达式来对句子进行分块。定义了 2 条规则来根据句子中单词的标签来识别短语。

主要是，我想捕获一个或多个动词，后跟一个可选的限定词，最后是一个或多个名词。这是定义中的第一条规则。但它没有被捕获为短语块。

import nltk

## Defining the POS tagger 
tagger = nltk.data.load(nltk.tag._POS_TAGGER)


## A Single sentence - input text value
textv="This has allowed the device to start, and I then see glitches which is not nice."
tagged_text = tagger.tag(textv.split())

## Defining Grammar rules for  Phrases
actphgrammar = r"""
     Ph: {<VB*>+<DT>?<NN*>+}  # verbal phrase - one or more verbs followed by optional determiner, and one or more nouns at the end
     {<RB*><VB*|JJ*|NN*\$>} # Adverbial phrase - Adverb followed by adjective / Noun or Verb
     """

### Parsing the defined grammar for  phrases
actp = nltk.RegexpParser(actphgrammar)

actphrases = actp.parse(tagged_text)

分块器的输入，tagged_text 如下。

标记文本输出[7]： [('这个', 'DT'), （'有'，'VBZ'）， ('允许', 'VBN'), （'该'，'DT'），（'设备'，'NN'），（'到'，'到'），（'开始'，'NNP'）， ('和', 'CC'), （'我'，'PRP'）， ('然后', 'RB'), ('参见', 'VB'), （'故障'，'NNS'），（'哪个'，'WDT'）， ('是', 'VBZ'), （'不'，'RB'）， ('很好。', 'NNP')]

在最终输出中，仅捕获与第二条规则匹配的副词短语（“then see”）。我期望口头短语（“允许设备”）与第一条规则匹配并被捕获，但事实并非如此。

actphrases Out[8]: Tree('S', [('This', 'DT'), ('has', 'VBZ'), ('允许', 'VBN'), ('该', 'DT'), ('设备', 'NN'), ('至', 'TO'), ('开始,', 'NNP'), ('和', 'CC'), ('I', 'PRP'), Tree('Ph', [('然后', 'RB'), ('参见', 'VB')]), ('故障', 'NNS'), ('哪个', 'WDT'), ('是', 'VBZ'), ('不', 'RB'), ('很好。', 'NNP')])

使用的NLTK版本是2.0.5（Python 2.7）任何帮助或建议将不胜感激。

Answer 1

对正则表达式进行关闭但微小的更改即可获得所需的输出。当您想使用

RegexpParser

语法获取通配符时，您应该使用

.*

而不是

，例如

VB.*

而不是

VB*

：

>>> from nltk import word_tokenize, pos_tag, RegexpParser
>>> text = "This has allowed the device to start, and I then see glitches which is not nice."
>>> tagged_text = pos_tag(word_tokenize(text))    
>>> g = r"""
... VP: {<VB.*><DT><NN.*>}
... """
>>> p = RegexpParser(g); p.parse(tagged_text)
Tree('S', [('This', 'DT'), ('has', 'VBZ'), Tree('VP', [('allowed', 'VBN'), ('the', 'DT'), ('device', 'NN')]), ('to', 'TO'), ('start', 'VB'), (',', ','), ('and', 'CC'), ('I', 'PRP'), ('then', 'RB'), ('see', 'VBP'), ('glitches', 'NNS'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('nice', 'JJ'), ('.', '.')])

请注意，您正在捕获

Tree(AdvP, [('then', 'RB'), ('see', 'VB')])

，因为标签正是

RB

和

VB

。因此，在这种情况下，语法中的通配符（即 `"""AdvP: {}"""）将被忽略。

此外，如果是两种不同类型的短语，建议使用 2 个标签而不是一个。而且（我认为）通配符之后的字符串结尾有点多余，所以最好：

g = r"""
VP:{<VB.*><DT><NN.*>} 
AdvP: {<RB.*><VB.*|JJ.*|NN.*>}
"""

NLTK RegEx Chunker 未使用通配符捕获定义的语法模式

问题描述投票：0回答：1

1个回答

最新问题

NLTK RegEx Chunker 未使用通配符捕获定义的语法模式

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1