如果名字在句子中排在前面，Python NLTK 将姓氏识别为组织

Question

我正在使用 Python 的 nltk 库从句子中提取名称。我期望输出为

['Barack Obama', 'Michelle Obama']

，但我得到

['Barack', 'Michelle Obama']

。我的示例代码如下。当我尝试打印

ner_tree

时，我发现它会将奥巴马的姓氏识别为组织名称。

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
 
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = "Barack Obama and Michelle Obama visited New York last week"
 
words = word_tokenize(sentence)

tagged_words = pos_tag(words)

ner_tree = ne_chunk(tagged_words)

names = []
for subname in ner_tree:
    if isinstance(subname, nltk.Tree) and subname.label() == 'PERSON':
        name = " ".join(word for word, tag in subname)
        names.append(name)
 
print(names)

# Expected output: ['Barack Obama', 'Michelle Obama']
# Actual output: ['Barack', 'Michelle Obama']

变量

ner_tree

的结果如下：

(S
  (PERSON Barack/NNP)
  (ORGANIZATION Obama/NNP)
  and/CC
  (PERSON Michelle/NNP Obama/NNP)
  visited/VBD
  (GPE New/NNP York/NNP)
  last/JJ
  week/NN)

在上面的代码中，如果句子改变如下，那么它将产生预期的输出。

sentence = "As per our sources, Barack Obama and Michelle Obama visited New York last week"

Answer 1

您使用的 NLTK 库在底层有统计模型，该模型是在海量数据上进行训练的。该模型依赖上下文来理解名称、组织、地理位置等。

通过添加前缀词“根据来源”，模型可以获得更好的上下文，从而提高“奥巴马”一词作为姓名的置信度。

这些模型根据上下文单词进行预测。与组织相比，在不同的上下文中，同一单词的名称可能具有较高的置信度得分。

为了更好地理解上下文，请参阅下面的示例：

sentence =“巴拉克·奥巴马和米歇尔使用的官方货币奥巴马上周访问了纽约。”
前缀词=“官方货币”

现在单词

"Barack Obama"

通过相同的代码被标记为地理。

发生这种情况是因为它周围的上下文词使它看起来像地理事物。

如果名字在句子中排在前面，Python NLTK 将姓氏识别为组织

问题描述投票：0回答：1

1个回答

最新问题

如果名字在句子中排在前面，Python NLTK 将姓氏识别为组织

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1