为什么spaCy在标记化期间不像Stanford CoreNLP那样保留字内连字符？

Question

SpaCy版本：2.0.11

Python版本：3.6.5

操作系统：Ubuntu 16.04

我的句子样本：

Marketing-Representative- won't die in car accident.

要么

Out-of-box implementation

预期代币：

["Marketing-Representative", "-", "wo", "n't", "die", "in", "car", "accident", "."]

["Out-of-box", "implementation"]

SpaCy令牌（默认令牌）：

["Marketing", "-", "Representative-", "wo", "n't", "die", "in", "car", "accident", "."]

["Out", "-", "of", "-", "box", "implementation"]

我尝试创建自定义标记生成器，但它不会处理spaCy使用tokenizer_exceptions（下面的代码）处理的所有边缘情况：

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re
nlp = spacy.load('en')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=None)
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Marketing-Representative- won't die in car accident.")
for token in doc:
    print(token.text)

输出：

Marketing-Representative-
won
'
t
die
in
car
accident
.

我需要有人指导我采取适当的方式来做到这一点。

在上面的正则表达式中进行更改可以做到这一点或任何其他方法，或者我甚至尝试过spaCy的基于规则的匹配器，但是无法创建规则来处理超过2个单词之间的连字符，例如“开箱即用”，以便可以创建与span.merge（）一起使用的匹配器。

无论哪种方式，我需要将包含字内连字符的单词变成单个标记，由Stanford CoreNLP处理。

Answer 1

虽然没有记录在spacey usage site，

看起来我们只需要为我们正在使用的*修复添加regex，在本例中为中缀。

此外，似乎我们可以使用自定义nlp.Defaults.prefixes扩展regex

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

这将给你想要的结果。没有必要设置默认为prefix和suffix，因为我们没有使用它们。

import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
import re

nlp = spacy.load('en')

infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")

infix_re = spacy.util.compile_infix_regex(infixes)

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

nlp.tokenizer = custom_tokenizer(nlp)

s1 = "Marketing-Representative- won't die in car accident."
s2 = "Out-of-box implementation"

for s in s1,s2:
    doc = nlp("{}".format(s))
    print([token.text for token in doc])

结果

$python3 /tmp/nlp.py  
['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
['Out-of-box', 'implementation']

您可能希望修复addon regex，使其对于接近应用的正则表达式的其他类型的令牌更加健壮。

为什么spaCy在标记化期间不像Stanford CoreNLP那样保留字内连字符？

问题描述投票：1回答：1

1个回答

最新问题

为什么spaCy在标记化期间不像Stanford CoreNLP那样保留字内连字符？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1