添加SpaCy Tokenizer例外：不要拆分'>>'

Question

我试图添加一个例外来识别'>>'和'>>'作为开始新句子的指标。例如，

import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'>> We should. >>No.')

for sent in doc.sents:
    print (sent)

打印出来：

>> We should.
>
>
No.

但是，我想打印出来：

>> We should.
>> No.

感谢您提前的时间！

Answer 1

你需要创建一个custom component。代码示例提供了自定义句子分割example。从文档中，示例执行以下操作：

添加管道组件以在某些令牌之前禁止句子边界的示例。

代码（根据您的需要调整示例）：

import spacy


def prevent_sentence_boundaries(doc):
    for token in doc:
        if not can_be_sentence_start(token):
            token.is_sent_start = False
    return doc


def can_be_sentence_start(token):
    if token.i > 0 and token.nbor(-1).text == '>':
        return False
    return True

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(prevent_sentence_boundaries, before='parser')

raw_text = u'>> We should. >> No.'
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
for sentence in sentences:
    print(sentence)

产量

>> We should.
>> No.

添加SpaCy Tokenizer例外：不要拆分'>>'

问题描述投票：1回答：1

1个回答

最新问题

添加SpaCy Tokenizer例外：不要拆分'>>'

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1