Spacy到Conll格式而不使用Spacy的句子分割器

Question

This文章展示了如何使用Spacy的标记符获取Conll格式的文本块的依赖关系。这是发布的解决方案：

import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
        for i, word in enumerate(sent):
              if word.head == word:
                 head_idx = 0
              else:
                 head_idx = word.head.i - sent[0].i + 1
              print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
                 i+1, # There's a word.i attr that's position in *doc*
                  word,
                  word.lemma_,
                  word.tag_, # Fine-grained tag
                  word.ent_type_,
                  str(head_idx),
                  word.dep_ # Relation
                 ))

它输出这个块：

1   Bob bob NNP PERSON  2   nsubj
2   bought  buy VBD     0   ROOT
3   the the DT      4   det
4   pizza   pizza   NN      2   dobj
5   to  to  IN      2   dative
6   Alice   alice   NNP PERSON  5   pobj

我想在不使用doc.sents的情况下获得相同的输出。

的确，我有自己的句子分割器。我想使用它，然后一次给Spacy一个句子来获得POS，NER和依赖项。

如何使用Spacy获得Conll格式的一个句子的POS，NER和依赖关系，而不必使用Spacy的句子分割器？

Answer 1

Document中的sPacy是可迭代的，并且在文档中指出它在Tokens上迭代

 |  __iter__(...)
 |      Iterate over `Token`  objects, from which the annotations can be
 |      easily accessed. This is the main way of accessing `Token` objects,
 |      which are the main way annotations are accessed from Python. If faster-
 |      than-Python speeds are required, you can instead access the annotations
 |      as a numpy array, or access the underlying C data directly from Cython.
 |      
 |      EXAMPLE:
 |          >>> for token in doc

因此，我相信你只需要为你的每个句子分成一个Document，然后执行如下操作：

def printConll(split_sentence_text):
    doc = nlp(split_sentence_text)
    for i, word in enumerate(doc):
          if word.head == word:
             head_idx = 0
          else:
             head_idx = word.head.i - sent[0].i + 1
          print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
             i+1, # There's a word.i attr that's position in *doc*
              word,
              word.lemma_,
              word.tag_, # Fine-grained tag
              word.ent_type_,
              str(head_idx),
              word.dep_ # Relation
             ))

当然，遵循CoNLL格式，您必须在每个句子后打印换行符。

Answer 2

This帖子是关于使用spacy句边界检测的用户面临意外的句子中断。 Spacy开发人员提出的解决方案之一（如文章所述）是增加灵活性来添加自己的句子边界检测规则。这个问题与Spacy的依赖解析相结合，而不是在它之前解决。因此，我不认为Spacy目前支持的是什么，尽管可能在不久的将来。

Answer 3

@ashu的答案部分正确：依赖解析和句子边界检测在spaCy中通过设计紧密耦合。虽然有一个简单的sentencizer。

https://spacy.io/api/sentencizer

似乎sentecizer只使用标点符号（不是完美的方式）。但是如果存在这样的sentencizer，那么你可以使用你的规则创建一个自定义的，它肯定会影响句子边界。

Spacy到Conll格式而不使用Spacy的句子分割器

问题描述投票：0回答：3

3个回答

最新问题

Spacy到Conll格式而不使用Spacy的句子分割器

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3