如何将 INCEpTION 带注释的文本 NER 转换为 spaCy? (CoNLL-U 到 json)

问题描述 投票:0回答:2

我正在使用 INCEpTION 来注释命名实体,我想用它来通过 spaCy 训练模型。 INCEpTION 中有多种选项(例如 CoNLL 2000、CoNLL CoreNLP、CoNLL-U)来导出带注释的文本。我已将文件导出为 CoNLL-U,并且想将其转换为 json,因为训练 spaCy 的 NER 模块需要此文件格式。 有人问过类似的问题,但答案对我没有帮助(here)。

这是我正在使用的带注释的测试文本

spaCy 的 转换脚本 是:

python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
[--n-sents] [--morphology] [--lang]

我的第一个问题是,我无法将文件转换为 .json。当我使用下面的代码时,我只得到没有任何命名实体的输出(请参阅最后的输出):

!python -m spacy convert Teest.conllu

我还尝试添加输出路径和json

!python -m spacy convert Teest.conllu C:\Users json

但是我收到以下错误:

usage: spacy convert [-h] [-t json] [-n 1] [-s] [-b None] [-m] [-c auto]
                     [-l None]
                     input_file [output_dir]
spacy convert: error: unrecognized arguments: Users json

我的第二个问题是,输出不包含任何命名实体,也不包含开始和结束索引:

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"Hallo",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":1,
                "orth":",",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":2,
                "orth":"dies",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":3,
                "orth":"ist",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":4,
                "orth":"ein",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":5,
                "orth":"Test",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":6,
                "orth":"um",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":7,
                "orth":"zu",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":8,
                "orth":"schauen",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":9,
                "orth":",",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":10,
                "orth":"wie",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":11,
                "orth":"in",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":12,
                "orth":"Inception",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":13,
                "orth":"annotiert",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":14,
                "orth":"wird",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":15,
                "orth":".",
                "tag":"_",
                "head":0,
                "dep":"_"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":1,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"Funktioniert",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":1,
                "orth":"es",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":2,
                "orth":"?",
                "tag":"_",
                "head":0,
                "dep":"_"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":2,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"Simon",
                "tag":"_",
                "head":0,
                "dep":"_"
              }
            ]
          }
        ]
      }
    ]
  }
]

我正在使用 spaCy 版本 2.3.0 和 Python 版本 3.8.3。

更新: 我使用了一个新文件,因为我想查明该语言是否存在任何问题。 当我将文件导出为 CoNNL Core NLP 时,该文件包含命名实体:

1   I'm         _   _   O       _   _
2   trying      _   _   Verb    _   _
3   to          _   _   O       _   _
4   fix         _   _   Verb    _   _
5   some        _   _   O       _   _
6   problems    _   _   O       _   _
7   .           _   _   O       _   _
    
1   But         _   _   O       _   _
2   why         _   _   O       _   _
3   it          _   _   O       _   _
4   this        _   _   O       _   _
5   not         _   _   O       _   _
6   working     _   _   Verb    _   _
7   ?           _   _   O       _   _

1   Simon       _   _   Name    _   _

但是,当我尝试使用

转换 CoNNL Core NLP 文件时
!python -m spacy convert Teest.conll

错误

line 68, in read_conllx
    id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: not enough values to unpack (expected 10, got 7)

出现了。

更新: 通过在 ner 之前添加 3 行制表符分隔的“_”,转换可以正常工作。输出是:

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"I'm",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":1,
                "orth":"trying",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"U-rb"
              },
              {
                "id":2,
                "orth":"to",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":3,
                "orth":"fix",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"U-rb"
              },
              {
                "id":4,
                "orth":"some",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":5,
                "orth":"problems",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":6,
                "orth":".",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":1,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"But",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":1,
                "orth":"why",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":2,
                "orth":"it",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":3,
                "orth":"this",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":4,
                "orth":"not",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":5,
                "orth":"working",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"U-rb"
              },
              {
                "id":6,
                "orth":"?",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":2,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"Simon",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"U-me"
              }
            ]
          }
        ]
      }
    ]
  }
]

不过,我无法将其直接转换为 .json 文件,并且据我所知,训练 spaCy 的 NER 模块需要元组。例如:

[('Berlin is a city.', {'entities': (0, 5, 'LOC'), (7, 8, 'VERB'), (12, 15, 'NOUN')]})]
spacy named-entity-recognition inception
2个回答
0
投票

我理解你的痛苦。
您需要手动编写一个脚本将最后的输出转换为 spacy 格式的输出。

更好的解决方案是使用 spacy-annotator,它允许您注释实体并以 spaCy 喜欢的格式获得输出。这是它的样子:

enter image description here


0
投票

我找到了一个解决方案,使用 INCEpTION 作为注释工具来训练 spaCy 的 NER 模块。我尝试过各种文件格式,但在我看来,只有使用 CoNLL 2002 并在命令行界面上使用 spaCy 才有可能。

  1. 在 INCEPTION 中对文本进行注释
  2. 将注释文本导出为 CoNLL 2002 文件
  3. 设置命令行界面(这里使用Windows)
python -m venv .venv
.venv\Scripts\activate.bat
  1. 如有必要,安装spaCy并下载所需的语言模型(我使用的是大英文模型)
pip install spacy
python -m spacy download en_core_web_lg
  1. 将 CoNLL 2002 转换为 spaCy 所需的输入格式
python -m spacy convert --converter ner file_name.conll [output file direction]

此步骤不应该起作用,因为 CoNLL 2002 使用 IOB2,而 spaCy 的转换器需要 IOB。但是,我没有遇到任何问题,并且 .json 输出文件是正确的。

  1. 调试数据工具、培训和评估

这里是一个很好的示例,说明如何处理转换后的文件。

© www.soinside.com 2019 - 2024. All rights reserved.