我正在使用 INCEpTION 来注释命名实体,我想用它来通过 spaCy 训练模型。 INCEpTION 中有多种选项(例如 CoNLL 2000、CoNLL CoreNLP、CoNLL-U)来导出带注释的文本。我已将文件导出为 CoNLL-U,并且想将其转换为 json,因为训练 spaCy 的 NER 模块需要此文件格式。 有人问过类似的问题,但答案对我没有帮助(here)。
spaCy 的 转换脚本 是:
python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
[--n-sents] [--morphology] [--lang]
我的第一个问题是,我无法将文件转换为 .json。当我使用下面的代码时,我只得到没有任何命名实体的输出(请参阅最后的输出):
!python -m spacy convert Teest.conllu
我还尝试添加输出路径和json
!python -m spacy convert Teest.conllu C:\Users json
但是我收到以下错误:
usage: spacy convert [-h] [-t json] [-n 1] [-s] [-b None] [-m] [-c auto]
[-l None]
input_file [output_dir]
spacy convert: error: unrecognized arguments: Users json
我的第二个问题是,输出不包含任何命名实体,也不包含开始和结束索引:
[
{
"id":0,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Hallo",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":1,
"orth":",",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":2,
"orth":"dies",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":3,
"orth":"ist",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":4,
"orth":"ein",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":5,
"orth":"Test",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":6,
"orth":"um",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":7,
"orth":"zu",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":8,
"orth":"schauen",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":9,
"orth":",",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":10,
"orth":"wie",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":11,
"orth":"in",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":12,
"orth":"Inception",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":13,
"orth":"annotiert",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":14,
"orth":"wird",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":15,
"orth":".",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
},
{
"id":1,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Funktioniert",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":1,
"orth":"es",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":2,
"orth":"?",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
},
{
"id":2,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Simon",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
}
]
我正在使用 spaCy 版本 2.3.0 和 Python 版本 3.8.3。
更新: 我使用了一个新文件,因为我想查明该语言是否存在任何问题。 当我将文件导出为 CoNNL Core NLP 时,该文件包含命名实体:
1 I'm _ _ O _ _
2 trying _ _ Verb _ _
3 to _ _ O _ _
4 fix _ _ Verb _ _
5 some _ _ O _ _
6 problems _ _ O _ _
7 . _ _ O _ _
1 But _ _ O _ _
2 why _ _ O _ _
3 it _ _ O _ _
4 this _ _ O _ _
5 not _ _ O _ _
6 working _ _ Verb _ _
7 ? _ _ O _ _
1 Simon _ _ Name _ _
但是,当我尝试使用
转换 CoNNL Core NLP 文件时!python -m spacy convert Teest.conll
错误
line 68, in read_conllx
id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: not enough values to unpack (expected 10, got 7)
出现了。
更新: 通过在 ner 之前添加 3 行制表符分隔的“_”,转换可以正常工作。输出是:
[
{
"id":0,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"I'm",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":1,
"orth":"trying",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":2,
"orth":"to",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":3,
"orth":"fix",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":4,
"orth":"some",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":5,
"orth":"problems",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":6,
"orth":".",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
}
]
}
]
}
]
},
{
"id":1,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"But",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":1,
"orth":"why",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":2,
"orth":"it",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":3,
"orth":"this",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":4,
"orth":"not",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":5,
"orth":"working",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":6,
"orth":"?",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
}
]
}
]
}
]
},
{
"id":2,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Simon",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-me"
}
]
}
]
}
]
}
]
不过,我无法将其直接转换为 .json 文件,并且据我所知,训练 spaCy 的 NER 模块需要元组。例如:
[('Berlin is a city.', {'entities': (0, 5, 'LOC'), (7, 8, 'VERB'), (12, 15, 'NOUN')]})]
我理解你的痛苦。
您需要手动编写一个脚本将最后的输出转换为 spacy 格式的输出。
更好的解决方案是使用 spacy-annotator,它允许您注释实体并以 spaCy 喜欢的格式获得输出。这是它的样子:
我找到了一个解决方案,使用 INCEpTION 作为注释工具来训练 spaCy 的 NER 模块。我尝试过各种文件格式,但在我看来,只有使用 CoNLL 2002 并在命令行界面上使用 spaCy 才有可能。
python -m venv .venv
.venv\Scripts\activate.bat
pip install spacy
python -m spacy download en_core_web_lg
python -m spacy convert --converter ner file_name.conll [output file direction]
此步骤不应该起作用,因为 CoNLL 2002 使用 IOB2,而 spaCy 的转换器需要 IOB。但是,我没有遇到任何问题,并且 .json 输出文件是正确的。
这里是一个很好的示例,说明如何处理转换后的文件。