我想为我的项目创建一个自定义标签。为了获得有关此主题的帮助,我浏览了 使用 spaCy 3.0 构建自定义 NER 模型 本教程。 JSON 文件的内容
[{"text": "The F15 aircraft uses a lot of fuel", "entities": [4, 7, "aircraft"]},
{"text":"did you see the F16 landing?", "entities": [16, 19, "aircraft"]},
{"text":"how many missiles can a F35 carry", "entities": [24, 27, "aircraft"]},
{"text":"is the F15 outdated", "entities": [7, 10, "aircraft"]},
{"text":"does the US still train pilots to dog fight?","entities": [0, 0, "aircraft"]},
{"text":"how long does it take to train a F16 pilot","entities": [33, 36, "aircraft"]}]
但是,这是一种旧的文件格式。我们需要在 Spacy v3.1 中构建自定义 NER 模型。所以教程中有一个代码块可以将.json格式更改为.spacy格式。
以下是代码块
import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)
当我运行此代码时,我收到错误。
如何克服该错误?
在您提供的代码块中,TRAIN_DATA 应该是一个元组列表,其中每个元组包含一个文本及其相应的注释(包括实体跨度)。但是,您提供的 JSON 数据的格式不同。
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
# Your JSON data
json_data = [
{"text": "The F15 aircraft uses a lot of fuel", "entities": [4, 7, "aircraft"]},
{"text": "did you see the F16 landing?", "entities": [16, 19, "aircraft"]},
# ... (other data)
]
# Convert JSON data to spaCy format
TRAIN_DATA = []
for entry in json_data:
text = entry["text"]
entities = []
for start, end, label in [entry["entities"]]: # Note: Assuming each entry has only one entity
entities.append((start, end, label))
TRAIN_DATA.append((text, {"entities": entities}))
# spaCy setup
nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object
# Process each text
for text, annot in tqdm(TRAIN_DATA):
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)