我正在尝试在Google Colab笔记本中的Spacy中训练multiclass
(标签互斥)文本分类模型。这些类是>
我将训练数据形成为指定为here的注释格式>
以下是我制作的注释示例
[. . ["Happy #MothersDay to all ... ", {'cats': {'NEUTRAL': 1.0}}], ["Happy mothers day ..", {"cats": {"POSITIVE": 1.0}}], . .]
问题
[当我尝试使用以下命令在spacy CLI中使用debug-data选项调试数据时(在Jupyter笔记本中完成)
%%bash (python -m spacy debug-data en \ /content/drive/My\ Drive/Spacy/Pretrained/train_clas.json \ /content/drive/My\ Drive/Spacy/Pretrained/eval_clas.json \ -p 'textcat' \ )
我得到以下输出
=========================== Data format validation =========================== ✔ Corpus is loadable =============================== Training stats =============================== Training pipeline: textcat Starting with blank model 'en' 0 training docs 0 evaluation docs ✘ No evaluation docs ✔ No overlap between training and evaluation data ✘ Low number of examples to train from a blank model (0) ============================== Vocab & Vectors ============================== ℹ 0 total words in the data (0 unique) ℹ No word vectors present in the model ============================ Text Classification ============================ ℹ Text Classification: 0 new label(s), 0 existing label(s) ℹ The train data contains only instances with mutually-exclusive classes. ================================== Summary ================================== ✔ 2 checks passed ✘ 2 errors
它无法正确读取数据,但是我已经检查了文件,并且至少有1000个以上的样本。
我在数据中找不到任何错误,有人可以指出该错误吗?,谢谢!!
背景我正在尝试在Google Colab笔记本中的Spacy中训练多类(标签互斥)文本分类模型。这些类是正负中性,我形成了...
spacy debug-data
命令期望使用spacy的内部JSON训练格式的数据,在这里描述:https://spacy.io/api/annotation#json-input)
这里有一些示例:https://github.com/explosion/spaCy/tree/master/examples/training/textcat_example_data。同一目录中的转换脚本显示了如何从JSONL格式进行转换,该格式与示例脚本中使用的TRAIN_DATA
类型格式非常相似。