我正在尝试使用 HuggingFace 的 BartModel 架构从头开始训练翻译模型。我正在使用 ByteLevelBPETokenizer 来标记事物。
我面临的问题是,当我在训练后保存分词器时,它没有正确加载,即使它显然创建了正确的 vocab.json 和 merges.txt 文件。
from tokenizers import ByteLevelBPETokenizer
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=['/work/vaibhav/iwslt14/de-en/train.en', '/work/vaibhav/iwslt14/de-en/train.de'], vocab_size=8000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# Save files to disk
print(tokenizer)
tokenizer.save_model("tokens")
这就是我训练和保存标记器的方式。打印语句打印以下内容:
Tokenizer(vocabulary_size=8000, model=ByteLevelBPE, add_prefix_space=False, lowercase=False, dropout=None, unicode_normalizer=None, continuing_subword_prefix=None, end_of_word_suffix=None, trim_offsets=False)
但是,当我尝试加载标记生成器,同时通过以下代码行训练我的模型时:
#get the tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.from_file('tokens/vocab.json', 'tokens/merges.txt')
print(tokenizer)
return tokenizer
然后打印语句打印以下内容:
Tokenizer(vocabulary_size=0, model=ByteLevelBPE, add_prefix_space=False, lowercase=False, dropout=None, unicode_normalizer=None, continuing_subword_prefix=None, end_of_word_suffix=None, trim_offsets=False)
现在这对我来说很奇怪,因为 vocab_size 应该是 8000 而不是零,因此它基本上停止工作。如果我重新训练并直接使用它而不保存和加载,那么它可以工作,但效率不高。
这是 vocab.json 的视图(已截断)。
{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":127,"¿":128,"À":129,"Á":130,"Â":131,"Ã":132,"Ä":133,"Å":134,"Æ":135,"Ç":136,"È":137,"É":138,"Ê":139,"Ë":140,"Ì":141,"Í":142,"Î":143,"Ï":144,"Ð":145,"Ñ":146,"Ò":147,"Ó":148,"Ô":149,"Õ":150,"Ö":151,"×":152,"Ø":153,"Ù":154,"Ú":155,"Û":156,"Ü":157,"Ý":158,"Þ":159,"ß":160,"à":161,"á":162,"â":163,"ã":164,"ä":165,"å":166,"æ":167,"ç":168,"è":169,"é":170,"ê":171,"ë":172,"ì":173,"í":174,"î":175,"ï":176,"ð":177,"ñ":178,"ò":179,"ó":180,"ô":181,"õ":182,"ö":183,"÷":184,"ø":185,"ù":186,"ú":187,"û":188,"ü":189,"ý":190,"þ":191,"ÿ":192,"Ā":193,"ā":194,"Ă":195,"ă":196,"Ą":197,"ą":198,"Ć":199,"ć":200,"Ĉ":201,"ĉ":202,"Ċ":203,"ċ":204,"Č":205,"č":206,"Ď":207,"ď":208,"Đ":209,"đ":210,"Ē":211,"ē":212,"Ĕ":213,"ĕ":214,"Ė":215,"ė":216,"Ę":217,"ę":218,"Ě":219,"ě":220,"Ĝ":221,"ĝ":222,"Ğ":223,"ğ":224,"Ġ":225,"ġ":226,"Ģ":227,"ģ":228,"Ĥ":229,"ĥ":230,"Ħ":231,"ħ":232,"Ĩ":233,"ĩ":234,"Ī":235,"ī":236,"Ĭ":237,"ĭ":238,"Į":239,"į":240,"İ":241,"ı":242,"IJ":243,"ij":244,"Ĵ":245,"ĵ":246,"Ķ":247,"ķ":248,"ĸ":249,"Ĺ":250,"
这是 merges.txt 的视图(已截断)。
#version: 0.2 - Trained by `huggingface/tokenizers`
e n
e r
i n
Ġ t
c h
Ġ a
Ġ d
Ġ w
Ġ s
Ġt h
n d
i e
e s
i s
a t
o n
i t
Ġ m
a s
a n
r e
Ġth e
Ġ h
可以看到文件是正常的。任何有关此问题的帮助将不胜感激。
所以我终于自己弄清楚了! Huggingface 代码中有一些错误,所以我像这样加载了标记生成器并且它工作了。
tokenizer = ByteLevelBPETokenizer('tokens/vocab.json', 'tokens/merges.txt')
您使用的 tokenizers lib 版本是什么?因为我遇到了同样的问题,但无法使其工作:/