为什么我在下载 code-bert 模型时会出现意外的标记化?

问题描述 投票:0回答:2

加载

BertEmbedding
时出现以下错误:

代码:

name = "microsoft/codebert-base"

from transformers import BertModel
from transformers import BertTokenizer

print("[ Using pretrained BERT embeddings ]")
self.bert_tokenizer = BertTokenizer.from_pretrained(name, do_lower_case=lower_case)
self.bert_model = BertModel.from_pretrained(name)
if fix_emb:
    print("[ Fix BERT layers ]")
    self.bert_model.eval()
    for param in self.bert_model.parameters():
        param.requires_grad = False
else:
    print("[ Finetune BERT layers ]")
    self.bert_model.train()
[ Using pretrained BERT embeddings ]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RobertaTokenizer'. 
The class this function is called from is 'BertTokenizer'.
deep-learning huggingface-transformers transformer-model huggingface-tokenizers
2个回答
1
投票

这个名字

codebert-base
有点误导,因为模特实际上是罗伯塔(Roberta)。 Bert 和 Roberta 的架构相似,仅显示出微小的差异,但分词器完全不同(以及相关方法,但这与这里无关)。

您应该像这样加载

microsoft/codebert-base

from transformers import RobertaModel
from transformers import RobertaTokenizer

name = "microsoft/codebert-base"
tokenizer = RobertaTokenizer.from_pretrained(name)
model = RobertaModel.from_pretrained(name)

或者您可以使用自动类别,它将为您选择合适的类别:

from transformers import AutoTokenizer, AutoModel

name = "microsoft/codebert-base"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModel.from_pretrained(name)

0
投票

感谢您的 Roberta 回答,我正在寻找 RAG 实现的代码示例,它可以获取一些上下文文档/字符串 - 至于 Roberta,我没有找到任何检索器。如果您能提供一个有效的示例,我将不胜感激。

© www.soinside.com 2019 - 2024. All rights reserved.