背景:我正在使用Hugging Face的
transformers
包和Llama 3.1 8B(指导)。
问题:我通过以下方式一次生成一个单词的提示响应(请注意,我选择了
texts
并将其附加到 input_string
,然后重复该过程):
tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
model = AutoModelForCausalLM.from_pretrained(model_path, use_safetensors=True)
input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
logits = model(input_ids).logits # call model() to get logits
logits = logits[-1, -1] # only care about the last projection in the last batch
probs = torch.nn.functional.softmax(logits, dim=-1) # softmax() to get probabilities
probs, ids = torch.topk(probs, 5) # keep only the top 5
texts = tokenizer.convert_ids_to_tokens(ids) # convert ids to tokens
但我注意到输出中出现了许多奇怪或特殊的字符。例如,以下是从
input_string = "How often should I wear a seatbelt?"
返回的文字字符串:
ĠAlways.ĊĊĊÄÃĦAlways,ĠunlessĠyouĠareÄłinÃĦaÃĦcarÃĥthatÃĦisÃĦnotÃĦmoving.
有什么方法可以轻松去除奇怪的特殊字符吗?
我尝试过在解码器上使用选项(在每个可能的 T/F 组合中),如下所示:
myStr = 'ĠAlways.ĊĊĊÄÃĦAlways,ĠunlessĠyouĠareÄłinÃĦaÃĦcarÃĥthatÃĦisÃĦnotÃĦmoving.'
tokenizer.decode(tokenizer.encode(myStr), skip_special_tokens=True, clean_up_tokenization_spaces=True)
但它不会从字符串中删除任何特殊字符。
使用它而不是推出你自己的去标记器。
tokenizer.batch_decode(input_ids)
官方 Llama 3.1 有一些审批流程,可能需要一些时间,因此这个答案将使用与 llama 3.1 共享相同标记器的代理模型
无需使用模型或通过前向函数,我们可以通过将文本转换为输入ID然后将其转换回文本来直接看到那些“奇怪的符号”出现。
您会看到始终添加了这个
Ġ
符号。
from transformers import AutoTokenizer
import torch
model_path = "neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
input_string = "Always. Always, unless you are in a car that is not moving"
input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
texts = tokenizer.convert_ids_to_tokens(input_ids.squeeze()) # convert ids to tokens
print(texts)
[出]:
['<|begin_of_text|>',
'Always',
'.',
'ĠAlways',
',',
'Ġunless',
'Ġyou',
'Ġare',
'Ġin',
'Ġa',
'Ġcar',
'Ġthat',
'Ġis',
'Ġnot',
'Ġmoving']
看起来
Ġ
表示空格。就像 sentencepiece
使用“ ” (U+2581) 符号一样。
Ġ
从何而来?让我们首先尝试打印出词汇,你会看到这些非自然文本字符到处出现:
print(tokenizer.vocab)
[出]:
{'icc': 48738,
'ĠCarly': 79191,
'ĠBOT': 83430,
'ĠÑĦоÑĤо': 118849,
'depends': 59047,
'ĠÑĢиз': 120010,
'ĠDolphin': 96096,
'ĠdataType': 23082,
'ĠÙģÙĤد': 116811,
'Ġme': 757,
'ÙĦÙī': 84659,
'.secondary': 70156,
'ĠAxes': 90804,
'PN': 18378,
'Ġflav': 18779,
'Ġhp': 21280,
'(Module': 76395,
'ãģ¾ãģ§': 103296,
ĠÑĢÐÙģÙĤ
字符来自...参见 https://github.com/openai/gpt-2/issues/80 和 https://augustasmacijauskas.github.io/personal-website/posts/tokenizers-deep-dive/tokenizers-deep-dive .html
这个
Ġevil
的根源来自https://github.com/openai/gpt-2/blob/master/src/encoder.py#L9
试试这个:
from transformers import AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained(model_path, use_safetensors=True)
input_string = "Always. Always, unless you are in a car that is not moving"
input_ids = tokenizer.encode(input_string, return_tensors="pt") # tokenize to ids
texts = tokenizer.convert_ids_to_tokens(input_ids.squeeze())
tokenizer.batch_decode(input_ids) # convert ids to natural text.
[出]:
['<|begin_of_text|>Always. Always, unless you are in a car that is not moving']
并删除特殊的 BOS 代币,
tokenizer.batch_decode(input_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
[出]:
['Always. Always, unless you are in a car that is not moving']