我一直在使用 HuggingFace 分词器,似乎当我处理带有换行符的字符串时,它会忽略它并将其视为空格字符。我想创建自己的语言模型,并且我相信能够生成结构化段落会很有用。
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), StripAccents()])
from tokenizers.pre_tokenizers import BertPreTokenizer
bert_tokenizer.pre_tokenizer = BertPreTokenizer()
from tokenizers.processors import TemplateProcessing
bert_tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", 1),
("[SEP]", 2),
],
)
from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
file = 'input.txt'
bert_tokenizer.train([file], trainer)
from tokenizers import decoders
bert_tokenizer.decoder = decoders.WordPiece()
output = bpe_tokenizer.encode("This is the first line.\nThis is the second line.")
print(bpe_tokenizer.decode(output.ids))
# Should read as:
# This is the first line.
# This is the second line
#
# Instead I get:
# This is the first line. This is the second line.
我尝试过其他标记器模型,例如 BPE,但它们存在将标记合并在一起的问题。我发现 WordPiece 标记生成器可以生成更清晰的输出,但我想生成段落或脚本行作为潜在输出。
更简单的方法是替换 ' ' 带有特殊标记的字符,然后您的标记生成器将识别该字符,并像其他所有内容一样由您的模型生成。在文本生成管道的最后一步中,您可以将其替换为实际的 ' ',从而保留训练数据集的结构并能够重现它。