如何让 HuggingFace 标记器识别换行符?

问题描述 投票:0回答:1

我一直在使用 HuggingFace 分词器,似乎当我处理带有换行符的字符串时,它会忽略它并将其视为空格字符。我想创建自己的语言模型,并且我相信能够生成结构化段落会很有用。

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), StripAccents()])

from tokenizers.pre_tokenizers import BertPreTokenizer
bert_tokenizer.pre_tokenizer = BertPreTokenizer()

from tokenizers.processors import TemplateProcessing
bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
file = 'input.txt'
bert_tokenizer.train([file], trainer)

from tokenizers import decoders
bert_tokenizer.decoder = decoders.WordPiece()

output = bpe_tokenizer.encode("This is the first line.\nThis is the second line.")
print(bpe_tokenizer.decode(output.ids))
# Should read as:
# This is the first line.
# This is the second line
#
# Instead I get:
# This is the first line. This is the second line.

我尝试过其他标记器模型,例如 BPE,但它们存在将标记合并在一起的问题。我发现 WordPiece 标记生成器可以生成更清晰的输出,但我想生成段落或脚本行作为潜在输出。

python tokenize huggingface
1个回答
0
投票

更简单的方法是替换 ' ' 带有特殊标记的字符,然后您的标记生成器将识别该字符,并像其他所有内容一样由您的模型生成。在文本生成管道的最后一步中,您可以将其替换为实际的 ' ',从而保留训练数据集的结构并能够重现它。

© www.soinside.com 2019 - 2024. All rights reserved.