我正在使用 Huggingface Transformers 包来加载预训练的 GPT-2 模型。我想使用 GPT-2 进行文本生成,但预训练版本还不够,所以我想用一堆个人文本数据对其进行微调。
我不确定应该如何准备数据并训练模型。我已经标记了我必须训练 GPT-2 的文本数据,但我不确定用于文本生成的“标签”是什么,因为这不是分类问题。
如何使用 Keras API 在此数据上训练 GPT-2?
我的型号:
modelName = "gpt2"
generator = pipeline('text-generation', model=modelName)
我的标记器:
tokenizer = AutoTokenizer.from_pretrained(modelName)
我的标记化数据集:
from datasets import Dataset
def tokenize_function(examples):
return tokenizer(examples['dataset']) # 'dataset' column contains a string of text. Each row is a string of text (in sequence)
dataset = Dataset.from_pandas(conversation)
tokenized_dataset = dataset.map(tokenize_function, batched=False)
print(tokenized_dataset)
我应该如何使用这个标记化数据集来微调我的 GPT-2 模型?
这是我的尝试
"""
Datafile is a text file with one sentence per line _DATASETS/data.txt
tf_gpt2_keras_lora is the name of the fine-tuned model
"""
import tensorflow as tf
from transformers import GPT2Tokenizer, TFGPT2LMHeadModel
from transformers.modeling_tf_utils import get_initializer
import os
# use 2 cores
tf.config.threading.set_intra_op_parallelism_threads(2)
tf.config.threading.set_inter_op_parallelism_threads(2)
# Use pretrained model if it exists
# otherwise download it
if os.path.exists("tf_gpt2_keras_lora"):
print("Model exists")
# use pretrained model
model = TFGPT2LMHeadModel.from_pretrained("tf_gpt2_keras_lora")
else:
print("Downloading model")
model = TFGPT2LMHeadModel.from_pretrained("gpt2")
# Load the tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Load and preprocess the data
with open("_DATASETS/data.txt", "r") as f:
lines = f.read().split("\n")
# Encode the data using the tokenizer and truncate the sequences to a maximum length of 1024 tokens
input_ids = []
for line in lines:
encoding = tokenizer.encode(line, add_special_tokens=True, max_length=1024, truncation=True)
input_ids.append(encoding)
# Define some params
batch_size = 2
num_epochs = 3
learning_rate = 5e-5
# Define the optimizer and loss function
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
# Fine-tune the model using low-rank adaptation and attention pruning
for layer in model.transformer.h:
layer.attention_output_dense = tf.keras.layers.Dense(units=256, kernel_initializer=get_initializer(0.02), name="attention_output_dense")
model.summary()
# Train the model
for epoch in range(num_epochs):
print(f"Epoch {epoch + 1}/{num_epochs}")
# Shuffle the input data
#input_ids = tf.random.shuffle(input_ids)
for i in range(0, len(input_ids), batch_size):
batch = input_ids[i:i+batch_size]
# Pad the batch to the same length
batch = tf.keras.preprocessing.sequence.pad_sequences(batch, padding="post")
# Define the inputs and targets
inputs = batch[:, :-1]
targets = batch[:, 1:]
# Compute the predictions and loss
with tf.GradientTape() as tape:
logits = model(inputs)[0]
loss = loss_fn(targets, logits)
# Compute the gradients and update the parameters
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
# Print the loss every 10 batches
if i % (10 * batch_size) == 0:
print(f"Batch {i}/{len(input_ids)} - loss: {loss:.4f}")
# Save the fine-tuned model
model.save_pretrained("tf_gpt2_keras_lora")
# Generate text using the fine-tuned model
input_ids = tokenizer.encode("How much wood", return_tensors="tf")
output = model.generate(input_ids, max_length=100, do_sample=True, top_k=50, top_p=0.95, temperature=0.9)
print(tokenizer.decode(output[0], skip_special_tokens=True))
我建议看一下 HuggingFace 提供的这个示例,它展示了如何微调 TensorFlow 模型以进行因果语言建模(即文本生成):https://github.com/huggingface/transformers/blob/主要/示例/tensorflow/语言建模/run_clm.py
关于如何表示“标签”的具体问题,HuggingFace 转换器模型允许您在执行模型时传入
labels
参数。此参数的值应与标记化的 input_ids
相同,如 transformers
文档中所述 (https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/gpt2#transformers. GPT2LMHeadModel.forward.labels):
- labels(torch.LongTensor of shape(batch_size,sequence_length),可选) - 用于语言建模的标签。请注意,标签在模型内部移动,即您可以设置 labels = input_ids 在 [-100, 0, ..., config.vocab_size] 中选择索引 所有设置为 -100 的标签都将被忽略(屏蔽),损失为仅计算 [0, ..., config.vocab_size] 中的标签
在上面共享的
run_clm.py
脚本中需要注意的另一件重要事情:一旦您准备好包含 input_ids
和 labels
列的标记化数据集,您将需要将其转换为 TensorFlow 数据集对象,以便可以使用它与 model.fit()
。这是通过 prepare_tf_dataset()
函数完成的,如下所示:https://github.com/huggingface/transformers/blob/main/examples/tensorflow/language-modeling/run_clm.py#L505-L516
将数据集转换为JSON格式。
导入库:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config, DataCollatorForLanguageModeling, Trainer, TrainingArguments,TextDataset
对数据集进行标记:
dataset = TextDataset(tokenizer,file_path="Your File Path")
data_collator = DataCollatorForLanguageModel(tokenizer=tokenizer,mlm=False)
training_args=TrainingArguments(output_dir="./finetunedmoedl",overwrite_output_dir=True,num_train_epochs=2,per_device_train_batch_size=1, # Start with a small batch size
gradient_accumulation_steps=1,
save_steps=1_000,
save_total_limit=2,
prediction_loss_only=True,
logging_dir="./logs",
fp16=True,
per_device_eval_batch_size=4,
eval_accumulation_steps=2)
实例化
Trainer
:
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,)
trainer.train()
保存模型:
model.save_pretrained("./medical_chatbot_finetuned")