努力微调 LLaMA 3.2 模型：为什么基础模型在我的用例中优于指令？

Question

我一直在尝试在我的自定义数据集上微调 LLaMA 3.2-Instruct 模型，该数据集采用 JSON 风格的聊天格式。该数据集很小（大约 400 个条目），由于其机密性，我无法共享它。在训练模型并测试它之后，我注意到该模型不能很好地适应我的数据集 - 结果似乎与基本模型几乎没有变化。

我尝试修改学习率等各种参数，但问题仍然存在。有趣的是，当我转而使用 Alpaca 提示符微调 LLaMA-3.2-1B-bnb-4bit 模型时，我观察到特定用例的结果明显更好。许多人提到 Instruct 模型在处理结构化数据时表现更好，但对我来说并非如此。

尽管 LLaMA-3.2-1B-bnb-4bit 模型效果更好，但结果远非完美，而且我不确定为什么尽管使用相同的数据集仍会出现这些差异。以下是我如何格式化数据集以及用于微调 Llama-3.2-1B-Instruct-bnb-4bit（指令模型）的参数：

数据集格式化代码 描述：以下代码显示了我如何将数据集格式化为适合微调的 JSON 样式聊天格式。我使用 unsloth.chat_templates 库来标准化 ShareGPT 格式并将我的数据集映射为所需的格式。数据集最初位于 CSV 文件中，其中对话存储为 JSON 字符串。


tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

from datasets import load_dataset

import json
import pandas as pd
from datasets import Dataset

# Load the CSV
df = pd.read_csv('/content/conversation_style_dataset.csv')

# Parse the JSON strings in the "conversations" column
df['conversations'] = df['conversations'].apply(json.loads)

# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Standardize using ShareGPT formatting
from unsloth.chat_templates import standardize_sharegpt

# Apply the standardization and formatting functions
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched=True)

微调参数 描述：以下是我在 Hugging Face SFTTrainer 类中使用的微调参数。由于硬件资源有限，我将训练批量大小保持在较小的水平，并调整了学习率和优化器等其他参数。训练运行一个 epoch，以节省实验时间。

from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

# Define the trainer for fine-tuning the model
trainer = SFTTrainer(
   model = model,
   tokenizer = tokenizer,
   train_dataset = dataset,
   dataset_text_field = "text",
   max_seq_length = max_seq_length,
   data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
   dataset_num_proc = 2,
   packing = False, # Can make training 5x faster for short sequences.
   args = TrainingArguments(
       per_device_train_batch_size = 2,
       gradient_accumulation_steps = 4,
       warmup_steps = 5,
       num_train_epochs = 1, # Set this for 1 full training run.
       learning_rate = 2e-4,
       fp16 = not is_bfloat16_supported(),
       bf16 = is_bfloat16_supported(),
       logging_steps = 1,
       optim = "adamw_8bit",
       weight_decay = 0.01,
       lr_scheduler_type = "linear",
       seed = 3407,
       output_dir = "outputs",
       report_to = "none", # Use this for WandB etc
   ),
)

我的问题：

为什么使用 Alpaca 提示符微调 LLaMA-3.2-1B-bnb-4bit 会比 LLaMA-3.2-Instruct-bnb-4bit 产生更好的结果，即使在相同数据上进行训练也是如此？
格式化数据集以通过指导模型优化结果的最佳方法是什么？
有没有任何资源或策略可以对这么小的数据集进行微调？

Answer 1

为什么使用 Alpaca 提示符微调 LLaMA-3.2-1B-bnb-4bit 会比 LLaMA-3.2-Instruct-bnb-4bit 产生更好的结果，即使在相同数据上进行训练也是如此？

它们是不同的模型，因此没有在相同的数据上进行 1:1 的训练。指导模型还使用指导数据进行了微调。

在某些情况下，基础模型比指导模型表现更好的原因很简单：微调迫使模型更好地学习一些不同的模式，因此它“忘记”了之前训练中的一些泛化，以便在新学习的情况下变得更加智能。这是已知的行为。

格式化数据集以通过指导模型优化结果的最佳方法是什么？

您需要仔细提供提供商的材料。您使用了 Alpaca 格式 - 我认为 Llama-3.2 的格式不正确。 Llama3.2 的格式不同：参见 Meta 的资料：https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/:

轻量级模型与 Llama 3.1 纯文本模型有许多共同特征。有关适用于两组模型的信息，请参阅 Llama 3.1 页面上的以下部分。

因此应用 Llama3.1 的格式：

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 23 July 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

参见：https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#-instruct-model-prompt-

有没有任何资源或策略可以对这么小的数据集进行微调？

尽力检查和设计提示；然后，如果它不起作用，请整理作为最终数据集分发的 1000 个样本，并尝试使用最后一层稍微微调模型。请参阅有关微调的优秀材料：同样来自 META：https://arxiv.org/abs/2305.11206

努力微调 LLaMA 3.2 模型：为什么基础模型在我的用例中优于指令？

问题描述投票：0回答：1

1个回答

最新问题

努力微调 LLaMA 3.2 模型：为什么基础模型在我的用例中优于指令？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1