我正在开发一个使用 SBERT 预训练模型(特别是 MiniLM)的项目,用于具有 995 个分类的文本分类项目。我大部分时间都按照here列出的步骤进行操作,一切似乎都在运行。
我的问题是在实际训练模型时出现的。无论我在训练参数中设置什么值,训练似乎总是提前结束并且永远不会完成所有批次。例如,我设置了
num_train_epochs=1
但它最多只能获得 0.49 epoch。如果 num_train_epochs=4
,它总是在 3.49 epoch 结束。
这是我的代码:
from datasets import load_dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
SentenceTransformerModelCardData,
)
from sentence_transformers.losses import BatchAllTripletLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator
model = SentenceTransformer(
"nreimers/MiniLM-L6-H384-uncased",
model_card_data=SentenceTransformerModelCardData(
language="en",
license="apache-2.0",
model_name="all-MiniLM-L6-v2",
)
)
loss = BatchAllTripletLoss(model)
# Loss overview: https://www.sbert.net/docs/sentence_transformer/loss_overview.html
# This particular loss method: https://www.sbert.net/docs/package_reference/sentence_transformer/losses.html#batchalltripletloss
# training args: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir="finetune/model20240924",
# Optional training parameters:
num_train_epochs=1,
max_steps = -1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
learning_rate=1e-5,
warmup_ratio=0.1,
fp16=True, # Set to False if you get an error that your GPU can't run on FP16
bf16=False, # Set to True if you have a GPU that supports BF16
batch_sampler=BatchSamplers.GROUP_BY_LABEL, #
# Optional tracking/debugging parameters:
eval_strategy="no",
eval_steps=100,
save_strategy="epoch",
# save_steps=100,
save_total_limit=2,
logging_steps=100,
run_name="miniLm-triplet", # Will be used in W&B if `wandb` is installed
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=trainDataset,
eval_dataset=devDataset,
loss=loss,
#evaluator=dev_evaluator,
)
trainer.train()
请注意,我没有使用评估器,因为我们正在创建模型并在事后使用专用的测试值集对其进行测试。我的数据集的结构如下:
Dataset({
features: ['Title', 'Body', 'label'],
num_rows: 23961
})
与
dev
数据集具有相同的结构,只是行数较少。这给出了以下输出:
[1473/2996 57:06 < 59:07, 0.43 it/s, Epoch 0/1]
Step Training Loss
100 1.265600
200 0.702700
300 0.633900
400 0.505200
500 0.481900
600 0.306800
700 0.535600
800 0.369800
900 0.265400
1000 0.345300
1100 0.516700
1200 0.372600
1300 0.392300
1400 0.421900
TrainOutput(global_step=1473, training_loss=0.5003972503496366, metrics={'train_runtime': 3427.9198, 'train_samples_per_second': 6.99, 'train_steps_per_second': 0.874, 'total_flos': 0.0, 'train_loss': 0.5003972503496366, 'epoch': 0.4916555407209613})
尽管我调整了值,但我无法让它完成所有批次。如何解决这个问题?
我更改了batch_sampler 的训练参数值
batch_sampler=BatchSamplers.GROUP_BY_LABEL
到
batch_sampler=BatchSamplers.NO_DUPLICATES
问题就解决了。 最初,
GROUP_BY_LABEL
被选为此损失计算的文档推荐它,但切换它似乎已经解决了这个问题。