我正在尝试微调基于 Transformer 的计算机视觉模型,特别是
microsoft/swinv2-large-patch4-window12to16-192to256-22kto1k-ft
模型,可通过 HuggingFace 获得。
我的训练过程有点复杂。它涉及:
我正在训练的模型有一个用于图像分类的主输出头,以及 8 个用于预测合成数据集上注释的属性的补充输出头,其中包括对象的姿势和照明方向等特征。
训练过程同时使用真实图像和合成图像,并且在每个批次中,模型参数首先在一批合成图像上进行训练。这是使用一个优化器来完成的,该优化器训练模型的所有可训练参数。接下来,模型在一批 ImageNet 图像上进行训练,并使用“不同”优化器,该优化器仅影响主分类头的参数。 (其他参数在 ImageNet 中没有标注。)这种方法是作为 replay 的方法实现的,以避免 灾难性遗忘。 不幸的是,我遇到了一个无法解决的问题。查看我记录到 Tensorboard 的训练和验证准确性和损失,我发现每个任务的训练数据的准确性随着时间的推移
下降,而损失增加。与此同时,真实任务和合成任务的验证准确性和损失保持完全恒定。
此图中显示了超过 14 个训练周期的一些样本图所有任务都是离散化的,所有任务使用的损失函数都是分类交叉熵。对于多任务损失,使用所有类别的加权和将它们组合起来。对于仅 ImageNet 损失,使用原始 CCEL。
下面提供了代码的简化版本(带有描述省略步骤的注释):
# Custom data sets are defined and initialised
# Custom model is defined and initialised
# Custom loss functions are defined and initialised
# Base model is loaded from HuggingFace and passed into the custom model
# LoRA is applied to the model
config = LoraConfig(
r=20,
lora_alpha=20,
target_modules=["query", "value"],
lora_dropout=0.2,
bias="none",
modules_to_save=[
"dense_layer",
"additional_dense_layer",
"classifier",
# Names of the other output layers
],
)
model = get_peft_model(model, config)
# Model is distributed over 3 GPUs using DistributedDataParallel
# The two optimisers are created
# One optimiser for updating all parameters on synthetic images
all_parameters = model.parameters()
optimizer_all = torch.optim.Adam(all_parameters, lr=0.01)
# And one for when only updating the classification head on ImageNet images
lora_parameters = (
param
for name, param in model.module.named_parameters()
if ("lora_A" in name or "lora_B" in name)
)
imagenet_parameters = chain(
lora_parameters,
model.module.classifier.parameters(),
model.module.dense_layer.parameters(),
model.module.additional_dense_layer.parameters(),
)
optimizer_imagenet = torch.optim.Adam(imagenet_parameters, lr=0.01)
# Initialize the StepLR schedulers
scheduler_all = StepLR(optimizer_all, step_size=3, gamma=0.5)
scheduler_imagenet = StepLR(optimizer_imagenet, step_size=3, gamma=0.5)
# Training loop (many logging steps omitted)
for epoch in range(start_epoch, args.stop_epoch):
# Set the epoch for DistributedSampler
synth_train_dl.sampler.set_epoch(epoch)
imagenet_train_dl.sampler.set_epoch(epoch)
for i, (synth_batch, imagenet_batch) in enumerate(
zip(synth_train_dl, imagenet_train_dl)
):
###################
# Synthetic stage
# Move the batch tensors to the same device as the model
synth_batch = {k: v.to(rank) for k, v in synth_batch.items()}
# Zero the gradients
optimizer_all.zero_grad()
# Perform the forward pass
synth_outputs = model(synth_batch["image"])
# Compute the loss
synth_loss = multi_loss_fn(synth_outputs, synth_batch)
# Perform the backward pass
synth_loss.backward()
# Update the weights
optimizer_all.step()
###################
# ImageNet stage
# Move the batch tensors to the same device as the model
imagenet_batch = {k: v.to(rank) for k, v in imagenet_batch.items()}
# Zero the gradients
optimizer_imagenet.zero_grad()
# Perform the forward pass
imagenet_outputs = model(imagenet_batch["image"])
# Compute the loss
imagenet_loss = single_loss_fn(imagenet_outputs, imagenet_batch)
# Perform the backward pass
imagenet_loss.backward()
# Update the weights
optimizer_imagenet.step()
# Step the schedulers
scheduler_all.step()
scheduler_imagenet.step()
# After each epoch, evaluate the model on the validation set
model.eval()
val_task_losses = {task: 0 for task in multi_loss_fn.weights.keys()}
val_task_accuracies = {task: 0 for task in multi_loss_fn.weights.keys()}
with torch.no_grad():
val_loss = 0
imagenet_loss = 0
for i, (synth_batch, imagenet_batch) in enumerate(
zip(synth_val_dl, imagenet_val_dl)
):
###################
# Synthetic validation
# Move the batch tensors to the same device as the model
synth_batch = {k: v.to(rank) for k, v in synth_batch.items()}
synth_outputs = model(synth_batch["image"])
loss = multi_loss_fn(synth_outputs, synth_batch)
val_loss += loss.item()
# Log loss and accuracy for each task on the validation set
for task in multi_loss_fn.weights.keys():
task_loss = multi_loss_fn.loss_fn(
synth_outputs[task], synth_batch[task]
)
task_acc = compute_accuracy(synth_outputs[task], synth_batch[task])
val_task_losses[task] += task_loss.item()
val_task_accuracies[task] += task_acc
###################
# ImageNet validation
# Move the batch tensors to the same device as the model
imagenet_batch = {k: v.to(rank) for k, v in imagenet_batch.items()}
imagenet_outputs = model(imagenet_batch["image"])
loss = single_loss_fn(imagenet_outputs, imagenet_batch)
imagenet_loss += loss.item()
for task in multi_loss_fn.weights.keys():
avg_task_loss = val_task_losses[task] / len(synth_val_dl)
avg_task_acc = val_task_accuracies[task] / len(synth_val_dl)
writer.add_scalar(f"Loss/val/{task}", avg_task_loss, epoch)
writer.add_scalar(f"Accuracy/val/{task}", avg_task_acc, epoch)
# Write the overall validation loss to TensorBoard
val_loss /= len(synth_val_dl)
writer.add_scalar("Synthetic Loss/val", val_loss, epoch)
imagenet_loss /= len(imagenet_val_dl)
writer.add_scalar("ImageNet Loss/val", imagenet_loss, epoch)
# Print the validation loss
print(f"Epoch {epoch+1}/{args.stop_epoch}, Validation Loss: {val_loss}")
# Save the model parameters
save_checkpoint = {
"epoch": epoch + 1,
"model_state_dict": model.state_dict(),
"all_optimizer_state_dict": optimizer_all.state_dict(),
"single_optimizer_state_dict": optimizer_imagenet.state_dict(),
}
torch.save(save_checkpoint, f"checkpoints/checkpoint_epoch_{epoch+1}.pth")
model.train()
不幸的是,部署和训练这个模型的过程非常缓慢,我几乎没有余力进行反复试验。我非常感谢任何人指出我的代码中会导致此问题的原因,并且很乐意提供更多信息。
最初我预计我的学习率太低,这导致参数没有充分改变,但是我现在从 0.01 的学习率开始,我认为这个值相当大,并且仍然看到相同的情况问题。
各个时期之间的验证损失“完全相同”这一事实让我相信该模型“根本没有”进行训练。考虑到结果,这似乎是合理的,因为数据集在训练期间被打乱,这可能导致训练损失的高方差,但是我不明白为什么我的模型会出现这种情况。我也确信传递给优化器的参数集非空。
由于我的经验不足,我担心这可能是一个简单的疏忽,再次非常感谢任何帮助。 你能找出原因吗,因为我面临着类似的问题