在 PyTorch 中训练基于 Transformer 的 CV 模型时的高方差训练损失和恒定验证损失

我正在尝试微调基于 Transformer 的计算机视觉模型,特别是

模型,可通过 HuggingFace 获得。


  1. 在合成图像数据集和 ImageNet 子集上进行训练
  2. 使用低秩适应来减少可训练参数的数量(paperHuggingFace 文档

我正在训练的模型有一个用于图像分类的主输出头,以及 8 个用于预测合成数据集上注释的属性的补充输出头,其中包括对象的姿势和照明方向等特征。

训练过程同时使用真实图像和合成图像,并且在每个批次中,模型参数首先在一批合成图像上进行训练。这是使用一个优化器来完成的,该优化器训练模型的所有可训练参数。接下来,模型在一批 ImageNet 图像上进行训练,并使用“不同”优化器,该优化器仅影响主分类头的参数。 (其他参数在 ImageNet 中没有标注。)这种方法是作为 replay 的方法实现的,以避免 灾难性遗忘 不幸的是,我遇到了一个无法解决的问题。查看我记录到 Tensorboard 的训练和验证准确性和损失,我发现每个任务的训练数据的准确性随着时间的推移



此图中显示了超过 14 个训练周期的一些样本图

所有任务都是离散化的,所有任务使用的损失函数都是分类交叉熵。对于多任务损失,使用所有类别的加权和将它们组合起来。对于仅 ImageNet 损失,使用原始 CCEL。


# Custom data sets are defined and initialised # Custom model is defined and initialised # Custom loss functions are defined and initialised # Base model is loaded from HuggingFace and passed into the custom model # LoRA is applied to the model config = LoraConfig( r=20, lora_alpha=20, target_modules=["query", "value"], lora_dropout=0.2, bias="none", modules_to_save=[ "dense_layer", "additional_dense_layer", "classifier", # Names of the other output layers ], ) model = get_peft_model(model, config) # Model is distributed over 3 GPUs using DistributedDataParallel # The two optimisers are created # One optimiser for updating all parameters on synthetic images all_parameters = model.parameters() optimizer_all = torch.optim.Adam(all_parameters, lr=0.01) # And one for when only updating the classification head on ImageNet images lora_parameters = ( param for name, param in model.module.named_parameters() if ("lora_A" in name or "lora_B" in name) ) imagenet_parameters = chain( lora_parameters, model.module.classifier.parameters(), model.module.dense_layer.parameters(), model.module.additional_dense_layer.parameters(), ) optimizer_imagenet = torch.optim.Adam(imagenet_parameters, lr=0.01) # Initialize the StepLR schedulers scheduler_all = StepLR(optimizer_all, step_size=3, gamma=0.5) scheduler_imagenet = StepLR(optimizer_imagenet, step_size=3, gamma=0.5) # Training loop (many logging steps omitted) for epoch in range(start_epoch, args.stop_epoch): # Set the epoch for DistributedSampler synth_train_dl.sampler.set_epoch(epoch) imagenet_train_dl.sampler.set_epoch(epoch) for i, (synth_batch, imagenet_batch) in enumerate( zip(synth_train_dl, imagenet_train_dl) ): ################### # Synthetic stage # Move the batch tensors to the same device as the model synth_batch = {k: v.to(rank) for k, v in synth_batch.items()} # Zero the gradients optimizer_all.zero_grad() # Perform the forward pass synth_outputs = model(synth_batch["image"]) # Compute the loss synth_loss = multi_loss_fn(synth_outputs, synth_batch) # Perform the backward pass synth_loss.backward() # Update the weights optimizer_all.step() ################### # ImageNet stage # Move the batch tensors to the same device as the model imagenet_batch = {k: v.to(rank) for k, v in imagenet_batch.items()} # Zero the gradients optimizer_imagenet.zero_grad() # Perform the forward pass imagenet_outputs = model(imagenet_batch["image"]) # Compute the loss imagenet_loss = single_loss_fn(imagenet_outputs, imagenet_batch) # Perform the backward pass imagenet_loss.backward() # Update the weights optimizer_imagenet.step() # Step the schedulers scheduler_all.step() scheduler_imagenet.step() # After each epoch, evaluate the model on the validation set model.eval() val_task_losses = {task: 0 for task in multi_loss_fn.weights.keys()} val_task_accuracies = {task: 0 for task in multi_loss_fn.weights.keys()} with torch.no_grad(): val_loss = 0 imagenet_loss = 0 for i, (synth_batch, imagenet_batch) in enumerate( zip(synth_val_dl, imagenet_val_dl) ): ################### # Synthetic validation # Move the batch tensors to the same device as the model synth_batch = {k: v.to(rank) for k, v in synth_batch.items()} synth_outputs = model(synth_batch["image"]) loss = multi_loss_fn(synth_outputs, synth_batch) val_loss += loss.item() # Log loss and accuracy for each task on the validation set for task in multi_loss_fn.weights.keys(): task_loss = multi_loss_fn.loss_fn( synth_outputs[task], synth_batch[task] ) task_acc = compute_accuracy(synth_outputs[task], synth_batch[task]) val_task_losses[task] += task_loss.item() val_task_accuracies[task] += task_acc ################### # ImageNet validation # Move the batch tensors to the same device as the model imagenet_batch = {k: v.to(rank) for k, v in imagenet_batch.items()} imagenet_outputs = model(imagenet_batch["image"]) loss = single_loss_fn(imagenet_outputs, imagenet_batch) imagenet_loss += loss.item() for task in multi_loss_fn.weights.keys(): avg_task_loss = val_task_losses[task] / len(synth_val_dl) avg_task_acc = val_task_accuracies[task] / len(synth_val_dl) writer.add_scalar(f"Loss/val/{task}", avg_task_loss, epoch) writer.add_scalar(f"Accuracy/val/{task}", avg_task_acc, epoch) # Write the overall validation loss to TensorBoard val_loss /= len(synth_val_dl) writer.add_scalar("Synthetic Loss/val", val_loss, epoch) imagenet_loss /= len(imagenet_val_dl) writer.add_scalar("ImageNet Loss/val", imagenet_loss, epoch) # Print the validation loss print(f"Epoch {epoch+1}/{args.stop_epoch}, Validation Loss: {val_loss}") # Save the model parameters save_checkpoint = { "epoch": epoch + 1, "model_state_dict": model.state_dict(), "all_optimizer_state_dict": optimizer_all.state_dict(), "single_optimizer_state_dict": optimizer_imagenet.state_dict(), } torch.save(save_checkpoint, f"checkpoints/checkpoint_epoch_{epoch+1}.pth") model.train()


最初我预计我的学习率太低,这导致参数没有充分改变,但是我现在从 0.01 的学习率开始,我认为这个值相当大,并且仍然看到相同的情况问题。


由于我的经验不足,我担心这可能是一个简单的疏忽,再次非常感谢任何帮助。 你能找出原因吗,因为我面临着类似的问题

