Detectron2:记录训练和验证损失

问题描述 投票:0回答:1

我想在 AzureML 中训练 detectorron2 模型。在 AzureML 中,可以记录指标。标准,Detectron2 记录损失(总损失、分类器损失、边界框损失等)。但是,我不完全理解它是什么损失(训练、验证)以及它如何防止过度拟合(它是否使用实现最低验证损失的权重?)。为了更好地理解训练过程,我想在 AzureML 中记录训练和验证损失。但是,我不确定我是否以正确的方式这样做。我读到可以创建一个钩子(来自https://github.com/facebookresearch/detectron2/issues/810),尽管我不确定这到底意味着什么(我是新人)。目前我有这样的东西:

# After setting up the cfg

from detectron2.engine import HookBase
from detectron2.data import build_detection_train_loader
import detectron2.utils.comm as comm

# Test/vali loss
from detectron2.utils.events import get_event_storage

class TrainingLoss(HookBase):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg.clone()
        self.cfg.DATASETS.TRAIN = self.cfg.DATASETS.TRAIN
        self._loader = iter(build_detection_train_loader(self.cfg))

    def after_step(self):
        data = next(self._loader)
        with torch.no_grad():
            loss_dict = self.trainer.model(data)

            losses = sum(loss_dict.values())
            assert torch.isfinite(losses).all(), loss_dict

            loss_dict_reduced = {"val_" + k: v.item() for k, v in
                                 comm.reduce_dict(loss_dict).items()}
            losses_reduced = sum(loss for loss in loss_dict_reduced.values())
            if comm.is_main_process():
                self.trainer.storage.put_scalars(total_val_loss=losses_reduced,
                                                 **loss_dict_reduced)

            print(f"Training Loss (Iteration {self.trainer.iter}): {losses_reduced}")

class ValidationLoss(HookBase):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg.clone()
        self.cfg.DATASETS.TRAIN = cfg.DATASETS.TEST
        self._loader = iter(build_detection_train_loader(self.cfg))

    def after_step(self):
        data = next(self._loader)
        with torch.no_grad():
            loss_dict = self.trainer.model(data)

            losses = sum(loss_dict.values())
            assert torch.isfinite(losses).all(), loss_dict

            loss_dict_reduced = {"val_" + k: v.item() for k, v in
                                 comm.reduce_dict(loss_dict).items()}
            losses_reduced = sum(loss for loss in loss_dict_reduced.values())
            if comm.is_main_process():
                self.trainer.storage.put_scalars(total_val_loss=losses_reduced,
                                                 **loss_dict_reduced)

            print(f"Vali Loss (Iteration {self.trainer.iter}): {losses_reduced}")



trainer = DefaultTrainer(cfg)
val_loss = ValidationLoss(cfg)
train_loss = TrainingLoss(cfg)
trainer.register_hooks([val_loss])
trainer.register_hooks([train_loss])
trainer.resume_or_load(resume=False)
trainer.train()

在训练期间,它会打印如下内容:

图像中的数字可能非常小,但无论我打印什么(训练或验证损失),它们都与 detector2 默认记录的内容不匹配。例如,我为第 99 次迭代计算的 Total_loss 为

1.936
,而 detectorron2 记录为
2.097
。我知道我记录了计算的训练损失,但是当我计算计算的验证损失时,它也有点偏差。

有谁知道应该如何正确记录这些指标? detectorron2实际上是如何计算损失的?它是保存实现最低验证损失的权重,还是仅在迭代次数结束后保存?

logging loss detectron
1个回答
0
投票

要正确记录训练和验证损失,您可能不需要为训练和验证创建单独的挂钩。相反,您可以修改现有挂钩或创建一个新挂钩来处理训练和验证。

from detectron2.data import build_detection_train_loader
import detectron2.utils.comm as comm

class LossHook(HookBase):
    def __init__(self, cfg, is_validation=False):
        super().__init__()
        self.cfg = cfg.clone()
        self.cfg.DATASETS.TRAIN = self.cfg.DATASETS.TEST if is_validation else self.cfg.DATASETS.TRAIN
        self._loader = iter(build_detection_train_loader(self.cfg))
        self.loss_prefix = "val_" if is_validation else "train_"

    def after_step(self):
        data = next(self._loader)
        with torch.no_grad():
            loss_dict = self.trainer.model(data)

            losses = sum(loss_dict.values())
            assert torch.isfinite(losses).all(), loss_dict

            loss_dict_reduced = {self.loss_prefix + k: v.item() for k, v in
                                 comm.reduce_dict(loss_dict).items()}
            losses_reduced = sum(loss for loss in loss_dict_reduced.values())
            if comm.is_main_process():
                self.trainer.storage.put_scalars(total_loss=losses_reduced,
                                                 **loss_dict_reduced)

            print(f"{self.loss_prefix.capitalize()}Loss (Iteration {self.trainer.iter}): {losses_reduced}")

#training code
trainer = DefaultTrainer(cfg)
train_loss_hook = LossHook(cfg, is_validation=False)
val_loss_hook = LossHook(cfg, is_validation=True)
trainer.register_hooks([train_loss_hook, val_loss_hook])
trainer.resume_or_load(resume=False)
trainer.train()
© www.soinside.com 2019 - 2024. All rights reserved.