我刚刚使用 PyTorch Lightning 完成了模型训练(2000 个时期)。我以为 PL 有自动张量板日志记录,但我不确定。这是我训练步骤的回报:
log = {
"total_reward": torch.tensor(self.total_reward).to(device),
"reward": torch.tensor(reward).to(device),
"train_loss": loss,
}
status = {
"steps": torch.tensor(self.global_step).to(device),
"total_reward": torch.tensor(self.total_reward).to(device),
}
return OrderedDict({"loss": loss, "log": log, "progress_bar": status})
这是我的 lighting_logs 文件夹的结构:
.
├── version_0
│ ├── checkpoints
│ │ └── epoch=2-step=191.ckpt
│ └── hparams.yaml
├── version_1
│ ├── checkpoints
│ │ └── epoch=2-step=191.ckpt
│ └── hparams.yaml
└── version_2
├── checkpoints
│ └── epoch=2-step=191.ckpt
└── hparams.yaml
6 directories, 6 files
运行张量板:
tensorboard --logdir=lightning_logs
2022-02-21 19:41:13.915945: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-02-21 19:41:13.915968: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-02-21 19:41:15.602607: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-02-21 19:41:15.602639: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-02-21 19:41:15.602653: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (scrungus-pc): /proc/driver/nvidia/version does not exist
但是当我打开张量板时,我得到:
No dashboards are active for the current data set.
我做错了什么?
在 PyTorch Lightning 中,您可以使用
loss
方法将 self.log
等指标记录到 TensorBoard(或任何其他记录器)。例如:
def training_step(self, batch, batch_idx):
# Your training logic here
loss = ...
self.log('loss', loss) # Logs the loss to TensorBoard
return loss
您使用 self.log 记录的每个值都会在 TensorBoard 界面中自动创建自己的绘图。默认情况下,PyTorch Lightning 使用 TensorBoard 作为记录器,但您可以通过将记录器参数传递给 Trainer 来更改或自定义记录器。例如:
from pytorch_lightning.loggers import WandbLogger
# Example of using WandbLogger instead of TensorBoard
wandb_logger = WandbLogger(project="my-project")
trainer = Trainer(logger=wandb_logger)
使用默认的 TensorBoard 记录器时,您不需要任何额外的设置。记录的值(损失、准确性等)将显示在 TensorBoard 界面中的单独图表下。