我正在尝试使用 Yolov8 训练模型。一切都很好,但今天我突然注意到这个警告显然与
PyTorch
和 cuDNN
有关。尽管有警告,但训练似乎仍在取得进展。不知道对训练进度有没有负面影响。
site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
问题是什么以及如何解决?
这是
collect_env
的输出:
Collecting environment information...
PyTorch version: 2.3.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.31
Python version: 3.9.7 | packaged by conda-forge | (default, Sep 2 2021, 17:58:34) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100 80GB PCIe
Nvidia driver version: 515.105.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnx==1.16.0
[pip3] onnxruntime==1.17.3
[pip3] onnxruntime-gpu==1.17.1
[pip3] onnxsim==0.4.36
[pip3] optree==0.11.0
[pip3] torch==2.3.0+cu118
[pip3] torchaudio==2.3.0+cu118
[pip3] torchvision==0.18.0+cu118
[pip3] triton==2.3.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] pytorch-quantization 2.2.1 pypi_0 pypi
[conda] torch 2.1.1+cu118 pypi_0 pypi
[conda] torchaudio 2.1.1+cu118 pypi_0 pypi
[conda] torchmetrics 0.8.0 pypi_0 pypi
[conda] torchvision 0.16.1+cu118 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypi
在 pytorch 2.3.0 版本中,即使没有抛出异常,它也会打印这个不需要的警告:请参阅https://github.com/pytorch/pytorch/pull/125790
正如您所提到的,训练正在正确处理。如果你想摆脱这个警告,你应该恢复到 torch 2.2.2(然后你还必须将 torchvision 恢复到 0.17.2):
pip3 install torchvision==0.17.2
pip3 install torch==2.2.2
2024年6月解决方案:将torch版本升级到2.3.1即可修复:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118