用户警告:计划因 cudnnException 失败:CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR

问题描述 投票:0回答:2

我正在尝试使用 Yolov8 训练模型。一切都很好,但今天我突然注意到这个警告显然与

PyTorch
cuDNN
有关。尽管有警告,但训练似乎仍在取得进展。不知道对训练进度有没有负面影响。

site-packages/torch/autograd/graph.py:744: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

问题是什么以及如何解决?

这是

collect_env
的输出:

Collecting environment information...
PyTorch version: 2.3.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.31
Python version: 3.9.7 | packaged by conda-forge | (default, Sep  2 2021, 17:58:34)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100 80GB PCIe
Nvidia driver version: 515.105.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                    x86_64

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnx==1.16.0
[pip3] onnxruntime==1.17.3
[pip3] onnxruntime-gpu==1.17.1
[pip3] onnxsim==0.4.36
[pip3] optree==0.11.0
[pip3] torch==2.3.0+cu118
[pip3] torchaudio==2.3.0+cu118
[pip3] torchvision==0.18.0+cu118
[pip3] triton==2.3.0
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] pytorch-quantization      2.2.1                    pypi_0    pypi
[conda] torch                     2.1.1+cu118              pypi_0    pypi
[conda] torchaudio                2.1.1+cu118              pypi_0    pypi
[conda] torchmetrics              0.8.0                    pypi_0    pypi
[conda] torchvision               0.16.1+cu118             pypi_0    pypi
[conda] triton                    2.1.0                    pypi_0    pypi

python pytorch nvidia torchvision cudnn
2个回答
6
投票

在 pytorch 2.3.0 版本中,即使没有抛出异常,它也会打印这个不需要的警告:请参阅https://github.com/pytorch/pytorch/pull/125790

正如您所提到的,训练正在正确处理。如果你想摆脱这个警告,你应该恢复到 torch 2.2.2(然后你还必须将 torchvision 恢复到 0.17.2):

pip3 install torchvision==0.17.2
pip3 install torch==2.2.2

0
投票

2024年6月解决方案:将torch版本升级到2.3.1即可修复:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

© www.soinside.com 2019 - 2024. All rights reserved.