我有一个带有 4 个 Nvidia K80 GPU 和 Ubuntu 20.04 (python 3.8) 的 Google 云计算引擎。当我尝试训练 yolo5 模型时,出现以下错误:
RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
[W CUDAGuardImpl.h:113] Warning: CUDA warning: the launch timed out and was terminated (function destroyEvent)
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f62be2c17d2 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x239de (0x7f62f6ea69de in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x22d (0x7f62f6ea857d in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x300568 (0x7f63736d9568 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f62be2aa005 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2e9 (0x7f62fa8ca5e9 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::Reducer::~Reducer() + 0x205 (0x7f62fa8bcd25 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f6373bb7212 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f63735c7506 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x7e182f (0x7f6373bba82f in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x1f5b20 (0x7f63735ceb20 in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x1f6cce (0x7f63735cfcce in /home/cheyuxuanll/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: /usr/bin/python3() [0x5d1ec4]
frame #13: /usr/bin/python3() [0x5a958d]
frame #14: /usr/bin/python3() [0x5ed1a0]
frame #15: /usr/bin/python3() [0x544188]
frame #16: /usr/bin/python3() [0x5441da]
frame #17: /usr/bin/python3() [0x5441da]
frame #18: PyDict_SetItemString + 0x538 (0x5ce7c8 in /usr/bin/python3)
frame #19: PyImport_Cleanup + 0x79 (0x685179 in /usr/bin/python3)
frame #20: Py_FinalizeEx + 0x7f (0x68040f in /usr/bin/python3)
frame #21: Py_RunMain + 0x32d (0x6b7a1d in /usr/bin/python3)
frame #22: Py_BytesMain + 0x2d (0x6b7c8d in /usr/bin/python3)
frame #23: __libc_start_main + 0xf3 (0x7f6378be40b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #24: _start + 0x2e (0x5fb12e in /usr/bin/python3)
我正在用这个命令训练这个模型:
python3 -m torch.distributed.run --nproc_per_node 4 train.py --batch 16 --data coco128.yaml --weights yolov5s.pt --device 0,1,2,3
我在这里遗漏了什么吗?
谢谢
我们还在 Google Cloud 中运行 CUDA,当您发布问题时,我们的服务器大致重新启动。虽然我们无法检测到任何更改,但由于“运行时错误:没有可用的 CUDA GPU”,我们的服务无法启动。 所以有一些相似之处,但也有一些差异。
无论如何,我们选择了好的卸载并重新安装并修复了它:
卸载:
sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*"
sudo apt-get --purge remove "*nvidia*"
加上删除 /usr/local/*cuda* 中的所有内容
安装:
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/ /"
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-11-3
我们还重新安装了 CUDNN,但这可能是也可能不是您堆栈的一部分。