获取此堆栈跟踪:
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from getDevice at ../c10/cuda/impl/CUDAGuardImpl.h:39 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x79b727243612 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1a14b (0x79b72761a14b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x3637f3a (0x79b75b237f3a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x2a (0x79b75b238eba in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x5c (0x79b77112328c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xdc253 (0x79b7de6c6253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x94ac3 (0x79b7dfd78ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #7: clone + 0x44 (0x79b7dfe09814 in /lib/x86_64-linux-gnu/libc.so.6)
尝试在 GKE Autopilot 上运行。发生什么事了?
我错过了他们文档的这一部分,概述了我需要在容器定义中正确设置
LD_INCLUDE_PATH
。我没有意识到我需要在 GKE Autopilot 上进行此设置,但事实证明您需要这样做。
添加此环境变量修复了一些问题:
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/cuda-12.3/lib64