由于某种原因,任何带有 CUDA 的 docker 容器都看不到我的 GPU。
当我运行这个时:
docker run --gpus=all --rm nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
我有这个输出:
...
Error: only 0 Devices available, 1 requested. Exiting.
CUDA 容器无法找到我的 GPU。
我在论坛中发现了很多类似的问题,但没有令人满意的答案。 你们中有人找到了 WSl2 / Docker Desktop / Win10 / Ubuntu20.04 发生这种情况的原因吗? 我有最新版本的 CUDA 和 NVIDIA 驱动程序以及最新版本的 WSL2 和 Docker-Desktop。
但是 nvidia-smi 和 nvcc --version 都可以工作
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.03 Driver Version: 555.85 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti On | 00000000:01:00.0 On | N/A |
| 0% 53C P8 16W / 165W | 1045MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 21 G /Xwayland N/A |
| 0 N/A N/A 23 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
这也有效 -> 看起来纯粹与 CUDA 相关。
/mnt/c/Users/pavel$ docker run --rm --gpus=all ubuntu nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 4060 Ti (UUID: GPU-28820d91-b332-b4ba-f1c8-5508048ce1f7)
我的环境:
wsl --version
Verze WSL: 2.1.5.0
Verze jádra: 5.15.146.1-2
Verze WSLg: 1.0.60
Verze MSRDC: 1.2.5105
Verze Direct3D: 1.611.1-81528511
Verze DXCore: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Verze Windows: 10.0.19045.4412
docker info
Client:
Version: 26.1.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.14.0-desktop.1
Path: C:\Program Files\Docker\cli-plugins\docker-buildx.exe
compose: Docker Compose (Docker Inc.)
Version: v2.27.0-desktop.2
Path: C:\Program Files\Docker\cli-plugins\docker-compose.exe
debug: Get a shell into any image or container (Docker Inc.)
Version: 0.0.29
Path: C:\Program Files\Docker\cli-plugins\docker-debug.exe
dev: Docker Dev Environments (Docker Inc.)
Version: v0.1.2
Path: C:\Program Files\Docker\cli-plugins\docker-dev.exe
extension: Manages Docker extensions (Docker Inc.)
Version: v0.2.23
Path: C:\Program Files\Docker\cli-plugins\docker-extension.exe
feedback: Provide feedback, right in your terminal! (Docker Inc.)
Version: v1.0.4
Path: C:\Program Files\Docker\cli-plugins\docker-feedback.exe
init: Creates Docker-related starter files for your project (Docker Inc.)
Version: v1.1.0
Path: C:\Program Files\Docker\cli-plugins\docker-init.exe
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc.)
Version: 0.6.0
Path: C:\Program Files\Docker\cli-plugins\docker-sbom.exe
scout: Docker Scout (Docker Inc.)
Version: v1.8.0
Path: C:\Program Files\Docker\cli-plugins\docker-scout.exe
Server:
Containers: 11
Running: 6
Paused: 0
Stopped: 5
Images: 47
Server Version: 26.1.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: runc
Init Binary: docker-init
containerd version: e377cd56a71523140ca6ae87e30244719194a521
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
seccomp
Profile: unconfined
Kernel Version: 5.15.146.1-microsoft-standard-WSL2
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 5
Total Memory: 39.18GiB
Name: docker-desktop
ID: 88425de8-c396-4a90-9fea-afb64822deaa
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Labels:
com.docker.desktop.address=npipe://\\.\pipe\docker_cli
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5555
127.0.0.0/8
Live Restore Enabled: false
WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support
WARNING: daemon is not using the default seccomp profile
nvidia-smi
Fri May 31 08:47:59 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85 Driver Version: 555.85 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:01:00.0 On | N/A |
| 0% 54C P8 16W / 165W | 1083MiB / 16380MiB | 2% Default |
| | | N/A |
Ubuntu 20.04
我已经尝试过多次安装和全新安装 NVIDIA 驱动程序、CUDA Toolkit、NVIDIA Container Toolkit 等。
据我发现,人们必须在 Win10/WSL2 环境中安装哪些内容才能使 CUDA 正常工作,这之间存在很大差异。有些安装了最新的 NVIDIA 驱动程序。有些同时安装 CUDA Win10 Toolkit 和 CUDA WSL-Ubuntu Toolkit。此外,有些人必须安装 Nvidia Container Toolkit,有些则不需要。
我陷入了尝试所有可能的安装组合的无限循环中,但似乎我错过了一些东西。
有人遇到同样的情况并找到解决方案吗?谢谢!!
如果有人在使用驱动程序版本 555.85 (CUDA 12.5) 时遇到此问题,解决方案是降级到 552.22 (CUDA 12.4) (https://www.nvidia.com/download/driverResults.aspx/224154/en-us/)
如果您只需要运行 CUDA 容器 步骤 - 从 wsl 中删除所有 nvidia 和 cuda 包 在 Windows 中卸载 CUDA 工具包。
下载内含12.4 CUDA的552.22驱动程序。
运行全新安装。
重启
docker run --gpus all --rm nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
现在应该可以工作了:)