我在尝试运行需要在 LXC 容器内进行 GPU 访问的 Docker 容器时遇到问题。标准 Docker 容器运行良好,但当我尝试通过添加
--gpus=all
或 --runtime=nvidia
来使用 NVIDIA GPU 时,容器无法启动。
我收到的错误消息是:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted: unknown.
nvidia-smi
配合使用)nvidia-smi
运行成功)# Allow cgroup access
lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 235:* rwm
lxc.cgroup2.devices.allow: c 511:* rwm
lxc.cgroup2.devices.allow: c 226:* rwm
lxc.cgroup2.devices.allow: c 239:* rwm
lxc.cgroup2.devices.allow: c 243:* rwm
# Pass through device files
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
我正在寻找有关如何调试此问题并在 LXC 容器中成功运行支持 GPU 的 Docker 容器的任何指导。
这是我用来让
nvidia-smi
在 Proxmox 上的 LXC 中的 docker 中工作的过程:
nvidia-smi
在主机和 lxc 容器中工作吗?如果没有,我几年前写过这个过程(这是一篇较旧的帖子,但仍然应该可以帮助您实现),您可以在here阅读。在 Ubuntu 22.04 LXC Container 中 - 未在其他发行版上进行测试。我将在此处复制命令,但请考虑转到源代码进行故障排除。
$ for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
$ # Add Docker's official GPG key:
$ sudo apt-get update
$ sudo apt-get install ca-certificates curl
$ sudo install -m 0755 -d /etc/apt/keyrings
$ sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
$ sudo chmod a+r /etc/apt/keyrings/docker.asc
$ # Add the repository to Apt sources:
$ echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
$ sudo apt-get update
$ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
$ sudo docker run hello-world
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt-get update
$ sudo apt-get install -y nvidia-container-toolkit
$ sudo nvidia-ctk runtime configure --runtime=docker
$ sudo systemctl restart docker
不确定这是否有帮助,但我在尝试获取 Nvidia 时一直陷入困境 docker 在非特权 lxc 中运行,对我来说解决方法是 更改 Nvidia docker 配置文件中的设置
。
no-cgroups = true
/etc/nvidia-container-runtime/config.toml
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
希望这对你有用,但这可能很脆弱,尽管它已经变得更好了。