我正在使用最新版本的 Docker Desktop,并通过 WSL 2 在 Windows 11 上启用了 Kubernetes。我的笔记本电脑配有 NVidia GPU RTX 3080 ti。
GPU 可立即用于 Docker(请参阅下面的列表)。
我想让 GPU 可用于 Kubernetes 节点中的容器。
我已经看过这里了:
但是我找不到答案或操作指南。
令人费解的是,使用 --gpus 运行 Docker 容器都可以开箱即用,但是 kubernetes 容器(在 docker deskopt 中可见)似乎没有 GPU 支持。也许 Docker Desktop 在没有 --gpus all 的情况下启动它们。真有这么简单吗?有没有办法在 docker 桌面上检查/更改/配置 kubernetes?隐藏的 .kubernetes 文件或类似文件?
以下是一些有用的列表:
在带有 GPU 支持的 WSL 2(开箱即用)的 Windows 11 桌面版 docker 上的 ubuntu 容器中运行 nvidia-smi 的输出:
$docker run --rm --gpus all -it ubuntu nvidia-smi
Wed Jan 24 17:58:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.37.02 Driver Version: 546.65 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3080 ... On | 00000000:01:00.0 Off | N/A |
| N/A 61C P8 12W / 150W | 635MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1 C /python3.7 N/A |
+---------------------------------------------------------------------------------------+
kubectl 描述节点给出以下输出:(注意缺少 GPU)
$ kubectl describe node
Name: docker-desktop
Roles: control-plane
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVXVNNI=true
feature.node.kubernetes.io/cpu-cpuid.CETIBT=true
feature.node.kubernetes.io/cpu-cpuid.CETSS=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FSRM=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.GFNI=true
feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.MOVDIR64B=true
feature.node.kubernetes.io/cpu-cpuid.MOVDIRI=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.SERIALIZE=true
feature.node.kubernetes.io/cpu-cpuid.SHA=true
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.STOSB_SHORT=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.VAES=true
feature.node.kubernetes.io/cpu-cpuid.VMX=true
feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
feature.node.kubernetes.io/cpu-cpuid.WAITPKG=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=151
feature.node.kubernetes.io/cpu-model.vendor_id=Intel
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=5.15.133.1-microsoft-standard-WSL2
feature.node.kubernetes.io/pci-0302_1414.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=docker-desktop
kubernetes.io/os=linux
node-role.kubernetes.io/control-plane=
node.kubernetes.io/exclude-from-external-load-balancers=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/cri-dockerd.sock
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVXVNNI,cpu-cpuid.CETIBT,cpu-cpuid.CETSS,cpu-cpuid.CMPXCHG8,cpu-cpuid...
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 19 Jan 2024 15:59:59 +0100
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: docker-desktop
AcquireTime: <unset>
RenewTime: Wed, 24 Jan 2024 19:01:25 +0100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 24 Jan 2024 19:01:27 +0100 Fri, 19 Jan 2024 15:59:58 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 24 Jan 2024 19:01:27 +0100 Fri, 19 Jan 2024 15:59:58 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 24 Jan 2024 19:01:27 +0100 Fri, 19 Jan 2024 15:59:58 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 24 Jan 2024 19:01:27 +0100 Fri, 19 Jan 2024 15:59:59 +0100 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.65.3
Hostname: docker-desktop
Capacity:
cpu: 24
ephemeral-storage: 1055762868Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16184860Ki
pods: 110
Allocatable:
cpu: 24
ephemeral-storage: 972991057538
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16082460Ki
pods: 110
System Info:
Machine ID: 8e4a9a84-8d9a-4100-8a35-25329f9bcd04
System UUID: 8e4a9a84-8d9a-4100-8a35-25329f9bcd04
Boot ID: cd21f4ff-4b7f-4ab0-9791-650f44921aab
Kernel Version: 5.15.133.1-microsoft-standard-WSL2
OS Image: Docker Desktop
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://24.0.7
Kubelet Version: v1.28.2
Kube-Proxy Version: v1.28.2
Non-terminated Pods: (21 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default ingress-nginx-controller-76df688779-vmgcm 100m (0%) 0 (0%) 90Mi (0%) 0 (0%) 5d2h
kube-system coredns-5dd5756b68-267mg 100m (0%) 0 (0%) 70Mi (0%) 170Mi (1%) 5d3h
kube-system coredns-5dd5756b68-h8vlt 100m (0%) 0 (0%) 70Mi (0%) 170Mi (1%) 5d3h
kube-system etcd-docker-desktop 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 5d3h
kube-system kube-apiserver-docker-desktop 250m (1%) 0 (0%) 0 (0%) 0 (0%) 5d3h
kube-system kube-controller-manager-docker-desktop 200m (0%) 0 (0%) 0 (0%) 0 (0%) 5d3h
kube-system kube-proxy-9v4pl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d3h
kube-system kube-scheduler-docker-desktop 100m (0%) 0 (0%) 0 (0%) 0 (0%) 5d3h
kube-system nvidia-device-plugin-daemonset-wv575 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
kube-system storage-provisioner 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d3h
kube-system vpnkit-controller 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d3h
node-feature-discovery nfd-gc-5b987cb58f-qxn64 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
node-feature-discovery nfd-master-7bff75887-vs825 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
node-feature-discovery nfd-worker-f8f9s 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 950m (3%) 0 (0%)
memory 330Mi (2%) 340Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
kubectl 的输出描述 daemonset nvidia-device-plugin-daemonset -n kuda-system:
$ kubectl describe daemonset nvidia-device-plugin-daemonset -n kuda-system:
Name: nvidia-device-plugin-daemonset
Selector: name=nvidia-device-plugin-ds
Node-Selector: <none>
Labels: <none>
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 1
Number of Nodes Misscheduled: 0
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: name=nvidia-device-plugin-ds
Containers:
nvidia-device-plugin-ctr:
Image: nvcr.io/nvidia/k8s-device-plugin:v0.14.3
Port: <none>
Host Port: <none>
Environment:
FAIL_ON_INIT_ERROR: false
Mounts:
/var/lib/kubelet/device-plugins from device-plugin (rw)
Volumes:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
Priority Class Name: system-node-critical
Events: <none>
对于我的开发环境(WSL2、Docker-desktop、Minikube)
它安装是为了测试其他工具(类似的),所以我不确定是否需要安装它。
子包
配置docker daemon.json。对于 Windows,您应该在 docker-desktop 中配置它(不应该在 WSL2 中配置它)
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
参考资料:
https://minikube.sigs.k8s.io/docs/tutorials/nvidia/
minikube start --driver docker --container-runtime docker --gpus all
检查 Pod 状态
kubectl get po --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-6f6b679f8f-f2sd8 1/1 Running 2 (24h ago) 27h
kube-system etcd-minikube 1/1 Running 2 (24h ago) 27h
kube-system kube-apiserver-minikube 1/1 Running 2 (24h ago) 27h
kube-system kube-controller-manager-minikube 1/1 Running 2 (24h ago) 27h
kube-system kube-proxy-jx6p4 1/1 Running 2 (24h ago) 27h
kube-system kube-scheduler-minikube 1/1 Running 2 (24h ago) 27h
kube-system nvidia-device-plugin-daemonset-j4vf2 1/1 Running 2 (24h ago) 27h
kube-system storage-provisioner 1/1 Running 2 (24h ago) 27h
检查 GPU 状态
kubectl logs -f nvidia-device-plugin-daemonset-j4vf2 -n kube-system
I1123 13:04:19.276326 1 main.go:199] Starting FS watcher.
I1123 13:04:19.276699 1 main.go:206] Starting OS watcher.
I1123 13:04:19.277605 1 main.go:221] Starting Plugins.
I1123 13:04:19.277621 1 main.go:278] Loading configuration.
I1123 13:04:19.280194 1 main.go:303] Updating config with default resource matching patterns.
I1123 13:04:19.280304 1 main.go:314]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": false,
"mpsRoot": "",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I1123 13:04:19.280316 1 main.go:317] Retrieving plugins.
I1123 13:04:19.316373 1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I1123 13:04:19.317787 1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1123 13:04:19.320872 1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
同时运行
kubectl aplly -f test/nvidia-smi.yml
检查 GPU 状态并检查 pod 的日志 kubectl logs -f nvidia-smi
Sat Nov 23 13:12:11 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.72 Driver Version: 566.14 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4080 On | 00000000:01:00.0 On | N/A |
| 0% 49C P8 15W / 360W | 2380MiB / 16376MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 31 G /Xwayland N/A |
| 0 N/A N/A 31 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
nvidia-smi.yml
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
namespace: test
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: "nvidia/cuda:11.8.0-base-ubuntu20.04"
args: ["nvidia-smi"]
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
minikube start -n 2 -p dev --cni=cilium --cpus=4 --disk-size=20000mb --memory=4g --driver docker --container-runtime docker --gpus all
https://github.com/kubernetes-sigs/node-feature-discovery
kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.16.5"
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v24.9.0
Every 2.0s: kubectl get po --all-namespaces Lei: Sat Dec 14 04:16:08 2024
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator gpu-operator-1734117920-node-feature-discovery-gc-767794b8fd5wv 1/1 Running 1 (27m ago) 50m
gpu-operator gpu-operator-1734117920-node-feature-discovery-master-5767g5skk 1/1 Running 1 (27m ago) 50m
gpu-operator gpu-operator-1734117920-node-feature-discovery-worker-5hqnq 1/1 Running 1 (27m ago) 50m
gpu-operator gpu-operator-1734117920-node-feature-discovery-worker-6dcz8 1/1 Running 1 (27m ago) 50m
gpu-operator gpu-operator-7f474c6cf8-4sj9j 1/1 Running 1 (27m ago) 50m
kube-system cilium-envoy-2h8h7 1/1 Running 2 (25m ago) 55m
kube-system cilium-envoy-m84kr 1/1 Running 1 (27m ago) 55m
kube-system cilium-h8cwb 1/1 Running 1 (27m ago) 55m
kube-system cilium-nhqzk 1/1 Running 1 (27m ago) 55m
kube-system cilium-operator-5c7867ccd5-j7w7r 1/1 Running 2 (25m ago) 55m
kube-system coredns-6f6b679f8f-dnt28 1/1 Running 4 (27m ago) 55m
kube-system etcd-dev 1/1 Running 2 (25m ago) 55m
kube-system kube-apiserver-dev 1/1 Running 3 (16m ago) 55m
kube-system kube-controller-manager-dev 1/1 Running 3 (16m ago) 55m
kube-system kube-proxy-cp88s 1/1 Running 2 (25m ago) 55m
kube-system kube-proxy-m5djw 1/1 Running 1 (27m ago) 55m
kube-system kube-scheduler-dev 1/1 Running 2 (25m ago) 55m
kube-system nvidia-device-plugin-daemonset-2z7pk 1/1 Running 2 (27m ago) 55m
kube-system nvidia-device-plugin-daemonset-wgtg4 1/1 Running 2 (27m ago) 55m
kube-system storage-provisioner 1/1 Running 4 (15m ago) 55m
node-feature-discovery nfd-gc-7b46f54bf8-fvjjt 1/1 Running 1 (27m ago) 53m
node-feature-discovery nfd-master-6c95f4b5fb-dgw7x 1/1 Running 1 (27m ago) 53m
node-feature-discovery nfd-worker-n7vfh 1/1 Running 1 (27m ago) 53m
node-feature-discovery nfd-worker-x4tmh 1/1 Running 1 (27m ago) 53m
希望它对你有用:)