使用 WSL 2 的 Windows 11 上的 Docker for Desktop 的 Kubernetes 对 GPU 支持

问题描述 投票:0回答:1

我正在使用最新版本的 Docker Desktop,并通过 WSL 2 在 Windows 11 上启用了 Kubernetes。我的笔记本电脑配有 NVidia GPU RTX 3080 ti。

GPU 可立即用于 Docker(请参阅下面的列表)。

我想让 GPU 可用于 Kubernetes 节点中的容器。

  1. 这可能吗?
  2. 实现这一目标的步骤是什么?

我已经看过这里了:

但是我找不到答案或操作指南。

令人费解的是,使用 --gpus 运行 Docker 容器都可以开箱即用,但是 kubernetes 容器(在 docker deskopt 中可见)似乎没有 GPU 支持。也许 Docker Desktop 在没有 --gpus all 的情况下启动它们。真有这么简单吗?有没有办法在 docker 桌面上检查/更改/配置 kubernetes?隐藏的 .kubernetes 文件或类似文件?

以下是一些有用的列表:

在带有 GPU 支持的 WSL 2(开箱即用)的 Windows 11 桌面版 docker 上的 ubuntu 容器中运行 nvidia-smi 的输出:

$docker run --rm --gpus all -it ubuntu nvidia-smi
Wed Jan 24 17:58:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.37.02              Driver Version: 546.65       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   61C    P8              12W / 150W |    635MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A         1      C   /python3.7                                N/A      |
+---------------------------------------------------------------------------------------+

kubectl 描述节点给出以下输出:(注意缺少 GPU)

$ kubectl describe node
Name:               docker-desktop
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.AVXVNNI=true
                    feature.node.kubernetes.io/cpu-cpuid.CETIBT=true
                    feature.node.kubernetes.io/cpu-cpuid.CETSS=true
                    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
                    feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.FSRM=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSR=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
                    feature.node.kubernetes.io/cpu-cpuid.GFNI=true
                    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
                    feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.LAHF=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVDIR64B=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVDIRI=true
                    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.SERIALIZE=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.STOSB_SHORT=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
                    feature.node.kubernetes.io/cpu-cpuid.VAES=true
                    feature.node.kubernetes.io/cpu-cpuid.VMX=true
                    feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
                    feature.node.kubernetes.io/cpu-cpuid.WAITPKG=true
                    feature.node.kubernetes.io/cpu-cpuid.X87=true
                    feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-model.family=6
                    feature.node.kubernetes.io/cpu-model.id=151
                    feature.node.kubernetes.io/cpu-model.vendor_id=Intel
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.15.133.1-microsoft-standard-WSL2
                    feature.node.kubernetes.io/pci-0302_1414.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=docker-desktop
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/cri-dockerd.sock
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVXVNNI,cpu-cpuid.CETIBT,cpu-cpuid.CETSS,cpu-cpuid.CMPXCHG8,cpu-cpuid...
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 19 Jan 2024 15:59:59 +0100
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  docker-desktop
  AcquireTime:     <unset>
  RenewTime:       Wed, 24 Jan 2024 19:01:25 +0100
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 24 Jan 2024 19:01:27 +0100   Fri, 19 Jan 2024 15:59:58 +0100   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 24 Jan 2024 19:01:27 +0100   Fri, 19 Jan 2024 15:59:58 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 24 Jan 2024 19:01:27 +0100   Fri, 19 Jan 2024 15:59:58 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 24 Jan 2024 19:01:27 +0100   Fri, 19 Jan 2024 15:59:59 +0100   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  192.168.65.3
  Hostname:    docker-desktop
Capacity:
  cpu:                24
  ephemeral-storage:  1055762868Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16184860Ki
  pods:               110
Allocatable:
  cpu:                24
  ephemeral-storage:  972991057538
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16082460Ki
  pods:               110
System Info:
  Machine ID:                 8e4a9a84-8d9a-4100-8a35-25329f9bcd04
  System UUID:                8e4a9a84-8d9a-4100-8a35-25329f9bcd04
  Boot ID:                    cd21f4ff-4b7f-4ab0-9791-650f44921aab
  Kernel Version:             5.15.133.1-microsoft-standard-WSL2
  OS Image:                   Docker Desktop
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://24.0.7
  Kubelet Version:            v1.28.2
  Kube-Proxy Version:         v1.28.2
Non-terminated Pods:          (21 in total)
  Namespace                   Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                          ------------  ----------  ---------------  -------------  ---
  default                     ingress-nginx-controller-76df688779-vmgcm     100m (0%)     0 (0%)      90Mi (0%)        0 (0%)         5d2h
  kube-system                 coredns-5dd5756b68-267mg                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (1%)     5d3h
  kube-system                 coredns-5dd5756b68-h8vlt                      100m (0%)     0 (0%)      70Mi (0%)        170Mi (1%)     5d3h
  kube-system                 etcd-docker-desktop                           100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         5d3h
  kube-system                 kube-apiserver-docker-desktop                 250m (1%)     0 (0%)      0 (0%)           0 (0%)         5d3h
  kube-system                 kube-controller-manager-docker-desktop        200m (0%)     0 (0%)      0 (0%)           0 (0%)         5d3h
  kube-system                 kube-proxy-9v4pl                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d3h
  kube-system                 kube-scheduler-docker-desktop                 100m (0%)     0 (0%)      0 (0%)           0 (0%)         5d3h
  kube-system                 nvidia-device-plugin-daemonset-wv575          0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 storage-provisioner                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d3h
  kube-system                 vpnkit-controller                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d3h
  node-feature-discovery      nfd-gc-5b987cb58f-qxn64                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  node-feature-discovery      nfd-master-7bff75887-vs825                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  node-feature-discovery      nfd-worker-f8f9s                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                950m (3%)   0 (0%)
  memory             330Mi (2%)  340Mi (2%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

kubectl 的输出描述 daemonset nvidia-device-plugin-daemonset -n kuda-system:

$ kubectl describe daemonset nvidia-device-plugin-daemonset -n kuda-system:
Name:           nvidia-device-plugin-daemonset
Selector:       name=nvidia-device-plugin-ds
Node-Selector:  <none>
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 1
Number of Nodes Misscheduled: 0
Pods Status:  1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  name=nvidia-device-plugin-ds
  Containers:
   nvidia-device-plugin-ctr:
    Image:      nvcr.io/nvidia/k8s-device-plugin:v0.14.3
    Port:       <none>
    Host Port:  <none>
    Environment:
      FAIL_ON_INIT_ERROR:  false
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:               HostPath (bare host directory volume)
    Path:               /var/lib/kubelet/device-plugins
    HostPathType:
  Priority Class Name:  system-node-critical
Events:                 <none>
windows kubernetes gpu windows-subsystem-for-linux docker-desktop
1个回答
0
投票

对于我的开发环境(WSL2、Docker-desktop、Minikube)

安装 Nvidia 容器运行时

它安装是为了测试其他工具(类似的),所以我不确定是否需要安装它。

子包

  • libnvidia-container1
  • libnvidia-容器工具
  • nvidia-container-toolkit-base
  • nvidia-容器工具包

配置docker daemon.json。对于 Windows,您应该在 docker-desktop 中配置它(不应该在 WSL2 中配置它)

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

参考资料:

从 Minikube 开始

https://minikube.sigs.k8s.io/docs/tutorials/nvidia/

minikube start --driver docker --container-runtime docker --gpus all

检查 Pod 状态

kubectl get po --all-namespaces

NAMESPACE     NAME                                   READY   STATUS    RESTARTS      AGE
kube-system   coredns-6f6b679f8f-f2sd8               1/1     Running   2 (24h ago)   27h
kube-system   etcd-minikube                          1/1     Running   2 (24h ago)   27h
kube-system   kube-apiserver-minikube                1/1     Running   2 (24h ago)   27h
kube-system   kube-controller-manager-minikube       1/1     Running   2 (24h ago)   27h
kube-system   kube-proxy-jx6p4                       1/1     Running   2 (24h ago)   27h
kube-system   kube-scheduler-minikube                1/1     Running   2 (24h ago)   27h
kube-system   nvidia-device-plugin-daemonset-j4vf2   1/1     Running   2 (24h ago)   27h
kube-system   storage-provisioner                    1/1     Running   2 (24h ago)   27h

检查 GPU 状态

kubectl  logs -f  nvidia-device-plugin-daemonset-j4vf2 -n kube-system

I1123 13:04:19.276326       1 main.go:199] Starting FS watcher.
I1123 13:04:19.276699       1 main.go:206] Starting OS watcher.
I1123 13:04:19.277605       1 main.go:221] Starting Plugins.
I1123 13:04:19.277621       1 main.go:278] Loading configuration.
I1123 13:04:19.280194       1 main.go:303] Updating config with default resource matching patterns.
I1123 13:04:19.280304       1 main.go:314] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I1123 13:04:19.280316       1 main.go:317] Retrieving plugins.
I1123 13:04:19.316373       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
I1123 13:04:19.317787       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1123 13:04:19.320872       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet

同时运行

kubectl aplly -f test/nvidia-smi.yml
检查 GPU 状态并检查 pod 的日志
kubectl logs -f  nvidia-smi

Sat Nov 23 13:12:11 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.72                 Driver Version: 566.14         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        On  |   00000000:01:00.0  On |                  N/A |
|  0%   49C    P8             15W /  360W |    2380MiB /  16376MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        31      G   /Xwayland                                   N/A      |
|    0   N/A  N/A        31      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

nvidia-smi.yml

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
  namespace: test
spec:
  restartPolicy: OnFailure
  containers:
    - name: nvidia-smi
      image: "nvidia/cuda:11.8.0-base-ubuntu20.04"
      args: ["nvidia-smi"]
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Minikube(限制资源,cni cilium)

minikube start -n 2 -p dev --cni=cilium --cpus=4 --disk-size=20000mb --memory=4g --driver docker --container-runtime docker  --gpus all

安装节点功能发现

https://github.com/kubernetes-sigs/node-feature-discovery

kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.16.5"

安装 Nvidia GPU Operator

kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v24.9.0
Every 2.0s: kubectl get po --all-namespaces                                                                                                                                                                                    Lei: Sat Dec 14 04:16:08 2024

NAMESPACE                NAME                                                              READY   STATUS    RESTARTS      AGE
gpu-operator             gpu-operator-1734117920-node-feature-discovery-gc-767794b8fd5wv   1/1     Running   1 (27m ago)   50m
gpu-operator             gpu-operator-1734117920-node-feature-discovery-master-5767g5skk   1/1     Running   1 (27m ago)   50m
gpu-operator             gpu-operator-1734117920-node-feature-discovery-worker-5hqnq       1/1     Running   1 (27m ago)   50m
gpu-operator             gpu-operator-1734117920-node-feature-discovery-worker-6dcz8       1/1     Running   1 (27m ago)   50m
gpu-operator             gpu-operator-7f474c6cf8-4sj9j                                     1/1     Running   1 (27m ago)   50m
kube-system              cilium-envoy-2h8h7                                                1/1     Running   2 (25m ago)   55m
kube-system              cilium-envoy-m84kr                                                1/1     Running   1 (27m ago)   55m
kube-system              cilium-h8cwb                                                      1/1     Running   1 (27m ago)   55m
kube-system              cilium-nhqzk                                                      1/1     Running   1 (27m ago)   55m
kube-system              cilium-operator-5c7867ccd5-j7w7r                                  1/1     Running   2 (25m ago)   55m
kube-system              coredns-6f6b679f8f-dnt28                                          1/1     Running   4 (27m ago)   55m
kube-system              etcd-dev                                                          1/1     Running   2 (25m ago)   55m
kube-system              kube-apiserver-dev                                                1/1     Running   3 (16m ago)   55m
kube-system              kube-controller-manager-dev                                       1/1     Running   3 (16m ago)   55m
kube-system              kube-proxy-cp88s                                                  1/1     Running   2 (25m ago)   55m
kube-system              kube-proxy-m5djw                                                  1/1     Running   1 (27m ago)   55m
kube-system              kube-scheduler-dev                                                1/1     Running   2 (25m ago)   55m
kube-system              nvidia-device-plugin-daemonset-2z7pk                              1/1     Running   2 (27m ago)   55m
kube-system              nvidia-device-plugin-daemonset-wgtg4                              1/1     Running   2 (27m ago)   55m
kube-system              storage-provisioner                                               1/1     Running   4 (15m ago)   55m
node-feature-discovery   nfd-gc-7b46f54bf8-fvjjt                                           1/1     Running   1 (27m ago)   53m
node-feature-discovery   nfd-master-6c95f4b5fb-dgw7x                                       1/1     Running   1 (27m ago)   53m
node-feature-discovery   nfd-worker-n7vfh                                                  1/1     Running   1 (27m ago)   53m
node-feature-discovery   nfd-worker-x4tmh                                                  1/1     Running   1 (27m ago)   53m

希望它对你有用:)

© www.soinside.com 2019 - 2024. All rights reserved.