如何仅在 AKS 中的 GPU 节点池上运行 GPU 资源

Question

AKS 群集设置： 群集是使用 Azure CNI 覆盖创建的，并具有两个节点池，一个用于应用程序，另一个用于网络。网络节点池只有两个 CPU 虚拟机。应用程序节点池使用 GPU VM。

使用此处针对 Azure AKS 概述的步骤启用集群上的

gpu-resources

守护进程集时： https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#manually-install-the-nvidia-device-plugin

有没有办法限制 pod 只在 GPU 节点池上运行？

节点池存在污点

sku=gpu:NoSchedule

并且

gpu-resources

部署已指定此内容：

    spec:
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"

但 Pod 仍被安排在网络池虚拟机上，我看到了这一点：

kubectl get pods -n gpu-resources
NAME                                   READY   STATUS             RESTARTS      AGE
nvidia-device-plugin-daemonset-7dr7n   1/1     Running            0             11m
nvidia-device-plugin-daemonset-9nzbv   0/1     CrashLoopBackOff   7 (48s ago)   11m
nvidia-device-plugin-daemonset-hbck6   1/1     Running            0             11m
nvidia-device-plugin-daemonset-hksv9   0/1     CrashLoopBackOff   7 (48s ago)   11m
nvidia-device-plugin-daemonset-kp74v   1/1     Running            0             11m

网络节点池中的两个 CPU 虚拟机上发生 CrashLoopBackOff。

有没有办法只在 GPU 节点池上运行这些 pod？

Answer 1

从 @Arko 在该问题下的评论中获取提示，解决此问题的一种方法是使用节点关联性，如下所述： https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-advanced-scheduler#node-affinity

我将标签

hardware:gpu

应用于应用程序节点池，并在部署 YAML 中使用它，如下所示：

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: hardware
            operator: In
            values:
            - gpu

如何仅在 AKS 中的 GPU 节点池上运行 GPU 资源

问题描述投票：0回答：1

1个回答

最新问题

如何仅在 AKS 中的 GPU 节点池上运行 GPU 资源

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1