AKS 群集设置: 群集是使用 Azure CNI 覆盖创建的,并具有两个节点池,一个用于应用程序,另一个用于网络。网络节点池只有两个 CPU 虚拟机。应用程序节点池使用 GPU VM。
使用此处针对 Azure AKS 概述的步骤启用集群上的
gpu-resources
守护进程集时:
https://learn.microsoft.com/en-us/azure/aks/gpu-cluster?tabs=add-ubuntu-gpu-node-pool#manually-install-the-nvidia-device-plugin
有没有办法限制 pod 只在 GPU 节点池上运行?
节点池存在污点
sku=gpu:NoSchedule
并且 gpu-resources
部署已指定此内容:
spec:
tolerations:
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
但 Pod 仍被安排在网络池虚拟机上,我看到了这一点:
kubectl get pods -n gpu-resources
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-daemonset-7dr7n 1/1 Running 0 11m
nvidia-device-plugin-daemonset-9nzbv 0/1 CrashLoopBackOff 7 (48s ago) 11m
nvidia-device-plugin-daemonset-hbck6 1/1 Running 0 11m
nvidia-device-plugin-daemonset-hksv9 0/1 CrashLoopBackOff 7 (48s ago) 11m
nvidia-device-plugin-daemonset-kp74v 1/1 Running 0 11m
网络节点池中的两个 CPU 虚拟机上发生 CrashLoopBackOff。
有没有办法只在 GPU 节点池上运行这些 pod?
从 @Arko 在该问题下的评论中获取提示,解决此问题的一种方法是使用节点关联性,如下所述: https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-advanced-scheduler#node-affinity
我将标签
hardware:gpu
应用于应用程序节点池,并在部署 YAML 中使用它,如下所示:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hardware
operator: In
values:
- gpu