我已经安装了 Nvidia 的 GPU 运算符,并自动标记了支持 GPU 的节点(我认为重要的,还有一长串其他标签):
nvidia.com/gpu.count=1
节点似乎是可以调度的
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Tue, 10 Sep 2024 15:05:17 +0000 Tue, 10 Sep 2024 15:05:17 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 10 Sep 2024 16:26:50 +0000 Tue, 10 Sep 2024 15:05:04 +0000 KubeletReady kubelet is posting ready status
Node 还在“kubectl getnodes”中报告为就绪。 然而,当我查看演示工作负载时,我发现
`Warning FailedScheduling 11s (x17 over 79m) default-scheduler 0/6 nodes are available: 3 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.`
我什至尝试手动添加标签节点,但到目前为止还没有成功。 我遵循 Nvidia
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html的指南。 与自动部署的偏差是我手动安装了驱动程序 (v550),因为 Nvdia 尚未为 Ubuntu 24 生成映像。 我在 nvidia-smi 中看到输出,它本质上应该是正确的,因为节点被操作员标记。 库伯内特 v1.31.0 我还有什么遗漏的吗? 尝试手动标记节点并重新创建 Pod。 期望看到 pod 已安排
nvidia.com/gpu=1
因此,我们的节点需要这个标签:
affinity = k8s.V1Affinity(
node_affinity=k8s.V1NodeAffinity(
preferred_during_scheduling_ignored_during_execution=[
k8s.V1PreferredSchedulingTerm(
weight=1,
preference=k8s.V1NodeSelectorTerm(
match_expressions=[
k8s.V1NodeSelectorRequirement(
key="nvidia.com/gpu.present", operator="In", values=["true"]
)
]
),
)
]
),
)