启用 GPU 的 Kubernetes 容器未调度

问题描述 投票:0回答:1

我已经安装了 Nvidia 的 GPU 运算符,并自动标记了支持 GPU 的节点(我认为重要的,还有一长串其他标签):

nvidia.com/gpu.count=1

节点似乎是可以调度的

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 10 Sep 2024 15:05:17 +0000   Tue, 10 Sep 2024 15:05:17 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletReady                 kubelet is posting ready status

Node 还在“kubectl getnodes”中报告为就绪。 然而,当我查看演示工作负载时,我发现

`Warning  FailedScheduling  11s (x17 over 79m)  default-scheduler  0/6 nodes are available: 3 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.`

我什至尝试手动添加标签节点,但到目前为止还没有成功。 我遵循 Nvidia

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
的指南。 与自动部署的偏差是我手动安装了驱动程序 (v550),因为 Nvdia 尚未为 Ubuntu 24 生成映像。 我在 nvidia-smi 中看到输出,它本质上应该是正确的,因为节点被操作员标记。 库伯内特 v1.31.0 我还有什么遗漏的吗? 尝试手动标记节点并重新创建 Pod。 期望看到 pod 已安排

kubernetes gpu nvidia
1个回答
0
投票

nvidia.com/gpu=1

因此,我们的节点需要这个标签:

affinity = k8s.V1Affinity( node_affinity=k8s.V1NodeAffinity( preferred_during_scheduling_ignored_during_execution=[ k8s.V1PreferredSchedulingTerm( weight=1, preference=k8s.V1NodeSelectorTerm( match_expressions=[ k8s.V1NodeSelectorRequirement( key="nvidia.com/gpu.present", operator="In", values=["true"] ) ] ), ) ] ), )

	
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.