启用 GPU 的 Kubernetes 容器未调度

Question

我已经安装了 Nvidia 的 GPU 运算符，并自动标记了支持 GPU 的节点（我认为重要的，还有一长串其他标签）：

nvidia.com/gpu.count=1

节点似乎是可以调度的

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 10 Sep 2024 15:05:17 +0000   Tue, 10 Sep 2024 15:05:17 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Tue, 10 Sep 2024 16:26:50 +0000   Tue, 10 Sep 2024 15:05:04 +0000   KubeletReady                 kubelet is posting ready status

Node 还在“kubectl getnodes”中报告为就绪。然而，当我查看演示工作负载时，我发现

`Warning  FailedScheduling  11s (x17 over 79m)  default-scheduler  0/6 nodes are available: 3 Insufficient nvidia.com/gpu, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }. preemption: 0/6 nodes are available: 3 No preemption victims found for incoming pod, 3 Preemption is not helpful for scheduling.`

我什至尝试手动添加标签节点，但到目前为止还没有成功。我遵循 Nvidia

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html

的指南。与自动部署的偏差是我手动安装了驱动程序 (v550)，因为 Nvdia 尚未为 Ubuntu 24 生成映像。我在 nvidia-smi 中看到输出，它本质上应该是正确的，因为节点被操作员标记。库伯内特 v1.31.0 我还有什么遗漏的吗？尝试手动标记节点并重新创建 Pod。期望看到 pod 已安排

Answer 1

nvidia.com/gpu=1

因此，我们的节点需要这个标签：

affinity = k8s.V1Affinity( node_affinity=k8s.V1NodeAffinity( preferred_during_scheduling_ignored_during_execution=[ k8s.V1PreferredSchedulingTerm( weight=1, preference=k8s.V1NodeSelectorTerm( match_expressions=[ k8s.V1NodeSelectorRequirement( key="nvidia.com/gpu.present", operator="In", values=["true"] ) ] ), ) ] ), )

启用 GPU 的 Kubernetes 容器未调度

问题描述投票：0回答：1

1个回答

最新问题

启用 GPU 的 Kubernetes 容器未调度

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1