GKE kubernetes cube-system资源nodeAffinity

Question

我在GKE k8s 1.9.4上进行了多区域测试设置。每个群集都有：

一个入口，配置kubemci
3个节点池具有不同的节点标签： default-pool system（1vCPU / 2GB RAM）前端池frontend（2vCPU / 2GB RAM）后端池backend（1vCPU / 600Mb RAM）
HPA按自定义指标进行扩展

所以像prometheus-operator，prometheus-server，custom-metrics-api-server和kube-state-metrics这样的东西附加到system标签的节点上。

前端和后端pod分别与frontend和backend标签连接到节点（单个pod到单个节点），请参阅podantiaffinity。

在自动缩放后缩放backend或frontend pod，它们的节点仍然存在，因为似乎有来自kube-system命名空间的pod，即heapster。这导致一种情况，即frontend / backend标签的节点在降尺度后保持活动，即使没有留下后端或前端pod。

问题是：如何避免在节点上创建kube-system pod，为我的应用程序提供服务（如果这真的很健全，可能）？

猜猜，我应该对backend和frontend节点使用污点和容忍度，但它如何与HPA和群集内节点自动缩放器结合使用？

Answer 1

好像taints and tolerations做了这个伎俩。

使用默认节点池创建集群（用于监控和kube-system）：

gcloud container --project "my-project-id" clusters create "app-europe" \
  --zone "europe-west1-b" --username="admin" --cluster-version "1.9.4-gke.1" --machine-type "custom-2-4096" \
  --image-type "COS" --disk-size "10" --num-nodes "1" --network "default" --enable-cloud-logging --enable-cloud-monitoring \
  --maintenance-window "01:00" --node-labels=region=europe-west1,role=system

为您的应用程序创建节点池：

gcloud container --project "my-project-id" node-pools create "frontend" \
      --cluster "app-europe" --zone "europe-west1-b" --machine-type "custom-2-2048" --image-type "COS" \
      --disk-size "10" --node-labels=region=europe-west1,role=frontend \
      --node-taints app=frontend:NoSchedule \
      --enable-autoscaling --num-nodes "1" --min-nodes="1" --max-nodes="3"

然后将nodeAffinity和tolerations部分添加到部署清单中的pods模板spec：

  tolerations:
  - key: "app"
    operator: "Equal"
    value: "frontend"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: beta.kubernetes.io/instance-type
            operator: In
            values:
            - custom-2-2048
        - matchExpressions:
          - key: role
            operator: In
            values:
            - frontend

Answer 2

我建议检查的第一件事是，您在PodSpec中拥有的请求资源量足以承载负载，并且系统节点上有足够的资源来安排所有系统容器。

您可以尝试使用更简单的nodeSelector或更灵活的Node Affinity来阻止调度系统pod前端或后端自动缩放的节点。

您可以在文档“Assigning Pods to Nodes”中找到很好的解释和示例

Taints and Toleration功能类似于Node Affinity，但更多来自节点视角。它们允许节点排斥一组pod。如果选择这种方式，请检查文档“Taints and Tolerations”。

为自动缩放创建节点池时，可以添加labels和taints，以便在Cluster Autoscaler（CA）升级池时它们将应用于节点。

除了限制system pods在frontend / backend节点上的调度之外，对于可能阻止CA在缩减期间移除节点的pod的configure PodDisruptionBudget和autoscaler safe-to-evict选项将是一个好主意。

根据Cluster Autoscaler FAQ，有几种类型的pod可能会阻止CA缩小您的群集：

具有限制性PodDisruptionBudget（PDB）的Pod。
Kube系统吊舱：默认情况下不在节点上运行没有PDB或他们的PDB限制太多（因为CA 0.6）。
没有控制器对象支持的Pod（因此不是由部署，副本集，作业，有状态集等创建的）。
带有本地存储的Pod。 *
由于各种约束（缺少资源，不匹配的节点选择器或亲和力，匹配反亲和力等）而无法移动到其他位置的Pod

*除非pod具有以下注释（在CA 1.0.3或更高版本中受支持）：

"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

在版本0.6之前，Cluster Autoscaler没有触及运行重要的kube系统pod的节点，如DNS，Heapster，Dashboard等。如果这些pod落在不同的节点上，CA无法缩小群集，用户可能最终得到一个完全空的3节点群集。在0.6中，添加了一个选项，告诉CA可以移动一些系统pod。如果用户为kube-system pod配置PodDisruptionBudget，则使用PDB设置覆盖不触及运行此pod的节点的默认策略。因此，要启用kube-system pods迁移，应将minAvailable设置为0（如果有N + 1个pod副本，则设置<= N.）另见I have a couple of nodes with low utilization, but they are not scaled down. Why?

Cluster Autoscaler FAQ可以帮助您为您的群集选择正确的版本。

为了更好地了解Cluster Autoscaler引擎盖下的内容，请查看official documentation

GKE kubernetes cube-system资源nodeAffinity

问题描述投票：4回答：2

2个回答

最新问题

GKE kubernetes cube-system资源nodeAffinity

问题描述 投票：4回答：2

2个回答

最新问题

问题描述投票：4回答：2