我创建了 python 脚本,它基本上关闭了 GKE 集群的自动缩放,然后停止每个 MIG(托管实例组)当前区域中的底层节点。现在的问题是在停止实例时,正在重新创建实例,并且在停止实例后状态立即从“未知”更改为“就绪”状态。尽管我在实例停止之前关闭了自动缩放组。另一方面,当我在每个区域的每个 MIG 中从控制台本身手动停止实例时,情况并非如此。为什么会发生这种情况?任何人都可以建议需要检查哪些内容或者我还需要在代码上应用哪些内容。
我已检查自动缩放组是否已关闭,没有任何问题。对于您的信息,MIG 级别的自动缩放已关闭,并且 MIG 级别的运行状况检查也已禁用。自动缩放仅在节点池级别存在。
import subprocess
import json
import argparse
import time
def get_node_pool_name(cluster_name, project_name, region_name):
try:
cmd = [
'gcloud', 'container', 'clusters', 'describe', cluster_name,
'--project', project_name,
'--region', region_name,
'--format', 'json'
]
...
...
return node_pool_name
except Exception as e:
print(f"Error occurred while getting node pool name: {str(e)}")
raise
def disable_autoscaler(cluster_name, project_name, region_name, node_pool_name):
try:
cmd = [
'gcloud', 'container', 'node-pools', 'update', node_pool_name,
'--cluster', cluster_name,
'--project', project_name,
'--region', region_name,
'--no-enable-autoscaling'
]
...
...
except Exception as e:
print(f"Error occurred while disabling node pool autoscaler: {str(e)}")
raise
def get_instance_groups(cluster_name, project_name, region_name):
try:
cmd = [
'gcloud', 'container', 'clusters', 'describe', cluster_name,
'--project', project_name,
'--region', region_name,
'--format', 'json'
]
...
...
return instance_groups
except Exception as e:
print(f"Error occurred while getting instance-groups: {str(e)}")
raise
def get_instances(instance_group_name, project_name, zone):
try:
cmd = [
'gcloud', 'compute', 'instance-groups', 'list-instances', instance_group_name,
'--project', project_name,
'--zone', zone,
'--format', 'json'
]
...
...
return instance_names
except Exception as e:
print(f"Error occurred while getting instances: {str(e)}")
raise
def stop_instances(instance_names, project_name, zone):
try:
for instance in instance_names:
cmd = [
'gcloud', 'compute', 'instances', 'stop', instance,
'--project', project_name,
'--zone', zone
]
subprocess.run(cmd, check=True)
print(f"Instance stopped: {instance} in zone {zone}")
except Exception as e:
print(f"Error occurred while stopping instances: {str(e)}")
raise
def main(cluster_name, project_name, region_name):
node_pool_name = get_node_pool_name(cluster_name, project_name, region_name)
disable_autoscaler(cluster_name, project_name, region_name, node_pool_name)
time.sleep(60)
instance_groups = get_instance_groups(cluster_name, project_name, region_name)
for instance_group, zone in instance_groups:
instance_names = get_instances(instance_group, project_name, zone)
stop_instances(instance_names, project_name, zone)
time.sleep(30)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Disable GKE autoscaler and stop instances.")
parser.add_argument('cluster_name', type=str, help='Name of the GKE cluster')
parser.add_argument('project_name', type=str, help='Google Cloud project name')
parser.add_argument('region_name', type=str, help='Region name of the GKE cluster')
args = parser.parse_args()
main(args.cluster_name, args.project_name, args.region_name)
您是否在节点池级别禁用了节点自动修复,因为自动修复将尝试重新创建已停止或被视为不健康的不健康实例。请参阅此官方节点自动修复文档了解更多信息。
在 GCP 中,有时禁用自动缩放会出现“延迟”。因此,请尝试增加禁用“自动缩放”和“停止”实例之间的“睡眠时间”。 更多详细信息请参考官方禁用节点池自动伸缩文档。