我正在尝试使用系统配置在我的 ubuntu 虚拟机中创建 kubernetes 集群: ubuntu:v22 内存:4096 中央处理器:4个 硬盘:300GB 当我这样做时
kubectl get nodes
我看到主机和工作人员都已启动并正在运行,但一段时间后我的两台机器突然停止,就好像处于死锁状态并停止运行。一切。我什至无法从机器退出。下面我附加了工作机器上的系统日志
Jun 19 04:21:40 host02 containerd[5075]: time="2024-06-19T04:21:40.338401327Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:kube-proxy-677wz,Uid:704a6fcf-bfd0-4185-a62d-b5d3cd920823,Namespace:kube-system,Attempt:53,} returns sandbox id \"39bd5ab393f71392a753cb455aca1bf1ed129d0033a91f447957f791b695b136\""
Jun 19 04:21:40 host02 kubelet[1060]: E0619 04:21:40.340171 1060 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-proxy pod=kube-proxy-677wz_kube-system(704a6fcf-bfd0-4185-a62d-b5d3cd920823)\"" pod="kube-system/kube-proxy-677wz" podUID="704a6fcf-bfd0-4185-a62d-b5d3cd920823"
Jun 19 04:21:41 host02 kubelet[1060]: I0619 04:21:41.275045 1060 scope.go:117] "RemoveContainer" containerID="a2c6c5fa164aca6d2698ad9b949db036d5ae9ca7ad700b1d123fd032954c7fab"
Jun 19 04:21:41 host02 kubelet[1060]: E0619 04:21:41.275807 1060 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-proxy pod=kube-proxy-677wz_kube-system(704a6fcf-bfd0-4185-a62d-b5d3cd920823)\"" pod="kube-system/kube-proxy-677wz" podUID="704a6fcf-bfd0-4185-a62d-b5d3cd920823"
Jun 19 04:21:50 host02 kubelet[1060]: I0619 04:21:50.371821 1060 scope.go:117] "RemoveContainer" containerID="740244c2475add426868c6d41b94fe896f56443d38a539b21cb6b238cffa35d6"
Jun 19 04:21:50 host02 kubelet[1060]: E0619 04:21:50.373576 1060 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"calico-node\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=calico-node pod=calico-node-g94qs_kube-system(d7b0ad5d-d096-4e0b-bd3e-71680576d58e)\"" pod="kube-system/calico-node-g94qs" podUID="d7b0ad5d-d096-4e0b-bd3e-71680576d58e"
Jun 19 04:21:54 host02 kubelet[1060]: I0619 04:21:54.370961 1060 scope.go:117] "RemoveContainer" containerID="a2c6c5fa164aca6d2698ad9b949db036d5ae9ca7ad700b1d123fd032954c7fab"
Jun 19 04:21:54 host02 kubelet[1060]: E0619 04:21:54.371672 1060 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-proxy pod=kube-proxy-677wz_kube-system(704a6fcf-bfd0-4185-a62d-b5d3cd920823)\"" pod="kube-system/kube-proxy-677wz" podUID="704a6fcf-bfd0-4185-a62d-b5d3cd920823"
Jun 19 04:22:05 host02 kubelet[1060]: I0619 04:22:05.370872 1060 scope.go:117] "RemoveContainer" containerID="740244c2475add426868c6d41b94fe896f56443d38a539b21cb6b238cffa35d6"
Jun 19 04:22:05 host02 kubelet[1060]: E0619 04:22:05.373823 1060 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"calico-node\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=calico-node pod=calico-node-g94qs_kube-system(d7b0ad5d-d096-4e0b-bd3e-71680576d58e)\"" pod="kube-system/calico-node-g94qs" podUID="d7b0ad5d-d096-4e0b-bd3e-71680576d58e"
Jun 19 04:22:08 host02 kubelet[1060]: I0619 04:22:08.373709 1060 scope.go:117] "RemoveContainer" containerID="a2c6c5fa164aca6d2698ad9b949db036d5ae9ca7ad700b1d123fd032954c7fab"
Jun 19 04:22:08 host02 kubelet[1060]: E0619 04:22:08.375837 1060 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-proxy\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-proxy pod=kube-proxy-677wz_kube-system(704a6fcf-bfd0-4185-a62d-b5d3cd920823)\"" pod="kube-system/kube-proxy-677wz" podUID="704a6fcf-bfd0-4185-a62d-b5d3cd920823"
CrashLoopBackOff 状态不是特定错误,而是一个信号,表明潜在问题正在阻止容器内的主进程连续运行。当容器在启动后不久崩溃或退出时(CrashLoop)。
Kubernetes 的 kubelet 会自动重启它。每次重新启动失败后,下一次尝试之前的延迟 (BackOff) 都会呈指数级增加(10 秒、20 秒、40 秒等),最多可达 5 分钟。
几个潜在问题可能会触发 CrashLoopBackOff 状态:
资源耗尽(OOMKilled):容器超出其分配的内存限制(resources.limits.memory)并被内存不足(OOM)杀手终止。
活性探针失败:容器未通过活性健康检查,因为它不健康并且需要重新启动
大多数问题通常是由于 SSL 证书错误或网络问题造成的,因此请确保这些问题正常工作。
应用程序配置错误(退出代码 0):容器内的主应用程序配置错误,并以退出代码 0 干净退出。
应用程序配置错误(非零退出代码):主应用程序以非零退出代码退出,表明应用程序本身出现错误或崩溃。
您能否使用描述检查您是否有任何退出代码,以便我们进一步排除故障。
另请参阅 Mickael Alliel 撰写的关于 Troubleshoot and Fix Kubernetes CrashLoopBackoff Status 的文章,这可能会帮助您进一步排查 CrashLoopBackOff 错误。