这种情况是随机发生的,有时是在启动新实例时(如下例),有时是在实例已经运行一段时间时发生。
“Boot worker - SIGKILL - TIMEOUT”的错误周期可能持续 10 秒到一个多小时。
这也可能发生在我的应用程序中的任何端点上,它发生在不同端点上的 POST 请求、GET 请求。
最初我认为这是由于我的代码中的某些错误或 POST 请求中的某些格式错误的数据造成的,但经过调查,我意识到,当实例没有陷入此错误循环时发送完全相同的 POST/GET 请求效果非常好。
下面是一个示例的日志,其中导致启动新实例的 POST 请求在此循环中停留了一个多小时,然后突然自行解决并正常处理状态为 200 的 POST 请求。
DEFAULT 2024-04-07T07:56:06.463491Z [2024-04-07 07:56:06 +0000] [11] [INFO] Starting gunicorn 21.2.0
DEFAULT 2024-04-07T07:56:06.464623Z [2024-04-07 07:56:06 +0000] [11] [DEBUG] Arbiter booted
DEFAULT 2024-04-07T07:56:06.464637Z [2024-04-07 07:56:06 +0000] [11] [INFO] Listening at: http://0.0.0.0:8081 (11)
DEFAULT 2024-04-07T07:56:06.464712Z [2024-04-07 07:56:06 +0000] [11] [INFO] Using worker: gthread
INFO 2024-04-07T07:56:06.466397Z [pid1-nginx] Successfully connected to 127.0.0.1:8081 after 3.361200658s [session:RB4VHXB]
INFO 2024-04-07T07:56:06.466735Z [pid1-nginx] Successfully connected to localhost:8081 after 3.361617817s [session:RB4VHXB]
INFO 2024-04-07T07:56:06.469263Z [pid1-nginx] Creating config at /tmp/nginxconf-644813283/nginx.conf [session:RB4VHXB]
INFO 2024-04-07T07:56:06.472325Z [pid1-nginx] Starting nginx (pid 17): /usr/sbin/nginx -c /tmp/nginxconf-644813283/nginx.conf [session:RB4VHXB]
DEFAULT 2024-04-07T07:56:06.544837Z [2024-04-07 07:56:06 +0000] [16] [INFO] Booting worker with pid: 16
DEFAULT 2024-04-07T07:56:06.576102Z [2024-04-07 07:56:06 +0000] [16] [DEBUG] Closing connection.
DEFAULT 2024-04-07T07:56:06.577868Z [2024-04-07 07:56:06 +0000] [16] [DEBUG] Closing connection.
DEFAULT 2024-04-07T07:56:06.579116Z [2024-04-07 07:56:06 +0000] [16] [DEBUG] GET /_ah/start
DEFAULT 2024-04-07T07:56:06.618933Z [2024-04-07 07:56:06 +0000] [19] [INFO] Booting worker with pid: 19
DEFAULT 2024-04-07T07:56:06.666217Z [2024-04-07 07:56:06 +0000] [11] [DEBUG] 2 workers
DEFAULT 2024-04-07T07:56:06.739491Z [2024-04-07 07:56:06 +0000] [16] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T07:57:07.717148Z [2024-04-07 07:57:07 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:16)
DEFAULT 2024-04-07T07:57:08.720797Z [2024-04-07 07:57:08 +0000] [11] [ERROR] Worker (pid:16) was sent SIGKILL! Perhaps out of memory?
DEFAULT 2024-04-07T07:57:08.720910Z 2024/04/07 07:57:08 [error] 18#18: *6 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T07:57:08.824712Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.15 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T07:57:08.898539Z [2024-04-07 07:57:08 +0000] [19] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T07:57:08.990455Z [2024-04-07 07:57:08 +0000] [27] [INFO] Booting worker with pid: 27
DEFAULT 2024-04-07T07:58:08.968963Z [2024-04-07 07:58:08 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:19)
DEFAULT 2024-04-07T07:58:09.973588Z 2024/04/07 07:58:09 [error] 18#18: *7 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T07:58:09.973611Z [2024-04-07 07:58:09 +0000] [11] [ERROR] Worker (pid:19) was sent SIGKILL! Perhaps out of memory?
DEFAULT 2024-04-07T07:58:10.106688Z [2024-04-07 07:58:10 +0000] [33] [INFO] Booting worker with pid: 33
DEFAULT 2024-04-07T07:58:10.177760Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.976 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T07:58:10.196059Z [2024-04-07 07:58:10 +0000] [33] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T07:59:11.149239Z [2024-04-07 07:59:11 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:33)
DEFAULT 2024-04-07T07:59:12.153215Z 2024/04/07 07:59:12 [error] 18#18: *9 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T07:59:12.153281Z [2024-04-07 07:59:12 +0000] [11] [ERROR] Worker (pid:33) was sent SIGKILL! Perhaps out of memory?
DEFAULT 2024-04-07T07:59:12.307443Z [2024-04-07 07:59:12 +0000] [39] [INFO] Booting worker with pid: 39
DEFAULT 2024-04-07T08:00:45.632391Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.725 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
...
Repeat of same, POST request recieved, worker boot, worker timeout then worker sent SIGKILL for the next 1 hour.
...
DEFAULT 2024-04-07T09:01:47.742589Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.369 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T09:01:47.916781Z [2024-04-07 09:01:47 +0000] [387] [INFO] Booting worker with pid: 387
DEFAULT 2024-04-07T09:01:48.003333Z [2024-04-07 09:01:48 +0000] [387] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T09:02:13.317927Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 86.175 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T09:02:36.933886Z [2024-04-07 09:02:36 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:381)
DEFAULT 2024-04-07T09:02:37.938484Z [2024-04-07 09:02:37 +0000] [11] [ERROR] Worker (pid:381) was sent SIGKILL! Perhaps out of memory?
DEFAULT 2024-04-07T09:02:37.938619Z 2024/04/07 09:02:37 [error] 18#18: *140 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T09:02:38.097720Z [2024-04-07 09:02:38 +0000] [393] [INFO] Booting worker with pid: 393
DEFAULT 2024-04-07T09:02:38.142051Z [protoPayload.method: POST] [protoPayload.status: 502] [protoPayload.responseSize: 272 B] [protoPayload.latency: 61.351 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T09:02:38.189106Z [2024-04-07 09:02:38 +0000] [393] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T09:02:38.196806Z [2024-04-07 09:02:38 +0000] [393] [DEBUG] POST /processNewOrder
DEFAULT 2024-04-07T09:02:48.105780Z [2024-04-07 09:02:48 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:387)
DEFAULT 2024-04-07T09:02:49.112205Z 2024/04/07 09:02:49 [error] 18#18: *142 upstream prematurely closed connection while reading response header from upstream, client: 169.254.1.1, server: _, request: "POST /processNewOrder HTTP/1.1", upstream: "http://127.0.0.1:8081/processNewOrder", host: "redacted.et.r.appspot.com"
DEFAULT 2024-04-07T09:02:49.112209Z [2024-04-07 09:02:49 +0000] [11] [ERROR] Worker (pid:387) was sent SIGKILL! Perhaps out of memory?
Finally it processes the POST request correctly with status 200.
DEFAULT 2024-04-07T09:02:49.114051Z [protoPayload.method: POST] [protoPayload.status: 200] [protoPayload.responseSize: 135 B] [protoPayload.latency: 81.691 s] [protoPayload.userAgent: AppEngine-Google; (+http://code.google.com/appengine)] /processNewOrder
DEFAULT 2024-04-07T09:02:49.367448Z [2024-04-07 09:02:49 +0000] [401] [INFO] Booting worker with pid: 401
DEFAULT 2024-04-07T09:02:49.464783Z [2024-04-07 09:02:49 +0000] [401] [DEBUG] POST /processNewOrder
我也检查了内存使用情况,但似乎没有问题。我使用的是 B2 实例类,它应该具有 768 MB RAM,并且在事件发生期间 RAM 使用量较低,大约约为 210 MB。
当我通过本地App Engine运行 Gunicorn 服务器时,也永远不会出现此问题。
我尝试将实例大小增加到 B4(1536 MB RAM),认为内存使用情况可能未正确报告,但出现了同样的问题。
我尝试将 Gunicorn 配置文件中的超时参数增加到 20 分钟,但仍然出现同样的问题,只是现在每个工作人员在超时之前需要 20 分钟。
我还尝试在我的 Gunicorn 配置文件中设置
preload_app=True
,因为我认为问题可能是由工作线程启动但应用程序代码尚未准备好引起的,但同样的问题仍然发生。
考虑到它发生的随机性,我无法可靠地重现这些问题,因此无法找到该问题的解决方案。一种可能性是 Google Cloud 端出现某种中断,但在我们去那里之前,我想看看是否有其他人以前遇到过这个问题,并且可以提供有关潜在解决方案的任何建议。
对于将来遇到此问题的任何其他人,我最终转而在 Google Cloud Run 上的 docker 容器中运行我的应用程序,并在我的 Gunicorn 配置文件中进行
preload_app=True
设置,并完全消除了该问题。
由于某种原因,任何其他配置都会导致超时问题。例如,在没有 Dockerfile 的 Cloud Run 上运行(意味着 google 将决定您的容器中需要什么),无论有或没有
preload_app=True
设置,都会导致问题。
我最好的猜测是,Google Cloud 在其用于容器化的设置中存在一些配置(也可能用于 App Engine),这些配置在某种程度上与我的代码不兼容。迁移到 Cloud Run 还有助于大幅降低我的成本,因此这也很有帮助。