为什么我的 torchserve docker 镜像无法在 google cloud run 上运行？

Question

我有这个泊坞窗图像：

# syntax = docker/dockerfile:1.2

FROM continuumio/miniconda3

# install os dependencies
RUN mkdir -p /usr/share/man/man1
RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
    ca-certificates \
    curl \
    python3-pip \
    vim \
    sudo \
    default-jre \
    git \
    gcc \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# install python dependencies
RUN pip install openmim
RUN pip install torch
RUN mim install mmcv-full==1.7.0
RUN pip install mmpose==0.29.0
RUN pip install mmdet==2.27.0
RUN pip install torchserve

# prep torchserve
RUN mkdir -p /home/torchserve/model-store
RUN wget https://github.com/facebookresearch/AnimatedDrawings/releases/download/v0.0.1/drawn_humanoid_detector.mar -P /home/torchserve/model-store/
RUN wget https://github.com/facebookresearch/AnimatedDrawings/releases/download/v0.0.1/drawn_humanoid_pose_estimator.mar -P /home/torchserve/model-store/
COPY config.properties /home/torchserve/config.properties

# print the contents of /model-store
RUN ls /home/torchserve/model-store

# starting command
CMD /opt/conda/bin/torchserve --start --ts-config /home/torchserve/config.properties && sleep infinity

在同一个文件夹中，我有以下 config.properties：

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/home/torchserve/model-store
load_models=all
default_response_timeout=5000

它在本地运行得很好，但是当我将其推送到 gcloud 运行时，会发生以下错误，并且模型也无法正常运行 /ping 也恢复正常这是错误：

org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.

at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker ( org/pytorch.serve.wlm/WorkerLifeCycle.java:177 )
at org.pytorch.serve.wlm.WorkerThread.connect ( org/pytorch.serve.wlm/WorkerThread.java:339 )
at org.pytorch.serve.wlm.WorkerThread.run ( org/pytorch.serve.wlm/WorkerThread.java:183 )
at java.util.concurrent.ThreadPoolExecutor.runWorker ( java/util.concurrent/ThreadPoolExecutor.java:1128 )
at java.util.concurrent.ThreadPoolExecutor$Worker.run ( java/util.concurrent/ThreadPoolExecutor.java:628 )
at java.lang.Thread.run ( java/lang/Thread.java:829 )

出了什么问题？

以下是入门日志：不确定要查找什么，希望这不会太小而无法查看

Answer 1

我在 GCP Cloud Run 上发现了同样的问题。容器在本地工作，但我在 Cloud Run 上遇到了相同的后端工作超时错误。

我增加了 Cloud Run 实例的内存，减少了工作线程数量，并增加了启动超时值：

default_startup_timeout=600
default_workers_per_model=2

这些参数可以在config.properties中设置： https://pytorch.org/serve/configuration.html#other-properties

这些更改解决了问题，因此我认为这是与内存相关的问题。

为什么我的 torchserve docker 镜像无法在 google cloud run 上运行？

问题描述投票：0回答：1

1个回答

最新问题

为什么我的 torchserve docker 镜像无法在 google cloud run 上运行？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1