为什么我的 torchserve docker 镜像无法在 google cloud run 上运行?

问题描述 投票:0回答:1

我有这个泊坞窗图像:

# syntax = docker/dockerfile:1.2

FROM continuumio/miniconda3

# install os dependencies
RUN mkdir -p /usr/share/man/man1
RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
    ca-certificates \
    curl \
    python3-pip \
    vim \
    sudo \
    default-jre \
    git \
    gcc \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# install python dependencies
RUN pip install openmim
RUN pip install torch
RUN mim install mmcv-full==1.7.0
RUN pip install mmpose==0.29.0
RUN pip install mmdet==2.27.0
RUN pip install torchserve

# prep torchserve
RUN mkdir -p /home/torchserve/model-store
RUN wget https://github.com/facebookresearch/AnimatedDrawings/releases/download/v0.0.1/drawn_humanoid_detector.mar -P /home/torchserve/model-store/
RUN wget https://github.com/facebookresearch/AnimatedDrawings/releases/download/v0.0.1/drawn_humanoid_pose_estimator.mar -P /home/torchserve/model-store/
COPY config.properties /home/torchserve/config.properties

# print the contents of /model-store
RUN ls /home/torchserve/model-store

# starting command
CMD /opt/conda/bin/torchserve --start --ts-config /home/torchserve/config.properties && sleep infinity

在同一个文件夹中,我有以下 config.properties:

# Copyright (c) Meta Platforms, Inc. and affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
model_store=/home/torchserve/model-store
load_models=all
default_response_timeout=5000

它在本地运行得很好,但是当我将其推送到 gcloud 运行时,会发生以下错误,并且模型也无法正常运行 /ping 也恢复正常 这是错误:

org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.

at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker ( org/pytorch.serve.wlm/WorkerLifeCycle.java:177 )
at org.pytorch.serve.wlm.WorkerThread.connect ( org/pytorch.serve.wlm/WorkerThread.java:339 )
at org.pytorch.serve.wlm.WorkerThread.run ( org/pytorch.serve.wlm/WorkerThread.java:183 )
at java.util.concurrent.ThreadPoolExecutor.runWorker ( java/util.concurrent/ThreadPoolExecutor.java:1128 )
at java.util.concurrent.ThreadPoolExecutor$Worker.run ( java/util.concurrent/ThreadPoolExecutor.java:628 )
at java.lang.Thread.run ( java/lang/Thread.java:829 )

出了什么问题?

以下是入门日志:不确定要查找什么,希望这不会太小而无法查看

docker pytorch artificial-intelligence google-cloud-run torchserve
1个回答
0
投票

我在 GCP Cloud Run 上发现了同样的问题。容器在本地工作,但我在 Cloud Run 上遇到了相同的后端工作超时错误。

我增加了 Cloud Run 实例的内存,减少了工作线程数量,并增加了启动超时值:

default_startup_timeout=600
default_workers_per_model=2

这些参数可以在config.properties中设置: https://pytorch.org/serve/configuration.html#other-properties

这些更改解决了问题,因此我认为这是与内存相关的问题。

© www.soinside.com 2019 - 2024. All rights reserved.