我有一个使用 python 后端的自定义 triton docker 容器。这个容器在本地完美运行。
这是容器dockerfile(我省略了不相关的部分)。
ARG TRITON_RELEASE_VERSION=22.12
FROM nvcr.io/nvidia/tritonserver:${TRITON_RELEASE_VERSION}-pyt-python-py3
LABEL owner='toing'
LABEL maintainer='[email protected]'
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
ARG TRITON_RELEASE_VERSION
ENV DEBIAN_FRONTEND=noninteractive
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
ENV GIT_TRITON_RELEASE_VERSION="r$TRITON_RELEASE_VERSION"
ENV TRITON_MODEL_DIRECTORY="/opt/ml/model"
SHELL ["/bin/bash", "-c"]
# nvidia updated their repository keys recently
RUN apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
RUN apt-get update && \
apt-get install -y --no-install-recommends \
# generic requirements
gcc \
libgl1-mesa-glx
RUN pip install --upgrade pip && \
pip install --no-cache-dir setuptools \
scikit-build \
opencv-python-headless \
cryptography
# run create model dir
RUN mkdir -p $TRITON_MODEL_DIRECTORY
# for mmcv installation
ENV FORCE_CUDA="1"
# set TORCH_CUDA_ARCH_LIST
ENV TORCH_CUDA_ARCH_LIST="7.5"
RUN pip install --no-cache-dir what-i-need --index-url
# install pytorch requirements from aws
RUN mkdir -p /app/snapshots && \
mkdir -p /keys
# Copy the requirements files
ADD requirements/build.txt /install/build.txt
# install specific packages
RUN pip install --no-cache-dir -r /install/build.txt
# number of workers per model
ENV SAGEMAKER_MODEL_SERVER_WORKERS=1
ENV SAGEMAKER_BIND_TO_PORT=8000
ENV SAGEMAKER_SAFE_PORT_RANGE=8000-8002
# HTTP Inference Service
EXPOSE 8000
# GRPC Inference Service
EXPOSE 8001
# Metrics Service
EXPOSE 8002
RUN echo -e "#!/bin/bash\n\
tritonserver --model-repository ${TRITON_MODEL_DIRECTORY}"\
>> /start.sh
RUN chmod +x /start.sh
# Set the working directory to /
WORKDIR /
ENTRYPOINT ["/start.sh"]
问题是,当我从 sagemaker MME 端点启动它时,triton 服务器启动并运行,但显然 sagemaker 无法检测到正在运行的服务器,因此运行状况检查失败并且端点创建失败。
我使用了错误的端口,或者我应该怎样做才能避免此错误?
PS:我确实看到此 dockerfile 中使用的基本 NGC 容器使用位于
/opt/nvidia/nvidia_entrypoint.sh
的入口点,但代码似乎只是原始入口点的包装器。