需要有关使用 Boto3 API 为 Sagemaker 训练作业传递命令行参数的指导。请找到我的 docker 文件
FROM public.ecr.aws/ubuntu/ubuntu:22.04
LABEL version="2.0"
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
build-essential \
python3-dev \
python3-pip \
python3-setuptools \
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN python3.10 -m pip install pip --upgrade && pip install --upgrade cython
RUN ln -s /usr/bin/python3 /usr/bin/python
COPY requirements.txt .
RUN pip --no-cache-dir install -r requirements.txt
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/ml/code/:${PATH}"
ENV PYTHONPATH="/opt/ml/code/:${PYTHONPATH}"
COPY src/ /opt/ml/code/
WORKDIR /opt/ml/code/
ENTRYPOINT [ "python", "/opt/ml/code/entry_point.py" ]
entry_point.py脚本如下
parser = argparse.ArgumentParser()
parser.add_argument("--mode", type=str, required=True)
parser.add_argument("--region", type=int)
args = parser.parse_args()
if args.mode == "inference":
run_inference(args.region_id)
elif args.mode == "training":
run_training(args.region_id)
else:
raise ValueError(f"Unknown mode: {args.mode}")
该镜像已发布到AWS ECR。现在使用 boto3 API 调用如下来启动作业
session = boto3.Session(profile_name='algoprod')
client = session.client('sagemaker', region_name='us-east-1')
training_job_name = 'sagemaker-training-demo'
resp = client.create_training_job(
TrainingJobName=training_job_name,
RoleArn="xxxx",
AlgorithmSpecification={
'TrainingImage': "image:latest",
'TrainingInputMode': "File",
'ContainerArguments': [
'--mode training',
'--region_id 1',
]
)
print(resp)
上述使用 boto3 的 API 调用成功启动了 AWS 中的 Sagemaker 训练,但训练作业失败并出现以下错误消息
entry_point.py: error: the following arguments are required: --mode
模式已按照 Boto3 文档中的指导通过 ContainerArguments 传递 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_training_job.html
请指教
也许解决方案就像将
training
放入引号中一样简单 "training"
'ContainerArguments': ['--mode "training"',
'--region_id 1',]
1
被理解为整数,但不带引号的 training
被解释为变量。