我正在开发一个 API 并将其部署在 Google Cloud Run 上。
有一个导入 pandas 和 numpy 的预启动 python 脚本。当我对导入进行计时时,在 Cloud Run 上,numpy 大约需要 2 秒,pandas 大约需要 4 秒,而在我的本地计算机上则不到 0.5 秒。
我使用
python:3.8-alpine
作为我的基础镜像来构建我的 docker 容器。 (虽然我尝试过一些非阿尔卑斯山的图像......)
这是 Dockerfile
FROM python:3.8-alpine
COPY requirements.txt ./
RUN apk add --no-cache --virtual build-deps g++ gcc gfortran make libffi-dev openssl-dev file build-base \
&& apk add --no-cache libstdc++ openblas-dev lapack-dev \
&& pip install --no-cache-dir uvicorn gunicorn fastapi \
&& CFLAGS="-g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib" \
&& pip install --no-cache-dir --compile --global-option=build_ext --global-option="-j 16" -r requirements.txt \
&& rm -r /root/.cache \
&& find /usr/local/lib/python3.*/ -name 'tests' -exec rm -r '{}' + \
&& find /usr/local/lib/python3.*/site-packages/ \( -type d -a -name test -o -name tests \) -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) -exec rm -r '{}' + \
&& find /usr/local/lib/python3.*/site-packages/ -name '*.so' -print -exec /bin/sh -c 'file "{}" | grep -q "not stripped" && strip -s "{}"' \; \
&& find /usr/lib/ -name '*.so' -print -exec /bin/sh -c 'file "{}" | grep -q "not stripped" && strip -s "{}"' \; \
&& find /usr/local/lib/ -name '*.so' -print -exec /bin/sh -c 'file "{}" | grep -q "not stripped" && strip -s "{}"' \; \
&& rm -rf /usr/local/lib/python*/ensurepip \
&& rm -rf /usr/local/lib/python*/idlelib \
&& rm -rf /usr/local/lib/python*/distutils/command \
&& rm -rf /usr/local/lib/python*/lib2to2 \
&& rm -rf /usr/local/lib/python*/__pycache__/* \
&& rm -r /requirements.txt /databases.zip \
&& rm -rf /tmp/* \
&& rm -rf /var/cache/apk/* \
&& apk del build-deps g++ gcc make libffi-dev openssl-dev file build-base
CMD ["python","script.py"]
需求.txt:
numpy==1.2.0
pandas==1.2.1
以及执行python文件script.py:
import time
ts = time.time()
import pandas
te = time.time()
print(te-ts)
进口缓慢是否在意料之中?或者也许有一些 python 导入技巧?
我一直在寻找 stackoverflow 和 github 问题,但没有与此“问题”/“行为”类似的内容。
提前致谢。
这是Python生态系统中的一个已知问题。
所有模块均在运行时导入,部分模块大小达300-500MB
有大量关于进口时间缓慢的抱怨。最好的线程是这个:improving speed of Python module import
对于Cloud Run,我尝试了各种方法,但都无法大幅降低速度。
如果您想在无服务器环境中使用,或在其他冷启动生态系统中,
请注意,由于“导入”,冷启动可能需要 10 秒的量级。
importing pandas took 1.42 seconds
importing numpy took 1.90 seconds
importing torch took 2.84 seconds
importing torchvision took 0.78 seconds
importing IPython took 1.22 seconds
importing sklearn took 1.51 seconds
importing import dask took 0.74 seconds
没有将 CPU 调至最大可能的解决方案
尝试2:重写导入后速度没有提高:
pd = imp.load_module("pandas",None,"/usr/local/lib/python3.10/site-packages/pandas",('','',5))
这样,解释器会跳过“查找”阶段,但其时间仍然相同,因此速度没有提高。尝试3:
通过编译使用安装要求没有任何好处
RUN python -m pip install --no-cache-dir --compile -r requirements-prod.txt
RUN python -m compileall .
我什至探索了容器,
__pycache__
也是为所有模块和应用程序代码构建的,但与冷启动时间相比没有任何改进。
总结:这里有关于
延迟加载提案的好读物import importlib
import time
import builtins
original_import = builtins.__import__
def timed_import(name, *args, **kwargs):
start_time = time.time()
result = original_import(name, *args, **kwargs)
duration = time.time() - start_time
if duration > 0.01: # Only log imports that take more than 10 ms
print(f"Importing {name} took {duration:.4f} seconds")
return result
# Override the built-in import function with our timed import function
builtins.__import__ = timed_import
然后在代码的其他地方放置
builtins.__import__ = original_import #end of timed import
就我而言,绝大多数导入都是少量微秒,但有一些很突出。
Importing pandas took 1.2163 seconds
Importing pandas.core.api took 0.8030 seconds
Importing flask took 0.4840 seconds
Importing pandas._libs took 0.3502 seconds
I Importing pandas._libs.interval took 0.3396 seconds
Importing google.cloud took 0.3394 seconds
Importing serving took 0.2843 seconds
Importing google.cloud.datastore took 0.2836 seconds
Importing google.cloud.datastore.batch took 0.2836 seconds
Importing pandas._libs.hashtable took 0.2679 seconds
Importing pandas.core.groupby took 0.2362 seconds
Importing pandas.compat took 0.2269 seconds
Importing pandas._libs.missing took 0.2164 seconds
Importing pyarrow took 0.2067 seconds
Importing pandas.compat.pyarrow took 0.2067 seconds
Importing pandas._libs.tslibs.nattype took 0.2058 seconds
Importing numpy took 0.1947 seconds
Importing pandas.core.arrays took 0.1756 seconds
Importing pandas._libs.tslibs.conversion took 0.1647 seconds
Importing pandas.core.frame took 0.1645 seconds
Importing numpy._core._multiarray_umath took 0.1541 seconds
Importing numpy.__config__ took 0.1541 seconds
Importing numpy._core._multiarray_umath took 0.1237 seconds
Importing pandas._libs.tslibs.offsets took 0.1235 seconds
Importing pandas.core.arrays.arrow took 0.1232 seconds
Importing _ssl took 0.1288 seconds
Importing ssl took 0.1288 seconds
其中许多仅由用户产品的特定部分需要,因此应该是线程加载或延迟加载的绝佳机会。