Google Cloud Run 上的 numpy 和 pandas 导入缓慢

问题描述 投票:0回答:2

我正在开发一个 API 并将其部署在 Google Cloud Run 上。

有一个导入 pandas 和 numpy 的预启动 python 脚本。当我对导入进行计时时,在 Cloud Run 上,numpy 大约需要 2 秒,pandas 大约需要 4 秒,而在我的本地计算机上则不到 0.5 秒。

我使用

python:3.8-alpine
作为我的基础镜像来构建我的 docker 容器。 (虽然我尝试过一些非阿尔卑斯山的图像......)

这是 Dockerfile

FROM python:3.8-alpine

COPY requirements.txt ./

RUN apk add --no-cache --virtual build-deps g++ gcc gfortran make libffi-dev openssl-dev file build-base \
    && apk add --no-cache libstdc++ openblas-dev lapack-dev \ 
    && pip install --no-cache-dir uvicorn gunicorn fastapi \
    && CFLAGS="-g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib" \
    && pip install --no-cache-dir --compile --global-option=build_ext --global-option="-j 16" -r requirements.txt \
    && rm -r /root/.cache \
    && find /usr/local/lib/python3.*/ -name 'tests' -exec rm -r '{}' + \
    && find /usr/local/lib/python3.*/site-packages/ \( -type d -a -name test -o -name tests \) -o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) -exec rm -r '{}' + \
    && find /usr/local/lib/python3.*/site-packages/ -name '*.so' -print -exec /bin/sh -c 'file "{}" | grep -q "not stripped" && strip -s "{}"' \; \
    && find /usr/lib/ -name '*.so' -print -exec /bin/sh -c 'file "{}" | grep -q "not stripped" && strip -s "{}"' \; \
    && find /usr/local/lib/ -name '*.so' -print -exec /bin/sh -c 'file "{}" | grep -q "not stripped" && strip -s "{}"' \; \
    && rm -rf /usr/local/lib/python*/ensurepip \
    && rm -rf /usr/local/lib/python*/idlelib \
    && rm -rf /usr/local/lib/python*/distutils/command \
    && rm -rf /usr/local/lib/python*/lib2to2 \
    && rm -rf /usr/local/lib/python*/__pycache__/* \
    && rm -r /requirements.txt /databases.zip \
    && rm -rf /tmp/* \
    && rm -rf /var/cache/apk/* \
    && apk del build-deps g++ gcc make libffi-dev openssl-dev file build-base 

CMD ["python","script.py"]

需求.txt:

numpy==1.2.0
pandas==1.2.1

以及执行python文件script.py:

import time

ts = time.time()
import pandas
te = time.time()
print(te-ts)

进口缓慢是否在意料之中?或者也许有一些 python 导入技巧?

我一直在寻找 stackoverflow 和 github 问题,但没有与此“问题”/“行为”类似的内容。

提前致谢。

python-3.x pandas docker google-cloud-platform alpine-linux
2个回答
4
投票

这是Python生态系统中的一个已知问题。

所有模块均在运行时导入,部分模块大小达300-500MB

有大量关于进口时间缓慢的抱怨。最好的线程是这个:improving speed of Python module import

对于Cloud Run,我尝试了各种方法,但都无法大幅降低速度。

如果您想在无服务器环境中使用,或在其他冷启动生态系统中,
请注意,由于“导入”,冷启动可能需要 10 秒的量级。 importing pandas took 1.42 seconds importing numpy took 1.90 seconds importing torch took 2.84 seconds importing torchvision took 0.78 seconds importing IPython took 1.22 seconds importing sklearn took 1.51 seconds importing import dask took 0.74 seconds

尝试1:

没有将 CPU 调至最大可能的解决方案

尝试2:

重写导入后速度没有提高:

pd = imp.load_module("pandas",None,"/usr/local/lib/python3.10/site-packages/pandas",('','',5))

这样,解释器会跳过“查找”阶段,但其时间仍然相同,因此速度没有提高。

尝试3:

通过编译使用安装要求没有任何好处

RUN python -m pip install --no-cache-dir --compile -r requirements-prod.txt RUN python -m compileall .

我什至探索了容器,
__pycache__

也是为所有模块和应用程序代码构建的,但与冷启动时间相比没有任何改进。

总结:

这里有关于

延迟加载提案的好读物


0
投票

import importlib import time import builtins original_import = builtins.__import__ def timed_import(name, *args, **kwargs): start_time = time.time() result = original_import(name, *args, **kwargs) duration = time.time() - start_time if duration > 0.01: # Only log imports that take more than 10 ms print(f"Importing {name} took {duration:.4f} seconds") return result # Override the built-in import function with our timed import function builtins.__import__ = timed_import

然后在代码的其他地方放置

builtins.__import__ = original_import #end of timed import

就我而言,绝大多数导入都是少量微秒,但有一些很突出。

Importing pandas took 1.2163 seconds Importing pandas.core.api took 0.8030 seconds Importing flask took 0.4840 seconds Importing pandas._libs took 0.3502 seconds I Importing pandas._libs.interval took 0.3396 seconds Importing google.cloud took 0.3394 seconds Importing serving took 0.2843 seconds Importing google.cloud.datastore took 0.2836 seconds Importing google.cloud.datastore.batch took 0.2836 seconds Importing pandas._libs.hashtable took 0.2679 seconds Importing pandas.core.groupby took 0.2362 seconds Importing pandas.compat took 0.2269 seconds Importing pandas._libs.missing took 0.2164 seconds Importing pyarrow took 0.2067 seconds Importing pandas.compat.pyarrow took 0.2067 seconds Importing pandas._libs.tslibs.nattype took 0.2058 seconds Importing numpy took 0.1947 seconds Importing pandas.core.arrays took 0.1756 seconds Importing pandas._libs.tslibs.conversion took 0.1647 seconds Importing pandas.core.frame took 0.1645 seconds Importing numpy._core._multiarray_umath took 0.1541 seconds Importing numpy.__config__ took 0.1541 seconds Importing numpy._core._multiarray_umath took 0.1237 seconds Importing pandas._libs.tslibs.offsets took 0.1235 seconds Importing pandas.core.arrays.arrow took 0.1232 seconds Importing _ssl took 0.1288 seconds Importing ssl took 0.1288 seconds

其中许多仅由用户产品的特定部分需要,因此应该是线程加载或延迟加载的绝佳机会。

© www.soinside.com 2019 - 2024. All rights reserved.