如何从 PyInstaller 可执行文件自动检测 PySpark 主路径?

问题描述 投票:0回答:1

在我的本地开发环境中,我可以轻松运行 PySpark 应用程序,而无需配置任何内容。然而,在服务器上,我们使用 PyInstaller 进行 EXE 部署。 PyInstaller 在可执行文件中不包含 PySpark 库的

_internal
文件夹,因此我必须手动设置路径。

这是我的 PyInstaller

manage.py
脚本的片段:

# -*- mode: python ; coding: utf-8 -*-

# Analysis for manage.py
a_manage = Analysis(
    ['manage.py'],
    pathex=['/app/app_name/app_name-backend-dev'],
    # I tried adding .venv/lib/python3.11/site-packages to the pathex, but it didn't work
    binaries=[
        ('/usr/lib/x86_64-linux-gnu/libpython3.11.so.1.0', './_internal/libpython3.11.so.1.0')
    ],
    datas=[],
    hiddenimports=[
        # I tried adding pyspark imports, but it didn't work
        'pyspark', 'pyspark.sql', 'pyspark.sql.session', 'pyspark.sql.functions', 'pyspark.sql.types', 'pyspark.sql.column',
        
        
        'app_name2.apps', 'Crypto.Cipher', 'Crypto.Util.Padding', 'snakecase', 'cryptography.fernet',
        'cryptography.hazmat.primitives', 'cryptography.hazmat.primitives.kdf.pbkdf2', 'apscheduler.triggers.cron',
        'apscheduler.schedulers.background', 'apscheduler.events', 'oauth2_provider.contrib.rest_framework',
        'app_name.apps', 'app_name.role_permissions', 'django_filters.rest_framework', 'app_name.urls',
        'app_name.others.constants', 'app_name.models', 'app_name', 'sslserver'
    ],
    hookspath=[],
    hooksconfig={},
    runtime_hooks=[],
    excludes=[],
    noarchive=False,
)

pyz_manage = PYZ(a_manage.pure)

exe_manage = EXE(
    pyz_manage,
    a_manage.scripts,
    [],
    exclude_binaries=True,
    name='manage',
    debug=False,
    bootloader_ignore_signals=False,
    strip=False,
    upx=True,
    console=True,
    disable_windowed_traceback=False,
    argv_emulation=False,
    target_arch=None,
    codesign_identity=None,
    entitlements_file=None,
)

coll_manage = COLLECT(
    exe_manage,
    a_manage.binaries,
    a_manage.datas,
    strip=False,
    upx=True,
    upx_exclude=[],
    name='manage',
)

当我尝试运行可执行文件时,遇到以下错误:

Traceback (most recent call last):
  File "portal/operations/load_data/load_data.py", line 57, in start
  File "portal/pyspark/operations.py", line 498, in get_session
  File "pyspark/sql/session.py", line 497, in getOrCreate
  File "pyspark/context.py", line 515, in getOrCreate
  File "pyspark/context.py", line 201, in __init__
  File "pyspark/context.py", line 436, in _ensure_initialized
  File "pyspark/java_gateway.py", line 97, in launch_gateway
  File "subprocess.py", line 1026, in __init__
  File "subprocess.py", line 1955, in _execute_child
FileNotFoundError: [Errno 2] No such file or directory: '/home/rhythmflow/Desktop/Reconciliation/reconciliation-backend-v3/dist/manage/_internal/./bin/spark-submit'

为了解决这个问题,我在 Linux 主目录中创建了一个全局

.venv
并使用
pip install pyspark
安装了 PySpark。

然后我手动设置

SPARK_HOME
环境变量:

SPARK_HOME = /home/user_name/.venv/lib/python3.11/site-packages/pyspark

并在我的代码中使用它,如下所示:

SPARK_HOME = env_var("SPARK_HOME")
SparkSession.builder.appName(app_name).config("spark.home", SPARK_HOME).getOrCreate()

这种方法在开发环境中工作得很好,但我想简化流程并避免手动指定Spark home路径。

问题:

有没有办法自动检测 PyInstaller 可执行文件中的 PySpark 主路径,这样我就不必手动设置

SPARK_HOME
环境变量?

python django ubuntu pyspark pyinstaller
1个回答
0
投票

如果您需要安装 pyspark 的目录,您应该能够执行如下操作:

import os

import pyspark


SPARK_HOME = os.path.dirname(pyspark.__file__)
© www.soinside.com 2019 - 2024. All rights reserved.