spark-submit 本地模式下的 PySpark 虚拟环境问题

问题描述 投票:0回答:1

我正在尝试使用 python 虚拟环境在本地模式下使用 Spark-submit 运行 python 程序,即使 pyspark 未安装在虚拟环境中,它仍然可以运行而不会失败。

以下是我尝试测试的详细信息:

  1. 尝试通过卸载 pytest 包来检查它是否使用传递的 python 可执行文件来运行,它似乎按预期工作
    PS C:\Users\demouser\Desktop\pytest_demo> spark-submit --master local[*] --conf spark.pyspark.python="C:\Users\demouser\Desktop\pytest_demo\pyspark-env\Scripts\python.exe" .\src\main.py 2>error.txt
    Traceback (most recent call last):
      File "C:/Users/demouser/Desktop/pytest_demo/./src/main.py", line 1, in <module>
        import pytest
    ModuleNotFoundError: No module named 'pytest'
    PS C:\Users\demouser\Desktop\pytest_demo> .\pyspark-env\Scripts\Activate.ps1
    
    (pyspark-env) PS C:\Users\demouser\Desktop\pytest_demo> pip install pytest
    Collecting pytest
      Using cached pytest-7.1.2-py3-none-any.whl (297 kB)
    Requirement already satisfied: tomli>=1.0.0 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (2.0.1)
    Requirement already satisfied: pluggy<2.0,>=0.12 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (1.0.0)
    Requirement already satisfied: colorama in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (0.4.5)
    Requirement already satisfied: py>=1.8.2 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (1.11.0)
    Requirement already satisfied: atomicwrites>=1.0 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (1.4.1)
    Requirement already satisfied: attrs>=19.2.0 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (22.1.0)
    Requirement already satisfied: packaging in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (21.3)
    Requirement already satisfied: iniconfig in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (1.1.1)
    Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from packaging->pytest) (3.0.9)
    Installing collected packages: pytest
    Successfully installed pytest-7.1.2
    (pyspark-env) PS C:\Users\demouser\Desktop\pytest_demo> deactivate
  1. 安装上述包导入后,错误消失,但没有使用 pip 安装 pyspark,直到这一步仍然运行正常。
    PS C:\Users\demouser\Desktop\pytest_demo> spark-submit --master local[*] --conf spark.pyspark.python="C:\Users\demouser\Desktop\pytest_demo\pyspark-env\Scripts\python.exe" .\src\main.py 2>error.txt
    ============================= test session starts =============================
    platform win32 -- Python 3.8.12, pytest-7.1.2, pluggy-1.0.0
    rootdir: C:\Users\demouser\Desktop\pytest_demo\test, configfile: pytest.ini
    collected 6 items
    
    test\unit\test_factory.py::TestSparkSession::test_sparksession PASSED    [ 16%]
    test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count0] PASSED [ 33%]
    test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count1] PASSED [ 50%]
    test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count2] PASSED [ 66%]
    test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count3] PASSED [ 83%]
    test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count4] PASSED [100%]
    
    ============================== warnings summary ===============================
    ..\..\..\..\Spark\spark-3.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\context.py:75
    unit/test_factory.py::TestSparkSession::test_sparksession
      C:\Spark\spark-3.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\context.py:75: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
    
    -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
    ======================= 6 passed, 2 warnings in 26.90s ========================
    INFO     __main__:main.py: pytest session finished
  1. 我的系统中安装了以下版本的 pyspark。
    PS C:\Users\demouser\Desktop\pytest_demo> pyspark --version
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
          /_/
    
    Using Scala version 2.12.10, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
    Branch HEAD
    Compiled by user ubuntu on 2020-08-28T07:36:48Z
    Revision 2b147c4cd50da32fe2b4167f97c8142102a0510d
    Url https://gitbox.apache.org/repos/asf/spark.git
    Type --help for more information.
  1. 下面是 pyspark-env 虚拟环境中存在的软件包
    PS C:\Users\demouser\Desktop\pytest_demo> .\pyspark-env\Scripts\Activate.ps1
    (pyspark-env) PS C:\Users\demouser\Desktop\pytest_demo> pip freeze
    atomicwrites==1.4.1
    attrs==22.1.0
    colorama==0.4.5
    iniconfig==1.1.1
    packaging==21.3
    pluggy==1.0.0
    py==1.11.0
    pyparsing==3.0.9
    pytest==7.1.2
    tomli==2.0.1
    (pyspark-env) PS C:\Users\demouser\Desktop\pytest_demo> python --version
    Python 3.8.12

请告知为什么即使在虚拟环境中没有 pyspark 包也没有失败。

python-3.x apache-spark pyspark virtualenv spark-submit
1个回答
0
投票

正如官方文档中提到的(https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv),您必须导出环境变量

PYSPARK_DRIVER_PYTHON
并且
PYSPARK_PYTHON
到您正在使用的 python 二进制文件(即 virtualenv 目录的
bin
文件夹中的二进制文件)或将路径设置为sparkConf
spark.pyspark.python
spark.pyspark.driver.python

© www.soinside.com 2019 - 2024. All rights reserved.