我正在尝试使用 python 虚拟环境在本地模式下使用 Spark-submit 运行 python 程序,即使 pyspark 未安装在虚拟环境中,它仍然可以运行而不会失败。
以下是我尝试测试的详细信息:
PS C:\Users\demouser\Desktop\pytest_demo> spark-submit --master local[*] --conf spark.pyspark.python="C:\Users\demouser\Desktop\pytest_demo\pyspark-env\Scripts\python.exe" .\src\main.py 2>error.txt
Traceback (most recent call last):
File "C:/Users/demouser/Desktop/pytest_demo/./src/main.py", line 1, in <module>
import pytest
ModuleNotFoundError: No module named 'pytest'
PS C:\Users\demouser\Desktop\pytest_demo> .\pyspark-env\Scripts\Activate.ps1
(pyspark-env) PS C:\Users\demouser\Desktop\pytest_demo> pip install pytest
Collecting pytest
Using cached pytest-7.1.2-py3-none-any.whl (297 kB)
Requirement already satisfied: tomli>=1.0.0 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (2.0.1)
Requirement already satisfied: pluggy<2.0,>=0.12 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (1.0.0)
Requirement already satisfied: colorama in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (0.4.5)
Requirement already satisfied: py>=1.8.2 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (1.11.0)
Requirement already satisfied: atomicwrites>=1.0 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (1.4.1)
Requirement already satisfied: attrs>=19.2.0 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (22.1.0)
Requirement already satisfied: packaging in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (21.3)
Requirement already satisfied: iniconfig in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from pytest) (1.1.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\demouser\desktop\pytest_demo\pyspark-env\lib\site-packages (from packaging->pytest) (3.0.9)
Installing collected packages: pytest
Successfully installed pytest-7.1.2
(pyspark-env) PS C:\Users\demouser\Desktop\pytest_demo> deactivate
PS C:\Users\demouser\Desktop\pytest_demo> spark-submit --master local[*] --conf spark.pyspark.python="C:\Users\demouser\Desktop\pytest_demo\pyspark-env\Scripts\python.exe" .\src\main.py 2>error.txt
============================= test session starts =============================
platform win32 -- Python 3.8.12, pytest-7.1.2, pluggy-1.0.0
rootdir: C:\Users\demouser\Desktop\pytest_demo\test, configfile: pytest.ini
collected 6 items
test\unit\test_factory.py::TestSparkSession::test_sparksession PASSED [ 16%]
test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count0] PASSED [ 33%]
test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count1] PASSED [ 50%]
test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count2] PASSED [ 66%]
test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count3] PASSED [ 83%]
test\unit\test_factory.py::TestFactory::test_five_dfs_count[Test_df_count4] PASSED [100%]
============================== warnings summary ===============================
..\..\..\..\Spark\spark-3.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\context.py:75
unit/test_factory.py::TestSparkSession::test_sparksession
C:\Spark\spark-3.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\context.py:75: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================= 6 passed, 2 warnings in 26.90s ========================
INFO __main__:main.py: pytest session finished
PS C:\Users\demouser\Desktop\pytest_demo> pyspark --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.1
/_/
Using Scala version 2.12.10, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_181
Branch HEAD
Compiled by user ubuntu on 2020-08-28T07:36:48Z
Revision 2b147c4cd50da32fe2b4167f97c8142102a0510d
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
PS C:\Users\demouser\Desktop\pytest_demo> .\pyspark-env\Scripts\Activate.ps1
(pyspark-env) PS C:\Users\demouser\Desktop\pytest_demo> pip freeze
atomicwrites==1.4.1
attrs==22.1.0
colorama==0.4.5
iniconfig==1.1.1
packaging==21.3
pluggy==1.0.0
py==1.11.0
pyparsing==3.0.9
pytest==7.1.2
tomli==2.0.1
(pyspark-env) PS C:\Users\demouser\Desktop\pytest_demo> python --version
Python 3.8.12
请告知为什么即使在虚拟环境中没有 pyspark 包也没有失败。
正如官方文档中提到的(https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html#using-virtualenv),您必须导出环境变量
PYSPARK_DRIVER_PYTHON
并且PYSPARK_PYTHON
到您正在使用的 python 二进制文件(即 virtualenv 目录的 bin
文件夹中的二进制文件)或将路径设置为sparkConf spark.pyspark.python
和spark.pyspark.driver.python
。