pyspark @Functions.pandas_udf 函数(装饰器)将导致 Foundry Build Service 环境中出现无法解决的依赖错误。它使用 pyarrow,它使用构建系统环境没有的 openssl 版本,甚至通过将其放入 meta.yml 中将其安装在用户/项目环境中也无法解决问题。 标准输出:
ImportError: PyArrow >= 4.0.0 must be installed; however, it was not found.
Traceback (most recent call last):
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version
import pyarrow
File "/app/work-dir/__environment__/__SYMLINKS__/site-packages/pyarrow/__init__.py", line 65, in <module>
import pyarrow.lib as _lib
ImportError: /app/work-dir/__python_runtime_environment__/__SYMLINKS__/lib-dynload/../../libcrypto.so.3: version OPENSSL_3.4.0' not found (required by /app/work-dir/__environment__/__SYMLINKS__/site-packages/pyarrow/../../../././libssl.so.3)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms_spark_module/delegate.py", line 100, in _execute_job
result = job.run(
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/_build.py", line 329, in run
self._transform.compute(**kwargs, **parameters)
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/api/_transform.py", line 334, in compute
output_df: Union[DataFrame, Any] = self(**kwargs)
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/transforms/api/_transform.py", line 183, in __call__
return self._compute_func(*args, **kwargs)
File "/app/work-dir/__user_code_environment__/__SYMLINKS__/site-packages/myproject/datasets/data_anon.py", line 15, in compute
"case_weight": redist(df, "case_weight"),
File "/app/work-dir/__user_code_environment__/__SYMLINKS__/site-packages/myproject/datasets/utils.py", line 32, in redist
df = df.withColumn(column_name, add_noise(column_name, dist))
File "/app/work-dir/__user_code_environment__/__SYMLINKS__/site-packages/myproject/datasets/utils.py", line 14, in add_noise
@F.pandas_udf(DoubleType())
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/pyspark/sql/pandas/functions.py", line 338, in pandas_udf
require_minimum_pyarrow_version()
File "/app/work-dir/__python_runtime_environment__/__SYMLINKS__/site-packages/pyspark/sql/pandas/utils.py", line 60, in require_minimum_pyarrow_version
raise ImportError(
ImportError: PyArrow >= 4.0.0 must be installed; however, it was not found.
PySpark 错误消息具有误导性,正如您已正确识别的那样,这是由 Conda 环境和 Foundry Build 环境中存在不同的 OpenSSL 版本引起的。
较新版本的 Python Transforms 附带 OpenSSL 3.4.0,因此您可以通过确保不在 meta.yaml 文件中固定 openssl 版本并通过将存储库升级到最新模板版本来解决此问题。