pyspark 中的“类型错误:‘JavaPackage’对象不可调用”

问题描述 投票:0回答:1

以下是我的版本信息:

    python = 3.11.5
    pyspark = 3.4.1
    
    java -version
    java version "21.0.1" 2023-10-17 LTS
    Java(TM) SE Runtime Environment (build 21.0.1+12-LTS-29)
    Java HotSpot(TM) 64-Bit Server VM (build 21.0.1+12-LTS-29, mixed mode, sharing)

我正在尝试在 jupyter 笔记本中运行以下代码:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("practice").getOrCreate()

data_path  = "pathToFile\TelecomData.csv"
rdd = sc.textFile(data_path)

filteredRdd = rdd.filter(lambda pair: pair.split(",")[3] =='Y' and pair.split(",")[9] =='Y')

for rows in filteredRdd.collect():
    print(rows)

以下是我面临的错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 10
      7 filteredRdd = rdd.filter(lambda pair: pair.split(",")[3] =='Y' and pair.split(",")[9] =='Y')
      9 # print(sc._jvm.functions)
---> 10 for rows in filteredRdd.collect():
     11     print(rows)

File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:1814, in RDD.collect(self)
   1812 with SCCallSiteSync(self.context):
   1813     assert self.ctx._jvm is not None
-> 1814     sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   1815 return list(_load_from_socket(sock_info, self._jrdd_deserializer))

File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:5441, in PipelinedRDD._jrdd(self)
   5438 else:
   5439     profiler = None
-> 5441 wrapped_func = _wrap_function(
   5442     self.ctx, self.func, self._prev_jrdd_deserializer, self._jrdd_deserializer, profiler
   5443 )
   5445 assert self.ctx._jvm is not None
   5446 python_rdd = self.ctx._jvm.PythonRDD(
   5447     self._prev_jrdd.rdd(), wrapped_func, self.preservesPartitioning, self.is_barrier
   5448 )

File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:5243, in _wrap_function(sc, func, deserializer, serializer, profiler)
   5241 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
   5242 assert sc._jvm is not None
-> 5243 return sc._jvm.SimplePythonFunction(
   5244     bytearray(pickled_command),
   5245     env,
   5246     includes,
   5247     sc.pythonExec,
   5248     sc.pythonVer,
   5249     broadcast_vars,
   5250     sc._javaAccumulator,
   5251 )

TypeError: 'JavaPackage' object is not callable

到目前为止,我已经检查了 windows 中的 Spark 安装;有一些答案,添加 jar 路径可能会解决错误,但就我而言,我不确定要添加哪些 jar。

任何帮助将不胜感激。

python apache-spark pyspark jupyter-notebook typeerror
1个回答
0
投票

我也有同样的问题。将 rdd 文件从 sc._jvm.SimplePythonFunction 更改为 sc._jvm.PythonFunction 对我有用。希望这有帮助。

© www.soinside.com 2019 - 2024. All rights reserved.