以下是我的版本信息:
python = 3.11.5
pyspark = 3.4.1
java -version
java version "21.0.1" 2023-10-17 LTS
Java(TM) SE Runtime Environment (build 21.0.1+12-LTS-29)
Java HotSpot(TM) 64-Bit Server VM (build 21.0.1+12-LTS-29, mixed mode, sharing)
我正在尝试在 jupyter 笔记本中运行以下代码:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("practice").getOrCreate()
data_path = "pathToFile\TelecomData.csv"
rdd = sc.textFile(data_path)
filteredRdd = rdd.filter(lambda pair: pair.split(",")[3] =='Y' and pair.split(",")[9] =='Y')
for rows in filteredRdd.collect():
print(rows)
以下是我面临的错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 10
7 filteredRdd = rdd.filter(lambda pair: pair.split(",")[3] =='Y' and pair.split(",")[9] =='Y')
9 # print(sc._jvm.functions)
---> 10 for rows in filteredRdd.collect():
11 print(rows)
File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:1814, in RDD.collect(self)
1812 with SCCallSiteSync(self.context):
1813 assert self.ctx._jvm is not None
-> 1814 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
1815 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:5441, in PipelinedRDD._jrdd(self)
5438 else:
5439 profiler = None
-> 5441 wrapped_func = _wrap_function(
5442 self.ctx, self.func, self._prev_jrdd_deserializer, self._jrdd_deserializer, profiler
5443 )
5445 assert self.ctx._jvm is not None
5446 python_rdd = self.ctx._jvm.PythonRDD(
5447 self._prev_jrdd.rdd(), wrapped_func, self.preservesPartitioning, self.is_barrier
5448 )
File ~\anaconda3\envs\spark_latest\Lib\site-packages\pyspark\rdd.py:5243, in _wrap_function(sc, func, deserializer, serializer, profiler)
5241 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
5242 assert sc._jvm is not None
-> 5243 return sc._jvm.SimplePythonFunction(
5244 bytearray(pickled_command),
5245 env,
5246 includes,
5247 sc.pythonExec,
5248 sc.pythonVer,
5249 broadcast_vars,
5250 sc._javaAccumulator,
5251 )
TypeError: 'JavaPackage' object is not callable
到目前为止,我已经检查了 windows 中的 Spark 安装;有一些答案,添加 jar 路径可能会解决错误,但就我而言,我不确定要添加哪些 jar。
任何帮助将不胜感激。
我也有同样的问题。将 rdd 文件从 sc._jvm.SimplePythonFunction 更改为 sc._jvm.PythonFunction 对我有用。希望这有帮助。