我正在使用 Pyspark 3.4.1、java 8、hadoop 3.4.0、scala 2.12.17、python 3.11.4,这是我在 vscode 中的代码:
def calculating_click(df):
click_data = df.filter((df.custom_track == "click"))
click_data = click_data.na.fill({'bid':0})
click_data = click_data.na.fill({'job_id':0})
click_data = click_data.na.fill({'publisher_id':0})
click_data = click_data.na.fill({'group_id':0})
click_data = click_data.na.fill({'campaign_id':0})
click_data = click_data.na.fill({'campaign_id':0})
click_data.registerTempTable('clicks') #name temporary table 'clicks'
click_output = spark.sql("""select job_id,date(ts) as date,hour(ts) as hour,publisher_id,campaign_id,group_id, avg(bid) as bid_set,count(*) as clicks, sum(bid) as spend_hour from clicks`group by job_id, date(ts), hour(ts),publisher_id, campaign_id, group_id """)`
我收到此错误:
Py4JError: An error occurred while calling o28.sql. Trace:
py4j.Py4JException: Method sql([class java.lang.String, class [Ljava.lang.Object;]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:321)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:329)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
谁能帮我解决这个问题吗? 我正在尝试使用 pyspark 但每次都会出错,我应该使用哪个版本的 Spark、hadoop、java?
我今天面临同样的问题,尽管我使用的是容器化构建,但我试图查看 Spark:latest 镜像是否存在问题,将版本修复到 3.5.0 进行检查:(https://hub. docker.com/layers/apache/spark/3.5.0/images/sha256-a4a48089219912a8a87d7928541d576df00fc8d95f18a1509624e32b0e5c97d7?context=explore)