PySpark 逐行对多列求和

问题描述 投票:0回答:1

我在 pySpark 工作。我有 3 个如下所示的数据集(有时超过 3 个)

aa1=dfActivity.filter((f.col("UserID") == 514)).select('UserID','DoctorID','Department','Company','scoreHrs','MonthNumber','YearNumber','PeriodID')
aa2=dfCalories.filter((f.col("UserID") == 514)).select('UserID','DoctorID','Department','Company','score','MonthNumber','YearNumber','PeriodID') 
aa3=dfWHO5.filter((f.col("UserID") == 514)).select('UserID','DoctorID','Department','Company','score','MonthNumber','YearNumber','PeriodID')

我需要分别对 aa1、aa2 和 aa3 中的“scoreHrs”+“score”+“score”列进行求和逐行并将值分配给新的数据框。

这怎么办?

pyspark sum
1个回答
0
投票

假设您只需要在新数据框中新创建的 sum 列,这是我的答案:

df1 = spark.createDataFrame([["x",1],["x",2],["x",3],["x",4],["x",5],\
                            ["x",6]], ["col1", "scorehrs"])
df2 = spark.createDataFrame([["x",22],["x",22],["x",23],["x",24],["x",25],\
["x",26]], ["col1", "score"])
df3 = spark.createDataFrame([["x",31],["x",32],["x",33],["x",34],["x",35],\
["x",36]], ["col1", "score"])


w1=Window.orderBy(monotonically_increasing_id())
list_dfs=[df1,df2,df3]
new_dfs=[]

for idx,df in enumerate(list_dfs):
    
    df=df.withColumn("row",row_number().over(w1))
    if "score" in df.columns:
       df=df.withColumnRenamed("score","score_"+str(idx))
    new_dfs.append(df)

joined_df =new_dfs[0]
for df in new_dfs[1:]:
    joined_df=joined_df.join(df,on="row",how="inner")
joined_df=joined_df.withColumn("sum",expr("score_1 + score_2 + scorehrs")).select("sum")
joined_df.show()
© www.soinside.com 2019 - 2024. All rights reserved.