也许尝试这样的事情,
import pyspark.sql.functions as F
# Group by date, depot, bin, and status
grouped_df = df.groupBy('date', 'depot', 'bin', 'status')
# Calculate desired statistics in one line
stats = grouped_df.agg(
F.count('id').alias('count_scans'),
F.avg('count').alias('avg_count_scans'),
F.stddev('count').alias('stddev_count_scans'),
F.expr("percentile_approx(count, 0.25)").alias("percentile_25"),
F.expr("percentile_approx(count, 0.75)").alias("percentile_75"),
)
# Display results
stats.show()
希望这有帮助,我还没有尝试过这段代码,而且我对 pyspark 还很陌生,所以我不确定它是否正确。