Spark GroupBy通过操作性能改善

问题描述 投票:0回答:1

我有一个下面的DataFrame,大约有10亿条记录,大约有100列。

+---------+---------+------+------+------+
|  Col1   |  Col2   | Col3 | col4 | col5 |
+---------+---------+------+------+------+
| Value 1 | Value 2 |  123 | 10.0 | sae  |
| Value 1 | Value 2 |  123 | 10.0 | ser  |
| Value 1 | Value 2 |  123 | 10.0 | wer  |
+---------+---------+------+------+------+

我必须对100列中的5列应用groupBy操作,如下所示。

df.groupBy("col1").count()
df.groupBy("col2").count()
df.groupBy("col3").count()
df.groupBy("col4").count()
df.groupBy("col5").count()

由于这是一项昂贵的操作,需要太多改组,因此需要更多时间。要将groupBy操作本身应用5次,将花费更多时间。

但是对于我而言,我必须对每个过滤器应用这5个groupBy操作。就像如果有5个过滤器,我必须在5 * 5时间之前应用groupBy操作]

我正在寻找一种解决方案,通过使用这5列的分区缓存整个数据来解决此问题。我在想,如果我们进行分区,数据将是该执行器的本地数据(不会发生太多的改组)。

CodeWise流将是:

val df = spark.read.format("parquet").load("file://location")

val 1stdf = df.filter("$col1" == "a")
df.groupBy("col1").count()
df.groupBy("col2").count()
df.groupBy("col3").count()
df.groupBy("col4").count()
df.groupBy("col5").count()

val 2nddf = df.filter("$col2" == "abc")
df.groupBy("col1").count()
df.groupBy("col2").count()
df.groupBy("col3").count()
df.groupBy("col4").count()
df.groupBy("col5").count()

val 3rddf = df.filter("$col3" == "abdc")
df.groupBy("col1").count()
df.groupBy("col2").count()
df.groupBy("col3").count()
df.groupBy("col4").count()
df.groupBy("col5").count()



scala apache-spark apache-spark-sql rdd
1个回答
0
投票

IIUC,您可以使用map()进行计数。

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.count

// Just for example only, Col6 is one of the column for which you don't need count
 val df = Seq(("Value 1","Value 2",123,10.0,"sae", "test1"),("Value 1","Value 2",123,10.0,"ser", "test2"),("Value 1","Value 2",123,10.0,"wer", "test3")).toDF("Col1","Col2","Col3","col4","col5", "col6")

// This can be done many other ways based on requirement 
// Can filter df.columns as well
var interestedColumns = Array("Col1", "Col2", "Col3", "Col4", "Col5")

df.select(interestedColumns.map(c => count(col(c)).alias(c)): _*).show
+----+----+----+----+----+
|Col1|Col2|Col3|Col4|Col5|
+----+----+----+----+----+
|   3|   3|   3|   3|   3|
+----+----+----+----+----+
© www.soinside.com 2019 - 2024. All rights reserved.