鉴于以下 rdd 按人的年龄分区,我创建一个数据集,我想为其输出 parquet 也按年龄分区:
val rdd = spark.sparkContext.parallelize(
Seq(Person(11, "John Doe"), Person(22, "Jane Doe"), Person(33, "Foo Bar"))
)
val rddPartitionedByAge = rdd.keyBy(_.age).partitionBy(new ByAgePartitioner)
//create dataset from an already partitioned RDD
import spark.implicits._
val partitionedDataset = spark.createDataset(rddPartitionedByAge.values)
partitionedDataset
.write
.mode("overwrite")
.partitionBy("age") //does spark re-shuffle the data here ?
.parquet("./output/datasetFromRdd")
由于我正在度假,凭记忆回答: