Spark中RDD的UNION的非确定性行为

问题描述 投票:0回答:1

我正在3个RDD上执行Union操作,我知道Union不会保留顺序,但是就我而言,这很奇怪。有人可以向我解释我的代码有什么问题吗?

我有一个(myDF)行数据框,并转换为RDD:-

myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))

myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
*/

val rowCount = myRdd.count() // Count of Records in myRdd

val header = "name:country:date:nextdate:1" // random header

// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))

//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))

//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")

由于联盟不保留订购,因此不应给出以下结果

输出

name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3

不进行任何排序

sortByKey("true", 1)

如何获得高于输出的结果?

scala sorting apache-spark union rdd
1个回答
0
投票

在Spark中,特定分区中的元素是无序的,但是分区本身是有序的,请检查this

© www.soinside.com 2019 - 2024. All rights reserved.