我有如下数据框:
+---------------+--------------------+
|IndexedArtistID| recommendations|
+---------------+--------------------+
| 1580|[[919, 0.00249262...|
| 4900|[[41749, 7.143963...|
| 5300|[[0, 2.0147272E-4...|
| 6620|[[208780, 9.81092...|
+---------------+--------------------+
我想拆分建议列,以便具有如下数据框:
+---------------+--------------------+
|IndexedArtistID| recommendations|
+---------------+--------------------+
| 1580|919 |
| 1580|0.00249262 |
| 4900|41749 |
| 4900|7.143963 |
| 5300|0 |
| 5300|2.0147272E-4 |
| 6620|208780 |
| 6620|9.81092 |
+---------------+--------------------+
所以基本上,我想将特征向量拆分为列,然后将这些列合并为单个列。合并部分在How to split single row into multiple rows in Spark DataFrame using Java中进行了描述。现在,如何使用Java执行拆分部分?对于scala,在这里进行了解释:Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)],但是我无法找到一种方法,可以按照链接中给出的方式在java中进行相同的处理。
这里是scala代码。您可以在Java中重新使用它。该列上的数组被拆分为多行。
val df1 = spark.createDataFrame(Seq((1,Seq(1.0,2.2)), (2,Seq(2,3.8)),(3,Seq(4,5.3)))).toDF("IndexedArtistID","recommendations")
df1.show()
df1.map{ r =>
val clubs = r.getAs[Seq[Double]]("recommendations")
for{
c : Double <- clubs
}yield(r.getAs[Integer]("IndexedArtistID"), c)
}.flatMap(identity(_)).toDF("IndexedArtistID","recommendations").show(false)
Input
+---------------+---------------+
|IndexedArtistID|recommendations|
+---------------+---------------+
| 1| [1.0, 2.2]|
| 2| [2.0, 3.8]|
| 3| [4.0, 5.3]|
+---------------+---------------+
output
+---------------+---------------+
|IndexedArtistID|recommendations|
+---------------+---------------+
|1 |1.0 |
|1 |2.2 |
|2 |2.0 |
|2 |3.8 |
|3 |4.0 |
|3 |5.3 |
+---------------+---------------+