我有如下表:
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+
我试图做到的,是那里的ID和名称是相同的类型值附加在一起。像这样:
+-------+---------+---------------------------+
|movieId|movieName| genre |
+-------+---------+---------------------------+
| 1| example1| action|thriller|romance |
| 2| example2| action|fantastic |
+-------+---------+---------------------------+
使用groupBy
和collect_list
获得具有相同电影名称的项目清单。然后使用concat_ws
(如果顺序很重要,第一次使用sort_array
)组合这些为一个字符串。小例子与给定的样本数据帧:
val df2 = df.groupBy("movieId", "movieName")
.agg(collect_list($"genre").as("genre"))
.withColumn("genre", concat_ws("|", sort_array($"genre")))
给出结果:
+-------+---------+-----------------------+
|movieId|movieName|genre |
+-------+---------+-----------------------+
|1 |example1 |action|thriller|romance|
|2 |example2 |action|fantastic |
+-------+---------+-----------------------+