我有一个 scala DF,如下所示:
+---+-----+----+----+----+
|ID |info |col1|col2|col3|
+---+-----+----+----+----+
|id1|info1|a1 |a2 |a3 |
|id2|info2|a1 |a3 |a4 |
+---+-----+----+----+----+
我想生成(col1,col2,col3)的所有排列,同时保持“ID”和“info”列中的数据相同。
输出将如下所示:
+---+-----+----+----+----+
|ID |info |col1|col2|col3|
+---+-----+----+----+----+
|id1|info1|a1 |a2 |a3 |
|id1|info1|a1 |a3 |a2 |
|id1|info1|a2 |a1 |a3 |
|id1|info1|a2 |a3 |a1 |
|id1|info1|a3 |a1 |a2 |
|id1|info1|a3 |a2 |a1 |
|id2|info2|a1 |a3 |a4 |
|id2|info2|a1 |a4 |a3 |
|id2|info2|a3 |a1 |a4 |
|id2|info2|a3 |a4 |a1 |
|id2|info2|a4 |a1 |a3 |
|id2|info2|a4 |a3 |a1 |
+---+-----+----+----+----+
我目前的解决方案的思考过程如下,但我很好奇是否有更好的方法:
您可以使用包含在 UDF 中的
Seq.permutations
函数,如下所示:
// generating data
val df = Seq(
("id1", "info1", "a1", "a2", "a3"),
("id2", "info2", "a1", "a3", "a4")
).toDF("ID", "info", "col1", "col2", "col3")
df
.select('ID, 'info, explode(permutations(array('col1, 'col2, 'col3))) as "cols")
.select('ID, 'info, 'cols(0) as "col1", 'cols(1) as "col2", 'cols(2) as "col3")
.show
+---+-----+----+----+----+
| ID| info|col1|col2|col3|
+---+-----+----+----+----+
|id1|info1| a1| a2| a3|
|id1|info1| a1| a3| a2|
|id1|info1| a2| a1| a3|
|id1|info1| a2| a3| a1|
|id1|info1| a3| a1| a2|
|id1|info1| a3| a2| a1|
|id2|info2| a1| a3| a4|
|id2|info2| a1| a4| a3|
|id2|info2| a3| a1| a4|
|id2|info2| a3| a4| a1|
|id2|info2| a4| a1| a3|
|id2|info2| a4| a3| a1|
+---+-----+----+----+----+