生成列排列的最佳方法是什么

问题描述 投票:0回答:1

我有一个 scala DF,如下所示:

+---+-----+----+----+----+
|ID |info |col1|col2|col3|
+---+-----+----+----+----+
|id1|info1|a1  |a2  |a3  |
|id2|info2|a1  |a3  |a4  |
+---+-----+----+----+----+

我想生成(col1,col2,col3)的所有排列,同时保持“ID”和“info”列中的数据相同。

输出将如下所示:

+---+-----+----+----+----+
|ID |info |col1|col2|col3|
+---+-----+----+----+----+
|id1|info1|a1  |a2  |a3  |
|id1|info1|a1  |a3  |a2  |
|id1|info1|a2  |a1  |a3  |
|id1|info1|a2  |a3  |a1  |
|id1|info1|a3  |a1  |a2  |
|id1|info1|a3  |a2  |a1  |
|id2|info2|a1  |a3  |a4  |
|id2|info2|a1  |a4  |a3  |
|id2|info2|a3  |a1  |a4  |
|id2|info2|a3  |a4  |a1  |
|id2|info2|a4  |a1  |a3  |
|id2|info2|a4  |a3  |a1  |
+---+-----+----+----+----+

我目前的解决方案的思考过程如下,但我很好奇是否有更好的方法:

  1. 创建一个新列,并将 (col1, col2, col3) 合并到数组中
  2. 使用 scala 的排列方法生成数组排列
  3. 展开新列
  4. 将分解列中数组的每个索引映射到新的 col1、col2 和 col3
scala apache-spark
1个回答
0
投票

您可以使用包含在 UDF 中的

Seq.permutations
函数,如下所示:

// generating data
val df = Seq(
    ("id1", "info1", "a1", "a2", "a3"),
    ("id2", "info2", "a1", "a3", "a4")
).toDF("ID", "info", "col1", "col2", "col3")
df
    .select('ID, 'info, explode(permutations(array('col1, 'col2, 'col3))) as "cols")
    .select('ID, 'info, 'cols(0) as "col1", 'cols(1) as "col2", 'cols(2) as "col3")
    .show
+---+-----+----+----+----+
| ID| info|col1|col2|col3|
+---+-----+----+----+----+
|id1|info1|  a1|  a2|  a3|
|id1|info1|  a1|  a3|  a2|
|id1|info1|  a2|  a1|  a3|
|id1|info1|  a2|  a3|  a1|
|id1|info1|  a3|  a1|  a2|
|id1|info1|  a3|  a2|  a1|
|id2|info2|  a1|  a3|  a4|
|id2|info2|  a1|  a4|  a3|
|id2|info2|  a3|  a1|  a4|
|id2|info2|  a3|  a4|  a1|
|id2|info2|  a4|  a1|  a3|
|id2|info2|  a4|  a3|  a1|
+---+-----+----+----+----+
© www.soinside.com 2019 - 2024. All rights reserved.