我有一个数据框 df 如下:一列(Column_1)和相应的 5 条记录如下
0000099598|000000|-1|0.00|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1
0000099598|000000|-1|0.00|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1
0000099598|000000|-1|1580.00|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1
0000099598|000000|-1|150.00|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1
0000099598|000000|-1|1113.75|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1
我想像这样转换上面的数据框 => 这是基于 '|break|' 打破数据框,并将即将到来的值添加为同一列中同一数据框中的下一个新记录,使其成为 10 条记录这也是按顺序进行的。 请帮忙
第_1栏
0000099598|000000|-1|0.00|需要映射到 EDW|需要映射到 EDW
006||0000099598|000000|0000099598-1
0000099598|000000|-1|0.00|需要映射到 EDW|需要映射到 EDW
006||0000099598|000000|0000099598-1
0000099598|000000|-1|1580.00|需要映射到EDW|需要映射到EDW
006||0000099598|000000|0000099598-1
0000099598|000000|-1|150.00|需要映射到EDW|需要映射到EDW
006||0000099598|000000|0000099598-1
0000099598|000000|-1|1113.75|需要映射到EDW|需要映射到EDW
006||0000099598|000000|0000099598-1
如果你想用 pyspark 得到这个结果:
+------------------------------------------------------------------+
|Column_1 |
+------------------------------------------------------------------+
|0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW |
|006||0000099598|000000|0000099598-1 |
|0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW |
|006||0000099598|000000|0000099598-1 |
|0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|
|006||0000099598|000000|0000099598-1 |
|0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW |
|006||0000099598|000000|0000099598-1 |
|0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|
|006||0000099598|000000|0000099598-1 |
+------------------------------------------------------------------+
你可以试试这个:
from pyspark.sql.functions import split, explode
# input data
data = [
("0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
("0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
("0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
("0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
("0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",)
]
df = spark.createDataFrame(data, ["Column_1"])
df_split = df.withColumn("Split_Column", split(df["Column_1"], r"\|break\|"))
df_exploded = df_split.withColumn("Column_1", explode("Split_Column")).select("Column_1")
df_exploded.show(10, False)