打破数据框值并添加新的下一条记录

问题描述 投票:0回答:1

我有一个数据框 df 如下:一列(Column_1)和相应的 5 条记录如下

第_1栏

0000099598|000000|-1|0.00|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1

0000099598|000000|-1|0.00|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1

0000099598|000000|-1|1580.00|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1

0000099598|000000|-1|150.00|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1

0000099598|000000|-1|1113.75|需要映射到 EDW|需要映射到 EDW|break|006||0000099598|000000|0000099598-1

我想像这样转换上面的数据框 => 这是基于 '|break|' 打破数据框,并将即将到来的值添加为同一列中同一数据框中的下一个新记录,使其成为 10 条记录这也是按顺序进行的。 请帮忙

第_1栏

0000099598|000000|-1|0.00|需要映射到 EDW|需要映射到 EDW

006||0000099598|000000|0000099598-1

0000099598|000000|-1|0.00|需要映射到 EDW|需要映射到 EDW

006||0000099598|000000|0000099598-1

0000099598|000000|-1|1580.00|需要映射到EDW|需要映射到EDW

006||0000099598|000000|0000099598-1

0000099598|000000|-1|150.00|需要映射到EDW|需要映射到EDW

006||0000099598|000000|0000099598-1

0000099598|000000|-1|1113.75|需要映射到EDW|需要映射到EDW

006||0000099598|000000|0000099598-1

python pandas dataframe file pyspark
1个回答
0
投票

如果你想用 pyspark 得到这个结果:

+------------------------------------------------------------------+
|Column_1                                                          |
+------------------------------------------------------------------+
|0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW   |
|006||0000099598|000000|0000099598-1                               |
|0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW   |
|006||0000099598|000000|0000099598-1                               |
|0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|
|006||0000099598|000000|0000099598-1                               |
|0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW |
|006||0000099598|000000|0000099598-1                               |
|0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|
|006||0000099598|000000|0000099598-1                               |
+------------------------------------------------------------------+

你可以试试这个:

from pyspark.sql.functions import split, explode


# input data
data = [
    ("0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
    ("0000099598|000000|-1|0.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
    ("0000099598|000000|-1|1580.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
    ("0000099598|000000|-1|150.00|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",),
    ("0000099598|000000|-1|1113.75|Need to map to EDW|Need to map to EDW|break|006||0000099598|000000|0000099598-1",)
]

df = spark.createDataFrame(data, ["Column_1"])

df_split = df.withColumn("Split_Column", split(df["Column_1"], r"\|break\|"))

df_exploded = df_split.withColumn("Column_1", explode("Split_Column")).select("Column_1")

df_exploded.show(10, False)
© www.soinside.com 2019 - 2024. All rights reserved.