有什么方法可以用 following 值替换 pyspark 数据框中的空值吗? 我需要将表中的价格填空:
+----------+----------+-----+
|product_id| ts|price|
+----------+----------+-----+
| 1|2024-05-01| null|
| 1|2024-05-02| 109|
| 1|2024-05-03| 120|
| 2|2024-05-01| null|
| 2|2024-05-02| null|
| 2|2024-05-03| 115|
+----------+----------+-----+
预期结果:
+----------+----------+-----+
|product_id| ts|price|
+----------+----------+-----+
| 1|2024-05-01| 109|
| 1|2024-05-02| 109|
| 1|2024-05-03| 120|
| 2|2024-05-01| 115|
| 2|2024-05-02| 115|
| 2|2024-05-03| 115|
+----------+----------+-----+
df = [{'product_id': 1, 'ts': "2024-05-01"},
{'product_id': 1, 'ts': "2024-05-02", "price": 109},
{'product_id': 1, 'ts': "2024-05-03", "price": 120},
{'product_id': 2, 'ts': "2024-05-01", },
{'product_id': 2, 'ts': "2024-05-02"},
{'product_id': 2, 'ts': "2024-05-03", "price": 115}]
data = spark.createDataFrame(df)
尝试过功能
def fill_na_prices(data: DataFrame) -> DataFrame:
return data.withColumn(
"price", F.first("price", ignorenulls=True,).over(Window.partitionBy("product_id").orderBy("ts"))
)
但是不起作用。
这个怎么样?
from pyspark.sql import Window
import pyspark.sql.functions as F
windowSpec = (
Window.partitionBy("product_id")
.orderBy("ts")
.rowsBetween(Window.currentRow + 1, Window.unboundedFollowing)
)
data = data.withColumn(
"price",
F.when(
data["price"].isNull(), F.first("price", ignorenulls=True).over(windowSpec)
).otherwise(data["price"]),
)
data.show()
+----------+----------+-----+
|product_id| ts|price|
+----------+----------+-----+
| 1|2024-05-01| 109|
| 1|2024-05-02| 109|
| 1|2024-05-03| 120|
| 2|2024-05-01| 115|
| 2|2024-05-02| 115|
| 2|2024-05-03| 115|
+----------+----------+-----+
如果最后一行,此代码仍可能保留空值 分区中的值恰好为 NULL。这就是你想要的吗?