我有这个数据帧
+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid| | event | A | B | C |
+----------------+-----------------------------+--------------------+--------------+----------------+
| 1222222 | 2019-02-07 06:50:40.0 |aaaaaa | 25 | 5025 |
| 1222222 | 2019-02-07 06:50:42.0 |aaaaaa | 35 | 5000 |
| 1222222 | 2019-02-07 06:51:56.0 |aaaaaa | 100 | 4965 |
+----------------+-----------------------------+--------------------+--------------+----------------+
我想通过事件(时间戳)更新列C的值,并在新数据帧中仅保留具有最新值更新的行,如下所示
+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid| | event | A | B | C |
+----------------+-----------------------------+--------------------+--------------+----------------+
| 1222222 | 2019-02-07 06:51:56.0 |aaaaaa | 100 | 4965 |
+----------------+-----------------------------+--------------------+--------------+----------------+
数据以流媒体模式进入,带有火花流
您可以尝试创建由customerid分区的行号,并按事件desc排序,并获取rownum为1的行。我希望这会有所帮助。
df.withColumn("rownum", row_number().over(Window.partitionBy("customerid").orderBy(col("event").desc)))
.filter(col("rownum") === 1)
.drop("rownum")