通过时间戳scala更新数据帧值

问题描述 投票:-3回答:1

我有这个数据帧

+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid|     |  event                      | A                  | B            |    C           |
+----------------+-----------------------------+--------------------+--------------+----------------+
|     1222222    | 2019-02-07 06:50:40.0       |aaaaaa              | 25           | 5025           |
|     1222222    | 2019-02-07 06:50:42.0       |aaaaaa              | 35           | 5000           |
|     1222222    | 2019-02-07 06:51:56.0       |aaaaaa              | 100          | 4965           |
+----------------+-----------------------------+--------------------+--------------+----------------+

我想通过事件(时间戳)更新列C的值,并在新数据帧中仅保留具有最新值更新的行,如下所示

+----------------+-----------------------------+--------------------+--------------+----------------+
|customerid|     |  event                      | A                  | B            |    C           |
+----------------+-----------------------------+--------------------+--------------+----------------+
|     1222222    | 2019-02-07 06:51:56.0       |aaaaaa              | 100          | 4965           |
+----------------+-----------------------------+--------------------+--------------+----------------+

数据以流媒体模式进入,带有火花流

scala apache-spark dataframe bigdata spark-streaming
1个回答
0
投票

您可以尝试创建由customerid分区的行号,并按事件desc排序,并获取rownum为1的行。我希望这会有所帮助。

df.withColumn("rownum", row_number().over(Window.partitionBy("customerid").orderBy(col("event").desc)))
    .filter(col("rownum") === 1)
    .drop("rownum")
© www.soinside.com 2019 - 2024. All rights reserved.