如何过滤日期列,并使用Scala将其作为数字存储在数据框中

问题描述 投票:0回答:1

我有一个如下所示的数据框(dateds1),

+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate|      Contract Date|        ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 1995/09/16| 2008/09/09|2009-02-09 00:00:00|2017-09-09 00:00:00|
| 1994/09/20| 2008/09/10|1999-05-05 00:00:00|2016-09-30 00:00:00|
| 1993/09/24| 2016/06/29|2003-12-07 00:00:00|2028-02-13 00:00:00|
| 1992/09/28| 2007/06/24|2004-06-05 00:00:00|2019-09-24 00:00:00|
| 1991/10/03| 2011/07/07|2011-07-07 00:00:00|2020-03-30 00:00:00|
| 1990/10/07| 2009/02/09|2009-02-09 00:00:00|2011-03-13 00:00:00|
| 1989/10/11| 1999/05/05|1999-05-05 00:00:00|2021-03-13 00:00:00|

我需要帮助来过滤掉它,我的输出应如下图所示,

+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate|      Contract Date|        ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 19950916  | 20080909  |20090209           |20170909           |
| 19940920  | 20080910  |19990505           |20160930           |
| 19930924  | 20160629  |20031207           |20280213           |
| 19920928  | 20070624  |20040605           |20190924           |
| 19911003  | 20110707  |20110707           |20200330           |
| 19901007  | 20090209  |20090209           |20110313           |
| 19891011  | 19990505  |19990505           |20210313           |

我尝试使用过滤器,但是当日期采用YYYY / MM / DD或YYYY-MM-DD 00:00:00格式且列数固定时,我只能针对这两种情况之一进行过滤。有人可以帮我弄清楚两种格式以及当列数为dynamic时(它们可能会增加或减少)。应将其从Date数据类型转换为Integers或Long,格式为YYYYMMDD。

注意:此数据框中的记录,或者是YYYY / MM / DD或YYYY-MM-DD 00:00:00格式。任何帮助表示赞赏。谢谢

scala dataframe apache-spark-sql rdd
1个回答
0
投票

要动态进行转换,您必须遍历所有列并根据列类型执行不同的操作。

这里是一个例子:

import java.sql.Date
import org.apache.spark.sql.types._
import java.sql.Timestamp

val originalDf = Seq(
    (Timestamp.valueOf("2016-09-30 03:04:00"),Date.valueOf("2016-09-30")),
    (Timestamp.valueOf("2016-07-30 00:00:00"),Date.valueOf("2016-10-30"))
).toDF("ts_value","date_value")

原始表详细信息:

> originalDf.show
+-------------------+----------+
|           ts_value|date_value|
+-------------------+----------+
|2016-09-30 03:04:00|2016-09-30|
|2016-07-30 00:00:00|2016-10-30|
+-------------------+----------+

> originalDf.printSchema
root
 |-- ts_value: timestamp (nullable = true)
 |-- date_value: date (nullable = true)

转换操作示例:

val newDf = originalDf.columns.foldLeft(originalDf)((df, name) => {
    val data_type = df.schema(name).dataType
    if(data_type == DateType)
        df.withColumn(name, date_format(col(name), "yyyyMMdd").cast(IntegerType))
    else if(data_type == TimestampType)
        df.withColumn(name, year(col(name))*10000 + month(col(name))*100 + dayofmonth(col(name)))
    else
        df
})

新表格详细信息:

newDf.show
+--------+----------+
|ts_value|date_value|
+--------+----------+
|20160930|  20160930|
|20160730|  20161030|
+--------+----------+
newDf.printSchema
root
 |-- ts_value: integer (nullable = true)
 |-- date_value: integer (nullable = true)

如果您不想在所有列中都执行此操作,则可以通过更改来手动指定列

val newDf = originalDf.columns.foldLeft ...

to

val newDf = Seq("col1_name","col2_name", ... ).foldLeft ...

希望这会有所帮助!

© www.soinside.com 2019 - 2024. All rights reserved.