我有一个如下所示的数据框(dateds1),
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 1995/09/16| 2008/09/09|2009-02-09 00:00:00|2017-09-09 00:00:00|
| 1994/09/20| 2008/09/10|1999-05-05 00:00:00|2016-09-30 00:00:00|
| 1993/09/24| 2016/06/29|2003-12-07 00:00:00|2028-02-13 00:00:00|
| 1992/09/28| 2007/06/24|2004-06-05 00:00:00|2019-09-24 00:00:00|
| 1991/10/03| 2011/07/07|2011-07-07 00:00:00|2020-03-30 00:00:00|
| 1990/10/07| 2009/02/09|2009-02-09 00:00:00|2011-03-13 00:00:00|
| 1989/10/11| 1999/05/05|1999-05-05 00:00:00|2021-03-13 00:00:00|
我需要帮助来过滤掉它,我的输出应如下图所示,
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 19950916 | 20080909 |20090209 |20170909 |
| 19940920 | 20080910 |19990505 |20160930 |
| 19930924 | 20160629 |20031207 |20280213 |
| 19920928 | 20070624 |20040605 |20190924 |
| 19911003 | 20110707 |20110707 |20200330 |
| 19901007 | 20090209 |20090209 |20110313 |
| 19891011 | 19990505 |19990505 |20210313 |
我尝试使用过滤器,但是当日期采用YYYY / MM / DD或YYYY-MM-DD 00:00:00格式且列数固定时,我只能针对这两种情况之一进行过滤。有人可以帮我弄清楚两种格式以及当列数为dynamic时(它们可能会增加或减少)。应将其从Date数据类型转换为Integers或Long,格式为YYYYMMDD。
注意:此数据框中的记录,或者是YYYY / MM / DD或YYYY-MM-DD 00:00:00格式。任何帮助表示赞赏。谢谢
要动态进行转换,您必须遍历所有列并根据列类型执行不同的操作。
这里是一个例子:
import java.sql.Date
import org.apache.spark.sql.types._
import java.sql.Timestamp
val originalDf = Seq(
(Timestamp.valueOf("2016-09-30 03:04:00"),Date.valueOf("2016-09-30")),
(Timestamp.valueOf("2016-07-30 00:00:00"),Date.valueOf("2016-10-30"))
).toDF("ts_value","date_value")
原始表详细信息:
> originalDf.show
+-------------------+----------+
| ts_value|date_value|
+-------------------+----------+
|2016-09-30 03:04:00|2016-09-30|
|2016-07-30 00:00:00|2016-10-30|
+-------------------+----------+
> originalDf.printSchema
root
|-- ts_value: timestamp (nullable = true)
|-- date_value: date (nullable = true)
转换操作示例:
val newDf = originalDf.columns.foldLeft(originalDf)((df, name) => {
val data_type = df.schema(name).dataType
if(data_type == DateType)
df.withColumn(name, date_format(col(name), "yyyyMMdd").cast(IntegerType))
else if(data_type == TimestampType)
df.withColumn(name, year(col(name))*10000 + month(col(name))*100 + dayofmonth(col(name)))
else
df
})
新表格详细信息:
newDf.show
+--------+----------+
|ts_value|date_value|
+--------+----------+
|20160930| 20160930|
|20160730| 20161030|
+--------+----------+
newDf.printSchema
root
|-- ts_value: integer (nullable = true)
|-- date_value: integer (nullable = true)
如果您不想在所有列中都执行此操作,则可以通过更改来手动指定列
val newDf = originalDf.columns.foldLeft ...
to
val newDf = Seq("col1_name","col2_name", ... ).foldLeft ...
希望这会有所帮助!