我正在 PySpark 中使用一个 DataFrame,其中包含一个名为
datdoc
的列,该列具有多种日期格式,如下所示:
datdoc
07-SEP-24
07-SEP-2024
07-SEP-2024
07-SEP-2024
07-SEP-24
07-SEP-24
07-SEP-2024
07-SEP-2024
07-SEP-2024
07-SEP-2024
07-SEP-2024
我需要将这些日期解析为默认格式。我尝试了以下方法,但遇到了问题。
columns = {'field_name': 'datdoc', 'current_format': ['dd-MMM-yy', 'dd-MMM-yyyy'], 'data_type': 'Date'}
dateexpression = Column<'CASE WHEN (to_date(datdoc, dd-MMM-yy) IS NOT NULL) THEN to_date(datdoc, dd-MMM-yy) WHEN (to_date(datdoc, dd-MMM-yyyy) IS NOT NULL) THEN to_date(datdoc, dd-MMM-yyyy) ELSE NULL END AS datdoc'>
columns = {'field_name': 'datdoc', 'current_format': ['dd-MMM-yy'], 'data_type': 'Date'}
date_expression = Column<'to_date(datdoc, dd-MMM-yy) AS datdoc'>
def change_date_format(self, columns) -> None:
def _convert_date_format(field_name: str, current_format: list, is_timestamp: bool) -> F.Column:
base_function = F.to_timestamp if is_timestamp else F.to_date
expression = None
if len(current_format) == 1:
return base_function(F.col(field_name), current_format[0]).alias(field_name)
else:
for fmt in current_format:
current_expr = base_function(F.col(field_name), fmt)
if expression is None:
expression = F.when(current_expr.isNotNull(), current_expr)
else:
expression = expression.when(current_expr.isNotNull(), current_expr)
return expression.otherwise(F.lit(None)).alias(field_name)
cols = {col["field_name"] for col in columns}
date_expressions = []
for col in columns:
if col["data_type"] in ["DateTime", "Time"]:
date_expressions.append(_convert_date_format(col["field_name"], col["current_format"], True))
elif col["data_type"] == "Date":
date_expressions.append(_convert_date_format(col["field_name"], col["current_format"], False))
expression = [F.col(i) for i in self.df.columns if i not in cols]
self.df = self.df.select(*date_expressions, *expression)
在这两种情况下,我在尝试使用
07-SEP-2024
解析 dd-MMM-yy
时都遇到了以下错误:
24/09/25 21:10:18 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 7) (rhy-4 executor driver): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '07-SEP-2024' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
有没有办法确保无效的日期字符串作为
NULL
返回而不是错误地解析?我考虑的一种方法是在 PySpark 中使用 CASE WHEN
和 RegEx 模式。然而,我想首先探索修复我当前的方法。任何有关如何实现这一目标的指导将不胜感激!
如果您确定只获得这 2 种模式,并且您假设是 21 世纪,则可以使用一种模式转换这两种模式。
from pyspark.sql.functions import to_date
spark.conf.set("spark.sql.ansi.enabled","false") # this is needed if you get invalid dates
df = spark.createDataFrame(
[
(1, "07-SEP-2024"),
(2, "07-SEP-24"),
(3, "foo")
],
["id", "NotADate"]
)
df.withColumn("maybedate", to_date("NotADate","dd-MMM-yy")).show()
将会回归
+---+-----------+----------+
| id| NotADate| maybedate|
+---+-----------+----------+
| 1|07-SEP-2024|2024-09-07|
| 2| 07-SEP-24|2024-09-07|
| 3| foo| null|
+---+-----------+----------+