在 Spark 3 中似乎无法正确解析日期

问题描述 投票:0回答:1

我正在尝试编写一个实用程序来“评估”日期的良好格式。我似乎无法成功,因为我不断收到如下错误:

Exception has occurred: Py4JJavaError       (note: full exception trace is shown but execution is paused at: <module>)
An error occurred while calling o184.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 2) (172.21.66.190 executor driver): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2023-10-15 13:45:30' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
...
Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2023-10-15 13:45:30' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
...
Caused by: java.time.format.DateTimeParseException: Text '2023-10-15T13:45:30' could not be parsed, unparsed text found at index 10
        at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2049)
        at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
        at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:193)
        ... 21 more

这是我试图重现错误的最小脚本

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp, col, coalesce

# Initialize SparkSession
spark = SparkSession.builder.appName("DateParsingTest").master("local[*]").getOrCreate()

# Sample data for testing
data = [
    ("2023-10-15T13:45:30",),
    ("2023-10-15 13:45:30",),
    ("2023-10-15",),
    ("20231015",),
    ("15-Oct-2023",),
    ("10/15/2023",),
    ("15/10/2023",),
    ("2023.10.15",),
    ("Oct 15, 2023",),
    ("15 Oct 2023",),
    ("2023/10/15",),
    ("15-10-2023",),
    ("10-15-2023",),
    ("15.10.2023",),
    ("10.15.2023",),
    ("InvalidDate",),
    (None,),
]

# Create DataFrame
df = spark.createDataFrame(data, ["date_string"])

# Define date formats
date_formats = [
    "yyyy-MM-dd",
    "yyyyMMdd",
    "MM/dd/yyyy",
    "dd-MMM-yyyy",
    "dd/MM/yyyy",
    "yyyy.MM.dd",
    "MMM dd, yyyy",
    "dd MMM yyyy",
    "yyyy/MM/dd",
    "dd-MM-yyyy",
    "MM-dd-yyyy",
    "dd.MM.yyyy",
    "MM.dd.yyyy",
]

# Define time formats to append
time_formats = [
    "",  # No time
    " HH:mm:ss",
    " HH:mm:ss.SSS",
    "'T'HH:mm:ss",
    "'T'HH:mm:ss.SSS",
]


# Generate combined date-time formats
date_time_formats = []
for date_fmt in date_formats:
    for time_fmt in time_formats:
        date_time_formats.append(date_fmt + time_fmt)


# Parse the date strings
parsing_expressions = [to_timestamp(col("date_string"), fmt) for fmt in date_time_formats]

# Use coalesce to get the first successfully parsed timestamp
parsed_date_expr = coalesce(*parsing_expressions)

# Add the parsed date column to the DataFrame
df = df.withColumn("parsed_date", parsed_date_expr)

# Show the results
df.select("date_string", "parsed_date").show(truncate=False)

# Stop the SparkSession
spark.stop()

该模块的目标是评估数据中的日期时间字符串,以便我们可以在必要时修复它们。这些列通常具有异构的日期时间格式。

我发现这个问题有类似的前提,但其中的答案并没有解决我的问题。

我想要实现的目标有可能吗?

python apache-spark date pyspark
1个回答
0
投票

我好像发现问题了:

方法

to_timestamp
需要日期时间格式,问题在于此集合中的时间格式
"",  # No time

time_formats = [
    "",  # No time
    " HH:mm:ss",
    " HH:mm:ss.SSS",
    "'T'HH:mm:ss",
    "'T'HH:mm:ss.SSS",
]

为了修复 to_timestamp 可接受的日期

时间
格式,请使用括号
[..]
来表示相关格式的可选
time
。例如:

# Define date formats
date_formats = [
    "yyyy-MM-dd",
    "yyyyMMdd",
    "MM/dd/yyyy",
    "dd-MMM-yyyy",
    "dd/MM/yyyy",
    "yyyy.MM.dd",
    "MMM dd, yyyy",
    "dd MMM yyyy",
    "yyyy/MM/dd",
    "dd-MM-yyyy",
    "MM-dd-yyyy",
    "dd.MM.yyyy",
    "MM.dd.yyyy",
]

# Define time formats to append
time_formats = [
    "[ HH:mm:ss]",
    "['T'HH:mm:ss]",
    "[ HH:mm:ss.SSS]",
    "['T'HH:mm:ss.SSS]",
]

这使得所有时间格式都是可选的,因此解析有效:

+-------------------+-------------------+                                       
|date_string        |parsed_date        |
+-------------------+-------------------+
|2023-10-15T13:45:30|2023-10-15 13:45:30|
|2023-10-15 13:45:30|2023-10-15 13:45:30|
|2023-10-15         |2023-10-15 00:00:00|
|20231015           |2023-10-15 00:00:00|
|15-Oct-2023        |2023-10-15 00:00:00|
|10/15/2023         |2023-10-15 00:00:00|
|15/10/2023         |2023-10-15 00:00:00|
|2023.10.15         |2023-10-15 00:00:00|
|Oct 15, 2023       |2023-10-15 00:00:00|
|15 Oct 2023        |2023-10-15 00:00:00|
|2023/10/15         |2023-10-15 00:00:00|
|15-10-2023         |2023-10-15 00:00:00|
|10-15-2023         |2023-10-15 00:00:00|
|15.10.2023         |2023-10-15 00:00:00|
|10.15.2023         |2023-10-15 00:00:00|
|InvalidDate        |NULL               |
|NULL               |NULL               |
+-------------------+-------------------+
© www.soinside.com 2019 - 2024. All rights reserved.