我正在尝试编写一个实用程序来“评估”日期的良好格式。我似乎无法成功,因为我不断收到如下错误:
Exception has occurred: Py4JJavaError (note: full exception trace is shown but execution is paused at: <module>)
An error occurred while calling o184.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 2) (172.21.66.190 executor driver): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2023-10-15 13:45:30' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
...
Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2023-10-15 13:45:30' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
...
Caused by: java.time.format.DateTimeParseException: Text '2023-10-15T13:45:30' could not be parsed, unparsed text found at index 10
at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2049)
at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:193)
... 21 more
这是我试图重现错误的最小脚本
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp, col, coalesce
# Initialize SparkSession
spark = SparkSession.builder.appName("DateParsingTest").master("local[*]").getOrCreate()
# Sample data for testing
data = [
("2023-10-15T13:45:30",),
("2023-10-15 13:45:30",),
("2023-10-15",),
("20231015",),
("15-Oct-2023",),
("10/15/2023",),
("15/10/2023",),
("2023.10.15",),
("Oct 15, 2023",),
("15 Oct 2023",),
("2023/10/15",),
("15-10-2023",),
("10-15-2023",),
("15.10.2023",),
("10.15.2023",),
("InvalidDate",),
(None,),
]
# Create DataFrame
df = spark.createDataFrame(data, ["date_string"])
# Define date formats
date_formats = [
"yyyy-MM-dd",
"yyyyMMdd",
"MM/dd/yyyy",
"dd-MMM-yyyy",
"dd/MM/yyyy",
"yyyy.MM.dd",
"MMM dd, yyyy",
"dd MMM yyyy",
"yyyy/MM/dd",
"dd-MM-yyyy",
"MM-dd-yyyy",
"dd.MM.yyyy",
"MM.dd.yyyy",
]
# Define time formats to append
time_formats = [
"", # No time
" HH:mm:ss",
" HH:mm:ss.SSS",
"'T'HH:mm:ss",
"'T'HH:mm:ss.SSS",
]
# Generate combined date-time formats
date_time_formats = []
for date_fmt in date_formats:
for time_fmt in time_formats:
date_time_formats.append(date_fmt + time_fmt)
# Parse the date strings
parsing_expressions = [to_timestamp(col("date_string"), fmt) for fmt in date_time_formats]
# Use coalesce to get the first successfully parsed timestamp
parsed_date_expr = coalesce(*parsing_expressions)
# Add the parsed date column to the DataFrame
df = df.withColumn("parsed_date", parsed_date_expr)
# Show the results
df.select("date_string", "parsed_date").show(truncate=False)
# Stop the SparkSession
spark.stop()
该模块的目标是评估数据中的日期时间字符串,以便我们可以在必要时修复它们。这些列通常具有异构的日期时间格式。
我发现这个问题有类似的前提,但其中的答案并没有解决我的问题。
我想要实现的目标有可能吗?
我好像发现问题了:
方法
to_timestamp
需要日期时间格式,问题在于此集合中的时间格式"", # No time
:
time_formats = [
"", # No time
" HH:mm:ss",
" HH:mm:ss.SSS",
"'T'HH:mm:ss",
"'T'HH:mm:ss.SSS",
]
为了修复 to_timestamp
可接受的日期
时间格式,请使用括号
[..]
来表示相关格式的可选 time
。例如:
# Define date formats
date_formats = [
"yyyy-MM-dd",
"yyyyMMdd",
"MM/dd/yyyy",
"dd-MMM-yyyy",
"dd/MM/yyyy",
"yyyy.MM.dd",
"MMM dd, yyyy",
"dd MMM yyyy",
"yyyy/MM/dd",
"dd-MM-yyyy",
"MM-dd-yyyy",
"dd.MM.yyyy",
"MM.dd.yyyy",
]
# Define time formats to append
time_formats = [
"[ HH:mm:ss]",
"['T'HH:mm:ss]",
"[ HH:mm:ss.SSS]",
"['T'HH:mm:ss.SSS]",
]
这使得所有时间格式都是可选的,因此解析有效:
+-------------------+-------------------+
|date_string |parsed_date |
+-------------------+-------------------+
|2023-10-15T13:45:30|2023-10-15 13:45:30|
|2023-10-15 13:45:30|2023-10-15 13:45:30|
|2023-10-15 |2023-10-15 00:00:00|
|20231015 |2023-10-15 00:00:00|
|15-Oct-2023 |2023-10-15 00:00:00|
|10/15/2023 |2023-10-15 00:00:00|
|15/10/2023 |2023-10-15 00:00:00|
|2023.10.15 |2023-10-15 00:00:00|
|Oct 15, 2023 |2023-10-15 00:00:00|
|15 Oct 2023 |2023-10-15 00:00:00|
|2023/10/15 |2023-10-15 00:00:00|
|15-10-2023 |2023-10-15 00:00:00|
|10-15-2023 |2023-10-15 00:00:00|
|15.10.2023 |2023-10-15 00:00:00|
|10.15.2023 |2023-10-15 00:00:00|
|InvalidDate |NULL |
|NULL |NULL |
+-------------------+-------------------+