Exception has occurred: Py4JJavaError (note: full exception trace is shown but execution is paused at: <module>)
An error occurred while calling o184.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 2) ( executor driver): org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2023-10-15 13:45:30' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
Caused by: org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:
Fail to parse '2023-10-15 13:45:30' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string.
Caused by: java.time.format.DateTimeParseException: Text '2023-10-15T13:45:30' could not be parsed, unparsed text found at index 10
at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2049)
at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.parse(TimestampFormatter.scala:193)
... 21 more
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp, col, coalesce
# Initialize SparkSession
spark = SparkSession.builder.appName("DateParsingTest").master("local[*]").getOrCreate()
# Sample data for testing
data = [
("2023-10-15 13:45:30",),
("Oct 15, 2023",),
("15 Oct 2023",),
# Create DataFrame
df = spark.createDataFrame(data, ["date_string"])
# Define date formats
date_formats = [
"MMM dd, yyyy",
"dd MMM yyyy",
# Define time formats to append
time_formats = [
"", # No time
" HH:mm:ss",
" HH:mm:ss.SSS",
# Generate combined date-time formats
date_time_formats = []
for date_fmt in date_formats:
for time_fmt in time_formats:
date_time_formats.append(date_fmt + time_fmt)
# Parse the date strings
parsing_expressions = [to_timestamp(col("date_string"), fmt) for fmt in date_time_formats]
# Use coalesce to get the first successfully parsed timestamp
parsed_date_expr = coalesce(*parsing_expressions)
# Add the parsed date column to the DataFrame
df = df.withColumn("parsed_date", parsed_date_expr)
# Show the results
df.select("date_string", "parsed_date").show(truncate=False)
# Stop the SparkSession
需要日期时间格式,问题在于此集合中的时间格式"", # No time
time_formats = [
"", # No time
" HH:mm:ss",
" HH:mm:ss.SSS",
为了修复 to_timestamp
来表示相关格式的可选 time
# Define date formats
date_formats = [
"MMM dd, yyyy",
"dd MMM yyyy",
# Define time formats to append
time_formats = [
"[ HH:mm:ss]",
"[ HH:mm:ss.SSS]",
|date_string |parsed_date |
|2023-10-15T13:45:30|2023-10-15 13:45:30|
|2023-10-15 13:45:30|2023-10-15 13:45:30|
|2023-10-15 |2023-10-15 00:00:00|
|20231015 |2023-10-15 00:00:00|
|15-Oct-2023 |2023-10-15 00:00:00|
|10/15/2023 |2023-10-15 00:00:00|
|15/10/2023 |2023-10-15 00:00:00|
|2023.10.15 |2023-10-15 00:00:00|
|Oct 15, 2023 |2023-10-15 00:00:00|
|15 Oct 2023 |2023-10-15 00:00:00|
|2023/10/15 |2023-10-15 00:00:00|
|15-10-2023 |2023-10-15 00:00:00|
|10-15-2023 |2023-10-15 00:00:00|
|15.10.2023 |2023-10-15 00:00:00|
|10.15.2023 |2023-10-15 00:00:00|
|InvalidDate |NULL |