尝试使用带有日期时间库的用户定义函数将 pyspark 数据框中的字符串月份转换为月份数字。
但是运行函数时出现错误。
我创建了以下函数并按如下方式应用它:
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def convert_datetime(month_name):
print(month_name)
"""This function is to create string month into correct numerical form"""
datetime_object = datetime.datetime.strptime(month_name, "%B")
month_number = datetime_object.month
return month_number
numericalCaseUDF = udf(lambda x:convert_datetime(x),IntegerType())
date_listing.withColumn("Month",col("Month").cast('string'))
print(date_listing.schema)
date_listing.dropna().show(truncate=False)
date_listing.select(f.col("month"),numericalCaseUDF(f.col("month"))).show(truncate=False)
但是我遇到以下错误,我不确定错误在哪里。有人可以帮忙吗?
不要将 UDF 用于 pyspark 函数中已有的功能。 UDF 应该是最后的手段,因为它们非常慢。这是一个不使用 python UDF 实现的示例。
from pyspark import SQLContext
import pyspark.sql.functions as F
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
["29 September 2022 20:00:00"],
["02 July 2019 19:00:00"],
]
df1Columns = ["date_col"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
print("Given dataframe")
df1.show(n=100, truncate=False)
df1_parsed = df1.withColumn("parsed_date_column", F.to_timestamp(F.col("date_col"), "dd MMMM yyyy HH:mm:ss"))
print("parsed dataframe")
df1_parsed.show(n=100, truncate=False)
df1_parsed = df1_parsed.withColumn("month_number", F.month(F.col("parsed_date_column")))
print("month extracted dataframe")
df1_parsed.show(n=100, truncate=False)
输出:
Given dataframe
+--------------------------+
|date_col |
+--------------------------+
|29 September 2022 20:00:00|
|02 July 2019 19:00:00 |
+--------------------------+
parsed dataframe
+--------------------------+-------------------+
|date_col |parsed_date_column |
+--------------------------+-------------------+
|29 September 2022 20:00:00|2022-09-29 20:00:00|
|02 July 2019 19:00:00 |2019-07-02 19:00:00|
+--------------------------+-------------------+
month extracted dataframe
+--------------------------+-------------------+------------+
|date_col |parsed_date_column |month_number|
+--------------------------+-------------------+------------+
|29 September 2022 20:00:00|2022-09-29 20:00:00|9 |
|02 July 2019 19:00:00 |2019-07-02 19:00:00|7 |
+--------------------------+-------------------+------------+