在 Pyspark 中应用 UDF 时出现日期时间错误

问题描述 投票:0回答:1

尝试使用带有日期时间库的用户定义函数将 pyspark 数据框中的字符串月份转换为月份数字。

enter image description here

但是运行函数时出现错误。

我创建了以下函数并按如下方式应用它:

import datetime 
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
def convert_datetime(month_name): 
  print(month_name)
  """This function is to create string month into correct numerical form"""
  datetime_object = datetime.datetime.strptime(month_name, "%B")
  month_number = datetime_object.month
  return month_number

numericalCaseUDF = udf(lambda x:convert_datetime(x),IntegerType()) 
date_listing.withColumn("Month",col("Month").cast('string'))
print(date_listing.schema)
date_listing.dropna().show(truncate=False)
date_listing.select(f.col("month"),numericalCaseUDF(f.col("month"))).show(truncate=False)

但是我遇到以下错误,我不确定错误在哪里。有人可以帮忙吗?

enter image description here

pyspark apache-spark-sql user-defined-functions
1个回答
0
投票

不要将 UDF 用于 pyspark 函数中已有的功能。 UDF 应该是最后的手段,因为它们非常慢。这是一个不使用 python UDF 实现的示例。

from pyspark import SQLContext
import pyspark.sql.functions as F



sc = SparkContext('local')
sqlContext = SQLContext(sc)

data1 = [
["29 September 2022 20:00:00"],
["02 July 2019 19:00:00"],

      ]

df1Columns = ["date_col"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)

print("Given dataframe")
df1.show(n=100, truncate=False)

df1_parsed = df1.withColumn("parsed_date_column", F.to_timestamp(F.col("date_col"), "dd MMMM yyyy HH:mm:ss"))
print("parsed dataframe")
df1_parsed.show(n=100, truncate=False)

df1_parsed = df1_parsed.withColumn("month_number", F.month(F.col("parsed_date_column")))
print("month extracted dataframe")
df1_parsed.show(n=100, truncate=False)

输出:

Given dataframe
+--------------------------+
|date_col                  |
+--------------------------+
|29 September 2022 20:00:00|
|02 July 2019 19:00:00     |
+--------------------------+

parsed dataframe
+--------------------------+-------------------+
|date_col                  |parsed_date_column |
+--------------------------+-------------------+
|29 September 2022 20:00:00|2022-09-29 20:00:00|
|02 July 2019 19:00:00     |2019-07-02 19:00:00|
+--------------------------+-------------------+

month extracted dataframe
+--------------------------+-------------------+------------+
|date_col                  |parsed_date_column |month_number|
+--------------------------+-------------------+------------+
|29 September 2022 20:00:00|2022-09-29 20:00:00|9           |
|02 July 2019 19:00:00     |2019-07-02 19:00:00|7           |
+--------------------------+-------------------+------------+
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.