我有如下的表格
| job_id| timestamp | avg_Tag_value |
|:---- |:------------------------------- | ------------- |
| j1 | 2023-03-19T01:52:00.000+0000 | 0.4 |
| j2 | 2023-03-19T01:53:00.000+0000 | 0.5 |
| j3 | 2023-03-19T01:54:00.000+0000 | 0.6 |
我想使用 df = df.select(date_trunc("hour", "timestamp"))) 将时间戳截断为小时
我收到类似“AnalysisException:无法解析'
timestamp
'给定输入列的错误:[date_trunc(小时,时间戳)];“
timestamp 列是时间戳类型。不确定是什么导致了错误..
有人可以帮忙吗?
谢谢
我在我的环境中重现了同样的东西。我遇到了类似的错误。
解决这个问题。请关注此代码:
from pyspark.sql import SparkSession
from pyspark.sql.functions import date_trunc, col
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, DoubleType
sch1 = StructType([
StructField("job_id", StringType(), True),
StructField("timestamp", StringType(), True),
StructField("avg_Tag_value", DoubleType(), True)
])
d1 = [
("j1", "2023-03-19T01:52:00.000+0000", 0.4),
("j2", "2023-03-19T01:53:00.000+0000", 0.5),
("j3", "2023-03-19T01:54:00.000+0000", 0.6)
]
df = spark.createDataFrame(d1, sch1)
df = df.select(date_trunc("hour", col("timestamp")).alias("hour"), "avg_Tag_value")
df.show()
输出:
你的代码对我有用(我猜你还没有导入功能模块
pyspark.sql.functions
)-
from pyspark.sql.functions import *
# Create the input DataFrame
data = [("j1", "2023-03-19T01:52:00.000+0000", 0.4),
("j2", "2023-03-19T01:53:00.000+0000", 0.5),
("j3", "2023-03-19T01:54:00.000+0000", 0.6)]
schema = ["job_id", "timestamp", "avg_Tag_value"]
df = spark.createDataFrame(data, schema)
df = df.select(date_trunc("hour", "timestamp"))
df.show(truncate=False)
+---------------------------+
|date_trunc(hour, timestamp)|
+---------------------------+
| 2023-03-19 01:00:00|
| 2023-03-19 01:00:00|
| 2023-03-19 01:00:00|
+---------------------------+