我试图在嵌套字段上调用partitionBy,如下所示:
val rawJson = sqlContext.read.json(filename)
rawJson.write.partitionBy("data.dataDetails.name").parquet(filenameParquet)
我运行时遇到以下错误。我确实看到“名称”列为以下架构中的字段。是否有不同的格式来指定嵌套的列名?
java.lang.RuntimeException:在模式StructType中找不到分区列data.dataDetails.name(StructField(name,StringType,true),StructField(time,StringType,true),StructField(data,StructType(StructType)(dataDetails,StructType(StructField) (name,StringType,true),StructField(id,StringType,true),true)),true))
这是我的json文件:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
这似乎是这里列出的一个已知问题:https://issues.apache.org/jira/browse/SPARK-18084
我也遇到了这个问题,为了解决这个问题,我能够在我的数据集中取消嵌套列。我的数据集与您的数据集略有不同,但这是策略......
原Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data": {
"type": "EventData",
"dataDetails": {
"name": "EventName"
"id": "1234"
}
}
}
修改Json:
{
"name": "AssetName",
"time": "2016-06-20T11:57:19.4941368-04:00",
"data_type": "EventData",
"data_dataDetails_name" : "EventName",
"data_dataDetails_id": "1234"
}
}
获取修改Json的代码:
def main(args: Array[String]) {
...
val data = df.select(children("data", df) ++ $"name" ++ $"time"): _*)
data.printSchema
data.write.partitionBy("data_dataDetails_name").format("csv").save(...)
}
def children(colname: String, df: DataFrame) = {
val parent = df.schema.fields.filter(_.name == colname).head
val fields = parent.dataType match {
case x: StructType => x.fields
case _ => Array.empty[StructField]
}
fields.map(x => col(s"$colname.${x.name}").alias(s"$colname" + s"_" + s"${x.name}"))
}
由于该功能在Spark 2.3.1中不可用,因此这是一种解决方法。确保处理嵌套字段与根级别字段之间的名称冲突。
{"date":"20180808","value":{"group":"xxx","team":"yyy"}}
df.select("date","value.group","value.team")
.write
.partitionBy("date","group","team")
.parquet(filenameParquet)
分区最终就像
date=20180808/group=xxx/team=yyy/part-xxx.parquet