这是我尝试使用Spark SQL实现的最终模式
|-- references: array (nullable = true)
|-- element: struct (containsNull = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
|-- url: string (nullable = true)
我正在尝试在Parquet中插入数据但是我无法创建与上述模式匹配的嵌套JSON Row对象。
以下是我尝试过但不起作用的事情 -
Tried inserting the data as - Object[] references = new Object[]{"1", "2", "3"}
Tried this Object[] references - new Object[0] (Only this works)
Tried this Object[] references - new Object[]{new Object[]{"1", "2", "3"}}
然后我把它归还
Row.createFactory(references)
我在哪里尝试返回Row对象
我需要帮助使用Spark SQL Java创建模式。我无法在线找到任何解决方案。
看起来像数组列表可用,函数“array”和“struct”可用于创建所需的模式:
List<Row> data = Lists.newArrayList(
RowFactory.create(new String[]{"1", "2", "3"}),
RowFactory.create(new String[]{"4", "5", "6"})
);
StructType schema = DataTypes.createStructType(
new StructField[]{
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("type", DataTypes.StringType, true),
DataTypes.createStructField("url", DataTypes.StringType, true),
});
Dataset<Row> plain = spark().createDataFrame(data, schema);
Dataset<Row> result = plain.withColumn("references",
array(
struct(col("name"), col("type"), col("url")))).
select("references");
result.show(false);
result.printSchema();
输出是:
+----------+
|references|
+----------+
|[[1,2,3]] |
|[[4,5,6]] |
+----------+
root
|-- references: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- name: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)