将数据集中的嵌套json字符串转换为Spark Scala中的数据集/数据帧

问题描述 投票:0回答:1

我有一个简单的程序,其数据集的列resource_serialized具有JSON字符串,如下所示:

import org.apache.spark.SparkConf

object TestApp {
  def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")

    val spark = org.apache.spark.sql.SparkSession
      .builder
      .config(sparkConf)
      .appName("Test")
      .getOrCreate()

    val json = "[{\"resource_serialized\":\"{\\\"createdOn\\\":\\\"2000-07-20 00:00:00.0\\\",\\\"genderCode\\\":\\\"0\\\"}\",\"id\":\"00529e54-0f3d-4c76-9d3\"}]"

    import spark.implicits._
    val df = spark.read.json(Seq(json).toDS)
    df.printSchema()
    df.show()
  }
}

打印的模式是:

root
 |-- id: string (nullable = true)
 |-- resource_serialized: string (nullable = true)

控制台上打印的数据集是:

+--------------------+--------------------+
|                  id| resource_serialized|
+--------------------+--------------------+
|00529e54-0f3d-4c7...|{"createdOn":"200...|
+--------------------+--------------------+

resource_serialized字段具有json字符串,它是(来自调试控制台)

enter image description here

现在,我需要从该json字符串中创建数据集/数据框,我该如何实现呢?

apache-spark apache-spark-sql dataset apache-spark-dataset
1个回答
0
投票

使用from_json

© www.soinside.com 2019 - 2024. All rights reserved.