我试图在 Spark 中为具有空值的嵌套列设置默认值,但看起来 DataFrameNaFunctions.fill 函数不适用于嵌套列。
import spark.implicits._
case class Demographics(city: String)
case class Detail(age: Int, demographics: Demographics)
case class Person(name: String, details: Details
val data = Seq(Data(Person("James", Details(48, demographics=Demographics("Toronto")))), Data(Person("Mary", Details(41, demographics=Demographics(null)))), Data(null)).toDS
data.na.fill("default").show(false)
+------------------------+
|person |
+------------------------+
|{James, {48, {Toronto}}}|
|{Mary, {41, {NULL}}} |
|NULL |
+------------------------+
What I am expecting:
+------------------------+
|person |
+------------------------+
|{James, {48, {Toronto}}}|
|{Mary, {41, {default}}} |
|NULL |
+------------------------+
有人知道有什么方法可以做到这一点吗?顺便说一句,我想设置一个值的主要原因是因为我需要引用 JVM 对象,这些对象是 Java beans,并且这些字段不能为 null。
val encoder = Encoders.bean(classOf[InputBeanClass])
data.map(row => {
row
})(encoder).count()
如果我运行上面的代码,我会收到以下错误:
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).