我有一个数据集,我想迭代列列表并使用两个新列丰富数据集 -
status
和 message
。
如果任何行、任何列为空,则状态应设置为
Failure
,并且消息应包含为空的列的列表:
Col : list[String] = df.coulmns.toList
df = df.withColumn("status" ,functions.lit("Success"))
.withColumn("message",functions.lit(" "))
col.forEach(f=>{
val SuccessDf = df.filter(col(f).isNotNull)
var failureDf = df.filter(col(f).isNull)
failureDf = failureDf.withColumn("status" ,functions.lit("Failure"))
.withColumn("message", concat(col("message"),f))
df = SuccessDf.union(failureDf)
}
)
上述代码的问题是,运行大约 40K、大约 18 列的数据集需要永远运行
示例消息:
" Column : " + columnName + " is null"
请检查以下解决方案。
scala> df.show(30, false) // sample data
+----------------------------+----------+------+---+----------+----------+------+-----------+
|birthDate |firstName |gender|id |lastName |middleName|salary|ssn |
+----------------------------+----------+------+---+----------+----------+------+-----------+
|1955-07-02T04:00:00.000+0000|Pennie |null |1 |Hirschmann|Carry |56172 |981-43-9345|
|1992-02-08T05:00:00.000+0000|An |Female|2 |Cowper |Amira |40203 |null |
|null |Quyen |Female|3 |Dome |Marlen |53417 |957-57-8246|
|1990-04-11T04:00:00.000+0000|Coralie |Female|4 |null |null |94727 |963-39-4885|
|1980-01-16T05:00:00.000+0000|Terrie |Female|5 |Bonar |Wava |79908 |964-49-8051|
|1990-11-24T05:00:00.000+0000|null |Female|6 |null |null |64652 |954-59-9172|
|1970-12-19T05:00:00.000+0000|Geri |Female|7 |Mosby |Tambra |38195 |968-16-4020|
|null |Patria |null |8 |null |Nancy |null |null |
|1967-11-17T05:00:00.000+0000|Terese |Female|9 |Tocque |Alfredia |91294 |967-48-7309|
|null |Wava |null |10 |null |null |56521 |null |
|1979-09-17T04:00:00.000+0000|Sophie |Female|11 |Hearn |Emerita |90920 |977-66-4483|
|1959-01-31T05:00:00.000+0000|Jodie |Female|12 |Laneham |Tabetha |90634 |923-24-9769|
|1974-02-19T04:00:00.000+0000|Marietta |Female|13 |Yansons |Mandi |93162 |900-34-8083|
|1960-09-26T04:00:00.000+0000|Caridad |Female|14 |Snelle |Maire |38859 |992-11-7062|
|1960-01-29T05:00:00.000+0000|Yasmine |Female|15 |Edworthye |Meg |76220 |922-12-9862|
|1986-12-05T05:00:00.000+0000|Chan |Female|16 |Hartas |Jani |75050 |995-51-3115|
|1961-09-29T04:00:00.000+0000|Evangeline|Female|17 |Casserley |Wanetta |62814 |926-61-3526|
|1980-02-14T05:00:00.000+0000|Elnora |Female|18 |Lipman |Kecia |71350 |950-23-9739|
|1978-11-14T05:00:00.000+0000|Adelle |Female|19 |Grigoriev |Kathyrn |60600 |923-23-5984|
|1973-11-24T05:00:00.000+0000|Mica |Female|20 |Challens |Zandra |51071 |918-66-1232|
+----------------------------+----------+------+---+----------+----------+------+-----------+
scala> val statusExpr = when(
df.columns.map(c => col(c).isNull).reduce(_ or _),
lit("Failure")
)
.otherwise(lit(""))
scala> val messageExpr = concat_ws(
", ",
df.columns.map(c => when(col(c).isNull, lit(s"Column: ${c} isNull"))):_*
)
scala> df
.withColumn("status", statusExpr)
.withColumn("message", messageExpr)
.show(30, false)
最终输出
+----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+
|birthDate |firstName |gender|id |lastName |middleName|salary|ssn |status |message |
+----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+
|1955-07-02T04:00:00.000+0000|Pennie |null |1 |Hirschmann|Carry |56172 |981-43-9345|Failure|Column: gender isNull |
|1992-02-08T05:00:00.000+0000|An |Female|2 |Cowper |Amira |40203 |null |Failure|Column: ssn isNull |
|null |Quyen |Female|3 |Dome |Marlen |53417 |957-57-8246|Failure|Column: birthDate isNull |
|1990-04-11T04:00:00.000+0000|Coralie |Female|4 |null |null |94727 |963-39-4885|Failure|Column: lastName isNull, Column: middleName isNull |
|1980-01-16T05:00:00.000+0000|Terrie |Female|5 |Bonar |Wava |79908 |964-49-8051| | |
|1990-11-24T05:00:00.000+0000|null |Female|6 |null |null |64652 |954-59-9172|Failure|Column: firstName isNull, Column: lastName isNull, Column: middleName isNull |
|1970-12-19T05:00:00.000+0000|Geri |Female|7 |Mosby |Tambra |38195 |968-16-4020| | |
|null |Patria |null |8 |null |Nancy |null |null |Failure|Column: birthDate isNull, Column: gender isNull, Column: lastName isNull, Column: salary isNull, Column: ssn isNull |
|1967-11-17T05:00:00.000+0000|Terese |Female|9 |Tocque |Alfredia |91294 |967-48-7309| | |
|null |Wava |null |10 |null |null |56521 |null |Failure|Column: birthDate isNull, Column: gender isNull, Column: lastName isNull, Column: middleName isNull, Column: ssn isNull|
|1979-09-17T04:00:00.000+0000|Sophie |Female|11 |Hearn |Emerita |90920 |977-66-4483| | |
|1959-01-31T05:00:00.000+0000|Jodie |Female|12 |Laneham |Tabetha |90634 |923-24-9769| | |
|1974-02-19T04:00:00.000+0000|Marietta |Female|13 |Yansons |Mandi |93162 |900-34-8083| | |
|1960-09-26T04:00:00.000+0000|Caridad |Female|14 |Snelle |Maire |38859 |992-11-7062| | |
|1960-01-29T05:00:00.000+0000|Yasmine |Female|15 |Edworthye |Meg |76220 |922-12-9862| | |
|1986-12-05T05:00:00.000+0000|Chan |Female|16 |Hartas |Jani |75050 |995-51-3115| | |
|1961-09-29T04:00:00.000+0000|Evangeline|Female|17 |Casserley |Wanetta |62814 |926-61-3526| | |
|1980-02-14T05:00:00.000+0000|Elnora |Female|18 |Lipman |Kecia |71350 |950-23-9739| | |
|1978-11-14T05:00:00.000+0000|Adelle |Female|19 |Grigoriev |Kathyrn |60600 |923-23-5984| | |
|1973-11-24T05:00:00.000+0000|Mica |Female|20 |Challens |Zandra |51071 |918-66-1232| | |
+----------------------------+----------+------+---+----------+----------+------+-----------+-------+-----------------------------------------------------------------------------------------------------------------------+