我正在尝试删除数据框中某些列的空值,但我在 python 和 scala 中得到了不同的行数。
我对两者都做了同样的事情。在 python 中,我收到 2127178 行,在 scala 中,我收到 8723 行。
例如在Python中我做了:
dfplaneairport.dropna(subset=["model"], inplace= True)
dfplaneairport.dropna(subset=["engine_type"], inplace= True)
dfplaneairport.dropna(subset=["aircraft_type"], inplace= True)
dfplaneairport.dropna(subset=["status"], inplace= True)
dfplaneairport.dropna(subset=["ArrDelay"], inplace= True)
dfplaneairport.dropna(subset=["issue_date"], inplace= True)
dfplaneairport.dropna(subset=["manufacturer"], inplace= True)
dfplaneairport.dropna(subset=["type"], inplace= True)
dfplaneairport.dropna(subset=["tailnum"], inplace= True)
dfplaneairport.dropna(subset=["DepDelay"], inplace= True)
dfplaneairport.dropna(subset=["TaxiOut"], inplace= True)
dfplaneairport.shape
(2127178, 32)
我做了 Spark scala:
dfairports = dfairports.na.drop(Seq("engine_type", "aircraft_type", "status", "model", "issue_date", "manufacturer", "type","ArrDelay", "DepDelay", "TaxiOut", "tailnum"))
dfairports.count()
8723
我期待相同的行数,但我不知道我做错了什么
您似乎没有使用 Pyspark
dropna
函数,而是使用 Pandas 函数。请注意,您正在使用 inplace
输入参数,而 Pyspark 函数中不存在该参数。
这里有 2 位代码(在 Scala 和 Pyspark 中),其行为方式完全相同。
斯卡拉:
import spark.implicits._
val df = Seq(
("James",null,"Smith","36636","M",3000), ("Michael","Rose",null,"40288","M",4000),
("Robert",null,"Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown",null,"F",-1)
).toDF("firstname", "middlename", "lastname", "id", "gender", "salary")
df.show
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James| null| Smith|36636| M| 3000|
| Michael| Rose| null|40288| M| 4000|
| Robert| null|Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| null| F| -1|
+---------+----------+--------+-----+------+------+
df.na.drop(Seq("middlename", "lastname")).show
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| null| F| -1|
+---------+----------+--------+-----+------+------+
Pyspark:
data = [("James",None,"Smith","36636","M",3000), ("Michael","Rose",None,"40288","M",4000),
("Robert",None,"Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown",None,"F",-1)
]
df = spark.createDataFrame(data, ["firstname", "middlename", "lastname", "id", "gender", "salary"])
df.show()
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| James| null| Smith|36636| M| 3000|
| Michael| Rose| null|40288| M| 4000|
| Robert| null|Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| null| F| -1|
+---------+----------+--------+-----+------+------+
df.dropna(subset=["middlename", "lastname"]).show()
+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| null| F| -1|
+---------+----------+--------+-----+------+------+