过滤掉不符合pyspark模式的行

Question

我有一个名为

employee.csv

的文件，其中

empid

列为整数，

empname

为字符串。我通过定义模式将文件读入数据帧

d1

并按原样读入另一个数据帧

d2

。 employee.csv 的数据如下：

01,A\n
02,B\n
3,C\n
D,d\n

我想列出 empid 不是整数的行。我将 empid 列转换为 d2 中的整数，以通过使用减法找出坏行，但现在我看到行 D,d 为 null，d 作为减法命令的输出。我怎样才能得到想要的东西。

我还尝试过滤掉无法转换为整数的行，但这似乎也不起作用。

d3 = d2.filter(d2[“empid”].cast(“int”).isNull())

请让我知道我们如何实现它。

Answer 1

要返回非整数的 empid 列表，您可以使用

filter()

和

isdigit()

：

out = d2.filter(~d2['empid'].astype(str).str.isdigit()).tolist()