PySpark - 如何对特定列执行操作?

问题描述 投票:0回答:1

我正在尝试对

df.summary()
数据框执行舍入函数,不包括摘要列。到目前为止,我已经尝试使用
select()
和理解列表,例如

代码

df2 = df.select(*[round(column, 2).alias(column) for column in df.columns])

输出

这是

df2
的输出,分类值转换为
NULL

+---------+-------+-------+-------+-------+
| Summary | col 1 | col 2 | col 3 | col 4 |
+---------+-------+-------+-------+-------+
| NULL    | 0     | 0.1   | 0.2   | 0.3   |
+---------+-------+-------+-------+-------+
| NULL    | 1     | 1.1   | 1.2   | 1.3   |
+---------+-------+-------+-------+-------+
| NULL    | 2     | 2.1   | 2.2   | 2.3   |
+---------+-------+-------+-------+-------+

所需输出

我只想将

columns[1:]
进行四舍五入。

+---------+-------+-------+-------+-------+
| Summary | col 1 | col 2 | col 3 | col 4 |
+---------+-------+-------+-------+-------+
| min     | 0     | 0.1   | 0.2   | 0.3   |
+---------+-------+-------+-------+-------+
| max     | 1     | 1.1   | 1.2   | 1.3   |
+---------+-------+-------+-------+-------+
| stddev  | 2     | 2.1   | 2.2   | 2.3   |
+---------+-------+-------+-------+-------+

我也尝试过切片

df.columns[1:]
,但它没有选择摘要列。

df2 = df.select(*[round(column, 2).alias(column) for column in df.columns[1:])

输出

+-------+-------+-------+-------+
| col 4 | col 1 | col 2 | col 3 |
+-------+-------+-------+-------+
| 0.3   | 0     | 0.1   | 0.2   |
+-------+-------+-------+-------+
| 1.3   | 1     | 1.1   | 1.2   |
+-------+-------+-------+-------+
| 2.3   | 2     | 2.1   | 2.2   |
+-------+-------+-------+-------+
python pyspark bigdata
1个回答
0
投票

如果您想从舍入操作中排除第一列,您可以修改代码以有选择地将舍入操作仅应用于所需的列。您可以尝试以下方法:

columns_to_round = df.columns[1:]
rounded_df = df.selectExpr("Summary", *[f"round({column}, 2) as {column}" for column in columns_to_round])
© www.soinside.com 2019 - 2024. All rights reserved.