编写一个函数的最 Pythonic 方法是什么,该函数对 pandas 数据帧的指定列集(列表中的列名称)进行行式聚合(总和、最小值、最大值、平均值等),同时跳过 NaN 值?
import pandas as pd
import numpy as np
df = pd.DataFrame({"col1": [1, np.NaN, 1],
"col2": [2, 2, np.NaN]})
def aggregate_rows(df, column_list, func):
# Check if the specified columns exist in the DataFrame
missing_columns = [col for col in column_list if col not in df.columns]
if missing_columns:
raise ValueError(f"Columns not found in DataFrame: {missing_columns}")
# Check if func is callable
if not callable(func):
raise ValueError("The provided function is not callable.")
# Sum the specified columns
agg_series = df[column_list].apply(lambda row: func(row.dropna()), axis=1)
return agg_series
df["sum"] = aggregate_rows(df, ["col1", "col2"], sum)
df["max"] = aggregate_rows(df, ["col1", "col2"], max)
df["mean"] = aggregate_rows(df, ["col1", "col2"], lambda x: x.mean())
print(df)
结果(如预期):
col1 col2 sum max mean
0 1.0 2.0 3.0 2.0 1.5
1 NaN 2.0 2.0 2.0 2.0
2 1.0 NaN 1.0 1.0 1.0
但是只有 NaN 值的行,
df = pd.DataFrame({"col1": [1, np.NaN, 1, np.NaN],
"col2": [2, 2, np.NaN, np.NaN]})
结果:
ValueError: max() arg is an empty sequence
解决此问题的最佳方法是什么?
您可以尝试使用
numpy.sum
/numpy.max
/numpy.mean
代替 Python 的内置函数:
df["sum"] = aggregate_rows(df, ["col1", "col2"], np.sum)
df["max"] = aggregate_rows(df, ["col1", "col2"], np.max)
df["mean"] = aggregate_rows(df, ["col1", "col2"], np.mean)
print(df)
打印:
col1 col2 sum max mean
0 1.0 2.0 3.0 2.0 1.5
1 NaN 2.0 2.0 2.0 2.0
2 1.0 NaN 1.0 1.0 1.0
3 NaN NaN 0.0 NaN NaN