我想根据另一列的值计算
group_by
中一列的总和。与 pl.Expr.value_counts
的作用差不多(参见示例),但我想将函数(例如 sum
)应用于特定列,在本例中为 Price 列。
我知道我可以在
Weather+ Windy 上执行
group_by
然后进行聚合,但是,我不能这样做,因为我有很多其他聚合,我只需要在 Weather group_by 上进行计算。
import polars as pl
df = pl.DataFrame(
data = {
"Weather":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Price":[1,2,3,4,5,6,7,8],
"Windy":["Y","Y","Y","Y","N","N","N","N"]
}
)
我可以通过
value_counts
获得每个大风天的计数数量
df_agg = (df
.group_by("Weather")
.agg(
pl.col("Windy")
.value_counts()
.alias("Price")
)
)
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",2}, {"N",2}] │
│ Rain ┆ [{"Y",2}, {"N",2}] │
└─────────┴────────────────────┘
我想做这样的事情:
df_agg =(df
.group_by("Weather")
.agg(
pl.col("Windy")
.custom_fun_on_other_col("Price",sum)
.alias("Price")
)
)
而且,这就是我想要的结果,
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",6},{"N",14}] │
│ Rain ┆ [{"Y",4},{"N",12}] │
└─────────┴────────────────────┘
例如,您可以创建临时数据框,然后将其与主数据框连接。
tmp = df.groupby(["Weather", "Windy"]).agg(col("Price").sum())\
.select([pl.col("Weather"), pl.struct(["Windy", "Price"])])\
.groupby("Weather").agg(pl.list("Windy"))
df.groupby("Weather").agg([
# your another aggregations ...
]).join(tmp, on="Weather")
┌─────────┬─────────────────────┐
│ Weather ┆ Windy │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪═════════════════════╡
│ Rain ┆ [{"Y",4}, {"N",12}] │
│ Sun ┆ [{"N",14}, {"Y",6}] │
└─────────┴─────────────────────┘