Python Polars 中多列的多重聚合

问题描述 投票:0回答:1

查看如何使用 Python Polars 实现分箱,我可以轻松计算各个列的聚合:

import polars as pl
import numpy as np

t, v = np.arange(0, 100, 2), np.arange(0, 100, 2)
df = pl.DataFrame({"t": t, "v0": v, "v1": v})
df = df.with_column((pl.datetime(2022,10,30) + pl.duration(seconds=df["t"])).alias("datetime")).drop("t")

df.groupby_dynamic("datetime", every="10s").agg(pl.col("v0").mean())
┌─────────────────────┬──────┐
│ datetime            ┆ v0   │
│ ---                 ┆ ---  │
│ datetime[μs]        ┆ f64  │
╞═════════════════════╪══════╡
│ 2022-10-30 00:00:00 ┆ 4.0  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...                 ┆ ...  │

或计算多个聚合,如

df.groupby_dynamic("datetime", every="10s").agg([
    pl.col("v0").mean().alias("v0_binmean"),
    pl.col("v0").count().alias("v0_bincount")
])
┌─────────────────────┬────────────┬─────────────┐
│ datetime            ┆ v0_binmean ┆ v0_bincount │
│ ---                 ┆ ---        ┆ ---         │
│ datetime[μs]        ┆ f64        ┆ u32         │
╞═════════════════════╪════════════╪═════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0        ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0       ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0       ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0       ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 ┆ ...        ┆ ...         │

或计算多列的一个聚合,例如

cols = [c for c in df.columns if "datetime" not in c]
df.groupby_dynamic("datetime", every="10s").agg([
     pl.col(f"{c}").mean().alias(f"{c}_binmean")
     for c in cols
])
┌─────────────────────┬────────────┬────────────┐
│ datetime            ┆ v0_binmean ┆ v1_binmean │
│ ---                 ┆ ---        ┆ ---        │
│ datetime[μs]        ┆ f64        ┆ f64        │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0        ┆ 4.0        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0       ┆ 14.0       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0       ┆ 24.0       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0       ┆ 34.0       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 ┆ ...        ┆ ...        │

但是,结合这两种方法都失败了!

df.groupby_dynamic("datetime", every="10s").agg([
    [
    pl.col(f"{c}").mean().alias(f"{c}_binmean"),
    pl.col(f"{c}").count().alias(f"{c}_bincount")
    ]
    for c in cols
])
Traceback (most recent call last):

  File "/tmp/ipykernel_2666/421808935.py", line 2, in <cell line: 2>
    df.groupby_dynamic("datetime", every="10s").agg([

  File ".../3.10.9/lib/python3.10/site-packages/polars/internals/dataframe/groupby.py", line 924, in agg
    .agg(aggs)

  File ".../3.10.9/lib/python3.10/site-packages/polars/internals/lazyframe/groupby.py", line 55, in agg
    raise TypeError(msg)

TypeError: expected 'Expr | Sequence[Expr]', got '<class 'list'>'

是否有一种“极地”方法可以一次性计算数据帧的多个(所有)列的多个统计参数?

相关,pandas 特定:Python pandas groupby 在多列上聚合

python dataframe group-by aggregate python-polars
1个回答
3
投票

在极坐标中“一次”选择多列有不同的方法:

>>> df.select(pl.all()).columns
['v0', 'v1', 'datetime']
>>> df.select(pl.col(["v0", "v1"])).columns          # by name(s)
['v0', 'v1']
>>> df.select(pl.all().exclude("datetime")).columns  # by exclusion
['v0', 'v1']
>>> df.select(pl.exclude("datetime")).columns        # we can omit `.all()`
['v0', 'v1']

.suffix()
可用于附加到每个名称的末尾:

>>> df.select(pl.exclude("datetime").mean().suffix("_binmean"))
shape: (1, 2)
┌────────────┬────────────┐
│ v0_binmean | v1_binmean │
│ ---        | ---        │
│ f64        | f64        │
╞════════════╪════════════╡
│ 49.0       | 49.0       │
└────────────┴────────────┘

这意味着您的示例可以写为:

df.groupby_dynamic("datetime", every="10s").agg(
   pl.exclude("datetime").mean().suffix("_binmean")
)
shape: (10, 3)
┌─────────────────────┬────────────┬────────────┐
│ datetime            | v0_binmean | v1_binmean │
│ ---                 | ---        | ---        │
│ datetime[μs]        | f64        | f64        │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 | 4.0        | 4.0        │
├─────────────────────┼────────────┼────────────┤
│ 2022-10-30 00:00:10 | 14.0       | 14.0       │
├─────────────────────┼────────────┼────────────┤
│ 2022-10-30 00:00:20 | 24.0       | 24.0       │

多重聚合:

df.groupby_dynamic("datetime", every="10s").agg([
   pl.exclude("datetime").mean().suffix("_binmean"),
   pl.exclude("datetime").count().suffix("_bincount")
])
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime            | v0_binmean | v1_binmean | v0_bincount | v1_bincount │
│ ---                 | ---        | ---        | ---         | ---         │
│ datetime[μs]        | f64        | f64        | u32         | u32         │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 | 4.0        | 4.0        | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:10 | 14.0       | 14.0       | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:20 | 24.0       | 24.0       | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:30 | 34.0       | 34.0       | 5           | 5           │

对于列表理解 - 您需要将 2 个组合成一个列表:

df.groupby_dynamic("datetime", every="10s").agg(
   [pl.col(f"{c}").mean().alias(f"{c}_binmean") for c in cols] + 
   [pl.col(f"{c}").count().alias(f"{c}_bincount") for c in cols]
)
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime            | v0_binmean | v1_binmean | v0_bincount | v1_bincount │
│ ---                 | ---        | ---        | ---         | ---         │
│ datetime[μs]        | f64        | f64        | u32         | u32         │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 | 4.0        | 4.0        | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:10 | 14.0       | 14.0       | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:20 | 24.0       | 24.0       | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:30 | 34.0       | 34.0       | 5           | 5           │
© www.soinside.com 2019 - 2024. All rights reserved.