查看如何使用 Python Polars 实现分箱,我可以轻松计算各个列的聚合:
import polars as pl
import numpy as np
t, v = np.arange(0, 100, 2), np.arange(0, 100, 2)
df = pl.DataFrame({"t": t, "v0": v, "v1": v})
df = df.with_column((pl.datetime(2022,10,30) + pl.duration(seconds=df["t"])).alias("datetime")).drop("t")
df.groupby_dynamic("datetime", every="10s").agg(pl.col("v0").mean())
┌─────────────────────┬──────┐
│ datetime ┆ v0 │
│ --- ┆ --- │
│ datetime[μs] ┆ f64 │
╞═════════════════════╪══════╡
│ 2022-10-30 00:00:00 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... │
或计算多个聚合,如
df.groupby_dynamic("datetime", every="10s").agg([
pl.col("v0").mean().alias("v0_binmean"),
pl.col("v0").count().alias("v0_bincount")
])
┌─────────────────────┬────────────┬─────────────┐
│ datetime ┆ v0_binmean ┆ v0_bincount │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ u32 │
╞═════════════════════╪════════════╪═════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
或计算多列的一个聚合,例如
cols = [c for c in df.columns if "datetime" not in c]
df.groupby_dynamic("datetime", every="10s").agg([
pl.col(f"{c}").mean().alias(f"{c}_binmean")
for c in cols
])
┌─────────────────────┬────────────┬────────────┐
│ datetime ┆ v0_binmean ┆ v1_binmean │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0 ┆ 14.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0 ┆ 24.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0 ┆ 34.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
但是,结合这两种方法都失败了!
df.groupby_dynamic("datetime", every="10s").agg([
[
pl.col(f"{c}").mean().alias(f"{c}_binmean"),
pl.col(f"{c}").count().alias(f"{c}_bincount")
]
for c in cols
])
Traceback (most recent call last):
File "/tmp/ipykernel_2666/421808935.py", line 2, in <cell line: 2>
df.groupby_dynamic("datetime", every="10s").agg([
File ".../3.10.9/lib/python3.10/site-packages/polars/internals/dataframe/groupby.py", line 924, in agg
.agg(aggs)
File ".../3.10.9/lib/python3.10/site-packages/polars/internals/lazyframe/groupby.py", line 55, in agg
raise TypeError(msg)
TypeError: expected 'Expr | Sequence[Expr]', got '<class 'list'>'
是否有一种“极地”方法可以一次性计算数据帧的多个(所有)列的多个统计参数?
相关,pandas 特定:Python pandas groupby 在多列上聚合
在极坐标中“一次”选择多列有不同的方法:
>>> df.select(pl.all()).columns
['v0', 'v1', 'datetime']
>>> df.select(pl.col(["v0", "v1"])).columns # by name(s)
['v0', 'v1']
>>> df.select(pl.all().exclude("datetime")).columns # by exclusion
['v0', 'v1']
>>> df.select(pl.exclude("datetime")).columns # we can omit `.all()`
['v0', 'v1']
.suffix()
可用于附加到每个名称的末尾:
>>> df.select(pl.exclude("datetime").mean().suffix("_binmean"))
shape: (1, 2)
┌────────────┬────────────┐
│ v0_binmean | v1_binmean │
│ --- | --- │
│ f64 | f64 │
╞════════════╪════════════╡
│ 49.0 | 49.0 │
└────────────┴────────────┘
这意味着您的示例可以写为:
df.groupby_dynamic("datetime", every="10s").agg(
pl.exclude("datetime").mean().suffix("_binmean")
)
shape: (10, 3)
┌─────────────────────┬────────────┬────────────┐
│ datetime | v0_binmean | v1_binmean │
│ --- | --- | --- │
│ datetime[μs] | f64 | f64 │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 | 4.0 | 4.0 │
├─────────────────────┼────────────┼────────────┤
│ 2022-10-30 00:00:10 | 14.0 | 14.0 │
├─────────────────────┼────────────┼────────────┤
│ 2022-10-30 00:00:20 | 24.0 | 24.0 │
多重聚合:
df.groupby_dynamic("datetime", every="10s").agg([
pl.exclude("datetime").mean().suffix("_binmean"),
pl.exclude("datetime").count().suffix("_bincount")
])
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime | v0_binmean | v1_binmean | v0_bincount | v1_bincount │
│ --- | --- | --- | --- | --- │
│ datetime[μs] | f64 | f64 | u32 | u32 │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 | 4.0 | 4.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:10 | 14.0 | 14.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:20 | 24.0 | 24.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:30 | 34.0 | 34.0 | 5 | 5 │
对于列表理解 - 您需要将 2 个组合成一个列表:
df.groupby_dynamic("datetime", every="10s").agg(
[pl.col(f"{c}").mean().alias(f"{c}_binmean") for c in cols] +
[pl.col(f"{c}").count().alias(f"{c}_bincount") for c in cols]
)
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime | v0_binmean | v1_binmean | v0_bincount | v1_bincount │
│ --- | --- | --- | --- | --- │
│ datetime[μs] | f64 | f64 | u32 | u32 │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 | 4.0 | 4.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:10 | 14.0 | 14.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:20 | 24.0 | 24.0 | 5 | 5 │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:30 | 34.0 | 34.0 | 5 | 5 │