如何根据 Polars 中其他列的数据对列执行此聚合?

问题描述 投票:0回答:1

我将如下所示的数据加载到 Polars 中:

流体 组ID 阈值 班级 数据1 数据2 数据3 数据4
X1 X 0.0 0 1 1 1
X2 X 0.0 0 1 1 1
X3 X 0.0 0 1 1 1
Y1 0.0 1 1 1 1
Y2 0.0 1 1 1 1
Y3 0.0 1 1 1 1
X1 X 1.0 0 0 0 0
X2 X 1.0 0 0 0 0
X3 X 1.0 0 0 0 0
Y1 1.0 1 0 0 0
Y2 1.0 1 0 0 0
Y3 1.0 1 0 0 0
X1 X 2.0 0 0 0 0
X2 X 2.0 0 0 0 0
X3 X 2.0 0 0 0 0
Y1 2.0 1 0 0 0
Y2 2.0 1 0 0 0
Y3 2.0 1 0 0 0

我想计算每组阈值中每组组 id 的数据列的模式,以便聚合 uid 列:

组ID 阈值 班级 数据1 数据2 数据3 数据4
X 0.0 0 1 1 1
0.0 1 1 1 1
X 1.0 0 0 0 0
1.0 1 0 0 0
X 2.0 0 0 0 0
2.0 1 0 0 0

我尝试过使用 Polars 中的窗口函数:

lf.with_column(
    col("^data[0-9]*$")
        .mode()
        .over([col("thresholds"), col("groupid")]),
)

但我收到错误:

the length of the window expression did not match that of the group

Error originated in expression: 'col("data1").mode().over([col("thresholds"), col("groupid")])'

我尝试使用

over_with_options
方法和
WindowMapping::Join
选项(然后分解结果列),但是在较大的数据集上,这需要太长时间,并且 Polars 文档指出此操作是内存密集型的。

我想一个因素是我很难概念化这个错误的含义。

rust aggregate mode rust-polars polars
1个回答
0
投票

您可以

group_by
(注意,不是
groupby
)并使用
agg
进行聚合。方法如下:

import polars as pl

data = {
    "uid": ["X1", "X2", "X3", "Y1", "Y2", "Y3", "X1", "X2", "X3", "Y1", "Y2", "Y3", "X1", "X2", "X3", "Y1", "Y2", "Y3"],
    "groupid": ["X", "X", "X", "Y", "Y", "Y", "X", "X", "X", "Y", "Y", "Y", "X", "X", "X", "Y", "Y", "Y"],
    "thresholds": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0],
    "class": [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
    "data1": [None, 1, 1, 1, 1, 1, None, 0, 0, 0, 0, 0, None, 0, 0, 0, 0, 0],
    "data2": [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    "data3": [1, 1, 1, None, None, None, 0, 0, 0, None, None, None, 0, 0, 0, None, None, None],
    "data4": [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
}

df = pl.DataFrame(data)

result = df.group_by(["groupid", "thresholds"]).agg(
    [
        pl.col("class").mode().alias("class"),
        pl.col("data1").mode().alias("data1"),
        pl.col("data2").mode().alias("data2"),
        pl.col("data3").mode().alias("data3"),
        pl.col("data4").mode().alias("data4"),
    ]
)

print(result)

这给出了

shape: (6, 7)
┌─────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ groupid ┆ thresholds ┆ class     ┆ data1     ┆ data2     ┆ data3     ┆ data4     │
│ ---     ┆ ---        ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ str     ┆ f64        ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ X       ┆ 1.0        ┆ [0]       ┆ [0]       ┆ [0]       ┆ [0]       ┆ [0]       │
│ X       ┆ 0.0        ┆ [0]       ┆ [1]       ┆ [1]       ┆ [1]       ┆ [1]       │
│ Y       ┆ 0.0        ┆ [1]       ┆ [1]       ┆ [1]       ┆ [null]    ┆ [1]       │
│ Y       ┆ 2.0        ┆ [1]       ┆ [0]       ┆ [0]       ┆ [null]    ┆ [0]       │
│ X       ┆ 2.0        ┆ [0]       ┆ [0]       ┆ [0]       ┆ [0]       ┆ [0]       │
│ Y       ┆ 1.0        ┆ [1]       ┆ [0]       ┆ [0]       ┆ [null]    ┆ [0]       │
└─────────┴────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
© www.soinside.com 2019 - 2024. All rights reserved.