我将如下所示的数据加载到 Polars 中:
流体 | 组ID | 阈值 | 班级 | 数据1 | 数据2 | 数据3 | 数据4 |
---|---|---|---|---|---|---|---|
X1 | X | 0.0 | 0 | 1 | 1 | 1 | |
X2 | X | 0.0 | 0 | 1 | 1 | 1 | |
X3 | X | 0.0 | 0 | 1 | 1 | 1 | |
Y1 | 是 | 0.0 | 1 | 1 | 1 | 1 | |
Y2 | 是 | 0.0 | 1 | 1 | 1 | 1 | |
Y3 | 是 | 0.0 | 1 | 1 | 1 | 1 | |
X1 | X | 1.0 | 0 | 0 | 0 | 0 | |
X2 | X | 1.0 | 0 | 0 | 0 | 0 | |
X3 | X | 1.0 | 0 | 0 | 0 | 0 | |
Y1 | 是 | 1.0 | 1 | 0 | 0 | 0 | |
Y2 | 是 | 1.0 | 1 | 0 | 0 | 0 | |
Y3 | 是 | 1.0 | 1 | 0 | 0 | 0 | |
X1 | X | 2.0 | 0 | 0 | 0 | 0 | |
X2 | X | 2.0 | 0 | 0 | 0 | 0 | |
X3 | X | 2.0 | 0 | 0 | 0 | 0 | |
Y1 | 是 | 2.0 | 1 | 0 | 0 | 0 | |
Y2 | 是 | 2.0 | 1 | 0 | 0 | 0 | |
Y3 | 是 | 2.0 | 1 | 0 | 0 | 0 |
我想计算每组阈值中每组组 id 的数据列的模式,以便聚合 uid 列:
组ID | 阈值 | 班级 | 数据1 | 数据2 | 数据3 | 数据4 |
---|---|---|---|---|---|---|
X | 0.0 | 0 | 1 | 1 | 1 | |
是 | 0.0 | 1 | 1 | 1 | 1 | |
X | 1.0 | 0 | 0 | 0 | 0 | |
是 | 1.0 | 1 | 0 | 0 | 0 | |
X | 2.0 | 0 | 0 | 0 | 0 | |
是 | 2.0 | 1 | 0 | 0 | 0 |
我尝试过使用 Polars 中的窗口函数:
lf.with_column(
col("^data[0-9]*$")
.mode()
.over([col("thresholds"), col("groupid")]),
)
但我收到错误:
the length of the window expression did not match that of the group
Error originated in expression: 'col("data1").mode().over([col("thresholds"), col("groupid")])'
我尝试使用
over_with_options
方法和 WindowMapping::Join
选项(然后分解结果列),但是在较大的数据集上,这需要太长时间,并且 Polars 文档指出此操作是内存密集型的。
我想一个因素是我很难概念化这个错误的含义。
您可以
group_by
(注意,不是 groupby
)并使用 agg
进行聚合。方法如下:
import polars as pl
data = {
"uid": ["X1", "X2", "X3", "Y1", "Y2", "Y3", "X1", "X2", "X3", "Y1", "Y2", "Y3", "X1", "X2", "X3", "Y1", "Y2", "Y3"],
"groupid": ["X", "X", "X", "Y", "Y", "Y", "X", "X", "X", "Y", "Y", "Y", "X", "X", "X", "Y", "Y", "Y"],
"thresholds": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0],
"class": [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1],
"data1": [None, 1, 1, 1, 1, 1, None, 0, 0, 0, 0, 0, None, 0, 0, 0, 0, 0],
"data2": [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
"data3": [1, 1, 1, None, None, None, 0, 0, 0, None, None, None, 0, 0, 0, None, None, None],
"data4": [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
}
df = pl.DataFrame(data)
result = df.group_by(["groupid", "thresholds"]).agg(
[
pl.col("class").mode().alias("class"),
pl.col("data1").mode().alias("data1"),
pl.col("data2").mode().alias("data2"),
pl.col("data3").mode().alias("data3"),
pl.col("data4").mode().alias("data4"),
]
)
print(result)
这给出了
shape: (6, 7)
┌─────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ groupid ┆ thresholds ┆ class ┆ data1 ┆ data2 ┆ data3 ┆ data4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] ┆ list[i64] │
╞═════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ X ┆ 1.0 ┆ [0] ┆ [0] ┆ [0] ┆ [0] ┆ [0] │
│ X ┆ 0.0 ┆ [0] ┆ [1] ┆ [1] ┆ [1] ┆ [1] │
│ Y ┆ 0.0 ┆ [1] ┆ [1] ┆ [1] ┆ [null] ┆ [1] │
│ Y ┆ 2.0 ┆ [1] ┆ [0] ┆ [0] ┆ [null] ┆ [0] │
│ X ┆ 2.0 ┆ [0] ┆ [0] ┆ [0] ┆ [0] ┆ [0] │
│ Y ┆ 1.0 ┆ [1] ┆ [0] ┆ [0] ┆ [null] ┆ [0] │
└─────────┴────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘