如何实现group_by_dynamic然后unstack？

Question

背景

我有一个 Polars DataFrame（“df”），由一个时间列（“date”）、一个可能的 ID 列（“id”）和多个数字列（即特征）组成。形状是 (14852, 431)。

df 代表有关金融交易的数据。 ID表示客户端；该日期是进行这些交易的月份的开始日期。这些特征是一些“平均值”（例如平均花费的钱、交易数量等）。

经过适当的操作后，该 df 将被输入到机器学习模型中以用于训练目的。

瞄准

定性地我正在尝试做以下事情：

对于每个唯一的ID（客户），创建一个6个月的滑动窗口；
将df限制为该ID该时间段内的交易；
“取消堆叠”其行。也就是说：如果在给定的时间窗口内，给定的 ID 有 6 个交易，即我们有一个形状为 (6, 431) 的
```
restricted_df
```
，我需要将其替换为形状为 (1, 431) 的
```
unstacked_df
```
* 6).实际上，如果我在
```
restricted_df
```
中有一个功能名称“money_spent”，那么
```
unstacked_df
```
应包含类似“money_spent_0”、“money_spent_1”、...、“money_spent_5”之类的内容。

接近

尽管我确实对 Polars 很陌生，但我还是知道这个难题的各个部分。在我看来，他们是：

group_by_dynamic("date", every="180d", period="180d", group_by="id")

```
unstack(step=1, columns=features)
```

但是，我无法真正让它们工作，至少不能以有效的方式工作。请参阅下面的低效解决方案。

问题

我认为主要问题是，据我的理解，在

group_by_dynamic

之后，Polars 期望

.agg

应用于单列，例如通过

pl.col("foo").some_function()

。然而，Series 没有提供 unstack 方法，所以它不太有效。

尝试的解决方案

一种非常低效的方法是将上述 Series 转换为 DataFrame，然后将其取消堆叠。然而，仅此并不能完全解决问题。事实上，我们最终只是得到一个具有相同 431 列的 df，其中每一列都包含每一行的 DataFrame（我们拆开的那个）。

这是通过

获得的

df.group_by_dynamic("date", every="180d", period="180d", group_by="id").agg(pl.col(features).apply(lambda x: pl.DataFrame(x).unstack(step=1)))

示意性地，对于功能“foo”，我们最终得到以下形式的东西

|       foo           |
|col_0 1, col_1 2, ...|

而不是想要的

|foo_0|foo_1|...|
|  1  |  2  |...|

为了解决这个问题，我们可以插入一个

to_dict()

，并在最后使用

unnest

函数。这是通过

获得的

df.group_by_dynamic("date", every="180d", period="180d", group_by="id").agg(pl.col(features).map_elements(lambda x: pl.DataFrame(x).unstack(step=1).to_dict())).unnest()

问题

这可行，但显然效率很低，对我来说似乎有点过分了。我们有什么办法可以完成这件事吗？

最小示例

import numpy as np
import polars as pl
from datetime import date

# Generate fake data
ids = [1]*6 + [2]*6
start = date(2023, 1, 1)
end =  date(2023, 12, 1)
dates = pl.date_range(start, end, "1mo", name="date", eager=True)
foos = np.arange(0, 12)
bars = np.arange(12, 24)

# Generate df
df = pl.DataFrame({"id":ids, "date":dates, "foo":foos, "bar":bars})

# Print df
print(df)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 1   ┆ 2023-01-01 ┆ 0   ┆ 12  │
│ 1   ┆ 2023-02-01 ┆ 1   ┆ 13  │
│ 1   ┆ 2023-03-01 ┆ 2   ┆ 14  │
│ 1   ┆ 2023-04-01 ┆ 3   ┆ 15  │
│ …   ┆ …          ┆ …   ┆ …   │
│ 2   ┆ 2023-09-01 ┆ 8   ┆ 20  │
│ 2   ┆ 2023-10-01 ┆ 9   ┆ 21  │
│ 2   ┆ 2023-11-01 ┆ 10  ┆ 22  │
│ 2   ┆ 2023-12-01 ┆ 11  ┆ 23  │

# Group df as required
grouped_df = df.group_by_dynamic("date", every="180d", period="180d", group_by="id")

# Check group content
for _name, group in grouped_df:

  print(group)

shape: (6, 4)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 1   ┆ 2023-01-01 ┆ 0   ┆ 12  │
│ 1   ┆ 2023-02-01 ┆ 1   ┆ 13  │
│ 1   ┆ 2023-03-01 ┆ 2   ┆ 14  │
│ 1   ┆ 2023-04-01 ┆ 3   ┆ 15  │
│ 1   ┆ 2023-05-01 ┆ 4   ┆ 16  │
│ 1   ┆ 2023-06-01 ┆ 5   ┆ 17  │
└─────┴────────────┴─────┴─────┘
shape: (6, 4)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 2   ┆ 2023-07-01 ┆ 6   ┆ 18  │
│ 2   ┆ 2023-08-01 ┆ 7   ┆ 19  │
│ 2   ┆ 2023-09-01 ┆ 8   ┆ 20  │
│ 2   ┆ 2023-10-01 ┆ 9   ┆ 21  │
│ 2   ┆ 2023-11-01 ┆ 10  ┆ 22  │
│ 2   ┆ 2023-12-01 ┆ 11  ┆ 23  │
└─────┴────────────┴─────┴─────┘

# Manipulation
result = ...

# Expected output after correct manipulation
print(result)

shape: (2, 14)
┌─────┬────────────┬───────┬───────┬───┬───────┬───────┬───────┬───────┐
│ id  ┆ date       ┆ foo_0 ┆ foo_1 ┆ … ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
│ --- ┆ ---        ┆ ---   ┆ ---   ┆   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ date       ┆ i64   ┆ i64   ┆   ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═════╪════════════╪═══════╪═══════╪═══╪═══════╪═══════╪═══════╪═══════╡
│ 1   ┆ 2023-01-01 ┆ 0     ┆ 1     ┆ … ┆ 14    ┆ 15    ┆ 16    ┆ 17    │
│ 2   ┆ 2023-07-01 ┆ 6     ┆ 7     ┆ … ┆ 20    ┆ 21    ┆ 22    ┆ 23    │
└─────┴────────────┴───────┴───────┴───┴───────┴───────┴───────┴───────┘

Answer 1

看起来

.to_struct

可能是您需要的缺失部分？

我们可以使用

n_field_strategy="max_width"

来确保所有结果具有相同的“长度”。

fields=

可以接受可调用的，在这种情况下，您需要添加列名称作为前缀。

然后您可以取消嵌套生成的结构列：

features = "foo", "bar"

(df.groupby_dynamic(index_column="date", by="id", every="6mo")
   .agg(pl.col(features))
   .with_columns(
      pl.col(feature)
        .arr.to_struct(
           fields = lambda idx, feature=feature: f"{feature}_{idx}", 
           n_field_strategy = "max_width"
        )
      for feature in features
   )
   .unnest(*features)
)

shape: (2, 14)
┌─────┬────────────┬───────┬───────┬───┬───────┬───────┬───────┬───────┐
│ id  ┆ date       ┆ foo_0 ┆ foo_1 ┆ … ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
│ --- ┆ ---        ┆ ---   ┆ ---   ┆   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ date       ┆ i64   ┆ i64   ┆   ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═════╪════════════╪═══════╪═══════╪═══╪═══════╪═══════╪═══════╪═══════╡
│ 1   ┆ 2023-01-01 ┆ 0     ┆ 1     ┆ … ┆ 14    ┆ 15    ┆ 16    ┆ 17    │
│ 2   ┆ 2023-07-01 ┆ 6     ┆ 7     ┆ … ┆ 20    ┆ 21    ┆ 22    ┆ 23    │
└─────┴────────────┴───────┴───────┴───┴───────┴───────┴───────┴───────┘

feature=feature

中出现

lambda idx, feature=feature:

的原因是由于Python中循环/推导式内的lambdas的后期绑定问题。

https://docs.python.org/3/faq/programming.html#why-do-lambdas-define-in-a-loop-with- different-values-all-return-the-same-result

如何实现group_by_dynamic然后unstack？

问题描述投票：0回答：1

1个回答

最新问题

如何实现group_by_dynamic然后unstack？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1