在 Polars 中累积列表

Question

假设我有一个包含 2 列的

pl.DataFrame()

：第一列包含

Date

，第二列包含

List[str]

。

import polars as pl

df = pl.DataFrame([
    pl.Series('Date', [2000, 2001, 2002]),
    pl.Series('Ids', [
        ['a'], 
        ['b', 'c'], 
        ['d'], 
    ])
])

日期	ID
2000	`['a']`
2001	`['b', 'c']`
2002	`['d']`

是否可以累积

List[str]

列，以便每一行都包含其自身以及 Polars 中所有先前的列表？像这样：

日期	ID
2000	`['a']`
2001	`['a', 'b', 'c']`
2002	`['a', 'b', 'c', 'd']`

Answer 1

看起来像一个滚动组？

(df.groupby_rolling(index_column="Date", period=f"{df.height}i")
   .agg(pl.col("Ids").flatten()))

shape: (3, 2)
┌──────┬─────────────────────┐
│ Date | Ids                 │
│ ---  | ---                 │
│ i64  | list[str]           │
╞══════╪═════════════════════╡
│ 2000 | ["a"]               │
│ 2001 | ["a", "b", "c"]     │
│ 2002 | ["a", "b", ... "d"] │
└──────┴─────────────────────┘

index_column

与您的用例并不是特别相关，我们只是在这里使用

Date

，因为它是

int

。

相反，一种常见的方法是添加“行计数”列来使用。

需要铸造才能与

.groupby_rolling

一起使用

(df.with_row_count()
   .with_columns(pl.col("row_nr").cast(pl.Int64))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.exclude("row_nr")))

shape: (3, 3)
┌────────┬────────────────────┬────────────────────────────┐
│ row_nr | Date               | Ids                        │
│ ---    | ---                | ---                        │
│ i64    | list[i64]          | list[list[str]]            │
╞════════╪════════════════════╪════════════════════════════╡
│ 0      | [2000]             | [["a"]]                    │
│ 1      | [2000, 2001]       | [["a"], ["b", "c"]]        │
│ 2      | [2000, 2001, 2002] | [["a"], ["b", "c"], ["d"]] │
└────────┴────────────────────┴────────────────────────────┘

您可以在只需要原始值的列上使用

.last()

：

(df.with_row_count()
   .with_columns(pl.col("row_nr").cast(pl.Int64))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.col("Date").last(), pl.col("Ids").flatten()))

shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids                 │
│ ---    | ---  | ---                 │
│ i64    | i64  | list[str]           │
╞════════╪══════╪═════════════════════╡
│ 0      | 2000 | ["a"]               │
│ 1      | 2001 | ["a", "b", "c"]     │
│ 2      | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘

使用

.arange()

是获取“行数”的另一种方法 - 它会产生一个

int

，它允许跳过

.cast

- 有些人更喜欢。

(df.with_columns(row_nr = pl.arange(0, pl.count()))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.col("Date").last(), pl.col("Ids").flatten()))

shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids                 │
│ ---    | ---  | ---                 │
│ i64    | i64  | list[str]           │
╞════════╪══════╪═════════════════════╡
│ 0      | 2000 | ["a"]               │
│ 1      | 2001 | ["a", "b", "c"]     │
│ 2      | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘

Answer 2

这是我到目前为止所拥有的：

df.with_columns(
    pl.col('Ids').cumulative_eval(pl.element().list.explode().implode())
)

shape: (3, 2)
┌──────┬──────────────────────┐
│ Date ┆ Ids                  │
│ ---  ┆ ---                  │
│ i64  ┆ list[str]            │
╞══════╪══════════════════════╡
│ 2000 ┆ ["a"]                │
│ 2001 ┆ ["a", "b", "c"]      │
│ 2002 ┆ ["a", "b", "c", "d"] │
└──────┴──────────────────────┘

如果有人有更好的，我会给他们答案。

在 Polars 中累积列表

问题描述投票：0回答：2

2个回答

最新问题

在 Polars 中累积列表

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2