如何对 Polars 或 Pandas 中的条件值进行累积求和？

Question

我有一个问题，我试图最好使用极坐标来解决，但熊猫也很好。假设我们有以下数据集（样本）：

import polars as pl

df = pl.from_repr("""
┌─────────────────────┬───────────┬───────────────────┐
│ date                ┆ customers ┆ is_reporting_day? │
│ ---                 ┆ ---       ┆ ---               │
│ datetime[ns]        ┆ i64       ┆ bool              │
╞═════════════════════╪═══════════╪═══════════════════╡
│ 2022-01-01 00:00:00 ┆ 3         ┆ true              │
│ 2022-01-02 00:00:00 ┆ 4         ┆ false             │
│ 2022-01-03 00:00:00 ┆ 5         ┆ false             │
│ 2022-01-04 00:00:00 ┆ 3         ┆ false             │
│ 2022-01-05 00:00:00 ┆ 2         ┆ true              │
└─────────────────────┴───────────┴───────────────────┘
""")

# df = df.to_pandas() # for a pandas solution

我想要得到的是：

如果 reporting_day 是
```
True
```
保持客户数量不变
如果 reporting_day 为
```
False
```
：对所有客户求和 (4, 5, 3 = 12 + 2 = 14) 并将其添加到 reporting_day
```
 中的下一个 
```
True

所以应用转换后，它应该看起来像这样：

┌─────────────────────┬───────────┬───────────────────┬─────────┐
│ date                ┆ customers ┆ is_reporting_day? ┆ cum_sum │
│ ---                 ┆ ---       ┆ ---               ┆ ---     │
│ datetime[ns]        ┆ i64       ┆ str               ┆ i64     │
╞═════════════════════╪═══════════╪═══════════════════╪═════════╡
│ 2022-01-01 00:00:00 ┆ 3         ┆ True              ┆ 3       │
│ 2022-01-05 00:00:00 ┆ 2         ┆ True              ┆ 14      │
└─────────────────────┴───────────┴───────────────────┴─────────┘

我尝试通过使用

cum_sum()

语句在极坐标中使用

pl.when

函数，但这是错误的逻辑，因为它从头开始总结，即第一天（大约有 700 天）。

注意：解决方案应该是动态的，即有时reporting_day和non-reporting_day之间的差距是1天、2天等。

任何想法或意见都将受到高度赞赏！预先感谢！

u200eu200eu200eu200eu200eu200eu200eu200eu200eu200eu200eu200eu200eu200e

Answer 1

解决该问题的一种方法是基于

is_reporting_day?

列创建组。

如果我们在

True

时取 date

df.select(
   pl.when("is_reporting_day?").then(pl.col("date"))
)

shape: (5, 1)
┌────────────┐
│ date       │
│ ---        │
│ date       │
╞════════════╡
│ 2022-01-01 │
│ null       │
│ null       │
│ null       │
│ 2022-01-05 │
└────────────┘

然后我们可以

.backward_fill()

date

包含之前的 False 行。

date = pl.when("is_reporting_day?").then(pl.col("date"))

(df.group_by(date.backward_fill(), maintain_order=True)
   .agg(
      pl.all().last(),
      pl.sum("customers").suffix("_sum")
   )
)

shape: (2, 4)
┌────────────┬───────────┬───────────────────┬───────────────┐
│ date       ┆ customers ┆ is_reporting_day? ┆ customers_sum │
│ ---        ┆ ---       ┆ ---               ┆ ---           │
│ date       ┆ i64       ┆ bool              ┆ i64           │
╞════════════╪═══════════╪═══════════════════╪═══════════════╡
│ 2022-01-01 ┆ 3         ┆ true              ┆ 3             │
│ 2022-01-05 ┆ 2         ┆ true              ┆ 14            │
└────────────┴───────────┴───────────────────┴───────────────┘

Answer 2

假设日期已经排序，请使用

groupby.agg

:

out = (df.groupby(df['is_reporting_day?'].shift(fill_value=False).cumsum(), as_index=False)
         .agg({'date': 'max', 'customers': 'sum', 'is_reporting_day?': 'max'})
      )

输出：

         date  customers  is_reporting_day?
0  2022-01-01          3               True
1  2022-01-05         14               True

如果您需要“客户”的首字母和总和：

out = (df.groupby(df['is_reporting_day?'].shift(fill_value=False).cumsum(), as_index=False)
         .agg(**{'date': ('date', 'max'),
                 'customers': ('customers', 'last'),
                 'is_reporting_day?': ('is_reporting_day?', 'max'),
                 'customers_sum': ('customers', 'sum'),
                })
      )

输出：

         date  customers  is_reporting_day?  customers_sum
0  2022-01-01          3               True              3
1  2022-01-05          2               True             14

替代方案：

out = (
 df.assign(date=df['date'].where(df['is_reporting_day?']).bfill())
   .groupby('date', as_index=False)
         .agg(**{'date': ('date', 'max'),
                 'customers': ('customers', 'last'),
                 'is_reporting_day?': ('is_reporting_day?', 'max'),
                 'customers_sum': ('customers', 'sum'),
                })
)

Answer 3

col1=(df1.is_reporting_day.eq(False)&df1.is_reporting_day.shift().eq(True)).cumsum()

df1.groupby(col1,group_keys=False).apply(lambda dd:dd.tail(1)
                                         .assign(customers2=dd['customers'].sum()))

输出：

    date  customers  is_reporting_day  customers2
0  2022-01-01          3              True           3
4  2022-01-05          2              True          14

如何对 Polars 或 Pandas 中的条件值进行累积求和？

问题描述投票：0回答：3

3个回答

最新问题

如何对 Polars 或 Pandas 中的条件值进行累积求和？

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3