如何计算 Polar 数据帧中的持续时间？

Question

我有以下数据框：

import datetime

import polars as pl


df = pl.DataFrame(
    {
        "idx": [259, 123],
        "timestamp": [
            [
                datetime.datetime(2023, 4, 20, 1, 45),
                datetime.datetime(2023, 4, 20, 1, 51, 7),
                datetime.datetime(2023, 4, 20, 2, 29, 50),
            ],
            [
                datetime.datetime(2023, 4, 19, 6, 0, 1),
                datetime.datetime(2023, 4, 19, 6, 0, 17),
                datetime.datetime(2023, 4, 19, 6, 0, 26),
                datetime.datetime(2023, 4, 19, 19, 53, 29),
                datetime.datetime(2023, 4, 19, 19, 54, 4),
                datetime.datetime(2023, 4, 19, 19, 57, 52),
            ],
        ],
    }
)

print(df)
# Output
shape: (2, 2)
┌─────┬───────────────────────────────────────────────────────────────────┐
│ idx ┆ timestamp                                                         │
│ --- ┆ ---                                                               │
│ i64 ┆ list[datetime[μs]]                                                │
╞═════╪═══════════════════════════════════════════════════════════════════╡
│ 259 ┆ [2023-04-20 01:45:00, 2023-04-20 01:51:07, 2023-04-20 02:29:50]   │
│ 123 ┆ [2023-04-19 06:00:01, 2023-04-19 06:00:17, … 2023-04-19 19:57:52] │
└─────┴───────────────────────────────────────────────────────────────────┘

我想知道每个id的总时长，所以我这样做：

df = df.with_columns(
    pl.col("timestamp")
    .apply(lambda x: [x[i + 1] - x[i] for i in range(len(x)) if i + 1 < len(x)])
    .alias("duration")
)

这给了我：

shape: (2, 2)
┌─────┬─────────────────────┐
│ idx ┆ duration            │
│ --- ┆ ---                 │
│ i64 ┆ list[duration[μs]]  │
╞═════╪═════════════════════╡
│ 259 ┆ [6m 7s, 38m 43s]    │
│ 123 ┆ [16s, 9s, … 3m 48s] │
└─────┴─────────────────────┘

现在，在 Pandas 中，我会在调用 apply 和

sum

列表时使用

total_seconds

，如下所示：

df["duration"] = (
    df["timestamp"]
    .apply(
        lambda x: sum(
            [(x[i + 1] - x[i]).total_seconds() for i in range(len(x)) if i + 1 < len(x)]
        )
    )
    .astype(int)
)

哪个会给我预期的结果：

print(df[["idx", "duration"]])
# Output

   idx  duration
0  259      2690
1  123     50271

在 Polars 中执行此操作的等效惯用方法是什么？

Answer 1

列表类型有

arr.diff

方法，然后可以求和，总秒数可以用

dt.seconds

计算：

df.select(
    "idx",
    duration=pl.col("timestamp")
        .arr.diff(null_behavior="drop")
        .arr.sum()
        .dt.seconds(),
)

┌─────┬──────────┐
│ idx ┆ duration │
│ --- ┆ ---      │
│ i64 ┆ i64      │
╞═════╪══════════╡
│ 259 ┆ 2690     │
│ 123 ┆ 50271    │
└─────┴──────────┘

在这种情况下，等效表达式是从最后一个元素中减去列表的第一个元素：

duration=(
  pl.col("timestamp").arr.last() - pl.col("timestamp").arr.first()
).dt.seconds()

如何计算 Polar 数据帧中的持续时间？

问题描述投票：0回答：1

1个回答

最新问题

如何计算 Polar 数据帧中的持续时间？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1