优化 Polars 中具有不同“weekmask”逻辑的两个日期之间的 timedelta 计算

Question

Polars 库给我留下了深刻的印象，并试图更好地学习它。 :)

现在我正在尝试计算 Polars 中数百万行的两个日期之间的天数，但在某些情况下，对于某些行，我需要排除某些工作日。在 Pandas/Numpy 中，我使用了 np.busday_count ，我可以定义一个周掩码，其中计算每个条件的工作日，并在需要时排除假期。

我很难快速计算有条件的天数，因为我不知道如何在表达中做到这一点。

示例数据框：

df = (pl
    .DataFrame({"Market": ["AT", "DE", "AT", "CZ", "GB", "CZ"],
              "Service": ["Standard", "Express", "Standard", "Standard", "Standard", "Standard"],
              "Day1": ["2022-01-02","2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07"],
              "Day2": ["2022-01-03","2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07", "2022-01-08"]
             }
            )
    .with_columns(pl.col(["Day1", "Day2"]).str.strptime(pl.Date, "%Y-%m-%d"))
)

我能够通过 struct 和 apply 方法将数据传递给 np.busday_function 。然而，与 Pandas 分配（262 毫秒）相比，真实数据集（34.4 秒）的执行速度要慢得多。

下面是我在 Polars 中想出的代码。我正在寻找一种更快的优化方法。

(df
    .with_column(
        pl.struct([pl.col("Day1"), pl.col("Day2")])
        .apply(lambda x: np.busday_count(x["Day1"], x["Day2"], weekmask='1110000'))
        .alias("Result"))
)

编辑，预期输出：

┌────────┬──────────┬────────────┬────────────┬────────┐
│ Market ┆ Service  ┆ Day1       ┆ Day2       ┆ Result │
│ ---    ┆ ---      ┆ ---        ┆ ---        ┆ ---    │
│ str    ┆ str      ┆ date       ┆ date       ┆ i64    │
╞════════╪══════════╪════════════╪════════════╪════════╡
│ AT     ┆ Standard ┆ 2022-01-02 ┆ 2022-01-03 ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ DE     ┆ Express  ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ AT     ┆ Standard ┆ 2022-01-04 ┆ 2022-01-05 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ CZ     ┆ Standard ┆ 2022-01-05 ┆ 2022-01-06 ┆ 1      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ GB     ┆ Standard ┆ 2022-01-06 ┆ 2022-01-07 ┆ 0      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ CZ     ┆ Standard ┆ 2022-01-07 ┆ 2022-01-08 ┆ 0      │
└────────┴──────────┴────────────┴────────────┴────────┘

Answer 1

您现在可以在 Polars 中执行此操作，但您需要 polars-xdt 插件

import polars_xdt as xdt

df.with_columns(
    result = xdt.workday_count(
        'Day1',
        'Day2',
        weekend=['Thu', 'Fri', 'Sat', 'Sun'],
    )
)

输出：

Out[6]:
shape: (6, 5)
┌────────┬──────────┬────────────┬────────────┬────────┐
│ Market ┆ Service  ┆ Day1       ┆ Day2       ┆ result │
│ ---    ┆ ---      ┆ ---        ┆ ---        ┆ ---    │
│ str    ┆ str      ┆ date       ┆ date       ┆ i32    │
╞════════╪══════════╪════════════╪════════════╪════════╡
│ AT     ┆ Standard ┆ 2022-01-02 ┆ 2022-01-03 ┆ 0      │
│ DE     ┆ Express  ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1      │
│ AT     ┆ Standard ┆ 2022-01-04 ┆ 2022-01-05 ┆ 1      │
│ CZ     ┆ Standard ┆ 2022-01-05 ┆ 2022-01-06 ┆ 1      │
│ GB     ┆ Standard ┆ 2022-01-06 ┆ 2022-01-07 ┆ 0      │
│ CZ     ┆ Standard ┆ 2022-01-07 ┆ 2022-01-08 ┆ 0      │
└────────┴──────────┴────────────┴────────────┴────────┘

这比使用

map_elements

（以前的

apply

）快几个数量级，并且与 Polars 惰性模式一起使用

Answer 2

当您在

map_elements

上下文中使用

select

时，您将创建一个 Python 字典，并将其提供给列表中每个元素的 lambda。这个好贵啊

您可以使用

vectorization

，通过使用

map_batches

代替 map_elements。这样我们就可以一次将整列发送到 numpys

busday_count

。

(df
    .with_columns(
        pl.struct("Day1", "Day2")
        .map_batches(lambda x: np.busday_count(x.struct["Day1"], x.struct["Day2"], weekmask='1110000'))
        .alias("Result"))
)

优化 Polars 中具有不同“weekmask”逻辑的两个日期之间的 timedelta 计算

问题描述投票：0回答：2

2个回答

最新问题

优化 Polars 中具有不同“weekmask”逻辑的两个日期之间的 timedelta 计算

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2