如何将泊松 CDF 编写为 Python Polars 表达式

Question

我有一组极坐标表达式，用于生成 ML 模型的特征。我想向这个集合添加一个 poission cdf 功能，同时保持延迟执行（具有速度、缓存等优点......）。到目前为止，我还没有找到实现这一目标的简单方法。

我已经能够在所需的惰性表达框架之外获得我想要的结果：

import polars as pl
from scipy.stats import poisson

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))

但是，实际上我希望它看起来像这样：

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
    [
        ... # bunch of other expressions here
        poisson_cdf()
    ]
)

其中

poisson_cdf

是一些极坐标表达式，例如：

def poisson_cdf():
    # this is just for illustration, clearly wont work
    return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")

我还尝试使用由

"count"

和

"expected_count"

组成的结构，并在应用自定义函数时按照文档中的建议进行应用。然而，我的数据集实际上有数百万行 - 导致执行时间荒谬。

如有任何建议或指导，我们将不胜感激。理想情况下，那里存在这样的表达方式吗？预先感谢！

Answer 1

如果

scipy.stats.poisson.cdf

被实现为适当的 numpy 通用函数，则可以直接在极坐标表达式上使用它，但事实并非如此。幸运的是，泊松 CDF 与正则化上不完全伽马函数几乎相同，scipy 为其提供了 gammaincc ，可以在极坐标表达式中使用：

>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k') + 1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k   ┆ cdf      │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 0   ┆ 0.018316 │
│ 1   ┆ 0.091578 │
│ 2   ┆ 0.238103 │
│ 3   ┆ 0.43347  │
│ ... ┆ ...      │
│ 6   ┆ 0.889326 │
│ 7   ┆ 0.948866 │
│ 8   ┆ 0.978637 │
│ 9   ┆ 0.991868 │
└─────┴──────────┘

结果与poisson.cdf返回的结果相同：

>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k   ┆ cdf      ┆ cdf2     │
│ --- ┆ ---      ┆ ---      │
│ i64 ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╡
│ 0   ┆ 0.018316 ┆ 0.018316 │
│ 1   ┆ 0.091578 ┆ 0.091578 │
│ 2   ┆ 0.238103 ┆ 0.238103 │
│ 3   ┆ 0.43347  ┆ 0.43347  │
│ ... ┆ ...      ┆ ...      │
│ 6   ┆ 0.889326 ┆ 0.889326 │
│ 7   ┆ 0.948866 ┆ 0.948866 │
│ 8   ┆ 0.978637 ┆ 0.978637 │
│ 9   ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘

Answer 2

听起来您想使用

.map()

而不是

.apply()

- 它将一次传递整列。

df.select([
   pl.all(),
   # ...
   pl.struct(["count", "expected_count"])
     .map(lambda x: 
        poisson.cdf(x.struct.field("count"), x.struct.field("expected_count")))
     .flatten()
     .alias("poisson_cdf")
])

shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ ---   | ---            | ---         │
│ i64   | f64            | f64         │
╞═══════╪════════════════╪═════════════╡
│ 9     | 7.7            | 0.75308     │
│ 2     | 0.2            | 0.998852    │
│ 3     | 0.7            | 0.994247    │
│ 4     | 1.1            | 0.994565    │
│ 5     | 7.5            | 0.241436    │
└───────┴────────────────┴─────────────┘

Answer 3

您想利用 scipy 有一组 numpy ufuncs 的函数

仍然可以通过 NumPy API 进行快速的柱状运算。

具体来说，您需要 pdtr 函数。

然后您需要使用

reduce

而不是

map

或

apply

，因为它们用于通用 Python 函数，并且性能不佳。

所以如果我们有...

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))

然后我们可以添加

df=df.with_columns([
    pl.reduce(f=pdtr, exprs=[pl.col('count'),pl.col('expected_count')]).alias("poicdf")
])
df

shape: (5, 4)
┌───────┬────────────────┬──────────────┬──────────┐
│ count ┆ expected_count ┆ poission_cdf ┆ poicdf   │
│ ---   ┆ ---            ┆ ---          ┆ ---      │
│ i64   ┆ f64            ┆ f64          ┆ f64      │
╞═══════╪════════════════╪══════════════╪══════════╡
│ 9     ┆ 7.7            ┆ 0.75308      ┆ 0.75308  │
│ 2     ┆ 0.2            ┆ 0.998852     ┆ 0.998852 │
│ 3     ┆ 0.7            ┆ 0.994247     ┆ 0.994247 │
│ 4     ┆ 1.1            ┆ 0.994565     ┆ 0.994565 │
│ 5     ┆ 7.5            ┆ 0.241436     ┆ 0.241436 │
└───────┴────────────────┴──────────────┴──────────┘

你可以看到它给出了相同的答案。

如何将泊松 CDF 编写为 Python Polars 表达式

问题描述投票：0回答：3

3个回答

最新问题

如何将泊松 CDF 编写为 Python Polars 表达式

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3