Polars - 计算窗口中除此之外的所有其他值

Question

对于每一行，我尝试计算组中不包括该行值的其他值的标准差。思考它的一种方法是“如果删除该行的值，该组的标准差是多少”。一个例子可能更容易解析：

df = pl.DataFrame(
    {
        "answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
        "value": [5,10,7,8,6,9,10],
    }
)

┌────────┬───────┐
│ answer ┆ value │
│ ---    ┆ ---   │
│ str    ┆ i64   │
╞════════╪═══════╡
│ yes    ┆ 5     │
│ yes    ┆ 10    │
│ yes    ┆ 7     │
│ yes    ┆ 8     │
│ maybe  ┆ 6     │
│ maybe  ┆ 9     │
│ maybe  ┆ 10    │
└────────┴───────┘

我想添加一列，第一行为 std([10,7,8]) = 1.527525

我试图将一些东西组合在一起，但最终得到的代码读起来很糟糕，而且还有一个我不知道如何解决的错误：

df.with_columns(
    (
        (pl.col("value").sum().over(pl.col("answer")) - pl.col("value"))
        / (pl.col("value").count().over(pl.col("answer")) - 1)
    ).alias("average_other")
).with_columns(
    (
        (
            (
                (pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
                - (pl.col("value") - pl.col("average_other")).pow(2)
            )
            / (pl.col("value").count().over(pl.col("answer")) - 1)
        ).sqrt()
    ).alias("std_dev_other")
)

我不确定我会建议解析它，但我会指出至少一件事是错误的：

pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))

我想将每行中的“值”与该行中的“average_other”进行比较，然后在窗口上进行平方和求和，但我将每行中的“值”与每行中的“average_other”进行比较。

我的主要问题是“在忽略这个值的情况下获得标准差的最佳方法是什么？”部分。但我也很感兴趣是否有一种方法可以进行我上面做错的比较。第三是关于如何以易于理解的方式编写此内容的提示。

Answer 1

我的方法（至少乍一看）是创建三个辅助列。第一个是行索引，第二个是组中值的窗口列表，最后一个是行索引的窗口列表。接下来我将通过上述两个列表进行爆炸。这样，您可以过滤掉实际行索引等于列表行索引的行。这允许您根据行索引对值运行

std

，我们已在其中过滤掉每行上自己的值。您可以将该结果连接回原始结果。

df = pl.DataFrame(
    {
        "answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
        "value": [5,10,7,8,6,9,10],
    }
)

df.with_row_count('i').join(
    df.with_row_count('i') \
        .with_columns([
            pl.col('value').list().over('answer').alias('l'), 
            pl.col('i').list().over('answer').alias('il')]) \
        .explode(['l','il']).filter(pl.col('i')!=pl.col('il')) \
        .groupby('i').agg(pl.col('l').std().alias('std')),
    on='i').drop('i')

shape: (7, 3)
┌────────┬───────┬──────────┐
│ answer ┆ value ┆ std      │
│ ---    ┆ ---   ┆ ---      │
│ str    ┆ i64   ┆ f64      │
╞════════╪═══════╪══════════╡
│ yes    ┆ 5     ┆ 1.527525 │
│ yes    ┆ 10    ┆ 1.527525 │ 
│ yes    ┆ 7     ┆ 2.516611 │
│ yes    ┆ 8     ┆ 2.516611 │
│ maybe  ┆ 6     ┆ 0.707107 │
│ maybe  ┆ 9     ┆ 2.828427 │
│ maybe  ┆ 10    ┆ 2.12132  │
└────────┴───────┴──────────┘

Answer 2

我想出了一些类似于@DeanMacGregor 的答案：

df = (
    df.with_row_index()
    .join(df.with_row_index(), on="answer")
    .filter(pl.col("index") != pl.col("index_right"))
    .group_by("answer", "index_right", maintain_order=True).agg(
        pl.col("value_right").first().alias("value"),
        pl.col("value").std().alias("stdev"),
    )
    .drop("index_right")
)

.join

df

本身具有行索引，并删除两个行索引相同的行。然后按

answer

和

row_index_right

进行分组，并 (1) 从

value_right

中选择第一个组项目，(2) 计算

value

组的标准差。

结果

df = pl.DataFrame({
    "answer": ["yes", "yes", "yes", "yes", "yes", "maybe", "maybe", "maybe", "maybe"],
    "value": [5, 10, 7, 8, 4, 6, 9, 10, 4],
})

是

┌────────┬───────┬──────────┐
│ answer ┆ value ┆ stdev    │
│ ---    ┆ ---   ┆ ---      │
│ str    ┆ i64   ┆ f64      │
╞════════╪═══════╪══════════╡
│ yes    ┆ 5     ┆ 2.5      │
│ yes    ┆ 10    ┆ 1.825742 │
│ yes    ┆ 7     ┆ 2.753785 │
│ yes    ┆ 8     ┆ 2.645751 │
│ yes    ┆ 4     ┆ 2.081666 │
│ maybe  ┆ 6     ┆ 3.21455  │
│ maybe  ┆ 9     ┆ 3.05505  │
│ maybe  ┆ 10    ┆ 2.516611 │
│ maybe  ┆ 4     ┆ 2.081666 │
└────────┴───────┴──────────┘

Answer 3

我不确定我的解决方案是否保留了使用

polars

（而不是常规

pandas

）的（性能）优势，但我发现它比其他答案更容易维护且更具可读性。通过将数据从系列迭代转换为列表，我预计这不会很好地扩展，但也许您的用例不需要这样做。

从数据开始（感谢这个答案提供了最小的工作数据集）：

import polars as pl
import statistics as stats

df = pl.DataFrame(
    {
        "answer": ["yes", "yes", "yes", "yes", "maybe", "maybe", "maybe"],
        "value": [5, 10, 7, 8, 6, 9, 10],
    }
)
df

shape: (7, 2)
┌────────┬───────┐
│ answer ┆ value │
│ ---    ┆ ---   │
│ str    ┆ i64   │
╞════════╪═══════╡
│ yes    ┆ 5     │
│ yes    ┆ 10    │
│ yes    ┆ 7     │
│ yes    ┆ 8     │
│ maybe  ┆ 6     │
│ maybe  ┆ 9     │
│ maybe  ┆ 10    │
└────────┴───────┘

现在定义一个自定义函数来进行实际计算（感谢这个答案提供了自定义极坐标聚合函数的示例）。 IE。获取一个值列表，在列表中逐一移动并计算列表中值（当前值除外）的标准差：

def custom_std(args: list[pl.Series]) -> pl.Series:
    output = []
    # Iterate over the values within the group
    for idx in range(0, len(args[0])):
        # Convert the Series to a list, as Polars does not have a method that can delete individual elements
        temp = args[0].to_list()
        # Delete value fo the current row
        del temp[idx]
        # Now calculate the std. dev. on the remaining values
        # I arbitrarly chose the sample standard deviation, adjust accordingly to your situation
        result = stats.stdev(temp)
        # Store the result in a list and go to the next iteration
        output.append(result)
    # The dtype is not correct, but I can't find how to specify that this series contains a list of floats
    return pl.Series(output, dtype=pl.Float64)

在

groupby

中使用此功能：

gdf = df.group_by("answer", maintain_order=True).agg(pl.map_groups(function=custom_std, exprs=["value"]))
gdf

┌────────┬─────────────────────────────────┐
│ answer ┆ value                           │
│ ---    ┆ ---                             │
│ str    ┆ list[f64]                       │
╞════════╪═════════════════════════════════╡
│ yes    ┆ [1.527525, 1.527525, … 2.51661… │
│ maybe  ┆ [0.707107, 2.828427, 2.12132]   │
└────────┴─────────────────────────────────┘

以所需的格式获取它

explode

生成的DataFrame：

gdf.explode("value")

┌────────┬──────────┐
│ answer ┆ value    │
│ ---    ┆ ---      │
│ str    ┆ f64      │
╞════════╪══════════╡
│ yes    ┆ 1.527525 │
│ yes    ┆ 1.527525 │
│ yes    ┆ 2.516611 │
│ yes    ┆ 2.516611 │
│ maybe  ┆ 0.707107 │
│ maybe  ┆ 2.828427 │
│ maybe  ┆ 2.1213   │
└────────┴──────────┘

Polars - 计算窗口中除此之外的所有其他值

问题描述投票：0回答：3

3个回答

最新问题

Polars - 计算窗口中除此之外的所有其他值

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3