在极坐标中传递带有选择器的表达式名称的通用解决方案

Question

在极坐标（1.7.1）中，通过各种操作传递表达式的名称似乎并不简单。

但是我觉得它有足够的不同，值得拆分。

更一般地说，问题是我们想要应用涉及多个列的表达式（严格来说，多个输入表达式），我们想要确保我们最终得到的名称是基于“锚”表达式保留的，即调用该方法的方法（下面的

self._lhs_expr

）。

下面是一个独立的示例来说明这一点。

import traceback
import polars as pl


@pl.api.register_expr_namespace("new")
class NewExtensions:
    _lhs_expr: pl.Expr

    def __init__(self, lhs_expr: pl.Expr):
        # Capture the expression on which the methods are to be called.
        self._lhs_expr = lhs_expr

    def reorder_1(self, weight: pl.Expr) -> pl.Expr:
        # This does not seem to work correctly, despite the superficial expectation it would.
        return weight.sort_by(self._lhs_expr).alias(self._lhs_expr.meta.output_name())

    def reorder_2(self, weight: pl.Expr) -> pl.Expr:
        # This "works". It is a kludge that may have unintended consequences, and is not necessarily generalisable to other types.
        return (self._lhs_expr * 0) + weight.sort_by(self._lhs_expr)


df = pl.DataFrame(
    data=dict(
        weight=[5.1, 4.1, 3.1, 2.1, 1.1], a=[1, 5, 3, 4, 2], b=[-3, -2, -5, -2, -4]
    )
)

# Working.
print(
    "Following works.  Note that the reordering of weight is labeled as 'a' as required."
)
print(
    df.select(
        pl.col("weight"),
        pl.col("a").new.reorder_1(pl.col("weight")),
    )
)


# First failure mode, despite the .alias() inside .reorder_1(), the output name of .new.reorder_1() is weight_reorder
print(
    "Following almost works, except the output name is not a_reorder, but weight_reorder!"
)
print(
    df.select(
        pl.col("a"),
        pl.col("weight"),
        pl.col("a").new.reorder_1(pl.col("weight")).name.suffix("_reorder"),
    )
)

# Second (exception) failure mode, using a selector means the names don't get correctly fixed.
print(
    "The following throws an exception, as it cannot determine the names of all the columns."
)
try:
    print(df.select(pl.selectors.numeric().new.reorder_1(pl.col("weight"))))
except pl.exceptions.ComputeError as e:
    print(traceback.format_exc())


# Working kludge.
print(
    "The following 'works', including the name flowing through to .suffix() but it cannot possibly be the right way to do this."
)
print(df.select(pl.selectors.numeric().new.reorder_2(pl.col("weight")).name.suffix("_")))

这个问题在 Polars 文档本身中也很明显，在装饰器的示例中

register_expr_namespace

，复制如下。

def register_expr_namespace(name: str) -> Callable[[type[NS]], type[NS]]:
    """
    Decorator for registering custom functionality with a Polars Expr.

    Parameters
    ----------
    name
        Name under which the functionality will be accessed.

    See Also
    --------
    register_dataframe_namespace: Register functionality on a DataFrame.
    register_lazyframe_namespace: Register functionality on a LazyFrame.
    register_series_namespace: Register functionality on a Series.

    Examples
    --------
    >>> @pl.api.register_expr_namespace("pow_n")
    ... class PowersOfN:
    ...     def __init__(self, expr: pl.Expr):
    ...         self._expr = expr
    ...
    ...     def next(self, p: int) -> pl.Expr:
    ...         return (p ** (self._expr.log(p).ceil()).cast(pl.Int64)).cast(pl.Int64)
    ...
    ...     def previous(self, p: int) -> pl.Expr:
    ...         return (p ** (self._expr.log(p).floor()).cast(pl.Int64)).cast(pl.Int64)
    ...
    ...     def nearest(self, p: int) -> pl.Expr:
    ...         return (p ** (self._expr.log(p)).round(0).cast(pl.Int64)).cast(pl.Int64)
    >>>
    >>> df = pl.DataFrame([1.4, 24.3, 55.0, 64.001], schema=["n"])
    >>> df.select(
    ...     pl.col("n"),
    ...     pl.col("n").pow_n.next(p=2).alias("next_pow2"),
    ...     pl.col("n").pow_n.previous(p=2).alias("prev_pow2"),
    ...     pl.col("n").pow_n.nearest(p=2).alias("nearest_pow2"),
    ... )
    shape: (4, 4)
    ┌────────┬───────────┬───────────┬──────────────┐
    │ n      ┆ next_pow2 ┆ prev_pow2 ┆ nearest_pow2 │
    │ ---    ┆ ---       ┆ ---       ┆ ---          │
    │ f64    ┆ i64       ┆ i64       ┆ i64          │
    ╞════════╪═══════════╪═══════════╪══════════════╡
    │ 1.4    ┆ 2         ┆ 1         ┆ 1            │
    │ 24.3   ┆ 32        ┆ 16        ┆ 32           │
    │ 55.0   ┆ 64        ┆ 32        ┆ 64           │
    │ 64.001 ┆ 128       ┆ 64        ┆ 64           │
    └────────┴───────────┴───────────┴──────────────┘

在上面，因为结果列的名称已明确设置，所以一切正常，但默认名称是“文字”，而不是更直观的名称

self._expr

。

这是一个令人发狂、令人沮丧的问题，而且真的不清楚正确的解决方案应该是什么。

Answer 1

不幸的是，由于这个注释，没有正确

的方法可以做你想做的事。

这将撤消之前对表达式进行的任何重命名操作。

由于实现限制，该方法只能作为链中的最后一个表达式调用。每个表达式只能进行一次名称操作。考虑使用 .name.map 进行高级重命名。

我能想到两种解决方法。第一个是在你的函数中有一个名称参数，这样你就可以使用它，而不是尝试使用

.name

 ，你知道这是行不通的。该名称参数采用可调用函数，就像

name.map

 的工作方式一样。另一种方法很像

reorder_2

 中的方法，但它使用的结构比乘以 0 和相加更具可扩展性。它的工作原理是创建一个结构体，其中首先包含要用作默认名称的列。第二列是计算。在结构之外，您可以重命名字段，然后提取第二列。如果没有执行其他命名操作，则以您对结构的重命名为准。如果有人在其上调用

.name

，则使用结构体的第一列。无论哪种方式，保留该名称的效果都是相同的。由于 Polars 不复制数据，这并不像语法所暗示的那么糟糕。

他们都在这里

@pl.api.register_expr_namespace("new")
class NewExtensions:
    _lhs_expr: pl.Expr

    def __init__(self, lhs_expr: pl.Expr):
        # Capture the expression on which the methods are to be called.
        self._lhs_expr = lhs_expr


    def name_alias(self, weight: pl.Expr, map_alias: Callable[[str], str]|None=None) -> pl.Expr:
        root_out = self._lhs_expr.meta.output_name()
        if map_alias is not None:
            root_out=map_alias(root_out)
        return weight.sort_by(self._lhs_expr).alias(root_out)
    
    def struct_hack(self, weight: pl.Expr) -> pl.Expr:
        return (
            pl.struct(
                self._lhs_expr, 
                weight.sort_by(self._lhs_expr)
                )
                .struct.rename_fields(['zzz', self._lhs_expr.meta.output_name()])
                .struct.field(self._lhs_expr.meta.output_name())
        )

他们各自在实践中如何运作

df.select(
    pl.col('a').new.struct_hack(pl.col('weight')).name.suffix('fjf'),
    pl.col('a').new.name_alias(pl.col('weight'), lambda x: f"{x}_zzz")
    )

shape: (5, 2)
┌──────┬───────┐
│ afjf ┆ a_zzz │
│ ---  ┆ ---   │
│ f64  ┆ f64   │
╞══════╪═══════╡
│ 5.1  ┆ 5.1   │
│ 1.1  ┆ 1.1   │
│ 3.1  ┆ 3.1   │
│ 2.1  ┆ 2.1   │
│ 4.1  ┆ 4.1   │
└──────┴───────┘

作为一个完整的旁白，我建议将其添加到您的函数中

def some_func(self, rhs: str | pl.Expr)->pl.Expr:
    if isinstance(rhs, str):
        rhs=pl.col(rhs)
    ...

### Now you can do
df.select(pl.col('a').new.some_func('z'))
# or if you like verbosity you can still do
df.select(pl.col('a').new.some_func(pl.col('z')))

Answer 2

我认为我有一个可行的解决方案，并且希望得到反馈。

它是原始乘以零加法方法的推广，但技巧是使用

.when

、

.then

、

.otherwise

。  在这种情况下，

.then的值（原则上）永远不会被计算，但是name(s)

被应用于结果表达式。

代码

import polars as pl
from typing import Union


@pl.api.register_expr_namespace("new")
class NewExtensions:
    _lhs_expr: pl.Expr

    def __init__(self, lhs_expr: pl.Expr):
        # Capture the expression on which the methods are to be called.
        self._lhs_expr = lhs_expr

    def reorder_3(self, weight: Union[pl.Expr, str]) -> pl.Expr:
        if isinstance(weight, str):
            weight = pl.col(weight)
        # This appears to work in a more generic manner than reorder_2 in the original example.
        return (
            pl.when(False) 
            .then(self._lhs_expr) 
            .otherwise(weight.sort_by(self._lhs_expr))
        )


df = pl.DataFrame(
    data=dict(
        weight=[5.1, 4.1, 3.1, 2.1, 1.1], 
             a=[1, 5, 3, 4, 2], 
             b=[-3, -2, -5, -2, -4],
             c=["cat","dog","giraffe","bear","monkey"]
    )
)

# Working with single column, specified using a string.
print(df.select(pl.col("a").new.reorder_3("weight")))

# Working with a selector, specified using pl.col(), also allowing name.suffix.
print(df.with_columns(pl.selectors.all().new.reorder_3(pl.col("weight")).name.suffix("_")))

输出

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ f64 │
╞═════╡
│ 5.1 │
│ 1.1 │
│ 3.1 │
│ 2.1 │
│ 4.1 │
└─────┘
shape: (5, 8)
┌────────┬─────┬─────┬─────────┬─────────┬─────┬─────┬─────┐
│ weight ┆ a   ┆ b   ┆ c       ┆ weight_ ┆ a_  ┆ b_  ┆ c_  │
│ ---    ┆ --- ┆ --- ┆ ---     ┆ ---     ┆ --- ┆ --- ┆ --- │
│ f64    ┆ i64 ┆ i64 ┆ str     ┆ f64     ┆ f64 ┆ f64 ┆ str │
╞════════╪═════╪═════╪═════════╪═════════╪═════╪═════╪═════╡
│ 5.1    ┆ 1   ┆ -3  ┆ cat     ┆ 1.1     ┆ 5.1 ┆ 3.1 ┆ 2.1 │
│ 4.1    ┆ 5   ┆ -2  ┆ dog     ┆ 2.1     ┆ 1.1 ┆ 1.1 ┆ 5.1 │
│ 3.1    ┆ 3   ┆ -5  ┆ giraffe ┆ 3.1     ┆ 3.1 ┆ 5.1 ┆ 4.1 │
│ 2.1    ┆ 4   ┆ -2  ┆ bear    ┆ 4.1     ┆ 2.1 ┆ 4.1 ┆ 3.1 │
│ 1.1    ┆ 2   ┆ -4  ┆ monkey  ┆ 5.1     ┆ 4.1 ┆ 2.1 ┆ 1.1 │
└────────┴─────┴─────┴─────────┴─────────┴─────┴─────┴─────┘

我很满意这应该对名称做正确的事情，但是，我想听听有人对

.then()

 的意见，是否对

.when().then().otherwise() 进行了评估、复制或以其他方式对性能产生重大影响。我的期望是“不”，因为在评估表达式链时，特别是在惰性模式下操作时，极坐标似乎非常聪明。鉴于此，除了取名字之外，这应该是一个废话！

如果人们认为这是我们能做的最好的事情，也许我们应该将这个更广泛的问题提请 Polars 团队注意。感觉应该有一个标准的、无黑客的方法来做到这一点。

在极坐标中传递带有选择器的表达式名称的通用解决方案

问题描述投票：0回答：2

2个回答

最新问题

在极坐标中传递带有选择器的表达式名称的通用解决方案

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2