独立填写列

Question

我有一个带有两个数据类的Python类，第一个是极坐标时间序列，第二个是字符串列表。

在字典中，提供了字符串和函数的映射，对于字符串的每个元素都关联一个返回极坐标系（一列）的函数。

然后有一个函数类，它创建一个极坐标数据框，第一列是时间序列，其他列是用这个函数创建的。

各列都是独立的。

有没有办法并行创建这个数据框？

这里我尝试定义一个最小的例子：

class data_frame_constr():
    function_list: List[str]
    time_series: pl.DataFrame

    def compute_indicator_matrix(self) -> pl.DataFrame:
        for element in self.function_list:
            self.time_series.with_columns(
                mapping[element] # here is where we construct columns with the loop and mapping[element] is a custom function that returns a pl column
            )
        return self.time_series

例如，function_list = ["square", "square_root"]。

时间范围是一个列时间序列，我需要创建平方和平方根（或其他自定义复杂函数，由其名称标识）列，但我只知道在运行时的函数列表，在构造函数中指定。

Answer 1

您可以使用

with_columns

上下文来提供表达式列表，只要表达式是独立的。（注意复数：with_columns。）Polar 将尝试并行运行列表中的所有表达式，即使表达式列表是在运行时动态生成的。


def mapping(func_str: str) -> pl.Expr:
    '''Generate Expression from function string'''
    ...


def compute_indicator_matrix(self) -> pl.DataFrame:
    expr_list = [mapping(next_funct_str)
                 for next_funct_str in self.function_list]
    self.time_series = self.time_series.with_columns(expr_list)
    return self.time_series

注意一点：一种常见的误解是，Polars 是一个通用的线程池，可以并行运行任何/所有代码。这不是真的。

如果您的任何表达式调用外部库或自定义 Python 字节码函数（例如，使用

lambda

函数、

map_elements

、

map_batches

等），那么您的代码将受到 Python GIL 的约束，并且将运行单线程 - 无论你如何编码。因此，尝试仅使用 Polars 表达式来实现您的目标（而不是调用外部库或 Python 函数。）

例如，尝试以下操作。（选择一个

nbr_rows

值，这会给您的计算平台带来压力。）如果我们运行下面的代码，它将并行运行，因为所有内容都是使用 Polars 表达式表达的，无需调用外部库或自定义 Python 代码。结果是令人尴尬的并行性能。

nbr_rows = 100_000_000
df = pl.DataFrame({
    'col1': pl.repeat(2, nbr_rows, eager=True),
})

df.with_columns(
    pl.col('col1').pow(1.1).alias('exp_1.1'),
    pl.col('col1').pow(1.2).alias('exp_1.2'),
    pl.col('col1').pow(1.3).alias('exp_1.3'),
    pl.col('col1').pow(1.4).alias('exp_1.4'),
    pl.col('col1').pow(1.5).alias('exp_1.5'),
)

但是，如果我们使用调用 Python 字节码的

lambda

函数来编写代码，那么它将运行得非常缓慢。

import math
df.with_columns(
    pl.col('col1').map_elements(lambda x: math.pow(x, 1.1)).alias('exp_1.1'),
    pl.col('col1').map_elements(lambda x: math.pow(x, 1.2)).alias('exp_1.2'),
    pl.col('col1').map_elements(lambda x: math.pow(x, 1.3)).alias('exp_1.3'),
    pl.col('col1').map_elements(lambda x: math.pow(x, 1.4)).alias('exp_1.4'),
    pl.col('col1').map_elements(lambda x: math.pow(x, 1.5)).alias('exp_1.5'),
)

独立填写列

问题描述投票：0回答：1

1个回答

最新问题

独立填写列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1