为 polars Dataframe 中的每一行创建字典

Question

假设我们有下面给定的数据框。现在，对于每一行，我需要创建字典并将其传递给 UDF 以进行一些逻辑处理。有没有办法使用 polars 或 pyspark 数据框来实现这一点？

Answer 1

使用

Polars

，您可以使用：

# Dict of lists
>>> df.transpose().to_dict(as_series=False)
{'column_0': [1.0, 100.0, 1000.0], 'column_1': [2.0, 200.0, None]}

# List of dicts
>>> df.to_dicts()
[{'Account number': 1, 'V1': 100, 'V2': 1000.0},
 {'Account number': 2, 'V1': 200, 'V2': None}]

输入数据框：

>>> df
shape: (2, 3)
┌────────────────┬─────┬────────┐
│ Account number ┆ V1  ┆ V2     │
│ ---            ┆ --- ┆ ---    │
│ i64            ┆ i64 ┆ f64    │
╞════════════════╪═════╪════════╡
│ 1              ┆ 100 ┆ 1000.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2              ┆ 200 ┆ null   │
└────────────────┴─────┴────────┘

Answer 2

除了@Corralien 的回复之外，如果您想直接从 Polars 调用 UDF，还可以执行以下操作：

import polars as pl

df = pl.DataFrame({
    'Account number': [1,2],
    'V1': [100,200],
    'V2': [1000, None]
})

shape: (2, 3)
┌────────────────┬─────┬──────┐
│ Account number ┆ V1  ┆ V2   │
│ ---            ┆ --- ┆ ---  │
│ i64            ┆ i64 ┆ i64  │
╞════════════════╪═════╪══════╡
│ 1              ┆ 100 ┆ 1000 │
│ 2              ┆ 200 ┆ null │
└────────────────┴─────┴──────┘

# Define a UDF
def my_udf(row):
    return (row['V1'] or 0) + (row['V2'] or 0)

# Add a column using the UDF
df.with_columns(
    result_udf = pl.struct(pl.all()).apply(my_udf)
)

shape: (2, 4)
┌────────────────┬─────┬──────┬────────────┐
│ Account number ┆ V1  ┆ V2   ┆ result_udf │
│ ---            ┆ --- ┆ ---  ┆ ---        │
│ i64            ┆ i64 ┆ i64  ┆ i64        │
╞════════════════╪═════╪══════╪════════════╡
│ 1              ┆ 100 ┆ 1000 ┆ 1100       │
│ 2              ┆ 200 ┆ null ┆ 200        │
└────────────────┴─────┴──────┴────────────┘

# You can also run your UDF on multiple threads
df.with_columns(
    result_udf = pl.struct(pl.all())
        .apply(my_udf, strategy='threading', return_dtype=pl.Int64)
)

关于在单独的线程上运行一个函数，下面是 Polars API 的说法：

此功能处于 alpha 阶段。这可能会被删除/更改而不被视为重大更改。

‘threading’：在单独的线程上运行 python 函数。搭配使用小心，因为这会降低性能。如果每个元素的工作量很大并且 python 函数释放 GIL（例如，通过调用 c 函数），这可能只会加速您的代码

https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.apply.html#polars.Expr.apply

为 polars Dataframe 中的每一行创建字典

问题描述投票：0回答：2

2个回答

最新问题

为 polars Dataframe 中的每一行创建字典

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2