如何避免使用 pl.DataFrame.iter_rows() 并对其进行矢量化

问题描述 投票:0回答:1

我有两个极坐标数据框,其中包含唯一的 ID 和实用程序的名称。我正在尝试构建这两个数据帧之间的条目映射。我正在使用

polars_fuzzy_match
对条目进行模糊字符串搜索。我的第一个数据帧 (
wg_df
) 大约是第二个 (
eia_df
) 的子集。在下面的代码中,我将
utility_name
中的每个
wg_df
传递到针对
eia_utility_name
运行的 fuzzy_match_score 中。我可以避免 rowise 迭代并将其矢量化吗?

import polars as pl
from polars_fuzzy_match import fuzzy_match_score

# Sample data
#  wg_df is approximately a subset of eia_df.
wg_df = pl.DataFrame({"wg_id": [1, 2], "utility_name": ["Utility A", "Utility B"]})

eia_df = pl.DataFrame(
    {"eia_id": [101, 102, 103], "utility_name": ["Utility A co.", "Utility B", "utility c"]}
)

out = pl.DataFrame(
    schema=[
        ("wg_id", pl.Int64),
        ("eia_id", pl.Int64),
        ("wg_utility_name", pl.String),
        ("utility_name", pl.String),
        ("score", pl.UInt32),
    ],
)

# Iterate through each wg utility and find the best match in eia
# can this be vectorized?
for wg_id, utility in wg_df.iter_rows():
    res = (
        eia_df.with_columns(score=fuzzy_match_score(pl.col("utility_name"), utility))
        .filter(pl.col("score").is_not_null())
        .sort(by="score", descending=True)
    )
    # insert the wg_id and wg_utility_name into the results. They have to be put into the 
    res.insert_column(
        0,
        pl.Series("wg_id", [wg_id] * len(res)),
    )
    res.insert_column(2, pl.Series("wg_utility_name", [utility] * len(res)))
    out = out.vstack(res.select([col_name for col_name in out.schema]))
python vectorization python-polars
1个回答
0
投票

这将使工作并行化。

def make_res(wg_id, utility, eia_df):
    return (
        eia_df.lazy()
        .select(
            pl.lit(wg_id).alias('wg_id'),
            pl.all(),
            other_utility_name = pl.lit(utility),
            score=fuzzy_match_score(pl.col("utility_name"), utility),
            )
        .filter(pl.col("score").is_not_null())
    )
    
pl.concat([
    make_res(wg_id, utility, eia_df) for wg_id, utility in wg_df.iter_rows()
]).collect()
    

shape: (2, 5)
┌───────┬────────┬───────────────┬────────────────────┬───────┐
│ wg_id ┆ eia_id ┆ utility_name  ┆ other_utility_name ┆ score │
│ ---   ┆ ---    ┆ ---           ┆ ---                ┆ ---   │
│ i32   ┆ i64    ┆ str           ┆ str                ┆ u32   │
╞═══════╪════════╪═══════════════╪════════════════════╪═══════╡
│ 1     ┆ 101    ┆ Utility A co. ┆ Utility A          ┆ 228   │
│ 2     ┆ 102    ┆ Utility B     ┆ Utility B          ┆ 228   │
└───────┴────────┴───────────────┴────────────────────┴───────┘

我对这个插件一无所知,也不知道是否预期两个结果都是 228。

抛开这一点,因为它实际上与问题无关,这种方法有两个关键。第一个是我们不会为每次迭代进行 vstacking,第二个是当您连接 LazyFrames 列表然后收集它们时,Polars 将并行执行每个 LazyFrame 计划。

© www.soinside.com 2019 - 2024. All rights reserved.