我有两个极坐标数据框,其中包含唯一的 ID 和实用程序的名称。我正在尝试构建这两个数据帧之间的条目映射。我正在使用
polars_fuzzy_match
对条目进行模糊字符串搜索。我的第一个数据帧 (wg_df
) 大约是第二个 (eia_df
) 的子集。在下面的代码中,我将 utility_name
中的每个 wg_df
传递到针对 eia_utility_name
运行的 fuzzy_match_score 中。我可以避免 rowise 迭代并将其矢量化吗?
import polars as pl
from polars_fuzzy_match import fuzzy_match_score
# Sample data
# wg_df is approximately a subset of eia_df.
wg_df = pl.DataFrame({"wg_id": [1, 2], "utility_name": ["Utility A", "Utility B"]})
eia_df = pl.DataFrame(
{"eia_id": [101, 102, 103], "utility_name": ["Utility A co.", "Utility B", "utility c"]}
)
out = pl.DataFrame(
schema=[
("wg_id", pl.Int64),
("eia_id", pl.Int64),
("wg_utility_name", pl.String),
("utility_name", pl.String),
("score", pl.UInt32),
],
)
# Iterate through each wg utility and find the best match in eia
# can this be vectorized?
for wg_id, utility in wg_df.iter_rows():
res = (
eia_df.with_columns(score=fuzzy_match_score(pl.col("utility_name"), utility))
.filter(pl.col("score").is_not_null())
.sort(by="score", descending=True)
)
# insert the wg_id and wg_utility_name into the results. They have to be put into the
res.insert_column(
0,
pl.Series("wg_id", [wg_id] * len(res)),
)
res.insert_column(2, pl.Series("wg_utility_name", [utility] * len(res)))
out = out.vstack(res.select([col_name for col_name in out.schema]))
这将使工作并行化。
def make_res(wg_id, utility, eia_df):
return (
eia_df.lazy()
.select(
pl.lit(wg_id).alias('wg_id'),
pl.all(),
other_utility_name = pl.lit(utility),
score=fuzzy_match_score(pl.col("utility_name"), utility),
)
.filter(pl.col("score").is_not_null())
)
pl.concat([
make_res(wg_id, utility, eia_df) for wg_id, utility in wg_df.iter_rows()
]).collect()
shape: (2, 5)
┌───────┬────────┬───────────────┬────────────────────┬───────┐
│ wg_id ┆ eia_id ┆ utility_name ┆ other_utility_name ┆ score │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i64 ┆ str ┆ str ┆ u32 │
╞═══════╪════════╪═══════════════╪════════════════════╪═══════╡
│ 1 ┆ 101 ┆ Utility A co. ┆ Utility A ┆ 228 │
│ 2 ┆ 102 ┆ Utility B ┆ Utility B ┆ 228 │
└───────┴────────┴───────────────┴────────────────────┴───────┘
我对这个插件一无所知,也不知道是否预期两个结果都是 228。
抛开这一点,因为它实际上与问题无关,这种方法有两个关键。第一个是我们不会为每次迭代进行 vstacking,第二个是当您连接 LazyFrames 列表然后收集它们时,Polars 将并行执行每个 LazyFrame 计划。