如何从其他数据框中获取正确的数据

Question

我没有获取实际数据，而是获取了查询计划对象的 df。

import polars as pl

df1 = pl.DataFrame({
    "df1_date": [20221011, 20221012, 20221013, 20221014, 20221016],
    "df1_col1": ["foo", "bar", "foo", "bar", "foo"],
})

df2 = pl.DataFrame({
    "df2_date": [20221012, 20221015, 20221018],
    "df2_col1": ["1", "2", "3"],
})

print(
    df1.lazy()
    .with_context(df2.lazy())
    .select(
        pl.col("df1_date")
        .map_elements(lambda s: pl.col("df2_date").filter(pl.col("df2_date") >= s).first())
        .alias("release_date")
    )
    .collect()
)

shape: (5, 1)
┌─────────────────────────────────┐
│ release_date                    │
│ ---                             │
│ object                          │
╞═════════════════════════════════╡
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
│ col("df2_date").filter([(col("… │
└─────────────────────────────────┘

在pandas中，我可以通过使用来获得我想要的东西：

df1 = df1.to_pandas().set_index("df1_date")
df2 = df2.to_pandas().set_index("df2_date")

df1["release_date"] = df1.index.map(
    lambda x: df2[df2.index <= x].index[-1] if len(df2[df2.index <= x]) > 0 else 0
)
print(df1)

shape: (5, 3)
┌──────────┬──────────┬──────────────┐
│ df1_date ┆ df1_col1 ┆ release_date │
│ ---      ┆ ---      ┆ ---          │
│ i64      ┆ str      ┆ i64          │
╞══════════╪══════════╪══════════════╡
│ 20221011 ┆ foo      ┆ 0            │
│ 20221012 ┆ bar      ┆ 20221012     │
│ 20221013 ┆ foo      ┆ 20221012     │
│ 20221014 ┆ bar      ┆ 20221012     │
│ 20221016 ┆ foo      ┆ 20221015     │
└──────────┴──────────┴──────────────┘

如何使用 Polars 获得预期结果？

Answer 1

看起来您正在尝试进行 asof 连接。换句话说，您采用最后一个匹配的值而不是精确匹配的连接。

你可以做

df1 = (df1.lazy().join_asof(df2.lazy(), left_on='df1_date', right_on='df2_date')) \
           .select(['df1_date', 'df1_col1',
                    pl.col('df2_date').fill_null(0).alias('release_date')]).collect()

第一个区别是，在极坐标中，您不分配新列，而是分配整个 df，因此它始终只是等号左侧的 df 名称。

join_asof

取代了你的索引/映射/lambda 的东西。最后一件事就是用

fill_null

替换 0 的空值，然后重命名该列。旧版本的 Polars 中存在一个错误，导致

collect

最终无法工作。这至少在 0.15.1 中得到了修复（也许也是早期版本，但我只是检查该版本）

如何从其他数据框中获取正确的数据

问题描述投票：0回答：1

1个回答

最新问题

如何从其他数据框中获取正确的数据

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1