Polars 数据框 join_asof with(keep) null

Question

更新：此问题已解决。

df.join_asof(df2, on="time", by=["a", "b"])

现在运行没有错误并返回预期结果。

目前，根据我的实验，如果“by”列中有任何 None(null) ， join_asof 不会导致错误。有什么方法我仍然可以使用 join_asof 同时在左侧数据框中保留任何 None(null) ？

例如，我有以下数据框：

df = pl.DataFrame(
    {"a": [1, 2, 3, 4, 5, None, 8], "b": [2, 3, 4, 5, 6, 7, None], "time": [1, 2, 3, 4, 5, 6, 7]}
)
df2 = pl.DataFrame({"a": [1, 3, 4, None], "b": [2, 4, 5, 8], "c": [2, 3, 4, 5], "time": [0, 2, 4, 6]})

如果我只运行下面的代码，将会出现错误：

df.join_asof(df2, on="time", by=["a", "b"])

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ComputeError(Borrowed("cannot take slice"))', /home/runner/work/polars/polars/polars/polars-core/src/frame/asof_join/groups.rs:253:35

但是，下面的代码运行良好：

df.drop_nulls(["a", "b"]).join_asof(df2.drop_nulls(["a", "b"]), on="time", by=["a", "b"])

shape: (5, 4)
┌─────┬─────┬──────┬──────┐
│ a   ┆ b   ┆ time ┆ c    │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 1    ┆ 2    │
│ 2   ┆ 3   ┆ 2    ┆ null │
│ 3   ┆ 4   ┆ 3    ┆ 3    │
│ 4   ┆ 5   ┆ 4    ┆ 4    │
│ 5   ┆ 6   ┆ 5    ┆ null │
└─────┴─────┴──────┴──────┘

我的问题是如何获得以下结果，基本上是上面附加了行的结果（其中左侧数据帧中的 a 为空 - 在本例中为 df）？

┌─────┬─────┬──────┬──────┐
│ a   ┆ b   ┆ time ┆ c    │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 1    ┆ 2    │
│ 2   ┆ 3   ┆ 2    ┆ null │
│ 3   ┆ 4   ┆ 3    ┆ 3    │
│ 4   ┆ 5   ┆ 4    ┆ 4    │
│ 5   ┆ 6   ┆ 5    ┆ null │
│ null┆ 7   ┆ 6    ┆ null │
│ 8   ┆ null┆ 7    ┆ null │
└─────┴─────┴──────┴──────┘

谢谢！

Answer 1

一个简单的解决方案是将

concat

与

how='diagonal'

一起使用。例如：

pl.concat(
    [
        df.drop_nulls(["a", "b"]).join_asof(df2.drop_nulls(["a", "b"]), on="time", by=["a", "b"]),
        df.filter(pl.col('a').is_null() | pl.col('b').is_null()),
    ],
    how='diagonal'
)

shape: (7, 4)
┌──────┬──────┬──────┬──────┐
│ a    ┆ b    ┆ time ┆ c    │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 1    ┆ 2    ┆ 1    ┆ 2    │
│ 2    ┆ 3    ┆ 2    ┆ null │
│ 3    ┆ 4    ┆ 3    ┆ 3    │
│ 4    ┆ 5    ┆ 4    ┆ 4    │
│ 5    ┆ 6    ┆ 5    ┆ null │
│ null ┆ 7    ┆ 6    ┆ null │
│ 8    ┆ null ┆ 7    ┆ null │
└──────┴──────┴──────┴──────┘

编辑：

如果数据帧很大，diagonal pl.concat 似乎相当慢？

是吗？

import time

import polars as pl

mult = 100_000_000
df = pl.DataFrame(
    {
        "a": [1, 2, 3, 4, 5, None, 8] * mult,
        "b": [2, 3, 4, 5, 6, 7, None] * mult,
        "time": [1, 2, 3, 4, 5, 6, 7] * mult,
    }
).sort("time")
df2 = pl.DataFrame(
    {
        "a": [1, 3, 4, None] * mult,
        "b": [2, 4, 5, 8] * mult,
        "c": [2, 3, 4, 5] * mult,
        "time": [0, 2, 4, 6] * mult,
    }
).sort("time")

not_null_df = df.drop_nulls(["a", "b"]).join_asof(
    df2.drop_nulls(["a", "b"]), on="time", by=["a", "b"]
)
is_null_df = df.filter(pl.col("a").is_null() | pl.col("b").is_null())

not_null_df
is_null_df

start = time.perf_counter()
pl.concat([not_null_df, is_null_df], how="diagonal")
print(time.perf_counter() - start)

>>> not_null_df
shape: (500000000, 4)
┌─────┬─────┬──────┬──────┐
│ a   ┆ b   ┆ time ┆ c    │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64 ┆ i64  ┆ i64  │
╞═════╪═════╪══════╪══════╡
│ 1   ┆ 2   ┆ 1    ┆ 2    │
│ 1   ┆ 2   ┆ 1    ┆ 2    │
│ 1   ┆ 2   ┆ 1    ┆ 2    │
│ 1   ┆ 2   ┆ 1    ┆ 2    │
│ ... ┆ ... ┆ ...  ┆ ...  │
│ 5   ┆ 6   ┆ 5    ┆ null │
│ 5   ┆ 6   ┆ 5    ┆ null │
│ 5   ┆ 6   ┆ 5    ┆ null │
│ 5   ┆ 6   ┆ 5    ┆ null │
└─────┴─────┴──────┴──────┘
>>> is_null_df
shape: (200000000, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ time │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ null ┆ 7    ┆ 6    │
│ null ┆ 7    ┆ 6    │
│ null ┆ 7    ┆ 6    │
│ null ┆ 7    ┆ 6    │
│ ...  ┆ ...  ┆ ...  │
│ 8    ┆ null ┆ 7    │
│ 8    ┆ null ┆ 7    │
│ 8    ┆ null ┆ 7    │
│ 8    ┆ null ┆ 7    │
└──────┴──────┴──────┘

>>> pl.concat([not_null_df, is_null_df], how="diagonal")
shape: (700000000, 4)
┌─────┬──────┬──────┬──────┐
│ a   ┆ b    ┆ time ┆ c    │
│ --- ┆ ---  ┆ ---  ┆ ---  │
│ i64 ┆ i64  ┆ i64  ┆ i64  │
╞═════╪══════╪══════╪══════╡
│ 1   ┆ 2    ┆ 1    ┆ 2    │
│ 1   ┆ 2    ┆ 1    ┆ 2    │
│ 1   ┆ 2    ┆ 1    ┆ 2    │
│ 1   ┆ 2    ┆ 1    ┆ 2    │
│ ... ┆ ...  ┆ ...  ┆ ...  │
│ 8   ┆ null ┆ 7    ┆ null │
│ 8   ┆ null ┆ 7    ┆ null │
│ 8   ┆ null ┆ 7    ┆ null │
│ 8   ┆ null ┆ 7    ┆ null │
└─────┴──────┴──────┴──────┘
>>> print(time.perf_counter() - start)
6.087414998997701

6秒连接500,000,000条记录和200,000,000条记录的数据集

Polars 数据框 join_asof with(keep) null

问题描述投票：0回答：1

1个回答

编辑：

最新问题

Polars 数据框 join_asof with(keep) null

问题描述 投票：0回答：1

1个回答

编辑：

最新问题

问题描述投票：0回答：1