在获取第一行时对 group_by 表达式进行排序，保留所有列

Question

鉴于以下数据框，我想按“foo”分组，按“bar”排序，然后保留整行。

df = pl.DataFrame(
    {
        "foo": [1, 1, 1, 2, 2, 2, 3],
        "bar": [5, 7, 6, 4, 2, 3, 1],
        "baz": [1, 2, 3, 4, 5, 6, 7],
    }
)

df_desired = pl.DataFrame({"foo": [1, 2, 3], "bar": [5, 2, 1], "baz": [1,5,7]})

>>> df_desired
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 5   ┆ 1   │
│ 2   ┆ 2   ┆ 5   │
│ 3   ┆ 1   ┆ 7   │
└─────┴─────┴─────┘

我可以通过预先排序来做到这一点，但这比对组进行排序要昂贵：

df_solution = df.sort("bar").group_by("foo", maintain_order=True).first().sort(by="foo")

assert df_desired.equals(df_solution)

我可以在聚合中按“foo”排序，如这个SO答案：

>>> df.group_by("foo").agg(pl.col("bar").sort().first()).sort(by="foo")
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 5   │
│ 2   ┆ 2   │
│ 3   ┆ 1   │
└─────┴─────┘

但后来我只得到那一栏。我如何保留“baz”的行值？

.agg([])

的任何附加条目均独立于新的

pl.col("bar").sort()

。

Answer 1

您可以使用

pl.col("bar").arg_min()

获取最小行的索引

可以将其赋予

pl.all().get()

来访问相应的值。

(df.group_by("foo", maintain_order=True)
   .agg(pl.all().get(pl.col("bar").arg_min())
)

shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3   ┆ 1   ┆ 7   │
│ 2   ┆ 2   ┆ 5   │
│ 1   ┆ 5   ┆ 1   │
└─────┴─────┴─────┘

或更接近您的示例，您可以在.sort_by()

内

.agg()

(df.group_by("foo", maintain_order=True)
   .agg(pl.all().sort_by("bar").first())
)

shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 5   ┆ 1   │
│ 3   ┆ 1   ┆ 7   │
│ 2   ┆ 2   ┆ 5   │
└─────┴─────┴─────┘

在获取第一行时对 group_by 表达式进行排序，保留所有列

问题描述投票：0回答：1

1个回答

最新问题

在获取第一行时对 group_by 表达式进行排序，保留所有列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1