列表类型上的 Polars map_batches 引发 InvalidOperationError

Question

Polars 有一个我无法解决的难题：

这符合预期：

df = pl.DataFrame(
    {
        "int1": [1, 2, 3],
        "int2": [3, 2, 1]
    }
)

df.with_columns(
    pl.struct('int1', 'int2')
      .map_batches(lambda x: x.struct.field('int1') + x.struct.field('int2')).alias('int3')
)

输出：

shape: (3, 3)
┌──────┬──────┬──────┐
│ int1 ┆ int2 ┆ int3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ 3    ┆ 4    │
│ 2    ┆ 2    ┆ 4    │
│ 3    ┆ 1    ┆ 4    │
└──────┴──────┴──────┘

但这并没有：

df = pl.DataFrame(
    {
        "int1": [[1], [2], [3]],
        "int2": [[3], [2], [1]]
    }
)

df.with_columns(
    pl.struct('int1', 'int2')
      .map_batches(lambda x: x.struct.field('int1').to_list() + x.struct.field('int2').to_list()).alias('int3')
)

输出：

# InvalidOperationError: Series int3, length 1 doesn't match the DataFrame height of 3

这是我期待的输出：

┌───────────┬───────────┬───────────┐
│ int1      ┆ int2      ┆ int3      │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1]       ┆ [3]       ┆ [1, 3]    │
│ [2]       ┆ [2]       ┆ [2, 2]    │
│ [3]       ┆ [1]       ┆ [3, 1]    │
└───────────┴───────────┴───────────┘

Answer 1

所以我认为你的问题是你混淆了

.map_batches

和

.map_elements

。

用户指南中很好地解释了这两个功能的差异请参阅此处

因此，要获得您想要的结果，您必须这样做：

print(
    pl.DataFrame({"int1": [[1], [2], [3]], "int2": [[3], [2], [1]]}).with_columns(
        pl.struct("int1", "int2").map_elements(lambda x: x["int1"] + x["int2"]).alias("int3")
    )
)

shape: (3, 3)
┌───────────┬───────────┬───────────┐
│ int1      ┆ int2      ┆ int3      │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1]       ┆ [3]       ┆ [1, 3]    │
│ [2]       ┆ [2]       ┆ [2, 2]    │
│ [3]       ┆ [1]       ┆ [3, 1]    │
└───────────┴───────────┴───────────┘

Answer 2

您可以使用

.list.concat()

import polars as pl

df = pl.DataFrame(
        {
            "int1": [[1], [2], [3]],
            "int2": [[3], [2], [1]]
        }
)

out = df.with_columns(pl.col("int1").list.concat(pl.col("int2")).alias("a + b"))

print(out)

shape: (3, 3)
┌───────────┬───────────┬───────────┐
│ int1      ┆ int2      ┆ a + b     │
│ ---       ┆ ---       ┆ ---       │
│ list[i64] ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════╪═══════════╡
│ [1]       ┆ [3]       ┆ [1, 3]    │
│ [2]       ┆ [2]       ┆ [2, 2]    │
│ [3]       ┆ [1]       ┆ [3, 1]    │
└───────────┴───────────┴───────────┘

Answer 3

严格来说，将两个（或更多）列组合成单个列表列的惯用方法是使用

concat_list

pl.DataFrame(
        {
            "int1": [1, 2, 3],
            "int2": [3, 2, 1]
        }
        ) \
    .with_columns(int3=pl.concat_list(pl.col('int1'), pl.col('int2')))

shape: (3, 3)
┌──────┬──────┬───────────┐
│ int1 ┆ int2 ┆ int3      │
│ ---  ┆ ---  ┆ ---       │
│ i64  ┆ i64  ┆ list[i64] │
╞══════╪══════╪═══════════╡
│ 1    ┆ 3    ┆ [1, 3]    │
│ 2    ┆ 2    ┆ [2, 2]    │
│ 3    ┆ 1    ┆ [3, 1]    │
└──────┴──────┴───────────┘

列表类型上的 Polars map_batches 引发 InvalidOperationError

问题描述投票：0回答：3

3个回答

最新问题

列表类型上的 Polars map_batches 引发 InvalidOperationError

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3