按多列对 pyarrow 表进行分组,并按另一个列表列中的项目进行聚合

问题描述 投票:0回答:1

我将数据从 parquet 文件加载到 pyarrow 表中,其架构如下。我想按 ma_id 和 items.nan 对表进行分组,并获取每个组的 max(processing_ts) 。我没有设法按 items 列表中归档的 nan

进行分组
import pyarrow as pa

schema: pa.Schema = pa.schema(
    [
        ("ma_id", pa.int32()),
        ("processing_ts", pa.timestamp("ms")),
        (
            "items",
            pa.list_(
                pa.struct(
                    [
                        pa.field("nan", pa.int32()),
                        pa.field("ean", pa.int32()),
                    ]
                )
            ),
        ),
    ]
)

假设该表包含这样的数据:

[
 (100, '2025-01-03 16:21:00', [{'nan': 1, 'ean': 11}, {'nan': 2, 'ean': 212}, {'nan': 3, 'ean': 3}]),
 (100, '2025-01-03 23:55:00', [{'nan': 9, 'ean': 95}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (120, '2025-01-03 21:21:00', [{'nan': 8, 'ean': 87}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
 (100, '2025-01-03 01:45:00', [{'nan': 6, 'ean': 666}, {'nan': 1, 'ean': 11}, {'nan': 7, 'ean': 711}, {'nan': 6, 'ean': 666}]),
 (120, '2025-01-03 12:38:00', [{'nan': 8, 'ean': 87}, {'nan': 9, 'ean': 95}]),
               ]

我的目标是从 items 列中获取 ma_idnan 的每个组合的最大processing_ts 值。与上面的数据相关的结果应该是:

ma_id max_processing_ts
100 1 '2025-01-03 16:21:00'
100 2 '2025-01-03 23:55:00'
100 3 '2025-01-03 16:21:00'
100 6 '2025-01-03 01:45:00'
100 7 '2025-01-03 01:45:00'
100 9 '2025-01-03 23:55:00'
120 2 '2025-01-03 21:21:00'
120 8 '2025-01-03 21:21:00'
120 9 '2025-01-03 21:21:00'
python-3.x pyarrow
1个回答
0
投票

从技术上讲,您可以通过分解列表、展平/取消嵌套结构并调用 group by 来完成。但 pyarrow 的工作量很大。使用极轴会让您更轻松。

import polaras as pl

df = pl.from_arrow(table)

results = (
    df.explode("items")
    .unnest("items")
    .group_by("ma_id", "nan", maintain_order=True)
    .agg(pl.col("processing_ts").max().alias("max_processing_ts"))
)
© www.soinside.com 2019 - 2024. All rights reserved.