我将数据从 parquet 文件加载到 pyarrow 表中,其架构如下。我想按 ma_id 和 items.nan 对表进行分组,并获取每个组的 max(processing_ts) 。我没有设法按 items 列表中归档的 nan
进行分组import pyarrow as pa
schema: pa.Schema = pa.schema(
[
("ma_id", pa.int32()),
("processing_ts", pa.timestamp("ms")),
(
"items",
pa.list_(
pa.struct(
[
pa.field("nan", pa.int32()),
pa.field("ean", pa.int32()),
]
)
),
),
]
)
假设该表包含这样的数据:
[
(100, '2025-01-03 16:21:00', [{'nan': 1, 'ean': 11}, {'nan': 2, 'ean': 212}, {'nan': 3, 'ean': 3}]),
(100, '2025-01-03 23:55:00', [{'nan': 9, 'ean': 95}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
(120, '2025-01-03 21:21:00', [{'nan': 8, 'ean': 87}, {'nan': 2, 'ean': 212}, {'nan': 9, 'ean': 95}]),
(100, '2025-01-03 01:45:00', [{'nan': 6, 'ean': 666}, {'nan': 1, 'ean': 11}, {'nan': 7, 'ean': 711}, {'nan': 6, 'ean': 666}]),
(120, '2025-01-03 12:38:00', [{'nan': 8, 'ean': 87}, {'nan': 9, 'ean': 95}]),
]
我的目标是从 items 列中获取 ma_id 和 nan 的每个组合的最大processing_ts 值。与上面的数据相关的结果应该是:
ma_id | 南 | max_processing_ts |
---|---|---|
100 | 1 | '2025-01-03 16:21:00' |
100 | 2 | '2025-01-03 23:55:00' |
100 | 3 | '2025-01-03 16:21:00' |
100 | 6 | '2025-01-03 01:45:00' |
100 | 7 | '2025-01-03 01:45:00' |
100 | 9 | '2025-01-03 23:55:00' |
120 | 2 | '2025-01-03 21:21:00' |
120 | 8 | '2025-01-03 21:21:00' |
120 | 9 | '2025-01-03 21:21:00' |
从技术上讲,您可以通过分解列表、展平/取消嵌套结构并调用 group by 来完成。但 pyarrow 的工作量很大。使用极轴会让您更轻松。
import polaras as pl
df = pl.from_arrow(table)
results = (
df.explode("items")
.unnest("items")
.group_by("ma_id", "nan", maintain_order=True)
.agg(pl.col("processing_ts").max().alias("max_processing_ts"))
)