有什么方法可以在惰性极坐标数据帧上执行枢转吗？

Question

我正在尝试对惰性极坐标数据框执行枢轴操作。

这意味着，如果我收集数据，我就可以进行数据透视：

df = pl.read_parquet(path,low_memory=True)
pivoted_df = df.pivot(index=["ind1", "ind2", "ind3", "ind4"], columns="my_signal", values="Value", aggregate_function="mean")

这些代码行有效。

但是，如果我通过调用此方法使用惰性数据框：

df = pl.scan_parquet(path,low_memory=True)

我找不到执行相同算法的方法。

.pivot

无法应用于惰性对象。

重要提示：我在任何时候都不想使用

df.collect()

收集数据，因为我有一个非常大的数据集，不适合内存。

我想在最后用

pivoted_df .sink_parquet()

保存惰性数据框，所以是的，我不想在任何时候收集数据。

提前谢谢您！

我尝试使用

.group_by

代替

.pivot

：

grouped_df = df.group_by(["["ind1", "ind2", "ind3", "ind4"])
transformed_df = grouped_df.agg(**{f"{col}_mean": pl.col("Value").mean()  for col in pl.col('my_signal').unique()})

但是我收到错误

'Expr' object is not iterable

，这是正常的，因为

pl.col('full_signal_name').unique()

是一个表达式。

还有其他选择吗？

示例：

df：

┌──────┬──────┬──────┬──────┬───────────┬───────┐
│ ind1 ┆ ind2 ┆ ind3 ┆ ind4 ┆ my_signal ┆ Value │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---       ┆ ---   │
│ str  ┆ str  ┆ str  ┆ str  ┆ str       ┆ i64   │
╞══════╪══════╪══════╪══════╪═══════════╪═══════╡
│ www  ┆ xxx  ┆ yyy  ┆ zzz  ┆ a         ┆ 1     │
│ www  ┆ xxx  ┆ yyy  ┆ zzz  ┆ a         ┆ 1     │
│ www  ┆ xxx  ┆ yyy  ┆ zzz  ┆ a         ┆ 1     │
│ www  ┆ xxx  ┆ yyy  ┆ zzz  ┆ b         ┆ 2     │
│ fff  ┆ xxx  ┆ yyy  ┆ zzz  ┆ b         ┆ 2     │
│ fff  ┆ xxx  ┆ yyy  ┆ zzz  ┆ b         ┆ 2     │
│ fff  ┆ xxx  ┆ yyy  ┆ zzz  ┆ c         ┆ 3     │
│ fff  ┆ xxx  ┆ yyy  ┆ zzz  ┆ c         ┆ 3     │
│ fff  ┆ xxx  ┆ yyy  ┆ zzz  ┆ c         ┆ 3     │
└──────┴──────┴──────┴──────┴───────────┴───────┘

pivoted_df：

┌──────┬──────┬──────┬──────┬──────┬─────┬──────┐
│ ind1 ┆ ind2 ┆ ind3 ┆ ind4 ┆ a    ┆ b   ┆ c    │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ --- ┆ ---  │
│ str  ┆ str  ┆ str  ┆ str  ┆ f64  ┆ f64 ┆ f64  │
╞══════╪══════╪══════╪══════╪══════╪═════╪══════╡
│ www  ┆ xxx  ┆ yyy  ┆ zzz  ┆ 1.0  ┆ 2.0 ┆ null │
│ fff  ┆ xxx  ┆ yyy  ┆ zzz  ┆ null ┆ 2.0 ┆ 3.0  │
└──────┴──────┴──────┴──────┴──────┴─────┴──────┘

Answer 1

pivot()

的文档中有一些建议：

请注意，pivot 仅在 eager 模式下可用。如果您提前知道唯一列值，则可以使用
polars.LazyFrame.groupby()
在惰性模式下获得与上面相同的结果

根据您的用例调整此示例：

index = pl.col("ind1", "ind2", "ind3", "ind4")
columns = pl.col("my_signal")
values = pl.col("Value")
unique_column_values = list(df.unique("my_signal").select("my_signal").collect().to_series())
aggregate_function = lambda col: col.mean()

df.group_by(index).agg(
    aggregate_function(values.filter(columns == value)).alias(value)
    for value in unique_column_values
).collect()

┌──────┬──────┬──────┬──────┬─────┬──────┬──────┐
│ ind1 ┆ ind2 ┆ ind3 ┆ ind4 ┆ b   ┆ a    ┆ c    │
│ ---  ┆ ---  ┆ ---  ┆ ---  ┆ --- ┆ ---  ┆ ---  │
│ str  ┆ str  ┆ str  ┆ str  ┆ f64 ┆ f64  ┆ f64  │
╞══════╪══════╪══════╪══════╪═════╪══════╪══════╡
│ fff  ┆ xxx  ┆ yyy  ┆ zzz  ┆ 2.0 ┆ null ┆ 3.0  │
│ www  ┆ xxx  ┆ yyy  ┆ zzz  ┆ 2.0 ┆ 1.0  ┆ null │
└──────┴──────┴──────┴──────┴─────┴──────┴──────┘

有什么方法可以在惰性极坐标数据帧上执行枢转吗？

问题描述投票：0回答：1

1个回答

最新问题

有什么方法可以在惰性极坐标数据帧上执行枢转吗？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1