使用 iter_rows 改变大型 Polars (Python) 数据框中的单元格会产生分段错误

Question

我有一个大数据框，如下所示：

df_large = pl.DataFrame({'x':['h1','h2','h2','h3'],
                         'y':[1,2,3,4], 
                         'ind1':['0/0','1/0','1/1','0/1'], 
                         'ind2':['0/1','0/2','1/1','0/0'] }).lazy()
df_large.collect()

|   x   |   y   | ind_1 | ind_2   | 
|_______|_______|_______|_________|
| "h1"  |   1   | "0/0" |  '0/1'  |       
| "h2"  |   2   | "1/0" |  '0/2'  | 
| "h2"  |   3   | "1/1" |  '1/1'  | 
| "h3"  |   4   | "0/1" |  '0/0'  |

df_large 包含许多个体的坐标 [x (str), y (int)] 和字符串值 [ind_1,ind_2,...]。它非常大，所以我必须将 CSV 文件作为惰性数据帧读取。此外，我有一个如下所示的小数据框：

df_rep = pl.DataFrame({'x':['h1','h2','h2'],
                       'y':[1,2,2], 
                       'ind':['ind1','ind1','ind2']})
df_rep

|   x   |   y   |  indvs  | 
|_______|_______|_________|
| "h1"  |   1   | "ind_1" |     
| "h2"  |   2   | "ind_1" |
| "h2"  |   2   | "ind_2" |

当 df_large 中的 ind_k 列出现在 df_rep 上时，我需要改变它们的值。

我为此编写了以下代码：

for row in df_rep.iter_rows():
    df_large = df_large.with_columns(
                    pl.when(pl.col('x') == row[0],
                       pl.col('y')    == row[1])
                       .then(pl.col(row[2]).str.replace_all('(.)/(.)','./.'))
                       .otherwise(pl.col(row[2]))
                      .alias(row[2])
                  )
df_large.collect()

|   x   |   y   | ind_1 | ind_2   | 
|_______|_______|_______|_________|
| "h1"  |   1   | "./." |  '0/1'  |       
| "h2"  |   2   | "./." |  './.'  | 
| "h2"  |   3   | "1/1" |  '1/1'  | 
| "h3"  |   4   | "0/1" |  '0/0'  |

此方法虽然速度慢，但适用于较大数据集的子集。然而，当应用于完整数据集时，Polars 会产生“分段错误”。我希望您能提供有关如何解决此问题的反馈。不使用 iter_rows() 来实现我的目标的替代方法将是理想的！我是 Polars 的初学者，如果有任何反馈，我将不胜感激。我已经被这个问题困扰了一段时间了:(

Answer 1

.pivot()

重塑小框架

df_rep.with_columns(value=True).pivot(on="ind", index=["x", "y"])

shape: (2, 4)
┌─────┬─────┬──────┬──────┐
│ x   ┆ y   ┆ ind1 ┆ ind2 │
│ --- ┆ --- ┆ ---  ┆ ---  │
│ str ┆ i64 ┆ bool ┆ bool │
╞═════╪═════╪══════╪══════╡
│ h1  ┆ 1   ┆ true ┆ null │
│ h2  ┆ 2   ┆ true ┆ true │
└─────┴─────┴──────┴──────┘

然后，您可以将行与左侧

.join()

 相匹配，并将 then when/then 逻辑放入单个

.with_columns() 调用中。

index = ["x", "y"]
other = df_rep.with_columns(value=True).pivot(on="ind", index=index)
names = other.drop(index).columns
names_right = [f"{name}_right" for name in names]

(df_large
  .join(other, on=index, how="left")
  .with_columns(
     pl.when(pl.col(name_right))
       .then(pl.col(name).str.replace_all(r"(.)/(.)", "./."))
       .otherwise(pl.col(name))
       for name, name_right in zip(names, names_right)
  ) 
)

shape: (4, 6)
┌─────┬─────┬──────┬──────┬────────────┬────────────┐
│ x   ┆ y   ┆ ind1 ┆ ind2 ┆ ind1_right ┆ ind2_right │
│ --- ┆ --- ┆ ---  ┆ ---  ┆ ---        ┆ ---        │
│ str ┆ i64 ┆ str  ┆ str  ┆ bool       ┆ bool       │
╞═════╪═════╪══════╪══════╪════════════╪════════════╡
│ h1  ┆ 1   ┆ ./.  ┆ 0/1  ┆ true       ┆ null       │
│ h2  ┆ 2   ┆ ./.  ┆ ./.  ┆ true       ┆ true       │
│ h2  ┆ 3   ┆ 1/1  ┆ 1/1  ┆ null       ┆ null       │
│ h3  ┆ 4   ┆ 0/1  ┆ 0/0  ┆ null       ┆ null       │
└─────┴─────┴──────┴──────┴────────────┴────────────┘

然后您可以

.drop()

names_right

列。

使用 iter_rows 改变大型 Polars (Python) 数据框中的单元格会产生分段错误

问题描述投票：0回答：1

1个回答

最新问题

使用 iter_rows 改变大型 Polars (Python) 数据框中的单元格会产生分段错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1