polars 像 pandas 一样应用 lambda 和列表理解：还有其他更好的方法吗？

Question

熊猫

df['sentences'] = df['content'].str.split(pattern2)
df['normal_text'] = df['sentences'].apply(lambda x: [re.sub(pattern3, ' ', sentence) for sentence in x])

极地

df = df.with_column(pl.col('content').apply(lambda x: re.split(pattern2, x)).alias('sentences'))
df = df.with_column(pl.col('sentences').apply(lambda x: [re.sub(pattern3, ' ', sentence) for sentence in x]).alias('normal_text'))

艾米还有比这更优雅的方式吗？

Answer 1

该功能可通过

.str

命名空间在 Polars 中本地使用。

.str.split()

目前不支持正则表达式

但是，使用

.extract_all()

和

.replace_all()

可以实现类似的行为

df = pl.DataFrame({
   "content": [
      "o neHItw oHIIIIIth ree", 
      "fo urHIIfi veHIIIIs ix"
   ]
})

pattern2 = r"HI+"
pattern3 = r"\s"
replacement = ""

df.with_columns(sentences = 
   pl.col("content")
     .str.extract_all(rf".*?({pattern2}|$)")
)

shape: (2, 2)
┌────────────────────────┬───────────────────────────────────┐
│ content                ┆ sentences                         │
│ ---                    ┆ ---                               │
│ str                    ┆ list[str]                         │
╞════════════════════════╪═══════════════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["o neHI", "tw oHIIIII", "th ree… │
│ fo urHIIfi veHIIIIs ix ┆ ["fo urHII", "fi veHIIII", "s ix… │
└────────────────────────┴───────────────────────────────────┘

结果是类型为

list

的列。

我们可以使用

.list.eval()

来处理每个列表元素。

(
   df.with_columns(sentences =    
      pl.col("content")
        .str.extract_all(rf".*?({pattern2}|$)")
        .list.eval(pl.element().str.replace_all(pattern2, ""))
   )
   .with_columns(normal_text = 
      pl.col("sentences").list.eval(
         pl.element().str.replace_all(pattern3, replacement)
      )
   )
)

shape: (2, 3)
┌────────────────────────┬────────────────────────────┬─────────────────────────┐
│ content                ┆ sentences                  ┆ normal_text             │
│ ---                    ┆ ---                        ┆ ---                     │
│ str                    ┆ list[str]                  ┆ list[str]               │
╞════════════════════════╪════════════════════════════╪═════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["o ne", "tw o", "th ree"] ┆ ["one", "two", "three"] │
│ fo urHIIfi veHIIIIs ix ┆ ["fo ur", "fi ve", "s ix"] ┆ ["four", "five", "six"] │
└────────────────────────┴────────────────────────────┴─────────────────────────┘

如果只需要

normal_text

结果，可以合并这些步骤。

df.with_columns(normal_text = 
   pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
     .list.eval(
        pl.element()
          .str.replace_all(pattern2, "")           
          .str.replace_all(pattern3, replacement)
     )
)

shape: (2, 2)
┌────────────────────────┬─────────────────────────┐
│ content                ┆ normal_text             │
│ ---                    ┆ ---                     │
│ str                    ┆ list[str]               │
╞════════════════════════╪═════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["one", "two", "three"] │
│ fo urHIIfi veHIIIIs ix ┆ ["four", "five", "six"] │
└────────────────────────┴─────────────────────────┘

性能对比

创建更大的字符串以进行基本比较。

N = 2000
df = pl.DataFrame({
   "content": [
      "o neHItw oHIIIIIth ree" * N, 
      "fo urHIIfi veHIIIIs ix" * N] * N
})

方法时间

.str

+

.list.eval

8.28秒

.apply()

29.9秒

polars 像 pandas 一样应用 lambda 和列表理解：还有其他更好的方法吗？

问题描述投票：0回答：1

1个回答

性能对比

最新问题

polars 像 pandas 一样应用 lambda 和列表理解：还有其他更好的方法吗？

问题描述 投票：0回答：1

1个回答

性能对比

最新问题

问题描述投票：0回答：1