polars 像 pandas 一样应用具有列表理解的 lambda：还有其他更好的方法吗？

Question

熊猫

df['sentences'] = df['content'].str.split(pattern2)
df['normal_text'] = df['sentences'].apply(lambda x: [re.sub(pattern3, ' ', sentence) for sentence in x])

极地

df = df.with_columns(pl.col('content').map_elements(lambda x: re.split(pattern2, x)).alias('sentences'))
df = df.with_columns(pl.col('sentences').map_elements(lambda x: [re.sub(pattern3, ' ', sentence) for sentence in x]).alias('normal_text'))

艾米还有比这更优雅的方式吗？

Answer 1

该功能可通过

.str

命名空间在 Polars 中本地使用。

.str.split()

不支持正则表达式。

但是可以使用

.extract_all()

和

.replace_all()

实现类似的行为

df = pl.DataFrame({"content": ["o neHItw oHIIIIIth ree", "fo urHIIfi veHIIIIs ix"]})

pattern2 = r"HI+"
pattern3 = r"\s"

replacement = ""

df.with_columns(
   pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
     .alias("sentences")
)

shape: (2, 2)
┌────────────────────────┬────────────────────────────────────┐
│ content                ┆ sentences                          │
│ ---                    ┆ ---                                │
│ str                    ┆ list[str]                          │
╞════════════════════════╪════════════════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["o neHI", "tw oHIIIII", "th ree"] │
│ fo urHIIfi veHIIIIs ix ┆ ["fo urHII", "fi veHIIII", "s ix"] │
└────────────────────────┴────────────────────────────────────┘

然后可以使用

list.eval()

来处理列表并“提取”所需的结果。

df.with_columns(
   pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
     .list.eval(
        pl.element().str.replace_all(pattern2, "")
                    .str.replace_all(pattern3, replacement)
     )
     .alias("normal_text")
)

shape: (2, 2)
┌────────────────────────┬─────────────────────────┐
│ content                ┆ normal_text             │
│ ---                    ┆ ---                     │
│ str                    ┆ list[str]               │
╞════════════════════════╪═════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["one", "two", "three"] │
│ fo urHIIfi veHIIIIs ix ┆ ["four", "five", "six"] │
└────────────────────────┴─────────────────────────┘

性能

两种方法的基本比较。

N = 2000
df = pl.DataFrame({
   "content": [
      "o neHItw oHIIIIIth ree" * N, 
      "fo urHIIfi veHIIIIs ix" * N] * N
})

姓名	时间
.str + .list.eval()	8.28秒
.map_elements()	29.9秒

polars 像 pandas 一样应用具有列表理解的 lambda：还有其他更好的方法吗？

问题描述投票：0回答：1

1个回答

性能

最新问题

polars 像 pandas 一样应用具有列表理解的 lambda：还有其他更好的方法吗？

问题描述 投票：0回答：1

1个回答

性能

最新问题

问题描述投票：0回答：1