我需要搜索包含子字符串的字符串。 我正在寻找有效的方法来做到这一点。
慢速版本:
import polars as pl
def search_text(queries, text):
return [query for query in queries if query in text]
pl_df = pl.DataFrame( {
"Title": ["I am aa", "I am bbob"]
})
queries = ['aa', 'bb']
pl_df = pl_df.with_columns(pl.col('Title').map_elements(lambda text: search_text(queries, text)).alias('Title_match'))
print(pl_df)
shape: (2, 2)
┌───────────┬─────────────┐
│ Title ┆ Title_match │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪═════════════╡
│ I am aa ┆ ["aa"] │
│ I am bbob ┆ ["bb"] │
└───────────┴─────────────┘
你可以试试
.extract_all()
您可以将查询字符串组合成单个正则表达式:
>>> import re
...
... queries = "aa", "bb", "am"
... query = "|".join(map(re.escape, sorted(queries, key=len, reverse=True)))
...
... pl_df.with_column(
... pl.col("Title").str.extract_all(query)
... .alias("Title_match")
... )
shape: (2, 2)
┌───────────┬──────────────┐
│ Title | Title_match │
│ --- | --- │
│ str | list[str] │
╞═══════════╪══════════════╡
│ I am aa | ["am", "aa"] │
├───────────┼──────────────┤
│ I am bbob | ["am", "bb"] │
└─//────────┴─//───────────┘