Python-Polars。使用 utf-8 字符时，Method.find() 返回错误的字符串值

Question

polars

中的 str.find() 方法的行为与 Python 中

pandas

中的 str.find() 不同。有处理utf-8字符的参数吗？还是bug？

Python 中的示例代码：

import polars as pl

# Define a custom function that wraps the str.find() method
def find_substring(s, substring):
    return int(s.find(substring))

# Test df
df = pl.DataFrame({
    "text": ["testтестword",None,'']
})

# Apply the custom function to the "text" column using map_elements()
substr = 'word'
df = df.with_columns(
    pl.col('text').str.find(substr,literal=True,strict=True).alias('in_polars'),   
    pl.col("text").map_elements(lambda s: find_substring(s, substr), return_dtype=pl.Int64).alias('find_check')
)

print(df)

结果：

文档中没有设置字符编码的参数。
使用我的函数是一个解决方案，但是速度很慢。

你能建议一些更快且没有map_elements的东西吗？谢谢。

pl.col("text").map_elements(lambda s: find_substring(s, substr), return_dtype=pl.Int64).alias('find_check')

Answer 1

Github 跟踪器上有一个未解决的问题。

https://github.com/pola-rs/polars/issues/14190

我认为我们应该更新文档以明确我们返回字节偏移量。

至于实际目标 - 似乎你想将一根绳子分成两部分并取右手边。

您可以使用正则表达式，例如与

.str.extract()

df.with_columns(after = pl.col("url").str.extract(r"\?(.*)"))

shape: (1, 2)
┌────────────────────────────────────┬──────────────┐
│ url                                ┆ after        │
│ ---                                ┆ ---          │
│ str                                ┆ str          │
╞════════════════════════════════════╪══════════════╡
│ https://тестword.com/?foo=bar&id=1 ┆ foo=bar&id=1 │
└────────────────────────────────────┴──────────────┘

.str.splitn()

可能是另一种选择。

df = pl.DataFrame({
    "url": ["https://тестword.com/?foo=bar&id=1"]
})

df.with_columns(
    pl.col("url").str.splitn("?", 2)
      .struct.rename_fields(["before", "after"])
      .struct.unnest()
)

shape: (1, 3)
┌────────────────────────────────────┬───────────────────────┬──────────────┐
│ url                                ┆ before                ┆ after        │
│ ---                                ┆ ---                   ┆ ---          │
│ str                                ┆ str                   ┆ str          │
╞════════════════════════════════════╪═══════════════════════╪══════════════╡
│ https://тестword.com/?foo=bar&id=1 ┆ https://тестword.com/ ┆ foo=bar&id=1 │
└────────────────────────────────────┴───────────────────────┴──────────────┘

它返回一个 Struct，我们将其重命名/取消嵌套到列中。

Python-Polars。使用 utf-8 字符时，Method.find() 返回错误的字符串值

问题描述投票：0回答：1

1个回答

最新问题

Python-Polars。使用 utf-8 字符时，Method.find() 返回错误的字符串值

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1