我有一列长字符串(如句子),我想对其执行以下操作:
我目前的做法如下:
for memo_field in self.memo_columns:
data = data.with_columns(
pl.col(memo_field).map_elements(
lambda x: self.filter_field(text=x, word_dict=word_dict))
)
filter_field方法使用纯Python,所以:
text_sani = re.sub(r'[^a-zA-Z0-9\s\_\-\%]', ' ', text)
替换text_sani = text_sani.split(' ')
分裂len(re.findall(r'[A-Za-z]', x))
查找 text_sani 列表中每个元素的 num 个字母(与 num 位数字类似),比率是差值除以总 num 个字符if
过滤单词列表实际上还不错,128M 行大约需要 10 分钟。不幸的是,未来的文件将会更大。在大约 300M 行文件上,此方法逐渐增加内存消耗,直到操作系统 (Ubuntu) 终止该进程。此外,所有处理似乎都在单个核心上进行。
我已经开始尝试使用 Polars 字符串表达式,并且下面提供了代码和玩具示例。
此时看来我唯一的选择是调用函数来完成其余的工作。我的问题是:
map_elements
是否会创建原始系列的副本,从而消耗更多内存?struct
的内容?更新:
Toy data:
temp = pl.DataFrame({"foo": ['COOOPS.autom.SAPF124',
'OSS REEE PAAA comp. BEEE atm 6079 19000000070 04-04-2023',
'ABCD 600000000397/7667896-6/REG.REF.REE PREPREO/HMO',
'OSS REFF pagopago cost. Becf atm 9682 50012345726 10-04-2023']
})
Code Functions:
def num_dec(x):
return len(re.findall(r'[0-9_\/]', x))
def num_letter(x):
return len(re.findall(r'[A-Za-z]', x))
def letter_dec_ratio(x):
if len(x) == 0:
return None
nl = num_letter(x)
nd = num_dec(x)
if (nl + nd) == 0:
return None
ratio = (nl - nd)/(nl + nd)
return ratio
def filter_field(text=None, word_dict=None):
if type(text) is not str or word_dict is None:
return 'no memo and/or dictionary'
if len(text) > 100:
text = text[0:101]
print("TEXT: ",text)
text_sani = re.sub(r'[^a-zA-Z0-9\s\_\-\%]', ' ', text) # parse by replacing most artifacts and symbols with space
words = text_sani.split(' ') # create words separated by spaces
print("WORDS: ",words)
kept = []
ratios = [letter_dec_ratio(w) for w in words]
[kept.append(w.lower()) for i, w in enumerate(words) if ratios[i] is not None and ((ratios[i] == -1 or (-0.7 <= ratios[i] <= 0)) or (ratios[i] == 1 and w.lower() in word_dict))]
print("FINAL: ",' '.join(kept))
return ' '.join(kept)
Code Current Implementation:
temp.with_columns(
pl.col("foo").map_elements(
lambda x: filter_field(text=x, word_dict=['cost','atm'])).alias('clean_foo') # baseline
)
Code Partial Attempt w/Polars:
这让我得到了正确的
WORDS
(参见下一个代码块)
temp.with_columns(
(
pl.col(col)
.str.replace_all(r'[^a-zA-Z0-9\s\_\-\%]',' ')
.str.split(' ')
)
)
Expected Result
(在每一步,请参阅上面的print
陈述):
TEXT: COOOPS.autom.SAPF124
WORDS: ['COOOPS', 'autom', 'SAPF124']
FINAL:
TEXT: OSS REEE PAAA comp. BEEE atm 6079 19000000070 04-04-2023
WORDS: ['OSS', 'REEE', 'PAAA', 'comp', '', 'BEEE', '', 'atm', '6079', '19000000070', '04-04-2023']
FINAL: atm 6079 19000000070 04-04-2023
TEXT: ABCD 600000000397/7667896-6/REG.REF.REE PREPREO/HMO
WORDS: ['ABCD', '600000000397', '7667896-6', 'REG', 'REF', 'REE', 'PREPREO', 'HMO']
FINAL: 600000000397 7667896-6
TEXT: OSS REFF pagopago cost. Becf atm 9682 50012345726 10-04-2023
WORDS: ['OSS', 'REFF', 'pagopago', 'cost', '', 'Becf', '', 'atm', '9682', '50012345726', '10-04-2023']
FINAL: cost atm 9682 50012345726 10-04-2023
可以使用 Polars 的原生表达式 API 来实现过滤,如下所示。我从问题中的简单实现中获取了正则表达式。
word_list = ["cost", "atm"]
# to avoid long expressions in ``pl.Expr.list.eval``
num_dec_expr = pl.element().str.count_matches(r'[0-9_\/]').cast(pl.Int32)
num_letter_expr = pl.element().str.count_matches(r'[A-Za-z]').cast(pl.Int32)
ratio_expr = (num_letter_expr - num_dec_expr) / (num_letter_expr + num_dec_expr)
(
df
.with_columns(
pl.col("foo")
# convert to lowercase
.str.to_lowercase()
# replace special characters with space
.str.replace_all(r"[^a-z0-9\s\_\-\%]", " ")
# split string at spaces into list of words
.str.split(" ")
# filter list of words
.list.eval(
pl.element().filter(
# only keep non-empty string...
pl.element().str.len_chars() > 0,
# ...that either
# - are in the list of words,
# - consist only of characters related to numbers,
# - have a ratio between -0.7 and 0
pl.element().is_in(word_list) | num_letter_expr.eq(0) | ratio_expr.is_between(-0.7, 0)
)
)
# join list of words into string
.list.join(" ")
.alias("foo_clean")
)
)
shape: (4, 2)
┌───────────────────────────────────────────────────────────────┬──────────────────────────────────────┐
│ foo ┆ foo_clean │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════════════════════════════════════════════════════════╪══════════════════════════════════════╡
│ COOOPS.autom.SAPF124 ┆ │
│ OSS REEE PAAA comp. BEEE atm 6079 19000000070 04-04-2023 ┆ atm 6079 19000000070 04-04-2023 │
│ ABCD 600000000397/7667896-6/REG.REF.REE PREPREO/HMO ┆ 600000000397 7667896-6 │
│ OSS REFF pagopago cost. Becf atm 9682 50012345726 10-04-2023 ┆ cost atm 9682 50012345726 10-04-2023 │
└───────────────────────────────────────────────────────────────┴──────────────────────────────────────┘