这是我的输入:
"Once there was a (so-called) rock. it.,was not! in fact, a big rock."
我需要它输出一个看起来像这样的数组
["Once", " ", "there", " ", "was", " ", "a", ",", "so", " ", "called", ",", "rock", ".", "it", ".", "was", " ", "not", ".", "in", " ", "fact", ",", "a", " ", "big", " ", "rock"]
输入需要经过一些规则才能使标点符号变成这样。规则如下:
spaceDelimiters = " -_"
commaDelimiters = ",():;\""
periodDelimiters = ".!?"
如果有空格分隔符,则应将其替换为空格。其他逗号和句点也是如此。逗号优先于空格,句点优先于逗号
我已经能够删除所有分隔符,但我需要它们作为数组的单独部分。并且存在层次结构,句点优先于逗号,优先于空格
也许我的方法是错误的?这就是我所拥有的:
def split(string, delimiters):
regex_pattern = '|'.join(map(re.escape, delimiters))
return re.split(regex_pattern, string)
最终一切都错了。还差得远呢
使用
re
库在单词边界上分割文本,然后按事件顺序替换
import re
s="Once there was a (so-called) rock. it.,was not! in fact, a big rock."
# split regex into tokens along word boundaries
regex=r"\b"
l=re.split(regex,s)
def replaceDelimeters(token:str):
# in each token identify if it contains a delimeter
spaceDelimiters = r"[^- _]*[- _]+[^- _]*"
commaDelimiters = r"[^,():;\"]*[,():;\"]+[^,():;\"]*"
periodDelimiters = r"[^.!?]*[.!?]+[^.!?]*"
# substitute for the replacement
token=re.sub(periodDelimiters,".",token)
token=re.sub(commaDelimiters,",",token)
token=re.sub(spaceDelimiters," ",token)
return token
# apply
[replaceDelimeters(token) for token in l if token!=""]