我希望拆分句子包括标点符号(例如:?,!,。),如果句子末尾有双引号,我也想包含它。
我在python3中使用了re.split()函数来将我的字符串拆分成句子。但遗憾的是,结果字符串不包括标点符号,如果在句子末尾有一个双引号,它们也不包括双引号。
这是我目前的代码:
x = 'This is an example sentence. I want to include punctuation! What is wrong with my code? It makes me want to yell, "PLEASE HELP ME!"'
sentence = re.split('[\.\?\!]\s*', x)
我得到的输出是:
['This is an example sentence', 'I want to include punctuation', 'What is wrong with my code', 'It makes me want to yell, "PLEASE HELP ME', '"']
尝试分解一个lookbehind:
sentences = re.split('(?<=[\.\?\!])\s*', x)
print(sentences)
['This is an example sentence.', 'I want to include punctuation!',
'What is wrong with my code?', 'It makes me want to yell, "PLEASE HELP ME!"']
当我们在我们后面看到一个标点符号时,这个正则表达式的工作就是分裂。在这种情况下,在继续输入字符串之前,我们还匹配并使用我们前面的任何空格。
这是我处理双引号问题的平庸尝试:
x = 'This is an example sentence. I want to include punctuation! "What is wrong with my code?" It makes me want to yell, "PLEASE HELP ME!"'
sentences = re.split('((?<=[.?!]")|((?<=[.?!])(?!")))\s*', x)
print filter(None, sentences)
['This is an example sentence.', 'I want to include punctuation!',
'"What is wrong with my code?"', 'It makes me want to yell, "PLEASE HELP ME!"']
请注意,它正确地分割了以双引号结尾的句子。