我想改进 NLTK 句子标记器。不幸的是,当文本在句号和下一个句子之间没有留下任何空格时,它的效果就不太好。
from nltk.tokenize import sent_tokenize
text = "I love you.i hate you.I understand. i comprehend. i have 3.5 lines.I am bored"
sentences = sent_tokenize(text)
sentences
输出:
['I love you.i hate you.I understand.',
'i comprehend.',
'i have 3.5 lines.I am bored']
因此,使用正则表达式,我可以将第一行分成 3 个单独的句子。但是,我也不知道如何才能得到最后一句话,它不以标点符号结尾。
import re
new_sentences = []
for i in sentences:
sents = re.findall(r'\w+.*?[.?!$](?!\d)', i, flags=re.S)
new_sentences.extend(sents)
new_sentences
输出:
['I love you.',
'i hate you.',
'I understand.',
'i comprehend.',
'i have 3.5 lines.']
我把
$
放在那里表示行尾,但它似乎并不关心。
尝试
text.split('.')
。它给你['I love you', 'i hate you', 'I understand', ' i comprehend', ' i have 3', '5 lines', 'I am bored']
只是最后一项缺少句号吗?