匹配标点符号或行尾

Question

我想改进 NLTK 句子标记器。不幸的是，当文本在句号和下一个句子之间没有留下任何空格时，它的效果就不太好。

from nltk.tokenize import sent_tokenize

text = "I love you.i hate you.I understand. i comprehend. i have 3.5 lines.I am bored"

sentences = sent_tokenize(text)
sentences

输出：

['I love you.i hate you.I understand.',
 'i comprehend.',
 'i have 3.5 lines.I am bored']

因此，使用正则表达式，我可以将第一行分成 3 个单独的句子。但是，我也不知道如何才能得到最后一句话，它不以标点符号结尾。

import re

new_sentences = []
for i in sentences:
    sents = re.findall(r'\w+.*?[.?!$](?!\d)', i, flags=re.S)
    new_sentences.extend(sents)
new_sentences

输出：

['I love you.',
 'i hate you.',
 'I understand.',
 'i comprehend.',
 'i have 3.5 lines.']

我把

放在那里表示行尾，但它似乎并不关心。

Answer 1

尝试

text.split('.')

。它给你

['I love you', 'i hate you', 'I understand', ' i comprehend', ' i have 3', '5 lines', 'I am bored']

只是最后一项缺少句号吗？

匹配标点符号或行尾

问题描述投票：0回答：1

1个回答

最新问题

匹配标点符号或行尾

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1