匹配标点符号或行尾

问题描述 投票:0回答:1

我想改进 NLTK 句子标记器。不幸的是,当文本在句号和下一个句子之间没有留下任何空格时,它的效果就不太好。

from nltk.tokenize import sent_tokenize

text = "I love you.i hate you.I understand. i comprehend. i have 3.5 lines.I am bored"

sentences = sent_tokenize(text)
sentences

输出:

['I love you.i hate you.I understand.',
 'i comprehend.',
 'i have 3.5 lines.I am bored']

因此,使用正则表达式,我可以将第一行分成 3 个单独的句子。但是,我也不知道如何才能得到最后一句话,它不以标点符号结尾。

import re

new_sentences = []
for i in sentences:
    sents = re.findall(r'\w+.*?[.?!$](?!\d)', i, flags=re.S)
    new_sentences.extend(sents)
new_sentences

输出:

['I love you.',
 'i hate you.',
 'I understand.',
 'i comprehend.',
 'i have 3.5 lines.']

我把

$
放在那里表示行尾,但它似乎并不关心。

python regex nltk tokenize
1个回答
0
投票

尝试

text.split('.')
。它给你
['I love you', 'i hate you', 'I understand', ' i comprehend', ' i have 3', '5 lines', 'I am bored']

只是最后一项缺少句号吗?

© www.soinside.com 2019 - 2024. All rights reserved.