嗨,我试图根据段落拆分文本文件。我已经将它们分成了一个列表。但是我的目标是只提取那些只包含一个句子的段落。有没有办法做到这一点?
My text file is something like this:
paragraph1:
sentence
paragraph2:
sentence. sentence. sentence.
paragraph3:
sentence. sentence.
paragraph4:
sentence
因为他们只有一句话我只需要1和4。我已经在下面做了:
par = document.split(".\n")
将文档拆分为段落列表。但是我如何只提取一个句子。有任何想法吗?
一种方法可能是在该段中搜索.
。 (注意space
之后的.
)。这可能表明该段落有多个句子。
所以你可以这样说:
if '. ' in paragraph:
# don't print
else:
# print
只是一个想法......
yield
之前应用您需要的任何条件。文件My_text_file.txt
的内容是:
paragraph1: sentence paragraph2: sentence. sentence. sentence. paragraph3: sentence. sentence. paragraph4: sentence
我的代码如下:
import re
def gen_read_lines(file_name):
with open(file_name) as fp:
for line in fp:
yield line.strip()
def is_para_starting(line):
return True if re.match("^paragraph\d+:$", line) else False
def gen_read_para(file_name):
gen = gen_read_lines(file_name)
para = []
for line in gen:
if is_para_starting(line):
if para:
#returning the previous para
yield para
para = []
if ". " in line:
#resetting para.
para = []
continue
para.append(line)
#returning last para.
yield para
if __name__ == "__main__":
para_gen = gen_read_para("My_text_file.txt")
for para in para_gen:
print "\n".join(para)
print "~"*10
输出是:
paragraph1: sentence ~~~~~~~~~~ paragraph4: sentence ~~~~~~~~~~
编辑:添加条件以检查段落中的“。”。