如何在python中基于关键字提取txt文件的一部分?

问题描述 投票:0回答:1

给出了约5000个HTML文档的非常大的文本文件。我试图“搜索”特定DOCNO的文本文件并打印文件的所有行,直到遇到下一个</DOC>标记。

文本文件大致如下:

<DOC>
<DOCNO>abc4567890</DOCNO>
contents 
more contents
<BODY> 
even more contents 
</BODY>
</DOC> 
... repeated roughly 5000 times for different DOC NO's

我正在寻找输出:

contents 
more contents
<BODY> 
even more contents 
</BODY>
</DOC> 

这是我一直试图实现的:

doc_string = "abc4567890"

with open('myfile.txt', encoding = "utf8") as f:
    for item in f.readlines():
        if "</DOCNO>" in item:
                ID = (item [ item.find("<DOCNO>")+len("<DOCNO>") : ])
                if (ID[0:9] == doc_string):
                    print (item)
                    if "</DOC>" in item:
                       break

但是,作为输出,我得到:

<DOCNO>abc4567890</DOCNO>
python tags extract
1个回答
0
投票
# initialize variables: lines = [] read_lines = False with open('file.txt', 'r') as file: # iterate over each line: for line in file.readlines(): # append line to lines list: if read_lines: lines.append(line) # set read_lines to True: if '</DOCNO>' in line: read_lines = True # set read_lines to False: if '</DOC>' in line: read_lines = False # print each line: for line in lines: print(line, end='')

给出您的输入,它将输出:

contents 
more contents
<BODY> 
even more contents 
</BODY>
</DOC> 
© www.soinside.com 2019 - 2024. All rights reserved.