我正在使用 lxml 库和 python 来解析一个简单的 XML,该 XML 在本例中打印下一个元素的文本 HD,如下面的 XML 所示
<BOOK>
<HD>The Best Book Ever</HD>
<HD>Table of Contents</HD>
<EXTRACT>
<TC>I. Introduction</TC>
<TC>II. Summary</TC>
<TC>III. Topic 1</TC>
<TC>IV. Topic 2</TC>
</EXTRACT>
<HD>I. Introduction</HD>
<p>
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
<FTN>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget</FTN>
</p>
<p>has been the industry standard dummy text ever since the 1500s</p>
<HD>II. Summary</HD>
<p>
<FT>data 1</FT>
data 2
<FT>data 3</FT>
</p>
<p>
<FT>data 4</FT>
data 5
<FT>data 6</FT>
</p>
<p>has been the industry standard dummy text ever since the 1500s</p>
<HD>III. Topic 1</HD>
<p>
something
<p>something else</p>
</p>
<HD>IV. Topic 2</HD>
<p>
something1
<p>something else 1</p>
</p>
<p>
something 2
<p>something else 2</p>
</p>
<HD>V. Topic 3</HD>
<p>
something not to show up
<p>because not in EXTRACT as TC</p>
</p>
</BOOK>
我的Python代码如下所示,它应该打印HD标签旁边的所有内容
import os
from lxml import etree
file_name = 'demofile2.xml'
full_file_name = os.path.abspath(os.path.join('', file_name))
def load_local_file(filename):
dom = etree.parse(filename)
#get all content of elements after HD tag
TOCsHD = dom.getroot().findall('HD')
for hd in TOCsHD:
text = hd.text
print(text)
for x in hd.getnext().iter():
print(x.text)
print(x.tail)
print("------------------------------")
load_local_file(full_file_name)
我的输出如下所示。正如你所看到的,II。例如,摘要不打印数据 4、数据 5、数据 6。有人可以帮我解决这个问题吗?非常感谢!
The Best Book Ever
Table of Contents
------------------------------
Table of Contents
I. Introduction
II. Summary
III. Topic 1
IV. Topic 2
------------------------------
I. Introduction
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
------------------------------
II. Summary
data 1
data 2
data 3
------------------------------
III. Topic 1
something
something else
------------------------------
IV. Topic 2
something1
something else 1
------------------------------
V. Topic 3
something not to show up
because not in EXTRACT as TC
------------------------------
itersiblings
:
import os
from lxml import etree
file_name = 'demofile2.xml'
full_file_name = os.path.abspath(os.path.join('', file_name))
def load_local_file(filename):
dom = etree.parse(filename)
#get all content of elements after HD tag
TOCsHD = dom.getroot().findall('HD')
for hd in TOCsHD:
print("Siblings of: " + hd.text)
theIter = hd.itersiblings()
for x in theIter:
print(x.tag, "".join(x.itertext()).strip().replace("\n", ""), sep=": ")
print("------------------------------")
load_local_file(full_file_name)
我不确定这是否是您正在寻找的结果,但如果您对标签的同级感兴趣,此功能将起作用。
Siblings of: The Best Book Ever
HD: Table of Contents
EXTRACT: I. Introduction II. Summary III. Topic 1 IV. Topic 2
HD: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1 data 2 data 3
p: data 4 data 5 data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: Table of Contents
EXTRACT: I. Introduction II. Summary III. Topic 1 IV. Topic 2
HD: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1 data 2 data 3
p: data 4 data 5 data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: I. Introduction
p: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam purus elit, suscipit eget
p: has been the industry standard dummy text ever since the 1500s
HD: II. Summary
p: data 1 data 2 data 3
p: data 4 data 5 data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: II. Summary
p: data 1 data 2 data 3
p: data 4 data 5 data 6
p: has been the industry standard dummy text ever since the 1500s
HD: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: III. Topic 1
p: something something else
HD: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: IV. Topic 2
p: something1 something else 1
p: something 2 something else 2
HD: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
Siblings of: V. Topic 3
p: something not to show up because not in EXTRACT as TC
------------------------------
请注意,您还需要使用
itertext
才能获取所有标签内的所有文本。例如,有一些 p
标签内部有内部标签。如果您想获得这些 p
标签的文本值,则需要应用 itertext
才能获取内部文本。您可以通过查看带有 "".join(x.itertext()).strip().replace("\n", "")
的行来深入了解该过程。