XML:
<?xml version="1.0"?>
<pages>
<page>
<url>http://example.com/Labs</url>
<title>Labs</title>
<subpages>
<page>
<url>http://example.com/Labs/Email</url>
<title>Email</title>
<subpages>
<page/>
<url>http://example.com/Labs/Email/How_to</url>
<title>How-To</title>
</subpages>
</page>
<page>
<url>http://example.com/Labs/Social</url>
<title>Social</title>
</page>
</subpages>
</page>
<page>
<url>http://example.com/Tests</url>
<title>Tests</title>
<subpages>
<page>
<url>http://example.com/Tests/Email</url>
<title>Email</title>
<subpages>
<page/>
<url>http://example.com/Tests/Email/How_to</url>
<title>How-To</title>
</subpages>
</page>
<page>
<url>http://example.com/Tests/Social</url>
<title>Social</title>
</page>
</subpages>
</page>
</pages>
代码:
// rexml is the XML string read from a URL
from xml.etree import ElementTree as ET
tree = ET.fromstring(rexml)
for node in tree.iter('page'):
for url in node.iterfind('url'):
print url.text
for title in node.iterfind('title'):
print title.text.encode("utf-8")
print '-' * 30
输出:
http://example.com/article1
Article1
------------------------------
http://example.com/article1/subarticle1
SubArticle1
------------------------------
http://example.com/article2
Article2
------------------------------
http://example.com/article3
Article3
------------------------------
Xml表示站点地图的树状结构。
我整天都在文档和谷歌上下,并且无法弄清楚热门来获得节点的深度。
我使用了儿童容器的计数,但这只适用于第一个父母,然后它打破了,因为我无法弄清楚如何重置。但这可能只是一个hackish想法。
所需的输出:
0
http://example.com/article1
Article1
------------------------------
1
http://example.com/article1/subarticle1
SubArticle1
------------------------------
0
http://example.com/article2
Article2
------------------------------
0
http://example.com/article3
Article3
------------------------------
二手lxml.html
。
import lxml.html
rexml = ...
def depth(node):
d = 0
while node is not None:
d += 1
node = node.getparent()
return d
tree = lxml.html.fromstring(rexml)
for node in tree.iter('page'):
print depth(node)
for url in node.iterfind('url'):
print url.text
for title in node.iterfind('title'):
print title.text.encode("utf-8")
print '-' * 30
Python ElementTree
API提供了深度优先遍历XML树的迭代器 - 遗憾的是,这些迭代器不向调用者提供任何深度信息。
但是你可以编写一个深度优先迭代器,它也返回每个元素的深度信息:
import xml.etree.ElementTree as ET
def depth_iter(element, tag=None):
stack = []
stack.append(iter([element]))
while stack:
e = next(stack[-1], None)
if e == None:
stack.pop()
else:
stack.append(iter(e))
if tag == None or e.tag == tag:
yield (e, len(stack) - 1)
注意,这比通过跟随父链接确定深度更有效(当使用lxml
时) - 即它是O(n)
与O(n log n)
。
lxml是最好的,但如果你必须使用标准库,不要使用它并走树,这样你就可以知道你在哪里。
from xml.etree import ElementTree as ET
tree = ET.fromstring(rexml)
def sub(node, tag):
return node.findall(tag) or []
def print_page(node, depth):
print "%s" % depth
url = node.find("url")
if url is not None:
print url.text
title = node.find("title")
if title is not None:
print title.text
print '-' * 30
def find_pages(node, depth=0):
for page in sub(node, "page"):
print_page(page, depth)
subpage = page.find("subpages")
if subpage is not None:
find_pages(subpage, depth+1)
find_pages(tree)