解析XML：查找没有循环的元素子树

Question

我正在使用ElementTree解析XML有效负载。我不能共享确切的代码或文件，因为它共享敏感信息。通过遍历一个元素（如ElementTree文档中所示）并将输出附加到列表中，我能够成功提取所需的信息。例如：

list_col_name = []
list_col_value = []

for col in root.iter('my_table'):
    # get col name
    col_name = col.find('col_name').text
    list_col_name.append(col_name
    # get col value
    col_value = col.find('col_value').text
    list_col_value.append(col_value)

我现在可以将它们放入字典，并继续进行其余的工作：

dict_ = dict(zip(list_col_name, list_col_value))

但是，我需要尽快完成此操作，并且想知道是否存在一种可以一次提取list_col_name的方法（即使用findall()或类似方法）。如果可能的话，只是想知道增加xml解析速度的方法。所有的答案/建议表示赞赏。预先谢谢你。

Answer 1

我的建议是对源文件使用“增量”解析，基于iterparse方法。原因是您实际上是：

不需要任何完整解析的XML树，
在增量解析期间，您可以丢弃已处理的元素，因此对内存的需求也较小。

另一个提示是使用lxml库，而不是ElementTree。原因是尽管both中存在iterparse方法，但这些库，但是lxml版本具有附加的tag参数，因此您只能将循环“限制”为仅处理感兴趣的标签。

作为我使用的源文件（类似）：

<root>
  <my_table id="t1">
    <col_name>N1</col_name>
    <col_value>V1</col_value>
    <some_other_stuff>xx1</some_other_stuff>
  </my_table>
  <my_table id="t2">
    <col_name>N2</col_name>
    <col_value>V2</col_value>
    <some_other_stuff>xx1</some_other_stuff>
  </my_table>
  <my_table id="t3">
    <col_name>N3</col_name>
    <col_value>V3</col_value>
    <some_other_stuff>xx1</some_other_stuff>
  </my_table>
</root>

实际上是我的源文件：

包括9 my_table元素（不是3），
[some_other_stuff]重复8次（每个my_table），以模拟每个my_table中包含的其他元素。

我使用％timeit执行了3个测试：

您的循环，带有对源XML文件的前置解析：

from lxml import etree as et

def fn1():
    root = et.parse('Tables.xml')
    list_col_name = []
    list_col_value = []
    for col in root.iter('my_table'):
        col_name = col.find('col_name').text
        list_col_name.append(col_name)
        col_value = col.find('col_value').text
        list_col_value.append(col_value)
    return dict(zip(list_col_name, list_col_value))

执行时间为1.74毫秒。

我的循环，基于iterparse，仅处理“必需”元素：
```
def fn2():
    key = ''
    dict_ = {}
    context = et.iterparse('Tables.xml', tag=['my_table', 'col_name', 'col_value'])
    for action, elem in context:
        tag = elem.tag
        txt = elem.text
        if tag == 'col_name':
            key = txt
        elif tag == 'col_value':
            dict_[key] = txt
        elif tag == 'my_table':
            elem.clear()
            elem.getparent().remove(elem)
    return dict_
```
我假设在每个my_table元素col_name中发生beforecol_value和每个my_table仅包含一个名为col_name的孩子和col_value。
还要注意，上面的函数会清除每个my_table元素，然后从解析的XML树中将其删除（getparent函数可用仅限于lxml版本。
另一个改进是我“直接”添加了每个key / value对到此函数要返回的字典，因此不需要zip。
执行时间为1.33毫秒。不太快，但至少有一些时间收益可见。

您还可以读取所有col_name和col_value元素，调用findall，然后调用zip：

def fn3():
    root = et.parse('Tables.xml')
    list_col_name = []
    for elem in root.findall('.//col_name'):
        list_col_name.append(elem.text)
    list_col_value = []
    for elem in root.findall('.//col_value'):
        list_col_value.append(elem.text)
    return dict(zip(list_col_name, list_col_value))

执行时间为1.38毫秒。还比你原来的更快解决方案，但与我的第一个解决方案（fn2）没有显着差异。

]

当然，最终结果很大程度上取决于：

输入文件的大小，
多少“其他东西”包含每个my_table元素。

Answer 2

考虑使用findall进行列表理解以避免列表初始化/追加和可能出现for的显式marginally improve performance循环：

# FINDALL LIST COMPREHENSION
list_col_name = [e.text for e in root.findall('./my_table/col_name')]
list_col_value = [e.text for e in root.findall('./my_table/col_value')]

dict(zip(list_col_name, list_col_value))

或者，使用完全支持XPath 1.0的lxml（第三方库），考虑可以将解析输出直接分配给列表的xpath()，也避免了初始化/附加和xpath()循环：

for

解析XML：查找没有循环的元素子树

问题描述投票：0回答：2

2个回答

最新问题

解析XML：查找没有循环的元素子树

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2