Python lxml - 如何删除空的重复标签

Question

我有一些由脚本生成的 XML，可能有也可能没有空元素。有人告诉我现在 XML 中不能有空元素。这是一个例子：

<customer>  
    <govId>
       <id>@</id>
       <idType>SSN</idType>
           <issueDate/>
           <expireDate/>
           <dob/>
           <state/>
           <county/>
           <country/>
    </govId>
    <govId>
        <id/>
        <idType/>
        <issueDate/>
        <expireDate/>
        <dob/>
        <state/>
        <county/>
        <country/>
    </govId>
</customer>

输出应如下所示：

<customer>  
    <govId>
       <id>@</id>
       <idType>SSN</idType>        
    </govId>        
</customer>

我需要删除所有空元素。您会注意到，我的代码取出了“govId”子元素中的空内容，但在第二个子元素中没有取出任何内容。我目前正在使用 lxml.objectify。

这基本上就是我正在做的事情：

root = objectify.fromstring(xml)
for customer in root.customers.iterchildren():
    for e in customer.govId.iterchildren():
        if not e.text:
            customer.govId.remove(e)

有谁知道使用 lxml objectify 执行此操作的方法或者是否有更简单的方法？如果第二个“govId”元素的所有元素均为空，我还想将其全部删除。

Answer 1

首先，您的代码的问题在于您正在迭代

customers

，但没有迭代

govIds

。在第三行，您为每个客户获取 first

govId

，并迭代其子项。因此，您需要另一个

for

循环才能让代码按照您的预期工作。

问题末尾的这个小句子使问题变得更加复杂：如果第二个“govId”元素的所有元素均为空，我还想将其全部删除。

这意味着，除非您只想硬编码检查一级嵌套，否则您需要“递归”检查元素及其子元素是否为空。例如这样： def recursively_empty(e): if e.text: return False return all((recursively_empty(c) for c in e.iterchildren()))

注意

：Python 2.5+ 因为使用了 all() 内置

。

然后，您可以将代码更改为类似的内容，以删除文档中自始至终为空的所有元素。

# Walk over all elements in the tree and remove all # nodes that are recursively empty context = etree.iterwalk(root) for action, elem in context: parent = elem.getparent() if recursively_empty(elem): parent.remove(elem)

输出示例：

<customer> <govId> <id>@</id> <idType>SSN</idType> </govId> </customer>

您可能想做的一件事是细化递归函数中的条件

if e.text:

。目前，这会将

None

和空字符串视为空，但不会将空格和换行符等空白视为空。如果这是您对“空”定义的一部分，请使用

str.strip()

。

编辑

：正如@Dave所指出的，可以通过使用生成器表达式来改进递归函数：

return all((recursively_empty(c) for c in e.getchildren()))

这不会立即对所有孩子评估

recursively_empty(c)

 ，而是对每个孩子进行惰性评估。由于

all()

 将在第一个

False

 元素上停止迭代，这可能意味着显着的性能改进。

编辑2：使用e.iterchildren()

代替

e.getchildren()

可以进一步优化表达式。这适用于

lxml etree API 和 objectify API。

Answer 2

我找到了一个更简单的解决方案，不需要递归。在这里，我们颠倒了迭代的顺序，因此首先访问最里面的叶节点并删除，然后是父节点（如果它们为空）。

def remove_empty_xml(root: etree._Element) -> None:
    # Walk over elements in reverse so that we visit 
    # leaf nodes first. 
    for elem in reversed(list(root.iter())):
        if elem.text is not None and elem.text.strip():
            continue
        if len(elem) > 0:
            continue
        parent = elem.getparent()
        parent.remove(elem)

Python lxml - 如何删除空的重复标签

问题描述投票：0回答：2

2个回答

最新问题

Python lxml - 如何删除空的重复标签

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2