Python LXML：如何删除两个指定标签之间的所有标签？

Question

我有一个 xml 文件（word.docx 文件的 document.xml），我需要从中删除某些部分。

结构是这样的：

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
 <w:body>
        <w:p>
            Bunch of nested tags
        </w:p>
        <w:p>
            Bunch of nested tags to delete
        </w:p>
        <w:p>
            Bunch of nested tags to delete
        </w:p>
        <w:tbl>
            Bunch of nested tags to delete
        </w:tbl>
        <w:p>
            Bunch of nested tags
        </w:p>
 </w:body>
</document>

我想删除 2 个指定边界标签之间的所有标签及其所有内容。我想包含 startTag 并排除 endTag，并删除之间的所有内容。

我的两个边界标签是标签，在标签之间还有一堆其他标签，例如标签，我也想删除它们。

我的问题是我不知道如何删除所有这些标签。有什么帮助吗？

期望的输出是：

<?xml version='1.0' encoding='UTF-8' standalone='yes'?>
<w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
 <w:body>
        <w:p>
            Bunch of nested tags
        </w:p>
        <w:p>
            Bunch of nested tags
        </w:p>
 </w:body>
</document>

这是我尝试过的：

我成功获得了边界标签：

startTag = parentBoundaryTags[3]
endTag = parentBoundaryTags[4]

边界标签值为：

<Element p at 0x12cf32ccfa0>
<Element p at 0x12cf32ccff0>

我尝试获取边界标签的共同父级，因为根据我的研究，我似乎需要它来删除其下面的元素：

common_ancestor = startTag.getparent()

common_ancestor 值为：

<Element body at 0x12cf32cccd0>

这对我来说很有意义，因为它对应于我的 xml 结构；这就是我期望看到的。

我使用 getchildren() 来迭代标签的所有直接子代。我正在尝试删除标签的所有直接子代，从标签的直接子代相当于我的 startTag 边界标签的点开始。

我正在尝试继续删除的直接子级，直到到达相当于我的 endTag 边界标记的直接子级。

# Flag to indicate whether to start removing elements
start_removal = False

# List to store elements to be removed
elements_to_remove = []

# Iterate over the children of the common ancestor
for child in common_ancestor.getchildren():
    if child == startTag:
        start_removal = True
        elements_to_remove.append(child)
    elif child == endTag:
        start_removal = False
        break
    elif start_removal:
        elements_to_remove.append(child)

# Remove the collected elements
for element in elements_to_remove:
    common_ancestor.remove(element)

# Write the modified XML tree back to the document.xml file
tree.write(document_xml, encoding='utf-8', xml_declaration=True)

我希望这会删除边界标签之间的所有标签，但它根本没有删除任何内容。

有人可以帮忙吗？

Answer 1

这是一个基于XSLT的解决方案。

它使用所谓的“身份转换”模式。

输入XML

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:body> <w:p>Bunch of nested tags</w:p> <w:p>Bunch of nested tags to delete</w:p> <w:p>Bunch of nested tags to delete</w:p> <w:tbl>Bunch of nested tags to delete</w:tbl> <w:p>Bunch of nested tags</w:p> </w:body> </w:document>

XSLT

<?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <xsl:output method="xml" omit-xml-declaration="no" encoding="UTF-8" indent="yes" standalone="yes"/> <xsl:strip-space elements="*"/>  <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> <xsl:template match="w:body/w:*[position() != 1 and position() != last()]"/> </xsl:stylesheet>

输出XML

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?> <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:body> <w:p>Bunch of nested tags</w:p> <w:p>Bunch of nested tags</w:p> </w:body> </w:document>

Python

import lxml.etree as lx # PARSE XML AND XSLT doc = lx.parse("Input.xml") style = lx.parse("Style.xslt") outfile = "Output.xml" # CONFIGURE AND RUN TRANSFORMER transformer = lx.XSLT(style) result = transformer(doc) # OUTPUT TO FILE with open(outfile, "wb") as f: f.write(result)

Python LXML：如何删除两个指定标签之间的所有标签？

问题描述投票：0回答：1

1个回答

最新问题

Python LXML：如何删除两个指定标签之间的所有标签？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1