我有一个 Python 脚本,它应该在 XML 文档中查找现有标签,并将其替换为新的、更具描述性的标签。问题是,在我运行脚本后,它似乎只捕获我输入的文本字符串的每隔几个实例。我确信它的这种行为背后有一些原因,但我似乎无法弄清楚。
import xml.etree.ElementTree as ET
from lxml import etree
def replace_specific_line_tags(input_file, output_file, replacements):
# Parse the XML file using lxml
tree = etree.parse(input_file)
root = tree.getroot()
for target_text, replacement_tag in replacements:
# Find all <line> tags with the specific target text under <content> and replace them with the new tag
for line_tag in root.xpath('.//content/page/line[contains(., "{}")]'.format(target_text)):
parent = line_tag.getparent()
# Create the new tag with the desired tag name
new_tag = etree.Element(replacement_tag)
# Copy the attributes of the original <line> tag to the new tag
for attr, value in line_tag.attrib.items():
new_tag.set(attr, value)
# Copy the text of the original <line> tag to the new tag
new_tag.text = line_tag.text
# Replace the original <line> tag with the new tag
parent.replace(line_tag, new_tag)
# Write the updated XML back to the file
with open(output_file, 'wb') as f:
tree.write(f, encoding='utf-8', xml_declaration=True)
if __name__ == '__main__':
input_file_name = 'beforeTagEdits.xml'
output_file_name = 'afterTagEdits.xml'
# List of target texts and their corresponding replacement tags
replacements = [
('The Washington Post', 'title'),
# Add more target texts and their replacement tags as needed
]
replace_specific_line_tags(input_file_name, output_file_name, replacements)
由于代码正在运行,只是不完全符合预期,我尝试更改一些文本字符串以匹配原始文件中已知的确切字符串,但这似乎并不能解决问题。以下是当前 XML 文档的示例:
<root>
<content>
<line>The Washington Post</line>
<line>The Washington Post</line>
</content>
</root>
您可以通过根目录进行 iter() 并使用搜索到的文本重命名标签:
import xml.etree.ElementTree as ET
xml= """<root>
<content>
<line>The Washington Post</line>
<line>The Washington Post</line>
<tag>Spiegel</tag>
</content>
</root>"""
root = ET.fromstring(xml)
pattern ={'title':'The Washington Post', 'title':'Spiegel'}
for k, v in pattern.items():
for elem in root.iter():
if elem.text == v:
elem.tag = k
ET.dump(root)
输出:
<root>
<content>
<line>The Washington Post</line>
<line>The Washington Post</line>
<title>Spiegel</title>
</content>
</root>