我有一个 XML 格式的标记文本。我需要添加标记,即为文本中出现的某些单词添加标签。
这就是我正在尝试的方式:
import xml.etree.ElementTree as ET
doc = '''<root><par>An <fr>example</fr> text with key words one and two</par></root>'''
profs=['one','two']
tag='<key>'
tag_cl='</key>'
root = ET.fromstring(doc)
for child in root:
for word in profs:
if word in child.text:
child.text=child.text.replace(word, f'{tag}{word}{tag_cl}')
print(child.text)
如果文本中没有嵌套标签,这行得通。如果有标签(在本例中为“fr”),则 child.text 仅被视为第一个标签之前的文本。当然必须有一些简单的解决方案来执行我描述的任务。你能给我一个提示吗?
这里是任务的 XSLT 2.0 实现。
输入 XML
<?xml version="1.0"?>
<root>
<par>An <fr>example</fr> text with key words one and two</par>
</root>
XSLT 2.0
<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="utf-8"
omit-xml-declaration="no"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:call-template name="OneTwoSequence"/>
</xsl:template>
<xsl:template name="OneTwoSequence">
<xsl:param name="string" select="string(.)"/>
<xsl:analyze-string select="$string" regex="one|two">
<xsl:matching-substring>
<key>
<xsl:value-of select="."/>
</key>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
输出
<?xml version='1.0' encoding='utf-8' ?>
<root>
<par>An
<fr>example</fr> text with key words
<key>one</key> and
<key>two</key>
</par>
</root>
你非常接近,但你必须使用 lxml 而不是 ElementTree 才能到达那里:
from lxml import html as lh
root = lh.fromstring(doc)
#locate relevant the element
target = root.xpath('//fr')[0]
#convert the relevant element to string and copy it to a new string
#that is a necessary step because we're going to have to delete the
#original string
target_str = lh.tostring(target).decode()
#make the necessary changes to the string
profs=['one','two']
for word in profs:
if word in target_str:
target_str = target_str.replace(word, f'<key>{word}</key>')
#locate the destination for the new element
destination = root.xpath('//par')[0]
#remove the original target
destination.remove(target)
#insert the new string, converted into a new element
destination.insert(0,lh.fromstring(target_str))
print(lh.tostring(root))
输出应该是您的预期输出。
你搜索尾元素。如有必要,您可以复制 tag.text 的 if 条件:
import xml.etree.ElementTree as ET
doc = '''<root><par>An <fr>example</fr> text with key words one and two</par></root>'''
profs=['one','two']
tag= ET.Element('key')
root = ET.fromstring(doc)
for elem in root.iter():
#print(elem.text)
#print(elem.tail)
for word in profs:
if elem.tail != None and word in elem.tail:
tag.text=word
elem.tail = elem.tail.replace(word, ET.tostring(tag).decode())
if elem.tail != None:
print(elem.tail)
输出:
text with key words <key>one</key> and <key>two</key>