我希望能够在字符串中包含自定义“HTML”标记,例如:"This is a <photo id="4" /> string"
。
在这种情况下,自定义标记是<photo id="4" />
。如果它更容易,即[photo id:4]
或其他什么,我也可以改变这个自定义标签以改写。
我希望能够将此字符串传递给将提取标签<photo id="4" />
的函数,并允许我将其转换为更复杂的模板,如<div class="photo"><img src="...." alt="..."></div>
,然后我可以使用它来替换原始字符串中的标记。
我正在成像这样的工作:
>>> content = "This is a <photo id="4" /> string"
# Pass the string to a function that returns all the tags with the given name.
>>> tags = parse_tags('photo', string)
>>> print(tags)
[{'tag': 'photo', 'id': 4, 'raw': '<photo id="4" />'}]
# Now that I know I need to render a photo with ID 4, so I can pass that to some sort of template thing
>>> rendered = render_photo(id=tags[0]['id'])
>>> print(rendered)
<div class="photo"><img src="...." alt="..."></div>
>>> content = content.replace(tags[0]['raw'], rendered)
>>> print(content)
This is a <div class="photo"><img src="...." alt="..."></div> string
我认为这是一个相当常见的模式,比如把照片放在博客文章中,所以我想知道是否有一个库可以做类似于上面的示例parse_tags
函数。或者我需要写它吗?
这个照片标签的例子只是一个例子。我想要有不同名称的标签。作为一个不同的例子,也许我有一个人的数据库,我想要一个像<person name="John Doe" />
的标签。在那种情况下,我想要的输出就像{'tag': 'person', 'name': 'John Doe', 'raw': '<person name="John Doe" />'}
。然后我可以使用该名称查看该人并返回该人的vcard或其他东西的渲染模板。
如果您正在使用HTML5,我建议您查看xml模块(etree)。它将允许您将整个文档解析为树结构并单独操作标记(然后将resut bask转换为html文档)。
您还可以使用正则表达式来执行文本替换。如果您没有太多的更改,这可能比加载xml树结构更快。
import re
text = """<html><body>some text <photo> and tags <photo id="4"> more text <person name="John Doe"> yet more text"""
tags = ["photo","person","abc"]
patterns = "|".join([ f"(<{tag} .*?>)|(<{tag}>)" for tag in tags ])
matches = list(re.finditer(patterns,text))
for match in reversed(matches):
tag = text[match.start():match.end()]
print(match.start(),match.end(),tag)
# substitute what you need for that tag
text = text[:match.start()] + "***" + text[match.end():]
print(text)
这将打印出来:
64 88 <person name="John Doe">
39 53 <photo id="4">
22 29 <photo>
<html><body>some text *** and tags *** more text *** yet more text
以相反顺序执行替换可确保finditer()找到的范围保持有效,因为文本随替换而变化。
对于这种“外科”解析(您希望隔离特定标记而不是创建完整的分层文档),pyparsing的makeHTMLTags
方法可能非常有用。
请参阅下面带注释的脚本,显示解析器的创建,并将其用于parseTag
和replaceTag
方法:
import pyparsing as pp
def make_tag_parser(tag):
# makeHTMLTags returns 2 parsers, one for the opening tag and one for the
# closing tag - we only need the opening tag; the parser will return parsed
# fields of the tag itself
tag_parser = pp.makeHTMLTags(tag)[0]
# instead of returning parsed bits of the tag, use originalTextFor to
# return the raw tag as token[0] (specifying asString=False will retain
# the parsed attributes and tag name as attributes)
parser = pp.originalTextFor(tag_parser, asString=False)
# add one more callback to define the 'raw' attribute, copied from t[0]
def add_raw_attr(t):
t['raw'] = t[0]
parser.addParseAction(add_raw_attr)
return parser
# parseTag to find all the matches and report their attributes
def parseTag(tag, s):
return make_tag_parser(tag).searchString(s)
content = """This is a <photo id="4" /> string"""
tag_matches = parseTag("photo", content)
for match in tag_matches:
print(match.dump())
print("raw: {!r}".format(match.raw))
print("tag: {!r}".format(match.tag))
print("id: {!r}".format(match.id))
# transform tag to perform tag->div transforms
def replaceTag(tag, transform, s):
parser = make_tag_parser(tag)
# add one more parse action to do transform
parser.addParseAction(lambda t: transform.format(**t))
return parser.transformString(s)
print(replaceTag("photo",
'<div class="{tag}"><img src="<src_path>/img_{id}.jpg." alt="{tag}_{id}"></div>',
content))
打印:
['<photo id="4" />']
- empty: True
- id: '4'
- raw: '<photo id="4" />'
- startPhoto: ['photo', ['id', '4'], True]
[0]:
photo
[1]:
['id', '4']
[2]:
True
- tag: 'photo'
raw: '<photo id="4" />'
tag: 'photo'
id: '4'
This is a <div class="photo"><img src="<src_path>/img_4.jpg." alt="photo_4"></div> string