我必须将大量极其混乱的 HTML 档案处理成 Markdown,其中充满了无关的表格、跨度和内联样式。
我正在尝试使用 Beautiful Soup 来完成此任务,我的目标基本上是
get_text()
函数的输出,除了保留 href
完好无损的锚标记。
举个例子,我想转换:
<td>
<font><span>Hello</span><span>World</span></font><br>
<span>Foo Bar <span>Baz</span></span><br>
<span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span>
</td>
进入:
Hello World
Foo Bar Baz
Example Link: <a href="https://google.com">Google</a>
到目前为止,我的思考过程是简单地获取所有标签,如果它们不是锚点,则将它们全部展开,但这会导致文本重复多次,因为
soup.find_all(True)
将递归嵌套的标签作为单独的元素返回:
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)
for tag in tags:
if (tag.name == 'a'):
print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
else:
print(tag.get_text())
当解析器沿着树向下移动时,它会返回多个片段/重复项:
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World
Foo Bar Baz
Baz
Example Link: Google
<a href='https://google.com'>Google</a>
解决此问题的可能方法之一是在打印元素文本时对
a
元素引入一些特殊处理。
您可以通过重写
_all_strings()
方法并返回 a
后代元素的字符串表示形式并跳过 a
元素内的可导航字符串来完成此操作。沿着这些思路:
from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
def _all_strings(self, strip=False, types=(NavigableString, CData)):
for descendant in self.descendants:
# return "a" string representation if we encounter it
if isinstance(descendant, Tag) and descendant.name == 'a':
yield str(descendant)
# skip an inner text node inside "a"
if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
continue
# default behavior
if (
(types is None and not isinstance(descendant, NavigableString))
or
(types is not None and type(descendant) not in types)):
continue
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant
演示:
In [1]: data = """
...: <td>
...: <font><span>Hello</span><span>World</span></font><br>
...: <span>Foo Bar <span>Baz</span></span><br>
...: <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
...: t-decoration: underline;">Google</a></span>
...: </td>
...: """
In [2]: soup = MyBeautifulSoup(data, "lxml")
In [3]: print(soup.get_text())
HelloWorld
Foo Bar Baz
Example Link: <a href="https://google.com" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;" target="_blank">Google</a>
要仅考虑直接子项设置 recursive = False,那么您需要处理每个“td”并单独提取文本和锚链接。
#!/usr/bin/env python
from bs4 import BeautifulSoup
example_html = '<td><font><span>Some Example Text</span></font><br><span>Another Example Text</span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'
soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(recursive=False)
for tag in tags:
print(tag.text)
print(tag.find('a'))
如果您希望将文本打印在单独的行上,则必须单独处理跨度。
for tag in tags:
spans = tag.find_all('span')
for span in spans:
print(span.text)
print(tag.find('a'))
接受的解决方案对我不起作用(我遇到了与@alextre相同的问题,可能是由于版本更改)。 但是,我设法通过修改并覆盖 get_text() 方法而不是 all_string() 来解决它。
from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
def get_text(self, separator='', strip=False, types=(NavigableString,)):
text_parts = []
for element in self.descendants:
if isinstance(element, NavigableString):
text_parts.append(str(element))
elif isinstance(element, Tag):
if element.name == 'a' and 'href' in element.attrs:
text_parts.append(element.get_text(separator=separator, strip=strip))
text_parts.append('(' + element['href'] + ')')
elif isinstance(element, types):
text_parts.append(element.get_text(separator=separator, strip=strip))
return separator.join(text_parts)```
如果有人想避免覆盖或装饰类...恕我直言,一个足够好的方法是迭代根元素的所有后代,并将(例如)span 元素附加为包含链接引用的链接的子元素
<a>
,在执行任何 get_text() 操作之前。所以,使用 OP 示例:
from bs4 import BeautifulSoup, Tag
example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'
soup = BeautifulSoup(example_html, 'html.parser')
for el in soup.descendants:
if isinstance(el, Tag):
if el.name == 'a' and 'href' in el.attrs:
new_span = soup.new_tag('span')
new_span.string = ' (' + el.attrs['href'] + ')'
el.insert(position=len(el.contents), new_child=new_span)
print(soup.get_text()) # HelloWorldFoo Bar BazExample Link: Google (https://google.com)