BeautifulSoup：跳过 html 元素

Question

我有以下 html 结构：这只是其中的一部分，但我认为这个片段足以解释我的问题。

<tr>
<td> Color Digest </td>
<td> AgAkAZwCJgMZ </td>
</tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>

我有以下代码来提取“Color Digest”标签的下一个兄弟

for td in soupPage.html.findAll('td'):
    if td.text == 'Color Digest':
        if td.nextSibling.text != " ":
            a = set()
            a = "[" + td.nextSibling.text.strip(",") + "]"
            print a

但我想跳过

<td> AgAkAZwCJgMZ </td>

并获取

<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>

中的值

我可以遵循的最好的 beautifulsoup 机制是什么？

Answer 1

您可以通过以下方式实现这一目标：

import re
from bs4 import BeautifulSoup

html = """
<tr>
<td> Color Digest </td>
<td> AgAkAZwCJgMZ </td>
</tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
"""

soup = BeautifulSoup(html)
output = soup.find_all('td', text = re.compile(" Color Digest "))[1].find_next('td').text


print(output)

Answer 2

更多 html（例如整个表格）将有助于编写更健壮的内容。

如果您知道要排除的字符串。使用 bs4 4.7.1

from bs4 import BeautifulSoup as bs

html = '''
<tr>
<td> Color Digest </td>
<td> AgAkAZwCJgMZ </td>
</tr>
<tr>
<td> Color Digest </td>
<td> 2,36,156,38,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, </td>
</tr>
'''
soup = bs(html, 'lxml')
elems = [item.text for item in soup.select('td:contains("Color Digest") + td:not(:contains("AgAkAZwCJgMZ"))')]
print(elems)

如果不这样做，则在返回的列表上使用索引

elems = [item.text for item in soup.select('td:contains("Color Digest") + td')][1]

BeautifulSoup：跳过 html 元素

问题描述投票：0回答：2

2个回答

最新问题

BeautifulSoup：跳过 html 元素

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2