BeautifulSoup4找到所有非嵌套匹配

Question

我很难在html文档中设置与我的查询匹配的所有最外层元素的简单搜索。我在这里问，希望有一个简单的bs4函数可以做到这一点，但它并没有出现。

考虑下面的html示例，其中我想要所有具有<div>类的最外层"wanted"（我期望得到2的列表）：

import bs4

text = """
<div>
    <div class="inner">
        <div class="wanted">
            I want this.
            <div class="wanted">
                I don't want that!
            </div>
        </div>
    </div>
    <div class="inner">
        <div class="wanted">
            I want this too.
        </div>
    </div>
</div>"""

soup = bs4.BeautifulSoup(text, 'lxml')

# 1. Trying all at once
fetched = soup.findAll('div', class_='wanted')
print(len(fetched))  # 3

fetched = soup.findAll('div', class_='wanted', recursive=False)
print(len(fetched))  # 0

fetched = soup.findChildren('div', class_='wanted')
print(len(fetched))  # 3

fetched = soup.findChildren('div', class_='wanted', recursive=False)
print(len(fetched))  # 0


# 2. Trying one after the other
fetched = []
fetched0 = soup.find('div', class_='wanted')

while fetched0:
    fetched.append(fetched0)
    descendants = list(fetched0.descendants)
    fetched0 = descendants[-1].findNext('div', class_='wanted')

print(len(fetched))  # 2  Hurra!

# 3. Destructive method: if you don't care about the parents of this element
fetched = []
fetched0 = soup.find('div', class_='wanted')
while fetched0:
    fetched.append(fetched0.extract())
    fetched0 = soup.find('div', class_='wanted')
print(len(fetched))

所以# 1.部分没有给出预期的结果。因此findAll和findChildren有什么区别？鉴于这里的筑巢，findNextSibling并不重要。

现在，部分# 2.工作，但为什么需要编写这么多代码？难道没有更优雅的解决方案吗？至于部分# 3.，我必须小心后果。

您对此搜索有何建议？我真的找到了最短的路吗？我可以使用一些CSS选择魔法吗？

Answer 1

除了其他参数之外，你还可以将函数作为参数传递给find_all。在其中，您可以使用find_parents（）进行检查，以确保它没有任何具有相同类的顶级div。使用find_parents()，因为它将检查所有父母，而不仅仅是它的直接父母，以便你只获得最外面的“通缉”div。

def top_most_wanted(tag):
    children_same_class=tag.find_parents("div", class_="wanted")
    if len(children_same_class) >0:
        return False
    return True
soup=BeautifulSoup(text,'html.parser')
print(soup.find_all(top_most_wanted,'div',class_="wanted"))

Answer 2

我终于做了以下，其优点是不具有破坏性。此外，我没有时间对它进行基准测试，但我只是希望这可以避免像@ Bitto-Bennichan一样回答每个嵌套元素，但实际上这是不确定的。无论如何，它做我想要的：

all_fetched = []
fetched = soup.find('div', class_='wanted')

while fetched is not None:
    all_fetched.append(fetched)
    try:
        last = list(fetched.descendants)[-1]
    except IndexError:
        break
    fetched = last.findNext('div', class_='wanted')

BeautifulSoup4找到所有非嵌套匹配

问题描述投票：1回答：2

2个回答

最新问题

BeautifulSoup4找到所有非嵌套匹配

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2