为什么 BeautifulSoup find_all() 方法在 HTML 注释标记后停止？

Question

我正在使用

BeautifulSoup

来解析这个网站：

https://www.baseball-reference.com/postseason/1905_WS.shtml

网站内有以下元素

<div id="all_post_pitching_NYG" class="table_wrapper">

该元素作为包装器应包含以下元素：

<div class="section_heading assoc_post_pitching_NYG as_controls" id="post_pitching_NYG_sh">

```
<div class="placeholder"></div>
```
很长的 HTML 注释

<div class="topscroll_div assoc_post_pitching_NYG">

<div class="table_container is_setup" id="div_post_pitching_NYG">

<div class="footer no_hide_long" id="tfooter_post_pitching_NYG">

我一直在使用：

response = requests.get(url)
response.raise_for_status()

soup = BeautifulSoup(response.content, "html.parser")

pitching = soup.find_all("div", id=lambda x: x and x.startswith("all_post_pitching_"))[0]
for div in pitching:
   print(div)

但是它只会打印非常长的绿色 HTML 注释，然后它就不会打印 (4) 或更长的时间。我做错了什么？

Answer 1

检查特殊字符串：

Tag、NavigableString 和 BeautifulSoup 几乎涵盖了您在 HTML 或 XML 文件中看到的所有内容，但还有一些剩余的部分。您可能会遇到的主要问题是评论。

一个简单的解决方案可能是替换 HTML 字符串中的注释字符，以将其显示为

BeautifulSoup

:

import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
        requests.get('https://www.baseball-reference.com/postseason/1905_WS.shtml').text.replace('<!--','').replace('-->','')
)

pitching = soup.select('div[id^="all_post_pitching_"]')[0]

for e,div in enumerate(pitching.select('div'),1):
   print(e,div)

更具体的替代方法是使用 bs4.Comment

为什么 BeautifulSoup find_all() 方法在 HTML 注释标记后停止？

问题描述投票：0回答：1

1个回答

最新问题

为什么 BeautifulSoup find_all() 方法在 HTML 注释标记后停止？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1