使用 html_requests 时提取 JavaScript 内容时出现问题

Question

我目前正在开发一个网络爬虫，并且在大多数情况下它运行得很好。我一直在使用beautiful soup来提取html内容；为了提取 javascript 内容，我刚刚从 html_requests 开始。

不幸的是，我在从以下网站“https://goglobal.com/”提取javascript数据时遇到了一些问题，特别是其中包含“100多个国家/地区”、“2500+员工”和“节省了 30 亿美元......”。该代码无法正确提取值。但是，该代码似乎对于加载动态内容的其他网站运行良好。

为了隔离问题，我编写了以下脚本。但是，goglobal 网站上的值仍然显示不正确。

from requests_html import HTMLSession
import time
session = HTMLSession()
url = "https://goglobal.com/"
r = session.get(url)

r.html.render(wait=10)
time.sleep(10)
print(r.html.html)

作为参考，我通过搜索“计数器编号”搜索了显示的输出。

我的问题如下：

为什么此内容无法正确加载？
有没有办法在仍然使用 html_requets 的情况下解决这个问题？
我可以使用 selenium 或 playwright/scrapy 解决这个问题吗？

我尝试使用上面的脚本来识别并解决问题。

Answer 1

requests-html

几乎已被弃用，对于静态 html 使用

requests

和

BeautifulSoup

，对于难以抓取/动态的网站使用

selenium/playwright

。

在这种情况下 requests + bs4 就足够了，您要查找的数字可以在静态 html 中找到，以下是获取它们的方法：

import requests
from bs4 import BeautifulSoup

url = 'https://goglobal.com/'
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

counters = {i.select_one('h3.title').text: i.select_one('span.counter-number').get('data-counter') for i in soup.select('div.counter-item')}
print(counters)

它不能与 requests-html 一起使用的原因可能是因为您看错了地方，您正在查看的值是动画的，并且动画仅在元素可见/滚动到视图中时才开始，但是实际数字在

data-counter

属性中，等效的 requests-html 代码在不渲染的情况下仍然有效：

from requests_html import HTMLSession

session = HTMLSession()
url = "https://goglobal.com/"
r = session.get(url)

counters = {i.find('h3.title', first=True).text: i.find('span.counter-number', first=True).attrs.get('data-counter') for i in r.html.find('div.counter-item')}
print(counters)

再次，requests-html 不再更新，我更喜欢 requests 和 bs4，但两者都可以在这种情况下工作。

使用 html_requests 时提取 JavaScript 内容时出现问题

问题描述投票：0回答：1

1个回答

最新问题

使用 html_requests 时提取 JavaScript 内容时出现问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1