无法使用BeautifulSoup获得文章的网址

问题描述 投票:1回答:2

我正在使用BeautifulSoup从此页面获取文章的网址:https://www.usnews.com/search?q=China+COVID-19&gsc.tab=0&gsc.q=China+COVID-19&gsc.page=1#gsc.tab=0&gsc.q=China%20COVID-19&gsc.page=1

我希望文章的所有链接都应存储在页面链接中,如下所示:

['https://www.usnews.com/news/health-news/articles/2020-04-08/chinas-controls-may-have-headed-off-700-000-covid-19-cases-study']
['https://www.usnews.com/news/health-news/articles/2020-03-18/trump-dubs-covid-19-chinese-virus-despite-hate-crime-risks']
.......

但是输出给了我一个[]。我使用了这段代码:

            #parse the content
            soup_page = bs(response.text, 'lxml') 
            #select all the articles for a single page
            #containers = soup_page.findAll("li", {'class': 'usn-site-search-item'})
            containers = soup_page.findAll("div", {'class': 'usn-site-search-item-container'})
            #scrape the links of the articles
            for i in containers:
                url = i.find('a')
                pagelinks.append(url.get('href'))
            print(pagelinks)

如您所见,我尝试了两个容器,但是都没有给我预期的结果。

如果有人可以帮助我,我将不胜感激!

python web-scraping beautifulsoup request web-crawler
2个回答
1
投票

您在调用上述URL时收到的响应不包含您要查找的链接。您可能需要使用无头浏览器,例如硒或puppeteer,因为该页面需要加载javascript

但是您可以使用此URL获得所需的结果programmable search engine

这是一个小的python代码:

import requests
import json

headers = {
    "user-agent": <User Agent Goes Here>
}

res = requests.get(url, headers=headers)
json_res = res.text.split("api2768")[1][1:-2]

urls = []
for url in json.loads(json_res)['results']:
    urls.append(url['url'])

P.S:请照顾饼干


1
投票

该网页加载了JavaScript。您需要解决这一问题,一种方法是使用硒。

from selenium import webdriver
from bs4 import BeautifulSoup

url = "https://www.usnews.com/search?q=China+COVID-19&gsc.tab=0&gsc.q=China+COVID-19&gsc.page=1#gsc.tab=0&gsc.q=China%20COVID-19&gsc.page=1"
driver = webdriver.Chrome()
driver.get(url)

soup = BeautifulSoup(driver.page_source,'lxml')

#the rest of your code here

driver.quit()

如果没有安装Selenium package,这是在终端上运行的命令:

pip install selenium
© www.soinside.com 2019 - 2024. All rights reserved.