我正在使用BeautifulSoup从此页面获取文章的网址:https://www.usnews.com/search?q=China+COVID-19&gsc.tab=0&gsc.q=China+COVID-19&gsc.page=1#gsc.tab=0&gsc.q=China%20COVID-19&gsc.page=1
我希望文章的所有链接都应存储在页面链接中,如下所示:
['https://www.usnews.com/news/health-news/articles/2020-04-08/chinas-controls-may-have-headed-off-700-000-covid-19-cases-study']
['https://www.usnews.com/news/health-news/articles/2020-03-18/trump-dubs-covid-19-chinese-virus-despite-hate-crime-risks']
.......
但是输出给了我一个[]
。我使用了这段代码:
#parse the content
soup_page = bs(response.text, 'lxml')
#select all the articles for a single page
#containers = soup_page.findAll("li", {'class': 'usn-site-search-item'})
containers = soup_page.findAll("div", {'class': 'usn-site-search-item-container'})
#scrape the links of the articles
for i in containers:
url = i.find('a')
pagelinks.append(url.get('href'))
print(pagelinks)
如您所见,我尝试了两个容器,但是都没有给我预期的结果。
如果有人可以帮助我,我将不胜感激!
您在调用上述URL时收到的响应不包含您要查找的链接。您可能需要使用无头浏览器,例如硒或puppeteer,因为该页面需要加载javascript
但是您可以使用此URL获得所需的结果programmable search engine
这是一个小的python代码:
import requests
import json
headers = {
"user-agent": <User Agent Goes Here>
}
res = requests.get(url, headers=headers)
json_res = res.text.split("api2768")[1][1:-2]
urls = []
for url in json.loads(json_res)['results']:
urls.append(url['url'])
P.S:请照顾饼干
该网页加载了JavaScript。您需要解决这一问题,一种方法是使用硒。
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.usnews.com/search?q=China+COVID-19&gsc.tab=0&gsc.q=China+COVID-19&gsc.page=1#gsc.tab=0&gsc.q=China%20COVID-19&gsc.page=1"
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source,'lxml')
#the rest of your code here
driver.quit()
如果没有安装Selenium package,这是在终端上运行的命令:
pip install selenium