所以我想抓取谷歌,我已经使用这种方法成功抓取了 craigslist,但由于某种原因我无法抓取谷歌(是的,当然我改变了类和东西..)这就是我想要抓取的:
我想抓取网站描述:
from selenium import webdriver
path = r"C:\Users\Skid\Desktop\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.google.com/#q=python+webscape+google")
posts = driver.find_elements_by_class_name("r")
for post in posts:
print(post.text)
已解决,在抓取之前添加一个计时器(导入时间,time.sleep(2))。
BeautifulSoup
网络抓取库抓取 Google 搜索描述网站。
详细了解 什么是 CSS 选择器,以及 使用 CSS 选择器的缺点。
from bs4 import BeautifulSoup
import requests, lxml, json
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
# this URL params is taken from the actual Google search URL
# and transformed to a more readable format
params = {
"q": "python web scrape google", # query
"gl": "us", # country to search from
"hl": "en", # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
website_description_data = []
for result in soup.select(".tF2Cxc"):
website_name = result.select_one(".yuRUbf a")["href"]
description = result.select_one(".lEBKkf").text
website_description_data.append({
"website_name" : website_name,
"description" : description
})
print(json.dumps(website_description_data, indent=2))
输出示例
[
{
"website_name": "https://practicaldatascience.co.uk/data-science/how-to-scrape-google-search-results-using-python",
"description": "Mar 13, 2021 \u2014 First, we're using urllib.parse.quote_plus() to URL encode our search query. This will add + characters where spaces sit and ensure that the\u00a0..."
}
]
[
{
"website_name": "https://practicaldatascience.co.uk/data-science/how-to-scrape-google-search-results-using-python",
"description": "Mar 13, 2021 \u2014 First, we're using urllib.parse.quote_plus() to URL encode our search query. This will add + characters where spaces sit and ensure that the\u00a0..."
},
{
"website_name": "https://stackoverflow.com/questions/38619478/google-search-web-scraping-with-python",
"description": "You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top\u00a0..."
}
# ...
]