我在 Google 中有以下搜索查询:
https://www.google.com/search?q=bonpland&tbm=isch&hl=en-US&tbs=qdr:w
此搜索将返回上周针对搜索词
bonpland
找到的所有图像。现在我想使用 get requests 库将所有这些 HTML 或图像链接和图像重定向返回到我的 Python 控制台。如果我在浏览器中运行此 URL,它最初会显示大约 108 张图像。如果我单击其中一张图像,则会加载更多图像,如果我向下滚动,则会加载越来越多的图像,直到加载约 450 张图像,然后会提示 Show more results
按钮。单击后,会加载另外约 480 张图像,因此可以说通过此查询找到了大约一千张图像。
但是,当我在Python中运行get命令时,如下所示,只返回49张原始图像:
import requests
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Connection': 'keep-alive',
'DNT': '1',
'Accept-Language': 'en-US,en;q=0.5',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/55.0.2994.37',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',}
response = requests.get(
'https://www.google.com/search?q=bonpland&tbm=isch&hl=en-US&tbs=qdr:w',
headers=headers,
)
soup = BeautifulSoup(response.text, 'html.parser')
soup
有什么方法可以修改 URL 以返回所有链接,或者修改代码以便我们可以使用此库检索所有结果?我尝试了多种方法修改网址都没有成功。
我尝试向下滚动并查看网络中发生了什么,它似乎是一个后响应,它返回一个 json,我可以在 Python 中重新创建它,但我似乎无法解码这个 json 响应,也无法我自己想一些逻辑来生成这些请求:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36 OPR/55.0.2994.37',
'Accept': '*/*',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': 'https://www.google.com/',
'X-Same-Domain': '1',
'x-goog-ext-190139975-jspb': '["NL","ZZ","KgKka8TAAAFqmWCfx71ZfQ=="]',
'Content-Type': 'application/x-www-form-urlencoded;charset=utf-8',
'Origin': 'https://www.google.com',
'DNT': '1',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin'
}
params = {
'rpcids': 'HoAMBc',
'source-path': '/search',
'f.sid': '2747221314367002709',
'bl': 'boq_visualfrontendserver_20230813.07_p1',
'hl': 'en-US',
'authuser': '0',
'soc-app': '162',
'soc-platform': '1',
'soc-device': '1',
'_reqid': '303769',
'rt': 'c',
}
data = 'f.req=%5B%5B%5B%22HoAMBc%22%2C%22%5Bnull%2Cnull%2C%5B3%2Cnull%2C4294967246%2C1%2C3766%2C%5B%5B%5C%22CYxr5OmPOtywOM%5C%22%2C259%2C194%2C536870912%5D%2C%5B%5C%222BSR5sBuDzSHqM%5C%22%2C306%2C165%2C0%5D%2C%5B%5C%22ZA_122FexBY1nM%5C%22%2C268%2C188%2C34340864%5D%2C%5B%5C%22-3b9ovO7KQ_dYM%5C%22%2C275%2C183%2C10485760%5D%2C%5B%5C%220iXGzZD6KO-t_M%5C%22%2C275%2C183%2C444596224%5D%2C%5B%5C%22hO_2vHlaM5M5mM%5C%22%2C277%2C182%2C0%5D%2C%5B%5C%22eLrlE2L34a8f8M%5C%22%2C323%2C156%2C0%5D%2C%5B%5C%22Ahf1fxWknMx_AM%5C%22%2C259%2C195%2C956301312%5D%2C%5B%5C%22Nv1VenvVudaghM%5C%22%2C261%2C193%2C134217728%5D%2C%5B%5C%22Q9QbJWUxHV4hnM%5C%22%2C171%2C295%2C179568640%5D%2C%5B%5C%22FOerVX6mz_YP4M%5C%22%2C225%2C225%2C-2147483648%5D%2C%5B%5C%22kCWjgzlqhj6N8M%5C%22%2C225%2C225%2C-1257766912%5D%5D%2C%5B%5D%2C%5B%5D%2Cnull%2Cnull%2Cnull%2C0%5D%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C%5B%5C%22bonpland%5C%22%2C%5C%22en-US%5C%22%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C%5C%22qdr%3Aw%5C%22%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C%5B%5D%5D%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C%5Bnull%2C%5C%22CAM%3D%5C%22%2C%5C%22GKwCIAA%3D%5C%22%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D&at=AAuQa1qdstatNh2yQw-sJIcvETC_%3A1692054165315&'
response = requests.post(
'https://www.google.com/_/VisualFrontendUi/data/batchexecute',
params=params,
headers=headers,
data=data,
)
response.content
退货:
b')]}\'\n\n128460\n[["wrb.fr","HoAMBc","[null,[],null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,[],null,null,null,null,false,null,null,null,null,null,null,null,null,null,null,null,null,null,[null,[[\\"/search?q\\\\u003dbonpland\\\\u0026source\\\\u003dlmns\\",null,null,\\"All\\",false,null,null,null,null,\\"WEB\\",[0],null,null,0],[\\"/search?q\\\\u003dbonpland\\\\u0026source\\\\u003dlmns\\\\u0026tbm\\\\u003disch\\",null,null,\\"Images\\",true,null,null,null,null,\\"IMAGES\\",[6],null,null,6]],[[\\"//maps.google.com/maps?q\\\\u003dbonpland\\\\u0026source\\\\u003dlmns\\\\u0026entry\\\\u003dmt\\",null,null,\\"Maps\\",false,null,null,null,null,\\"MAPS\\",[8],null,null,8],[\\"/search?q\\\\u003dbonpland\\\\u0026source\\\\u003dlmns\\\\u0026tbm\\\\u003dvid\\",null,null,\\"Videos\\",false,null,null,null,null,\\"VIDEOS\\",[13],null,null,13],[\\"/search?q\\\\u003dbonpland\\\\u0026source\\\\u003dlmns\\\\u0026tbm\\\\u003dnws\\",null,null,\\"News\\",false,null,null,null,null,\\"NEWS\\",[10],null,null,10],[\\"/search?q\\\\u003dbonpland\\\\u0026source\\\\u003dlmns\\\\u0026tbm\\\\u003dbks\\",null,null,\\"Books\\",false,null,null,null,null,\\"BOOKS\\",[2],null,null,2],[\\"/travel/flights?q\\\\u003dbonpland\\\\u0026source\\\\u003dlmns\\\\u0026tbm\\\\u003dflm\\",null,null,\\"Flights\\",false,null,null,null,null,\\"FLIGHTS\\",[20],null,null,20],[\\"/search?q\\\\u003dbonpland\\\\u0026source\\\\u003dlmns\\\\u0026tbm\\\\u003dfin\\",null,null,\\"Finance\\",false,null,null,null,null,\\"FINANCE\\",[22],null,null,22]]],0,null,null,null,null,null,null,null,null,null,true,null,null,null,[[false],false,null,null,null,null,[true,false],true,null,0.564668],false,[[{\\"444381080\\":[]}],[[[[{\\"444383007\\":[7,null,null,null,null,null,null,\\"b-GRID_STATE0\\",-1,null,null,null,[\\"GRID_STATE0\\",null,null,null,null,null,1,[],null,null,null,[4,null,4294966996,1,3766,[[\\"mroB5K80ptCTuM\\",259,194,16777216],[\\"c1GanxYo-04FqM\\",299,168,117440512],[\\"CwPZkiIyxII1IM\\",259,194,524288],[\\"K-EOlTyweDVg8M\\",275,183,16777216],[\\"7aHfaSFG7gX8iM\\",225,225,-1874853888],[\\"cgME6WrQ91TqbM\\",264,191,262144],[\\"zYwz1EEHHPaiqM\\",225,225,-1090519040],[\\"eTmT6kDpI3uIQM\\",223,226,0],[\\"Q0rGwJ3za5hy4M\\",276,183,50593792],[\\"eiNmj70lbdjNUM\\",260,194,17563648],[\\"7g0dolqpZILMhM\\",300,168,1040187392],[\\"UfVRJEEcMgRyyM\\",183,275,-1074003968],[\\"7v0HRR8xWKSvLM\\",215,234,2097152],[\\"wpRPgCuU5zJAsM\\",300,168,-1788084224],[\\"ttZT8wyA9AIcpM\\",193,261,17039360]],null,null,null,null,null,0],null,null,null,null,[true,null,null,\\"CAQ\\\\u003d\\",\\"GJADIAA\\\\u003d\\"],null,null,null,null,null,null,null,null,null,20],[[1692054866047747,117621638,1745567696],null,null,null,null,[[1]]]]}],[[[[{\\"444383007\\":[1,[0,\\"cjxY8tC9TPK3gM\\",[\\"https://encrypted-tbn0.gstatic.com/images?q\\\\u003dtbn:ANd9GcQkdUVoe0SeMM_uE_oUKymwnw4XFeg5IQ_a0xmxkByykYSCPGI1icA-E1WsxqOzfqSEvb8\\\\u0026usqp\\\\u003dCAU\\",159,316],[\\"https://www.rematadores.com/rematadores/remates/2023/27986_5.jpg\\",576,1140],null,0,\\"rgb(240,240,221)\\",null,false,null,null,null,null,null,null,null,null,null,null,null,null,false,false,null,false,{\\"2001\\":[null,null,null,0,0,0,0,true,false],\\"2003\\":[\\"6 days ago\\",\\"N-SEEMfLWqwvkM
通过这些方法,您应该能够从 Google 的图像搜索结果中检索所有图像 URL。
此方法只会提取直接嵌入 HTML 中的图像 URL。它不会提取动态加载的图像或显示更多结果。
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
query = 'spaghetti'
url = f"https://www.google.com/search?q={query}&tbm=isch&hl=en-US&tbs=qdr:w"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
image_urls = []
for img in soup.find_all('img'):
try:
image_urls.append(img['src'])
except KeyError:
pass
如果您想检索所有图像,包括动态加载的图像并显示更多结果,您将需要使用不同的方法。一种可能的方法是使用像 Selenium 这样的浏览器自动化工具来滚动页面并加载所有图像。这是使用 Selenium 的示例:
此方法要求您设置 Selenium 和 WebDriver(例如 ChromeDriver)。您可以使用
安装 Selenium,并下载适合您的浏览器的 WebDriver。如果您使用的是其他浏览器,请将pip install selenium
替换为webdriver.Chrome()
或webdriver.Firefox()
。webdriver.Edge()
from selenium import webdriver
# Set up WebDriver (assuming you have chromedriver in your PATH)
driver = webdriver.Chrome()
driver.get(url)
# Scroll to the bottom of the page to load more images
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
driver.find_element_by_css_selector('.mye4qd').click() # Click the "Show more results" button
except:
pass
# Check if all images are loaded
if len(driver.find_elements_by_css_selector('.rg_i')) >= 1000: # Maximum limit
break
# Extract the image URLs
image_urls = []
for img in driver.find_elements_by_css_selector('.rg_i'):
image_urls.append(img.get_attribute('src'))
# Close the browser
driver.quit()