如何抓取 Google 搜索结果（大规模）？

Question

据我所知，谷歌不喜欢被抓取/抓取。一个月前，当我准备开始这个项目时，在 stackoverflow 上发现了一个类似的问题（现在找不到了。）。有人说使用proxy是一种方法，所以我得到了proxymesh。我猜测我需要以“随机速率”发送请求才能具有类似人类的行为。就像如果我每 0.1 秒发送一次垃圾邮件请求，我可能会被阻止。我还读到我需要使用浏览器或激活 JavaScript 的东西，再次模仿人类的行为。为了快速获取大量结果，我想同时运行多个机器人，但我还不知道该怎么做（最近开始学习）。

所以我的问题是：我该怎么做？

我不太清楚的部分：

1. 用于抓取的浏览器？我需要它吗？效率高吗？（之前只用bs4和requests）

2. 计时器会有所作为吗？或者谷歌是否足够成熟，也注意到了这一点？

3. 运行多个机器人。我每天从代理获得 100 或 220 个 IP 地址（我认为数据有限），我想如果我运行多个机器人并为每个机器人分配一个唯一的 IP 地址，我将能够快速抓取结果？但我不知道如何运行多个机器人之类的东西......

我计划如何使用机器人（以便您更好地理解我的意思）：

用例 1：

搜索特定网站。
打开该网站并获取我需要的数据。
重复。（我有很多网站需要搜索，比如说一百万个）

用例 2：

搜索广泛的关键字。
打开搜索结果中的前 50-100 个网站并获取我需要的数据。
重复。

这两样我都想做。

Answer 1

尝试将 Node.js 与 puppeteer 结合使用并运行此代码作为起点。它将为您提供给定谷歌搜索的链接。使用不同的 IP 地址固然很好，但也请尝试关闭您的定位服务，不断清除缓存，并清除谷歌可能用来唯一识别您的任何其他内容。拆分多个 IP 地址可以节省您的时间，但您需要付出处理能力的代价，因此总会有成本。祝你好运。

const puppeteer = require('puppeteer');

function googleBot(search, pageIndex) {
    puppeteer.launch({
        headless: false,
        defaultViewport: null,
        args: ['--start-fullscreen'],
        timeout: 1000 * 30
    }).then(async browser => {
        try{
            const pages = await browser.pages();
            const page = pages[0];

            //close on page errors
            page.on("error", async function(error){
                console.log("FORCED CLOSE DUE TO ERROR: " + error);
                await browser.close();
            });
            
            //dismiss any alerts
            page.on('dialog', async dialog => {
                await dialog.dismiss();
            });
            
            var links = await runOnURLAndReturn("https://www.google.com/search?q=" + search + "&start=" + pageIndex * 10, ".LC20lb", async () => {
                try{
                    return await new Promise(resolve => {
                        setTimeout(function(){
                            var links = [];
                            
                            var searchResults = document.querySelectorAll(".LC20lb");
                            for(var i = 0; i < searchResults.length; i++){
                                if(searchResults[i].parentNode.tagName == "A"){
                                    links[links.length] = searchResults[i].parentNode.href;
                                }
                            }
                            
                            resolve(links);
                        }, 3000 + (Math.random() * 3000));
                    }, function(){
                        throw "link scraping promise error";
                    });
                }
                catch(error){
                    return false;
                }
            });
            
            console.log(links.join("\n"));
            
            await browser.close();

            async function runOnURLAndReturn(url, loadedSelector, fn) {
                try{
                    await page.goto(url, {timeout: 1000 * 30});
                    await page.waitFor(loadedSelector, {timeout: 1000 * 30});
                    return await page.evaluate(fn);
                }
                catch(error){
                    return true;
                }
            }
        }
        catch(error){
            console.log("FORCED CLOSE DUE TO ERROR: " + error);
            await browser.close();
        }
    }, async reason => {});
}
googleBot("something", 0);

Answer 2

用于抓取的浏览器？

这要看情况。如果需要做一些高度动态的事情那么是的。对于 Google 搜索，则没有必要。在某些情况下，您需要模拟 TSL、HTTP 握手来假装您是浏览器。对于这种情况，您可以使用

curl-impersonate

或

selenium-stealth

，它可以绕过大多数验证码。

运行多个机器人。

住宅代理是理想的选择。理想的场景是将它们与验证码解算器（例如 2captcha）结合使用。另一种方法可能是“旋转用户代理”，并结合旋转代理和验证码解算器。有时，如果在发出请求时需要传递

额外的请求标头

，则无需轮换用户代理。如果您需要更多信息，可以阅读

减少网络抓取时被阻止的机会

博客文章。

如果您不想使用任何 API，请查看我最近的 Stackoverflow 答案，其中我展示了

如何使用分页解析所有页面的 Google 搜索数据，使用

bs4

。

但是，解决您的问题最可靠的解决方案很可能是一些第三方 API，它消除了您持续维护解析器的需要（CSS 选择器更改等）。

SerpApi 有一个

Google 搜索引擎结果 API

。它是一个带有免费计划的付费 API。优点是它会绕过 Google 和其他搜索引擎的阻止（包括验证码），并且不需要从头开始创建解析器并维护它。下面是一个异步示例，它一次发出多个请求并返回 JSON 数据。在此示例中，我仅显示了 4 个请求。看看

500个请求的时间比较

（非多线程）：

同步13m 43.311s

异步
5米3.663秒

在线 IDE 中的代码和示例

:

from serpapi import GoogleSearch from queue import Queue import os, re, json queries = [ 'burly', # this is an example, 'silk', # the number of search queries can be much more. 'monkey', 'abortive', 'hot' ] search_queue = Queue() for query in queries: params = { 'api_key': os.getenv('API_KEY'), # serpapi api key 'engine': 'google', # search engine 'device': 'desktop', # device type 'q': query, # search query 'async': True # async batch requests } search = GoogleSearch(params) # where data extraction happens results = search.get_dict() # JSON -> Python dict if 'error' in results: print(results['error']) break print(f"add search to the queue with ID: {results['search_metadata']}") search_queue.put(results) data = [] while not search_queue.empty(): result = search_queue.get() search_id = result['search_metadata']['id'] print(f'Get search from archive: {search_id}') search_archived = search.get_search_archive(search_id) print(f"Search ID: {search_id}, Status: {search_archived['search_metadata']['status']}") if re.search(r'Cached|Success', search_archived['search_metadata']['status']): # processing, success, error for result in search_archived.get('organic_results', []): data.append({ 'title': result.get('title'), 'link': result.get('link'), 'displayed_link': result.get('displayed_link'), 'snippet': result.get('snippet') }) else: print(f'Requeue search: {search_id}') search_queue.put(result) print(json.dumps(data, indent=2, ensure_ascii=False)) print('all searches completed')

输出示例：

[
  {
    "title": "Burly Definition & Meaning - Merriam-Webster",
    "link": "https://www.merriam-webster.com/dictionary/burly",
    "displayed_link": "https://www.merriam-webster.com › dictionary › burly",
    "snippet": "The meaning of BURLY is strongly and heavily built : husky. How to use burly in a sentence."
  },
  {
    "title": "Burly Definition & Meaning - Dictionary.com",
    "link": "https://www.dictionary.com/browse/burly",
    "displayed_link": "https://www.dictionary.com › browse › burly",
    "snippet": "adjective, bur·li·er, bur·li·est. large in bodily size; stout; sturdy."
  },
  {
    "title": "BURLY Brewing Company | Castle Rock Brewery | Castle ...",
    "link": "http://www.burlybrewing.com/",
    "displayed_link": "http://www.burlybrewing.com",
    "snippet": "Welcome to BURLY Brewing, where craft beer is our specialty. Check out our live events, rotating food trucks, and fun for everyone."
  },
  {
    "title": "Burly - Definition, Meaning & Synonyms - Vocabulary.com",
    "link": "https://www.vocabulary.com/dictionary/burly",
    "displayed_link": "https://www.vocabulary.com › dictionary › burly",
    "snippet": "The adjective burly describes someone (usually male) who is muscular and beefy. Types of people that you might describe as burly?"
  },
  # ...
]

免责声明，我为 SerpApi 工作。

如果您尝试大规模抓取 Google 搜索结果，那么可能值得使用现成的解决方案，例如

Answer 3

或

Bright Data。例如，后者结合了 SERP API 来获取搜索引擎结果数据，还提供旋转代理等来绕过验证码解决/网站块。

大规模并不是一件容易的事。为了简单地解释生产是如何工作的，我写了

Answer 4

来解释请求生命周期。我还为 SerpApi 工作，我们扩展 API 来满足数十亿请求。希望这有帮助。

如何抓取 Google 搜索结果（大规模）？

问题描述投票：0回答：4

4个回答

最新问题

如何抓取 Google 搜索结果（大规模）？

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4