配置Google自定义搜索以像google.search（）一样工作

Question

我有一个相对较大的项目，搜索Google已经为我们的缺失值返回了最佳结果。在Python中使用google搜索可以获得我需要的确切结果。尝试使用自定义搜索以解除查询限制时，返回的结果并不是我所需要的。我有以下代码（在Searching in Google with Python中建议）完全返回我需要的内容，这与我在Google网站上搜索时完全相同，但由于http请求太多而被阻止...

from google import search

def google_scrape(url):
    cj = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    thepage = opener.open(url)
    soup = BeautifulSoup(thepage, "html.parser")
    return soup.title.text

i = 1
# queries = ['For. Policy Econ.','Int. J. Soc. For.','BMC Int Health Hum. Rights',
#            'Environ. Health Persp','Environ. Entomol.','Sociol. Rural.','Ecol. Soc.']

search_results = []    
abbrevs_searched = []   
url_results = []  

error_names = []
error = []

#Note, names_to_search is simply a longer version of the commented our queries list. 
for abbreviation in names_to_search:   
    query = abbreviation
    for url in search(query, num=2,stop=1):
        try:
            a = google_scrape(url)
            print(str(i) + ". " + a)
            search_results.append(a)
            abbrevs_searched.append(query)
            url_results.append(url)
            print(url)
            print(" ")
        except Exception as e:
            error_names.append(query)
            error.append(query)
            print("\n\n***************"," Exeption: ",e)
        i += 1

我通过以下方式设置了Google自定义搜索引擎代码...

import urllib
from bs4 import BeautifulSoup
import http.cookiejar
from apiclient.discovery import build
"""List of names to search on google"""
names_to_search = set(search_list_1+search_list)
service = build('customsearch', 'v1',developerKey="AIz**********************")
rse = service.cse().list(q="For. Policy Econ.",cx='*******************').execute()
rse

我的Google自定义搜索引擎设置已设置为搜索Google.com。截至目前，除Google.com网站外，所有其他设置均为默认设置。

Answer 1

据我所知，python模块的问题不是python模块的限制，而是谷歌不允许用脚本刮取页面的事实。当我运行你的程序（使用谷歌模块）我得到HTTP Error 503。这是因为在短时间内请求过多的谷歌要求您进行验证码确认，并且没有可以绕过验证码的模块。所以任何其他选择都使用Google Custom Search API。问题在于它旨在搜索您的网页。

通过Google自定义搜索，您可以为自己的网站，博客或网站集创建搜索引擎。 Read more。

有一种方法可以搜索整个网络，如Bangkokian在他的answer中解释：

要创建搜索整个网络的Google自定义搜索引擎：

从Google Custom Search homepage中，单击“创建自定义搜索引擎”。

输入搜索引擎的名称和说明。

在“定义您的搜索引擎”下的“要搜索的站点”框中，输入至少一个有效的URL（现在，只需将www.anyurl.com放到此屏幕即可。稍后详细介绍）。

选择所需的CSE版本并接受服务条款，然后单击“下一步”。选择所需的布局选项，然后单击“下一步”。

单击“后续步骤”部分下的任何链接以导航到“控制”面板。

在左侧菜单中的“控制面板”下，单击“基本”。

在“搜索首选项”部分中，选择“搜索整个Web”但强调包含的网站。

单击保存更改。

在左侧菜单中的“控制面板”下，单击“站点”。

删除在初始设置过程中输入的站点。

你已经创建了一个自定义搜索引擎，所以在Google Custom Search你需要点击你已经拥有的搜索引擎（它可能是“Google”，在下面的图片上用红色框标记）：现在你需要在搜索首选项部分，选择搜索整个网络但强调包含的网站（步骤7）然后单击添加按钮：键入http://www.example.org/，将其设置为仅包含特定页面并单击保存：之后选择您的旧网站并单击删除：更新它以保存更改：不幸的是，这不会提供与Web上的serching相同的rusult：

请注意，结果可能与您在Google网页搜索上搜索所获得的结果不符。 Read more。

此外，您只能使用免费版本：

本文仅适用于免费的基本自定义搜索引擎。您无法将Google Site Search设置为搜索整个网络。 Read more。

每天最多有100个搜索查询：

对于CSE用户，API每天免费提供100个搜索查询。 Read more。

只有其他选择是使用其他搜索引擎的API。似乎只有一个是免费的是FAROO API。

编辑：你可以在python中使用selenium webdriver来模仿浏览器的使用。有options使用Firefox，Chrome，Edge或Safari网络驱动程序（它实际上打开Chrome并进行搜索），但这很烦人，因为你实际上并不想看到浏览器。但有解决方案，你可以使用PhantomJS。

PhantomJS是一个带有JavaScript API的无头WebKit脚本。

从here下载。在下面的示例中提取并查看如何使用它（我编写了可以使用的简单类，您只需要更改PhantomJS的路径）：

import time
from urllib.parse import quote_plus
from selenium import webdriver


class Browser:

    def __init__(self, path, initiate=True, implicit_wait_time = 10, explicit_wait_time = 2):
        self.path = path
        self.implicit_wait_time = implicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        self.explicit_wait_time = explicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        if initiate:
            self.start()
        return

    def start(self):
        self.driver = webdriver.PhantomJS(path)
        self.driver.implicitly_wait(self.implicit_wait_time)
        return

    def end(self):
        self.driver.quit()
        return

    def go_to_url(self, url, wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        self.driver.get(url)
        print('[*] Fetching results from: {}'.format(url))
        time.sleep(wait_time)
        return

    def get_search_url(self, query, page_num=0, per_page=10, lang='en'):
        query = quote_plus(query)
        url = 'https://www.google.hr/search?q={}&num={}&start={}&nl={}'.format(query, per_page, page_num*per_page, lang)
        return url

    def scrape(self):
        #xpath migth change in future
        links = self.driver.find_elements_by_xpath("//h3[@class='r']/a[@href]") # searches for all links insede h3 tags with class "r"
        results = []
        for link in links:
            d = {'url': link.get_attribute('href'),
                 'title': link.text}
            results.append(d)
        return results

    def search(self, query, page_num=0, per_page=10, lang='en', wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        url = self.get_search_url(query, page_num, per_page, lang)
        self.go_to_url(url, wait_time)
        results = self.scrape()
        return results




path = '<YOUR PATH TO PHANTOMJS>/phantomjs-2.1.1-windows/bin/phantomjs.exe' ## SET YOU PATH TO phantomjs
br = Browser(path)
results = br.search('For. Policy Econ.')
for r in results:
    print(r)

br.end()

配置Google自定义搜索以像google.search（）一样工作

问题描述投票：0回答：1

1个回答

最新问题

配置Google自定义搜索以像google.search（）一样工作

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1