给出网站列表，在Python中搜索和返回信息

Question

我创建了一个函数，它返回给定特定公司名称的URL列表。我想知道搜索这个网址列表，并找到有关该公司是否归其他公司所有的信息。

示例：公司“Marketo”被Adobe收购。

我想回头是否有一家公司被收购，以及由谁收购。

这是我到目前为止：

import requests
from googlesearch import search
from bs4 import BeautifulSoup as BS


def get_url(company_name):
    url_list = []
    for url in search(company_name, stop=10):
        url_list.append(url)
    return url_list


test1 = get_url('Marketo')
print(test1[7])


r = requests.get(test1[7])
html = r.text
soup = BS(html, 'lxml')
stuff = soup.find_all('a')


print(stuff)

我是网络抓取的新手，我不知道如何真正搜索每个URL（假设我可以）并找到我寻求的信息。

test1的值如下：

['https://www.marketo.com/', 'https://www.marketo.com/software/marketing-automation/', 'https://blog.marketo.com/', 'https://www.marketo.com/software/', 'https://www.marketo.com/company/', 'https://www.marketo.com/solutions/pricing/', 'https://www.marketo.com/solutions/', 'https://en.wikipedia.org/wiki/Marketo', 'https://www.linkedin.com/company/marketo', 'https://www.cmswire.com/digital-marketing/what-is-marketo-a-marketers-guide/']

Answer 1

我想回头是否有一家公司被收购，以及由谁收购

您可以抓住crunchbase网站获取此信息。缺点是您将限制搜索到他们的网站。为了扩展这个，你也可以包括一些其他网站。

import requests
from bs4 import BeautifulSoup
import re
while True:
    print()
    organization_name=input('Enter organization_name: ').strip().lower()
    crunchbase_url='https://www.crunchbase.com/organization/'+organization_name
    headers={
        'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
    r=requests.get(crunchbase_url,headers=headers)
    if r.status_code == 404:
        print('This organization is not available\n')
    else:
        soup=BeautifulSoup(r.text,'html.parser')
        overview_h2=soup.find('h2',text=re.compile('Overview'))
        try:
            possible_acquired_by_span=overview_h2.find_next('span',class_='bigValueItemLabelOrData')
            if possible_acquired_by_span.text.strip() == 'Acquired by':
                acquired_by=possible_acquired_by_span.find_next('span',class_='bigValueItemLabelOrData').text.strip()
            else:
                acquired_by=False
        except Exception as e:
                acquired_by=False
                # uncomment below line if you want to see the error
                # print(e)
        if acquired_by:
            print('Acquired By: '+acquired_by+'\n')
        else:
            print('No acquisition information available\n')

    again=input('Do You Want To Continue? ').strip().lower()
    if  again not in ['y','yes']:
        break

样本输出：

Enter organization_name: Marketo
Acquired By: Adobe Systems

Do You Want To Continue? y

Enter organization_name: Facebook
No acquisition information available

Do You Want To Continue? y

Enter organization_name: FakeCompany
This organization is not available

Do You Want To Continue? n

笔记

在任何商业项目中部署之前，请阅读crunchbase Terms并征得他们的同意。
还要检查crunchbase api - 我认为这将是你要求的合法方式。

Answer 2

你可以从像Crunchbase这样的网站上找到这些信息。

获得它的步骤如下：

构建包含目标公司信息的URL。假设您找到包含所需信息的网址： url = 'https://www.example.com/infoaboutmycompany.html'
使用selenium获取html，因为该站点不允许您直接刮取页面。像这样的东西： from selenium import webdriver from bs4 import BeautifulSoup driver = webdriver.Firefox() driver.get(url) html = driver.page_source
使用BeautifulSoup从包含信息的div中获取文本。它有一个特定的类，你可以很容易地找到html： bsobj = BeautifulSoup(html, 'lxml') res = bsobj.find('div', {'class':'alpha beta gamma'}) res.text.strip()

获得它的代码少于10行。

当然，它可以更改您的列表，从网址列表到公司列表，希望由该网站考虑。对于marketo，它的工作原理。

Answer 3

正如其他答案所提到的，crunchbase是获取此类信息的好地方，但是你需要一个无头浏览器来废弃像Selenium这样的crunchbase

如果你使用ubuntu安装Selenium相当容易。 Selenium需要驱动程序与所选浏览器进行交互。例如，Firefox需要geckodriver

pip安装硒
sudo pip3安装selenium --upgrade

安装geckodriver的最新版本

wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
tar -xvzf geckodriver *
chmod + x geckodriver

将驱动程序添加到PATH中，以便其他工具可以找到它或在所有软件都安装的目录中，否则会抛出错误（'geckodriver'可执行文件需要在PATH中）

geckodriver / usr / bin /

码

from bs4 import BeautifulSoup as BS
from selenium import webdriver


baseurl = "https://www.crunchbase.com/organization/{0}"

query = input('type company name : ').strip().lower()
url = baseurl.format(query)

driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BS(html, 'lxml')
acquiredBy = soup.find('div', class_= 'flex-no-grow cb-overflow-ellipsis identifier-label').text


print(acquiredBy)

您还可以使用相同的逻辑获取其他信息，只需检查类/ ID并废弃信息。

给出网站列表，在Python中搜索和返回信息

问题描述投票：0回答：3

3个回答

最新问题

给出网站列表，在Python中搜索和返回信息

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3