无法刮

问题描述 投票:0回答:3

enter image description here

我正在尝试从angellist获取公司列表https://angel.co/companies

我尝试使用此代码

from bs4 import BeautifulSoup
import urllib2

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://angel.co/companies', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class"," dc59 frw44 _a _jm"})
print p1

但这会返回一个空字符串。

我也遇到过类似的问题,有人说更新 beautifulsoup,有人说更改解析器。没有什么对我有用。

python html web-scraping beautifulsoup
3个回答
7
投票

您可以通过从

https://angel.co/company_filters/search_data
获取参数来获取所有公司信息html,而无需selenium:

import requests
from bs4 import BeautifulSoup



js = "https://angel.co/company_filters/search_data"

headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}




u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"
with requests.Session() as s:
    params = s.post(js, data={"sort": "signal"}, headers=headers).json()
    companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])),params["page"] ,params["total"], params["hexdigest"]), headers=headers)
    soup = BeautifulSoup(companies.json()["html"])

您可以在迭代时传递页码来模拟加载更多:

import requests
from bs4 import BeautifulSoup
import time

# post url
js = "https://angel.co/company_filters/search_data"

# X-Requested-With is important
headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


# get url
u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"


def get_next_pages(js, u, start_page=1):
    with requests.Session() as s:
        params = s.post(js, data={"sort": "signal","page":start_page}, headers=headers).json()
        companies = s.get(
            u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"], params["hexdigest"]),
            headers=headers)
        soup = BeautifulSoup(companies.json()["html"])
        comps = soup.select("div.company.column")
        yield comps
        while True:
            # increment page count from previous.
            page = params["page"] + 1
            params = s.post(js, data={"sort": "signal", "page": page}, headers=headers).json()
            # keep going until we have reached the maximum queries
            if "ids" not in params:
                break
            companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"],
                                       params["hexdigest"]),
                              headers=headers)
            soup = BeautifulSoup(companies.json()["html"])
            comps = soup.select("div.company.column")
            # don't hammer with requests
            time.sleep(.3)
            yield comps

for comps in get_next_pages(js, u):
    print(comps)

如果我们查看开发人员工具的网络输出,我们可以看到当我们点击加载更多时发布的数据,它会一直持续下去,直到我们达到限制:

enter image description here

运行上面代码的输出片段:

[<div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies" title="Dunwello"><img alt="Dunwello" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275696-99335faecd2fb01467c98d5032f23cf6-thumb_jpg.jpg?buster=1393099676"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies">Dunwello</a>
</div>
<div class="pitch">
Trustworthy recommendations of individual professionals.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies" title="GroupAhead"><img alt="GroupAhead" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275832-3541a563250008bd3f7f9b4d7fe9c33c-thumb_jpg.jpg?buster=1423077576"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies">GroupAhead</a>
</div>
<div class="pitch">
Dedicated apps for groups
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies" title="Workpop"><img alt="Workpop" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/431492-c1b857e30254da60f3847d5358db5c82-thumb_jpg.jpg?buster=1404420060"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies">Workpop</a>
</div>
<div class="pitch">
When can you start?
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies" title="Late Stage Pre-IPO @ Flight.vc"><img alt="Late Stage Pre-IPO @ Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/446358-3511ab7edb5192dad97cbccf2b67ddd7-thumb_jpg.jpg?buster=1428089778"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies">Late Stage Pre-IPO @ Flight.vc</a>
</div>
<div class="pitch">
Syndicated:  Beepi, Zirx, Boost Media, Rent the Runway, Life 360, Scripted
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies" title="Complex Polygon"><img alt="Complex Polygon" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/450451-4f00fd11b2d54533a5bac3cfa72acb1e-thumb_jpg.jpg?buster=1407937645"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies">Complex Polygon</a>
</div>
<div class="pitch">
Product studio based in San Francisco, California. 
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies" title="21"><img alt="21" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/457068-2e7b8c417c3a70aab3026f5f0ca3d8e9-thumb_jpg.jpg?buster=1425975133"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies">21</a>
</div>
<div class="pitch">
A bitcoin miner in every device and in every hand.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies" title="Parenthoods"><img alt="Parenthoods" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/460720-25bc7ca7afd4f7bf0fd7842cafa1bdd1-thumb_jpg.jpg?buster=1425426951"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies">Parenthoods</a>
</div>
<div class="pitch">
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies" title="Seed"><img alt="Seed" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/462906-f6b439e20a9d36b9e2d3792da92d160d-thumb_jpg.jpg?buster=1462318689"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies">Seed</a>
</div>
<div class="pitch">
Online Business Banking
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies" title="Zen99"><img alt="Zen99" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/470102-67da791cec4374a1046c53fe99b6f05f-thumb_jpg.jpg?buster=1410560341"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies">Zen99</a>
</div>
<div class="pitch">
Finance and insurance tools for freelancers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies" title="Maven Ventures Growth Labs"><img alt="Maven Ventures Growth Labs" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/488240-d467860829cac8b1a9fbfa2d14e05789-thumb_jpg.jpg?buster=1411577330"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies">Maven Ventures Growth Labs</a>
</div>
<div class="pitch">
Get a option to invest up to $500k in the best Maven grads
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies" title="Skydio"><img alt="Skydio" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/507975-aac9786d6c4cba99be634b7bc1969cf3-thumb_jpg.jpg?buster=1420952326"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies">Skydio</a>
</div>
<div class="pitch">
MIT, Google[x]ers with deep prior experience doing intelligent navigation for drones
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies" title="Fin Tech by Flight.vc"><img alt="Fin Tech by Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/517240-5bc50eb42d1e40a8ad437c6bd164a5a8-thumb_jpg.jpg?buster=1414004533"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies">Fin Tech by Flight.vc</a>
</div>
<div class="pitch">
Investing in Financial Services and Fin-Tech that has proprietary advantages
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies" title="Channel"><img alt="Channel" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/521452-b6bc15ef040fdf37d885aea71ecad3bb-thumb_jpg.jpg?buster=1446676191"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies">Channel</a>
</div>
<div class="pitch">
Watch the world.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies" title="HealthSherpa"><img alt="HealthSherpa" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/443932-63c6bcbbf9ba36a7fa3e532177222c9b-thumb_jpg.jpg?buster=1462374897"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies">HealthSherpa</a>
</div>
<div class="pitch">
Next-generation Healthcare.gov
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies" title="Sidewire"><img alt="Sidewire" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/558206-b416bf8347c7f766b5ea1cf79123c4d2-thumb_jpg.jpg?buster=1444189112"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies">Sidewire</a>
</div>
<div class="pitch">
Where Experts Chat in Public
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies" title="Brainchild &amp;amp; Co."><img alt="Brainchild &amp; Co." class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/570055-cc2c2309fefa21e3ebda6229d6a0b890-thumb_jpg.jpg?buster=1420474118"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies">Brainchild &amp; Co.</a>
</div>
<div class="pitch">
Building services and products for consumers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies" title="Signatures Capital"><img alt="Signatures Capital" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/571060-8a077d7cbac9cc7e2d81859adb8cd1c6-thumb_jpg.jpg?buster=1420664121"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies">Signatures Capital</a>
</div>
<div class="pitch">
Supporting founders committed to inventing the future.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies" title="Airtable"><img alt="Airtable" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/623000-9d210a39051abc7accec1dc686888dcc-thumb_jpg.jpg?buster=1449952044"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies">Airtable</a>
</div>
<div class="pitch">
Organize anything you can imagine
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies" title="Meerkat"><img alt="Meerkat" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/630861-820b9d4af09e110b150c9affe418d860-thumb_jpg.jpg?buster=1425688408"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies">Meerkat</a>
</div>
<div class="pitch">
Live Stream Video.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies" title="Flight Ventures"><img alt="Flight Ventures" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/658877-89ccd88502db9d964a651ecba6f86d9d-thumb_jpg.jpg?buster=1457552637"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies">Flight Ventures</a>
</div>
<div class="pitch">
Investing in the Top Companies and Entrepreneurs
</div>
</div>
</div>
</div>]

还有更多过滤器等..如果您想查看如何在浏览器中选择它们并观察如何在 Network 下的 xhr 选项卡下的 Firebug 或开发人员工具中发出请求。


2
投票

您要提取的数据是由

JavaScript
生成的。这就是为什么
p1
是一个空列表;
urllib2.urlopen(req).read()
给你服务器响应,它不等待 JS。

BeautifulSoup
Selenium
结合使用。

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://angel.co/companies')
html = browser.page_source

soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class", " dc59 frw44 _a _jm"})
print p1

此外,如果这不起作用(未经测试),请使类选择器更简单,即尝试仅搜索

dc59
并使其逐渐变得更加具体。


0
投票

在您的情况下,似乎所有具有

frw44
类的 div 元素都是用 js 动态生成的。您无法通过使用传统的 urllib、urllib2 或 requests 模块(甚至机械化)来获取使用 javascript 动态生成的数据。您必须使用带有 chrome 或 Firefox 或 phantomjs 的 selenium 来模拟浏览器环境,以评估网页中的 javascript。

看看Python 的 Selenium 绑定

以下内容已经经过我测试验证

from bs4 import BeautifulSoup as bs
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://angel.co/companies")
html = driver.page_source
driver.quit()
soup = bs(html,"html.parser")
p1 = soup.findAll('div' , {"class":" dc59 frw44 _a _jm"})
print p1
© www.soinside.com 2019 - 2024. All rights reserved.