python的scrapy似乎没有从所有可用的URL获取数据

问题描述 投票:0回答:2

我正在尝试抓取 thesession.org 创建一个表格,其中显示每首曲子已添加到会员的曲调书中的次数,以便我可以找到一些流行的作品来学习。我已经开始使用 scrapy 教程here,并尝试修改它以适合我的目的。问题是,尽管session.org 网站似乎有大约 10,390 首歌曲,但我的抓取工具仅返回其中 10 首的数据(仅 http://www.thesession.org/tunes/index.php 上的数据)。如何获取所有歌曲(或排名前百的歌曲)的数据?任何建议将不胜感激。

这是我到目前为止所得到的:

items.py

from scrapy.item import Item, Field

class tuneItem(Item):
    url = Field()
    name1 = Field()
    name2 = Field()
    key = Field()
    count = Field() 
    pass

tune_spider.py

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import tuneItem
from scrapy.conf import settings

class tunesSpider(CrawlSpider):

    name = "irishtunes"
    allowed_domains = ["thesession.org"]
    start_urls = ["http://www.thesession.org/tunes"]
    rules = [Rule(SgmlLinkExtractor(allow=['/display/\d+'], deny=['/members/','/recordings/','/index/','/display/\d+/.']), 'parse_tune')]

    def parse_tune(self, response):
        x = HtmlXPathSelector(response)

        tune = tuneItem()
        tune['url'] = response.url
        tune['name1'] = x.select("//div[@id='details']//div[@class='box']/h1/text()").extract()
        tune['name2'] = x.select("//div[@id='details']//div[@class='box']/h2/text()").extract()
        tune['key']   = x.select("//div[@id='details']//div[@class='box']/p[1]/text()").extract()
        tune['count'] = x.select("//div[@id='details']//div[@class='box']/p[3]/text()").re('\d+')
        return tune

我通过打开控制台、转到包含教程 cfg 文件的目录并运行

scrapy crawl irishtunes --set FEED_URI=scraped_data.csv --set FEED_FORMAT=csv

来运行抓取工具

这是我得到的:

C:\Users\BM\Desktop\scrape\tutorial>scrapy crawl irishtunes --set FEED_URI=scrap
ed_data.csv --set FEED_FORMAT=csv
2011-11-25 22:45:47-0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: tutoria
l)
2011-11-25 22:45:47-0800 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled item pipelines:
2011-11-25 22:45:48-0800 [irishtunes] INFO: Spider opened
2011-11-25 22:45:48-0800 [irishtunes] INFO: Crawled 0 pages (at 0 pages/min), sc
raped 0 items (at 0 items/min)
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Redirecting (301) to <GET http://ww
w.thesession.org/tunes/> from <GET http://www.thesession.org/tunes>
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/> (referer: None)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11602> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11602>
        {'count': [u'1'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Brendan Begley's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11602'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11593> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11593>
        {'count': [u'3'],
         'key': [u'Key signature: Amajor'],
         'name1': [u'Carleton County Breakdown'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11593'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11597> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11597>
        {'count': [u'3'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Kasper's Rant"],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11597'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11594> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11594>
        {'count': [u'5'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'The Full Of The Bag'],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11594'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11599> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11599>
        {'count': [u'1'],
         'key': [u'Key signature: Adorian'],
         'name1': [u'The New Steamboat'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11599'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11598> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11598>
        {'count': [u'4'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u"Galen's Arrival"],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11598'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11596> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11596>
        {'count': [u'2'],
         'key': [u'Key signature: Amixolydian'],
         'name1': [u'Culloden Day'],
         'name2': [u'strathspey'],
         'url': 'http://www.thesession.org/tunes/display/11596'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11595> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11595>
        {'count': [u'2'],
         'key': [u'Key signature: Aminor'],
         'name1': [u'Miss Sine Flemington'],
         'name2': [u'barndance'],
         'url': 'http://www.thesession.org/tunes/display/11595'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11600> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11600>
        {'count': [u'2'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Joan Martin's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11600'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200) <GET http://www.these
ssion.org/tunes/display/11601> (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11601>
        {'count': [u'2'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'My Time Inside 2005'],
         'name2': [u'waltz'],
         'url': 'http://www.thesession.org/tunes/display/11601'}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Closing spider (finished)
2011-11-25 22:45:49-0800 [irishtunes] INFO: Stored csv feed (10 items) in: scrap
ed_data.csv
2011-11-25 22:45:49-0800 [irishtunes] INFO: Dumping spider stats:
        {'downloader/request_bytes': 3655,
         'downloader/request_count': 12,
         'downloader/request_method_count/GET': 12,
         'downloader/response_bytes': 31620,
         'downloader/response_count': 12,
         'downloader/response_status_count/200': 11,
         'downloader/response_status_count/301': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2011, 11, 26, 6, 45, 49, 500000),
         'item_scraped_count': 10,
         'request_depth_max': 1,
         'scheduler/memory_enqueued': 12,
         'start_time': datetime.datetime(2011, 11, 26, 6, 45, 48, 10000)}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Spider closed (finished)
2011-11-25 22:45:49-0800 [scrapy] INFO: Dumping global stats:
        {}

编辑:@reclosedev 的回答让我上路了。对于任何想知道结果的人,这里有一个快照......

(1) 绝大多数歌曲都是少于10个成员的曲谱

enter image description here

(2) 我可以从网站上抓取的所有 10,379 首歌曲的受欢迎程度(通过它们包含在多少本歌曲中来衡量)遵循幂律分布

enter image description here

(3) 这里是网站上超过 1000 首曲子中的歌曲,显示排名靠前的曲子的名称以及它们在多少曲子中

enter image description here

python web-scraping scrapy
2个回答
6
投票

您需要添加

Rule
,这将提取所有页面的链接,蜘蛛将
follow
它:

rules = [
    ..., #your existing parse_tune rule
    Rule(
        SgmlLinkExtractor(
             allow=('/index/new\?new_start=\d+',)
        ),
        follow=True,
    ),
]

编辑:

follow=True
不是必需的,因为
callback=None
默认表示
follow=True


0
投票

方法可以有很多,我建议最简单的一种:

运行代码十次,替换 start_urls 或像 range(10,100,10) 那样循环它

http://www.thesession.org/tunes/index/new?new_start=10
http://www.thesession.org/tunes/index/new?new_start=20
http://www.thesession.org/tunes/index/new?new_start=30
http://www.thesession.org/tunes/index/new?new_start=40
http://www.thesession.org/tunes/index/new?new_start=50
http://www.thesession.org/tunes/index/new?new_start=60
http://www.thesession.org/tunes/index/new?new_start=70
http://www.thesession.org/tunes/index/new?new_start=80
http://www.thesession.org/tunes/index/new?new_start=90
© www.soinside.com 2019 - 2024. All rights reserved.