scrapy不会抓取所有内容

问题描述 投票:0回答:1

我想以这种格式抓取一个包含网址的网站:

  • www.test.com/category1/123456.html(页)
  • 呜呜呜.test.com/category1/123457.HTML ..
  • 呜呜呜.test.com/category2
  • 呜呜呜.test.com/category3 ...

这是代码:

    class ExampleSpider(CrawlSpider):
    name = "test"  # Spider name
    allowed_domains = ["test.com"]  # Which (sub-)domains shall be scraped?
    start_urls = ["https://test.com/"]  # Start with this one
    user_agent=["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"]
    rules = [Rule(LinkExtractor(allow=(r'/[a-z-]+/[0-9]+\.html$')), callback='parse_item', follow=True)]
    # Follow any link scrapy finds (that is allowed).

    def parse_item(self, response):
        print('Got a response from %s.' % response.url)

        selector = Selector(response)

        title = selector.xpath('//title/text()').extract()[0]
        post = ''
        for line in selector.xpath('//div[@id="article_body"]/p/text()').extract():
            post += line

        url = response.url
        print('TITLE: %s \n' % title)
        print('CONTENT: %s \n' % post)

Results:
        2017-11-22 12:19:19 [scrapy.core.engine] INFO: Closing spider (finished)
    2017-11-22 12:19:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 132266,
     'downloader/request_count': 315,
     'downloader/request_method_count/GET': 315,
     'downloader/response_bytes': 9204814,
     'downloader/response_count': 315,
     'downloader/response_status_count/200': 313,
     'downloader/response_status_count/301': 2,
     'dupefilter/filtered': 21126,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2017, 11, 22, 12, 19, 19, 295516),
     'log_count/DEBUG': 318,
     'log_count/INFO': 11,
     'offsite/domains': 1,
     'offsite/filtered': 312,
     'request_depth_max': 4,
     'response_received_count': 313,
     'scheduler/dequeued': 315,
     'scheduler/dequeued/memory': 315,
     'scheduler/enqueued': 315,
     'scheduler/enqueued/memory': 315,
     'start_time': datetime.datetime(2017, 11, 22, 12, 14, 41, 591030)}
    2017-11-22 12:19:19 [scrapy.core.engine] INFO: Spider closed (finished)

爬虫在一分钟后停止,它只返回最近的内容!任何解决方案

python scrapy web-crawler scrapy-spider
1个回答
0
投票

Scrapy已将您的一些请求定义为重复链接,因此它们将被丢弃。 'dupefilter/filtered': 21126,

您可以在scrapy项目文件夹中的'settings.py'文件中添加下一行:

DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'

它将完全禁用过滤(对于整个项目),但现在您需要自己检测并过滤重复的请求。

© www.soinside.com 2019 - 2024. All rights reserved.