如何将cloudscraper与scrapy一起使用

问题描述 投票:0回答:1

我正在尝试解析来自站点的数据,我使用 scrapy,但该站点受 cloudflare 保护。我找到了一个解决方案,使用cloudscraper,这个cloudscraper确实可以绕过保护。但我不明白它如何与scrapy一起使用。

尝试写这样的东西

import scrapy
   from scrapy.xlib.pydispatch import dispatcher
   import cloudscraper
   import requests
   from scrapy.http import Request, FormRequest
   class PycoderSpider(scrapy.Spider):
      name = 'armata_exper'
      start_urls = ['https://arma-models.ru/catalog/sbornye_modeli/?limit=48']
      def start_requests(self):
         url = "https://arma-models.ru/catalog/sbornye_modeli/?limit=48"
         scraper = cloudscraper.CloudScraper()
         cookie_value, user_agent = scraper.get_tokens(url)
         yield scrapy.Request(url, cookies=cookie_value, headers={'User-Agent': user_agent})

      def parse(self, response):
         ....

出现错误

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
    *arguments, **named)
  File "/usr/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
    redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'Response' object has no attribute 'meta'
Unhandled Error

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/usr/lib/python3.6/site-packages/scrapy/crawler.py", line 309, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 1283, in run
    self.mainLoop()
  File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 1292, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python3.6/site-packages/scrapy/utils/reactor.py", line 41, in __call__
    return self._func(*self._a, **self._kw)
  File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 135, in _next_request
    self.crawl(request, spider)
  File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 210, in crawl
    self.schedule(request, spider)
  File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 216, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "/usr/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 91, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'Response' object has no attribute 'dont_filter'

请告诉我正确的做法

scrapy scrapy-splash
1个回答
0
投票

我已经通过使用Scrapy下载器中间件成功集成了Scrapy和Cloudscraper。

这是我想出的中间件:

import cloudscraper
import logging
from scrapy.http import HtmlResponse

class CustomCloudflareMiddleware(object):

    cloudflare_scraper = cloudscraper.create_scraper()

    def process_response(self, request, response, spider):
        request_url = request.url
        response_status = response.status
        if response_status not in (403, 503):
            return response
        
        spider.logger.info("Cloudflare detected. Using cloudscraper on URL: %s", request_url)
        cflare_response = self.cloudflare_scraper.get(request_url)
        cflare_res_transformed = HtmlResponse(url = request_url, body=cflare_response.text, encoding='utf-8')
        return cflare_res_transformed

我正在使用

process_response
中间件方法。如果我检测到响应是 403 或 503,那么我会使用
cloudscraper
执行相同的请求。由于
requests
响应与 Scrapy 不同,我们需要将它们转换为 Scrapy 响应。否则,我只是继续正常的管道(为了简单起见,也可以删除这个
if
以始终使用cloudscraper;或者还定义更精细的条件来使用或不使用cloudscraper)。

最后,你必须在spider中配置中间件。我喜欢通过定义

custom_settings
类变量来做到这一点:

class MyCrawler(CrawlSpider):
    name = 'mycrawlername'

    custom_settings = {
      'USER_AGENT': '...',
      'CLOSESPIDER_PAGECOUNT': 20,
      'DOWNLOADER_MIDDLEWARES': {            
        'middlewares.CustomCloudflareMiddleware.CustomCloudflareMiddleware': 543,             
      }
    }

    # my rules...
    # my parsing functions...

(中间件的确切路径取决于您的项目结构)

您可以在此处找到我的完整示例。

© www.soinside.com 2019 - 2024. All rights reserved.