我正在尝试解析来自站点的数据,我使用 scrapy,但该站点受 cloudflare 保护。我找到了一个解决方案,使用cloudscraper,这个cloudscraper确实可以绕过保护。但我不明白它如何与scrapy一起使用。
尝试写这样的东西
import scrapy
from scrapy.xlib.pydispatch import dispatcher
import cloudscraper
import requests
from scrapy.http import Request, FormRequest
class PycoderSpider(scrapy.Spider):
name = 'armata_exper'
start_urls = ['https://arma-models.ru/catalog/sbornye_modeli/?limit=48']
def start_requests(self):
url = "https://arma-models.ru/catalog/sbornye_modeli/?limit=48"
scraper = cloudscraper.CloudScraper()
cookie_value, user_agent = scraper.get_tokens(url)
yield scrapy.Request(url, cookies=cookie_value, headers={'User-Agent': user_agent})
def parse(self, response):
....
出现错误
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/usr/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/usr/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'Response' object has no attribute 'meta'
Unhandled Error
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/usr/lib/python3.6/site-packages/scrapy/crawler.py", line 309, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 1283, in run
self.mainLoop()
File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 1292, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/usr/lib64/python3.6/site-packages/twisted/internet/base.py", line 913, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python3.6/site-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 135, in _next_request
self.crawl(request, spider)
File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 210, in crawl
self.schedule(request, spider)
File "/usr/lib/python3.6/site-packages/scrapy/core/engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 91, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'Response' object has no attribute 'dont_filter'
请告诉我正确的做法
我已经通过使用Scrapy下载器中间件成功集成了Scrapy和Cloudscraper。
这是我想出的中间件:
import cloudscraper
import logging
from scrapy.http import HtmlResponse
class CustomCloudflareMiddleware(object):
cloudflare_scraper = cloudscraper.create_scraper()
def process_response(self, request, response, spider):
request_url = request.url
response_status = response.status
if response_status not in (403, 503):
return response
spider.logger.info("Cloudflare detected. Using cloudscraper on URL: %s", request_url)
cflare_response = self.cloudflare_scraper.get(request_url)
cflare_res_transformed = HtmlResponse(url = request_url, body=cflare_response.text, encoding='utf-8')
return cflare_res_transformed
我正在使用
process_response
中间件方法。如果我检测到响应是 403 或 503,那么我会使用 cloudscraper
执行相同的请求。由于 requests
响应与 Scrapy 不同,我们需要将它们转换为 Scrapy 响应。否则,我只是继续正常的管道(为了简单起见,也可以删除这个if
以始终使用cloudscraper;或者还定义更精细的条件来使用或不使用cloudscraper)。
最后,你必须在spider中配置中间件。我喜欢通过定义
custom_settings
类变量来做到这一点:
class MyCrawler(CrawlSpider):
name = 'mycrawlername'
custom_settings = {
'USER_AGENT': '...',
'CLOSESPIDER_PAGECOUNT': 20,
'DOWNLOADER_MIDDLEWARES': {
'middlewares.CustomCloudflareMiddleware.CustomCloudflareMiddleware': 543,
}
}
# my rules...
# my parsing functions...
(中间件的确切路径取决于您的项目结构)
您可以在此处找到我的完整示例。