是否可以延迟特定scrapy请求的重试。我有一个中间件,需要将页面的请求推迟到稍后的时间。我知道如何进行基本的延迟(队列末尾),以及如何延迟所有请求(全局设置),但我只想延迟这个单独的请求。这在队列末尾附近最重要,如果我执行简单的延迟,它会立即再次成为下一个请求。
一种方法是向您的 Spider 添加中间件(source,linked):
# File: middlewares.py
from twisted.internet import reactor
from twisted.internet.defer import Deferred
class DelayedRequestsMiddleware(object):
def process_request(self, request, spider):
delay_s = request.meta.get('delay_request_by', None)
if not delay_s:
return
deferred = Deferred()
reactor.callLater(delay_s, deferred.callback, None)
return deferred
您稍后可以在 Spider 中使用它,如下所示:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
}
def start_requests(self):
# This request will have itself delayed by 5 seconds
yield scrapy.Request(url='http://quotes.toscrape.com/page/1/',
meta={'delay_request_by': 5})
# This request will not be delayed
yield scrapy.Request(url='http://quotes.toscrape.com/page/2/')
def parse(self, response):
... # Process results here
您可以使用自定义重试中间件(source)来做到这一点,您只需重写当前
重试中间件的
process_response
方法:
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
class CustomRetryMiddleware(RetryMiddleware):
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
# Your delay code here, for example sleep(10) or polling server until it is alive
return self._retry(request, reason, spider) or response
return response
然后启用它,而不是
RetryMiddleware
中的默认settings.py
:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'myproject.middlewarefilepath.CustomRetryMiddleware': 550,
}
我已经尝试了所有答案,但没有一个有效
我已经想出了一条可能的路径:
class DelayedRequestsMiddleware(object):
def process_request(self, request, spider):
delay = request.meta.get('delay_request', None)
if delay:
request.meta.pop('delay_request')
d = Deferred()
reactor.callLater(delay, spider.crawler.engine.crawl, request=request)
# ignore current request
raise scrapy.exceptions.IgnoreRequest
sleep() 方法将执行暂停给定的秒数。该参数可以是浮点数,以指示更精确的睡眠时间。
因此您必须在蜘蛛中导入时间模块。
import time
然后就可以在需要延迟的地方添加sleep方法了。
time.sleep( 5 )