我有一个 scrapy 蜘蛛,我以与平常不同的方式运行,因为有时我想在蜘蛛完成后再次运行它。 这是我的代码:
class LinkExtractorSpider(scrapy.spiders.CrawlSpider):
name = 'LinkExtractor'
rules = [
scrapy.spiders.Rule(
scrapy.linkextractors.LinkExtractor(),
callback='produce_url',
follow=True,
)
]
def start_requests(self):
yield scrapy.Request(url=self.settings['START_URL'])
def produce_url(self, response):
yield {'url': response.url}
class Pipeline:
def process_item(self, item, spider):
logging.info(f'Check: {item}')
def spider_closed(self, spider, reason):
logging.info(f'Spider closed because: {reason}')
logging.info('Cleanup code would go here')
@classmethod
def from_crawler(cls, crawler):
pipeline = Pipeline()
crawler.signals.connect(pipeline.spider_closed, signal=scrapy.signals.spider_closed)
return pipeline
class Crawler:
def __init__(self):
os.environ['SCRAPY_SETTINGS_MODULE'] = 'settings'
scrapy.utils.log.configure_logging()
self.settings = scrapy.utils.project.get_project_settings()
self.runner = scrapy.crawler.CrawlerRunner(self.settings)
def should_run_again(self):
return False
def callback(self, results):
logging.info("Maybe I want to run again, maybe not")
if self.should_run_again():
self.runner.crawl(LinkExtractorSpider).addCallback(self.callback)
else:
twisted.internet.reactor.stop()
def run(self, start_url):
if not start_url.startswith('http'):
start_url = 'https://' + start_url
self.settings.set('START_URL', start_url)
self.runner.crawl(LinkExtractorSpider).addCallback(self.callback)
twisted.internet.reactor.run()
if __name__ == '__main__':
Crawler().run('some_website.com')
这工作得很好,我看到了来自回调和管道的
spider_closed
方法的日志消息。 但是,当我使用 Ctrl-C 中断脚本时,它会干净地退出,但这些消息都不会出现。
这是我的设置.py:
import logging
ITEM_PIPELINES = {
"spider.Pipeline": 100,
}
LOG_LEVEL = logging.INFO
LOG_FORMAT = "%(name)s: [%(levelname)s] %(message)s"
LOG_STDOUT = True
ROBOTSTXT_OBEY = False
DOWNLOAD_TIMEOUT = 10
RETRY_ENABLED = False
DEPTH_LIMIT = 2
以下是日志:
scrapy.addons: [INFO] Enabled addons:
[]
py.warnings: [WARNING] /Users/daniel/environments/poisonedparadox/lib/python3.10/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
scrapy.extensions.telnet: [INFO] Telnet Password: aaaaaaaaa
scrapy.middleware: [INFO] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
scrapy.crawler: [INFO] Overridden settings:
{'DEPTH_LIMIT': 2,
'DOWNLOAD_TIMEOUT': 10,
'LOG_FORMAT': '%(name)s: [%(levelname)s] %(message)s',
'LOG_LEVEL': 20,
'LOG_STDOUT': True,
'RETRY_ENABLED': False}
scrapy.middleware: [INFO] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
scrapy.middleware: [INFO] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
scrapy.middleware: [INFO] Enabled item pipelines:
['spider.Pipeline']
scrapy.core.engine: [INFO] Spider opened
scrapy.extensions.logstats: [INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
scrapy.extensions.telnet: [INFO] Telnet console listening on 127.0.0.1:6023
root: [INFO] Check: {'url': 'https://www.google.com/intl/en/policies/privacy/'}
root: [INFO] Check: {'url': 'https://www.google.com/imghp?hl=en&tab=wi'}
root: [INFO] Check: {'url': 'https://www.google.com/intl/en/policies/terms/'}
root: [INFO] Check: {'url': 'https://www.google.com/preferences?hl=en'}
root: [INFO] Check: {'url': 'https://www.google.com/advanced_search?hl=en&authuser=0'}
root: [INFO] Check: {'url': 'https://smallbusiness.withgoogle.com/?subid=us-en-et-g-awa-a-g_hpbfoot1_1!o2&utm_source=google&utm_medium=ep&utm_campaign=google_hpbfooter&utm_content=google_hpbfooter&gmbsrc=us-en_US-et-gs-z-gmb-s-z-u~sb-g4sb_srvcs-u'}
root: [INFO] Check: {'url': 'https://www.google.com/maps?hl=en&tab=wl'}
root: [INFO] Check: {'url': 'https://ads.google.com/intl/en/home/'}
root: [INFO] Check: {'url': 'https://about.google/'}
root: [INFO] Check: {'url': 'https://www.google.com/history/optout?hl=en'}
root: [INFO] Check: {'url': 'https://news.google.com/home?tab=wn&hl=en-US&gl=US&ceid=US:en'}
root: [INFO] Check: {'url': 'https://about.google/products/?tab=wh'}
root: [INFO] Check: {'url': 'https://www.youtube.com/?tab=w1'}
root: [INFO] Check: {'url': 'https://accounts.google.com/v3/signin/identifier?continue=https%3A%2F%2Fwww.google.com%2F&ec=GAZAAQ&hl=en&ifkv=AcMMx-d2GgD-YTcHQJRQ98PqHE-zo-q3UYpsuA9APS1CY6G3mBsmxU_kgZuengZK0fVjecGRTCpIug&passive=true&flowName=WebLiteSignIn&flowEntry=ServiceLogin&dsh=S-1389799524%3A1730486180270110'}
root: [INFO] Check: {'url': 'https://www.google.com/advanced_image_search?hl=en&authuser=0'}
root: [INFO] Check: {'url': 'https://www.google.com/search?q=how+to+vote+in+the+us&hl=en&gl=us&stick=H4sIAAAAAAAAAHvE6Mgt8PLHPWEpi0lrTl5jNOISdM1JTS7JzM_zyC8PyQ_LL0kVkuXigAkKCUrxc_Hqp-sbGuaWFeRmZVTl8CxiFc3IL1coyVcoA6pWyMxTKMlIVSgtBgCjAj5eXAAAAA&utm_source=discover&utm_medium=promo&utm_campaign=13'}
root: [INFO] Check: {'url': 'https://policies.google.com/privacy?hl=en&gl=us'}
root: [INFO] Check: {'url': 'https://policies.google.com/terms'}
^Cscrapy.core.downloader.handlers.http11: [WARNING] Got data loss in https://play.google.com/store/games?hl=en&tab=w8. If you want to process broken responses set the setting DOWNLOAD_FAIL_ON_DATALOSS = False -- This message won't be shown in further requests
scrapy.core.downloader.handlers.http11: [WARNING] Got data loss in https://support.google.com/. If you want to process broken responses set the setting DOWNLOAD_FAIL_ON_DATALOSS = False -- This message won't be shown in further requests
当我按下 Ctrl-C 时,会出现最后几条有关数据丢失的消息。
但是,当我使用 Ctrl-C 中断脚本时,它会干净地退出 但这些消息都没有出现。
CTRL+C的处理在
scrapy.utils.ossignal.install_shutdown_handlers
(源代码)中实现
这个函数用在scrapy.crawler.CrawlerProcess._signal_shutdown
(源码)scrapy.crawler.CrawlerProcess
类是 scrapy.crawler.CrawlerRunner
的后代
不幸的是,您的代码中使用的
scrapy.crawler.CrawlerRunner
类 - 因此没有安装 CTRL+C 的处理程序 - 应用程序没有像最初预期的那样对 CTRL+C 做出反应。