spider_close 信号在中断时未处理

问题描述 投票:0回答:1

我有一个 scrapy 蜘蛛,我以与平常不同的方式运行,因为有时我想在蜘蛛完成后再次运行它。 这是我的代码:

class LinkExtractorSpider(scrapy.spiders.CrawlSpider):
    name = 'LinkExtractor'

    rules = [
        scrapy.spiders.Rule(
            scrapy.linkextractors.LinkExtractor(),
            callback='produce_url',
            follow=True,
        )
    ]

    def start_requests(self):
        yield scrapy.Request(url=self.settings['START_URL'])

    def produce_url(self, response):
        yield {'url': response.url}


class Pipeline:
    def process_item(self, item, spider):
        logging.info(f'Check: {item}')

    def spider_closed(self, spider, reason):
        logging.info(f'Spider closed because: {reason}')
        logging.info('Cleanup code would go here')

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = Pipeline()
        crawler.signals.connect(pipeline.spider_closed, signal=scrapy.signals.spider_closed)
        return pipeline


class Crawler:
    def __init__(self):
        os.environ['SCRAPY_SETTINGS_MODULE'] = 'settings'
        scrapy.utils.log.configure_logging()
        self.settings = scrapy.utils.project.get_project_settings()
        self.runner = scrapy.crawler.CrawlerRunner(self.settings)

    def should_run_again(self):
        return False

    def callback(self, results):
        logging.info("Maybe I want to run again, maybe not")
        if self.should_run_again():
            self.runner.crawl(LinkExtractorSpider).addCallback(self.callback)
        else:
            twisted.internet.reactor.stop()

    def run(self, start_url):
        if not start_url.startswith('http'):
            start_url = 'https://' + start_url
        self.settings.set('START_URL', start_url)
        self.runner.crawl(LinkExtractorSpider).addCallback(self.callback)
        twisted.internet.reactor.run()

if __name__ == '__main__':
    Crawler().run('some_website.com')

这工作得很好,我看到了来自回调和管道的

spider_closed
方法的日志消息。 但是,当我使用 Ctrl-C 中断脚本时,它会干净地退出,但这些消息都不会出现。

这是我的设置.py:

import logging

ITEM_PIPELINES = {
    "spider.Pipeline": 100,
}

LOG_LEVEL = logging.INFO
LOG_FORMAT = "%(name)s: [%(levelname)s] %(message)s"
LOG_STDOUT = True

ROBOTSTXT_OBEY = False
DOWNLOAD_TIMEOUT = 10
RETRY_ENABLED = False
DEPTH_LIMIT = 2

以下是日志:

scrapy.addons: [INFO] Enabled addons:
[]
py.warnings: [WARNING] /Users/daniel/environments/poisonedparadox/lib/python3.10/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

scrapy.extensions.telnet: [INFO] Telnet Password: aaaaaaaaa
scrapy.middleware: [INFO] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
scrapy.crawler: [INFO] Overridden settings:
{'DEPTH_LIMIT': 2,
 'DOWNLOAD_TIMEOUT': 10,
 'LOG_FORMAT': '%(name)s: [%(levelname)s] %(message)s',
 'LOG_LEVEL': 20,
 'LOG_STDOUT': True,
 'RETRY_ENABLED': False}
scrapy.middleware: [INFO] Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
scrapy.middleware: [INFO] Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
scrapy.middleware: [INFO] Enabled item pipelines:
['spider.Pipeline']
scrapy.core.engine: [INFO] Spider opened
scrapy.extensions.logstats: [INFO] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
scrapy.extensions.telnet: [INFO] Telnet console listening on 127.0.0.1:6023
root: [INFO] Check: {'url': 'https://www.google.com/intl/en/policies/privacy/'}
root: [INFO] Check: {'url': 'https://www.google.com/imghp?hl=en&tab=wi'}
root: [INFO] Check: {'url': 'https://www.google.com/intl/en/policies/terms/'}
root: [INFO] Check: {'url': 'https://www.google.com/preferences?hl=en'}
root: [INFO] Check: {'url': 'https://www.google.com/advanced_search?hl=en&authuser=0'}
root: [INFO] Check: {'url': 'https://smallbusiness.withgoogle.com/?subid=us-en-et-g-awa-a-g_hpbfoot1_1!o2&utm_source=google&utm_medium=ep&utm_campaign=google_hpbfooter&utm_content=google_hpbfooter&gmbsrc=us-en_US-et-gs-z-gmb-s-z-u~sb-g4sb_srvcs-u'}
root: [INFO] Check: {'url': 'https://www.google.com/maps?hl=en&tab=wl'}
root: [INFO] Check: {'url': 'https://ads.google.com/intl/en/home/'}
root: [INFO] Check: {'url': 'https://about.google/'}
root: [INFO] Check: {'url': 'https://www.google.com/history/optout?hl=en'}
root: [INFO] Check: {'url': 'https://news.google.com/home?tab=wn&hl=en-US&gl=US&ceid=US:en'}
root: [INFO] Check: {'url': 'https://about.google/products/?tab=wh'}
root: [INFO] Check: {'url': 'https://www.youtube.com/?tab=w1'}
root: [INFO] Check: {'url': 'https://accounts.google.com/v3/signin/identifier?continue=https%3A%2F%2Fwww.google.com%2F&ec=GAZAAQ&hl=en&ifkv=AcMMx-d2GgD-YTcHQJRQ98PqHE-zo-q3UYpsuA9APS1CY6G3mBsmxU_kgZuengZK0fVjecGRTCpIug&passive=true&flowName=WebLiteSignIn&flowEntry=ServiceLogin&dsh=S-1389799524%3A1730486180270110'}
root: [INFO] Check: {'url': 'https://www.google.com/advanced_image_search?hl=en&authuser=0'}
root: [INFO] Check: {'url': 'https://www.google.com/search?q=how+to+vote+in+the+us&hl=en&gl=us&stick=H4sIAAAAAAAAAHvE6Mgt8PLHPWEpi0lrTl5jNOISdM1JTS7JzM_zyC8PyQ_LL0kVkuXigAkKCUrxc_Hqp-sbGuaWFeRmZVTl8CxiFc3IL1coyVcoA6pWyMxTKMlIVSgtBgCjAj5eXAAAAA&utm_source=discover&utm_medium=promo&utm_campaign=13'}
root: [INFO] Check: {'url': 'https://policies.google.com/privacy?hl=en&gl=us'}
root: [INFO] Check: {'url': 'https://policies.google.com/terms'}
^Cscrapy.core.downloader.handlers.http11: [WARNING] Got data loss in https://play.google.com/store/games?hl=en&tab=w8. If you want to process broken responses set the setting DOWNLOAD_FAIL_ON_DATALOSS = False -- This message won't be shown in further requests
scrapy.core.downloader.handlers.http11: [WARNING] Got data loss in https://support.google.com/. If you want to process broken responses set the setting DOWNLOAD_FAIL_ON_DATALOSS = False -- This message won't be shown in further requests

当我按下 Ctrl-C 时,会出现最后几条有关数据丢失的消息。

python scrapy twisted
1个回答
0
投票

但是,当我使用 Ctrl-C 中断脚本时,它会干净地退出 但这些消息都没有出现。

CTRL+C的处理在

scrapy.utils.ossignal.install_shutdown_handlers
(源代码)中实现 这个函数用在
scrapy.crawler.CrawlerProcess._signal_shutdown
(源码)
scrapy.crawler.CrawlerProcess
类是
scrapy.crawler.CrawlerRunner

的后代

不幸的是,您的代码中使用的

scrapy.crawler.CrawlerRunner
类 - 因此没有安装 CTRL+C 的处理程序 - 应用程序没有像最初预期的那样对 CTRL+C 做出反应。

© www.soinside.com 2019 - 2024. All rights reserved.