将蜘蛛的名字添加到日志的每一行

问题描述 投票:0回答:3

我正在寻找一种方法,为 Scrapy 生成的每个日志添加生成它的蜘蛛的名称作为前缀。到目前为止,我在循环中同步启动每个蜘蛛,因此很容易跟踪哪个蜘蛛生成哪个日志。但我最近重构了我的代码,以便接受蜘蛛列表作为参数,或者通过

CrawlerProcess()
函数立即启动它们。结果是它们是异步启动的,所以日志都混在一起了。

我考虑过在 LOG_FORMAT 设置中添加类似

[%(name)]
的内容,但生成的名称是调用它的模块(scrapy.core.engine、scrapy.utils.log 等),而不是蜘蛛的名称。

我还尝试创建一个扩展,通过检索

spider.name
并将其添加到 LOG_FORMAT 常量来修改爬网程序的设置,但据我所知,在爬网程序运行时更改设置没有效果(而且我还没有'没有找到一种干净的方法来做到这一点,因为它们是不可变的)。

任何帮助将不胜感激!谢谢你

  • 我尝试设置自定义LOG_FORMAT,但似乎没有任何方法可以访问蜘蛛的名称;
  • 我尝试使用
    extension
    来捕获爬虫的设置并修改它们,但它们是不可变的,并且仅在过程开始时进行评估;
python web-scraping logging scrapy web-crawler
3个回答
3
投票

您需要创建一个自定义日志格式,并将其设置为项目的日志格式化程序

基本上,您需要扩展 Scrapy 的 log formatter 并使用新格式设置消息。

main2.py:

from scrapy import logformatter
import logging
import os
from twisted.python.failure import Failure
from scrapy.utils.request import referer_str

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


SCRAPEDMSG = "Scraped from %(src)s" + os.linesep + "%(item)s"
# DROPPEDMSG = "Dropped: %(exception)s" + os.linesep + "%(item)s"
CRAWLEDMSG = "Crawled (%(status)s) %(request)s%(request_flags)s (referer: %(referer)s)%(response_flags)s"
# ITEMERRORMSG = "Error processing %(item)s"
# SPIDERERRORMSG = "Spider error processing %(request)s (referer: %(referer)s)"
# DOWNLOADERRORMSG_SHORT = "Error downloading %(request)s"
# DOWNLOADERRORMSG_LONG = "Error downloading %(request)s: %(errmsg)s"


class ExampleLogFormatter(logformatter.LogFormatter):
    def crawled(self, request, response, spider):
        request_flags = f' {str(request.flags)}' if request.flags else ''
        response_flags = f' {str(response.flags)}' if response.flags else ''
        return {
            'level': logging.DEBUG,
            'msg': f'{spider.name} {CRAWLEDMSG}',
            'args': {
                'status': response.status,
                'request': request,
                'request_flags': request_flags,
                'referer': referer_str(request),
                'response_flags': response_flags,
                # backward compatibility with Scrapy logformatter below 1.4 version
                'flags': response_flags
            }
        }

    def scraped(self, item, response, spider):
        if isinstance(response, Failure):
            src = response.getErrorMessage()
        else:
            src = response
        return {
            'level': logging.DEBUG,
            'msg': f'{spider.name} {SCRAPEDMSG}',
            'args': {
                'src': src,
                'item': item,
            }
        }


if __name__ == "__main__":
    spider = 'example_spider'
    settings = get_project_settings()
    settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    settings['LOG_FORMATTER'] = 'tempbuffer.main2.ExampleLogFormatter'
    process = CrawlerProcess(settings)
    process.crawl(spider)
    process.start()

蜘蛛.py:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['scrapingclub.com']
    start_urls = ['https://scrapingclub.com/exercise/detail_basic/']

    def parse(self, response):
        item = dict()
        item['title'] = response.xpath('//h3/text()').get()
        item['price'] = response.xpath('//div[@class="card-body"]/h4/text()').get()
        yield item

输出:

[scrapy.core.engine] DEBUG: example_spider Crawled (200) <GET https://scrapingclub.com/exercise/detail_basic/> (referer: None)
[scrapy.core.scraper] DEBUG: example_spider Scraped from <200 https://scrapingclub.com/exercise/detail_basic/>
{'title': 'Long-sleeved Jersey Top', 'price': '$12.99'}

更新:

非全局工作解决方案:

import logging
import scrapy
from scrapy.utils.log import configure_logging


class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['scrapingclub.com']
    start_urls = ['https://scrapingclub.com/exercise/detail_basic/']

    configure_logging(install_root_handler=False)
    logging.basicConfig(level=logging.DEBUG, format=name + ': %(levelname)s: %(message)s')

    def parse(self, response):
        item = dict()
        item['title'] = response.xpath('//h3/text()').get()
        item['price'] = response.xpath('//div[@class="card-body"]/h4/text()').get()
        yield item

更新2:终于有了一个可行的解决方案。

main2.py:

import logging
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings


# create a logging filter
class ContentFilter(logging.Filter):
    def filter(self, record):
        record.spider_name = ''
        # enter the spider's name
        if hasattr(record, 'spider'):
            record.spider_name = record.spider.name

        return True

        # record.spider.name was enough for my tests, but maybe you'll need this:
        # record.spider_name = ''
        # if hasattr(record, 'crawler'):
        #     record.spider_name = record.crawler.spidercls.name
        # elif hasattr(record, 'spider'):
        #     record.spider_name = record.spider.name
        # return True


# Extend scrapy.Spider class
class Spider(scrapy.Spider):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # the new format with "spider_name" variable:
        formatter = logging.Formatter('[%(spider_name)s]: %(levelname)s: %(message)s')

        # add the new format and filter to all the handlers
        for handler in logging.root.handlers:
            handler.formatter = formatter
            handler.addFilter(ContentFilter())


if __name__ == "__main__":
    spider1 = 'example_spider'
    spider2 = 'example_spider2'
    settings = get_project_settings()
    settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

    process = CrawlerProcess(settings)
    process.crawl(spider1)
    process.crawl(spider2)
    process.start()

蜘蛛.py:

from tempbuffer.main2 import Spider


# use the extended "Spider" class
class ExampleSpider(Spider):
    name = 'example_spider'
    allowed_domains = ['scrapingclub.com']
    start_urls = ['https://scrapingclub.com/exercise/detail_basic/']

    def parse(self, response):
        item = dict()
        item['price'] = response.xpath('//div[@class="card-body"]/h4/text()').get()
        yield item


# use the extended "Spider" class
class ExampleSpider2(Spider):
    name = 'example_spider2'
    allowed_domains = ['scrapingclub.com']
    start_urls = ['https://scrapingclub.com/exercise/detail_basic/']

    def parse(self, response):
        item = dict()
        item['title'] = response.xpath('//h3/text()').get()
        yield item

2
投票

感谢@SuperUser,我成功地完成了我需要的事情,而无需在每个蜘蛛中单独添加代码。一切都发生在扩展内部,更具体地说是在

spider_opened
方法内。这是代码:

class CustomLogExtension:

    class ContentFilter(logging.Filter):
        """
        Creates a filter that will
        """
        def filter(self, record):
            record.spider_name = ''
            # enter the spider's name
            if hasattr(record, 'spider'):
                record.spider_name = record.spider.name

            return True

    @classmethod
    def from_crawler(cls, crawler):
        # first check if the extension should be enabled and raise NotConfigured otherwise
        if not crawler.settings.getbool('CUSTOM_LOG_EXTENSION'):
            raise NotConfigured

        # instantiate the extension object
        ext = cls()

        # connect the extension object to signals
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)

        # return the extension object
        return ext

    def spider_opened(self, spider):
        """Prefixes the spider's name to every log emitted."""

        formatter = logging.Formatter('[%(spider_name)s] %(asctime)s [%(name)s] %(levelname)s: %(message)s')
        # add the new format and filter to all the handlers
        for handler in logging.root.handlers:
            handler.formatter = formatter
            handler.addFilter(self.ContentFilter())

0
投票

对于我们这些刚接触 scrapy 并且不希望在定义自定义日志扩展中进行大量工作/自定义的人来说,这似乎可以通过自定义统计数据 + 定期日志统计数据来实现。

在每个蜘蛛类的解析函数下,我设置

self.crawler.stats.set_value('spider_name', self.name)
,然后在 settings.py 中设置
"PERIODIC_LOG_STATS": {"include": "spider_name"]}
(以及您想要从定期日志统计中输出的任何其他内容)。我还为每个蜘蛛定义了单独的
CrawlProcess
进程。

这可能太老套了,但一直对我有用,并且允许我留在 scrapy 定义的日志类和扩展中,同时通过 API 运行多个蜘蛛。如果有人发现这是不可接受的原因,请告诉我,正如我提到的,我是 scrapy 的新手:)

© www.soinside.com 2019 - 2024. All rights reserved.