我正在寻找一种方法,为 Scrapy 生成的每个日志添加生成它的蜘蛛的名称作为前缀。到目前为止,我在循环中同步启动每个蜘蛛,因此很容易跟踪哪个蜘蛛生成哪个日志。但我最近重构了我的代码,以便接受蜘蛛列表作为参数,或者通过
CrawlerProcess()
函数立即启动它们。结果是它们是异步启动的,所以日志都混在一起了。
我考虑过在 LOG_FORMAT 设置中添加类似
[%(name)]
的内容,但生成的名称是调用它的模块(scrapy.core.engine、scrapy.utils.log 等),而不是蜘蛛的名称。
我还尝试创建一个扩展,通过检索
spider.name
并将其添加到 LOG_FORMAT 常量来修改爬网程序的设置,但据我所知,在爬网程序运行时更改设置没有效果(而且我还没有'没有找到一种干净的方法来做到这一点,因为它们是不可变的)。
任何帮助将不胜感激!谢谢你
extension
来捕获爬虫的设置并修改它们,但它们是不可变的,并且仅在过程开始时进行评估;您需要创建一个自定义日志格式,并将其设置为项目的日志格式化程序。
基本上,您需要扩展 Scrapy 的 log formatter 并使用新格式设置消息。
main2.py:
from scrapy import logformatter
import logging
import os
from twisted.python.failure import Failure
from scrapy.utils.request import referer_str
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
SCRAPEDMSG = "Scraped from %(src)s" + os.linesep + "%(item)s"
# DROPPEDMSG = "Dropped: %(exception)s" + os.linesep + "%(item)s"
CRAWLEDMSG = "Crawled (%(status)s) %(request)s%(request_flags)s (referer: %(referer)s)%(response_flags)s"
# ITEMERRORMSG = "Error processing %(item)s"
# SPIDERERRORMSG = "Spider error processing %(request)s (referer: %(referer)s)"
# DOWNLOADERRORMSG_SHORT = "Error downloading %(request)s"
# DOWNLOADERRORMSG_LONG = "Error downloading %(request)s: %(errmsg)s"
class ExampleLogFormatter(logformatter.LogFormatter):
def crawled(self, request, response, spider):
request_flags = f' {str(request.flags)}' if request.flags else ''
response_flags = f' {str(response.flags)}' if response.flags else ''
return {
'level': logging.DEBUG,
'msg': f'{spider.name} {CRAWLEDMSG}',
'args': {
'status': response.status,
'request': request,
'request_flags': request_flags,
'referer': referer_str(request),
'response_flags': response_flags,
# backward compatibility with Scrapy logformatter below 1.4 version
'flags': response_flags
}
}
def scraped(self, item, response, spider):
if isinstance(response, Failure):
src = response.getErrorMessage()
else:
src = response
return {
'level': logging.DEBUG,
'msg': f'{spider.name} {SCRAPEDMSG}',
'args': {
'src': src,
'item': item,
}
}
if __name__ == "__main__":
spider = 'example_spider'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
settings['LOG_FORMATTER'] = 'tempbuffer.main2.ExampleLogFormatter'
process = CrawlerProcess(settings)
process.crawl(spider)
process.start()
蜘蛛.py:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['scrapingclub.com']
start_urls = ['https://scrapingclub.com/exercise/detail_basic/']
def parse(self, response):
item = dict()
item['title'] = response.xpath('//h3/text()').get()
item['price'] = response.xpath('//div[@class="card-body"]/h4/text()').get()
yield item
输出:
[scrapy.core.engine] DEBUG: example_spider Crawled (200) <GET https://scrapingclub.com/exercise/detail_basic/> (referer: None)
[scrapy.core.scraper] DEBUG: example_spider Scraped from <200 https://scrapingclub.com/exercise/detail_basic/>
{'title': 'Long-sleeved Jersey Top', 'price': '$12.99'}
更新:
非全局工作解决方案:
import logging
import scrapy
from scrapy.utils.log import configure_logging
class ExampleSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['scrapingclub.com']
start_urls = ['https://scrapingclub.com/exercise/detail_basic/']
configure_logging(install_root_handler=False)
logging.basicConfig(level=logging.DEBUG, format=name + ': %(levelname)s: %(message)s')
def parse(self, response):
item = dict()
item['title'] = response.xpath('//h3/text()').get()
item['price'] = response.xpath('//div[@class="card-body"]/h4/text()').get()
yield item
更新2:终于有了一个可行的解决方案。
main2.py:
import logging
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
# create a logging filter
class ContentFilter(logging.Filter):
def filter(self, record):
record.spider_name = ''
# enter the spider's name
if hasattr(record, 'spider'):
record.spider_name = record.spider.name
return True
# record.spider.name was enough for my tests, but maybe you'll need this:
# record.spider_name = ''
# if hasattr(record, 'crawler'):
# record.spider_name = record.crawler.spidercls.name
# elif hasattr(record, 'spider'):
# record.spider_name = record.spider.name
# return True
# Extend scrapy.Spider class
class Spider(scrapy.Spider):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# the new format with "spider_name" variable:
formatter = logging.Formatter('[%(spider_name)s]: %(levelname)s: %(message)s')
# add the new format and filter to all the handlers
for handler in logging.root.handlers:
handler.formatter = formatter
handler.addFilter(ContentFilter())
if __name__ == "__main__":
spider1 = 'example_spider'
spider2 = 'example_spider2'
settings = get_project_settings()
settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
process = CrawlerProcess(settings)
process.crawl(spider1)
process.crawl(spider2)
process.start()
蜘蛛.py:
from tempbuffer.main2 import Spider
# use the extended "Spider" class
class ExampleSpider(Spider):
name = 'example_spider'
allowed_domains = ['scrapingclub.com']
start_urls = ['https://scrapingclub.com/exercise/detail_basic/']
def parse(self, response):
item = dict()
item['price'] = response.xpath('//div[@class="card-body"]/h4/text()').get()
yield item
# use the extended "Spider" class
class ExampleSpider2(Spider):
name = 'example_spider2'
allowed_domains = ['scrapingclub.com']
start_urls = ['https://scrapingclub.com/exercise/detail_basic/']
def parse(self, response):
item = dict()
item['title'] = response.xpath('//h3/text()').get()
yield item
感谢@SuperUser,我成功地完成了我需要的事情,而无需在每个蜘蛛中单独添加代码。一切都发生在扩展内部,更具体地说是在
spider_opened
方法内。这是代码:
class CustomLogExtension:
class ContentFilter(logging.Filter):
"""
Creates a filter that will
"""
def filter(self, record):
record.spider_name = ''
# enter the spider's name
if hasattr(record, 'spider'):
record.spider_name = record.spider.name
return True
@classmethod
def from_crawler(cls, crawler):
# first check if the extension should be enabled and raise NotConfigured otherwise
if not crawler.settings.getbool('CUSTOM_LOG_EXTENSION'):
raise NotConfigured
# instantiate the extension object
ext = cls()
# connect the extension object to signals
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
# return the extension object
return ext
def spider_opened(self, spider):
"""Prefixes the spider's name to every log emitted."""
formatter = logging.Formatter('[%(spider_name)s] %(asctime)s [%(name)s] %(levelname)s: %(message)s')
# add the new format and filter to all the handlers
for handler in logging.root.handlers:
handler.formatter = formatter
handler.addFilter(self.ContentFilter())
对于我们这些刚接触 scrapy 并且不希望在定义自定义日志扩展中进行大量工作/自定义的人来说,这似乎可以通过自定义统计数据 + 定期日志统计数据来实现。
在每个蜘蛛类的解析函数下,我设置
self.crawler.stats.set_value('spider_name', self.name)
,然后在 settings.py 中设置 "PERIODIC_LOG_STATS": {"include": "spider_name"]}
(以及您想要从定期日志统计中输出的任何其他内容)。我还为每个蜘蛛定义了单独的 CrawlProcess
进程。
这可能太老套了,但一直对我有用,并且允许我留在 scrapy 定义的日志类和扩展中,同时通过 API 运行多个蜘蛛。如果有人发现这是不可接受的原因,请告诉我,正如我提到的,我是 scrapy 的新手:)