Scrapy商店返回变量中的项目以在主脚本中使用

Question

我是Scrapy的新手，想要尝试以下方法：从网页中提取一些值，将其存储在变量中并在我的主脚本中使用它。因此我按照他们的教程并为我的目的更改了代码：

import scrapy
from scrapy.crawler import CrawlerProcess


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        global title # This would work, but there should be a better way
        title = response.css('title::text').extract_first()

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start() # the script will block here until the crawling is finished

print(title) # Verify if it works and do some other actions later on...

到目前为止这可以工作，但我很确定它不是一个好的风格，或者如果我将title变量定义为全局，甚至有一些不好的副作用。如果我跳过那一行，那么我当然得到“未定义的变量”错误：/因此我正在寻找一种方法来返回变量并在我的主脚本中使用它。

我已阅读有关项目管道但我无法使其工作。

任何帮助/想法都非常感谢:)提前感谢！

Answer 1

如你所知，使用global并不是一种好的风格，特别是当你需要扩展你的需求时。

我的建议是将标题存储到文件或列表中并在主进程中使用它，或者如果你想在其他脚本中处理标题，那么只需打开文件并在脚本中读取标题

（注意：请忽略缩进问题）

spider.朋友

import scrapy
from scrapy.crawler import CrawlerProcess

namefile = 'namefile.txt'
current_title_session = []#title stored in current session
file_append = open(namefile,'a',encoding = 'utf-8')

try:
    title_in_file = open(namefile,'r').readlines()
except:
    title_in_file = open(namefile,'w')

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    custom_settings = {
        'LOG_ENABLED': 'False',
    }

    def parse(self, response):
        title = response.css('title::text').extract_first()
        if title +'\n' not in title_in_file  and title not in current_title_session:
             file_append.write(title+'\n')
             current_title_session.append(title)
if __name__=='__main__':
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(QuotesSpider)
    process.start() # the script will block here until the crawling is finished

Answer 2

制作一个变量global应该适合你所需要的，但正如你所提到的那样，它的风格并不好。

我实际上建议使用不同的服务进行进程之间的通信，比如Redis，这样你就不会在你的蜘蛛和任何其他进程之间发生冲突。

设置和使用非常简单，文档有very simple example。

实例化蜘蛛内部的redis连接，再次在主进程上实现（将它们视为单独的进程）。蜘蛛设置变量，主要过程读取（或gets）信息。

Scrapy商店返回变量中的项目以在主脚本中使用

问题描述投票：2回答：2

2个回答

最新问题

Scrapy商店返回变量中的项目以在主脚本中使用

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2