Scrapy将数据返回/传递给另一个模块

问题描述 投票:2回答:1

嗨,我想知道如何将pandas文件的抓取结果传递给创建创建蜘蛛的模块。

import mySpider as mspider
def main():
    spider1 = mspider.MySpider()
    process = CrawlerProcess()
    process.crawl(spider1)
    process.start()
    print(len(spider1.result))

蜘蛛:

class MySpider(scrapy.Spider):
    name = 'MySpider'
    allowed_domains = config.ALLOWED_DOMAINS
    result = pd.DataFrame(columns=...)

    def start_requests(self):
        yield Request(url=...,headers=config.HEADERS, callback=self.parse)

    def parse(self, response):
        *...Some Code of adding values to result...*
        print("size: " + str(len(self.result)))

当解析方法为1005时,main方法中的打印值为0.你能告诉我如何在两者之间传递值。

我想这样做因为我正在运行多个蜘蛛。完成抓取后,我将合并并保存到文件中。

def spider_closed(spider, reason):
    print("Size" + str(len(spider.result)))

def main():
    now = datetime.now()
    spider1 = spider.MySpider()
    crawler_process = CrawlerProcess()
    crawler = crawler_process.create_crawler(spider1)
    crawler.signals.connect(spider_closed, signals.spider_closed)
    crawler_process.crawl(spider1)
    crawler_process.start()
python web-scraping scrapy python-3.5
1个回答
1
投票

这种行为的主要原因是Scrapy本身的异步性质。 print(len(spider1.result))线将在调用.parse()方法之前执行。

有多种方法可以等待蜘蛛完成。我会做spider_closed signal

from scrapy import signals


def spider_closed(spider, reason):
    print(len(spider.result))

spider1 = mspider.MySpider()

crawler_process = CrawlerProcess(settings)
crawler = crawler_process.create_crawler()

crawler.signals.connect(spider_closed, signals.spider_closed)

crawler.crawl(spider1)
crawler_process.start()
© www.soinside.com 2019 - 2024. All rights reserved.