我正在尝试运行多个 scrapy python 文件,它们位于不同的子包中,例如项目结构就像
所以目前抓取任务的工作方式如下,我从 cmd 运行 main.py 文件,它调用 scrapy_settings.json 和 myscrapy.py。 mycrapy.py 爬虫通过带有 fields 标签的网站。
我想从一个不同的 python 文件顺序运行主要文件,该文件对列表中的所有抓取程序都有 for 循环,一旦第一个文件的抓取完成,下一个抓取程序就应该工作了。
run.py
import os, sys
import subprocess
import time
def execute_scraper(script_path):
print(f"starting execution of {script_path}")
subprocess.Popen(["python", script_path], shell=True)
except Exception as exc:
print(f"error during execution of {script_path}: {str(exc)}")
else:
print(f"finished execution of {script_path}")
def main():
fileDirectory: str = os.path.abspath(os.path.join(_file_ ,"../"))
# path for the scrappers
path1 = os.path.join(fileDirectory, "Sub_Package_1", "website1","main.py")
path2 = os.path.join(fileDirectory, "Sub_Package_1", "website1","main.py")
list_scrapers = [path1, path2]
for scraper in scraper_list_run_one:
execute_scraper(scraper)
scrapy_settings.json
{
"DOWNLOAD_DELAY": 1,
"COOKIES_ENABLED": false,
"LOG_ENABLED": true,
"LOG_LEVEL": "DEBUG",
"RETRY_TIMES": 10,
"RETRY_HTTP_CODES": [500, 503, 504, 400, 403, 404, 408],
"FEED_EXPORTERS" : { "csv": "scrapy.exporters.CsvItemExporter"},
"FEED_FORMAT" : "csv",
"FEED_URI" : "filename.csv"
}
main.py
from scrapy.crawler import CrawlerProcess
import myscrapy
import json
import os, sys
def start_crawler():
# Read Scrapy Settings
fileDirectory: str = os.path.abspath(os.path.join(__file__ ,"../"))
json_path = os.path.join(fileDirectory,"scrapy_settings.json")
with open(json_path) as fsettings:
settings = json.load(fsettings)
# Start Crawling Process
process = CrawlerProcess(settings)
process.crawl(myscrapy.myspider)
process.start()
process.stop()
if __name__ == "__main__":
start_crawler()
myscrapy.py is just an example taken from https://www.scrapingbee.com/blog/web-scraping-with-scrapy/
# -*- coding: utf-8 -*-
import urllib.parse
import random
import scrapy
import re
from scrapy.selector import Selector
from scrapy.spiders import Spider
from datetime import datetime
#import html2text
# Fields of the Scrapy Item -- Similar to py dictionary
class myitems(scrapy.Item):
product_url = scrapy.Field()
price = scrapy.Field()
title = scrapy.Field()
img_url = scrapy.Field()
class myspider(Spider):
name = 'ecom_spider'
allowed_domains = ['clever-lichterman-044f16.netlify.app']
start_urls = ['https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/']
def parse(self, response):
item = Product()
item['product_url'] = response.url
item['price'] = response.xpath("//div[@class='my-4']/span/text()").get()
item['title'] = response.xpath('//section[1]//h2/text()').get()
item['img_url'] = response.xpath("//div[@class='product-slider']//img/@src").get(0)
return item
在两个子包中,myscrapy.file 几乎相同,但链接不同
直到现在我已经尝试使用导入作为模块,但是当我导入 main 时它给出了一个错误 Module not found error for myscrapy.py.
我也尝试过,使用 run 和 Popen 进行子处理。当使用 subprocess.run() 时,所有的爬虫都不起作用,命令行停止执行。
使用subprocess.Popen()时,列表中的第一个爬虫正常运行,但是爬虫执行完毕后,列表中的第二个爬虫没有开始执行,终端卡顿。 我为 Popen 保留了 shell=True。我也尝试编写一个新的子流程行而不是在 for 循环中运行它,但我们仍然有同样的问题。
出于安全原因,我不能使用批处理 (.bat) 文件。
使用 cmd 一次运行一个爬虫时,它运行正常