Scrapy 广泛爬行返回错误“filedescriptor out of range in select()”

问题描述 投票:0回答:1

我正在尝试使用 Scrapy 在 2,000 个域上运行简单的广泛爬网。

我有 4 个列表,每个列表有 500 个域,然后我简单地在这 4 个列表上运行 process.crawl。 我现在只是抓取他们的主页。

问题是在第 1,000 个域之后,我开始收到以下错误(“filedescriptor out of range in select()”):

2019-10-20 00:24:59 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.******.com>
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.internet.error.ConnectError: An error occurred while connecting: [Failure instance: Traceback: <class 'ValueError'>: filedescriptor out of range in select()

注意:将错误中的域名替换为 ******

以下是我的功能:

import scrapy
class scan_scripts(scrapy.Spider):

    name = 'scan_scripts'

    custom_settings = {
        'CONCURRENT_REQUESTS': 500,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 25,
        'CONCURRENT_ITEMS': 50,
        'REACTOR_THREADPOOL_MAXSIZE': 500,
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'LOG_LEVEL': 'ERROR',
        'HTTPCACHE_ENABLED': False,
        'COOKIES_ENABLED': False,
        'SCHEDULER_PRIORITY_QUEUE': 'scrapy.pqueues.DownloaderAwarePriorityQueue',
        'RETRY_ENABLED': False,
        'DOWNLOAD_TIMEOUT': 15,
        'REDIRECT_ENABLED': False,
        'AJAXCRAWL_ENABLED': True,
        'DEPTH_PRIORITY': 1,
        'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
        'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue'
    }

    def start_requests(self):

        for domain in self.domains:
            yield scrapy.Request("https://" + domain, self.parse)

    def parse(self, response):

        print('### SCANNED: %s', (response.request.url))

然后我运行这个:

process = CrawlerProcess({})
    
process.crawl(scan_scripts, domains=domains_list1)
process.crawl(scan_scripts, domains=domains_list2)
process.crawl(scan_scripts, domains=domains_list3)
process.crawl(scan_scripts, domains=domains_list4)
    
process.start()
python web-scraping scrapy web-crawler
1个回答
0
投票

一个简单的解决方案是将您的

CONCURRENT_REQUESTS
设置为一个较小的值

© www.soinside.com 2019 - 2024. All rights reserved.