我正在尝试使用 Scrapy 在 2,000 个域上运行简单的广泛爬网。
我有 4 个列表,每个列表有 500 个域,然后我简单地在这 4 个列表上运行 process.crawl。 我现在只是抓取他们的主页。
问题是在第 1,000 个域之后,我开始收到以下错误(“filedescriptor out of range in select()”):
2019-10-20 00:24:59 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.******.com>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 44, in process_request
defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.internet.error.ConnectError: An error occurred while connecting: [Failure instance: Traceback: <class 'ValueError'>: filedescriptor out of range in select()
注意:将错误中的域名替换为 ******
以下是我的功能:
import scrapy
class scan_scripts(scrapy.Spider):
name = 'scan_scripts'
custom_settings = {
'CONCURRENT_REQUESTS': 500,
'CONCURRENT_REQUESTS_PER_DOMAIN': 25,
'CONCURRENT_ITEMS': 50,
'REACTOR_THREADPOOL_MAXSIZE': 500,
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'LOG_LEVEL': 'ERROR',
'HTTPCACHE_ENABLED': False,
'COOKIES_ENABLED': False,
'SCHEDULER_PRIORITY_QUEUE': 'scrapy.pqueues.DownloaderAwarePriorityQueue',
'RETRY_ENABLED': False,
'DOWNLOAD_TIMEOUT': 15,
'REDIRECT_ENABLED': False,
'AJAXCRAWL_ENABLED': True,
'DEPTH_PRIORITY': 1,
'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue',
'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue'
}
def start_requests(self):
for domain in self.domains:
yield scrapy.Request("https://" + domain, self.parse)
def parse(self, response):
print('### SCANNED: %s', (response.request.url))
然后我运行这个:
process = CrawlerProcess({})
process.crawl(scan_scripts, domains=domains_list1)
process.crawl(scan_scripts, domains=domains_list2)
process.crawl(scan_scripts, domains=domains_list3)
process.crawl(scan_scripts, domains=domains_list4)
process.start()
一个简单的解决方案是将您的
CONCURRENT_REQUESTS
设置为一个较小的值