我正在使用
scrapy
来抓取 this 页面
但由于某种原因
scrapy
无法收到该网站的回复。
当我运行爬虫时,我收到 https 500 错误
这是我的基本
spider
import scrapy
class SavingsGov(scrapy.Spider):
name = 'savings'
start_urls = [
'https://savings.gov.pk/download-draws/'
]
def parse(self, response):
for option in response.css('select option'):
yield {
'url': option.css('::attr(value)').get()
}
这是我运行时遇到的错误,(我还在
settings.py
中将重试次数增加到10次)
2023-08-26 16:30:22 [scrapy.core.engine] INFO: Spider opened
2023-08-26 16:30:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-26 16:30:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-26 16:30:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/robots.txt> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/robots.txt> (referer: None)
2023-08-26 16:30:40 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-08-26 16:30:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:43 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:53 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/download-draws/> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/download-draws/> (referer: None)
2023-08-26 16:30:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://savings.gov.pk/download-draws/>: HTTP status code is not handled or not allowed
2023-08-26 16:30:56 [scrapy.core.engine] INFO: Closing spider (finished)
但我可以使用 python 的
requests
模块轻松获得响应。
这是Python代码
import requests
response = requests.get('https://savings.gov.pk/download-draws/')
print(response.text)
我不知道为什么会发生这种情况,我假设问题出在
scrapy.Request
。
有没有办法用
requests
执行请求并将响应传递给 scrapy
?但更好的选择是以某种方式进行调试 scrapy.Request
我是
scrapy
的新手,所以如果我可能误解了这个问题,请告诉我。
这很可能是因为服务器可能拒绝来自 scrapy 默认用户代理的请求。
尝试在蜘蛛自定义设置中设置自定义设置。
例如:
import scrapy
class SavingsGov(scrapy.Spider):
name = 'savings'
start_urls = [
'https://savings.gov.pk/download-draws/'
]
custom_settings = {
"USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
}
def parse(self, response):
for option in response.css('select option'):
yield {
'url': option.css('::attr(value)').get()
}
部分输出:
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-1500-draw-list/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-200-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-1500-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-25000-premium-bonds-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-15000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-40000-premium-bonds-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-40000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-25000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-7500-draws/'}