我想使用scrapy-splash来获取目标页面的html和截图png。我需要能够以编程方式调用它。根据spashy doc,指定
endpoint='render.json'
并传递参数
'png': 1
应该生成一个响应对象('scrapy_splash.response.SplashJsonResponse'),其.data属性包含表示目标页面的png屏幕截图的解码JSON数据。
当调用蜘蛛(这里名为'search')时
scrapy crawl search
结果与预期一致,response.data ['png']包含png数据。
但是,如果通过scrapy的CrawlerProcess调用它,则返回一个不同的响应对象:'scrapy.http.response.html.HtmlResponse'。该对象没有.data属性。
这是代码:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_splash import SplashRequest
import base64
RUN_CRAWLERPROCESS = False
if RUN_CRAWLERPROCESS:
from crochet import setup
setup()
class SpiderSearch(scrapy.Spider):
name = 'search'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
def start_requests(self):
urls = ['https://www.google.com/search?q=test', ]
splash_args = {
'html': 1,
'png': 1,
'width': 1920,
'wait': 0.5,
'render_all': 1,
}
for url in urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.json', args=splash_args, )
def parse(self, response):
print(type(response))
for result in response.xpath('//div[@class="r"]'):
url = str(result.xpath('./a/@href').extract_first())
yield {
'url': url
}
png_bytes = base64.b64decode(response.data['png'])
with open('google_results.png', 'wb') as f:
f.write(png_bytes)
splash_args = {
'html': 1,
'png': 1,
'width': 1920,
'wait': 2,
'render_all': 1,
'html5_media': 1,
}
# cue the subsequent url to be fetched (self.parse_result omitted here for brevity)
yield SplashRequest(url=url, callback=self.parse_result, endpoint='render.json', args=splash_args)
if RUN_CRAWLERPROCESS:
runner = CrawlerProcess({'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'})
#d = runner.crawl(SpiderSearch)
#d.addBoth(lambda _: reactor.stop())
#reactor.run()
runner.crawl(SpiderSearch)
runner.start()
重申:
RUN_CRAWLERPROCESS = False
和调用
scrapy crawl search
响应是类型
class 'scrapy_splash.response.SplashJsonResponse'
但是设定
RUN_CRAWLERPROCESS = True
并使用CrawlerProcess运行脚本导致类型的响应
class 'scrapy.http.response.html.HtmlResponse'
(p.s.我在ReactorNotRestartable上遇到了一些麻烦,所以采用了this post中描述的钩针,似乎解决了这个问题。我承认我不明白为什么,但假设它是无关的......)
有关如何调试此问题的任何想法?
如果您将此代码作为独立脚本运行,则永远不会加载设置模块,并且您的爬虫将不会知道Splashy中间件(这是添加.data
中引用的.parse
属性的内容)。
您可以通过调用get_project_settings
并将结果传递给Crawler来在脚本中加载这些设置:
from scrapy.utils.project import get_project_settings
# ...
project_settings = get_project_settings()
process = CrawlerProcess(project_settings)