Scrapy解析网页，提取结果页面并下载图像

Question

我已经使用Beautiful Soup用python编写了一个网络爬虫，并请求为项目抓取图像，但是速度很慢。我听说Scrapy的速度要快得多，所以我已经安装了Scrapy并阅读了许多教程，但是我不知道如何在Spider脚本的parse函数中实现搜寻器。

如果我提供搜索结果首页的链接，则应：

通过分析特定类别的<a>标记来查找搜索结果中的页数
基于特定类的<a>标签获得链接，并且
基于特定的'id'从这些链接下载图像

我已经更新了项目和设置脚本，如下所示。非常感谢您能给我的帮助。

items.py

import scrapy

class SampleItem(scrapy.Item):
    # define the fields for your item here like:
    images = scrapy.Field()
    image_urls = scrapy.Field()

settings.py

ITEM_PIPELINES = {
   'scrapy.pipeline.images.FilesPipeline': 1
}
FILES_STORE = '/Documents/scraped_images/'

Answer 1

这是从http://books.toscrape.com/的主页下载图像的最小示例>

您可以将所有代码放在一个文件中并运行python script.py，而无需创建项目。

您必须自行查找HTML上的图像，然后将其添加到Item中（或作为字典生成）。

如果使用FilesPipeline，则必须使用file_urls而不是images_urls

您在FILES_STORE中使用的路径必须存在。它不会创建它。但是它将为图像创建子文件夹full/。

import scrapy
from scrapy.pipelines.files import FilesPipeline

class MySpider(scrapy.Spider):

    name = 'myspider'

    #allowed_domains = []

    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        print('url:', response.url)

        # download files (not only images, but without converting to JPG)
        for url in response.css('img::attr(src)').extract():
            url = response.urljoin(url)
            yield {'file_urls': [url]}


from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file CSV, JSON or XML
    #'FEED_FORMAT': 'csv',     # csv, json, xml
    #'FEED_URI': 'output.csv', #

    'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},   # used standard FilesPipeline (download to FILES_STORE/full)

    'FILES_STORE': '.',                   # this folder has to exist before downloading
})
c.crawl(MySpider)
c.start()

Scrapy解析网页，提取结果页面并下载图像

问题描述投票：1回答：1

1个回答

最新问题

Scrapy解析网页，提取结果页面并下载图像

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1