使用 Scrapy-Crawler 和 LinkExtractor-Rules 抓取所有页面

问题描述 投票:0回答:1

我正在尝试使用爬虫程序通过 scrapy-vertical 方法抓取 dockerhub.com,我需要定义一个规则,该规则收集具有以下模式的所有页面:

我尝试过使用:

 rules = (
    Rule(LinkExtractor(allow=r'\?q=.*&page=\d+'),
         follow=True,
         process_request=set_playwright_true,
         callback=parse_page
         ),
)

 def parse_page(self, response):
    self.logger.info("Parsing page: %s", response.url)

    for result in response.css('div#searchResults div.some-class'):
        yield {
            'title': result.css('h2 a::text').get(),
            'link': result.css('h2 a::attr(href)').get(),
        }

但是这种方法只收集一页。

有人知道如何根据我的模式获取每一页吗?

python web-scraping browser scrapy playwright
1个回答
0
投票

我在页面中没有看到任何符合您的规则的链接。

为什么不直接迭代页面索引?

import re
from scrapy import Request
from scrapy.spiders.crawl import CrawlSpider
from scrapy_playwright.page import PageMethod
from twisted.internet.error import TimeoutError

import time

CSS_IMG_LIST = ".MuiStack-root > a > .MuiPaper-root"

class DockerHubSpider(CrawlSpider):
    name = "dockerhub"
    allowed_domains = ["hub.docker.com"]

    def start_requests(self):
        yield Request(
            "https://hub.docker.com/search?q=python&page=1",
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", CSS_IMG_LIST),
                ],
            },
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        time.sleep(5)

        await page.close()

        # Get index for current page.
        index = int(re.search("([0-9]+)$", response.url).groups()[0])

        yield Request(
            f"https://hub.docker.com/search?q=python&page={index+1}",
            callback=self.parse,
            errback=self.handle_error,
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", CSS_IMG_LIST),
                ],
            },
        )

    def handle_error(self, failure):
        if failure.check(TimeoutError):
            raise CloseSpider("TimeoutError encountered")

当然有更优雅的方法来做到这一点,但上面的代码将从 Python 图像的第一页开始,一直前进到最后一页(当前为第 100 页),然后停止。

© www.soinside.com 2019 - 2024. All rights reserved.