我正在尝试使用爬虫程序通过 scrapy-vertical 方法抓取 dockerhub.com,我需要定义一个规则,该规则收集具有以下模式的所有页面:
我尝试过使用:
rules = (
Rule(LinkExtractor(allow=r'\?q=.*&page=\d+'),
follow=True,
process_request=set_playwright_true,
callback=parse_page
),
)
def parse_page(self, response):
self.logger.info("Parsing page: %s", response.url)
for result in response.css('div#searchResults div.some-class'):
yield {
'title': result.css('h2 a::text').get(),
'link': result.css('h2 a::attr(href)').get(),
}
但是这种方法只收集一页。
有人知道如何根据我的模式获取每一页吗?
我在页面中没有看到任何符合您的规则的链接。
为什么不直接迭代页面索引?
import re
from scrapy import Request
from scrapy.spiders.crawl import CrawlSpider
from scrapy_playwright.page import PageMethod
from twisted.internet.error import TimeoutError
import time
CSS_IMG_LIST = ".MuiStack-root > a > .MuiPaper-root"
class DockerHubSpider(CrawlSpider):
name = "dockerhub"
allowed_domains = ["hub.docker.com"]
def start_requests(self):
yield Request(
"https://hub.docker.com/search?q=python&page=1",
callback=self.parse,
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", CSS_IMG_LIST),
],
},
)
async def parse(self, response):
page = response.meta["playwright_page"]
time.sleep(5)
await page.close()
# Get index for current page.
index = int(re.search("([0-9]+)$", response.url).groups()[0])
yield Request(
f"https://hub.docker.com/search?q=python&page={index+1}",
callback=self.parse,
errback=self.handle_error,
meta={
"playwright": True,
"playwright_include_page": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", CSS_IMG_LIST),
],
},
)
def handle_error(self, failure):
if failure.check(TimeoutError):
raise CloseSpider("TimeoutError encountered")
当然有更优雅的方法来做到这一点,但上面的代码将从 Python 图像的第一页开始,一直前进到最后一页(当前为第 100 页),然后停止。