抓取了 0 个页面(以 0 页/分钟),抓取了一定数量的页面后抓取了 0 个项目(以 0 个项目/分钟)

问题描述 投票:0回答:1

我正在尝试使用

Scrapy-Playwright
抓取提供的 URL 列表。但我发现了一个奇怪的行为。它开始爬行得很好,但是每次爬行到一定数量的页面后就停止爬行,然后显示如下日志:

[logstats.py:54] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 08:30:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

无论我提供哪一组URL,此时总是像这样冻结。

以下是我的蜘蛛(

my_spider.py
):

import scrapy
from faker import Faker
from scrapy.spiders import Rule

fake = Faker()

class MySpider(scrapy.Spider):
    name = 'my_spider'


    def start_requests(self):
        # Define the initial URL(s) to scrape
        for url in self.start_urls:
            yield scrapy.Request(url, 
                headers={'User-Agent': fake.user_agent()},
                meta=dict(
                playwright=True,
                playwright_include_page=True,
                errback=self.errback,
            ))

    def __init__(self):
        # List of URLs to start scraping
        self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16

        self._rules = [Rule(callback = self.parse)]

    def parse(self, response):
        page_title = response.xpath('//title/text()').get()
        yield {
            'url': response.url,
            'title': page_title
        }

    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

settings.py
(因为我正在使用
Playwright
):

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 60 * 1000  # 10 seconds

为什么会出现这种情况?

web-scraping scrapy playwright
1个回答
0
投票

原因是我已经用完了为此进程分配的资源。如果你仔细观察,默认的并发请求(在

settings.py
中找到)也是16。这就是为什么它在16个请求后被冻结,因为早期的请求没有释放资源。修复方法如下:

我仍将其保持为 16:

CONCURRENT_REQUESTS = 16 # in settings.py

但是现在在使用后释放资源(在

my_spider.py
中):

import scrapy
from faker import Faker
from scrapy.spiders import Spider, Rule
import asyncio

fake = Faker()

class MySpider(scrapy.Spider):
    name = 'my_spider'


    def start_requests(self):
        # Define the initial URL(s) to scrape
        for url in self.start_urls:
            yield scrapy.Request(url, 
                headers={'User-Agent': fake.user_agent()},
                meta=dict(
                playwright=True,
                playwright_include_page=True,
                errback=self.errback,
            ))

    def __init__(self):
        # List of URLs to start scraping
        self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16

        self._rules = [Rule(callback = self.parse)]

    def parse(self, response):
        page = response.meta.get("playwright_page")
        # This is where you'll extract the data from the crawled pages.
        # As an example, we'll just print the title of each page.
        try:
            page_title = response.xpath('//title/text()').get()
            yield {
                'url': response.url,
                'title': page_title
            }
        finally:
            # Ensure the Playwright page is closed after processing
            if page:
                asyncio.ensure_future(page.close())


    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        if page:
            await page.close()

此外,如果您有兴趣了解的话,这些是我在

settings.py
中添加的附加设置:

# Increase concurrency
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# Increase timeouts
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 90 * 1000  # 90 seconds
DOWNLOAD_TIMEOUT = 120  # 120 seconds

# Retry failed requests
RETRY_ENABLED = True
RETRY_TIMES = 5

# Max Playwright contexts
PLAYWRIGHT_MAX_CONTEXTS = 4

# Logging level
LOG_LEVEL = 'DEBUG'

# Playwright download handlers
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

# Other deprecated settings
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
© www.soinside.com 2019 - 2024. All rights reserved.