如何防止Playwright加载非文本内容?

问题描述 投票:0回答:1

我正在尝试实现一个爬虫,它将负责爬行给定的页面。在这里,我不想抓取任何非文本项目,甚至不想让无头浏览器加载它,因为这只是浪费并且不必要地增加了抓取时间。为了实现这一目标,我添加了一些规则来帮助它删除那些不需要的内容。但它没有按预期工作。从日志中,我可以清楚地看到一些非文本内容已加载并且需要更多时间来爬行: ... ... ... 'playwright/request_count/resource_type/document': 1, 'playwright/request_count/resource_type/font': 1, 'playwright/request_count/resource_type/image': 20, 'playwright/request_count/resource_type/script': 6, 'playwright/request_count/resource_type/stylesheet': 3, 'playwright/response_count': 30, 'playwright/response_count/method/GET': 30, 'playwright/response_count/resource_type/document': 1, 'playwright/response_count/resource_type/font': 1, 'playwright/response_count/resource_type/image': 20, 'playwright/response_count/resource_type/script': 5, 'playwright/response_count/resource_type/stylesheet': 3, ... ... ...

如何不加载它们?我将 
Scrapy

Playwright
一起使用,使其能够处理动态内容。
以下是我的代码:

from scrapy.spiders import Spider, Rule import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.linkextractors import IGNORED_EXTENSIONS import tldextract from scrapy.exceptions import CloseSpider class LinkCrawlerSpiderSelective(Spider): name = 'link_crawler_selective' def __init__(self, *args, **kwargs): self.start_urls = ["https://books.toscrape.com/"] # Enter URLs to crawl self.hard_limit = 100 # Maximum how many items to crawl self.total_page_visited = 0 # allow_domains = self.start_urls: because we only want to take the start_url and/or pages # which are direct children of it. Even if other pages are accessible from the start_url # which do not share the same context path, we will not take them. self.allowed_domains = [] for url in self.start_urls: domain = tldextract.extract(url).registered_domain self.allowed_domains.append(domain) self._rules = [Rule(LinkExtractor(allow_domains = self.allowed_domains, deny_extensions = IGNORED_EXTENSIONS))] def start_requests(self): # Define the initial URL(s) to scrape for url in self.start_urls: yield scrapy.Request(url, meta=dict( playwright=True, playwright_include_page=True, errback=self.errback, )) def parse(self, response): self.total_page_visited += 1 if self.total_page_visited > self.hard_limit: raise CloseSpider(f'Hard limit exceeded. Maximum number of pages to crawl is set to be: {self.hard_limit}') # Extract all text nodes except those within script and style tags text = ' '.join(response.xpath('//*[not(self::script) and not(self::style)]/text()').getall()) yield { "link": response.url, "text": "\n".join([" ".join(str(l).split()) for l in text.split("\n") if str(l).strip()]) } async def errback(self, failure): page = failure.request.meta["playwright_page"] await page.close()

以下是
settings.py

# settings.py

import os

SPIDER_MODULES = [f"{os.path.split(os.getcwd())[-1]}.spiders"]

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
FEED_EXPORT_ENCODING = "utf-8"


#### FOR PLAYWRIGHT
DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 60 * 1000  # 10 seconds

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

DOWNLOAD_DELAY = 2  # minimum download delay 
AUTOTHROTTLE_ENABLED = True


# Scrapy requires this reactor
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')

	
python web-scraping scrapy playwright
1个回答
0
投票
wRAR

提供信息。通过实施一种阻止非文本内容的方法可以解决此问题。将此方法保留在内部settings.py。以下是要添加的代码(根据需要修改扩展名或类型):

def should_abort_request(request):
    print(f"Request resource type: {request.resource_type}, URL: {request.url}")

    return (
        # Block images
        request.resource_type in ["image", "font", "media", "stylesheet", "script"]
        or any(ext in request.url for ext in [".jpg", ".jpeg", ".png", ".gif", ".webp", ".svg", ".ico"])
        # Block media (audio/video)
        or any(ext in request.url for ext in [".mp3", ".mp4", ".avi", ".mov", ".wav", ".flv", ".mkv", ".webm"])
        # Block fonts
        or any(ext in request.url for ext in [".woff", ".woff2", ".ttf", ".eot", ".otf"])
        # Block stylesheets
        or ".css" in request.url
        # Block scripts
        or ".js" in request.url
        # Block xhr and fetch requests
        or request.resource_type in ["xhr", "fetch"]
    )

PLAYWRIGHT_ABORT_REQUEST = should_abort_request

	
© www.soinside.com 2019 - 2024. All rights reserved.