我正在尝试实现一个爬虫,它将负责爬行给定的页面。在这里,我不想抓取任何非文本项目,甚至不想让无头浏览器加载它,因为这只是浪费并且不必要地增加了抓取时间。为了实现这一目标,我添加了一些规则来帮助它删除那些不需要的内容。但它没有按预期工作。从日志中,我可以清楚地看到一些非文本内容已加载并且需要更多时间来爬行:
... ... ...
'playwright/request_count/resource_type/document': 1,
'playwright/request_count/resource_type/font': 1,
'playwright/request_count/resource_type/image': 20,
'playwright/request_count/resource_type/script': 6,
'playwright/request_count/resource_type/stylesheet': 3,
'playwright/response_count': 30,
'playwright/response_count/method/GET': 30,
'playwright/response_count/resource_type/document': 1,
'playwright/response_count/resource_type/font': 1,
'playwright/response_count/resource_type/image': 20,
'playwright/response_count/resource_type/script': 5,
'playwright/response_count/resource_type/stylesheet': 3,
... ... ...
如何不加载它们?我将
Scrapy
与
Playwright
一起使用,使其能够处理动态内容。以下是我的代码:
from scrapy.spiders import Spider, Rule
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors import IGNORED_EXTENSIONS
import tldextract
from scrapy.exceptions import CloseSpider
class LinkCrawlerSpiderSelective(Spider):
name = 'link_crawler_selective'
def __init__(self, *args, **kwargs):
self.start_urls = ["https://books.toscrape.com/"] # Enter URLs to crawl
self.hard_limit = 100 # Maximum how many items to crawl
self.total_page_visited = 0
# allow_domains = self.start_urls: because we only want to take the start_url and/or pages
# which are direct children of it. Even if other pages are accessible from the start_url
# which do not share the same context path, we will not take them.
self.allowed_domains = []
for url in self.start_urls:
domain = tldextract.extract(url).registered_domain
self.allowed_domains.append(domain)
self._rules = [Rule(LinkExtractor(allow_domains = self.allowed_domains, deny_extensions = IGNORED_EXTENSIONS))]
def start_requests(self):
# Define the initial URL(s) to scrape
for url in self.start_urls:
yield scrapy.Request(url, meta=dict(
playwright=True,
playwright_include_page=True,
errback=self.errback,
))
def parse(self, response):
self.total_page_visited += 1
if self.total_page_visited > self.hard_limit:
raise CloseSpider(f'Hard limit exceeded. Maximum number of pages to crawl is set to be: {self.hard_limit}')
# Extract all text nodes except those within script and style tags
text = ' '.join(response.xpath('//*[not(self::script) and not(self::style)]/text()').getall())
yield {
"link": response.url,
"text": "\n".join([" ".join(str(l).split()) for l in text.split("\n") if str(l).strip()])
}
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
以下是
settings.py
:
# settings.py
import os
SPIDER_MODULES = [f"{os.path.split(os.getcwd())[-1]}.spiders"]
# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
FEED_EXPORT_ENCODING = "utf-8"
#### FOR PLAYWRIGHT
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 60 * 1000 # 10 seconds
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
DOWNLOAD_DELAY = 2 # minimum download delay
AUTOTHROTTLE_ENABLED = True
# Scrapy requires this reactor
from scrapy.utils.reactor import install_reactor
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
提供信息。通过实施一种阻止非文本内容的方法可以解决此问题。将此方法保留在内部settings.py
。以下是要添加的代码(根据需要修改扩展名或类型):
def should_abort_request(request):
print(f"Request resource type: {request.resource_type}, URL: {request.url}")
return (
# Block images
request.resource_type in ["image", "font", "media", "stylesheet", "script"]
or any(ext in request.url for ext in [".jpg", ".jpeg", ".png", ".gif", ".webp", ".svg", ".ico"])
# Block media (audio/video)
or any(ext in request.url for ext in [".mp3", ".mp4", ".avi", ".mov", ".wav", ".flv", ".mkv", ".webm"])
# Block fonts
or any(ext in request.url for ext in [".woff", ".woff2", ".ttf", ".eot", ".otf"])
# Block stylesheets
or ".css" in request.url
# Block scripts
or ".js" in request.url
# Block xhr and fetch requests
or request.resource_type in ["xhr", "fetch"]
)
PLAYWRIGHT_ABORT_REQUEST = should_abort_request