我正在尝试使用
Scrapy-Playwright
抓取提供的 URL 列表。但我发现了一个奇怪的行为。它开始爬行得很好,但是每次爬行到一定数量的页面后就停止爬行,然后显示如下日志:
[logstats.py:54] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 08:30:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
无论我提供哪一组URL,此时总是像这样冻结。
以下是我的蜘蛛(
my_spider.py
):
import scrapy
from faker import Faker
from scrapy.spiders import Rule
fake = Faker()
class MySpider(scrapy.Spider):
name = 'my_spider'
def start_requests(self):
# Define the initial URL(s) to scrape
for url in self.start_urls:
yield scrapy.Request(url,
headers={'User-Agent': fake.user_agent()},
meta=dict(
playwright=True,
playwright_include_page=True,
errback=self.errback,
))
def __init__(self):
# List of URLs to start scraping
self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16
self._rules = [Rule(callback = self.parse)]
def parse(self, response):
page_title = response.xpath('//title/text()').get()
yield {
'url': response.url,
'title': page_title
}
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
在
settings.py
(因为我正在使用Playwright
):
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 60 * 1000 # 10 seconds
为什么会出现这种情况?
原因是我已经用完了为此进程分配的资源。如果你仔细观察,默认的并发请求(在
settings.py
中找到)也是16。这就是为什么它在16个请求后被冻结,因为早期的请求没有释放资源。修复方法如下:
我仍将其保持为 16:
CONCURRENT_REQUESTS = 16 # in settings.py
但是现在在使用后释放资源(在
my_spider.py
中):
import scrapy
from faker import Faker
from scrapy.spiders import Spider, Rule
import asyncio
fake = Faker()
class MySpider(scrapy.Spider):
name = 'my_spider'
def start_requests(self):
# Define the initial URL(s) to scrape
for url in self.start_urls:
yield scrapy.Request(url,
headers={'User-Agent': fake.user_agent()},
meta=dict(
playwright=True,
playwright_include_page=True,
errback=self.errback,
))
def __init__(self):
# List of URLs to start scraping
self.start_urls = ['https://dummy0, ...', 'https://dummy100, ...'] # some list of URLs more than 16
self._rules = [Rule(callback = self.parse)]
def parse(self, response):
page = response.meta.get("playwright_page")
# This is where you'll extract the data from the crawled pages.
# As an example, we'll just print the title of each page.
try:
page_title = response.xpath('//title/text()').get()
yield {
'url': response.url,
'title': page_title
}
finally:
# Ensure the Playwright page is closed after processing
if page:
asyncio.ensure_future(page.close())
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
if page:
await page.close()
此外,如果您有兴趣了解的话,这些是我在
settings.py
中添加的附加设置:
# Increase concurrency
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# Increase timeouts
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 90 * 1000 # 90 seconds
DOWNLOAD_TIMEOUT = 120 # 120 seconds
# Retry failed requests
RETRY_ENABLED = True
RETRY_TIMES = 5
# Max Playwright contexts
PLAYWRIGHT_MAX_CONTEXTS = 4
# Logging level
LOG_LEVEL = 'DEBUG'
# Playwright download handlers
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
# Other deprecated settings
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"