我正在使用 Splash 和 Scrapy 在页面中加载动态渲染的内容,但它没有按我的预期工作。 在
setting.py
中我设置了这些变量
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
}
SPLASH_URL="http://localhost:8050"
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_COOKIES_DEBUG = False
蜘蛛
def start_requests(self):
urls = [
"https://callmeduy.com/san-pham/"
]
for url in urls:
yield SplashRequest(url=url,
# endpoint='render.html',
callback=self.parse,
args={
'wait': 5
})
def parse(self, response):
print(response.xpath("//body").get())
f = open('res.html', 'w+')
f.write(response.xpath("//body").get())
f.close()
动态内容尚未加载。这里是 响应体
如果有人知道请帮忙
我无法让它与 Splash 一起使用。可能是因为我不太熟悉。
但是,我有一个使用 Scrapy 和 Playwright 的可行解决方案。
这是
requirements.txt
:
Scrapy==2.11.2
playwright==1.44.0
scrapy-playwright==0.0.35
beautifulsoup4==4.12.3
在
settings.py
:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = (
30 * 1000
)
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False
}
还有蜘蛛:
import scrapy
import time
from bs4 import BeautifulSoup
class CallmeduySpider(scrapy.Spider):
name = "callmeduy"
allowed_domains = ["callmeduy.com"]
def start_requests(self):
url = "https://callmeduy.com/san-pham"
yield scrapy.Request(
url,
meta=dict(
playwright=True,
playwright_include_page=True,
),
)
async def parse(self, response):
page = response.meta["playwright_page"]
while True:
soup = BeautifulSoup(await page.content(), "lxml")
wait = soup.select_one(".card-title.h5 > span span.react-loading-skeleton")
if not wait:
self.logger.debug("====================================================")
for card in soup.select(".jss23 .row .col-12"):
link = card.select_one("a.jss29")
title = card.select_one(".card-title.h5 > span.jss31")
self.logger.debug(title.get_text())
self.logger.debug(link["href"])
# TODO: Probably yield another scrapy.Request() here for each product?
self.logger.debug("====================================================")
return
else:
self.logger.info("Waiting for skeleton to load.")
time.sleep(5)
输出示例:
2024-06-17 09:25:41 [callmeduy] DEBUG: ====================================================
2024-06-17 09:25:41 [callmeduy] DEBUG: Sữa Chống Nắng Bí Đao Cocoon
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1451
2024-06-17 09:25:41 [callmeduy] DEBUG: COSRX The Hyaluronic Acid 3...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1450
2024-06-17 09:25:41 [callmeduy] DEBUG: Kem chống nắng Skin1004 Mad...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1449
2024-06-17 09:25:41 [callmeduy] DEBUG: COSRX The Niacinamide 15% S...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1448
2024-06-17 09:25:41 [callmeduy] DEBUG: Nacific Origin Red Salicyli...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1446
2024-06-17 09:25:41 [callmeduy] DEBUG: Skin1004 Madagascar Centell...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1445
2024-06-17 09:25:41 [callmeduy] DEBUG: ACNACARE GEL Mega We Care
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1444
2024-06-17 09:25:41 [callmeduy] DEBUG: Viên uống ACNACARE Mega We ...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1443
2024-06-17 09:25:41 [callmeduy] DEBUG: Serum NNO VITE Mega We Care
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1442
2024-06-17 09:25:41 [callmeduy] DEBUG: Neutrogena Hydro Boost Acti...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1441
2024-06-17 09:25:41 [callmeduy] DEBUG: Neutrogena Hydroboost Clean...
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1440
2024-06-17 09:25:41 [callmeduy] DEBUG: Skin Recovery Cream
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1439
2024-06-17 09:25:41 [callmeduy] DEBUG: ====================================================
正如代码中的注释所指出的,您可能希望通过这些链接到达实际的产品页面(或者也许不?)。您还需要处理索引页上结果的分页。
但是,这段代码应该让您开始至少能产生结果的事情。