Scrapysplash不加载动态内容

问题描述 投票:0回答:1

我正在使用 Splash 和 Scrapy 在页面中加载动态渲染的内容,但它没有按我的预期工作。 在

setting.py
中我设置了这些变量

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
}
SPLASH_URL="http://localhost:8050"
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
SPLASH_COOKIES_DEBUG = False

蜘蛛

def start_requests(self):
        urls = [
            "https://callmeduy.com/san-pham/"
        ]
        for url in urls:
            yield SplashRequest(url=url, 
                                # endpoint='render.html', 
                                callback=self.parse, 
                                args={
                                    'wait': 5
                                })

def parse(self, response):
        print(response.xpath("//body").get())
        f = open('res.html', 'w+')
        f.write(response.xpath("//body").get())
        f.close()

动态内容尚未加载。这里是 响应体

如果有人知道请帮忙

python web-scraping scrapy scrapy-splash
1个回答
0
投票

我无法让它与 Splash 一起使用。可能是因为我不太熟悉。

但是,我有一个使用 Scrapy 和 Playwright 的可行解决方案。

这是

requirements.txt
:

Scrapy==2.11.2
playwright==1.44.0
scrapy-playwright==0.0.35
beautifulsoup4==4.12.3

settings.py

DOWNLOAD_HANDLERS = {
  "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
  "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = (
  30 * 1000
)
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
  "headless": False
}

还有蜘蛛:

import scrapy
import time
from bs4 import BeautifulSoup


class CallmeduySpider(scrapy.Spider):
    name = "callmeduy"
    allowed_domains = ["callmeduy.com"]

    def start_requests(self):
        url = "https://callmeduy.com/san-pham"
        yield scrapy.Request(
            url,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
            ),
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        while True:
            soup = BeautifulSoup(await page.content(), "lxml")
            wait = soup.select_one(".card-title.h5 > span span.react-loading-skeleton")

            if not wait:
                self.logger.debug("====================================================")
                for card in soup.select(".jss23 .row .col-12"):
                    link = card.select_one("a.jss29")
                    title = card.select_one(".card-title.h5 > span.jss31")

                    self.logger.debug(title.get_text())
                    self.logger.debug(link["href"])

                    # TODO: Probably yield another scrapy.Request() here for each product?
                self.logger.debug("====================================================")

                return
            else:
                self.logger.info("Waiting for skeleton to load.")
                time.sleep(5)

输出示例:

2024-06-17 09:25:41 [callmeduy] DEBUG: ====================================================                                                                                                                                             
2024-06-17 09:25:41 [callmeduy] DEBUG: Sữa Chống Nắng  Đao Cocoon                                                                                                                                                                     
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1451                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: COSRX The Hyaluronic Acid 3...                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1450                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: Kem chống nắng Skin1004 Mad...                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1449                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: COSRX The Niacinamide 15% S...                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1448                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: Nacific Origin Red Salicyli...                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1446                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: Skin1004 Madagascar Centell...                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1445                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: ACNACARE GEL Mega We Care                                                                                                                                                                        
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1444                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: Viên uống ACNACARE Mega We ...                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1443                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: Serum NNO VITE Mega We Care                                                                                                                                                                      
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1442                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: Neutrogena Hydro Boost Acti...                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1441                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: Neutrogena Hydroboost Clean...                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1440                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: Skin Recovery Cream                                                                                                                                                                              
2024-06-17 09:25:41 [callmeduy] DEBUG: /san-pham/1439                                                                                                                                                                                   
2024-06-17 09:25:41 [callmeduy] DEBUG: ====================================================

正如代码中的注释所指出的,您可能希望通过这些链接到达实际的产品页面(或者也许不?)。您还需要处理索引页上结果的分页。

但是,这段代码应该让您开始至少能产生结果的事情。

最新问题
© www.soinside.com 2019 - 2025. All rights reserved.