无法在某些网站上使用剧作家执行无限滚动

问题描述 投票:0回答:1

我正在爬行https://kick.com/browse/categories,每次滚动时它都会加载某个类别的新卡片。我使用剧作家尝试了多种方法,但都不起作用。

如有任何建议,将不胜感激。

我尝试过使用 Chrome 调试器控制台,效果很好。每次运行

div.scrollBy(0, window.innerHeight); document.documentElement.outerHTML;
,返回的html就会越来越长,带有
/category/
的href链接也会越来越多。

但是,通过以下剧作家方法,我总是得到完全相同的 HTML。滚动后加载的 HTML(由

page.conent()
打印)将始终包含指向可用类别的 64 个 href 链接,我认为这是最初加载的内容。

这是我设置网络驱动程序的方法

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            args=[f"--proxy-server={PROXY}"], headless=True, executable_path=CHROME
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
        )
        page = await context.new_page()
        await page.goto("https://kick.com/browse/categories")

我尝试过的:

  1. 使用
    scrollTo
  2. 滚动
PAGE_DOWN_JS = """
div = document.querySelector('#main-container');
div.scrollBy(0, window.innerHeight);
"""
        await page.wait_for_selector("#main-container")
        for _ in range(100):
            await page.evaluate(PAGE_DOWN_JS)
            await page.wait_for_load_state("networkidle", timeout=120000)
  1. 发送PageDown键
        await page.wait_for_load_state("networkidle", timeout=60000)
        for _ in range(100):  # Adjust the range according to how much you want to scroll
        await page.keyboard.press("PageDown")
  1. 鼠标滚轮
        await page.mouse.wheel(0, 150000)
        time.sleep(10)
  1. 滚动到指向列表底部的定位器
        cards = page.locator("div.flex.flex-col.items-start")
        print(f"found {await cards.count()} cards")
        all_cards = await cards.all()
        await all_cards[-1].scroll_into_view_if_needed()
        cards = page.locator("div.flex.flex-col.items-start")
python web-scraping web-crawler playwright playwright-python
1个回答
0
投票

我已经确定了您的代码片段的问题 - 代码不会等待新的向下滚动页面(即新内容)完全加载。我还添加了一个

assert
来比较每次向下滚动后的新内容:

import asyncio
import time
from playwright.async_api import async_playwright

PAGE_DOWN_JS = """
div = document.querySelector('#main-container');
div.scrollBy(0, window.innerHeight);
"""

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto("https://kick.com/browse/categories")

        await page.wait_for_selector("#main-container")

        prev_content = ""
        for _ in range(3):
            await page.evaluate(PAGE_DOWN_JS)
            # await page.wait_for_load_state("networkidle", timeout=120000)
            await page.wait_for_timeout(5000)  # new
            time.sleep(5)  # new
            content = await page.content()
            assert content != prev_content
            prev_content = content

        content = await page.content()
        print(content)
        await browser.close()

asyncio.run(main())

您还可以检查页面是否到达页面末尾:

prev_content = ""
prev_scroll_height = 0
while True:
    await page.evaluate(PAGE_DOWN_JS)
    await page.wait_for_timeout(3000)
    scroll_height = await page.evaluate(
        'document.querySelector("#main-container").scrollHeight'
    )

    if scroll_height == prev_scroll_height:
        break
    prev_scroll_height = scroll_height

    content = await page.content()
    assert content != prev_content
    prev_content = content
© www.soinside.com 2019 - 2024. All rights reserved.