我正在爬行https://kick.com/browse/categories,每次滚动时它都会加载某个类别的新卡片。我使用剧作家尝试了多种方法,但都不起作用。
如有任何建议,将不胜感激。
我尝试过使用 Chrome 调试器控制台,效果很好。每次运行
div.scrollBy(0, window.innerHeight); document.documentElement.outerHTML;
,返回的html就会越来越长,带有/category/
的href链接也会越来越多。
但是,通过以下剧作家方法,我总是得到完全相同的 HTML。滚动后加载的 HTML(由
page.conent()
打印)将始终包含指向可用类别的 64 个 href 链接,我认为这是最初加载的内容。
这是我设置网络驱动程序的方法
async with async_playwright() as p:
browser = await p.chromium.launch(
args=[f"--proxy-server={PROXY}"], headless=True, executable_path=CHROME
)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36"
)
page = await context.new_page()
await page.goto("https://kick.com/browse/categories")
我尝试过的:
scrollTo
PAGE_DOWN_JS = """
div = document.querySelector('#main-container');
div.scrollBy(0, window.innerHeight);
"""
await page.wait_for_selector("#main-container")
for _ in range(100):
await page.evaluate(PAGE_DOWN_JS)
await page.wait_for_load_state("networkidle", timeout=120000)
await page.wait_for_load_state("networkidle", timeout=60000)
for _ in range(100): # Adjust the range according to how much you want to scroll
await page.keyboard.press("PageDown")
await page.mouse.wheel(0, 150000)
time.sleep(10)
cards = page.locator("div.flex.flex-col.items-start")
print(f"found {await cards.count()} cards")
all_cards = await cards.all()
await all_cards[-1].scroll_into_view_if_needed()
cards = page.locator("div.flex.flex-col.items-start")
我已经确定了您的代码片段的问题 - 代码不会等待新的向下滚动页面(即新内容)完全加载。我还添加了一个
assert
来比较每次向下滚动后的新内容:
import asyncio
import time
from playwright.async_api import async_playwright
PAGE_DOWN_JS = """
div = document.querySelector('#main-container');
div.scrollBy(0, window.innerHeight);
"""
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://kick.com/browse/categories")
await page.wait_for_selector("#main-container")
prev_content = ""
for _ in range(3):
await page.evaluate(PAGE_DOWN_JS)
# await page.wait_for_load_state("networkidle", timeout=120000)
await page.wait_for_timeout(5000) # new
time.sleep(5) # new
content = await page.content()
assert content != prev_content
prev_content = content
content = await page.content()
print(content)
await browser.close()
asyncio.run(main())
您还可以检查页面是否到达页面末尾:
prev_content = ""
prev_scroll_height = 0
while True:
await page.evaluate(PAGE_DOWN_JS)
await page.wait_for_timeout(3000)
scroll_height = await page.evaluate(
'document.querySelector("#main-container").scrollHeight'
)
if scroll_height == prev_scroll_height:
break
prev_scroll_height = scroll_height
content = await page.content()
assert content != prev_content
prev_content = content