我如何在通过滚动而不是索引显示信息的网页中进行抓取?

问题描述 投票:0回答:1

我正在学习网页抓取,我正在尝试从显示滚动信息的页面获取数据,在这种情况下我能做什么?,是否有一个函数可以使整个页面加载?我正在使用硒和 beautifulsoup

这是代码:

html = driver.page_source
bs = BeautifulSoup(html, 'html.parser')

games = bs.find_all('div', {'class': 'GVj7ae imso-medium-font qJnhT imso-ani'})
for game in games:
    print(game.get_text())

我读到了一个可以滚动页面的脚本,但不起作用,只需给我重复的数据,例如,如果输出是

Apertura · Jornada 1 de 17
Apertura · Jornada 2 de 17
Apertura · Jornada 3 de 17
Apertura · Jornada 4 de 17
Apertura · Jornada 5 de 17

给我的滚动脚本:

Apertura · Jornada 1 de 17
Apertura · Jornada 2 de 17
Apertura · Jornada 3 de 17
Apertura · Jornada 4 de 17
Apertura · Jornada 5 de 17
Apertura · Jornada 1 de 17
Apertura · Jornada 2 de 17
Apertura · Jornada 3 de 17
Apertura · Jornada 4 de 17
Apertura · Jornada 5 de 17

这是脚本:

import time

driver = webdriver.Chrome(service=service, options=chrome_options)
driver.get(url)

scroll_pause_time = 2 
last_height = driver.execute_script("return document.body.scrollHeight")

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    time.sleep(scroll_pause_time)
    
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

html = driver.page_source
bs = BeautifulSoup(html, 'html.parser')

games = bs.find_all('div', {'class': 'GVj7ae imso-medium-font qJnhT imso-ani'})
for game in games:
    print(game.get_text())

driver.quit()

python selenium-webdriver web-scraping beautifulsoup selenium-chromedriver
1个回答
0
投票

您正在滚动主窗口,匹配项位于可滚动的 div 中。

以下是如何执行此操作的快速示例:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Chrome()
url = 'https://www.google.com/search?q=Liga+MX#sie=lg;/g/11y5frthr0;2;/m/0446bd;bs;hd;'
driver.get(url)

header = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div[data-title]'))) 
total_matches = int(header.text.split(' ')[-1])

container = driver.find_element(By.CSS_SELECTOR, 'div[jsname=GMIS8e]')
while True:
    driver.execute_script('arguments[0].scrollTo(0, arguments[0].scrollHeight);', container)
    time.sleep(0.2)

    if len(container.find_elements(By.CSS_SELECTOR, 'div[data-title]')) >= total_matches:
        driver.execute_script('arguments[0].scrollTo(0, arguments[0].scrollHeight);', container)
        time.sleep(0.2)
        break


match_tables = container.find_elements(By.CSS_SELECTOR, 'td.liveresults-sports-immersive__match-tile table')
print(f'{len(match_tables) = }')
driver.quit()

但是我建议你使用API,有很多免费的足球API。

© www.soinside.com 2019 - 2024. All rights reserved.