使用 Python Selenium 从具有无限滚动和动态加载内容的网站中抓取数据

问题描述 投票:0回答:1

如果涉及到使用Python中的selenium从无限滚动的网站获取数据,我看到了很多主题,但遗憾的是我没有找到解决我的问题的任何解决方案,我想我只是错过了一些东西。

对于硒来说,我是初学者。

我尝试从 Filmweb 页面获取排名前 500 的电影标题,主要问题是我只得到 25 个第一名的标题。我在 while 循环中执行脚本,但可能在错误的位置。

我尝试使用下面的代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

options = webdriver.EdgeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging']) 
browser = webdriver.Edge(options=options)

browser.get('https://www.filmweb.pl/ranking/film')

accept_button = WebDriverWait(browser, 10).until(
    EC.element_to_be_clickable((By.ID, "didomi-notice-agree-button"))
)
accept_button.click()

browser.implicitly_wait(30)

items = []

last_height = browser.execute_script("return document.body.scrollHeight")
while True:
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(20)

    titles = browser.find_elements(By.CLASS_NAME, "rankingType__originalTitle")
    for i, title in enumerate(titles):
        movie_dict = {f"Movie Number : {i + 1}, 'Title': {title.text}"}
        items.append(movie_dict)

    new_height = browser.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

for movie_title in items:
    print(movie_title)

browser.quit()

我得到的结果:

{"Movie Number : 1, 'Title': The Shawshank Redemption 1994"}
{"Movie Number : 2, 'Title': Intouchables 2011"}
{"Movie Number : 3, 'Title': The Green Mile 1999"}
{"Movie Number : 4, 'Title': The Godfather 1972"}
{"Movie Number : 5, 'Title': 12 Angry Men 1957"}
{"Movie Number : 6, 'Title': 1994"}
{"Movie Number : 7, 'Title': One Flew Over the Cuckoo's Nest 1975"}
{"Movie Number : 8, 'Title': The Godfather: Part II 1974"}
{"Movie Number : 9, 'Title': The Lord of the Rings: The Return of the King 2003"}
{"Movie Number : 10, 'Title': Schindler's List 1993"}
{"Movie Number : 11, 'Title': 1994"}
{"Movie Number : 12, 'Title': La vita è bella 1997"}
{"Movie Number : 13, 'Title': The Lord of the Rings: The Two Towers 2002"}
{"Movie Number : 14, 'Title': Se7en 1995"}
{"Movie Number : 15, 'Title': Fight Club 1999"}
{"Movie Number : 16, 'Title': Goodfellas 1990"}
{"Movie Number : 17, 'Title': The Pianist 2002"}
{"Movie Number : 18, 'Title': 2019"}
{"Movie Number : 19, 'Title': Django Unchained 2012"}
{"Movie Number : 20, 'Title': A Beautiful Mind 2001"}
{"Movie Number : 21, 'Title': Inception 2010"}
{"Movie Number : 22, 'Title': The Silence of the Lambs 1991"}
{"Movie Number : 23, 'Title': The Lion King 1994"}
{"Movie Number : 24, 'Title': Scarface 1983"}
{"Movie Number : 25, 'Title': 2008"}

有些电影标题只有年份,但那是因为它们的原始标题位于源代码结构中的另一个位置,我稍后会处理它。

所以首先我想以某种方式提取 500 个标题,然后在我知道如何处理当前问题后我将使用另一个数据。

也许这里有人遇到这样的问题并可以帮助我。

提前致谢

python selenium-webdriver web-scraping
1个回答
0
投票

您的主要问题是您没有等待广告关闭(它会在 15 秒内自动关闭,无法手动关闭)。因此,您滚动到底部一次,由于广告尚未关闭,因此无法呈现新块。 接受 cookie 后,我建议添加硬编码等待(我没有看到任何摆脱该广告的选项),然后实现连续滚动:

  1. 获取标题列表
  2. 滚动到最后一个标题
  3. 等待最后冠军位置稳定

可以有更复杂的解决方案,例如等待渲染滚动后渲染新元素并循环直到新元素停止出现,但为了获取 500 部电影这个解决方案已经足够了。

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

options = webdriver.EdgeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Edge(options=options)
wait = WebDriverWait(browser, 20)
action_chains = ActionChains(browser)

def wait_for_element_location_to_be_stable(element):
    initial_location = element.location
    previous_location = initial_location
    start_time = time.time()
    while time.time() - start_time < 1:
        current_location = element.location
        if current_location != previous_location:
            previous_location = current_location
            start_time = time.time()
        time.sleep(0.4)

def continuous_scroll(by, selector, times):
    for i in range(times):
        last_container = wait.until(EC.presence_of_all_elements_located((by, selector)))[-1]
        action_chains.move_to_element(last_container).perform()
        wait_for_element_location_to_be_stable(last_container)

browser.get('https://www.filmweb.pl/ranking/film')
browser.maximize_window()

accept_button = WebDriverWait(browser, 10).until(
    EC.element_to_be_clickable((By.ID, "didomi-notice-agree-button"))
)
accept_button.click()

# wait for ads to be closed automatically
time.sleep(20)
items = []

times = 20
selector = "rankingType__originalTitle"

continuous_scroll(By.CLASS_NAME, selector, times)
titles = browser.find_elements(By.CLASS_NAME, selector)
for i, title in enumerate(titles):
    movie_dict = {f"Movie Number : {i + 1}, 'Title': {title.text}"}
    items.append(movie_dict)

for movie_title in items:
    print(movie_title)
© www.soinside.com 2019 - 2024. All rights reserved.