使用 Python Selenium 从具有无限滚动和动态加载内容的网站中抓取数据

Question

如果涉及到使用Python中的selenium从无限滚动的网站获取数据，我看到了很多主题，但遗憾的是我没有找到解决我的问题的任何解决方案，我想我只是错过了一些东西。

对于硒来说，我是初学者。

我尝试从 Filmweb 页面获取排名前 500 的电影标题，主要问题是我只得到 25 个第一名的标题。我在 while 循环中执行脚本，但可能在错误的位置。

我尝试使用下面的代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

options = webdriver.EdgeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging']) 
browser = webdriver.Edge(options=options)

browser.get('https://www.filmweb.pl/ranking/film')

accept_button = WebDriverWait(browser, 10).until(
    EC.element_to_be_clickable((By.ID, "didomi-notice-agree-button"))
)
accept_button.click()

browser.implicitly_wait(30)

items = []

last_height = browser.execute_script("return document.body.scrollHeight")
while True:
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(20)

    titles = browser.find_elements(By.CLASS_NAME, "rankingType__originalTitle")
    for i, title in enumerate(titles):
        movie_dict = {f"Movie Number : {i + 1}, 'Title': {title.text}"}
        items.append(movie_dict)

    new_height = browser.execute_script("return document.body.scrollHeight")

    if new_height == last_height:
        break
    last_height = new_height

for movie_title in items:
    print(movie_title)

browser.quit()

我得到的结果：

{"Movie Number : 1, 'Title': The Shawshank Redemption 1994"}
{"Movie Number : 2, 'Title': Intouchables 2011"}
{"Movie Number : 3, 'Title': The Green Mile 1999"}
{"Movie Number : 4, 'Title': The Godfather 1972"}
{"Movie Number : 5, 'Title': 12 Angry Men 1957"}
{"Movie Number : 6, 'Title': 1994"}
{"Movie Number : 7, 'Title': One Flew Over the Cuckoo's Nest 1975"}
{"Movie Number : 8, 'Title': The Godfather: Part II 1974"}
{"Movie Number : 9, 'Title': The Lord of the Rings: The Return of the King 2003"}
{"Movie Number : 10, 'Title': Schindler's List 1993"}
{"Movie Number : 11, 'Title': 1994"}
{"Movie Number : 12, 'Title': La vita è bella 1997"}
{"Movie Number : 13, 'Title': The Lord of the Rings: The Two Towers 2002"}
{"Movie Number : 14, 'Title': Se7en 1995"}
{"Movie Number : 15, 'Title': Fight Club 1999"}
{"Movie Number : 16, 'Title': Goodfellas 1990"}
{"Movie Number : 17, 'Title': The Pianist 2002"}
{"Movie Number : 18, 'Title': 2019"}
{"Movie Number : 19, 'Title': Django Unchained 2012"}
{"Movie Number : 20, 'Title': A Beautiful Mind 2001"}
{"Movie Number : 21, 'Title': Inception 2010"}
{"Movie Number : 22, 'Title': The Silence of the Lambs 1991"}
{"Movie Number : 23, 'Title': The Lion King 1994"}
{"Movie Number : 24, 'Title': Scarface 1983"}
{"Movie Number : 25, 'Title': 2008"}

有些电影标题只有年份，但那是因为它们的原始标题位于源代码结构中的另一个位置，我稍后会处理它。

所以首先我想以某种方式提取 500 个标题，然后在我知道如何处理当前问题后我将使用另一个数据。

也许这里有人遇到这样的问题并可以帮助我。

提前致谢

Answer 1

您的主要问题是您没有等待广告关闭（它会在 15 秒内自动关闭，无法手动关闭）。因此，您滚动到底部一次，由于广告尚未关闭，因此无法呈现新块。接受 cookie 后，我建议添加硬编码等待（我没有看到任何摆脱该广告的选项），然后实现连续滚动：

获取标题列表
滚动到最后一个标题
等待最后冠军位置稳定

可以有更复杂的解决方案，例如等待渲染滚动后渲染新元素并循环直到新元素停止出现，但为了获取 500 部电影这个解决方案已经足够了。

from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

options = webdriver.EdgeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Edge(options=options)
wait = WebDriverWait(browser, 20)
action_chains = ActionChains(browser)

def wait_for_element_location_to_be_stable(element):
    initial_location = element.location
    previous_location = initial_location
    start_time = time.time()
    while time.time() - start_time < 1:
        current_location = element.location
        if current_location != previous_location:
            previous_location = current_location
            start_time = time.time()
        time.sleep(0.4)

def continuous_scroll(by, selector, times):
    for i in range(times):
        last_container = wait.until(EC.presence_of_all_elements_located((by, selector)))[-1]
        action_chains.move_to_element(last_container).perform()
        wait_for_element_location_to_be_stable(last_container)

browser.get('https://www.filmweb.pl/ranking/film')
browser.maximize_window()

accept_button = WebDriverWait(browser, 10).until(
    EC.element_to_be_clickable((By.ID, "didomi-notice-agree-button"))
)
accept_button.click()

# wait for ads to be closed automatically
time.sleep(20)
items = []

times = 20
selector = "rankingType__originalTitle"

continuous_scroll(By.CLASS_NAME, selector, times)
titles = browser.find_elements(By.CLASS_NAME, selector)
for i, title in enumerate(titles):
    movie_dict = {f"Movie Number : {i + 1}, 'Title': {title.text}"}
    items.append(movie_dict)

for movie_title in items:
    print(movie_title)

使用 Python Selenium 从具有无限滚动和动态加载内容的网站中抓取数据

问题描述投票：0回答：1

1个回答

最新问题

使用 Python Selenium 从具有无限滚动和动态加载内容的网站中抓取数据

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1