如果涉及到使用Python中的selenium从无限滚动的网站获取数据,我看到了很多主题,但遗憾的是我没有找到解决我的问题的任何解决方案,我想我只是错过了一些东西。
对于硒来说,我是初学者。
我尝试从 Filmweb 页面获取排名前 500 的电影标题,主要问题是我只得到 25 个第一名的标题。我在 while 循环中执行脚本,但可能在错误的位置。
我尝试使用下面的代码
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
options = webdriver.EdgeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Edge(options=options)
browser.get('https://www.filmweb.pl/ranking/film')
accept_button = WebDriverWait(browser, 10).until(
EC.element_to_be_clickable((By.ID, "didomi-notice-agree-button"))
)
accept_button.click()
browser.implicitly_wait(30)
items = []
last_height = browser.execute_script("return document.body.scrollHeight")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(20)
titles = browser.find_elements(By.CLASS_NAME, "rankingType__originalTitle")
for i, title in enumerate(titles):
movie_dict = {f"Movie Number : {i + 1}, 'Title': {title.text}"}
items.append(movie_dict)
new_height = browser.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
for movie_title in items:
print(movie_title)
browser.quit()
我得到的结果:
{"Movie Number : 1, 'Title': The Shawshank Redemption 1994"}
{"Movie Number : 2, 'Title': Intouchables 2011"}
{"Movie Number : 3, 'Title': The Green Mile 1999"}
{"Movie Number : 4, 'Title': The Godfather 1972"}
{"Movie Number : 5, 'Title': 12 Angry Men 1957"}
{"Movie Number : 6, 'Title': 1994"}
{"Movie Number : 7, 'Title': One Flew Over the Cuckoo's Nest 1975"}
{"Movie Number : 8, 'Title': The Godfather: Part II 1974"}
{"Movie Number : 9, 'Title': The Lord of the Rings: The Return of the King 2003"}
{"Movie Number : 10, 'Title': Schindler's List 1993"}
{"Movie Number : 11, 'Title': 1994"}
{"Movie Number : 12, 'Title': La vita è bella 1997"}
{"Movie Number : 13, 'Title': The Lord of the Rings: The Two Towers 2002"}
{"Movie Number : 14, 'Title': Se7en 1995"}
{"Movie Number : 15, 'Title': Fight Club 1999"}
{"Movie Number : 16, 'Title': Goodfellas 1990"}
{"Movie Number : 17, 'Title': The Pianist 2002"}
{"Movie Number : 18, 'Title': 2019"}
{"Movie Number : 19, 'Title': Django Unchained 2012"}
{"Movie Number : 20, 'Title': A Beautiful Mind 2001"}
{"Movie Number : 21, 'Title': Inception 2010"}
{"Movie Number : 22, 'Title': The Silence of the Lambs 1991"}
{"Movie Number : 23, 'Title': The Lion King 1994"}
{"Movie Number : 24, 'Title': Scarface 1983"}
{"Movie Number : 25, 'Title': 2008"}
有些电影标题只有年份,但那是因为它们的原始标题位于源代码结构中的另一个位置,我稍后会处理它。
所以首先我想以某种方式提取 500 个标题,然后在我知道如何处理当前问题后我将使用另一个数据。
也许这里有人遇到这样的问题并可以帮助我。
提前致谢
您的主要问题是您没有等待广告关闭(它会在 15 秒内自动关闭,无法手动关闭)。因此,您滚动到底部一次,由于广告尚未关闭,因此无法呈现新块。 接受 cookie 后,我建议添加硬编码等待(我没有看到任何摆脱该广告的选项),然后实现连续滚动:
可以有更复杂的解决方案,例如等待渲染滚动后渲染新元素并循环直到新元素停止出现,但为了获取 500 部电影这个解决方案已经足够了。
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
options = webdriver.EdgeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Edge(options=options)
wait = WebDriverWait(browser, 20)
action_chains = ActionChains(browser)
def wait_for_element_location_to_be_stable(element):
initial_location = element.location
previous_location = initial_location
start_time = time.time()
while time.time() - start_time < 1:
current_location = element.location
if current_location != previous_location:
previous_location = current_location
start_time = time.time()
time.sleep(0.4)
def continuous_scroll(by, selector, times):
for i in range(times):
last_container = wait.until(EC.presence_of_all_elements_located((by, selector)))[-1]
action_chains.move_to_element(last_container).perform()
wait_for_element_location_to_be_stable(last_container)
browser.get('https://www.filmweb.pl/ranking/film')
browser.maximize_window()
accept_button = WebDriverWait(browser, 10).until(
EC.element_to_be_clickable((By.ID, "didomi-notice-agree-button"))
)
accept_button.click()
# wait for ads to be closed automatically
time.sleep(20)
items = []
times = 20
selector = "rankingType__originalTitle"
continuous_scroll(By.CLASS_NAME, selector, times)
titles = browser.find_elements(By.CLASS_NAME, selector)
for i, title in enumerate(titles):
movie_dict = {f"Movie Number : {i + 1}, 'Title': {title.text}"}
items.append(movie_dict)
for movie_title in items:
print(movie_title)