我正在 IMDB 上抓取前 250 部电影,它只给了我字典中的第一个 mvoive。
这是我写的代码
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
url = 'https://m.imdb.com/chart/top/'
driver = webdriver.Edge()
driver.get(url)
title = []
year = []
container = driver.find_elements(By.XPATH, './/ul[contains(@role, "presentation")]')
for x in container:
try:
title.append(x.find_element(By.XPATH, './/a[contains(@class, "title")]').text)
year.append(x.find_element(By.XPATH, './/span[contains(@class, "title")]').text)
except:
pass
# print(x.text)
df_movie = pd.DataFrame({'title' : title, 'year' : year})
df_movie.to_csv(r'C:\Users\martha\OneDrive\Python\imbd_project.csv', index=False)
您的某些定位器不正确,您需要添加等待。下面的代码已经过测试并且正在运行。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import pandas as pd
url = 'https://m.imdb.com/chart/top/'
driver = webdriver.Chrome()
driver.get(url)
title = []
year = []
runtime = []
rating = []
wait = WebDriverWait(driver, 20)
containers = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.ipc-metadata-list > li")))
for container in containers:
title.append(container.find_element(By.CSS_SELECTOR, 'h3').text)
metadata = container.find_elements(By.CSS_SELECTOR, "div.cli-title-metadata > span")
year.append(metadata[0].text)
runtime.append(metadata[1].text)
rating.append(metadata[2].text)
df_movie = pd.DataFrame({"title": title, "year": year, "runtime": runtime, "rating": rating})
df_movie.to_csv(r'C:\Users\martha\OneDrive\Python\imbd_project.csv', index=False)
注意:我只测试了前 5 部电影,以缩短运行时间。