我尝试使用 selenium python 抓取网站并将其保存到 csv,但它仅打印和复制第一个结果

问题描述 投票:0回答:1

我正在 IMDB 上抓取前 250 部电影,它只给了我字典中的第一个 mvoive。

这是我写的代码

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

url = 'https://m.imdb.com/chart/top/'
driver = webdriver.Edge()
driver.get(url)

title = []
year = []

container = driver.find_elements(By.XPATH, './/ul[contains(@role, "presentation")]')

for x in container:
    try:
        title.append(x.find_element(By.XPATH, './/a[contains(@class, "title")]').text)
        year.append(x.find_element(By.XPATH, './/span[contains(@class, "title")]').text)
 except:
    pass
# print(x.text)

df_movie = pd.DataFrame({'title' : title, 'year' : year})
df_movie.to_csv(r'C:\Users\martha\OneDrive\Python\imbd_project.csv', index=False)
python selenium-webdriver web-scraping
1个回答
0
投票

您的某些定位器不正确,您需要添加等待。下面的代码已经过测试并且正在运行。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import pandas as pd

url = 'https://m.imdb.com/chart/top/'
driver = webdriver.Chrome()
driver.get(url)

title = []
year = []
runtime = []
rating = []

wait = WebDriverWait(driver, 20)
containers = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.ipc-metadata-list > li")))

for container in containers:
    title.append(container.find_element(By.CSS_SELECTOR, 'h3').text)
    metadata = container.find_elements(By.CSS_SELECTOR, "div.cli-title-metadata > span")
    year.append(metadata[0].text)
    runtime.append(metadata[1].text)
    rating.append(metadata[2].text)

df_movie = pd.DataFrame({"title": title, "year": year, "runtime": runtime, "rating": rating})
df_movie.to_csv(r'C:\Users\martha\OneDrive\Python\imbd_project.csv', index=False)

注意:我只测试了前 5 部电影,以缩短运行时间。

© www.soinside.com 2019 - 2024. All rights reserved.