我是 python 和 selenium 的新手。我正在尝试使用 for 循环从网站 audible.in/search 抓取文本。下面是编写的代码。
我正在运行以下代码
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
web = "https://www.audible.in/search"
path = "C:/Users/vikas/Downloads/chromedriver-win64/chromedriver-win64/chromedriver"
driver = webdriver.Chrome(path)
driver.get(web)
container = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, "adbl- impression-container "))
products = WebDriverWait(container, 5).until(EC.presence_of_all_elements_located((By.XPATH, "./div/span/ul/li")))
book_title = []
book_author = []
book_length = []
for product in products:
book_title.append(product.find_element("xpath",'//h3[contains(@class, "bc-heading")]').text)
book_author.append(product.find_element("xpath",'//a[contains(@href, "author")]').text)
book_length.append(product.find_element("xpath",'//li[contains(@class, "runtimeLabel")]').text)
df = pd.DataFrame({'title':book_title, 'author':book_author, 'length':book_length})
df.to_csv('books.csv')
当我运行代码时,我希望将循环中的所有
h3
、a
和 li
附加到列表中。总共有20个元素。我得到的是第一个元素 20 次。我在这里做错了什么。请帮忙。
如果您的 Python 和 selenium 版本是最新的(分别为 3.12.4 和 4.22.0),您不再需要指定 Chrome 驱动程序可执行文件的路径。
我还建议采用不同的方法来导航网页,如下所示:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
import pandas as pd
URL = "https://www.audible.in/search"
options = ChromeOptions()
options.add_argument("--headless")
with webdriver.Chrome(options=options) as driver:
driver.get(URL)
wait = WebDriverWait(driver, 5)
db = {
"Title": [],
"Author": [],
"Duration": []
}
for li in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.productListItem"))):
title = li.get_attribute("aria-label")
db["Title"].append(title)
author = li.find_element(By.CSS_SELECTOR, "li.bc-list-item.authorLabel").text.replace("Written by:", "").strip()
db["Author"].append(author)
length = li.find_element(By.CSS_SELECTOR, "li.bc-list-item.runtimeLabel").text.replace("Length:", "").strip()
db["Duration"].append(length)
print(pd.DataFrame(db).to_string(index=False))
输出:
Title Author Duration
War of Lanka (Hindi Edition) Amish Tripathi 17 hrs and 15 mins
Panchatantra (Hindi Edition) Vishnu Sharma 5 hrs and 41 mins
Forge Your Future A. P. J. Kalam 5 hrs and 37 mins
Meri Gita Devdutt Pattanaik 7 hrs and 39 mins
Kautilya Arthshastra (Hindi Edition) Chanakya 6 hrs
41 Anmol Kahaniya [41 Priceless Stories] Premchand 15 hrs and 28 mins
The Guy Next Door (Hindi Edition) Nylla C 36 hrs and 43 mins
Kalki Puran (Hindi Edition) Dr. Vinay 6 hrs and 10 mins
The Quick and Easy Way to Effective Speaking Dale Carnegie 6 hrs and 39 mins
Jadugarni [Sorceress] Surender Mohan Pathak 8 hrs and 24 mins
21 Shreshth Kahaniyan Prem Chand (Hindi Edition) Munshi Premchand 7 hrs and 44 mins
Gita Pranay 4 hrs and 40 mins
The Path to Self-Love Ruby Dhal 6 hrs and 55 mins
The Art of Influencing People Virender Kapoor 38 mins
Haar Jeet [Lose Win] Surender Mohan Pathak 8 hrs and 34 mins
The Sandman, Act I (Hindi Edition) Neil Gaiman, Dirk Maggs 11 hrs and 3 mins
The Housemaid Is Watching Freida McFadden 11 hrs and 42 mins
Red Circle Society (Hindi Edition) Surender Mohan Pathak 3 hrs and 15 mins
Pride and Prejudice Jane Austen 11 hrs and 35 mins
Crystal Lodge (Hindi Edition) Surender Mohan Pathak 11 hrs and 49 mins