在Python中使用for循环使用selenium从网站上抓取内容

问题描述 投票:0回答:1

我是 python 和 selenium 的新手。我正在尝试使用 for 循环从网站 audible.in/search 抓取文本。下面是编写的代码。

我正在运行以下代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

web = "https://www.audible.in/search"
path = "C:/Users/vikas/Downloads/chromedriver-win64/chromedriver-win64/chromedriver"
driver = webdriver.Chrome(path)
driver.get(web)

container = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CLASS_NAME, "adbl-    impression-container "))
products = WebDriverWait(container, 5).until(EC.presence_of_all_elements_located((By.XPATH, "./div/span/ul/li")))

book_title = []
book_author = []
book_length = []

for product in products:
    book_title.append(product.find_element("xpath",'//h3[contains(@class, "bc-heading")]').text)
    book_author.append(product.find_element("xpath",'//a[contains(@href, "author")]').text)
    book_length.append(product.find_element("xpath",'//li[contains(@class, "runtimeLabel")]').text)

df = pd.DataFrame({'title':book_title, 'author':book_author, 'length':book_length})
df.to_csv('books.csv')

当我运行代码时,我希望将循环中的所有

h3
a
li
附加到列表中。总共有20个元素。我得到的是第一个元素 20 次。我在这里做错了什么。请帮忙。

python python-3.x selenium-webdriver web-scraping
1个回答
0
投票

如果您的 Python 和 selenium 版本是最新的(分别为 3.12.4 和 4.22.0),您不再需要指定 Chrome 驱动程序可执行文件的路径。

我还建议采用不同的方法来导航网页,如下所示:

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
import pandas as pd

URL = "https://www.audible.in/search"

options = ChromeOptions()
options.add_argument("--headless")

with webdriver.Chrome(options=options) as driver:
    driver.get(URL)
    wait = WebDriverWait(driver, 5)
    db = {
        "Title": [],
        "Author": [],
        "Duration": []
    }
    for li in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.productListItem"))):
        title = li.get_attribute("aria-label")
        db["Title"].append(title)
        author = li.find_element(By.CSS_SELECTOR, "li.bc-list-item.authorLabel").text.replace("Written by:", "").strip()
        db["Author"].append(author)
        length = li.find_element(By.CSS_SELECTOR, "li.bc-list-item.runtimeLabel").text.replace("Length:", "").strip()
        db["Duration"].append(length)
    print(pd.DataFrame(db).to_string(index=False))

输出:

                                           Title                  Author           Duration
                    War of Lanka (Hindi Edition)          Amish Tripathi 17 hrs and 15 mins
                    Panchatantra (Hindi Edition)           Vishnu Sharma  5 hrs and 41 mins
                               Forge Your Future          A. P. J. Kalam  5 hrs and 37 mins
                                       Meri Gita       Devdutt Pattanaik  7 hrs and 39 mins
            Kautilya Arthshastra (Hindi Edition)                Chanakya              6 hrs
        41 Anmol Kahaniya [41 Priceless Stories]               Premchand 15 hrs and 28 mins
               The Guy Next Door (Hindi Edition)                 Nylla C 36 hrs and 43 mins
                     Kalki Puran (Hindi Edition)               Dr. Vinay  6 hrs and 10 mins
    The Quick and Easy Way to Effective Speaking           Dale Carnegie  6 hrs and 39 mins
                           Jadugarni [Sorceress]   Surender Mohan Pathak  8 hrs and 24 mins
21 Shreshth Kahaniyan Prem Chand (Hindi Edition)        Munshi Premchand  7 hrs and 44 mins
                                            Gita                  Pranay  4 hrs and 40 mins
                           The Path to Self-Love               Ruby Dhal  6 hrs and 55 mins
                   The Art of Influencing People         Virender Kapoor            38 mins
                            Haar Jeet [Lose Win]   Surender Mohan Pathak  8 hrs and 34 mins
              The Sandman, Act I (Hindi Edition) Neil Gaiman, Dirk Maggs  11 hrs and 3 mins
                       The Housemaid Is Watching         Freida McFadden 11 hrs and 42 mins
              Red Circle Society (Hindi Edition)   Surender Mohan Pathak  3 hrs and 15 mins
                             Pride and Prejudice             Jane Austen 11 hrs and 35 mins
                   Crystal Lodge (Hindi Edition)   Surender Mohan Pathak 11 hrs and 49 mins
© www.soinside.com 2019 - 2024. All rights reserved.