场地列表脚本存在问题

问题描述 投票:0回答:1

我编写了一个脚本来列出网站上某个城市的场馆。但是,我遇到了两个不知道如何解决的问题:

  1. 虽然我写了滚动代码,但它只捕获了前 15 个项目,并没有列出其余的。

  2. 脚本没有点击“下一步”按钮进入下一页。

    place='Abu Dhabi'
    #----------------------------------------------- Cvent
    url="https://www-eur.cvent.com/venues/"
    driver = webdriver.Chrome()
    driver.get(url)
    driver.maximize_window()
    search_box = driver.find_element(By.CSS_SELECTOR, 'input#searchString')
    search_box.send_keys(place)
    search_box.send_keys(Keys.RETURN)
    time.sleep(5)
    visited_links = {}
    Titles = []
    pages=[]
    
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        cventlinks = []
        link_elements = driver.find_elements(By.CSS_SELECTOR,"li.w-full a")
        for link_element in link_elements:
            cventlinks.append(link_element.get_attribute("href"))
        Title_elements = driver.find_elements(By.CSS_SELECTOR,'li.w-full h3')
        for item in Title_elements:
            Titles.append(item.text)
        with open('CventLinks.csv', 'a', newline='') as csvfile:
            writer = csv.writer(csvfile)
            for title in Titles:
                if title not in visited_links:
                    writer.writerow([cventlinks[Titles.index(title)]])
                    visited_links[title] = 1
        next_button=driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Next page"]')
        ActionChains(driver).move_to_element(next_button[0]).click().perform()
        time.sleep(5)
        if driver.current_url not in pages:
            pages.append(driver.current_url)
        else:
            break
    

您能指导一下吗?

python selenium-webdriver web-scraping
1个回答
0
投票

1. 就像@chitown88 所说的那样,您遇到了时间问题。这就是为什么你在第一页中只获取大约 15 个 url。

您可以使用隐式等待:Selenium-python 文档

driver.implicitly_wait(10) # The driver will wait 10 second until element is find

然后在向下滚动之前检查是否找到了要查找的元素,然后添加短暂的延迟:

# Check if first page is loaded
if len(pages) < 1: # Only apply to first page for optimization purpose
    driver.find_element(By.CSS_SELECTOR,"h1.text-d-xs")
    
# Scroll down
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1) # Adding a delay to make sure all elements are loaded

2. 下一页按钮似乎很棘手。您可以使用

.get_attributes("href")

获取 url,而不是使用 click()
# Go to the next page
next_button = driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Next page"]')
next_url = next_button[0].get_attribute("href")
driver.get(next_url)

为了避免错误,您应该添加一些行来检查您是否已登陆“我们没有找到任何结果”页面以停止脚本。 看这里 这就是我的目的:

# Check if page error
if len(Titles) == 0:
    break

3. 尝试获取第 2 页上的数据时出现错误:

writer.writerow([cventlinks[Titles.index(title)]])
                ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

标题必须作为 cventlinks 处于循环中,以便在每次迭代时重置其 len :

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    cventlinks = []
    Titles  = []

希望对您有帮助。祝您有美好的一天,祝您的项目好运。 完整脚本:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import csv

#----------------------------------------------- Cvent
# Initialiaze variables
visited_links = {}
pages=[]
url="https://www-eur.cvent.com/venues/"
place='Abu Dhabi'
 
# driver configuraton
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(10)

# Go to the search box and search for the place
search_box = driver.find_element(By.CSS_SELECTOR, 'input#searchString')
search_box.send_keys(place)
search_box.send_keys(Keys.RETURN)
    
while True:
    # Initialize Titles list
    Titles = []
    
    # Check if the page is loaded
    if len(pages) < 1:
        driver.find_element(By.CSS_SELECTOR,"h1.text-d-xs")
        
    # Scroll down
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1) # Adding a delay to make sure all elements are loaded
    
    cventlinks = []

    # Get all links
    link_elements = driver.find_elements(By.CSS_SELECTOR,"li.w-full a")
    
    # Get all titles
    Title_elements = driver.find_elements(By.CSS_SELECTOR,'li.w-full h3')
    
    # Save Links to cventlinks list
    for link_element in link_elements:
        cventlinks.append(link_element.get_attribute("href"))
    
    # Save Titles to list
    for item in Title_elements:
        Titles.append(item.text)
     
    # Open .csv file
    with open('CventLinks.csv', 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        
        # Go throught title
        for title in Titles:
            if title not in visited_links:
                # Save url to .csv file
                writer.writerow([cventlinks[Titles.index(title)]])
                
                # Save title to visited_links
                visited_links[title] = 1
           
    # Check if no more page to fetch
    if len(Titles) == 0:
        break     
    else:
        # Go to the next page
        next_button=driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Page suivante"]')
        next_url = next_button[0].get_attribute("href")
        driver.get(next_url)
    

    if driver.current_url not in pages:
        pages.append(driver.current_url)
    else:
        break
    
exit()
© www.soinside.com 2019 - 2024. All rights reserved.