场地列表脚本存在问题

Question

我编写了一个脚本来列出网站上某个城市的场馆。但是，我遇到了两个不知道如何解决的问题：

虽然我写了滚动代码，但它只捕获了前 15 个项目，并没有列出其余的。

脚本没有点击“下一步”按钮进入下一页。

place='Abu Dhabi'
#----------------------------------------------- Cvent
url="https://www-eur.cvent.com/venues/"
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()
search_box = driver.find_element(By.CSS_SELECTOR, 'input#searchString')
search_box.send_keys(place)
search_box.send_keys(Keys.RETURN)
time.sleep(5)
visited_links = {}
Titles = []
pages=[]

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    cventlinks = []
    link_elements = driver.find_elements(By.CSS_SELECTOR,"li.w-full a")
    for link_element in link_elements:
        cventlinks.append(link_element.get_attribute("href"))
    Title_elements = driver.find_elements(By.CSS_SELECTOR,'li.w-full h3')
    for item in Title_elements:
        Titles.append(item.text)
    with open('CventLinks.csv', 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        for title in Titles:
            if title not in visited_links:
                writer.writerow([cventlinks[Titles.index(title)]])
                visited_links[title] = 1
    next_button=driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Next page"]')
    ActionChains(driver).move_to_element(next_button[0]).click().perform()
    time.sleep(5)
    if driver.current_url not in pages:
        pages.append(driver.current_url)
    else:
        break

您能指导一下吗？

Answer 1

1. 就像@chitown88 所说的那样，您遇到了时间问题。这就是为什么你在第一页中只获取大约 15 个 url。

您可以使用隐式等待：Selenium-python 文档

driver.implicitly_wait(10) # The driver will wait 10 second until element is find

然后在向下滚动之前检查是否找到了要查找的元素，然后添加短暂的延迟：

# Check if first page is loaded
if len(pages) < 1: # Only apply to first page for optimization purpose
    driver.find_element(By.CSS_SELECTOR,"h1.text-d-xs")
    
# Scroll down
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1) # Adding a delay to make sure all elements are loaded

2. 下一页按钮似乎很棘手。您可以使用

.get_attributes("href")

获取 url，而不是使用 click()

# Go to the next page
next_button = driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Next page"]')
next_url = next_button[0].get_attribute("href")
driver.get(next_url)
为了避免错误，您应该添加一些行来检查您是否已登陆“我们没有找到任何结果”页面以停止脚本。看这里这就是我的目的：
# Check if page error
if len(Titles) == 0:
    break

3. 尝试获取第 2 页上的数据时出现错误：

writer.writerow([cventlinks[Titles.index(title)]])
                ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
IndexError: list index out of range

标题必须作为 cventlinks 处于循环中，以便在每次迭代时重置其 len :

while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    cventlinks = []
    Titles  = []

希望对您有帮助。祝您有美好的一天，祝您的项目好运。 完整脚本：

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import csv

#----------------------------------------------- Cvent
# Initialiaze variables
visited_links = {}
pages=[]
url="https://www-eur.cvent.com/venues/"
place='Abu Dhabi'
 
# driver configuraton
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(10)

# Go to the search box and search for the place
search_box = driver.find_element(By.CSS_SELECTOR, 'input#searchString')
search_box.send_keys(place)
search_box.send_keys(Keys.RETURN)
    
while True:
    # Initialize Titles list
    Titles = []
    
    # Check if the page is loaded
    if len(pages) < 1:
        driver.find_element(By.CSS_SELECTOR,"h1.text-d-xs")
        
    # Scroll down
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1) # Adding a delay to make sure all elements are loaded
    
    cventlinks = []

    # Get all links
    link_elements = driver.find_elements(By.CSS_SELECTOR,"li.w-full a")
    
    # Get all titles
    Title_elements = driver.find_elements(By.CSS_SELECTOR,'li.w-full h3')
    
    # Save Links to cventlinks list
    for link_element in link_elements:
        cventlinks.append(link_element.get_attribute("href"))
    
    # Save Titles to list
    for item in Title_elements:
        Titles.append(item.text)
     
    # Open .csv file
    with open('CventLinks.csv', 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        
        # Go throught title
        for title in Titles:
            if title not in visited_links:
                # Save url to .csv file
                writer.writerow([cventlinks[Titles.index(title)]])
                
                # Save title to visited_links
                visited_links[title] = 1
           
    # Check if no more page to fetch
    if len(Titles) == 0:
        break     
    else:
        # Go to the next page
        next_button=driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Page suivante"]')
        next_url = next_button[0].get_attribute("href")
        driver.get(next_url)
    

    if driver.current_url not in pages:
        pages.append(driver.current_url)
    else:
        break
    
exit()

场地列表脚本存在问题

问题描述投票：0回答：1

1个回答

最新问题

场地列表脚本存在问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1