我编写了一个脚本来列出网站上某个城市的场馆。但是,我遇到了两个不知道如何解决的问题:
虽然我写了滚动代码,但它只捕获了前 15 个项目,并没有列出其余的。
脚本没有点击“下一步”按钮进入下一页。
place='Abu Dhabi'
#----------------------------------------------- Cvent
url="https://www-eur.cvent.com/venues/"
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()
search_box = driver.find_element(By.CSS_SELECTOR, 'input#searchString')
search_box.send_keys(place)
search_box.send_keys(Keys.RETURN)
time.sleep(5)
visited_links = {}
Titles = []
pages=[]
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
cventlinks = []
link_elements = driver.find_elements(By.CSS_SELECTOR,"li.w-full a")
for link_element in link_elements:
cventlinks.append(link_element.get_attribute("href"))
Title_elements = driver.find_elements(By.CSS_SELECTOR,'li.w-full h3')
for item in Title_elements:
Titles.append(item.text)
with open('CventLinks.csv', 'a', newline='') as csvfile:
writer = csv.writer(csvfile)
for title in Titles:
if title not in visited_links:
writer.writerow([cventlinks[Titles.index(title)]])
visited_links[title] = 1
next_button=driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Next page"]')
ActionChains(driver).move_to_element(next_button[0]).click().perform()
time.sleep(5)
if driver.current_url not in pages:
pages.append(driver.current_url)
else:
break
您能指导一下吗?
1. 就像@chitown88 所说的那样,您遇到了时间问题。这就是为什么你在第一页中只获取大约 15 个 url。
您可以使用隐式等待:Selenium-python 文档
driver.implicitly_wait(10) # The driver will wait 10 second until element is find
然后在向下滚动之前检查是否找到了要查找的元素,然后添加短暂的延迟:
# Check if first page is loaded if len(pages) < 1: # Only apply to first page for optimization purpose driver.find_element(By.CSS_SELECTOR,"h1.text-d-xs") # Scroll down driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(1) # Adding a delay to make sure all elements are loaded
2. 下一页按钮似乎很棘手。您可以使用
.get_attributes("href")
获取 url,而不是使用 click()
# Go to the next page next_button = driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Next page"]') next_url = next_button[0].get_attribute("href") driver.get(next_url)
为了避免错误,您应该添加一些行来检查您是否已登陆“我们没有找到任何结果”页面以停止脚本。 看这里 这就是我的目的:
# Check if page error if len(Titles) == 0: break
3. 尝试获取第 2 页上的数据时出现错误:
writer.writerow([cventlinks[Titles.index(title)]]) ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^ IndexError: list index out of range
标题必须作为 cventlinks 处于循环中,以便在每次迭代时重置其 len :
while True: driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") cventlinks = [] Titles = []
希望对您有帮助。祝您有美好的一天,祝您的项目好运。 完整脚本:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import csv
#----------------------------------------------- Cvent
# Initialiaze variables
visited_links = {}
pages=[]
url="https://www-eur.cvent.com/venues/"
place='Abu Dhabi'
# driver configuraton
driver = webdriver.Chrome()
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(10)
# Go to the search box and search for the place
search_box = driver.find_element(By.CSS_SELECTOR, 'input#searchString')
search_box.send_keys(place)
search_box.send_keys(Keys.RETURN)
while True:
# Initialize Titles list
Titles = []
# Check if the page is loaded
if len(pages) < 1:
driver.find_element(By.CSS_SELECTOR,"h1.text-d-xs")
# Scroll down
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1) # Adding a delay to make sure all elements are loaded
cventlinks = []
# Get all links
link_elements = driver.find_elements(By.CSS_SELECTOR,"li.w-full a")
# Get all titles
Title_elements = driver.find_elements(By.CSS_SELECTOR,'li.w-full h3')
# Save Links to cventlinks list
for link_element in link_elements:
cventlinks.append(link_element.get_attribute("href"))
# Save Titles to list
for item in Title_elements:
Titles.append(item.text)
# Open .csv file
with open('CventLinks.csv', 'a', newline='') as csvfile:
writer = csv.writer(csvfile)
# Go throught title
for title in Titles:
if title not in visited_links:
# Save url to .csv file
writer.writerow([cventlinks[Titles.index(title)]])
# Save title to visited_links
visited_links[title] = 1
# Check if no more page to fetch
if len(Titles) == 0:
break
else:
# Go to the next page
next_button=driver.find_elements(By.CSS_SELECTOR,'a[aria-label*="Page suivante"]')
next_url = next_button[0].get_attribute("href")
driver.get(next_url)
if driver.current_url not in pages:
pages.append(driver.current_url)
else:
break
exit()