问题描述:
我正在尝试自动化一个流程,我可以访问一个网站,将鼠标悬停在菜单导航栏上,然后单击第 1 层下拉列表中的每个导航类别选项,访问该页面并抓取该页面上前 20 个产品的产品详细信息并将其放在在 Excel 文件中。如果该页面不包含任何产品,脚本将继续向下滚动,直到到达页面末尾,如果没有找到产品 div,它将返回到页面顶部,然后单击页面中的下一个类别。导航面板
我正在使用 Selenium(使用 python)来实现此目的。我在下面附上了我的代码。
scroll_and_click_view_more函数用于向下滚动页面,prod_vitals函数用于抓取每个页面特定的产品详细信息,prod_count函数用于提取每个页面上的产品总数并创建一个所有页面的摘要。
错误描述:
当我运行下面的代码时,除了一个之外,每个功能都工作正常。此代码向下滚动的第一页没有任何产品详细信息。该脚本将向下滚动整个页面,打印在该页面上找不到的产品图块,然后应该单击下一个类别,但由于某种原因它无法单击路径中的下一个类别。它会抛出超时异常错误,然后单击下一个类别,该类别再次正常工作。该网站有两个类别,其中没有产品图块,并且对于这两个页面,脚本无法单击下一个可用类别。我附上错误的屏幕截图。
我的代码的输出:
['/feature/unlock-your-courage.html', '/shop/new/women', '/shop/women', '/shop/men/bags', '/shop/collection', '/shop/gift/women/bestseller', '/shop/coachworld', '/shop/coachreloved/coach-reloved']
Reached the end of the page and no product tiles were found: /feature/unlock-your-courage.html
Element with href /shop/new/women not clickable
Link:
/shop/women
Link:
/shop/men/bags
Link:
/shop/collection
Link:
/shop/gift/women/bestseller
Reached the end of the page and no product tiles were found: /shop/coachworld
Element with href /shop/coachreloved/coach-reloved not clickable
如果您查看输出,在第一行中,它会打印网站上可用的所有导航类别。之后,脚本访问该数组中的所有 URL,并且能够单击除第二个和第八个 URL 之外的所有 URL。 仅供参考,第一和第七类别在该页面上不包含任何产品图块。其余所有链接都是可点击的。 单击每个类别并迭代循环是在 WebScraper 类中进行的。
解决步骤:
我尝试在操作之间添加 time.sleep() 但仍然不起作用。我还添加了一个步骤,在发生超时异常时进行屏幕截图,我可以看到该类别在屏幕上可见,但仍然不可点击。
我附上了终端输出的屏幕截图。
我在下面附上我的代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import os
import shutil
import datetime
import openpyxl
import chromedriver_autoinstaller
from openpyxl import Workbook
from openpyxl.styles import PatternFill
from openpyxl.utils.dataframe import dataframe_to_rows
#custom_path = r"c:\Users\DELL\Documents\Self_Project" # Define the custom path where you want ChromeDriver to be installed
#temp_path=chromedriver_autoinstaller.install() # Installs the ChromeDriver to a temporary directory and returns the path to that directory.
#print("Temporary path",temp_path)
#final_path = os.path.join(custom_path, "chromedriver.exe") # constructs and stores the full path to the ChromeDriver executable in the custom directory.
#shutil.move(temp_path, final_path) # Moves the ChromeDriver executable from the temporary directory to the custom directory.
#print("ChromeDriver installed at:", final_path)
date_time = datetime.datetime.now().strftime("%m%d%Y_%H%M%S")
file_name = f'CRTL_JP_staging_products_data_{date_time}.xlsx'
products_summary = []
max_count_of_products=20
def scroll_and_click_view_more(driver,href):
flag=False
last_height = driver.execute_script("return window.pageYOffset + window.innerHeight")
while True:
try:
driver.execute_script("window.scrollBy(0, 800);")
time.sleep(4)
new_height1 = driver.execute_script("return window.pageYOffset + window.innerHeight")
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div.product-tile')))
except Exception as e:
new_height = driver.execute_script("return window.pageYOffset + window.innerHeight")
if new_height1 == last_height and flag==False:
print("Reached the end of the page and no product tiles were found: ",href)
return "No product tiles found"
else:
last_height = new_height
continue
div_count = 0
flag=True
#while div_count >= 0:
response = driver.page_source
soup = BeautifulSoup(response, 'html.parser')
div_elements = soup.find_all('div', class_ = 'product-tile')
div_count = len(div_elements)
if(div_count > max_count_of_products):
return(driver.page_source)
driver.execute_script("window.scrollBy(0, 300);")
time.sleep(3)
new_height = driver.execute_script("return window.pageYOffset + window.innerHeight")
#print(new_height)
if new_height == last_height:
print("Reached the end of the page: ",href)
return("Reached the end of the page.")
break
else:
last_height = new_height
except Exception as e:
print(e)
break
def prod_vitals(soup,title,url):
count_of_items=1
products_data = [] # Array to store all product data for our excel sheet
for div in soup.find_all('div', class_ = 'product-tile'): # Iterate over each individual product-tile div tag
if count_of_items<=max_count_of_products:
#print(title)
list_price = 0 # Variable to store list price
sale_price = 0 # Variable to store sale price
discount1 = 0 # Variable to store discount% that is displayed on the site
discount2 = 0
count_of_items = count_of_items+1; # Variable to store discount% calculated manually
res = "Incorrect" # Variable to store result of discount1==discount2; initialized with Incorrect
#pro_code = div.select('div.css-1fg6eq7 img')[0]['id']
pro_name = div.select('div.product-name a.css-avqw6d p.css-1d5mpur')[0].get_text()
pdpurl = div.select('div.css-grdrdu a.css-avqw6d')[0]['href']
pdpurl = url+pdpurl
element = div.select('div.salePriceWrapper span.salesPrice') # Extract all the salesPrice span elements inside salePriceWrapper div (Ideally only one should be present) "<span class="chakra-text salesPrice false css-1gi2nbo" data-qa="m_plp_txt_pt_price_upper_rl">¥179000 </span>"
if element: # If sale price exists
sale_price = float(element[0].get_text().replace('¥', '').replace(',', '')) # Extract the text of the first element in the list (which is the price including the dollar sign), removes the dollar sign with the replace method, and converts the result to a float
res="Correct"
element = div.select('div.comparablePriceWrapper span.css-l96gil') # Similarly extract list price
if element:
list_price = float(element[0].get_text().replace('¥', '').replace(',', ''))
percent_off = div.select('div.salePriceWrapper span.css-181q1zt') # Similarly extract the DR% off text
if percent_off:
percent_off = percent_off[0].get_text()
discount1 = re.search(r'\d+', percent_off).group() # Extract only the digits from the DR% using the search function from regex library and group them together; return type is a string
discount1 = int(discount1)
else:
percent_off = 0 # Convert the DR% characters into integer
discount2 = round(((list_price - sale_price) / list_price) * 100) # Calculate the correct DR% manually using list price and sale price
if(discount1 == discount2): # Check if DR% on site matches with the expected DR% or not
res = "Correct" # If yes then store result as correct else Incorrect
else:
res = "Incorrect"
products_data.append({'Product Name': pro_name,'Product URL': pdpurl, 'Sale Price': '¥'+format(sale_price, '.2f'), 'List Price': '¥'+format(list_price, '.2f'), 'Discount on site': str(discount1)+'%', 'Actual Discount': str(discount2)+'%', 'Result': res}) # Append the extracted data to the list
else:
break
time.sleep(5)
df = pd.DataFrame(products_data, columns=['Product Name', 'Product URL', 'Sale Price', 'List Price', 'Discount on site', 'Actual Discount', "Result" ]) # Convert the array along with specific column names to a pandas DataFrame; A DataFrame is a two-dimensional labeled data structure with columns potentially of different types
if os.path.exists(file_name):
book = openpyxl.load_workbook(file_name)
else:
book = Workbook()
default_sheet = book.active
book.remove(default_sheet)
sheet = book.create_sheet(title)
for row in dataframe_to_rows(df, index=False, header=True):
sheet.append(row)
yellow_fill = PatternFill(start_color='FFFF00', end_color='FFFF00', fill_type='solid')
green_fill = PatternFill(start_color='00FF00', end_color='00FF00', fill_type='solid')
for row in range(2, sheet.max_row + 1):
cell = sheet.cell(row=row, column=8)
if cell.value == "Correct":
cell.fill = green_fill
else:
cell.fill = yellow_fill
book.save(file_name)
def prod_count(soup,title):
product_count_element = soup.find('p', {'class': 'chakra-text total-count css-120gdxl', 'data-qa': 'plp_txt_resultcount'})
if product_count_element:
pro_count_text = product_count_element.get_text()
pro_count_text = pro_count_text.replace(',', '')
pro_count = re.search(r'\d+', pro_count_text).group()
products_summary.append({'Category': title,'Total products available': pro_count, 'Total products scraped': max_count_of_products})
class WebScraper:
def __init__(self):
self.url = "https://staging1-japan.coach.com/?auto=true"
self.reloved_url="https://staging1-japan.coach.com/shop/coachreloved/coach-reloved"
self.driver = webdriver.Chrome()
#options = Options()
#options.add_argument("--lang=en")
#self.driver = webdriver.Chrome(service=Service(r"c:\Users\DELL\Documents\Self_Project\chromedriver.exe"), options=options)
def scrape(self):
self.driver.get(self.url)
self.driver.maximize_window()
time.sleep(5)
nav_count = 0
soup = BeautifulSoup(self.driver.page_source, 'html.parser')
links = soup.find('div', {'class': 'css-wnawyw'}).find_all('a', {'class': 'css-ipxypz'})
hrefs = [link.get('href') for link in links]
print(hrefs)
for i,href in enumerate(hrefs):
try:
#print(href)
element1 = WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'a[href="{href}"]')))
#self.driver.execute_script("arguments[0].scrollIntoView(true);", element1)
self.driver.execute_script("window.scrollTo(0, arguments[0].getBoundingClientRect().top + window.scrollY - 100);", element1)
time.sleep(10)
is_visible = self.driver.execute_script("return arguments[0].offsetParent !== null && arguments[0].getBoundingClientRect().top >= 0 && arguments[0].getBoundingClientRect().left >= 0 && arguments[0].getBoundingClientRect().bottom <= (window.innerHeight || document.documentElement.clientHeight) && arguments[0].getBoundingClientRect().right <= (window.innerWidth || document.documentElement.clientWidth);", element1)
#print("Displayed: {element1.is_displayed()}, Visible: {is_visible}")
WebDriverWait(self.driver, 30).until(EC.element_to_be_clickable((By.CSS_SELECTOR, f'a[href="{href}"]'))).click()
time.sleep(3)
response = scroll_and_click_view_more(self.driver,href)
time.sleep(3)
if(response!="No product tiles found" and response!="Reached the end of the page."):
print("Link: \n",href)
soup = BeautifulSoup(response, 'html.parser')
PLP_title=links[nav_count].get('title')
prod_vitals(soup,PLP_title,self.url)
time.sleep(5)
prod_count(soup,PLP_title)
self.driver.execute_script("window.scrollBy(0, -500);")
else:
self.driver.execute_script("window.scrollTo(0,0);")
#element2 = WebDriverWait(self.driver, 15).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'a[href="{hrefs[i+1]}"]')))
#self.driver.execute_script("window.scrollTo(0, arguments[0].getBoundingClientRect().top + window.scrollY - 100);", element2)
#time.sleep(3)
#is_visible = self.driver.execute_script("return arguments[0].offsetParent !== null && arguments[0].getBoundingClientRect().top >= 0 && arguments[0].getBoundingClientRect().left >= 0 && arguments[0].getBoundingClientRect().bottom <= (window.innerHeight || document.documentElement.clientHeight) && arguments[0].getBoundingClientRect().right <= (window.innerWidth || document.documentElement.clientWidth);", element2)
#print(f"Element href: {hrefs[i+1]}, Displayed: {element2.is_displayed()}, Visible: {is_visible}")
time.sleep(3)
continue
except TimeoutException:
print(f"Element with href {href} not clickable")
self.driver.save_screenshot('timeout_exception.png')
except Exception as e:
print(f"An error occurred: {e}")
nav_count+=1
df = pd.DataFrame(products_summary, columns=['Category', 'Total products available','Total products scraped'])
book = openpyxl.load_workbook(file_name)
sheet = book.create_sheet('Summary')
for row in dataframe_to_rows(df, index=False, header=True):
sheet.append(row)
book.save(file_name)
scraper = WebScraper()
scraper.scrape()
time.sleep(5)
scraper.driver.quit()
问题是您从
https://staging1-japan.coach.com/?auto=true
获取锚标记,并且您确实将它们保存在列表中,但是当您在此页面 https://staging1-japan.coach.com/feature/unlock-your-courage.html
中时,您想要单击位于 https://staging1-japan.coach.com/?auto=true
中的锚标记,这是不可能的,所以也许您会说这两个锚点都引用相同的地址或完全相似。但对于浏览器来说没有任何意义。相反,它们是两个单独页面上的两个锚点,当您位于不同页面时,您无法单击另一个页面上的某些内容。
因此,一种解决方案是加载您从中读取锚点的页面
在类 WebScraper 方法 scrape 中 for 循环
for i,href in enumerate(hrefs):
您可以将此代码添加到 self.driver.get(self.url)
抱歉,这是大量代码,我无法为您编写所有内容
for i,href in enumerate(hrefs):
try:
##########new line added##########
self.driver.get(self.url)
##################################
#print(href)
element1 = WebDriverWait(self.driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, f'a[href="{href}"]')))
#self.driver.execute_script("arguments[0].scrollIntoView(true);", element1)
self.driver.execute_script("window.scrollTo(0, arguments[0].getBoundingClientRect().top + window.scrollY - 100);", element1)
time.sleep(10)
其他解决方案是每次从导航中获取锚点,如果您确定锚点位于所有页面中并且它们是相同的,则每次都从您所在的每个页面获取锚点