我正在使用 Selenium 从亚马逊产品页面中抓取详细信息([示例][1])。我已经成功抓取了产品标题,但我还想获取所有产品图片的 URL。这是我的代码:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
def search_amazon():
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.amazon.com/Pendleton-Glacier-National-Queen-Blanket/dp/B003EQ4AYY/?_encoding=UTF8&pd_rd_w=dZURJ&pf_rd_p=ab102187-3a5a-49fd-b43f-4f928775aeae&pf_rd_r=PD8YGV8XA34FMYH7G9TJ&pd_rd_r=2cb55e9c-812a-43de-bf52-7e1976f5374b&pd_rd_wg=KmkoW&ref_=pd_gw_hfp13n_bbn')
productName = driver.find_element_by_id('productTitle').text
print(productName)
imgList = driver.find_element_by_xpath('//*[@id="altImages"]/ul')
options = imgList.find_elements_by_tag_name("li")
for option in options:
print(option.get_attribute("innerHTML"))
search_amazon()
末尾的选项循环返回每个 LI 的 innerHTML。我无法访问 IMG src,但我尝试的是:
for option in options:
src = option.find_element_by_tag_name("img").get_attribute("src")
这会抛出一个 NoSuchElementException:
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"img"}
为了获得实际图像而不是大拇指,我使用了悬停功能。添加等待将是安全的。
from selenium.webdriver.common.action_chains import ActionChains
...
for i in driver.find_elements_by_css_selector('#altImages .imageThumbnail'):
hover = ActionChains(driver).move_to_element(i)
hover.perform()
driver.find_element_by_css_selector('.image.item.maintain-height.selected img').get_attribute('src'))
这将获得实际的全尺寸图像 srcs
要打印 <img>
标签的
src属性的值,您必须为
visibility_of_all_elements_located()
引入 WebDriverWait并且您可以使用以下任一 Locator Strategies:
使用
CSS_SELECTOR
:
driver.get('https://www.amazon.com/Pendleton-Glacier-National-Queen-Blanket/dp/B003EQ4AYY/?_encoding=UTF8&pd_rd_w=dZURJ&pf_rd_p=ab102187-3a5a-49fd-b43f-4f928775aeae&pf_rd_r=PD8YGV8XA34FMYH7G9TJ&pd_rd_r=2cb55e9c-812a-43de-bf52-7e1976f5374b&pd_rd_wg=KmkoW&ref_=pd_gw_hfp13n_bbn')
print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div#altImages>ul li[data-ux-click] img")))])
使用
XPATH
:
driver.get('https://www.amazon.com/Pendleton-Glacier-National-Queen-Blanket/dp/B003EQ4AYY/?_encoding=UTF8&pd_rd_w=dZURJ&pf_rd_p=ab102187-3a5a-49fd-b43f-4f928775aeae&pf_rd_r=PD8YGV8XA34FMYH7G9TJ&pd_rd_r=2cb55e9c-812a-43de-bf52-7e1976f5374b&pd_rd_wg=KmkoW&ref_=pd_gw_hfp13n_bbn')
print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='altImages']/ul//li[@data-ux-click]//img")))])
控制台输出:
['https://images-na.ssl-images-amazon.com/images/I/41Sj%2BO--J9L._AC_US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/41iX14X%2BoRL._AC_US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/41wiU-3N5JL._AC_US40_.jpg', 'https://images-na.ssl-images-amazon.com/images/I/41waNtDjTxL._AC_US40_.jpg']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
您可以在以下位置找到一些相关的详细讨论:
当您为每个图像查找
li
元素时,您应该在路径中指定元素的类,因为并非元素li
的每个//*[@id="altImages"]/ul
都引用图像。所以为了找到你可以这样做的网址:
def search_amazon():
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.amazon.com/Pendleton-Glacier-National-Queen-Blanket/dp/B003EQ4AYY/?_encoding=UTF8&pd_rd_w=dZURJ&pf_rd_p=ab102187-3a5a-49fd-b43f-4f928775aeae&pf_rd_r=PD8YGV8XA34FMYH7G9TJ&pd_rd_r=2cb55e9c-812a-43de-bf52-7e1976f5374b&pd_rd_wg=KmkoW&ref_=pd_gw_hfp13n_bbn')
productName = driver.find_element_by_id('productTitle').text
print(productName)
imgList = driver.find_element_by_xpath('//*[@id="altImages"]/ul')
options = imgList.find_elements_by_xpath(".//li[contains(@class, 'imageThumbnail')]")
for option in options:
print(option.find_element_by_tag_name("img").get_attribute("src")
从亚马逊产品中抓取完整图片
i=1
while i<15:
try:
btn=driver.find_element(By.XPATH,'*//ul/li['+str(i)+']/span/span/span/input').click()
time.sleep(3)
main=driver.find_element(By.CSS_SELECTOR,'.image.item.maintain-height.selected img').get_attribute('src')
image_url.append(main)
print(main)
except:
pass
i=i+1