我正在尝试抓取此网站上的 sku 和描述: https://www.milwaukeetool.com/products/power-tools/drilling/drill-drivers
但是,尽管代码能够运行,但它不会抓取所需的元素。有谁知道为什么?看起来我正在获取正确的元素,我尝试使用 requests 和 selenium (如下所示)并不断获得相同的结果。
请求方式:
import requests
import pandas as pd
from bs4 import BeautifulSoup
link = 'https://www.milwaukeetool.com/products/power-tools/drilling/drill-drivers'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
df = pd.DataFrame(columns= ['sku','desc'])
for item in soup.select("#MTBody > main > section > div.product-listing-main.pt-5.md\:pt-\[30px\] > section:nth-child(1) > div > div > div > div:nth-child(2) > div > a > div.result-title__wrap.absolute.inset-0.top-auto.bg-gray-300.pt-\[5px\].md\:pt-2.px-1.md\:px-4.w-full.text-gray-800.text-center.h-\[75px\]"):
sku = item.select_one("#MTBody > main > section > div.product-listing-main.pt-5.md\:pt-\[30px\] > section:nth-child(1) > div > div > div > div:nth-child(2) > div > a > div.result-title__wrap.absolute.inset-0.top-auto.bg-gray-300.pt-\[5px\].md\:pt-2.px-1.md\:px-4.w-full.text-gray-800.text-center.h-\[75px\] > span").get_text(strip=True)
desc = item.select_one("#MTBody > main > section > div.product-listing-main.pt-5.md\:pt-\[30px\] > section:nth-child(1) > div > div > div > div:nth-child(2) > div > a > div.result-title__wrap.absolute.inset-0.top-auto.bg-gray-300.pt-\[5px\].md\:pt-2.px-1.md\:px-4.w-full.text-gray-800.text-center.h-\[75px\] > div.text-brandBlack.font-helvetica67.text-14.result-title.leading-none.max-h-8.overflow-hidden").get_text(strip=True)
df = pd.concat([df, pd.DataFrame({'sku': [sku], 'desc': [desc]})], ignore_index=True)
print(sku,desc)
df.to_csv("milwaukee.csv",index=False)
硒法:
import undetected_chromedriver as uc
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
from selenium.common.exceptions import NoSuchElementException
options = Options()
driver = uc.Chrome()
website = 'https://www.milwaukeetool.com/products/power-tools/drilling/drill-drivers'
driver.get(website)
product_list = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-listing-main.pt-5.md\\:pt-\\[30px\\]")))
prod_num = []
prod_desc = []
for container in product_list:
sku = container.find_element(By.CSS_SELECTOR, '.font-helvetica67.tracking-normal.uppercase.text-gray-900.text-12.result-sku.leading-none').text
description = container.find_element(By.CSS_SELECTOR, '.text-brandBlack.font-helvetica67.text-14.result-title.leading-none.max-h-8.overflow-hidden').text
prod_num.append(sku)
prod_desc.append(description)
for _ in range(4):
driver.execute_script("window.scrollBy(0, 2000);")
time.sleep(2)
driver.quit()
print(len(prod_num))
print(len(prod_desc))
# Create a DataFrame from the scraped data
df = pd.DataFrame({'code': prod_num, 'description': prod_desc})
# Save the DataFrame to a CSV file
df.to_csv('milwtest1.csv', index=False)
print(df)
请参考以下模板中的loc(CSS选择器或XPath):
<actions>
<action_goto url="https://www.milwaukeetool.com/products/power-tools/drilling/drill-drivers" />
<action_loopineles>
<element loc="div.product-listing__results-list > div > div a" />
<action_extract tabname="dat_00000000000012ab">
<column_element colname="c01" nickname="sku">
<element loc="span.font-helvetica67" />
<!-- <element loc=".//span[contains(@class, 'font-helvetica67')]" /> -->
<transform>
<fun_replace substr="(" newstr="" />
<fun_replace substr=")" newstr="" />
</transform>
</column_element>
<column_element colname="c02" nickname="description">
<element loc="div.result-title__wrap > div.text-brandBlack" />
</column_element>
</action_extract>
</action_loopineles>
</actions>
提取数据: