网络抓取工具未抓取所需文本

问题描述 投票:0回答:1

我正在尝试抓取此网站上的 sku 和描述: https://www.milwaukeetool.com/products/power-tools/drilling/drill-drivers

但是,尽管代码能够运行,但它不会抓取所需的元素。有谁知道为什么?看起来我正在获取正确的元素,我尝试使用 requests 和 selenium (如下所示)并不断获得相同的结果。

请求方式:

import requests
import pandas as pd
from bs4 import BeautifulSoup

link = 'https://www.milwaukeetool.com/products/power-tools/drilling/drill-drivers'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
}

res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
df = pd.DataFrame(columns= ['sku','desc'])
for item in soup.select("#MTBody > main > section > div.product-listing-main.pt-5.md\:pt-\[30px\] > section:nth-child(1) > div > div > div > div:nth-child(2) > div > a > div.result-title__wrap.absolute.inset-0.top-auto.bg-gray-300.pt-\[5px\].md\:pt-2.px-1.md\:px-4.w-full.text-gray-800.text-center.h-\[75px\]"):
    sku = item.select_one("#MTBody > main > section > div.product-listing-main.pt-5.md\:pt-\[30px\] > section:nth-child(1) > div > div > div > div:nth-child(2) > div > a > div.result-title__wrap.absolute.inset-0.top-auto.bg-gray-300.pt-\[5px\].md\:pt-2.px-1.md\:px-4.w-full.text-gray-800.text-center.h-\[75px\] > span").get_text(strip=True)
    desc = item.select_one("#MTBody > main > section > div.product-listing-main.pt-5.md\:pt-\[30px\] > section:nth-child(1) > div > div > div > div:nth-child(2) > div > a > div.result-title__wrap.absolute.inset-0.top-auto.bg-gray-300.pt-\[5px\].md\:pt-2.px-1.md\:px-4.w-full.text-gray-800.text-center.h-\[75px\] > div.text-brandBlack.font-helvetica67.text-14.result-title.leading-none.max-h-8.overflow-hidden").get_text(strip=True)
    df = pd.concat([df, pd.DataFrame({'sku': [sku], 'desc': [desc]})], ignore_index=True)
    print(sku,desc)

df.to_csv("milwaukee.csv",index=False)

硒法:

import undetected_chromedriver as uc

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
from selenium.common.exceptions import NoSuchElementException

options = Options()
driver = uc.Chrome()

website = 'https://www.milwaukeetool.com/products/power-tools/drilling/drill-drivers'
driver.get(website)


product_list = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-listing-main.pt-5.md\\:pt-\\[30px\\]")))
prod_num = []
prod_desc = []

for container in product_list:
    sku = container.find_element(By.CSS_SELECTOR, '.font-helvetica67.tracking-normal.uppercase.text-gray-900.text-12.result-sku.leading-none').text
    description = container.find_element(By.CSS_SELECTOR, '.text-brandBlack.font-helvetica67.text-14.result-title.leading-none.max-h-8.overflow-hidden').text
    prod_num.append(sku)
    prod_desc.append(description)


for _ in range(4):
        driver.execute_script("window.scrollBy(0, 2000);")
        time.sleep(2)




driver.quit()
print(len(prod_num))
print(len(prod_desc))
# Create a DataFrame from the scraped data

df = pd.DataFrame({'code': prod_num, 'description': prod_desc})

# Save the DataFrame to a CSV file
df.to_csv('milwtest1.csv', index=False)

print(df)
python selenium-webdriver web-scraping beautifulsoup python-requests
1个回答
0
投票

请参考以下模板中的loc(CSS选择器或XPath):

  <actions>
    <action_goto url="https://www.milwaukeetool.com/products/power-tools/drilling/drill-drivers" />
    <action_loopineles>
      <element loc="div.product-listing__results-list > div > div a" />
      <action_extract tabname="dat_00000000000012ab">
        <column_element colname="c01" nickname="sku">
          <element loc="span.font-helvetica67" />
          <!-- <element loc=".//span[contains(@class, 'font-helvetica67')]" /> -->
          <transform>
            <fun_replace substr="(" newstr="" />
            <fun_replace substr=")" newstr="" />
          </transform>
        </column_element>
        <column_element colname="c02" nickname="description">
          <element loc="div.result-title__wrap > div.text-brandBlack" />
        </column_element>
      </action_extract>
    </action_loopineles>
  </actions>

提取数据:

enter image description here

或者您可以从请求的响应中提取数据: enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.