无法从动态网站的表中抓取数据

问题描述 投票:0回答:1

好吧,所以我正在尝试从这个网站上删除表格 - https://www.diamondsfactory.co.uk/design/combined-band-look-diamond-engagement-ring-clrn0717801

本节基本上:表中的行

到目前为止我已经尝试过这个脚本:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Initialize the Chrome driver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Open the webpage
url = "https://www.diamondsfactory.co.uk/design/combined-band-look-diamond-engagement-ring-clrn0717801"
driver.get(url)

# Find the table row by its id
row = driver.find_elements(By.CLASS_NAME, "tdStone  odd")

# Extract the data from each cell in the row
td_elements = row.find_elements(By.TAG_NAME, "td")
data = [td.text for td in td_elements]

# Print the extracted data
print(data)

# Close the driver
driver.quit()

但是,我收到了这个错误 - 回溯(最近一次调用最后一次): 文件“C:\Users\Payalkumavat\scrape_diamonds.py\scrape.py”,第 17 行,位于 td_elements = row.find_elements(By.TAG_NAME, "td") ^^^^^^^^^^^^^^^^^^ AttributeError:“列表”对象没有属性“find_elements”

selenium-webdriver web-scraping selenium-chromedriver
1个回答
0
投票

注意:此答案包含实现目标的不同方法。 (使用的模块:请求、JSON、时间)

根据您的问题,我认为您正在尝试获取与 html 页面中的

tdStone
类相关的所有信息。好吧,我发现有比使用硒更好的解决方案,这是我的思维导图:-

您的目标应用程序有一个名为

index.php?route=
(URL:
https://www.diamondsfactory.co.uk/index.php?route=product/product/lazyloadDiamond
)的端点,它充当某种 API 路由并从服务器获取所有详细信息(所有这些信息然后作为 HTML 源存储在
tdStone
类中) ,因此,如果我们向此端点发送指定目标(例如:符合您标准的钻石)的请求,我们可以借助 python 请求库和一些编码轻松获取这些数据。这是我的代码:

注意:为了避免速率限制问题,我使用

time.sleep(3)
来最小化线程。

import json
import requests
import time
from requests.packages.urllib3.exceptions import InsecureRequestWarning

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

def getData(content):
    loaded_content = json.loads(content)
    result = loaded_content['stones']
    for i in result:
        shape = i['shape']
        diamond_code = i['diamond_code']
        color = i['color']
        weight = i['weight_display']
        clarity = i['clarity']
        certificate = i['lab']
        image = i['image_url']
        polish = i['polish']
        symmetry = i['symmetry']
        price = i['csprice']
        video_url = i['video_url']
        mm = i['meas']
        depth = i['depth']
        table = i['table']

        print(f"Diamond Shape:  {shape}\nDiamond Color:  {color}\nDiamond Code:  {diamond_code}\nDiamond Weight:  {weight}\nDiamond Depth:  {depth}\nDiamond Table:  {table}\nDiamond MM:  {mm}\nDiamond Certificate:  {certificate}\nDiamond Image:  {image}\nDiamond Video:  {video_url}\nDiamond Polish:  {polish}\nDiamond Symmetry:  {symmetry}\nDiamond Price:  {price}\n====================================")

def sendRequest(url):
    headers = {
        "Content-Type": "application/x-www-form-urlencoded",
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101"
    }
    for i in range(1,100):
        data = f"stone_type=LAB&ring_size=R16_7&metal_purity=GL_18K_W&stone_carat_min=0.20&stone_carat_max=30.00&stone_price_min=100&stone_price_max=5000000&&active_diamond_tab=LAB&page={i}"
        time.sleep(3)
        res = requests.post(url, data=data, verify=False, headers=headers).text
        if '"stone_price_id":' in res:
            getData(res)
        else:
            break

sendRequest('https://www.diamondsfactory.co.uk/index.php?route=product/product/lazyloadDiamond')

如果您只对一种响应感兴趣,您可以分析这些参数以对满足您期望的响应进行排序

&stone_shape=MQS&stone_carat_min=0.20&stone_carat_max=30.00&stone_clarity=&stone_color=&stone_certificate=&stone_cut=&stone_polish=&stone_symmetry=&stone_fluorescence=&stone_price_min=100&stone_price_max=5000000&show_image=&show_video=&show_instock=&show_heart_arrows=&markup=&tax_class_id=10&design_id=49&image_stone=di&side_stone=&metal_purity=GL_18K_W&product_id=15265&ring_size=R16_7&active_diamond_tab=LAB&diamond_code=&edit_product=&order=asc&search=&page=1

一些要点:

  1. 我的脚本中没有使用任何排序方法,这取决于你。
  2. 如果您想对这些值进行排序,请随意分析我上面提到的参数。
  3. 什么是
    product_id
    ?在对您提到的 URL 进行分析后,我猜产品 ID 是您提到的 URL 的最后五位数字。 (
    17801
    product_id
    clrn0717801
  4. 我使用循环来提取满足您期望的最大值。 (希望如此:()

希望这会有所帮助

谢谢

© www.soinside.com 2019 - 2024. All rights reserved.