好吧,所以我正在尝试从这个网站上删除表格 - https://www.diamondsfactory.co.uk/design/combined-band-look-diamond-engagement-ring-clrn0717801
本节基本上:表中的行
到目前为止我已经尝试过这个脚本:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# Initialize the Chrome driver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Open the webpage
url = "https://www.diamondsfactory.co.uk/design/combined-band-look-diamond-engagement-ring-clrn0717801"
driver.get(url)
# Find the table row by its id
row = driver.find_elements(By.CLASS_NAME, "tdStone odd")
# Extract the data from each cell in the row
td_elements = row.find_elements(By.TAG_NAME, "td")
data = [td.text for td in td_elements]
# Print the extracted data
print(data)
# Close the driver
driver.quit()
但是,我收到了这个错误 - 回溯(最近一次调用最后一次): 文件“C:\Users\Payalkumavat\scrape_diamonds.py\scrape.py”,第 17 行,位于 td_elements = row.find_elements(By.TAG_NAME, "td") ^^^^^^^^^^^^^^^^^^ AttributeError:“列表”对象没有属性“find_elements”
注意:此答案包含实现目标的不同方法。 (使用的模块:请求、JSON、时间)
根据您的问题,我认为您正在尝试获取与 html 页面中的
tdStone
类相关的所有信息。好吧,我发现有比使用硒更好的解决方案,这是我的思维导图:-
您的目标应用程序有一个名为
index.php?route=
(URL:https://www.diamondsfactory.co.uk/index.php?route=product/product/lazyloadDiamond
)的端点,它充当某种 API 路由并从服务器获取所有详细信息(所有这些信息然后作为 HTML 源存储在 tdStone
类中) ,因此,如果我们向此端点发送指定目标(例如:符合您标准的钻石)的请求,我们可以借助 python 请求库和一些编码轻松获取这些数据。这是我的代码:
注意:为了避免速率限制问题,我使用
time.sleep(3)
来最小化线程。
import json
import requests
import time
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def getData(content):
loaded_content = json.loads(content)
result = loaded_content['stones']
for i in result:
shape = i['shape']
diamond_code = i['diamond_code']
color = i['color']
weight = i['weight_display']
clarity = i['clarity']
certificate = i['lab']
image = i['image_url']
polish = i['polish']
symmetry = i['symmetry']
price = i['csprice']
video_url = i['video_url']
mm = i['meas']
depth = i['depth']
table = i['table']
print(f"Diamond Shape: {shape}\nDiamond Color: {color}\nDiamond Code: {diamond_code}\nDiamond Weight: {weight}\nDiamond Depth: {depth}\nDiamond Table: {table}\nDiamond MM: {mm}\nDiamond Certificate: {certificate}\nDiamond Image: {image}\nDiamond Video: {video_url}\nDiamond Polish: {polish}\nDiamond Symmetry: {symmetry}\nDiamond Price: {price}\n====================================")
def sendRequest(url):
headers = {
"Content-Type": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101"
}
for i in range(1,100):
data = f"stone_type=LAB&ring_size=R16_7&metal_purity=GL_18K_W&stone_carat_min=0.20&stone_carat_max=30.00&stone_price_min=100&stone_price_max=5000000&&active_diamond_tab=LAB&page={i}"
time.sleep(3)
res = requests.post(url, data=data, verify=False, headers=headers).text
if '"stone_price_id":' in res:
getData(res)
else:
break
sendRequest('https://www.diamondsfactory.co.uk/index.php?route=product/product/lazyloadDiamond')
如果您只对一种响应感兴趣,您可以分析这些参数以对满足您期望的响应进行排序
&stone_shape=MQS&stone_carat_min=0.20&stone_carat_max=30.00&stone_clarity=&stone_color=&stone_certificate=&stone_cut=&stone_polish=&stone_symmetry=&stone_fluorescence=&stone_price_min=100&stone_price_max=5000000&show_image=&show_video=&show_instock=&show_heart_arrows=&markup=&tax_class_id=10&design_id=49&image_stone=di&side_stone=&metal_purity=GL_18K_W&product_id=15265&ring_size=R16_7&active_diamond_tab=LAB&diamond_code=&edit_product=&order=asc&search=&page=1
product_id
?在对您提到的 URL 进行分析后,我猜产品 ID 是您提到的 URL 的最后五位数字。 ( 17801
是 product_id
的 clrn0717801
)希望这会有所帮助
谢谢