我正在尝试为一个小型项目抓取一个网站,我需要的数据隐藏在 HTML 的 #Shadow-root 标签下。我尝试使用 selenium 访问它,代码如下:
def expand_shadow_element(element):
shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
return shadow_root
url = "https://new.abb.com/products/SK615502-D"
#Initializing the webdriver
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(executable_path="/Users/ritchevy/Desktop/scraping-glassdoor/chromedriver", options=options)
timeout = 10
wait = WebDriverWait(driver, timeout)
driver.set_window_size(1120, 1000)
driver.get(url)
root1 = driver.find_element(By.CSS_SELECTOR,"pis-products-details-attribute-groups")
shadow_root1 = expand_shadow_element(root1)
shadow_container_root = shadow_root1.find_element(By.CSS_SELECTOR,"div")
执行后它给了我这个错误
---> 35 shadow_container_root = shadow_root1.find_element(By.CSS_SELECTOR,"div")
36
AttributeError: 'dict' object has no attribute 'find_element'
知道如何解决这个问题吗?
我在运行您的原始代码时没有遇到任何问题,所以不确定为什么它对您不起作用。由于您不是无头运行,您是否看到所需的页面在浏览器中打开?您可能必须在
time.sleep()
之后插入 driver.get(url)
调用,以确保您可以在遇到错误之前看到浏览器窗口。
我做了一些小的调整,然后从影子根节点中的表中获取数据(假设这是您想要的数据)。
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
url = "https://new.abb.com/products/SK615502-D"
options = webdriver.ChromeOptions()
# * Use local Chrome.
# driver = webdriver.Chrome(options=options)
# * Use remote Chrome in Docker container.
driver = webdriver.Remote(
"http://127.0.0.1:4444/wd/hub",
DesiredCapabilities.CHROME,
options=options
)
wait = WebDriverWait(driver, 10)
driver.get(url)
# Find element enclosing the shadow root DOM.
#
root = driver.find_element(By.CSS_SELECTOR, "pis-products-details-attribute-groups")
# Extract the shadow root content.
#
shadow_root = driver.execute_script('return arguments[0].shadowRoot', root)
print(shadow_root)
for table in shadow_root.find_elements(By.CSS_SELECTOR, ".ext-attr-group .ext-attr-group-inner"):
title = table.find_element(By.CSS_SELECTOR, "h4")
print("====================================================")
print("🟦 "+title.text)
for row in table.find_elements(By.CSS_SELECTOR, ".ext-attr-group-content > div"):
key = row.find_element(By.CSS_SELECTOR, ".col-md-4")
value = row.find_element(By.CSS_SELECTOR, ".col-md-8")
print(str(key.text)+" "+str(value.text))
我通常使用远程 Selenium 实例,但您可以将其注释掉并使用
webdriver.Chrome(options=options)
代替。
这是一些数据的样子:
====================================================
🟦 Ordering
Minimum Order Quantity: 1 piece
Customs Tariff Number: 85389099
Product Main Type: Accessories
====================================================
🟦 Popular Downloads
Data Sheet, Technical Information: 1SFC151007C02__
Instructions and Manuals: 1SFC151011M0201
CAD Dimensional Drawing: 2CDC001079B0201
====================================================
🟦 Dimensions
Product Net Width: 0.038 m
Product Net Depth / Length: 0.038 m
Product Net Height: 0.038 m
Product Net Weight: 0.08 kg