使用 Python Selenium 抓取 Shadow 根元素

问题描述 投票:0回答:1

我正在尝试为一个小型项目抓取一个网站,我需要的数据隐藏在 HTML 的 #Shadow-root 标签下。我尝试使用 selenium 访问它,代码如下:

def expand_shadow_element(element):
  shadow_root = driver.execute_script('return arguments[0].shadowRoot', element)
  return shadow_root

url = "https://new.abb.com/products/SK615502-D"

#Initializing the webdriver
options = webdriver.ChromeOptions()

driver = webdriver.Chrome(executable_path="/Users/ritchevy/Desktop/scraping-glassdoor/chromedriver", options=options)
timeout = 10
wait = WebDriverWait(driver, timeout)

driver.set_window_size(1120, 1000)
driver.get(url)

root1 = driver.find_element(By.CSS_SELECTOR,"pis-products-details-attribute-groups")
shadow_root1 = expand_shadow_element(root1)
shadow_container_root = shadow_root1.find_element(By.CSS_SELECTOR,"div")

执行后它给了我这个错误

---> 35 shadow_container_root = shadow_root1.find_element(By.CSS_SELECTOR,"div")
     36 

AttributeError: 'dict' object has no attribute 'find_element'

知道如何解决这个问题吗?

python selenium-webdriver web-scraping shadow-dom
1个回答
0
投票

我在运行您的原始代码时没有遇到任何问题,所以不确定为什么它对您不起作用。由于您不是无头运行,您是否看到所需的页面在浏览器中打开?您可能必须在

time.sleep()
之后插入
driver.get(url)
调用,以确保您可以在遇到错误之前看到浏览器窗口。

我做了一些小的调整,然后从影子根节点中的表中获取数据(假设这是您想要的数据)。

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By

url = "https://new.abb.com/products/SK615502-D"

options = webdriver.ChromeOptions()
# * Use local Chrome.
# driver = webdriver.Chrome(options=options)
# * Use remote Chrome in Docker container.
driver = webdriver.Remote(
  "http://127.0.0.1:4444/wd/hub",
  DesiredCapabilities.CHROME,
  options=options
)

wait = WebDriverWait(driver, 10)

driver.get(url)

# Find element enclosing the shadow root DOM.
#
root = driver.find_element(By.CSS_SELECTOR, "pis-products-details-attribute-groups")

# Extract the shadow root content.
#
shadow_root = driver.execute_script('return arguments[0].shadowRoot', root)
print(shadow_root)

for table in shadow_root.find_elements(By.CSS_SELECTOR, ".ext-attr-group .ext-attr-group-inner"):
    title = table.find_element(By.CSS_SELECTOR, "h4")
    print("====================================================")
    print("🟦 "+title.text)
    for row in table.find_elements(By.CSS_SELECTOR, ".ext-attr-group-content > div"):
        key = row.find_element(By.CSS_SELECTOR, ".col-md-4")
        value = row.find_element(By.CSS_SELECTOR, ".col-md-8")
        print(str(key.text)+" "+str(value.text))

我通常使用远程 Selenium 实例,但您可以将其注释掉并使用

webdriver.Chrome(options=options)
代替。

这是一些数据的样子:

====================================================
🟦 Ordering
Minimum Order Quantity: 1 piece
Customs Tariff Number: 85389099
Product Main Type: Accessories
====================================================
🟦 Popular Downloads
Data Sheet, Technical Information: 1SFC151007C02__
Instructions and Manuals: 1SFC151011M0201
CAD Dimensional Drawing: 2CDC001079B0201
====================================================
🟦 Dimensions
Product Net Width: 0.038 m
Product Net Depth / Length: 0.038 m
Product Net Height: 0.038 m
Product Net Weight: 0.08 kg
© www.soinside.com 2019 - 2024. All rights reserved.