我正在尝试获取标题
'Friday Night Lights'
,但我似乎无法通过javascript。
我正在使用 python 和 selenium 或 beautifullsoup。 尝试过 WebDriverWait(驱动程序, 10) 我用的是 webdriver.Chrome。
options = webdriver.ChromeOptions()
options.add_argument("disable-infobars")
options.add_argument("start-maximized")
options.add_argument("disable-dev-shm-usage")
options.add_argument("no-sandbox")
#options.add_experimental_option("prefs", {'profile.managed_default_content_settings.javascript': 2})
options.add_experimental_option("excludeSwitches", \["enable-automation"\])
options.add_argument("disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')
page = requests.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')
element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#ViewsPageId-BB1kJC5H')))
title = soup.find_all(name="span", class\_="title")
print(title)
这会返回一个空列表,当我打印页面源时,我会在 javascript 执行之前获得 HTML,因此不会显示标题,但在检查器 HTML 中,我会在 javascript 执行后获得完整的 html,其中包括标题。
棘手
问题是,信息位于 Shadow-DOM 元素内,无法直接访问。你必须做一些额外的工作:
driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'gallery-slideshow'))
)
# Step 1: Locate the shadow host element
shadow_host = driver.find_element(By.CSS_SELECTOR, 'gallery-slideshow')
# Step 2: Access the shadow root using JavaScript
shadow_root = driver.execute_script('return arguments[0].shadowRoot', shadow_host)
# Step 3: Interact with elements inside the shadow DOM
shadow_element = shadow_root.find_element(By.CLASS_NAME, 'metadata-container')
print(shadow_element.text.split('\n')[0])
应该可以解决问题