如何使用 python 从 javascript 驱动的页面中抓取字幕

问题描述 投票:0回答:1

我正在尝试抓取此网站:https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss- BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2

我正在尝试获取标题

'Friday Night Lights'
,但我似乎无法通过javascript。

我正在使用 python 和 selenium 或 beautifullsoup。 尝试过 WebDriverWait(驱动程序, 10) 我用的是 webdriver.Chrome。

options = webdriver.ChromeOptions()

options.add_argument("disable-infobars")

options.add_argument("start-maximized")

options.add_argument("disable-dev-shm-usage")

options.add_argument("no-sandbox")

#options.add_experimental_option("prefs", {'profile.managed_default_content_settings.javascript': 2})

options.add_experimental_option("excludeSwitches", \["enable-automation"\])

options.add_argument("disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(options=options)

driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')

page = requests.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')

element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#ViewsPageId-BB1kJC5H')))

title = soup.find_all(name="span", class\_="title")

print(title)

这会返回一个空列表,当我打印页面源时,我会在 javascript 执行之前获得 HTML,因此不会显示标题,但在检查器 HTML 中,我会在 javascript 执行后获得完整的 html,其中包括标题。

javascript python selenium-webdriver
1个回答
0
投票

棘手

问题是,信息位于 Shadow-DOM 元素内,无法直接访问。你必须做一些额外的工作:

driver.get('https://www.msn.com/nl-be/nieuws/other/de-50-beste-netflix-series-volgens-the-new-york-times/ss-BB1kJC5H?rc=1&ocid=winp1taskbar&cvid=c774477be4b04494b3690631644cf5a9&ei=3#image=2')

WebDriverWait(driver, 20).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, 'gallery-slideshow'))
)

# Step 1: Locate the shadow host element
shadow_host = driver.find_element(By.CSS_SELECTOR, 'gallery-slideshow')

# Step 2: Access the shadow root using JavaScript
shadow_root = driver.execute_script('return arguments[0].shadowRoot', shadow_host)

# Step 3: Interact with elements inside the shadow DOM
shadow_element = shadow_root.find_element(By.CLASS_NAME, 'metadata-container')

print(shadow_element.text.split('\n')[0])

应该可以解决问题

© www.soinside.com 2019 - 2024. All rights reserved.