使用 Selenium/SeleniumBase 时，一个特定的 span 标签为空，没有属性

Question

我正在尝试使用 Python 抓取 MLB.com 上的棒球前景的数据。当我手动打开浏览器并查看源 HTML 时，我会在

data-init-state

标签的单个属性

<span>

中看到包含我需要的所有内容的 JSON 数据。但是，当我使用 Selenium 或 SeleniumBase 时，该

<span>

标签完全是空的，没有任何属性 -->

<span></span>

结果，我无法访问我需要的数据。

刚开始的时候，我使用 Selenium 的

find_element()

和 SeleniumBase 的

get_element()

在

data-init-state

元素中寻找

span

属性，但是我不断收到找不到该元素的错误。当我注意到 HTML 源代码中存在空的

<span>

标签后，我就很清楚为什么找不到它了。我开始寻找原因和解决方案。因此，现在我将仅展示如何获取 HTML 源代码。

首先我尝试使用常规的 Selenium Webdriver：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

url = "https://www.mlb.com/prospects"

chrome_options = Options()
chrome_options.add_argument('--headless=new')
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
sleep(10)

prospects_source_html = driver.page_source

with open('prospects_source_html.html', 'w') as file:
    file.write(prospects_source_html)

也许该网站检测到了自动化软件的使用并限制了数据，然后我尝试使用 SeleniumBase：

from seleniumbase import SB

url = "https://www.mlb.com/prospects"

with SB(uc=True, test=True, incognito=True) as sb:
    sb.uc_open_with_reconnect(url, 8)
    sb.sleep(15)
    sb.save_page_source('prospects_source_html.html')

考虑到可能只是需要额外的时间来在网站上填写该数据，我尝试将

sleep()

延长至 120 秒。不幸的是，结果相同。

我尝试运行一个有头版本（删除 Selenium 的

--headless=new

选项，并使用 SeleniumBase 的

headed=True

参数），但这与空

<span>

标签也没有任何区别。

在使用 Headed 版本进行测试时，我尝试在页面加载后输入较长的

sleep()

时间。然后，我没有坐下来让 Selenium/SeleniumBase 来做这件事，而是手动右键单击页面，选择“查看页面源代码”，然后在浏览器窗口中检查 HTML。有趣的是，

data-init-state

属性和 JSON 数据位于

<span>

标签中！然而，当我让脚本继续并完成时，生成的保存的 HTML 文件仍然具有空的

<span>

标签。

我很困惑...是否有可能网站在加载页面后运行一个脚本来填充该

span

，但网站阻止运行该脚本，因为它检测到 Selenium/SeleniumBase 的存在？

有人知道这里发生了什么事吗？有什么办法可以实现这个功能吗？

Answer 1

据我了解，页面初始化后，

<span data-init-state="..."></span>

标签会显示在页面上，然后被清除。

要从

data-init-state

属性检索数据，您可以发送一个 HTTP 请求，在页面初始化之前获取完整页面代码，而无需使用 Selenium。

这是示例代码：

import re
import html
import json
import requests

pattern = re.compile(r'state="([^"]+)', re.M)

url = "https://www.mlb.com/prospects"
response = requests.get(url)
html_content = response.content.decode()

if match := pattern.search(html_content):
    raw_data = html.unescape(match.group(1))
    data = json.loads(raw_data)

    with open("data.json", "w") as f:
        json.dump(data, f, ensure_ascii=False, indent=2)

属性

data-init-state

中的数据保存在名为

data.json

的文件中。

使用 Selenium/SeleniumBase 时，一个特定的 span 标签为空，没有属性

问题描述投票：0回答：1

1个回答

最新问题

使用 Selenium/SeleniumBase 时，一个特定的 span 标签为空，没有属性

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1