我一直在使用 ESPN 上公开可用的 API 抓取数据 https://hs-consumer-api.espncricinfo.com/ 端点。以下是其中一个端点的示例
v1/pages/match/scorecard?lang=en&seriesId=&matchId= 这些 API 通过 AJAX/XHR 请求在 espncricinfo 页面上调用
最近他们在页面中引入了一项更改,似乎正在使用 javascript 添加充当身份验证令牌的附加标头
x-hsci-auth-token: exp=1727527924~hmac=9080f72a36f2b97ec94069dca3382981b2312fb10ec800ab75959a6777f344f2
这导致 API 无法直接访问。我尝试使用 seleniumwire 来分析使用 chrome 驱动程序的请求,并得出结论,如果我能够提取标头并调用 API,那么它就可以工作。
我面临的问题是,当我通过 selenium-wire 访问所需的 URL 时,这些 API 调用不会发生。我可以看到对 Adv 站点和其他跟踪 api 的所有其他调用,但消费者 API 调用不会发生。
这是我正在使用的代码
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import requests
import time
# Set the path to your manually installed ChromeDriver
chromedriver_path = 'chromedriver.exe'
# Set up Chrome WebDriver with Selenium Wire
chrome_options = Options()
chrome_options.add_argument("--headless") # Optional: to run browser in headless mode
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('--ignore-certificate-errors') # Ignore certificate errors
chrome_options.add_argument('--disable-proxy-certificate-handler')
# Selenium Wire options to disable SSL verification
seleniumwire_options = {
'verify_ssl': False # Disable SSL verification
}
# Manually specify the ChromeDriver path
driver = webdriver.Chrome(service=Service(chromedriver_path), options=chrome_options, seleniumwire_options=seleniumwire_options)
# Navigate to the target URL
driver.get('https://www.espncricinfo.com/series/germany-women-s-t20i-tri-series-2024-1444526/germany-women-vs-italy-women-final-1444537/full-scorecard')
# Capture the cookies and convert them to a format usable by the requests library
time.sleep(5)
req_headers = {}
req_url = ''
# Capture and filter the XHR requests after the div is loaded
for request in driver.requests:
if 'scorecard' in request.url and 'xhr' in request.headers.get('X-Requested-With', '').lower() and request.response.status_code == 200:
req_url = request.url
# print(request.response.headers)
req_headers = request.headers
print("Found headers for the needed URL")
break
# Close the WebDriver session
driver.quit()
# Check if headers were successfully captured
if req_headers:
print("Request headers captured successfully.")
else:
print("Failed to capture request headers.")
exit(1)
try:
print(req_headers, req_url)
response = requests.get(req_url, headers=req_headers)
print(f"API Response Status: {response.status_code}")
if response.status_code == 200:
print("API call was successful!")
print(response)
print(response.json()) # Print the JSON response from the API
else:
print(f"API call failed with status code: {response.status_code}")
print(response.text)
except Exception as e:
print(f"Error during API call: {str(e)}")
如果我在普通浏览器中打开相同的 URL,我可以看到该 URL 发生的调用 https://hs-consumer-api.espncricinfo.com/v1/pages/match/scorecard?lang=en&seriesId=1444526&matchId=1444537
但不在硒线请求中。
selenium-wire 正式不再维护;然而,它的最后一个版本可以与blinker==1.7.0很好地配合。希望这有帮助。