我正在尝试在 www.statmuse.com 上循环浏览投手的比赛日志。主要问题是,由于我试图循环执行此操作,因此部分 URL 目前未知。
例如,查看 Martín Pérez 2024 年的比赛日志,网址为:https://www.statmuse.com/mlb/player/martin-perez-46483/game-log
现在,在尝试循环不同的投手时,这个 5 位数字序列(在我的示例中为 46483)是可变的,并且在每个投手的比赛日志之间发生变化。
我整理了以下代码。当然,问题是循环遍历 10000 和 99999,试图找到正确的 5 位数字序列,这导致我的 Excel 崩溃并且没有响应。谁能建议一种更有效的方法来实现这一目标?我很抱歉,这是我第一个使用 HTTP 请求和类似内容的项目,所以我确信代码是一团糟。
代码:
Dim ws As Worksheet, PLws As Worksheet
Set ws = ThisWorkbook.Sheets("Sheet1")
Set PLws = ThisWorkbook.Sheets("Pitcher List T")
Set rng = PLws.Range("B1:B1")
For Each cc In rng
Dim httpRequest As MSXML2.XMLHTTP60: Set httpRequest = New MSXML2.XMLHTTP60
Dim htmldoc As HTMLDocument: Set htmldoc = New HTMLDocument
playerName = CStr(cc.Value)
Dim baseURL As String
baseURL = "https://www.statmuse.com/mlb/player/" & playerName & "-"
Dim lastRow As Long
Dim startNumber As Long
Dim endNumber As Long
startNumber = 10000 ' this loop is the issue (i think)
endNumber = 99999 ' this loop is the issue (i think)
Dim i As Long, target As Long
For i = startNumber To endNumber ' this loop is the issue (i think)
Dim url As String
url = baseURL & CStr(i) & "/game-log"
If CheckUrlExists(url) Then
target = i
Debug.Print "the target i is: " & target
End If
Next i
Dim Murl As String
Murl = baseURL & target & "/game-log"
httpRequest.Open "GET", Murl, False
httpRequest.send
htmldoc.body.innerHTML = httpRequest.responseText
这是功能:
Public Function CheckUrlExists(url) As Boolean
On Error GoTo CheckUrlExists_Error
Dim xmlhttp As MSXML2.XMLHTTP60: Set xmlhttp = New MSXML2.XMLHTTP60
Dim htmldoc As HTMLDocument: Set htmldoc = New HTMLDocument
Dim H2el As Object
xmlhttp.Open "GET", url, False
xmlhttp.send
htmldoc.body.innerHTML = xmlhttp.responseText
If xmlhttp.Status = 200 Then
For Each H2el In htmldoc.getElementsByTagName("h2")
If InStr(1, ChangeAccent(H2el.innerText), CStr(cc.Offset(0, -1).Value)) > 0 Then
CheckUrlExists = True
End If
Next H2el
Else
CheckUrlExists = False
End If
Exit Function
CheckUrlExists_Error:
CheckUrlExists = False
End Function
谢谢你
对于每个 POST 请求,远程服务器对 200(正常)或 502(错误网关)状态代码的响应时间各不相同。在我的测试中,通常需要半秒到一秒。使用此方法验证 90,000 个 URL 是不切实际的。
确实,尝试验证大量 URL 可能会给网站服务器带来巨大负担。事实上,它可以被解释为一种小型 DDoS(分布式拒绝服务)攻击。
下面的代码使用
search
模拟 https://www.statmuse.com/ 上的 Internet Explorer
,虽然有点过时,但仍然可以完成工作。 Selenium WebDriver可能是更好的选择。
Sub Demo()
Dim t: t = Timer
Debug.Print GetURL("martin perez")
Debug.Print GetURL("lebron james")
Debug.Print Timer - t
End Sub
Function GetURL(ByVal sFullName As String) As String
Dim IE As Object
Dim doc As Object
' Initialize Internet Explorer
Set IE = CreateObject("InternetExplorer.Application")
IE.Visible = False
' Navigate to the webpage
IE.Navigate "https://www.statmuse.com/ask?q=" & Replace(sFullName, " ", "-")
' Wait for IE to finish loading the page
Do While IE.Busy Or IE.readyState <> 4
DoEvents
Loop
' Get the document object
Set doc = IE.Document
' Get the HTML content of the webpage
'Dim htmlContent As String: htmlContent = doc.DocumentElement.outerHTML
Dim oLink As Object
Set oLink = doc.getElementsByTagName("link")(0)
' Extract the href attribute value from the link tag
If Not oLink Is Nothing Then
GetURL = oLink.getAttribute("href")
Else
GetURL = ""
End If
' Close IE
IE.Quit
Set IE = Nothing
End Function
输出:
https://www.statmuse.com/mlb/player/mart%C3%ADn-p%C3%A9rez-46483
https://www.statmuse.com/nba/player/lebron-james-1780