我正在尝试抓取这个网站。
如果您将组织 ID 放入搜索栏中,然后按“Искать”,它会将您重定向到一个单独的页面,其基本网址为 https://pb.nalog.ru/search.html 和哈希值“#t=* &mode=search-all&queryAll=ID" 其中 t 是当前毫秒(来自
Date.gettime()
)
如果我使用宏生成的 url 并手动将其放入浏览器中,它会返回正确的页面,但每当我尝试以编程方式执行此操作时,它都会返回 404 找不到页面网站虚拟;而且它返回的url和我的不一样:
https://pb.nalog.ru/search.html#t=1730356470622&mode=search-all&queryAll=9714055795
/search.html%23t=1730356470622&mode=search-all&queryAll=9714055795
I assume %23 是 # 的转换,但我对此很陌生,不能肯定地说。我会尽力回答所有后续问题。
这是有问题的代码:
Option Explicit
Private Type SYSTEMTIME
wYear As Integer
wMonth As Integer
wDayOfWeek As Integer
wDay As Integer
wHour As Integer
wMinute As Integer
wSecond As Integer
wMilliseconds As Integer
End Type
Private Declare PtrSafe Sub GetSystemTime Lib "kernel32" (lpSystemTime As SYSTEMTIME)
Function CurrentTimeMillis() As Double
' Returns the milliseconds from 1970/01/01 00:00:00.0 to system UTC
Dim st As SYSTEMTIME
GetSystemTime st
Dim t_Start, t_Now
t_Start = DateSerial(1970, 1, 1) ' Starting time for Linux
t_Now = DateSerial(st.wYear, st.wMonth, st.wDay) + _
TimeSerial(st.wHour, st.wMinute, st.wSecond)
CurrentTimeMillis = DateDiff("s", t_Start, t_Now) * 1000 + st.wMilliseconds
End Function
Public Sub oopsie_doopsie()
Dim http As New XMLHTTP60
Dim html As New HTMLDocument
Dim curr As Double
curr = CurrentTimeMillis(): Debug.Print curr
With http
.Open "GET", "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & "", False
Debug.Print "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & ""
DoEvents
.send
DoEvents
html.body.innerHTML = .responseText
End With
html.getElementsByClassName ("pb-subject-status pb-subject-status--active")
End Sub
我想你需要使用 XMLHTTP60.setRequestHeader 来创建服务器接受的标头:
With http
.setRequestHeader = "header zdes"
.Open "GET", "https://pb.nalog.ru/search.html#t=" & curr & "&mode=search-all&queryAll=" & "9714055795" & "", False
...
使用 Postman 之类的工具来研究任何浏览器发送的标头。