我试图从此页面获取表格数据,而不使用
BeautifulSoup
来解析生成的 HTML:
https://finance.vietstock.vn/ket-qua-giao-dich/vietnam.aspx?tab=thong-ke-gia&exchange=1&code=-19
查看DeveloperTools,我找到了脚本
KQGDThongKeGiaStockPaging
,显然这是对网址的POST请求:https://finance.vietstock.vn/data/KQGDThongKeGiaStockPaging(如果我只是复制/粘贴到浏览器,则什么也不显示) ).
我复制请求标头、有效负载,运行 POST 请求 - 我总是会收到一个响应,其文本是 url 的内容 https://finance.vietstock.vn/data/KQGDThongKeGiaStockPaging 基本上说:
there's nothing here
。
尝试过 ChatGPT,搜索 stackoverflow 问题。想知道我是否无法做到这一点,因为数据是动态加载的(我对此很陌生,所以不太理解这个术语)并且我可能需要动态验证令牌。因此,我尝试 - 在一个会话中 - 向 url 发出请求以获取令牌,将其复制到有效负载并再次运行 - 仍然不起作用(始终获取内容
there's nothing here
)。
我知道我可以使用
BeautifulSoup
从 HTML 甚至 selenium
读取表格,但希望通过 GET/POST 请求保持简单。
import requests
from bs4 import BeautifulSoup
# Function to extract __RequestVerificationToken from the page
def extract_verification_token(html):
soup = BeautifulSoup(html, 'html.parser')
token_input = soup.find('input', {'name': '__RequestVerificationToken'})
if token_input:
return token_input['value']
return None
# URL and initial payload
url = "https://finance.vietstock.vn/data/KQGDThongKeGiaStockPaging"
initial_payload = {
"page": 1,
"pageSize": 20,
"catID": 1,
"stockID": -19,
"fromDate": "2023-12-07",
"toDate": "2024-01-07"
}
# Add headers to simulate a legitimate browser request
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Referer": "https://finance.vietstock.vn/ket-qua-giao-dich/vietnam.aspx?tab=thong-ke-gia&exchange=1&code=-19",
}
# Make the initial GET request with headers
response = requests.get("https://finance.vietstock.vn/ket-qua-giao-dich/vietnam.aspx?tab=thong-ke-gia&exchange=1&code=-19", headers=headers)
if response.status_code == 200:
verification_token = extract_verification_token(response.text)
if verification_token:
# Update the payload with the verification token
initial_payload['__RequestVerificationToken'] = verification_token
# Make the actual POST request with the updated payload and headers
response = requests.post(url, data=initial_payload, headers=headers)
if response.status_code == 200:
data = response.json()
# Now 'data' contains the tabular data, and you can process it as needed
print(data)
else:
print(f"Error: {response.status_code}")
print(response.text)
else:
print("Verification token not found.")
else:
print(f"Error fetching the page: {response.status_code}")
要从站点获取数据,您需要从服务器获取 cookie plus 还要从 HTML 获取验证 cookie:
import requests
from bs4 import BeautifulSoup
url = "https://finance.vietstock.vn/ket-qua-giao-dich/vietnam.aspx?tab=thong-ke-gia&exchange=1&code=-19"
api_url = "https://finance.vietstock.vn/data/KQGDThongKeGiaStockPaging"
with requests.session() as s:
s.headers.update(
{
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0"
}
)
soup = BeautifulSoup(s.get(url).content, "html.parser")
token = soup.select_one('input[name="__RequestVerificationToken"]')["value"]
payload = {
"page": "1",
"pageSize": "20",
"catID": "1",
"stockID": "-19",
"fromDate": "2023-12-07",
"toDate": "2024-01-07",
"__RequestVerificationToken": token,
}
data = s.post(api_url, data=payload).json()
print(data)
打印:
[[{'CloseIndex': 1154.68, 'PriorIndex': 1150.72, 'Change': 3.96,
...