复制相同的 POST 请求(在 DeveloperTools 中检查)无法抓取数据

问题描述 投票:0回答:1

我试图从此页面获取表格数据,而不使用

BeautifulSoup
来解析生成的 HTML: https://finance.vietstock.vn/ket-qua-giao-dich/vietnam.aspx?tab=thong-ke-gia&exchange=1&code=-19

查看DeveloperTools,我找到了脚本

KQGDThongKeGiaStockPaging
,显然这是对网址的POST请求:https://finance.vietstock.vn/data/KQGDThongKeGiaStockPaging(如果我只是复制/粘贴到浏览器,则什么也不显示) ).

我复制请求标头、有效负载,运行 POST 请求 - 我总是会收到一个响应,其文本是 url 的内容 https://finance.vietstock.vn/data/KQGDThongKeGiaStockPaging 基本上说:

 there's nothing here

尝试过 ChatGPT,搜索 stackoverflow 问题。想知道我是否无法做到这一点,因为数据是动态加载的(我对此很陌生,所以不太理解这个术语)并且我可能需要动态验证令牌。因此,我尝试 - 在一个会话中 - 向 url 发出请求以获取令牌,将其复制到有效负载并再次运行 - 仍然不起作用(始终获取内容

there's nothing here
)。

我知道我可以使用

BeautifulSoup
从 HTML 甚至
selenium
读取表格,但希望通过 GET/POST 请求保持简单。

import requests
from bs4 import BeautifulSoup

# Function to extract __RequestVerificationToken from the page
def extract_verification_token(html):
    soup = BeautifulSoup(html, 'html.parser')
    token_input = soup.find('input', {'name': '__RequestVerificationToken'})
    if token_input:
        return token_input['value']
    return None

# URL and initial payload
url = "https://finance.vietstock.vn/data/KQGDThongKeGiaStockPaging"
initial_payload = {
    "page": 1,
    "pageSize": 20,
    "catID": 1,
    "stockID": -19,
    "fromDate": "2023-12-07",
    "toDate": "2024-01-07"
}

# Add headers to simulate a legitimate browser request
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Referer": "https://finance.vietstock.vn/ket-qua-giao-dich/vietnam.aspx?tab=thong-ke-gia&exchange=1&code=-19",
}

# Make the initial GET request with headers
response = requests.get("https://finance.vietstock.vn/ket-qua-giao-dich/vietnam.aspx?tab=thong-ke-gia&exchange=1&code=-19", headers=headers)
if response.status_code == 200:
    verification_token = extract_verification_token(response.text)
    if verification_token:
        # Update the payload with the verification token
        initial_payload['__RequestVerificationToken'] = verification_token

        # Make the actual POST request with the updated payload and headers
        response = requests.post(url, data=initial_payload, headers=headers)

        if response.status_code == 200:
            data = response.json()
            # Now 'data' contains the tabular data, and you can process it as needed
            print(data)
        else:
            print(f"Error: {response.status_code}")
            print(response.text)
    else:
        print("Verification token not found.")
else:
    print(f"Error fetching the page: {response.status_code}")
python web-scraping post python-requests
1个回答
0
投票

要从站点获取数据,您需要从服务器获取 cookie plus 还要从 HTML 获取验证 cookie:

import requests
from bs4 import BeautifulSoup

url = "https://finance.vietstock.vn/ket-qua-giao-dich/vietnam.aspx?tab=thong-ke-gia&exchange=1&code=-19"
api_url = "https://finance.vietstock.vn/data/KQGDThongKeGiaStockPaging"

with requests.session() as s:
    s.headers.update(
        {
            "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0"
        }
    )

    soup = BeautifulSoup(s.get(url).content, "html.parser")
    token = soup.select_one('input[name="__RequestVerificationToken"]')["value"]

    payload = {
        "page": "1",
        "pageSize": "20",
        "catID": "1",
        "stockID": "-19",
        "fromDate": "2023-12-07",
        "toDate": "2024-01-07",
        "__RequestVerificationToken": token,
    }

    data = s.post(api_url, data=payload).json()
    print(data)

打印:

[[{'CloseIndex': 1154.68, 'PriorIndex': 1150.72, 'Change': 3.96, 

...
© www.soinside.com 2019 - 2024. All rights reserved.