无法使用请求从下一页抓取名称

问题描述 投票:1回答:1

我正在尝试使用python脚本解析遍历网页的名称。通过当前的尝试,我可以从其登录页面获取名称。但是,我找不到使用请求和BeautifulSoup从下一页获取名称的任何想法。

website link

到目前为止,我的尝试:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    for elem in soup.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)

我已尝试修改脚本以仅从第二页获取内容,以确保当涉及到下一页按钮时它可以正常工作,但是很遗憾,它仍然从第一页获取数据:

import requests
from bs4 import BeautifulSoup

url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

with requests.Session() as s:
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTARGUMENT'] = 'Page$Next'
    payload.pop('btnClose')
    payload.pop('btnMapClose')
    res = s.post(url,data=payload,headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
        'X-Requested-With':'XMLHttpRequest',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
        })
    sauce = BeautifulSoup(res.text,"lxml")
    for elem in sauce.select("table#gvContractors tr:has([id*='_lblName'])"):
        name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
        print(name)
python python-3.x web-scraping beautifulsoup http-post
1个回答
1
投票

导航到下一页是通过使用__VIEWSTATE光标的POST请求执行的。

如何处理请求:

  1. 向第一页发送GET请求;

  2. 解析所需的数据和__VIEWSTATE游标;

  3. 使用接收的光标为下一页准备POST请求;

  4. 运行它,解析所有数据和新光标以进入下一页。

我将不提供任何代码,因为它需要写下几乎所有搜寻器的代码。

====已添加====

您几乎完成了,但是您错过了两个重要的事情。

  1. 必须通过第一个GET请求发送标头。如果没有发送标头,则令牌会破损(很容易从视觉上进行检测-末尾还没有==)

  2. 我们需要在发送的有效载荷中添加__ ASYNCPOST。 (这很有趣:它不是布尔值True,而是字符串'true')

这里是代码。我删除了bs4并添加了lxml(我不喜欢bs4,它非常慢)。我们确切地知道我们需要发送哪些数据,因此让我们仅分析少量输入。

import re
import requests
from lxml import etree


def get_nextpage_tokens(response_body):
    """ Parse tokens from XMLHttpRequest response for making next request to next page and create payload """
    try:
        payload = dict()
        payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
        payload['__EVENTTARGET'] = 'gvContractors'
        payload['__EVENTARGUMENT'] = 'Page$Next'
        payload['__VIEWSTATEENCRYPTED'] = ''
        payload['__VIEWSTATE'] = re.search(r'__VIEWSTATE\|([^\|]+)', response_body).group(1)
        payload['__VIEWSTATEGENERATOR'] = re.search(r'__VIEWSTATEGENERATOR\|([^\|]+)', response_body).group(1)
        payload['__EVENTVALIDATION'] = re.search(r'__EVENTVALIDATION\|([^\|]+)', response_body).group(1)
        payload['__ASYNCPOST'] = 'true'
        return payload
    except:
        return None


if __name__ == '__main__':
    url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"

    headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
            'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
            }

    with requests.Session() as s:
        page_num = 1
        r = s.get(url, headers=headers)
        parser = etree.HTMLParser()
        tree = etree.fromstring(r.text, parser)

        # Creating payload
        payload = dict()
        payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
        payload['__EVENTTARGET'] = 'gvContractors'
        payload['__EVENTARGUMENT'] = 'Page$Next'
        payload['__VIEWSTATE'] = tree.xpath("//input[@name='__VIEWSTATE']/@value")[0]
        payload['__VIEWSTATEENCRYPTED'] = ''
        payload['__VIEWSTATEGENERATOR'] = tree.xpath("//input[@name='__VIEWSTATEGENERATOR']/@value")[0]
        payload['__EVENTVALIDATION'] = tree.xpath("//input[@name='__EVENTVALIDATION']/@value")[0]
        payload['__ASYNCPOST'] = 'true'
        headers['X-Requested-With'] = 'XMLHttpRequest'

        while True:
            page_num += 1
            res = s.post(url, data=payload, headers=headers)

            print(f'page {page_num} data: {res.text}')  # FIXME: Parse data

            payload = get_nextpage_tokens(res.text)  # Creating payload for next page
            if not payload:
                # Break if we got no tokens - maybe it was last page (it mush be checked)
                break

重要

响应格式不正确的HTML。因此,您必须处理它:切割桌子或其他东西。祝你好运!

© www.soinside.com 2019 - 2024. All rights reserved.