使用Beautifulsoup和Requests抓取'N'页（如何获取真实页码）

Question

我想获取网站中所有的titles()。

http://www.shyan.gov.cn/zwhd/web/webindex.action

现在，我的代码仅成功抓取一页。但是，上面的网站上有多个页面可供我抓取。

例如，对于上面的 url，当我单击“第 2 页”的链接时，整个 url 不会改变。我查看了页面源代码，看到了前进到下一页的 javascript 代码，如下所示：javascript:gotopage(2) 或 javascript:void(0)。我的代码在这里（获取第 1 页）

from bs4 import Beautifulsoup
import requests
url = 'http://www.shyan.gov.cn/zwhd/web/webindex.action'
r =  requests.get(url)
soup = Beautifulsoup(r.content,'lxml')
titles = soup.select('td.tit3 > a')
for title in titles:
    print(title.get_text())

如何更改我的代码以从所有可用的列出页面中抓取标题？非常感谢！

Answer 1

尝试使用以下 URL 格式：

http://www.shiyan.gov.cn/zwhd/web/webindex.action?keyWord=&searchType=3&page.currentpage=2&page.pagesize=15&page.pagecount=2357&docStatus=&sendOrg=

该网站正在使用 javascript 将隐藏页面信息传递到服务器以请求下一页。当你查看源码时你会发现：

<form action="/zwhd/web/webindex.action" id="searchForm" name="searchForm" method="post">
 <div class="item">
     <div class="titlel">
      <span>留言查询</span>
     <label class="dow"></label>
     </div>
     <input type="text" name="keyWord" id="keyword" value="" class="text"/>
     <div class="key">
        <ul>
            <li><span><input type="radio" checked="checked" value="3" name="searchType"/></span><p>编号</p></li>
            <li><span><input type="radio" value="2" name="searchType"/></span><p>关键字</p></li>
        </ul>    
     </div>
     <input type="button" class="btn1" onclick="search();" value="查询"/>
  </div>
  <input type="hidden" id="pageIndex" name="page.currentpage" value="2"/>
  <input type="hidden" id="pageSize" name="page.pagesize" value="15"/>
  <input type="hidden" id="pageCount" name="page.pagecount" value="2357"/>
  <input type="hidden" id="docStatus" name="docStatus" value=""/>
  <input type="hidden" id="sendorg" name="sendOrg" value=""/>
  </form>

使用Beautifulsoup和Requests抓取'N'页（如何获取真实页码）

问题描述投票：0回答：1

1个回答

最新问题

使用Beautifulsoup和Requests抓取'N'页（如何获取真实页码）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1