我正在尝试抓取属于该网站上一个按钮的链接。 (最终目的是丰富RAG聊天机器人的数据)
https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm
上一个/下一个按钮位于右上角。必须在给定示例子页面上提取的链接是这个:
href="https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/Prinect/measuring/measuring-3.htm"
我尝试了Beautifulsoup的标准方法:
from bs4 import BeautifulSoup
import requests
url = "https://onlinehelp.prinect-lounge.com/Prinect_Color_Toolbox/Version2021/de_10/#t=Prinect%2Fmeasuring%2Fmeasuring-4.htm"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
# get full html section
test1 = soup.find(id="browseSeqBack")
print(test1)
# get full html section test 2
test2 = soup.find("div", class_="brs_previous").children
print(test2)
# get link directly test 3
secBackButton = soup.find(id="browseSeqBack")
href = secBackButton.attrs.get('href', None)
print(href)
但是,测试 1 和 2 都没有提供整个 html 部分,也没有直接查询链接。 本节返回测试 1:
<a class="wBSBackButton" data-attr="href:.l.brsBack" data-css="visibility: @.l.brsBack?'visible':'hidden'" data-rhwidget="Basic" id="browseSeqBack">
<span aria-hidden="true" class="rh-hide" data-html="@KEY_LNG.Prev"></span>
提前致谢:)
我得到了
403 错误 无法满足请求。
你也一样吗?