我需要从资源中抓取大量文本(html)(mayocliniclabs.org,它有一个 /test_catalog,其中包括我需要以文本格式保存的数千页有关生物标记物的页面)。我使用 python,但我可以使用任何语言,我只需要一个解决方案,每次我使用一些 python 库时,我都无法抓取数据。当我执行一个简单的请求时,它返回 403,当我模拟完整的浏览器环境时,它非常慢,并且数据不完整,甚至不包含页面上的文本。你能推荐什么? 我的英语不像Python那么完美。谢谢你。
我尝试了请求、selenium、unDetected_chromedriver、模拟完整的浏览器环境、chatgpt
您的目标应用程序受 Akamai WAF(Web 应用程序防火墙)保护,这就是您的请求被检测为机器人请求的原因,要绕过此问题,您需要在请求目标应用程序时添加
cookie
和 user-agent
标头,具体方法如下您应该获得站点的 cookie
: right click
--> inspect element
--> network tab
--> reload
页面 --> 单击您的目标应用程序请求之一 --> 复制 cookie
标题值来自 Request Header
,
带有我的
user-agent
和 cookie
的示例代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mayocliniclabs.com/test-catalog/alphabetical/A'
header = {
"Cookie":"ApplicationGatewayAffinityCORS=3336e934db99b7286cd2fd881c2d3856; ApplicationGatewayAffinity=3336e934db99b7286cd2fd881c2d3856; shell#lang=en; SC_ANALYTICS_GLOBAL_COOKIE=f5634522708943e8ba0d8d6777908be4|False; ASP.NET_SessionId=qqybddyfc5xwjsxvnwjtftjx; ARRAffinity=4ea7283ea26b4872bb798047b927162eda6c90118196e9ab93bf45f865d495db; ARRAffinitySameSite=4ea7283ea26b4872bb798047b927162eda6c90118196e9ab93bf45f865d495db; utag_main__sn=1; utag_main__se=19%3Bexp-session; utag_main__ss=0%3Bexp-session; utag_main__st=1730403207583%3Bexp-session; utag_main_ses_id=1730401387381%3Bexp-session; utag_main__pn=2%3Bexp-session; utag_main_v_id=0192e3f5c3e10089681009ec6cb80504e001f01100bd0; utag_main_dc_visit=1; utag_main_dc_event=19%3Bexp-session; utag_main_dc_region=us-west-2%3Bexp-session; nmstat=12eb346c-2df8-0acb-1887-6985c203abd3; _vwo_uuid_v2=D48FCD5513CCE38754C0AF3F1AA74689B|fc4354467141506ada63b1963851c5fc; _vwo_ssm=1; _vis_opt_s=1%7C; _vis_opt_test_cookie=1; dmd-tag=c51c6b60-97ba-11ef-863f-430e6747046c; dmd-sid4={%22id%22:%22c526f2b0-97ba-11ef-8a57-39b2d1556f20%22%2C%22timestamp%22:1730401389000%2C%22lastUpdate%22:1730401389000}; _vwo_uuid=D48FCD5513CCE38754C0AF3F1AA74689B; _vwo_ds=3%241730401389%3A32.04078586%3A%3A; _vwo_sn=0%3A2; da_sid=2D3ECDAE8B4BAE89BA20AA13A6B58DC4DD.1|3|0|3; da_lid=1E0DFE9D9F0BEA122F71BB99E4B7C7CF6E|0|0|0; da_intState=; mdLogger=false; kampyle_userid=3012-cce1-45b9-6d42-aa06-c44b-bac8-5787; kampyleUserSession=1730401392568; kampyleSessionPageCounter=1; kampyleUserSessionsCount=1; kampyleUserPercentile=41.552186135797456; AKA_A2=A",
"User-Agent":"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0"
}
resp = requests.get(url, headers=header).text
soup = BeautifulSoup(resp, 'lxml')
link = soup.findAll('div', class_='rochester-item-data')
for i in link:
print(f"https://www.mayocliniclabs.com{i.find('a')['href']}")
https://www.mayocliniclabs.com/test-catalog/overview/113437
https://www.mayocliniclabs.com/test-catalog/overview/620307
https://www.mayocliniclabs.com/test-catalog/overview/63686
https://www.mayocliniclabs.com/test-catalog/overview/113498
https://www.mayocliniclabs.com/test-catalog/overview/113490
https://www.mayocliniclabs.com/test-catalog/overview/82757
https://www.mayocliniclabs.com/test-catalog/overview/64717
https://www.mayocliniclabs.com/test-catalog/overview/82850
https://www.mayocliniclabs.com/test-catalog/overview/57707
https://www.mayocliniclabs.com/test-catalog/overview/37030
https://www.mayocliniclabs.com/test-catalog/overview/75388
请告诉我这是否可以解决您的 403 错误