如何从Bayut(DLD验证属性)刮擦数据,而不会遇到401错误? 我使用签证从Bailut刮擦房地产数据,但无法提取绿色壁虱(DLD验证信息) 该信息是通过具有基本身份验证的帖子API获取的。 在

问题描述 投票:0回答:0

信息是通过基本身份验证的postapi获取的。 将API与相同的标题,有效载荷和参数返回

401未授权
  • 使用selenium有效,但对于大规模刮擦而言太慢(〜210k属性/周)。
  • I在networktab中找到了API请求并精确地复制了API请求,但仍会遇到401错误。网站可以使用其他安全措施(例如基于会话的身份验证或IP限制)吗? 我尝试了什么:
  • Scrapy(无法获取验证信息)。 POSTMAN和PYTHON请求(401错误)。
  • 子(工作速度太慢)。

如何有效地访问此数据?任何见解都将不胜感激。 跟踪请求帖子API的代码:

import requests import base64 # Define the URL url = "https://fenix-data-es2.bayut.com/_msearch" # Encode credentials manually (decoded: "bayut_read_user_es2:10yNmg5+6K") auth_string = "bayut_read_user_es2:10yNmg5+6K" auth_encoded = base64.b64encode(auth_string.encode()).decode() # Convert to Base64 # Headers with Authorization headers = { "Authorization": f"Basic {auth_encoded}", "accept": "*/*", "accept-encoding": "gzip, deflate, br, zstd", "accept-language": "en-US,en;q=0.9", "cache-control": "no-cache", "content-type": "application/x-ndjson", "origin": "https://www.bayut.com", "pragma": "no-cache", "priority": "u=1, i", "referer": "https://www.bayut.com/", "sec-ch-ua": "\"Not(A:Brand\";v=\"99\", \"Google Chrome\";v=\"133\", \"Chromium\";v=\"133\"", "sec-ch-ua-mobile": "?0", "sec-ch-ua-platform": "\"Windows\"", "sec-fetch-dest": "empty", "sec-fetch-mode": "cors", "sec-fetch-site": "same-site", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36" } # Query parameters (filter_path) params = { "filter_path": "took,*.took,*.suggest.*.options.text,*.suggest.*.options._source.*,*.hits.total.*,*.hits.hits._source.*,*.hits.hits._score,*.hits.hits.highlight.*,*.error,*.aggregations.*.buckets.key,*.aggregations.*.buckets.doc_count,*.aggregations.*.buckets.complex_value.hits.hits._source,*.aggregations.*.filtered_agg.facet.buckets.key,*.aggregations.*.filtered_agg.facet.buckets.doc_count,*.aggregations.*.filtered_agg.facet.buckets.complex_value.hits.hits._source" } # POST data (formatted in NDJSON format) post_data = """{"index":"dld_matched_property_details_prod_alias"} {"from":0,"size":5,"track_total_hits":10000,"query":{"bool":{"must":[{"term":{"external_id":"10228377"}}]}}} """ # Sending the POST request response = requests.post(url, headers=headers, params=params, data=post_data) # Check if the request was successful if response.status_code == 200: print("Request Successful!") print(response.json()) # Print the response in JSON format else: print(f"Request failed with status code: {response.status_code}") print(response.text) # Print the error message if any

你需要
    hb-session-id
  • 饼干;您可以从需要
  • /.humbucker/challenge/js/validate
  • 标题和正确的帖子数据(指纹以特定顺序的指纹)的the post请求中获取它。
  • 在这里做所有这些的方法:
  • x-hb-co

python web-scraping post postman basic-authentication
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.