在某些URL上提取数据时,某些页面出现403错误

问题描述 投票:0回答:1

“你好,你能帮我吗?当尝试从网页中提取 JSON 文件时,它适用于同一页面中的某些 URL,但对于其他 URL,我收到 403 错误。URL 为:”

好的:https://www.falabella.com/falabella-cl/category/cat16510006/Electrohogar?facetSelected=true&f.driven.variant.sellerId=FALABELLA%3A%3ASODIMAC%3A%3ATOTTUS&page=1

错误403: https://www.falabella.com/falabella-cl/category/cat7330051/Mujer?facetSelected=true&f.driven.variant.sellerId=FALABELLA&page=1

我的示例代码:

import requests
import json
from bs4 import BeautifulSoup


session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})


def extract_json_from_falabella(url):
    try:
        response = session.get(url)
        response.raise_for_status()  # Lanza una excepción si la respuesta no es exitosa (código 2xx)

        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
           
            script_tag = soup.find('script', id='__NEXT_DATA__')

            if script_tag:
               
                json_text = script_tag.string.strip()
                data = json.loads(json_text)
                return data
            else:
                print("No se encontró el script con id='__NEXT_DATA__'.")
                return None
        else:
            print(f"Error al realizar la solicitud: {response.status_code}")
            return None

    except requests.exceptions.HTTPError as http_err:
        print(f"Error HTTP: {http_err}")
        return None
    except Exception as err:
        print(f"Ocurrió un error: {err}")
        return None


url = "https://www.falabella.com/falabella-cl/category/cat7330051/Mujer?facetSelected=true&f.derived.variant.sellerId=FALABELLA%3A%3ASODIMAC&page=1"
data = extract_json_from_falabella(url)

if data:
  
    with open('falabella_data.json', 'w', encoding='utf-8') as json_file:
        json.dump(data, json_file, ensure_ascii=False, indent=4)
    print("Datos guardados en 'falabella_data.json'")
else:
    print("No se pudieron extraer los datos JSON.")

你能看出问题所在吗?

python json beautifulsoup http-status-code-403
1个回答
0
投票

这是 Cloudflare 保护,我不知道为什么它只应用于某些路径而不应用于其他路径,但这是被动保护,它使用

tls/ja3/http2 fingerprinting
来阻止机器人/抓取。

幸运的是,在这种情况下可以通过使用 curl_cffi 模拟浏览器的指纹来绕过它,它具有类似 api 的

requests

由于这个网站使用了api,我们可以直接以json格式检索数据,而不是从html中提取数据。

下面的代码将检索此页面的结果:

https://www.falabella.com/falabella-cl/category/cat7330051/Mujer?facetSelected=true&f.derived.variant.sellerId=FALABELLA&page=1

from curl_cffi import requests

def get_pid():
    url = 'https://www.falabella.com/s/geo/v2/districts/cl?politicalId=default'
    response = requests.get(url)
    data = response.json().get('data', {})
    return data.get('politicalId')


api_url = "https://www.falabella.com/s/browse/v1/listing/cl"

# pid does not seem to change/expire so you can replace it with string value
pid = get_pid()

params = {
    'f.derived.variant.sellerId': 'FALABELLA::SODIMAC::TOTTUS',
    'facetSelected': True,
    'page': 1,
    'categoryId': 'cat7330051',
    'categoryName': 'Mujer',
    'pid': pid,
}

response = requests.get(api_url, params=params, impersonate='chrome')
data = response.json()['data']

pagination = data['pagination']
results = data['results']

print(f'{len(results) = }')

不要忘记使用pip安装curl_cffi:

pip install curl_cffi --upgrade

注意:我删除了 2 个参数(

pgid
zones
),它们似乎没有做任何事情,如果您发现这些结果与 html 中的结果(
__NEXT_DATA__
)之间有任何差异,您可以尝试将它们添加回来(从开发工具复制)。

© www.soinside.com 2019 - 2024. All rights reserved.