LookupError:未知编码:'b'utf8''

问题描述 投票:0回答:1

我不知道为什么,但当我尝试抓取和解析沃尔玛的网页时,出现查找错误,发现未知编码“b'utf8”。

我已经将编码设置为utf-8,并尝试删除BOM,根据这篇文章:lxml LookupError发生。参数:(“未知编码:'b'utf-8-sig''”,)

感谢任何帮助或指点!

完整代码:

import httpx
from parsel import Selector
import json

# Fake browser-like headers
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}

response = httpx.get("https://www.walmart.com/product-page-url", headers=BASE_HEADERS)
if response.encoding is None:
    response.encoding = 'utf-8' 

# Remove BOM if present
content = response.content
if content.startswith(b'\xef\xbb\xbf'):
    content = content[3:]  # Remove the BOM

response_text = content.decode('utf-8')
sel = Selector(text=response_text)
data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()

if data:
    data = json.loads(data)
    product = data["props"]["pageProps"]["initialData"]["data"]["product"]
    print(product)
else:
    print("No product data found.")
python encoding utf-8 httpx
1个回答
0
投票

网址错误

response = httpx.get("https://www.walmart.com/product-page-url", headers=BASE_HEADERS)

这将产生

307
重定向或
404
content-type:
编码并不重要,如果 没有内容。 选择支持的 URL 进行获取。

© www.soinside.com 2019 - 2024. All rights reserved.