最初我是正常抓取这个网页的,但最近更新了它,所以现在我的请求在 HTML 正文中返回 Javascript。无论如何,我决定更改我的代码,以便通过 POST 请求将数据拉到后端。
我要抓取的页面是 https://www.tesco.ie/groceries/en-IE/shop/fresh-food/fresh-fruit/all?page=1&count=48,我的代码看起来像这样:
import json
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.64',
'Content-Type': 'application/json; charset=UTF-8',
'Accept-Language': 'en-US,en;q=0.9',
'X-Requested-With': 'XMLHttpRequest',
}
payload = {
"resources": [
{"type": "appState",
"params": {},
"hash": "8608229003782371"},
{"type": "trolleyContents",
"params": {},
"hash": "2574718136506441"},
{"type": "productsByCategory",
"params": {
"aisle": "all",
"department": "fresh-fruit",
"query": {
"count": "48",
"page": "1"},
"superdepartment": "fresh-food"
},
"hash": "4571228679394986"}
],
"sharedParams": {
"superdepartment": "fresh-food",
"department": "fresh-fruit",
"aisle": "all",
"referer": "/groceries/en-IE/shop/fresh-food/fresh-fruit/all?page=5&count=48",
"query": {
"count": "48",
"page": "1"
}
},
"requiresAuthentication": "false"
}
server_url = 'https://www.tesco.ie/groceries/en-IE/resources'
with requests.Session() as s:
data = s.post(
server_url,
headers=headers,
data=json.dumps(payload)
)
print(data)
这段代码返回
抱歉,如果我只是缺少一些简单的东西。感谢您的帮助!