网站 https://www.supralift.com/uk/itemsearch/results 使用基于 JavaScript 的寻呼机,它不会在 url 中公开任何参数,我可以更改这些参数并通过网站这种方式导航。
查看 Chrome 控制台的“网络”选项卡,我发现该网站在 url
/api/search/item/summary
下还公开了一些相对完整的信息。调用此 API 端点会返回空结果:
{
"type": "about:blank",
"title": "Method Not Allowed",
"status": 405,
"detail": "Method 'GET' is not supported.",
"instance": "/api/search/item/summary"
}
如何使用这个隐藏的 API 来抓取网站?
提前非常感谢。
如果没有代码示例,很难猜测您到底是如何实现这一点的。假设从包含的标签来看,可能是通过
httr::GET()
或 rvest::read_html_live()
?
无论如何,从 DevTools 的 Network 选项卡中您可以看到实际的请求方法,在本例中它是
POST
而不是 GET
,所以 httr::GET()
在这里没有多大用处。对于 POST
请求,您通常需要包含有效负载,在本例中,它将是带有搜索参数的 JSON(检查 DevTools 中的网络 > Payload)。
要在 R 中发出相同的请求,您可以首先将请求复制为 cURL 并通过 https://curlconverter.com/r/ 或类似工具进行翻译。
或者如果
httr2
也可以,则使用 httr2::curl_translate()
。
请求方法 & 复制为 cURL(如果它是某些转换工具的输入,即使在 Windows 中,也可以使用 bash):
以下内容基于
httr2::curl_translate()
输出,尽管在本例中它基本上只是获取有效负载的便捷快捷方式,使用 req_body_raw()
自动将请求方法设置为 POST
。 httr2
还提供了迭代执行请求的工具,下一个请求基于当前响应的响应,例如循环浏览pages,同时检查我们是否已到达最后page。
req_perform_iterative()
与 iterate_with_offset()
帮助器一起使用,根据当前响应生成下一个请求,resp_pages
是匿名函数,用于从第一个 JSON 响应中提取 totalPages
值,resp_complete
检查 last
是否已设置在 JSON 响应中。一旦 resp_complete
返回 TRUE
或发出最大数量的请求 (max_reqs
),迭代器就会停止。
library(httr2)
# prepare 1st request object, size increased
req <- request("https://www.supralift.com/api/search/item/summary") |>
req_url_query(
size = "2000",
page = "0",
) |>
req_body_raw('{"searchType":null,"bundleId":null,"identification":{"slNumber":null,"serialNumber":null,"supplierProductNumber":null,"slOrSupplierProductNumber":null},"configuration":{"buildClass":null,"manufacturer":null,"buildSeries":null,"acDriven":null,"powerUnit":null,"fuelType":null,"mastType":null,"gearBox":null,"tyres":null,"typeSearch":null},"buildDates":{"month":null,"year":{"from":null,"to":null}},"dimensions":{"overallHeight":{"from":null,"to":null},"workingHours":{"from":null,"to":null},"loadCentreOfGravity":{"from":null,"to":null},"capacity":{"from":null,"to":null},"forkLength":{"from":null,"to":null},"towingCapacity":{"from":null,"to":null}},"price":{"price":{"from":null,"to":null,"currency":"GBP"}},"cabin":{"cabin":null,"height":{"from":null,"to":null},"platformHeight":{"from":null,"to":null}},"engine":{"manufacturer":null,"power":null},"battery":{"exists":null,"manufacturer":null,"batteryType":null,"voltage":{"from":null,"to":null},"capacity":{"from":null,"to":null},"buildDates":null},"batteryCharger":{"exists":null,"manufacturer":null,"voltage":{"from":null,"to":null},"current":{"from":null,"to":null},"buildDates":{"month":null,"year":{"from":null,"to":null}}},"location":{"distance":100,"postCode":null,"region":null,"countryState":null,"country":null,"countryOrNull":null},"container":{"containerType":null,"hubhoehe8Z3":{"from":null,"to":null},"hubhoehe8Z4":{"from":null,"to":null},"hubhoehe8Z5":{"from":null,"to":null},"hubhoehe8Z6":{"from":null,"to":null},"hubhoehe8Z7":{"from":null,"to":null},"hubhoehe8Z8":{"from":null,"to":null},"hubhoehe8Z6I3":{"from":null,"to":null},"hubhoehe8Z6I4":{"from":null,"to":null},"hubhoehe8Z6I5":{"from":null,"to":null},"hubhoehe8Z6I6":{"from":null,"to":null},"hubhoehe8Z6I7":{"from":null,"to":null},"hubhoehe8Z6I8":{"from":null,"to":null},"hubhoehe9Z6I3":{"from":null,"to":null},"hubhoehe9Z6I4":{"from":null,"to":null},"hubhoehe9Z6I5":{"from":null,"to":null},"hubhoehe9Z6I6":{"from":null,"to":null},"hubhoehe9Z6I7":{"from":null,"to":null},"hubhoehe9Z6I8":{"from":null,"to":null}},"offerDetails":{"offerBegin":null,"maxOfferAge":null,"activationDate":null,"offerFormat":null,"dealsOnly":null,"imagesOnly":null,"offerType":"SALE"},"additionalHydraulic":{"toValve":null,"complete":null},"liftAttributes":{"initialLift":null,"liftHeight":{"from":null,"to":null},"freeLift":{"from":null,"to":null},"liftPower":null},"isLicensedDealerOnly":null,"warranty":{"from":null,"to":null},"qualityRating":null,"attachments":null,"accessories":null,"customFields":null,"specialAttributes":{"explosionProof":null,"stainlessSteel":null,"autonomousMobileRobot":null},"freightTerm":null,"itemStatus":[],"backendSearch":false}', "application/json")
# perform series of requests, increase `page` parameter until
# `resp_complete` returns TRUE or when reaching `max_reqs`
resps <-
req_perform_iterative(
req,
next_req = iterate_with_offset(
param_name = "page",
start = 0,
resp_pages = \(resp) resp_body_json(resp)$totalPages,
resp_complete = \(resp) resp_body_json(resp)$last,
),
# generate just first 2 requests as an example
max_reqs = 2
)
#> Iterating ■■■■■■■■■■■■■■■■ 50% | ETA: 3s
# list of responses:
resps
#> [[1]]
#> <httr2_response>
#> POST https://www.supralift.com/api/search/item/summary?size=2000&page=0
#> Status: 200 OK
#> Content-Type: application/json
#> Body: In memory (2043726 bytes)
#>
#> [[2]]
#> <httr2_response>
#> POST https://www.supralift.com/api/search/item/summary?size=2000&page=1
#> Status: 200 OK
#> Content-Type: application/json
#> Body: In memory (1933115 bytes)
单帧中的所有数据,2000(大小)* 2(max_reqs)= 4000行:
resps_data(resps, \(resp) resp_body_json(resp, simplifyVector = TRUE)$content) |>
tibble::as_tibble()
#> # A tibble: 4,000 × 18
#> id slNo type manufacturer powerUnit buildClass mastType liftHeight
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 66f8d101f… 1.29e7 R14S LINDE ELEKTRO SCHUBMAST… DREIFACH 6250
#> 2 66f8ccfaf… 1.29e7 EPL1… WEITERE ELEKTRO NIEDERHUB… <NA> 115
#> 3 66f8c894f… 1.29e7 27823 JUNGHEINRICH ELEKTRO WEITERE DREIFACH 5500
#> 4 66f8c886f… 1.29e7 27799 JUNGHEINRICH ELEKTRO WEITERE DREIFACH 5000
#> 5 66f8c595f… 1.29e7 T401… BOBCAT DIESEL TELESKOPA… <NA> 17000
#> 6 66f8c4e7f… 1.29e7 TFG … JUNGHEINRICH GAS VIERRADFR… DREIFACH 5000
#> 7 66f8b525f… 1.29e7 EFG … JUNGHEINRICH ELEKTRO VIERRADFR… DREIFACH 6000
#> 8 66f81676f… 1.28e7 R16G LINDE ELEKTRO SCHUBMAST… <NA> 6210
#> 9 66f80016f… 1.29e7 C500… COMBILIFT ELEKTRO SEITENSTA… DREIFACH 6000
#> 10 66f80015f… 1.29e7 H30D… LINDE DIESEL VIERRADFR… ZWEIFACH 3760
#> # ℹ 3,990 more rows
#> # ℹ 10 more variables: workingHours <int>, capacity <int>, yearOfBuild <int>,
#> # company <df[,6]>, offerType <chr>, isNew <lgl>, price <df[,4]>,
#> # qualityRating <int>, images <list>, specialRating <df[,4]>
创建于 2024-09-29,使用 reprex v2.1.1
根据你的问题,你说这个端点
/api/search/item/summary
暴露了一些相对完整的信息。尽管如此,服务器仍然显示 405 Method not allowed
,发生此错误是因为服务器仅接受 POST
请求(您可以看到 Allow:
响应标头仅显示 POST 值,)。这是带有 json 正文的完整 POST
请求:
POST /api/search/item/summary?size=100&page=0&sort=price,asc HTTP/2
Host: www.supralift.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://www.supralift.com/
Content-Type: application/json
Access-Control-Allow-Origin: https://www.supralift.com
Content-Length: 2664
Origin: https://www.supralift.com
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=4
Te: trailers
{"searchType":null,"bundleId":null,"identification":{"slNumber":null,"serialNumber":null,"supplierProductNumber":null,"slOrSupplierProductNumber":null},"configuration":{"buildClass":null,"manufacturer":null,"buildSeries":null,"acDriven":null,"powerUnit":null,"fuelType":null,"mastType":null,"gearBox":null,"tyres":null,"typeSearch":null},"buildDates":{"month":null,"year":{"from":null,"to":null}},"dimensions":{"overallHeight":{"from":null,"to":null},"workingHours":{"from":null,"to":null},"loadCentreOfGravity":{"from":null,"to":null},"capacity":{"from":null,"to":null},"forkLength":{"from":null,"to":null},"towingCapacity":{"from":null,"to":null}},"price":{"price":{"from":null,"to":null,"currency":"GBP"}},"cabin":{"cabin":null,"height":{"from":null,"to":null},"platformHeight":{"from":null,"to":null}},"engine":{"manufacturer":null,"power":null},"battery":{"exists":null,"manufacturer":null,"batteryType":null,"voltage":{"from":null,"to":null},"capacity":{"from":null,"to":null},"buildDates":null},"batteryCharger":{"exists":null,"manufacturer":null,"voltage":{"from":null,"to":null},"current":{"from":null,"to":null},"buildDates":{"month":null,"year":{"from":null,"to":null}}},"location":{"distance":100,"postCode":null,"region":null,"countryState":null,"country":null,"countryOrNull":null},"container":{"containerType":null,"hubhoehe8Z3":{"from":null,"to":null},"hubhoehe8Z4":{"from":null,"to":null},"hubhoehe8Z5":{"from":null,"to":null},"hubhoehe8Z6":{"from":null,"to":null},"hubhoehe8Z7":{"from":null,"to":null},"hubhoehe8Z8":{"from":null,"to":null},"hubhoehe8Z6I3":{"from":null,"to":null},"hubhoehe8Z6I4":{"from":null,"to":null},"hubhoehe8Z6I5":{"from":null,"to":null},"hubhoehe8Z6I6":{"from":null,"to":null},"hubhoehe8Z6I7":{"from":null,"to":null},"hubhoehe8Z6I8":{"from":null,"to":null},"hubhoehe9Z6I3":{"from":null,"to":null},"hubhoehe9Z6I4":{"from":null,"to":null},"hubhoehe9Z6I5":{"from":null,"to":null},"hubhoehe9Z6I6":{"from":null,"to":null},"hubhoehe9Z6I7":{"from":null,"to":null},"hubhoehe9Z6I8":{"from":null,"to":null}},"offerDetails":{"offerBegin":null,"maxOfferAge":null,"activationDate":null,"offerFormat":null,"dealsOnly":null,"imagesOnly":null,"offerType":"SALE"},"additionalHydraulic":{"toValve":null,"complete":null},"liftAttributes":{"initialLift":null,"liftHeight":{"from":null,"to":null},"freeLift":{"from":null,"to":null},"liftPower":null},"isLicensedDealerOnly":null,"warranty":{"from":null,"to":null},"qualityRating":null,"attachments":null,"accessories":null,"customFields":null,"specialAttributes":{"explosionProof":null,"stainlessSteel":null,"autonomousMobileRobot":null},"freightTerm":null,"itemStatus":[],"backendSearch":false}
嗯,我使用了 port swigger 的一个名为
Burp Suite
的代理工具,它对开发人员和渗透测试人员都非常有帮助这里
如果您对 burp 套件感兴趣,请按照以下步骤操作:
Foxyproxy
的 Firefox/chrome 扩展程序。 (在谷歌上搜索Foxypoxy
)Foxyproxy
扩展)Settings
--> 搜索 certificates
,谷歌搜索以获得更好的解决方案)。就是这样!
如果您在设置 Burp Suite 时遇到任何问题,请不要忘记用 google 搜索它。
希望这会有所帮助。
谢谢