使用隐藏的 API 通过 JavaScript 分页器抓取网站

问题描述 投票:0回答:2

网站 https://www.supralift.com/uk/itemsearch/results 使用基于 JavaScript 的寻呼机,它不会在 url 中公开任何参数,我可以更改这些参数并通过网站这种方式导航。

查看 Chrome 控制台的“网络”选项卡,我发现该网站在 url

/api/search/item/summary
下还公开了一些相对完整的信息。调用此 API 端点会返回空结果:

{
  "type": "about:blank",
  "title": "Method Not Allowed",
  "status": 405,
  "detail": "Method 'GET' is not supported.",
  "instance": "/api/search/item/summary"
}

如何使用这个隐藏的 API 来抓取网站?

提前非常感谢。

r web-scraping rvest httr
2个回答
1
投票

如果没有代码示例,很难猜测您到底是如何实现这一点的。假设从包含的标签来看,可能是通过

httr::GET()
rvest::read_html_live()

无论如何,从 DevTools 的 Network 选项卡中您可以看到实际的请求方法,在本例中它是

POST
而不是
GET
,所以
httr::GET()
在这里没有多大用处。对于
POST
请求,您通常需要包含有效负载,在本例中,它将是带有搜索参数的 JSON(检查 DevTools 中的网络 > Payload)。

要在 R 中发出相同的请求,您可以首先将请求复制为 cURL 并通过 https://curlconverter.com/r/ 或类似工具进行翻译。

或者如果

httr2
也可以,则使用
httr2::curl_translate()

请求方法 & 复制为 cURL(如果它是某些转换工具的输入,即使在 Windows 中,也可以使用 bash): Network tab of Chrome DevTools

以下内容基于

httr2::curl_translate()
输出,尽管在本例中它基本上只是获取有效负载的便捷快捷方式,使用
req_body_raw()
自动将请求方法设置为
POST
httr2
还提供了迭代执行请求的工具,下一个请求基于当前响应的响应,例如循环浏览pages,同时检查我们是否已到达最后page

req_perform_iterative()
iterate_with_offset()
帮助器一起使用,根据当前响应生成下一个请求,
resp_pages
是匿名函数,用于从第一个 JSON 响应中提取
totalPages
值,
resp_complete
检查
last
是否已设置在 JSON 响应中。一旦
resp_complete
返回
TRUE
或发出最大数量的请求 (
max_reqs
),迭代器就会停止。

library(httr2)

# prepare 1st request object, size increased
req <- request("https://www.supralift.com/api/search/item/summary") |> 
  req_url_query(
    size = "2000",
    page = "0",
  ) |> 
  req_body_raw('{"searchType":null,"bundleId":null,"identification":{"slNumber":null,"serialNumber":null,"supplierProductNumber":null,"slOrSupplierProductNumber":null},"configuration":{"buildClass":null,"manufacturer":null,"buildSeries":null,"acDriven":null,"powerUnit":null,"fuelType":null,"mastType":null,"gearBox":null,"tyres":null,"typeSearch":null},"buildDates":{"month":null,"year":{"from":null,"to":null}},"dimensions":{"overallHeight":{"from":null,"to":null},"workingHours":{"from":null,"to":null},"loadCentreOfGravity":{"from":null,"to":null},"capacity":{"from":null,"to":null},"forkLength":{"from":null,"to":null},"towingCapacity":{"from":null,"to":null}},"price":{"price":{"from":null,"to":null,"currency":"GBP"}},"cabin":{"cabin":null,"height":{"from":null,"to":null},"platformHeight":{"from":null,"to":null}},"engine":{"manufacturer":null,"power":null},"battery":{"exists":null,"manufacturer":null,"batteryType":null,"voltage":{"from":null,"to":null},"capacity":{"from":null,"to":null},"buildDates":null},"batteryCharger":{"exists":null,"manufacturer":null,"voltage":{"from":null,"to":null},"current":{"from":null,"to":null},"buildDates":{"month":null,"year":{"from":null,"to":null}}},"location":{"distance":100,"postCode":null,"region":null,"countryState":null,"country":null,"countryOrNull":null},"container":{"containerType":null,"hubhoehe8Z3":{"from":null,"to":null},"hubhoehe8Z4":{"from":null,"to":null},"hubhoehe8Z5":{"from":null,"to":null},"hubhoehe8Z6":{"from":null,"to":null},"hubhoehe8Z7":{"from":null,"to":null},"hubhoehe8Z8":{"from":null,"to":null},"hubhoehe8Z6I3":{"from":null,"to":null},"hubhoehe8Z6I4":{"from":null,"to":null},"hubhoehe8Z6I5":{"from":null,"to":null},"hubhoehe8Z6I6":{"from":null,"to":null},"hubhoehe8Z6I7":{"from":null,"to":null},"hubhoehe8Z6I8":{"from":null,"to":null},"hubhoehe9Z6I3":{"from":null,"to":null},"hubhoehe9Z6I4":{"from":null,"to":null},"hubhoehe9Z6I5":{"from":null,"to":null},"hubhoehe9Z6I6":{"from":null,"to":null},"hubhoehe9Z6I7":{"from":null,"to":null},"hubhoehe9Z6I8":{"from":null,"to":null}},"offerDetails":{"offerBegin":null,"maxOfferAge":null,"activationDate":null,"offerFormat":null,"dealsOnly":null,"imagesOnly":null,"offerType":"SALE"},"additionalHydraulic":{"toValve":null,"complete":null},"liftAttributes":{"initialLift":null,"liftHeight":{"from":null,"to":null},"freeLift":{"from":null,"to":null},"liftPower":null},"isLicensedDealerOnly":null,"warranty":{"from":null,"to":null},"qualityRating":null,"attachments":null,"accessories":null,"customFields":null,"specialAttributes":{"explosionProof":null,"stainlessSteel":null,"autonomousMobileRobot":null},"freightTerm":null,"itemStatus":[],"backendSearch":false}', "application/json")

# perform series of requests, increase `page` parameter until 
# `resp_complete` returns TRUE or when reaching `max_reqs`
resps <- 
  req_perform_iterative(
    req, 
    next_req = iterate_with_offset(
      param_name =  "page", 
      start = 0,
      resp_pages       = \(resp) resp_body_json(resp)$totalPages,
      resp_complete    = \(resp) resp_body_json(resp)$last,
      ),
    # generate just first 2 requests as an example
    max_reqs = 2
    )
#> Iterating ■■■■■■■■■■■■■■■■ 50% | ETA: 3s

# list of responses:
resps
#> [[1]]
#> <httr2_response>
#> POST https://www.supralift.com/api/search/item/summary?size=2000&page=0
#> Status: 200 OK
#> Content-Type: application/json
#> Body: In memory (2043726 bytes)
#> 
#> [[2]]
#> <httr2_response>
#> POST https://www.supralift.com/api/search/item/summary?size=2000&page=1
#> Status: 200 OK
#> Content-Type: application/json
#> Body: In memory (1933115 bytes)

单帧中的所有数据,2000(大小)* 2(max_reqs)= 4000行:

resps_data(resps, \(resp) resp_body_json(resp, simplifyVector = TRUE)$content) |> 
  tibble::as_tibble()
#> # A tibble: 4,000 × 18
#>    id           slNo type  manufacturer powerUnit buildClass mastType liftHeight
#>    <chr>       <int> <chr> <chr>        <chr>     <chr>      <chr>         <int>
#>  1 66f8d101f… 1.29e7 R14S  LINDE        ELEKTRO   SCHUBMAST… DREIFACH       6250
#>  2 66f8ccfaf… 1.29e7 EPL1… WEITERE      ELEKTRO   NIEDERHUB… <NA>            115
#>  3 66f8c894f… 1.29e7 27823 JUNGHEINRICH ELEKTRO   WEITERE    DREIFACH       5500
#>  4 66f8c886f… 1.29e7 27799 JUNGHEINRICH ELEKTRO   WEITERE    DREIFACH       5000
#>  5 66f8c595f… 1.29e7 T401… BOBCAT       DIESEL    TELESKOPA… <NA>          17000
#>  6 66f8c4e7f… 1.29e7 TFG … JUNGHEINRICH GAS       VIERRADFR… DREIFACH       5000
#>  7 66f8b525f… 1.29e7 EFG … JUNGHEINRICH ELEKTRO   VIERRADFR… DREIFACH       6000
#>  8 66f81676f… 1.28e7 R16G  LINDE        ELEKTRO   SCHUBMAST… <NA>           6210
#>  9 66f80016f… 1.29e7 C500… COMBILIFT    ELEKTRO   SEITENSTA… DREIFACH       6000
#> 10 66f80015f… 1.29e7 H30D… LINDE        DIESEL    VIERRADFR… ZWEIFACH       3760
#> # ℹ 3,990 more rows
#> # ℹ 10 more variables: workingHours <int>, capacity <int>, yearOfBuild <int>,
#> #   company <df[,6]>, offerType <chr>, isNew <lgl>, price <df[,4]>,
#> #   qualityRating <int>, images <list>, specialRating <df[,4]>

创建于 2024-09-29,使用 reprex v2.1.1


0
投票

根据你的问题,你说这个端点

/api/search/item/summary
暴露了一些相对完整的信息。尽管如此,服务器仍然显示
405 Method not allowed
,发生此错误是因为服务器仅接受
POST
请求(您可以看到
Allow:
响应标头仅显示 POST 值,screenshot)。这是带有 json 正文的完整
POST
请求:

POST /api/search/item/summary?size=100&page=0&sort=price,asc HTTP/2
Host: www.supralift.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:130.0) Gecko/20100101 Firefox/130.0
Accept: application/json, text/plain, */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://www.supralift.com/
Content-Type: application/json
Access-Control-Allow-Origin: https://www.supralift.com
Content-Length: 2664
Origin: https://www.supralift.com
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=4
Te: trailers

{"searchType":null,"bundleId":null,"identification":{"slNumber":null,"serialNumber":null,"supplierProductNumber":null,"slOrSupplierProductNumber":null},"configuration":{"buildClass":null,"manufacturer":null,"buildSeries":null,"acDriven":null,"powerUnit":null,"fuelType":null,"mastType":null,"gearBox":null,"tyres":null,"typeSearch":null},"buildDates":{"month":null,"year":{"from":null,"to":null}},"dimensions":{"overallHeight":{"from":null,"to":null},"workingHours":{"from":null,"to":null},"loadCentreOfGravity":{"from":null,"to":null},"capacity":{"from":null,"to":null},"forkLength":{"from":null,"to":null},"towingCapacity":{"from":null,"to":null}},"price":{"price":{"from":null,"to":null,"currency":"GBP"}},"cabin":{"cabin":null,"height":{"from":null,"to":null},"platformHeight":{"from":null,"to":null}},"engine":{"manufacturer":null,"power":null},"battery":{"exists":null,"manufacturer":null,"batteryType":null,"voltage":{"from":null,"to":null},"capacity":{"from":null,"to":null},"buildDates":null},"batteryCharger":{"exists":null,"manufacturer":null,"voltage":{"from":null,"to":null},"current":{"from":null,"to":null},"buildDates":{"month":null,"year":{"from":null,"to":null}}},"location":{"distance":100,"postCode":null,"region":null,"countryState":null,"country":null,"countryOrNull":null},"container":{"containerType":null,"hubhoehe8Z3":{"from":null,"to":null},"hubhoehe8Z4":{"from":null,"to":null},"hubhoehe8Z5":{"from":null,"to":null},"hubhoehe8Z6":{"from":null,"to":null},"hubhoehe8Z7":{"from":null,"to":null},"hubhoehe8Z8":{"from":null,"to":null},"hubhoehe8Z6I3":{"from":null,"to":null},"hubhoehe8Z6I4":{"from":null,"to":null},"hubhoehe8Z6I5":{"from":null,"to":null},"hubhoehe8Z6I6":{"from":null,"to":null},"hubhoehe8Z6I7":{"from":null,"to":null},"hubhoehe8Z6I8":{"from":null,"to":null},"hubhoehe9Z6I3":{"from":null,"to":null},"hubhoehe9Z6I4":{"from":null,"to":null},"hubhoehe9Z6I5":{"from":null,"to":null},"hubhoehe9Z6I6":{"from":null,"to":null},"hubhoehe9Z6I7":{"from":null,"to":null},"hubhoehe9Z6I8":{"from":null,"to":null}},"offerDetails":{"offerBegin":null,"maxOfferAge":null,"activationDate":null,"offerFormat":null,"dealsOnly":null,"imagesOnly":null,"offerType":"SALE"},"additionalHydraulic":{"toValve":null,"complete":null},"liftAttributes":{"initialLift":null,"liftHeight":{"from":null,"to":null},"freeLift":{"from":null,"to":null},"liftPower":null},"isLicensedDealerOnly":null,"warranty":{"from":null,"to":null},"qualityRating":null,"attachments":null,"accessories":null,"customFields":null,"specialAttributes":{"explosionProof":null,"stainlessSteel":null,"autonomousMobileRobot":null},"freightTerm":null,"itemStatus":[],"backendSearch":false}

您应该使用什么工具来捕获此类请求?

嗯,我使用了 port swigger 的一个名为

Burp Suite
的代理工具,它对开发人员和渗透测试人员都非常有帮助这里

如果您对 burp 套件感兴趣,请按照以下步骤操作:

  • 下载社区版本(因为它是免费的)。
  • 下载名为
    Foxyproxy
    的 Firefox/chrome 扩展程序。 (在谷歌上搜索
    Foxypoxy
  • 使用浏览器设置 burp 代理。 (借助
    Foxyproxy
    扩展)
  • 在目标浏览器上上传 burp 套件生成的 burp ca 证书。 (在您的浏览器中:
    Settings
    --> 搜索
    certificates
    ,谷歌搜索以获得更好的解决方案)。

就是这样!

如果您在设置 Burp Suite 时遇到任何问题,请不要忘记用 google 搜索它。

希望这会有所帮助。

谢谢

© www.soinside.com 2019 - 2024. All rights reserved.