如何使用 read_html_live() 浏览 javascript 寻呼机？

Question

我想抓取网站 https://www.supralift.com/uk/itemsearch/results 上的广告链接，该网站使用 JavaScript 寻呼机。我的目的是收集页面上的链接，然后单击寻呼机上的“下一步”按钮，这将引起加载网站上的下一页，再次收集链接等等。下面是我创建的代码，但它不起作用，我无法调用寻呼机来加载下一页。请问如何正确使用

read_html_live()

功能？非常感谢您的任何建议。

  library(tidyverse)
  library(rvest)
  
  ad_links_all <- tibble() # We will collect here all ad links
  n_of_pages <- 10 # For testing purposes we want to scrape just this number of first pages
  
  page <- read_html_live("https://www.supralift.com/uk/itemsearch/results")
  
  for (i in 1:n_of_pages) {
    
    # Scrape the links
    page_ad_links <- page %>% html_elements(".product-info .h-full a:has(.line-clamp-2)") %>% 
      html_attr("href") %>% str_split(., "\\?searchKey=", n = 2) %>% lapply(., `[[`, 1) %>% unlist() %>% tibble()
    
    # Collect all links here
    ad_links_all <- ad_links_all %>% bind_rows(page_ad_links)
    
    # Move to the next page
    page$click(".pagination-direction-btn:nth-child(1)")
    
    Sys.sleep(.5)
    
  }
#> Error in `private$wait_for_selector()`:
#> ! Failed to find selector ".pagination-direction-btn:nth-child(1)" in 5
#>   seconds.


<sup>Created on 2024-09-30 with [reprex v2.1.1](https://reprex.tidyverse.org)</sup>

Answer 1

您的问题可能是因为“下一步”按钮的选择器不正确。选择器“

.pagination-direction-btn:nth-child(1)

”可能选择“上一个”按钮而不是“下一个”。

您需要更新选择器才能正确识别“下一步”按钮。尝试使用：

page$click(".pagination-direction-btn[aria-label='Next']")

此选择器以

aria-label

属性等于“下一步”的按钮为目标。更新您的代码：

library(tidyverse)
library(rvest)

ad_links_all <- tibble()
n_of_pages <- 10

page <- read_html_live("https://www.supralift.com/uk/itemsearch/results")

for (i in 1:n_of_pages) {
  
  page_ad_links <- page %>% 
    html_elements(".product-info .h-full a:has(.line-clamp-2)") %>% 
    html_attr("href") %>% 
    str_split(., "\\?searchKey=", n = 2) %>% 
    lapply(`[[`, 1) %>% 
    unlist() %>% 
    tibble()
  
  ad_links_all <- bind_rows(ad_links_all, page_ad_links)
  
  page$click(".pagination-direction-btn[aria-label='Next']")
  
  Sys.sleep(0.5)
  
}

此外，您可以通过添加以下内容来确保单击“下一步”后页面已完全加载：

page$wait_for_selector(".product-info")

将此行放在

Sys.sleep(0.5)

之后，以确保在抓取之前内容已完全加载。

如何使用 read_html_live() 浏览 javascript 寻呼机？

问题描述投票：0回答：1

1个回答

最新问题

如何使用 read_html_live() 浏览 javascript 寻呼机？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1