RSelenium刮削返回奇数结果

Question

我正在尝试使用RSelenium搜索一些新闻来源搜索页面。这是我的代码：

library(rvest)
library(RSelenium)

#open the browser
rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
remDr <- rD[["client"]]

#create a blank space to put the links
urlslist_final = list()

##loop through the page number at the end until done with ~1000 / 20 = 50
for (i in 1:2) { ##change this to 50

  url = paste0('https://www.npr.org/search?query=kavanaugh&page=', i)

  #navigate to it
  remDr$navigate(url)

  #get the links
  webElems <- remDr$findElements(using = "css", "[href]")
  urlslist_final[[i]] = unlist(sapply(webElems, function(x) {x$getElementAttribute("href")}))

  #don't go too fast
  Sys.sleep(runif(1, 1, 5))

} #close the loop

remDr$close()
# stop the selenium server
rD[["server"]]$stop()

如果我在页面导航到之后将i = 1和CLICK设置为浏览器，那么我将获得166个链接的预期结果，其中包含我正在尝试抓取的特定结果链接：

> str(urlslist_final)
List of 1
 $ : chr [1:166] "https://media.npr.org/templates/favicon/favicon-180x180.png" "https://media.npr.org/templates/favicon/favicon-96x96.png" "https://media.npr.org/templates/favicon/favicon-32x32.png" "https://media.npr.org/templates/favicon/favicon-16x16.png" ...

但是，如果只运行我的循环，我只得到91个结果，并且它们都不是搜索的实际结果：

> str(urlslist_final)
List of 2
$ : chr [1:91] "https://media.npr.org/templates/favicon/favicon-180x180.png" "https://media.npr.org/templates/favicon/favicon-96x96.png" "https://media.npr.org/templates/favicon/favicon-32x32.png" "https://media.npr.org/templates/favicon/favicon-16x16.png" ...

任何帮助理解为什么这里的差异？我能做些什么不同的事情？我尝试使用rvest，但我无法找到它们的脚本中嵌入的链接以获得结果。

Answer 1

感谢我的朋友Thom，这是一个很好的解决方案：

#scroll on the page
webscroll <- remDr$findElement("css", "body")
webscroll$sendKeysToElement(list(key = "end"))

我把这个代码放在导航到页面和捕获链接之间，这触发了网站认为我正在使用它，所以我可以刮掉链接。

RSelenium刮削返回奇数结果

问题描述投票：0回答：1

1个回答

最新问题

RSelenium刮削返回奇数结果

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1