我正在开发一个项目,需要从网页列表中抓取一些数据(总共,我打算浏览约 1000 个页面)。每个网页的格式都非常相似,因此我可以编写一个
for
循环来以相同的方式处理每个网页。我认为这些页面使用某种 JavaScript 来生成其内容,因此这只适用于 read_html_live
包中的 rvest
函数,而不是更广泛使用的 read_html
。
我在 for 循环方面遇到了一些问题,我认为这与
read_html_live
的“实时”方面有关。基本上,如果我手动为循环的每个单独元素运行循环的内容,我会得到我想要的结果,但是当我运行整个循环(即 source
我的 R 脚本)时,似乎在阅读时遇到问题一些 URL / 正确提取元素。它不会给出错误,但它并不总是提供预期的结果,即使我知道它在那里。
我想知道
read_html_live
函数是否在后台进行某种处理,而后续行没有“等待”适当的时间。当我在 Sys.sleep(2)
操作之后添加 read_html_live
行时,一切都按预期工作,但这不是很准确(并且可能会使事情比应有的速度慢)。
我想知道是否有更准确、更有效或更直接的方法来解决这个问题。有关更多背景信息,请参阅下面我的可重现示例:
require(rvest)
require(tidyverse) # for pipes
# List of sample URLs to iterate over. Each of these pages is formatted very similarly and uses JavaScript to generate certain elements of the page, so read_html_live is needed.
ALL_URLs = c("https://species-registry.canada.ca/index-en.html#/species/19-17",
"https://species-registry.canada.ca/index-en.html#/species/1096-752",
"https://species-registry.canada.ca/index-en.html#/species/1097-758",
"https://species-registry.canada.ca/index-en.html#/species/1349-1011",
"https://species-registry.canada.ca/index-en.html#/species/502-0",
"https://species-registry.canada.ca/index-en.html#/species/818-554",
"https://species-registry.canada.ca/index-en.html#/species/540-61",
"https://species-registry.canada.ca/index-en.html#/species/1004-679",
"https://species-registry.canada.ca/index-en.html#/species/963-646",
"https://species-registry.canada.ca/index-en.html#/species/332-0")
for (url in ALL_URLs) {
# Read URL
this_page = read_html_live(url)
# Sys.sleep(2) # If I comment this out, the output is not as desired, but when included, it works as intended
# Get relevant HTML element from each page
this_page %>% html_elements("#cosewic_assessment+ div .sar-definition:nth-child(1)") %>% html_text %>%
# Do some formatting of text for it to be more legible and consistent
str_split_i("\n", 4) %>% trimws %>%
# Print out
message
}
所需的输出(与
Sys.sleep
命令一起出现):
Acadian Flycatcher
Acadian Redfish, Atlantic population
Acadian Redfish, Bonne Bay population
Acuteleaf Small Limestone Moss
Alaskan Brook Lamprey
Alkaline Wing-nerved Moss
Allegheny Mountain Dusky Salamander
Allegheny Mountain Dusky Salamander, Appalachian population
Allegheny Mountain Dusky Salamander, Carolinian population
American Badger
没有
Sys.sleep
的典型输出(并不总是相同,表明评估时间具有一定的随机性):
Acadian Redfish, Atlantic population
Acadian Redfish, Bonne Bay population
Acuteleaf Small Limestone Moss
Alkaline Wing-nerved Moss
网站“species-registry.canada.ca”从 Web API 检索物种信息, 托管于“ecprccsarsrch.search.windows.net”。我们可以直接与该 API 交互,因此我们不需要
read_html_live()
或类似的。
该 API 使用 API 密钥和引用策略来限制对 API 的直接访问。下面我演示如何 使用
httr2
包创建 API 接受的请求。
library(tidyverse)
library(rvest)
library(httr2)
此函数需要一个 id(
ALL_URLs
对象中 URL 路径的最后一部分)和一个
API 密钥。您可以通过在浏览器中访问网站并获取 API 密钥来检索 api 密钥
Inspector
的网络选项卡。
create_perform_request <- function(id, key) {
request("https://ecprccsarsrch.search.windows.net/indexes/sararegistry-species-indexen-v1/docs/search?api-version=2020-06-30") |>
req_headers(
Origin = "https://species-registry.canada.ca",
Referer = "https://species-registry.canada.ca",
"api-key" = key
) |>
req_body_json(list(
filter = paste0("id eq '", id, "'"),
select = "id,legalStatusId,cosewicStatusId,legalCommonName,cosewicCommonName,cosewicScientificName,legalScientificName,legalPopulation,cosewicPopulation,taxonomyId,emergencyAssessment,cosewicLastAssessmentDate,imageFileName,photoCopyright,legalDateListing,legalScheduleId,underConsiderationId,lastGicDecisionDate,currentGicDecisionId,ranges,cosewicReasonDes,cosewicStatusCriteria,cosewicStatusHis,legalStatusHistory,prevCosewicCommonName,prevCosewicScientificName,ministerReceiptDate,statusChange,azSpecies,hasRelatedSpeciesOnScheduleOne,recProgress,profiles,teams,related"
)) |>
req_perform() |>
resp_body_json()
}
这是我使用的API密钥,它可能不适合你。请改用您自己的 API 密钥。
api_key <- "3A1E8E87503C069448999238ABD05EE9"
resp <-
create_perform_request("1096-752", api_key)
resp$value[[1]]$cosewicCommonName
#> [1] "Acadian Redfish"
从您的
ALL_URLs
对象中提取 ID。
ids <- ALL_URLs |>
map(url_parse) |>
map(\(x) x$fragment) |>
map(str_remove, "/.*/")
运行所有 ID 的请求。
responses <- map(ids, create_perform_request, api_key)
map_chr(responses, \(x) x$value[[1]]$cosewicCommonName)
#> [1] "Acadian Flycatcher" "Acadian Redfish"
#> [3] "Acadian Redfish" "Acuteleaf Small Limestone Moss"
#> [5] "Alaskan Brook Lamprey" "Alkaline Wing-nerved Moss"
#> [7] "Allegheny Mountain Dusky Salamander" "Allegheny Mountain Dusky Salamander"
#> [9] "Allegheny Mountain Dusky Salamander" "American Badger"