使用 for 循环在 R 中抓取多个页面的数据（rvest 包）

Question

我正在开发一个项目，需要从网页列表中抓取一些数据（总共，我打算浏览约 1000 个页面）。每个网页的格式都非常相似，因此我可以编写一个

for

循环来以相同的方式处理每个网页。我认为这些页面使用某种 JavaScript 来生成其内容，因此这只适用于

read_html_live

包中的

rvest

函数，而不是更广泛使用的

read_html

。

我在 for 循环方面遇到了一些问题，我认为这与

read_html_live

的“实时”方面有关。基本上，如果我手动为循环的每个单独元素运行循环的内容，我会得到我想要的结果，但是当我运行整个循环（即

source

我的 R 脚本）时，似乎在阅读时遇到问题一些 URL / 正确提取元素。它不会给出错误，但它并不总是提供预期的结果，即使我知道它在那里。

我想知道

read_html_live

函数是否在后台进行某种处理，而后续行没有“等待”适当的时间。当我在

Sys.sleep(2)

操作之后添加

read_html_live

行时，一切都按预期工作，但这不是很准确（并且可能会使事情比应有的速度慢）。

我想知道是否有更准确、更有效或更直接的方法来解决这个问题。有关更多背景信息，请参阅下面我的可重现示例：

require(rvest)
require(tidyverse) # for pipes

# List of sample URLs to iterate over. Each of these pages is formatted very similarly and uses JavaScript to generate certain elements of the page, so read_html_live is needed.
ALL_URLs = c("https://species-registry.canada.ca/index-en.html#/species/19-17",
             "https://species-registry.canada.ca/index-en.html#/species/1096-752",
             "https://species-registry.canada.ca/index-en.html#/species/1097-758",
             "https://species-registry.canada.ca/index-en.html#/species/1349-1011",
             "https://species-registry.canada.ca/index-en.html#/species/502-0",
             "https://species-registry.canada.ca/index-en.html#/species/818-554",
             "https://species-registry.canada.ca/index-en.html#/species/540-61",
             "https://species-registry.canada.ca/index-en.html#/species/1004-679",
             "https://species-registry.canada.ca/index-en.html#/species/963-646",
             "https://species-registry.canada.ca/index-en.html#/species/332-0")

for (url in ALL_URLs) {
  # Read URL
  this_page = read_html_live(url)
  # Sys.sleep(2) # If I comment this out, the output is not as desired, but when included, it works as intended
  # Get relevant HTML element from each page
  this_page %>% html_elements("#cosewic_assessment+ div .sar-definition:nth-child(1)") %>% html_text %>% 
    # Do some formatting of text for it to be more legible and consistent
    str_split_i("\n", 4) %>% trimws %>% 
    # Print out
    message
}

所需的输出（与

Sys.sleep

命令一起出现）：

Acadian Flycatcher
Acadian Redfish, Atlantic population
Acadian Redfish, Bonne Bay population
Acuteleaf Small Limestone Moss
Alaskan Brook Lamprey
Alkaline Wing-nerved Moss
Allegheny Mountain Dusky Salamander
Allegheny Mountain Dusky Salamander, Appalachian population
Allegheny Mountain Dusky Salamander, Carolinian population
American Badger

没有

Sys.sleep

的典型输出（并不总是相同，表明评估时间具有一定的随机性）：


Acadian Redfish, Atlantic population
Acadian Redfish, Bonne Bay population
Acuteleaf Small Limestone Moss

Alkaline Wing-nerved Moss

Answer 1

网站“species-registry.canada.ca”从 Web API 检索物种信息，托管于“ecprccsarsrch.search.windows.net”。我们可以直接与该 API 交互，因此我们不需要

read_html_live()

或类似的。

该 API 使用 API 密钥和引用策略来限制对 API 的直接访问。下面我演示如何使用

httr2

包创建 API 接受的请求。

library(tidyverse)
library(rvest)
library(httr2)

此函数需要一个 id（

ALL_URLs

对象中 URL 路径的最后一部分）和一个 API 密钥。您可以通过在浏览器中访问网站并获取 API 密钥来检索 api 密钥

Inspector

的网络选项卡。

create_perform_request <- function(id, key) {
  request("https://ecprccsarsrch.search.windows.net/indexes/sararegistry-species-indexen-v1/docs/search?api-version=2020-06-30") |>
    req_headers(
      Origin = "https://species-registry.canada.ca",
      Referer = "https://species-registry.canada.ca",
      "api-key" = key
    ) |>
    req_body_json(list(
      filter = paste0("id eq '", id, "'"),
      select = "id,legalStatusId,cosewicStatusId,legalCommonName,cosewicCommonName,cosewicScientificName,legalScientificName,legalPopulation,cosewicPopulation,taxonomyId,emergencyAssessment,cosewicLastAssessmentDate,imageFileName,photoCopyright,legalDateListing,legalScheduleId,underConsiderationId,lastGicDecisionDate,currentGicDecisionId,ranges,cosewicReasonDes,cosewicStatusCriteria,cosewicStatusHis,legalStatusHistory,prevCosewicCommonName,prevCosewicScientificName,ministerReceiptDate,statusChange,azSpecies,hasRelatedSpeciesOnScheduleOne,recProgress,profiles,teams,related"
    )) |>
    req_perform() |>
    resp_body_json()
}

这是我使用的API密钥，它可能不适合你。请改用您自己的 API 密钥。

api_key <- "3A1E8E87503C069448999238ABD05EE9"

我们可以像这样执行单个请求：

resp <-
  create_perform_request("1096-752", api_key)

resp$value[[1]]$cosewicCommonName
#> [1] "Acadian Redfish"

同时处理多个ID

从您的

ALL_URLs

对象中提取 ID。

ids <- ALL_URLs |>
  map(url_parse) |>
  map(\(x) x$fragment) |>
  map(str_remove, "/.*/")

运行所有 ID 的请求。

responses <- map(ids, create_perform_request, api_key)
map_chr(responses, \(x) x$value[[1]]$cosewicCommonName)
#>  [1] "Acadian Flycatcher"                  "Acadian Redfish"                    
#>  [3] "Acadian Redfish"                     "Acuteleaf Small Limestone Moss"     
#>  [5] "Alaskan Brook Lamprey"               "Alkaline Wing-nerved Moss"          
#>  [7] "Allegheny Mountain Dusky Salamander" "Allegheny Mountain Dusky Salamander"
#>  [9] "Allegheny Mountain Dusky Salamander" "American Badger"

使用 for 循环在 R 中抓取多个页面的数据（rvest 包）

问题描述投票：0回答：1

1个回答

我们可以像这样执行单个请求：

同时处理多个ID

最新问题

使用 for 循环在 R 中抓取多个页面的数据（rvest 包）

问题描述 投票：0回答：1

1个回答

我们可以像这样执行单个请求：

同时处理多个ID

最新问题

问题描述投票：0回答：1