Rvest意外停止工作 - 刮擦桌子

问题描述 投票:0回答:1

在使用以下脚本一段时间后,它突然停止工作。我构建了一个简单的函数,在一个网页中找到一个基于xpath的表。

library(rvest)

url <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08'

find_table <- function(x){read_html(x) %>%
                          html_nodes(xpath = '//*[@id="center"]/table[2]') %>%
                          html_table() %>%
                          as.data.frame()}

table <- find_table(url)

我还尝试在httr::GET之前使用read_html,传递以下参数:

query = list(r_date = "2017-12-22")

但没有改变。有任何想法吗?

r rvest
1个回答
0
投票

好吧,那个代码不起作用,因为你错过了)线上的url <-

我们将添加httr

library(httr)
library(rvest)

url是基函数的名称。使用基函数名作为变量会使代码中的问题难以调试。除非你写出完美的代码,否则不要那样使用名称是个好主意。

URL <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08')

我不知道您是否了解有关网页抓取的“规则”,但如果您对此网站重复提出请求,则应使用“抓取延迟”。他们的robots.txt中没有一套,所以5秒是可接受的选择。我指出这一点,因为你可能会受到限制。

find_table <- function(x, crawl_delay=5) { 

  Sys.sleep(crawl_delay) # you can put this in a loop vs here if you aren't often doing repeat gets

  # switch to httr::GET so you can get web server interaction info.
  # since you're scraping, it's expected that you use a custom user agent
  # that also supplies contact info.

  res <- GET(x, user_agent("My scraper"))

  # check to see if there's a non HTTP 200 response which there may be
  # if you're getting rate-limited

  stop_for_status(res) 

  # now, try to do the parsing. It looks like you're trying to target a
  # single table, so i switched it from `html_nodes()` to `html_node()` since
  # the latter returns a `list` and the pipe will error out if there's more
  # than on list element.

  content(res, "parsed") %>% 
    html_node(xpath = '//*[@id="center"]/table[2]') %>%
    html_table() %>%
    as.data.frame()

}

table也是一个基本函数名称(见上文)

result <- find_table(URL)

为我工作得很好:

str(result)
## 'data.frame':  11 obs. of  5 variables:
##  $ ENTI EROGATORI                          : chr  "Cassa DD.PP." "Istituti di previdenza amministrati dal Tesoro" "Istituto per il credito sportivo" "Aziende di credito" ...
##  $                                         : logi  NA NA NA NA NA NA ...
##  $ ACCENSIONE ACCERTAMENTI                 : chr  "4.638.500,83" "0,00" "0,00" "953.898,47" ...
##  $ ACCENSIONE RISCOSSIONI C|COMP. + RESIDUI: chr  "2.177.330,12" "0,00" "129.114,22" "848.935,84" ...
##  $ RIMBORSO IMPEGNI                        : chr  "438.696,57" "975,07" "45.584,55" "182.897,01" ...
© www.soinside.com 2019 - 2024. All rights reserved.