使用 read_html 在 R 中读取时处理 404 和其他错误 URL

Question

总结：使用

trycatch

和 R 的

read_html

函数处理错误和坏页。

我们正在使用 Rs

read_html

功能连接到一些 NCAA 体育网站，需要识别页面何时出现错误。以下是一些错误页面的示例 URL：

 - www.newburynighthawks.com (does not exist)
 - http://www.clarkepride.com/sports/womens-basketball/roster/2020-21 (404 not found)
 - https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19 (not found)
 - www.lambuth.edu/athletics/index.html (does not exist)
 - https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19 (page not found)

使用

read_html

时，每个 URL 都有自己的问题/问题。为了处理这些问题，我编写了一个函数，在以下情况下使用

trycatch

检查这些页面的有效性：

check_url_validity <- function(this_url) {
  good_url = FALSE

  # go to url to check for a rosters page
  bad_page_titles = c('Page Not Found', 'Page not found', '404')
  result = tryCatch({
    team_page <- this_url %>% GET(., timeout(2)) %>% read_html
    team_page_title <- team_page %>% html_nodes('title') %>% html_text
    team_page_body <- team_page %>% html_nodes('body') %>% html_text
    good_page <- !grepl('Page not found', team_page_title) &&
      !grepl('Page Not Found', team_page_title) &&
      !grepl('404', team_page_title) &&
      team_page_title != "" &&
      !grepl('Error 404', team_page_body)
    
    if(good_page) { good_url = TRUE }
  }, error = function(e) { NA })
  
  return(good_url)
}

在上面链接的网址上测试此功能可提供以下结果：

these_urls = c(
'www.newburynighthawks.com', 
'http://www.clarkepride.com/sports/womens-basketball/roster/2020-21',
'https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19',
'www.lambuth.edu/athletics/index.html',
'https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19'
)

for (this_url in these_urls) {
  print(check_rosters_url(this_url))
}

其中一些页面 (

http://www.newburynighthawks.com/

) 在

trycatch

中很容易被识别为不良页面，因为没有页面。其他人 (

http://www.clarkepride.com/sports/womens-basketball/roster/2020-21

) 依靠正文中的字符串匹配来发现页面有问题。总体问题是，这是一个 hacky 解决方案，我们在这里处理 ~1000 个不同的 URL，并且我们继续向代码行添加条件，以确定

good_page

是 TRUE 还是 FALSE。目前我们最多有 5 个条件，其中大多数使用

grepl

来字符串匹配标题和正文中的短语，例如

和

Not Found

。

有没有比正文中的

和

Not Found

字符串匹配更好的解决方案，知道这些页面不是好页面？

Answer 1

下面的代码不会尝试读取页面内容，而是使用包

httr

发出HEAD请求。这更快并返回所有必要的信息。

library(httr)

check_url_validity <- function(this_url){
  r <- tryCatch(httr::HEAD(this_url),
                error = function(e) e
  )
  if(inherits(r, "error")){
    "does not exist"
    #conditionMessage(r)
  } else {
    httr::http_status(r)$reason
  }
}

lapply(urls_vec, check_url_validity)
#[[1]]
#[1] "does not exist"
#
#[[2]]
#[1] "Not Found"
#
#[[3]]
#[1] "Not Found"
#
#[[4]]
#[1] "does not exist"
#
#[[5]]
#[1] "OK"

要返回

NA/FALSE/TRUE

，下面的函数遵循相同的行。

check_url_validity2 <- function(this_url){
  r <- tryCatch(httr::HEAD(this_url),
                error = function(e) e
  )
  if(inherits(r, "error")){
    NA
  }else{
    httr::status_code(r) < 300
  }
}

lapply(urls_vec, check_url_validity2)
#[[1]]
#[1] NA
#
#[[2]]
#[1] FALSE
#
#[[3]]
#[1] FALSE
#
#[[4]]
#[1] NA
#
#[[5]]
#[1] TRUE

数据

urls_vec <- c(
  "www.newburynighthawks.com", 
  "http://www.clarkepride.com/sports/womens-basketball/roster/2020-21", 
  "https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19", 
  "www.lambuth.edu/athletics/index.html", 
  "https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19"
)

使用 read_html 在 R 中读取时处理 404 和其他错误 URL

问题描述投票：0回答：1

1个回答

最新问题

使用 read_html 在 R 中读取时处理 404 和其他错误 URL

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1