总结:使用
trycatch
和 R 的 read_html
函数处理错误和坏页。
我们正在使用 Rs
read_html
功能连接到一些 NCAA 体育网站,需要识别页面何时出现错误。以下是一些错误页面的示例 URL:
- www.newburynighthawks.com (does not exist)
- http://www.clarkepride.com/sports/womens-basketball/roster/2020-21 (404 not found)
- https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19 (not found)
- www.lambuth.edu/athletics/index.html (does not exist)
- https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19 (page not found)
使用
read_html
时,每个 URL 都有自己的问题/问题。为了处理这些问题,我编写了一个函数,在以下情况下使用 trycatch
检查这些页面的有效性:
check_url_validity <- function(this_url) {
good_url = FALSE
# go to url to check for a rosters page
bad_page_titles = c('Page Not Found', 'Page not found', '404')
result = tryCatch({
team_page <- this_url %>% GET(., timeout(2)) %>% read_html
team_page_title <- team_page %>% html_nodes('title') %>% html_text
team_page_body <- team_page %>% html_nodes('body') %>% html_text
good_page <- !grepl('Page not found', team_page_title) &&
!grepl('Page Not Found', team_page_title) &&
!grepl('404', team_page_title) &&
team_page_title != "" &&
!grepl('Error 404', team_page_body)
if(good_page) { good_url = TRUE }
}, error = function(e) { NA })
return(good_url)
}
在上面链接的网址上测试此功能可提供以下结果:
these_urls = c(
'www.newburynighthawks.com',
'http://www.clarkepride.com/sports/womens-basketball/roster/2020-21',
'https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19',
'www.lambuth.edu/athletics/index.html',
'https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19'
)
for (this_url in these_urls) {
print(check_rosters_url(this_url))
}
其中一些页面 (
http://www.newburynighthawks.com/
) 在 trycatch
中很容易被识别为不良页面,因为没有页面。其他人 (http://www.clarkepride.com/sports/womens-basketball/roster/2020-21
) 依靠正文中的字符串匹配来发现页面有问题。总体问题是,这是一个 hacky 解决方案,我们在这里处理 ~1000 个不同的 URL,并且我们继续向代码行添加条件,以确定 good_page
是 TRUE 还是 FALSE。目前我们最多有 5 个条件,其中大多数使用 grepl
来字符串匹配标题和正文中的短语,例如 404
和 Not Found
。
有没有比正文中的
404
和Not Found
字符串匹配更好的解决方案,知道这些页面不是好页面?
下面的代码不会尝试读取页面内容,而是使用包
httr发出
HEAD
请求。这更快并返回所有必要的信息。
library(httr)
check_url_validity <- function(this_url){
r <- tryCatch(httr::HEAD(this_url),
error = function(e) e
)
if(inherits(r, "error")){
"does not exist"
#conditionMessage(r)
} else {
httr::http_status(r)$reason
}
}
lapply(urls_vec, check_url_validity)
#[[1]]
#[1] "does not exist"
#
#[[2]]
#[1] "Not Found"
#
#[[3]]
#[1] "Not Found"
#
#[[4]]
#[1] "does not exist"
#
#[[5]]
#[1] "OK"
要返回
NA/FALSE/TRUE
,下面的函数遵循相同的行。
check_url_validity2 <- function(this_url){
r <- tryCatch(httr::HEAD(this_url),
error = function(e) e
)
if(inherits(r, "error")){
NA
}else{
httr::status_code(r) < 300
}
}
lapply(urls_vec, check_url_validity2)
#[[1]]
#[1] NA
#
#[[2]]
#[1] FALSE
#
#[[3]]
#[1] FALSE
#
#[[4]]
#[1] NA
#
#[[5]]
#[1] TRUE
数据
urls_vec <- c(
"www.newburynighthawks.com",
"http://www.clarkepride.com/sports/womens-basketball/roster/2020-21",
"https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19",
"www.lambuth.edu/athletics/index.html",
"https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19"
)