我正在尝试从包含表和链接的网络数据。我可以使用链接文本“score”成功下载表格。但是,我想要捕获完整的href
URL,而不是缩短的URL。
但是,我想我会用rvest
缩短网址。我不知道如何获得完整的'url',我可以循环,如下所示,以获得所需的数据,然后将所有数据转换为数据框。
library(rvest)
# Load the page
odi_score_url <- read_html('http://stats.espncricinfo.com/ci/engine/records/team/match_results.html?class=2;id=2019;type=year')
urls <- odi_score_url %>%
html_nodes('td:nth-child(7) .data-link') %>%
html_attr("href")
links <- odi_score_url %>%
html_nodes('td:nth-child(7) .data-link') %>%
html_text()
# Combine `links` and `urls` into a data.frame
score_df <- data.frame(links = links, urls = urls, stringsAsFactors = FALSE)
head(score_df)
links urls
1 ODI # 4074 /ci/engine/match/1153840.html
2 ODI # 4075 /ci/engine/match/1153841.html
3 ODI # 4076 /ci/engine/match/1153842.html
4 ODI # 4077 /ci/engine/match/1144997.html
5 ODI # 4078 /ci/engine/match/1144998.html
6 ODI # 4079 /ci/engine/match/1144999.html
score_df
and get required data for(i in score_df) {
text <- read_html(score_df$urls[i]) %>% # load the page
html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 ,
.stadium-details+ .match-detail--item span , .stadium-details ,
.stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
html_text() # get the text
## Create the dataframe
}
非常感谢你的帮助!!!
提前致谢
网址是相对于主页面的。因此,您可以通过在链接的开头添加http://stats.espncricinfo.com/
来获取完整的URL。所以,例如:
urls <- odi_score_url %>%
html_nodes('td:nth-child(7) .data-link') %>%
html_attr("href") %>%
paste0("http://stats.espncricinfo.com/", .)
然后你可以把循环写成:
text_list <- list()
for(i in seq_along(score_df$urls)) {
text_list[[i]] <- read_html(score_df$urls[i]) %>% # load the page
html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 ,
.stadium-details+ .match-detail--item span , .stadium-details ,
.stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
html_text() # get the text
# give some nice status
cat("Scraping link", i, "\n")
}
或者,甚至更好,作为应用循环:
text_list <- lapply(score_df$urls, function(x) {
text <- read_html(x) %>% # load the page
html_nodes(".match-detail--item:nth-child(3) span , .match-detail--item:nth-child(3) h4 ,
.stadium-details+ .match-detail--item span , .stadium-details ,
.stadium-details+ .match-detail--item h4 , .cscore_score , .cscore_name--long") %>% # isloate the text
html_text()
data.frame(url = x, text = text, stringsAsFactors = FALSE)
cat("Scraping link", x, "\n")
})
然后我们可以使用dplyr
将其转换为data.frame:
text_df <- dplyr::bind_rows(text_list)
head(text_df)
url text
1 http://stats.espncricinfo.com//ci/engine/match/1153840.html New Zealand
2 http://stats.espncricinfo.com//ci/engine/match/1153840.html 371/7
3 http://stats.espncricinfo.com//ci/engine/match/1153840.html Sri Lanka
4 http://stats.espncricinfo.com//ci/engine/match/1153840.html 326 (49/50 ov)
5 http://stats.espncricinfo.com//ci/engine/match/1153840.html New Zealand
6 http://stats.espncricinfo.com//ci/engine/match/1153840.html 371/7
不确定这是否已经是您想要的。你想要折叠文本,所以每个网址只有一行。但我认为如果你想要的话,应该很容易弄清楚。