read_html() 不从网站 HTML 代码返回表格

Question

我正在尝试从 https://www.hockey-reference.com/leagues/NHL_2025.html 提取团队统计和团队分析（5 对 5）表。当我使用 R 中 rvest 包中的 read_html() 函数时，代码不包含这两个表。它确实包括页面顶部的积分榜，但不包括其他两个。

我使用了 read_html("https://www.hockey-reference.com/leagues/NHL_2025.html") 并且代码的 html_text() 输出不包括有问题的表格。以及将 html_elements() 与表的 xpath 一起使用，但没有结果。

Answer 1

这与这个问题类似：使用 rvest 进行网页抓取：https://www.sports-reference.com.

表格存储在网页的注释中。有一个 RSelenium 解决方案，或者您可以解析注释并仅使用 rvest 处理它们。

library(rvest)

url <- "https://www.hockey-reference.com/leagues/NHL_2025.html"

page<-read_html(url)  # read html
page %>% html_table()  # the first 2 tables 

commentedNodes<-page %>%                   
   html_nodes('div.table_wrapper') %>%  # select node with comment
   html_nodes(xpath = 'comment()')    # select comments within node

#there are multiple  nodes containing comments
#chose the 2 via trial and error
additionaltables <- lapply(commentedNodes, function(node) {
   node %>%
      html_text() %>%             # return contents as text
      read_html() %>%             # parse text as html
      html_table()
})
additionaltables[[2]][[1]]

read_html() 不从网站 HTML 代码返回表格

问题描述投票：0回答：1

1个回答

最新问题

read_html() 不从网站 HTML 代码返回表格

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1