R：Web Scraping Wikipedia的JavaScript表

Question

我试图抓住表格中的所有数据：https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate

我尝试过使用选择器小工具。我实际上发现Chrome中的Right-click -> Inspect选项更容易使用。我发现选择器是：

#mw-content-text > div > table.wikitable.sortable.jquery-tablesorter

但是，我得到了错误的输出，character(0)：

library(rvest)
url <- 'https://en.wikipedia.org/wiki/List_of_countries_by_firearm-related_death_rate'
webpage <- read_html(url)
webpage %>%
        html_nodes("#mw-content-text > div > table.wikitable.sortable.jquery-tablesorter") %>% 
        html_text()
character(0)

我相信这是因为该表是由Javascript动态生成的，rvest无法读取。我听说RSelenium可以用来下载html，然后可以用上面的rvest代码解析。然而，RSelenium看起来像兔子洞（例如起始服务器，Dockers，端口等）。还有其他更直观，更易于访问的选项，我不知道或者RSelenium是我唯一的选择吗？

我的目标是编写一个RMarkdown报告，该报告将构建一个从一个或多个网站上抓取数据的模型，因此我希望自动网络报废解决方案。

Answer 1

首先，这是一个HTML表，而不是一个Javascript表。检查页面时，您可以看到每个表元素，如果它是Javascript表，您将看不到它们。

使用表xpath在这里运作良好。检查页面时，可以从右键单击菜单中复制它。

require(rvest)
guns <- url %>% read_html() %>% html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[3]') %>% 
  html_table()
guns <- guns[[1]]

                  Country Total Year   Homicides    Suicides Unintentional Undetermined       Sources and notes Guns per 100 inhabitants[citation needed]
1   Argentina ! Argentina  6.36 2009 2.58 (2012) 1.57 (2009)   0.05 (2009)  2.57 (2009) Guns in Argentina[1][2]                                      10.2
2   Australia ! Australia  0.93 2013 0.16 (2013) 0.74 (2013)   0.02 (2013)  0.02 (2013)    Guns in Australia[3]                                      21.7
3       Austria ! Austria  2.63 2011 0.10 (2011) 2.43 (2011)   0.01 (2009)  0.04 (2011)      Guns in Austria[4]                                      30.4
4 Azerbaijan ! Azerbaijan  0.30    ? 0.27 (2010) 0.01 (2007)   0.02 (2007)            ?   Guns in Azerbaijan[5]                                       3.5
5    Barbados !  Barbados  3.12    ? 3.12 (2013)           ?             ?            ?     Guns in Barbados[6]                                       7.8
6      Belarus !  Belarus  0.23    ? 0.14 (2009)           ?   0.09 (1996)            ?      Guns in Belarus[7]                                       7.3

还有一些清理工作要做，这是国家的一个例子：

require(dplyr)
guns <- guns %>% mutate(Country = trimws(gsub("!.*","", Country)))

R：Web Scraping Wikipedia的JavaScript表

问题描述投票：0回答：1

1个回答

最新问题

R：Web Scraping Wikipedia的JavaScript表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1