我想用R网络抓取不同年份的股票财务表。但是,我可以获得去年的财务表,该表显示为默认值。但我也想获取往年的数据。我怎样才能实现这个目标?这是我使用的代码:
# Load libraries
library(tidyverse)
library(rvest)
library(readxl)
library(magrittr)
google_finance <- read_html("https://www.google.com/finance/quote/AAPL:NASDAQ?") |>
html_node(".UulDgc") |>
html_table()
结果是:
> google_finance |>
+ head(5)
# A tibble: 5 × 3
`(USD)` Mar 2024infoFiscal Q…¹ `Y/Y change`
<chr> <chr> <chr>
1 "RevenueThe tot… 90.75B -4.31%
2 "Operating expe… 14.37B 5.22%
3 "Net incomeComp… 23.64B -2.17%
4 "Net profit mar… 26.04 2.20%
5 "Earnings per s… 1.53 0.66%
如您所见,我们只能看到最后一个时期(2024年3月)的财务表格。既然如此,我们该怎么做才能把历年的财务表都刮下来呢?
我认为您需要为此使用
RSelenium
,它将启动浏览器并为您单击按钮。这里我使用 Firefox 作为浏览器,您可能需要更改一些默认设置才能使浏览器设置正确。您还需要安装Java SDK。
library(RSelenium)
library(rvest)
library(glue)
# Initiate a Remote Driver using forefox; this step may also install some pre
# and post binary files.
rd <- rsDriver(browser = "firefox", chromever = NULL)
# Assign client
remDr <- rd$client
url <- "https://www.google.com/finance/quote/AAPL:NASDAQ"
# Extract names of buttons
aapl_html <- read_html(url)
btn_names <- aapl_html %>%
html_node(".zsnTKc") %>%
html_attr("aria-owns") %>%
strsplit(., split = " ") %>%
unlist()
# Using the Remote Driver, navigate to url of interest
remDr$navigate(url)
# In a loop, find button of interest by its xpath, click and extract table
df_ls <- lapply(
X = btn_names
,FUN = function(x) {
# Find button using xPath
btn <- remDr$findElement(using = "xpath", glue("//*[@id='{x}']"))
# Nifty trick to visually see which button is being clicked
btn$highlightElement()
# Click the button
btn$clickElement()
# Wait for elements to complete loading
Sys.sleep(1)
# Read HTML after each button is clicked
rem_aapl_html <- remDr$getPageSource()[[1]]
# Extract table
aapl_tbl <- rem_aapl_html %>%
read_html() %>%
html_node(".slpEwd") %>%
html_table()
}
)