我正在尝试从某个站点抓取波罗的海巴拿马型指数数据。我也从其他站点上抓取了数据,但不适用于此页面。我正在使用Office连接,而要从中下载的站点显示为“不安全”连接。这引起问题吗?
我需要“日期”和“关闭”列,并且到目前为止已经编写了以下用于抓取的代码:
#Baltic Panamax Index
#Specifying the url for desired website to be scraped
con=url("http://marine-transportation.capitallink.com/indices/baltic_exchange_history.html?ticker=BPI","rb")
#Reading the HTML code from the website
webpage <- read_html(con)
webpage
#Using CSS selectors to scrape the date section*
date_data = html_nodes(webpage,".text .div_line:nth-child(2)")
#Converting the ranking data to text
date_data <- html_text(date_data)
#Let's have a look at the rankings*
head(date_data)
需要输出:
Date Close
Jan 03,2020 949
Jan 02,2020 1003
您需要在请求标头中将您的用户名作为cookie发送,以获取此页面。我发现httr
软件包为提出此类请求提供了极大的灵活性。对于此站点,您将需要使用已经在该站点注册的用户名。只需在下面的代码中更改user_name
字段,即可使用:
# Use the httr package to allow flexibility with http requests
library(httr)
library(rvest)
# Set username here -----
# |
# ---------------------
# | |
# v v
user_name <- "[email protected]"
# Set url we need
site <- "http://marine-transportation.capitallink.com"
url <- paste0(site, "/indices/baltic_exchange_history.html?ticker=BPI")
# Obtain the page we want using user name as a cookie
response <- GET(url, set_cookies(clUser_email = user_name,
expires = "Sat, 16-Sep-2051 11:30:30 GMT",
`Max-Age` = "1000000000",
path = "/",
domain = "capitallink.com"))
# Parse the HTML code from the website using rvest
webpage <- read_html(response)
date_data <- html_nodes(webpage, "table")
result <- html_table(date_data[4])[[1]]
# Tidy up the result
result <- result[-1, 2:3]
names(result) <- c("Date", "Close")
现在我们得到您想要的结果:
result
#> Date Close
#> 2 Jan 06, 2020 890.00
#> 3 Jan 03, 2020 949.00
#> 4 Jan 02, 2020 1003.00
#> 5 Dec 24, 2019 1117.00
#> 6 Dec 23, 2019 1154.00
#> 7 Dec 20, 2019 1201.00
#> 8 Dec 19, 2019 1265.00
#> 9 Dec 18, 2019 1340.00
# ....[ plus 50 more rows]....