使用r中的rvest软件包和选择器工具的网页抓取

Question

我正在尝试从某个站点抓取波罗的海巴拿马型指数数据。我也从其他站点上抓取了数据，但不适用于此页面。我正在使用Office连接，而要从中下载的站点显示为“不安全”连接。这引起问题吗？

我需要“日期”和“关闭”列，并且到目前为止已经编写了以下用于抓取的代码：

#Baltic Panamax Index
#Specifying the url for desired website to be scraped
con=url("http://marine-transportation.capitallink.com/indices/baltic_exchange_history.html?ticker=BPI","rb")

#Reading the HTML code from the website
webpage <- read_html(con)
webpage

#Using CSS selectors to scrape the date section*
date_data = html_nodes(webpage,".text .div_line:nth-child(2)")

#Converting the ranking data to text
date_data <- html_text(date_data)

#Let's have a look at the rankings*
head(date_data)

需要输出：

Date          Close
Jan 03,2020   949
Jan 02,2020   1003

Answer 1

您需要在请求标头中将您的用户名作为cookie发送，以获取此页面。我发现httr软件包为提出此类请求提供了极大的灵活性。对于此站点，您将需要使用已经在该站点注册的用户名。只需在下面的代码中更改user_name字段，即可使用：

# Use the httr package to allow flexibility with http requests
library(httr)
library(rvest)

# Set username here -----
#                       |
#             ---------------------
#             |                   |
#             v                   v
user_name  <- "[email protected]"

# Set url we need
site  <- "http://marine-transportation.capitallink.com"
url   <- paste0(site, "/indices/baltic_exchange_history.html?ticker=BPI")

# Obtain the page we want using user name as a cookie
response <- GET(url, set_cookies(clUser_email = user_name,
                                 expires      = "Sat, 16-Sep-2051 11:30:30 GMT",
                                 `Max-Age`    = "1000000000",
                                 path         = "/",
                                 domain       = "capitallink.com"))

# Parse the HTML code from the website using rvest
webpage       <- read_html(response)
date_data     <- html_nodes(webpage, "table")
result        <- html_table(date_data[4])[[1]]

# Tidy up the result
result        <- result[-1, 2:3]
names(result) <- c("Date", "Close")

现在我们得到您想要的结果：

result
#>            Date   Close
#> 2  Jan 06, 2020  890.00
#> 3  Jan 03, 2020  949.00
#> 4  Jan 02, 2020 1003.00
#> 5  Dec 24, 2019 1117.00
#> 6  Dec 23, 2019 1154.00
#> 7  Dec 20, 2019 1201.00
#> 8  Dec 19, 2019 1265.00
#> 9  Dec 18, 2019 1340.00
# ....[ plus 50 more rows]....

使用r中的rvest软件包和选择器工具的网页抓取

问题描述投票：0回答：1

1个回答

最新问题

使用r中的rvest软件包和选择器工具的网页抓取

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1