R 中使用 rvest 进行网页抓取的问题

问题描述 投票:0回答:1

我正在尝试从这个网站上抓取政治演讲:https://www.narendramodi.in/category/text-speeches

使用 rvest 包,我刚刚开始使用这段代码:

modi <- "https://www.narendramodi.in/category/text-speeches"  
html <- read_html(modi)

但是,read_html 会连续运行几个小时而不停止。我找不到解决方法或发生这种情况的原因?我应该让代码运行更多时间吗?

到目前为止我已经尝试过:关闭我的VPN,read_html适用于其他网页,robotstxt表明该网站允许抓取。

我希望这能起作用,这样我就可以继续使用selectorgadget来抓取网页上的演讲,但我无法比这更进一步,因为read_html函数不会停止运行。

任何帮助/建议将不胜感激。 非常感谢

r web-scraping rvest
1个回答
0
投票

使用

httr
检索网页,然后使用
rvest
提取数据。

### Packages
library(httr)
library(rvest)

### Specify the url you want to get
url="https://www.narendramodi.in/category/text-speeches"

### Download the page with httr by specifying user-agent and encoding type
response=GET(
  url,
  add_headers(
    `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0",
    `Accept-Encoding` = "gzip, deflate"
  )
)

### Extract the data (titles for example) with rvest
content(response) %>%
  html_elements(xpath = '//div[@class="speechesItemLink left_class "]/a') %>% 
  html_text2()

输出:

[1] "This government, led by a son of the poor, has given top priority to the welfare of the poor: PM Modi in Kalyan"                
[2] "If you work for 10 hours then I will work for 18 hours and this is Modi's guarantee to 140 crore Indians: PM Modi in Pratapgarh"
[3] "There is no chance of Congress-SP emerging victorious in Bhadohi: PM Modi in Bhadohi, UP"                                       
[4] "CAA is a testimony to Modi's guarantee: PM Modi in Lalganj, UP"                                                                 
[5] "We've never discriminated based on religion; our schemes benefit everyone: PM Modi in Dindori"                                  
[6] "When there is a weak government like Congress, it weakens the country as well, says PM Modi in Koderma, Jharkhand"              
[7] "The priority of RJD and Congress is not you, the people, but their own vote bank: PM Modi in Hajipur"                           
[8] "Your dreams are my resolve and for this 24/7 for 2047: PM Modi in Saran"                                                        
[9] "The jungle raj of the RJD pushed Bihar back for decades: PM Modi in Muzaffarpur"
© www.soinside.com 2019 - 2024. All rights reserved.