R 中使用 rvest 进行网页抓取的问题

Question

我正在尝试从这个网站上抓取政治演讲：https://www.narendramodi.in/category/text-speeches

使用 rvest 包，我刚刚开始使用这段代码：

modi <- "https://www.narendramodi.in/category/text-speeches"  
html <- read_html(modi)

但是，read_html 会连续运行几个小时而不停止。我找不到解决方法或发生这种情况的原因？我应该让代码运行更多时间吗？

到目前为止我已经尝试过：关闭我的VPN，read_html适用于其他网页，robotstxt表明该网站允许抓取。

我希望这能起作用，这样我就可以继续使用selectorgadget来抓取网页上的演讲，但我无法比这更进一步，因为read_html函数不会停止运行。

任何帮助/建议将不胜感激。非常感谢

Answer 1

使用

httr

检索网页，然后使用

rvest

提取数据。

### Packages
library(httr)
library(rvest)

### Specify the url you want to get
url="https://www.narendramodi.in/category/text-speeches"

### Download the page with httr by specifying user-agent and encoding type
response=GET(
  url,
  add_headers(
    `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0",
    `Accept-Encoding` = "gzip, deflate"
  )
)

### Extract the data (titles for example) with rvest
content(response) %>%
  html_elements(xpath = '//div[@class="speechesItemLink left_class "]/a') %>% 
  html_text2()

输出：

[1] "This government, led by a son of the poor, has given top priority to the welfare of the poor: PM Modi in Kalyan"                
[2] "If you work for 10 hours then I will work for 18 hours and this is Modi's guarantee to 140 crore Indians: PM Modi in Pratapgarh"
[3] "There is no chance of Congress-SP emerging victorious in Bhadohi: PM Modi in Bhadohi, UP"                                       
[4] "CAA is a testimony to Modi's guarantee: PM Modi in Lalganj, UP"                                                                 
[5] "We've never discriminated based on religion; our schemes benefit everyone: PM Modi in Dindori"                                  
[6] "When there is a weak government like Congress, it weakens the country as well, says PM Modi in Koderma, Jharkhand"              
[7] "The priority of RJD and Congress is not you, the people, but their own vote bank: PM Modi in Hajipur"                           
[8] "Your dreams are my resolve and for this 24/7 for 2047: PM Modi in Saran"                                                        
[9] "The jungle raj of the RJD pushed Bihar back for decades: PM Modi in Muzaffarpur"

R 中使用 rvest 进行网页抓取的问题

问题描述投票：0回答：1

1个回答

最新问题

R 中使用 rvest 进行网页抓取的问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1