我正在尝试从这个网站上抓取政治演讲:https://www.narendramodi.in/category/text-speeches
使用 rvest 包,我刚刚开始使用这段代码:
modi <- "https://www.narendramodi.in/category/text-speeches"
html <- read_html(modi)
但是,read_html 会连续运行几个小时而不停止。我找不到解决方法或发生这种情况的原因?我应该让代码运行更多时间吗?
到目前为止我已经尝试过:关闭我的VPN,read_html适用于其他网页,robotstxt表明该网站允许抓取。
我希望这能起作用,这样我就可以继续使用selectorgadget来抓取网页上的演讲,但我无法比这更进一步,因为read_html函数不会停止运行。
任何帮助/建议将不胜感激。 非常感谢
使用
httr
检索网页,然后使用 rvest
提取数据。
### Packages
library(httr)
library(rvest)
### Specify the url you want to get
url="https://www.narendramodi.in/category/text-speeches"
### Download the page with httr by specifying user-agent and encoding type
response=GET(
url,
add_headers(
`User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0",
`Accept-Encoding` = "gzip, deflate"
)
)
### Extract the data (titles for example) with rvest
content(response) %>%
html_elements(xpath = '//div[@class="speechesItemLink left_class "]/a') %>%
html_text2()
输出:
[1] "This government, led by a son of the poor, has given top priority to the welfare of the poor: PM Modi in Kalyan"
[2] "If you work for 10 hours then I will work for 18 hours and this is Modi's guarantee to 140 crore Indians: PM Modi in Pratapgarh"
[3] "There is no chance of Congress-SP emerging victorious in Bhadohi: PM Modi in Bhadohi, UP"
[4] "CAA is a testimony to Modi's guarantee: PM Modi in Lalganj, UP"
[5] "We've never discriminated based on religion; our schemes benefit everyone: PM Modi in Dindori"
[6] "When there is a weak government like Congress, it weakens the country as well, says PM Modi in Koderma, Jharkhand"
[7] "The priority of RJD and Congress is not you, the people, but their own vote bank: PM Modi in Hajipur"
[8] "Your dreams are my resolve and for this 24/7 for 2047: PM Modi in Saran"
[9] "The jungle raj of the RJD pushed Bihar back for decades: PM Modi in Muzaffarpur"