我正在编写代码,从一个项目的 Reddit 帖子中抓取帖子标题、评论和作者姓名。 我可以通过网络抓取帖子标题、作者姓名,但评论无法正确提取。
如果该帖子有 31 条评论,则每条评论将被提取 31 次。以下代码供参考:
# load packages
library(RSelenium)
library(netstat)
# start the server
rs_driver_object <- rsDriver(browser = 'firefox',verbose = FALSE, port = free_port(), chromever = NULL)
# create a client object
remDr <- rs_driver_object$client
# open a browser
remDr$open()
# maximize window
remDr$maxWindowSize()
remDr$navigate("https://www.reddit.com/r/AnimeReviews/comments/essf1u/assassination_classroom_is_a_1010_the_charm_the/")
Sys.sleep(2)
# scroll to the end of the webpage
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")
Sys.sleep(2)
remDr$executeScript("window.scrollTo(0, document.body.scrollHeight);")
load_more_comments <- remDr$findElement(using = 'xpath', '//*[@id="comment-tree"]/faceplate-partial/div[1]/button')
load_more_comments$clickElement()
#load_more_comments$refresh()
#pickup title
title <- remDr$findElement(using = 'xpath', '//*[@id="main-content"]/shreddit-title')$getElementAttribute('title')
#comments
comment_list <- remDr$findElements(using = 'tag name', 'shreddit-comment')
#print(typeof(comment_list))
for (each_comment in comment_list) {
print(paste("Author --->", each_comment$getElementAttribute('author')))
p_tags <- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p")
# Extract and print the text from each <p> tag
for (p_tag in p_tags) {
print(p_tag$getElementText())
}
}
请参考下面的截图:
我不知道为什么它不能只工作一次。 怎么做好像有点问题
p_标签<- each_comment$findElements(using = "xpath", value = ".//div[3]/div/p") is working
参考上面的代码,我尝试使用 RSelenium 在 R 中进行网页抓取。我试图抓取 Reddit 评论,但它们会出现多次而不是一次。
findElements
搜索整个 HTML,需要使用 findChildElements
。这应该有效(替换你的最后一个循环):
lapply(comment_list, \(c) {
author <- unlist(c$getElementAttribute('author'))
comment <- unlist(lapply(c$findChildElements(using = "xpath", value = ".//div[3]/div/p"), \(p) {
p$getElementText()
}))
list(author = author, comment = comment)
})
#> [[1]]$author
#> [1] "dotti1999"
#>
#> [[1]]$comment
#> [1] "this shit was fucking insane"
#> [2] "honestly I adored this anime when I first watched it..."
#> ...
#> [[3]]$author
#> [1] "[deleted]"
#>
#> [[3]]$comment
#> [1] "Give me a be If premise and I will give it a watch"
#> ...
请注意,这似乎仍然无法让您对评论进行回复。
PBulls 的回答有效。 以下是一种替代方案,尽管它需要额外的清洁步骤。 对于(评论列表中的每个评论){
author <- each_comment$getElementAttribute('author')
# Get the HTML content of the comment
comment_html <- each_comment$getElementAttribute("innerHTML")
# Extract comment text using regex
comment_text <- gsub("<.*?>", "", comment_html)
comment_text <- gsub("\n", "", comment_text)
# Print author and comment text
print(paste("Author --->", author))
print(comment_text)
}