我正在学习如何使用 Reddit API - 我正在尝试学习如何从特定帖子中提取所有评论。
例如 - 考虑这篇文章:https://www.reddit.com/r/Homebrewing/comments/11dd5r3/worst_mistake_youve_made_as_a_homebrewer/
使用这个 R 代码,我想我能够访问评论:
library(httr)
library(jsonlite)
# Set authentication parameters
auth <- authenticate("some-key1", "some_key2")
# Set user agent
user_agent <- "my_app/0.1"
# Get access token
response <- POST("https://www.reddit.com/api/v1/access_token",
auth = auth,
user_agent = user_agent,
body = list(grant_type = "password",
username = "abc123",
password = "123abc"))
# Extract access token from response
access_token <- content(response)$access_token
# Use access token to make API request
url <- "https://oauth.reddit.com/LISTING" # Replace "LISTING" with the subreddit or endpoint you want to access
headers <- c("Authorization" = paste("Bearer", access_token))
result <- GET(url, user_agent(user_agent), add_headers(headers))
post_id <- "11dd5r3"
url <- paste0("https://oauth.reddit.com/r/Homebrewing/comments/", post_id)
# Set the user agent string
user_agent_string <- "MyApp/1.0"
# Set the authorization header
authorization_header <- paste("Bearer ", access_token, sep = "")
# Make the API request
response <- GET(url, add_headers(Authorization = authorization_header, `User-Agent` = user_agent_string))
# Extract the response content and parse
response_json <- rawToChar(response$content)
从这里看来,所有评论都存储在一组
<p> and </p>:
之间
<p>Reminds me of a chemistry professor I had in college, he taught a class on polymers (really smart guy, Nobel prize voter level). When talking about glass transition temperature he suddenly stopped and told a story about how a week or two beforehand he had put some styrofoam into the oven to keep the food warm while he waited for his wife to get home. It melted and that was his example on glass transition temperature. Basically: no matter how smart or trained you are, you can still make a mistake.</p>
<p>opening the butterfly valve on the bottom of a pressurized FV with a peanut butter chocolate milk stout in it. Made the inside of my freezer look like someone diarrhea'd all over the inside of the door.</p>
使用这种逻辑,我尝试通过正则表达式仅在这些符号之间保留文本:
final = response_json[1]
matches <- gregexpr("<p>(.*?)</p>", final)
matches_text <- regmatches(final, matches)[[1]]
我认为这段代码部分有效 - 但返回的许多条目不是注释:
[212] "<p>Worst mistake was buying malt hops and yeast and letting it go stale.</p>"
[213] "<p>Posts are automatically archived after 6 months.</p>"
有人可以告诉我更好的方法吗?怎么才能只提取评论文字而不提取其他内容?
谢谢!
如果您无论如何都想使用
regex
,也许您应该尝试像 (?<=<p>).*?(?=</p>)
这样的模式,例如,
> s <- "<p>xxxxx</p> <p>xyyyyyyyyy</p> <p>zzzzzzzzzzzz</p>"
> regmatches(s, gregexpr("(?<=<p>).*?(?=</p>)", s, perl = TRUE))[[1]]
[1] "xxxxx" "xyyyyyyyyy" "zzzzzzzzzzzz"
假设API响应是JSON格式,可以使用R中的jsonlite包将JSON响应转换为数据帧,然后使用正则表达式从数据帧中提取注释。
library(jsonlite)
response <- '{"comments":[{"name":"John","email":"[email protected]","body":"This is a comment."},{"name":"Jane","email":"[email protected]","body":"Another comment."}]}'
df <- jsonlite::fromJSON(response, simplifyDataFrame = TRUE)
comments <- df$body