对于一个项目,我试图获得不同新闻文章的情绪。我试图使用sentimentr
包来做到这一点。但是,由于我有一些文章,我试图通过使用我的处理器的多个核心来加速这一点。目前的代码如下:
library(sentimentr)
#Extract sentences
df_sentences <- text1 %>%
select(content) %>%
get_sentences()
#Get sentiment score
df_sentences2 <- text1 %>%
select(content) %>%
lapply(get_sentences())
Text1是一个数据框,其中包含有关这些文章的文章和信息,content
列是包含实际文章文本的列。我已经在网上找到了parallel
包,它可以让你这样做。我尝试使用下面的代码实现这个包,不幸的是它似乎没有使用更多的核心,因为速度保持不变。
library(sentimentr)
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(cl, library(sentimentr))
clusterExport(cl, "text1")
df_sentences2 <- text1 %>% select(content) %>% parLapply(cl, ., get_sentences)
df_sentiment <- df_sentences2 %>%
parSapply(cl, ., sentiment_by)
stopCluster(cl)
我希望有人可以帮助我,告诉我,如果我正确地做了,或者我必须改变它才能正常工作,因为它可以节省我很多时间。非常感谢所有帮助!示例数据包括在下面:
structure(list(X = 0:4, id = 17284:17288, title = c("Example Title",
"Example Title", "Example Title", "Example Title", "Example Title"
), publication = c("New York Times", "New York Times", "New York Times",
"New York Times", "New York Times"), author = c("Example Writer",
"Example Writer", "Example Writer", "Example Writer", "Example Writer"
), date = c("2016-12-31", "2015-12-31", "2014-12-31", "2013-12-31",
"2012-12-31"), year = c(2016, 2016, 2016, 2016, 2016), month = c(12,
12, 12, 12, 12), url = c(NA, NA, NA, NA, NA), content = c("This is an example sentence. This is another example sentence",
"This is an example sentence. This is another example sentence",
"This is an example sentence. This is another example sentence",
"This is an example sentence. This is another example sentence",
"This is an example sentence. This is another example sentence"
)), .Names = c("X", "id", "title", "publication", "author", "date",
"year", "month", "url", "content"), class = "data.frame", row.names = c(NA,
-5L))
编辑:
我已将原始代码更改为将@F.Privé的注释合并到以下内容中,但执行操作所需的时间保持不变。我希望有人知道我需要改变什么来让它正常工作。
library(sentimentr)
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(cl, library(sentimentr))
clusterExport(cl, "text1")
df_sentences <- text1 %>%
pull(content) %>%
parLapply(cl, ., get_sentences)
df_sentiment <- df_sentences2 %>%
parLapply(cl, ., sentiment_by)
stopCluster(cl)
因此,最好的方法是将矢量分成ncores部分,这样每个核心都可以完成整个计算的一部分。
在我的一个软件包中,我有一个使用foreach执行此操作的功能:
# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)
res <- big_parallelize(text1[["content"]], p.FUN = function(x, ind) {
sentimentr::get_sentences(x[ind])
}, p.combine = 'c', ind = rows_along(text1), ncores = nb_cores())
structure(res, class = c("get_sentences", "get_sentences_character", "list"))