我的数据文本是纯文本的小说。我使用了包tm和tidytext。数据处理进展顺利,我创建了DocumentTermMatrix而没有遇到任何问题。
text <- read_lines("GoneWithTheWind2.txt")
set.seed(314)
text <- iconv(text,'UTF-8',sub="")
myCorpus <- tm_map(myCorpus, removeWords, c(stopwords("english"),
stopwords("SMART"), mystopwords, Top200Words))
myDtm <- TermDocumentMatrix(myCorpus, control=list(minWordLength= 1))`
但是,我无法使用bing lexicon和DocumentTermMatrix之间的inner_join进行编码,以便随着时间的推移对这部小说进行时间顺序情绪分析。我根据在线示例编写了下面的函数但是不知道在count(情绪)中分组的是什么(我将????置于保留状态),因为纯文本和DocumentTermMatrix没有“行”列。
bing <- get_sentiments("bing")
m <- as.matrix(myDtm)
v <- sort(rowSums(m),decreasing=TRUE)
myNames <- names(v)
d <- data.frame(term=myNames, freq = v)
wind_polarity <- d %>%
# Inner join to the lexicon
inner_join(bing, by=c("term"="word")) %>%
# Count by sentiment, **????**
count(sentiment, **????**) %>%
# Spread sentiments
spread(sentiment, n, fill=0) %>%
mutate(
# Add polarity field
polarity = positive - negative,
# Add line number field
line_number = row_number())
Then plot by ggplot.
我尝试在文本中添加一列“索引”,表示每个文档(行)的行号,但此列在流程中的某处消失。任何建议都将受到高度赞赏。
下面是计算每条线极性的方法(基于三条线的最小示例)。您可以直接与词典一起加入您的dtm以维护计数信息。然后将极性信息转换为数字表示,并按行进行计算。您当然可以重写代码并使其更优雅(我对dplyr词汇不是很熟悉,抱歉)。我希望无论如何都有帮助。
library(tm)
library(tidytext)
text <- c("I like coffe."
,"I rather like tea."
,"I hate coffee and tea, but I love orange juice.")
myDtm <- TermDocumentMatrix(VCorpus(VectorSource(text)),
control = list(removePunctuation = TRUE,
stopwords = TRUE))
bing <- tidytext::get_sentiments("bing")
wind_polarity <- as.matrix(myDtm) %>%
data.frame(terms = rownames(myDtm), ., stringsAsFactors = FALSE) %>%
inner_join(bing, by= c("terms"="word")) %>%
mutate(terms = NULL,
polarity = ifelse( (.[,"sentiment"] == "positive"), 1,-1),
sentiment = NULL) %>%
{ . * .$polarity } %>%
mutate(polarity = NULL) %>%
colSums
#the polarity per line which you may plot, e.g., with base or ggplot
# X1 X2 X3
# 1 1 0