包corpus
提供了自定义的词干提取功能。当给定术语一项作为输入时,词干函数应返回该术语的词干作为输出。
[从Stemming Words开始,我举了下面的示例,它使用hunspell
字典进行词干分析。
首先,我定义测试该功能的句子:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
自定义词干函数是:
stem_hunspell <- function(term) {
# look up the term in the dictionary
stems <- hunspell::hunspell_stem(term)[[1]]
if (length(stems) == 0) { # if there are no stems, use the original term
stem <- term
} else { # if there are multiple stems, use the last one
stem <- stems[[length(stems)]]
}
stem
}
此代码
sentences=text_tokens(sentences, stemmer = stem_hunspell)
产生:
> sentences
[[1]]
[1] "the" "color" "blue" "neutralize" "orange" "yellow"
[7] "reflection" "."
[[2]]
[1] "zod" "stabbed" "me" "with" "blue" "kryptonite"
[7] "."
[[3]]
[1] "because" "blue" "i" "your" "favourite" "colour"
[7] "."
[[4]]
[1] "re" "i" "wrong" "," "blue" "i" "right" "."
[[5]]
[1] "you" "and" "i" "are" "go"
[6] "to" "yellowstone" "."
[[6]]
[1] "van" "gogh" "look" "for" "some" "yellow" "at" "sunset" "."
[[7]]
[1] "you" "ruin" "my" "beautiful" "green" "dress"
[7] "."
[[8]]
[1] "you" "do" "not" "agree" "."
[[9]]
[1] "there" "nothing" "wrong" "with" "green" "."
词干后,我想对文本应用其他操作,例如删除停用词。无论如何,当我应用tm
功能时:
removeWords(sentences,stopwords)
我的句子中,出现以下错误:
Error in UseMethod("removeWords", x) :
no applicable method for 'removeWords' applied to an object of class "list"
如果我使用
unlist(sentences)
我没有得到理想的结果,因为我最终得到65个元素的chr
。期望的结果应该是(例如,对于第一句话):
"the color blue neutralize orange yellow reflection."
如果要从每个sentence
中删除停用词,则可以使用lapply
:
lapply(sentences, removeWords, stopwords())
#[[1]]
#[1] "" "color" "blue" "neutralize" "orange" "yellow" "reflection" "."
#[[2]]
#[1] "zod" "stabbed" "" "" "blue" "kryptonite" "."
#...
#...
但是,从您的预期输出中,您似乎希望将文本粘贴在一起。
lapply(sentences, paste0, collapse = " ")