运行以下脚本将生成结果
a <- c("Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it. - Steve Jobs")
a_source <- VectorSource(a)
a_corpus <- VCorpus(a_source)
term_stats(a_corpus)
term_stats(a_corpus)
term count support
1 . 5 1
2 to 5 1
3 is 4 1
4 you 4 1
5 , 3 1
支持是单词出现的文档数,count是出现次数。如果做tf-idf,你需要两者。
library(tm)
txt <- c("Your work is going to fill a large part of your life,
and the only way to be truly satisfied is to do what you
believe is great work.
And the only way to do great work is to love what you do.
If you haven't found it yet, keep looking. Don't settle.
As with all matters of the heart, you'll know when you find it.
- Steve Jobs")
term_stats(VCorpus(VectorSource(txt)))[1:5,]
term count support
. 5 1
to 5 1
is 4 1
#Split txt into 4 docs
txt_df <- data.frame( txt = c(
"Your work is going to fill a large part of your life,
and the only way to be truly satisfied is to do what you
believe is great work." ,
"And the only way to do great work is to love what you do." ,
"If you haven't found it yet, keep looking. Don't settle." ,
"As with all matters of the heart, you'll know when you find it. -
Steve Jobs"))
term_stats(VCorpus(VectorSource(txt_df$txt)))[1:6,]
term count support
. 5 4
you 4 4
, 3 3
the 3 3
to 5 2
is 4 2
默认是按支持排序。