R：可以从每个句子（行）中提取单词组吗？并创建数据框（或矩阵）？

Question

我为每个单词创建列表以从句子中提取单词，例如这样

hello<- NULL
for (i in 1:length(text)){
hello[i]<-as.character(regmatches(text[i], gregexpr("[H|h]ello?", text[i])))
}

但是我要提取的单词列表超过25个，因此编码时间非常长。是否可以从文本数据中提取一组字符（单词）？

下面只是一个伪集。

words<-c("[H|h]ello","you","so","tea","egg")

text=c("Hello! How's you and how did saturday go?",  
       "hello, I was just texting to see if you'd decided to do anything later",
       "U dun say so early.",
       "WINNER!! As a valued network customer you have been selected" ,
       "Lol you're always so convincing.",
       "Did you catch the bus ? Are you frying an egg ? ",
       "Did you make a tea and egg?"
)

subsets<-NULL
for ( i in 1:length(text)){
.....???
   }

预期输出如下

[1] Hello you
[2] hello you
[3] you
[4] you so
[5] you you egg
[6] you tea egg

Answer 1

在R底下，您可以这样做：

regmatches(text,gregexpr(sprintf("\\b(%s)\\b",paste0(words,collapse = "|")),text))
[[1]]
[1] "Hello" "you"  

[[2]]
[1] "hello" "you"  

[[3]]
[1] "so"

[[4]]
[1] "you"

[[5]]
[1] "you" "so" 

[[6]]
[1] "you" "you" "egg"

[[7]]
[1] "you" "tea" "egg"

Answer 2

您说您的单词集很长。这是一种将每个单词集转换为正则表达式，将其应用于语料库（句子列表）并将匹配结果作为字符向量的方法。它不区分大小写，并且会检查单词边界，因此您不必将age

R：可以从每个句子（行）中提取单词组吗？并创建数据框（或矩阵）？

问题描述投票：0回答：2

2个回答

最新问题

R：可以从每个句子（行）中提取单词组吗？并创建数据框（或矩阵）？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2