我在R中设置了一个新项目,并希望从文本中提取特定符号
X <- c("amazing tiny phone ^_^","so cute!!! <3")
我想从R的^_^
中提取<3
和X
谢谢!
更直截了当
X = c("amazing tiny phone ^_^","so cute!!! <3","^_^ and :) are my fav symbols")
patt=c("=d" ,"<3" , ":o" , ":(" ,
":)" , "(y)" , ":*" , "^_^", ":d" ,";)" , ":'(")
variable = sapply(X,function(x){
i = which(patt%in%strsplit(x," ")[[1]])
if (length(i)>0){
paste(patt[i],collapse=" ")
} else{NA}
})
names(variable)=NULL
> variable
[1] "^_^" "<3" ":) ^_^" NA
@GraemeForst使用分组和前瞻可以实现泛化:
group <- "[\\^\\_\\<\\>3\\:\\(\\)\\;]"
pat <- sprintf(".*[\\s\\b](%s+)(?!\\1)", group)
group
定义了角色分组。基本上我们想要提取的所有符号。pat
定义了我们的匹配模式。 [\\s\\b]
说在可能的比赛之前必须有一个空白或边界。并且(?!\\1)
在比赛后说不能有group
的元素。这是一个演示:
X <- c("amazing tiny phone ^_^","so cute!!! <3", "I like pizza :)", "hello beautiful ;)")
gsub(pat, "\\1", grep(pat, X, value = TRUE, perl = TRUE), perl = TRUE)
# [1] "^_^" "<3" ":)" ";)"
这可以进一步细化和推广。可以添加的一个非常简单的步骤是扩展grouping
。
老答案
您可以使用正则表达式:
# create the pattern to be extracted
pat = ".*(\\^\\_\\^).*|.*(\\<3).*" # escape special characters with "\\" and ".*" to specify there may be text before/after
# extract
gsub(pat, "\\1\\2", grep(pat, X, value = TRUE, perl = TRUE), perl = TRUE)
# [1] "^_^" "<3"