我正在R中工作,并使用textclean包中的replace_emoticon
函数用其对应的单词替换图释:
library(textclean)
test_text <- "i had a great experience xp :P"
replace_emoticon(test_text)
[1] "i had a great e tongue sticking out erience tongue sticking out tongue sticking out "
如上所示,该功能有效,但它也替换了看起来像表情符号但在单词内的字符(例如,“ e xp erience”中的“ xp”)。我试图找到解决此问题的方法,并发现以下声称可以解决此问题的函数覆盖:
replace_emoticon <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
trimws(gsub(
"\\s+",
" ",
mgsub_regex(x, paste0('\\b\\Q', emoticon_dt[['x']], '\\E\\b'), paste0(" ", emoticon_dt[['y']], " "))
))
}
replace_emoticon(test_text)
[1] "i had a great experience tongue sticking out :P"
但是,虽然确实用“ experience”一词解决了问题,但它创建了一个全新的问题:它将停止替换“:P”,它是一个表情符号,通常应由该函数替换。
此外,用字符“ xp”知道该错误,但我不确定除“ xp”以外是否还有其他字符在单词中时也会被错误替换。
是否有一种方法可以告诉replace_emoticon
函数仅在不属于单词的情况下替换“表情符号”?
谢谢!
Wiktor是正确的,边界检查一词引起了问题。我在以下功能中对其进行了稍微的调整。仍然有1个问题,那就是如果表情符号后紧跟一个单词,且表情符号和该单词之间没有空格。问题是最后一个问题是否重要。请参阅下面的示例。
注意:我已使用textclean将此信息添加到问题跟踪器中。
replace_emoticon2 <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
trimws(gsub(
"\\s+",
" ",
mgsub_regex(x, paste0('\\Q', emoticon_dt[['x']], '\\E\\b'), paste0(" ", emoticon_dt[['y']], " "))
))
}
# works
replace_emoticon2("i had a great experience xp :P")
[1] "i had a great experience tongue sticking out tongue sticking out"
replace_emoticon2("i had a great experiencexp:P:P")
[1] "i had a great experience tongue sticking out tongue sticking out tongue sticking out"
# does not work:
replace_emoticon2("i had a great experience xp :Pnewword")
[1] "i had a great experience tongue sticking out :Pnewword"