从R中的推文中提取表情符号

Question

我正在从标记的Twitter数据中进行特征提取，以用于预测假推文。我已经在各种GitHub方法，R库，stackoverflow帖子上花费了大量时间，但是以某种方式我找不到提取与表情符号相关的功能的“直接”方法，例如表情符号的数量，无论该推文包含表情符号（1/0）还是特定表情符号的出现（在假/真实新闻中可能更频繁地出现）。我不确定显示可重现的代码是否有意义。

例如，

“ Ore”库提供的功能可以收集对象中的所有tweet并提取表情符号，但是如上所述，当尝试从提取中创建特征时，格式存在问题（至少对我而言）。下面的示例使用whatsapp文本示例。我将添加来自kaggle的twitter数据，以使其具有一定的可重复性。Twitter数据集：https://github.com/sherylWM/Fake-News-Detection-using-Twitter/blob/master/FinalDataSet.csv

# save this to '_chat.txt` (it require a login)
# https://www.kaggle.com/sarthaknautiyal/whatsappsample

library(ore)
library(dplyr)

emoji_src <- "https://raw.githubusercontent.com/laurenancona/twimoji/gh-pages/twitterEmojiProject/emoticon_conversion_noGraphic.csv"
emoji_fil <- basename(emoji_src)
if (!file.exists(emoji_fil)) download.file(emoji_src, emoji_fil)

emoji <- read.csv(emoji_fil, header=FALSE, stringsAsFactors = FALSE)
emoji_regex <- sprintf("(%s)", paste0(emoji$V2, collapse="|"))
compiled <- ore(emoji_regex)

chat <- readLines("_chat.txt", encoding = "UTF-8", warn = FALSE)

which(grepl(emoji_regex, chat, useBytes = TRUE))
##   [1]   8   9  10  11  13  19  20  22  23  62  65  69  73  74  75  82  83  84  87  88  90  91
##  [23]  92  93  94  95 107 108 114 115 117 119 122 123 124 125 130 135 139 140 141 142 143 144
##  [45] 146 147 150 151 153 157 159 161 162 166 169 171 174 177 178 183 184 189 191 192 195 196
##  [67] 199 200 202 206 207 209 220 221 223 224 225 226 228 229 234 235 238 239 242 244 246 247
##  [89] 248 249 250 251 253 259 260 262 263 265 274 275 280 281 282 286 287 288 291 292 293 296
## [111] 302 304 305 307 334 335 343 346 348 351 354 355 356 358 361 362 382 389 390 391 396 397
## [133] 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419
## [155] 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 442 451 452
## [177] 454 459 463 465 466 469 471 472 473 474 475 479 482 484 485 486 488 490 492 493 496 503
## [199] 505 506 507 509 517 518 519 525 526 527 528 531 535 540 543 545 548 549 557 558 559 560
## [221] 566 567 571 572 573 574 576 577 578 580 587 589 591 592 594 597 600 601 603 608 609 625
## [243] 626 627 637 638 639 640 641 643 645 749 757 764

chat_emoji_lines <- chat[which(grepl(emoji_regex, chat, useBytes = TRUE))]

found_emoji <- ore.search(compiled, chat_emoji_lines, all=TRUE)
emoji_matches <- matches(found_emoji)

str(emoji_matches, 1)
## List of 254
##  $ : chr [1:4] "\U0001f600" "\U0001f600" "\U0001f44d" "\U0001f44d"
##  $ : chr "\U0001f648"
##  $ : chr [1:2] "\U0001f44d" "\U0001f44d"
##  $ : chr "\U0001f602"
##  $ : chr [1:3] "\U0001f602" "\U0001f602" "\U0001f602"
##  $ : chr [1:4] "\U0001f44c" "\U0001f44c" "\U0001f44c" "\U0001f44c"
##  $ : chr [1:6] "\U0001f602" "\U0001f602" "\U0001f602" "\U0001f602" ...
##  $ : chr "\U0001f600"
##  $ : chr [1:5] "\U0001f604" "\U0001f604" "\U0001f604" "\U0001f603" ...
##  $ : chr "\U0001f44d"
## ...

data_frame(
  V2 = flatten_chr(emoji_matches) %>% 
    map(charToRaw) %>% 
    map(as.character) %>% 
    map(toupper) %>% 
    map(~sprintf("\\x%s", .x)) %>% 
    map_chr(paste0, collapse="")
) %>% 
  left_join(emoji) %>% 
  count(V3, sort=TRUE)
## # A tibble: 89 x 2
##                                                    V3     n
##                                                 <chr> <int>
##  1                             face with tears of joy   110
##  2                     smiling face with smiling eyes    50
##  3         face with stuck-out tongue and winking eye    43
##  4                                       musical note    42
##  5                                      birthday cake    35
##  6                    grinning face with smiling eyes    26
##  7 face with stuck-out tongue and tightly-closed eyes    24
##  8                                      grinning face    21
##  9                                            bouquet    17
## 10                                     thumbs up sign    17
## # ... with 79 more rows

来源：https://gist.github.com/hrbrmstr/e89eb173ae0333f50f94fe5086fedf8b

“ textclean”库提供了2个函数，可以用等同的单词替换表情符号。来源：https://cran.r-project.org/web/packages/textclean/textclean.pdf

来自cran-r的utf8程序包描述的另一个hit：

代码大于0xffff的字符（包括大多数表情符号，Windows支持。

有人可以使用其他方法，方向，包装/功能吗？

Answer 1

我为此在程序包rwhatsapp中编写了一个函数。

由于您的示例是whatsapp数据集，因此可以直接使用软件包（通过remotes::install_github("JBGruber/rwhatsapp")安装）进行测试

df <- rwhatsapp::rwa_read("_chat.txt")
#> Warning in readLines(x, encoding = encoding, ...): incomplete final line found
#> on '_chat.txt'
df
#> # A tibble: 392 x 6
#>    time                author    text             source       emoji  emoji_name
#>    <dttm>              <fct>     <chr>            <chr>        <list> <list>    
#>  1 2015-06-25 01:42:12 <NA>      : ‎Vishnu Gaud …  /home/johan… <NULL> <NULL>    
#>  2 2015-06-25 01:42:12 <NA>      : ‎You were added /home/johan… <NULL> <NULL>    
#>  3 2016-12-18 01:57:38 Shahain   :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  4 2016-12-21 21:54:46 Pankaj S… :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  5 2016-12-21 21:57:45 Shahain   :Wow             /home/johan… <NULL> <NULL>    
#>  6 2016-12-21 22:48:51 Sakshi    :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  7 2016-12-21 22:49:00 Sakshi    :<‎image omitted> /home/johan… <NULL> <NULL>    
#>  8 2016-12-21 22:50:12 Neha Wip… :Awsum😀😀👍🏼👍🏼   /home/johan… <chr … <chr [4]> 
#>  9 2016-12-21 22:51:21 Sakshi    :🙈              /home/johan… <chr … <chr [1]> 
#> 10 2016-12-21 22:57:01 Ganguly   :🙂🙂👍🏻👍🏻        /home/johan… <chr … <chr [4]> 
#> # … with 382 more rows

我从文本中提取empjis，并将它们存储在列表列中，因为每个文本可以包含多个表情符号。使用unnest取消嵌套列表列。

library(tidyverse)
df %>% 
  select(time, emoji) %>% 
  unnest(emoji)
#> # A tibble: 654 x 2
#>    time                emoji
#>    <dttm>              <chr>
#>  1 2016-12-21 22:50:12 😀   
#>  2 2016-12-21 22:50:12 😀   
#>  3 2016-12-21 22:50:12 👍🏼   
#>  4 2016-12-21 22:50:12 👍🏼   
#>  5 2016-12-21 22:51:21 🙈   
#>  6 2016-12-21 22:57:01 🙂   
#>  7 2016-12-21 22:57:01 🙂   
#>  8 2016-12-21 22:57:01 👍🏻   
#>  9 2016-12-21 22:57:01 👍🏻   
#> 10 2016-12-21 23:28:51 😂   
#> # … with 644 more rows

您可以对任何文本使用此功能。您唯一需要做的就是将文本存储在data.frame列中的text中（我在这里使用tibble，因为它打印的更好）：

df <- tibble::tibble(
  text = readLines("/home/johannes/_chat.txt")
)
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
rwhatsapp:::rwa_add_emoji(df)
#> # A tibble: 764 x 3
#>    text                                                emoji     emoji_name
#>    <chr>                                               <list>    <list>    
#>  1 25/6/15, 1:42:12 AM: ‎Vishnu Gaud created this group <NULL>    <NULL>    
#>  2 25/6/15, 1:42:12 AM: ‎You were added                 <NULL>    <NULL>    
#>  3 18/12/16, 1:57:38 AM: Shahain: <‎image omitted>      <NULL>    <NULL>    
#>  4 21/12/16, 9:54:46 PM: Pankaj Sinha: <‎image omitted> <NULL>    <NULL>    
#>  5 21/12/16, 9:57:45 PM: Shahain: Wow                  <NULL>    <NULL>    
#>  6 21/12/16, 10:48:51 PM: Sakshi: <‎image omitted>      <NULL>    <NULL>    
#>  7 21/12/16, 10:49:00 PM: Sakshi: <‎image omitted>      <NULL>    <NULL>    
#>  8 21/12/16, 10:50:12 PM: Neha Wipro: Awsum😀😀👍🏼👍🏼    <chr [4]> <chr [4]> 
#>  9 21/12/16, 10:51:21 PM: Sakshi: 🙈                   <chr [1]> <chr [1]> 
#> 10 21/12/16, 10:57:01 PM: Ganguly: 🙂🙂👍🏻👍🏻            <chr [4]> <chr [4]> 
#> # … with 754 more rows

更多详细信息

under the hood的工作方式是使用简单的字典和匹配方法。首先，我将文本拆分为字符，然后将这些字符与行ID一起放入data.frame中（这是对unnest_tokens从tidytext的重写）：

lines <- readLines("/home/johannes/_chat.txt")
#> Warning in readLines("/home/johannes/_chat.txt"): incomplete final line found on
#> '/home/johannes/_chat.txt'
id <- seq_along(lines)
l <- stringi::stri_split_boundaries(lines, type = "character")

out <- tibble(id = rep(id, sapply(l, length)), emoji = unlist(l))

然后我将这些字符与表情符号字符的数据集进行匹配（有关更多信息，请参见?rwhatsapp::emojis：

out <- add_column(out,
                  emoji_name = rwhatsapp::emojis$name[
                    match(out$emoji,
                          rwhatsapp::emojis$emoji)
                    ])
out
#> # A tibble: 28,652 x 3
#>       id emoji emoji_name
#>    <int> <chr> <chr>     
#>  1     1 "2"   <NA>      
#>  2     1 "5"   <NA>      
#>  3     1 "/"   <NA>      
#>  4     1 "6"   <NA>      
#>  5     1 "/"   <NA>      
#>  6     1 "1"   <NA>      
#>  7     1 "5"   <NA>      
#>  8     1 ","   <NA>      
#>  9     1 " "   <NA>      
#> 10     1 "1"   <NA>      
#> # … with 28,642 more rows

现在未找到表情符号时，新列将包含表情符号或NA。删除NA只剩下表情符号。

out <- out[!is.na(out$emoji_name), ]
out
#> # A tibble: 656 x 3
#>       id emoji emoji_name                       
#>    <int> <chr> <chr>                            
#>  1     8 😀    grinning face                    
#>  2     8 😀    grinning face                    
#>  3     8 👍🏼    thumbs up: medium-light skin tone
#>  4     8 👍🏼    thumbs up: medium-light skin tone
#>  5     9 🙈    see-no-evil monkey               
#>  6    10 🙂    slightly smiling face            
#>  7    10 🙂    slightly smiling face            
#>  8    10 👍🏻    thumbs up: light skin tone       
#>  9    10 👍🏻    thumbs up: light skin tone       
#> 10    11 😂    face with tears of joy           
#> # … with 646 more rows

这种方法的缺点是您依赖表情符号数据的完整性。但是，pacakge中的数据集包括来自unicode网站（版本13）的所有已知表情符号。

从R中的推文中提取表情符号

问题描述投票：0回答：1

1个回答

更多详细信息

最新问题

从R中的推文中提取表情符号

问题描述 投票：0回答：1

1个回答

更多详细信息

最新问题

问题描述投票：0回答：1