将一个向量中的字符串（另一向量中较长字符串的子字符串）映射到较长字符串中的相同子字符串

Question

在我的数据中，一列（

TCU_clean

）中的字符串是另一列（

TCU_clean_collpsd

）中的子字符串；在第三列 (

w_c7_collpsd

)（关键目标列）中，出现相同的子字符串，但由 PoS 标签扩展。例如，

TCU_clean

和

TCU_clean_collpsd

中的第一个单词是

like

；在

w_c7_collpsd

中，它被给出为

like_VV0

。

我的目标是修剪

w_c7_collpsd

（还有

TCU_clean_collpsd

，但这是次要的），使w_c7_collpsd
中的
单词准确映射到
TCU_clean
中的单词。

这是所需的输出：

     Sequ   TCU_clean                   w_c7_collpsd                                        TCU_clean_collpsd                          
    <dbl>   <chr>                       <chr>                                               <chr>                                      
1     1     like I do n't understand    like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI       like I do n't understand
2     1     sorry                       sorry_JJ                                            sorry 
3     1     like how old 's your mom    like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1    like how old 's your mom
4     2     when was that               when_RRQ was_VBDZ that_DD1                          when was that

感谢任何有关此任务的帮助。来自

tidyverse

的解决方案是首选，但也欢迎其他解决方案。

可重现的数据：

df <- structure(list(Sequ = c(1, 1, 1, 2), TCU_clean = c("like I do n't understand", 
                                                         "sorry", "like how old 's your mom", "when was that"), w_c7_collpsd = c("like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "when_RRQ was_VBDZ that_DD1"), TCU_clean_collpsd = c("like I do n't understand sorry like how old 's your mom", 
                                                                                                                                                                                      "like I do n't understand sorry like how old 's your mom", "like I do n't understand sorry like how old 's your mom", 
                                                                                                                                                                                      "when was that")), row.names = c(NA, -4L), class = c("tbl_df", 
                                                                                                                                                                                                                                           "tbl", "data.frame"))

Answer 1

有效标签列表会有所帮助，但根据包含的示例，删除

_[:upper:]+\\d?

正则表达式的所有匹配似乎就足够了
（下划线后跟至少 1 个大写字符，可选后跟数字）

library(dplyr)
library(stringr)

df |> 
  mutate(TCU_clean_collpsd = str_remove_all(w_c7_collpsd, "_[:upper:]+\\d?")) |> 
  select(starts_with("TCU")) |> 
  # check if TCU_clean_collpsd indeed contains TCU_clean
  mutate(match = str_detect(TCU_clean_collpsd, fixed(TCU_clean)))
#> # A tibble: 4 × 3
#>   TCU_clean                TCU_clean_collpsd                               match
#>   <chr>                    <chr>                                           <lgl>
#> 1 like I do n't understand like I do n't understand sorry like how old 's… TRUE 
#> 2 sorry                    like I do n't understand sorry like how old 's… TRUE 
#> 3 like how old 's your mom like I do n't understand sorry like how old 's… TRUE 
#> 4 when was that            when was that                                   TRUE

将一个向量中的字符串（另一向量中较长字符串的子字符串）映射到较长字符串中的相同子字符串

问题描述投票：0回答：1

1个回答

最新问题

将一个向量中的字符串（另一向量中较长字符串的子字符串）映射到较长字符串中的相同子字符串

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1