将一个向量中的字符串(另一向量中较长字符串的子字符串)映射到较长字符串中的相同子字符串

问题描述 投票:0回答:1

在我的数据中,一列(

TCU_clean
)中的字符串是另一列(
TCU_clean_collpsd
)中的子字符串;在第三列 (
w_c7_collpsd
)(关键目标列)中,出现相同的子字符串,但由 PoS 标签扩展。例如,
TCU_clean
TCU_clean_collpsd
中的第一个单词是
like
;在
w_c7_collpsd
中,它被给出为
like_VV0

我的目标是修剪

w_c7_collpsd
(还有
TCU_clean_collpsd
,但这是次要的),使w_c7_collpsd
中的
单词准确映射到
TCU_clean
中的单词。

这是所需的输出

     Sequ   TCU_clean                   w_c7_collpsd                                        TCU_clean_collpsd                          
    <dbl>   <chr>                       <chr>                                               <chr>                                      
1     1     like I do n't understand    like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI       like I do n't understand
2     1     sorry                       sorry_JJ                                            sorry 
3     1     like how old 's your mom    like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1    like how old 's your mom
4     2     when was that               when_RRQ was_VBDZ that_DD1                          when was that 

感谢任何有关此任务的帮助。来自

tidyverse
的解决方案是首选,但也欢迎其他解决方案。

可重现的数据

df <- structure(list(Sequ = c(1, 1, 1, 2), TCU_clean = c("like I do n't understand", 
                                                         "sorry", "like how old 's your mom", "when was that"), w_c7_collpsd = c("like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "when_RRQ was_VBDZ that_DD1"), TCU_clean_collpsd = c("like I do n't understand sorry like how old 's your mom", 
                                                                                                                                                                                      "like I do n't understand sorry like how old 's your mom", "like I do n't understand sorry like how old 's your mom", 
                                                                                                                                                                                      "when was that")), row.names = c(NA, -4L), class = c("tbl_df", 
                                                                                                                                                                                                                                           "tbl", "data.frame"))
r dplyr purrr
1个回答
0
投票

有效标签列表会有所帮助,但根据包含的示例,删除

_[:upper:]+\\d?
正则表达式的所有匹配似乎就足够了
(下划线后跟至少 1 个大写字符,可选后跟数字)

library(dplyr)
library(stringr)

df |> 
  mutate(TCU_clean_collpsd = str_remove_all(w_c7_collpsd, "_[:upper:]+\\d?")) |> 
  select(starts_with("TCU")) |> 
  # check if TCU_clean_collpsd indeed contains TCU_clean
  mutate(match = str_detect(TCU_clean_collpsd, fixed(TCU_clean)))
#> # A tibble: 4 × 3
#>   TCU_clean                TCU_clean_collpsd                               match
#>   <chr>                    <chr>                                           <lgl>
#> 1 like I do n't understand like I do n't understand sorry like how old 's… TRUE 
#> 2 sorry                    like I do n't understand sorry like how old 's… TRUE 
#> 3 like how old 's your mom like I do n't understand sorry like how old 's… TRUE 
#> 4 when was that            when was that                                   TRUE
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.