在我的数据中,一列(
TCU_clean
)中的字符串是另一列(TCU_clean_collpsd
)中的子字符串;在第三列 (w_c7_collpsd
)(关键目标列)中,出现相同的子字符串,但由 PoS 标签扩展。例如,TCU_clean
和TCU_clean_collpsd
中的第一个单词是like
;在 w_c7_collpsd
中,它被给出为 like_VV0
。
我的目标是修剪
w_c7_collpsd
(还有TCU_clean_collpsd
,但这是次要的),使w_c7_collpsd
中的单词准确映射到
TCU_clean
中的单词。
这是所需的输出:
Sequ TCU_clean w_c7_collpsd TCU_clean_collpsd
<dbl> <chr> <chr> <chr>
1 1 like I do n't understand like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI like I do n't understand
2 1 sorry sorry_JJ sorry
3 1 like how old 's your mom like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1 like how old 's your mom
4 2 when was that when_RRQ was_VBDZ that_DD1 when was that
感谢任何有关此任务的帮助。来自
tidyverse
的解决方案是首选,但也欢迎其他解决方案。
可重现的数据:
df <- structure(list(Sequ = c(1, 1, 1, 2), TCU_clean = c("like I do n't understand",
"sorry", "like how old 's your mom", "when was that"), w_c7_collpsd = c("like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1",
"like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1",
"like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1",
"when_RRQ was_VBDZ that_DD1"), TCU_clean_collpsd = c("like I do n't understand sorry like how old 's your mom",
"like I do n't understand sorry like how old 's your mom", "like I do n't understand sorry like how old 's your mom",
"when was that")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
有效标签列表会有所帮助,但根据包含的示例,删除
_[:upper:]+\\d?
正则表达式的所有匹配似乎就足够了library(dplyr)
library(stringr)
df |>
mutate(TCU_clean_collpsd = str_remove_all(w_c7_collpsd, "_[:upper:]+\\d?")) |>
select(starts_with("TCU")) |>
# check if TCU_clean_collpsd indeed contains TCU_clean
mutate(match = str_detect(TCU_clean_collpsd, fixed(TCU_clean)))
#> # A tibble: 4 × 3
#> TCU_clean TCU_clean_collpsd match
#> <chr> <chr> <lgl>
#> 1 like I do n't understand like I do n't understand sorry like how old 's… TRUE
#> 2 sorry like I do n't understand sorry like how old 's… TRUE
#> 3 like how old 's your mom like I do n't understand sorry like how old 's… TRUE
#> 4 when was that when was that TRUE