我的任务是根据特定单词的使用来识别句子所属的组,例如识别使用哪种颜色来描述动物。我有一本字典,里面有我想用这种方式识别的单词:
df <- data.frame(id = c(1:5), pets = c("brown dog", "black cat", "orange cat", "black bird", "white hamster"))
dictionary <- c("black", "orange", "white", "brown", "green", "red")
我需要将宠物与表明它们所属类别的字典进行匹配,我的最终 df 如下:
final_df <- data.frame(id = c(1:5),
pets = c("brown dog", "black cat", "orange cat", "black bird", "white hamster"),
color = c("brown", "black", "orange", "black", "white"))
使用
stringr
包:
library(stringr)
regex <- str_c("\\b", dictionary, "\\b", collapse = "|")
color <- str_extract(df$pets, regex)
# "brown" "black" "orange" "black" "white"
在基础 R 中:
regex <- paste0(".*(", paste0("\\b", dictionary, "\\b", collapse = "|"), ").*")
color <- sub(regex, "\\1", df$pets)
# "brown" "black" "orange" "black" "white"