我需要分析具有列dt
的数据集"subject
”,例如,
ID subject
1 USA(Texas)(Austin)
2 USA(California)(Sacramento)
因此,我想获得下表:
ID subject Country State Capital
1 USA(Texas)(Austin) USA Texas Austin
2 USA(California)(Sacramento) USA California Sacramento
当方括号中只有一个值时,我使用此表达式:
dt <- tidyr::extract(dt, subject, into = c(var1, var2), regex = "(.*)\\((.*)\\)", remove = FALSE)
但是当括号中有多个表达式时,如何更改它?
由于您有多个括号可以从中提取数据,因此需要使正则表达式变得懒惰。
library(dplyr)
library(tidyr)
extract(dt, subject, into = c("Country", "State", "Capital"),
regex = "(.*)\\((.*?)\\)\\((.*)\\)", remove = FALSE)
# ID subject Country State Capital
#1 1 USA(Texas)(Austin) USA Texas Austin
#2 2 USA(California)(Sacramento) USA California Sacramento
使用separate
的另一个选项
dt %>%
mutate(subject = trimws(gsub('[()]', ' ', subject))) %>%
separate(subject, into = c("Country", "State", "Capital"), sep = "\\s+")
数据
dt <- structure(list(ID = 1:2, subject = structure(2:1,
.Label = c("USA(California)(Sacramento)", "USA(Texas)(Austin)"),
class = "factor")), class = "data.frame", row.names = c(NA, -2L))