解析带有多个括号的字符串

问题描述 投票:1回答:1

我需要分析具有列dt的数据集"subject”,例如,

ID    subject   

1     USA(Texas)(Austin)
2     USA(California)(Sacramento)

因此,我想获得下表:

ID    subject                       Country     State        Capital   

1     USA(Texas)(Austin)            USA         Texas        Austin
2     USA(California)(Sacramento)   USA         California   Sacramento

当方括号中只有一个值时,我使用此表达式:

dt <- tidyr::extract(dt, subject, into = c(var1, var2), regex = "(.*)\\((.*)\\)", remove = FALSE)

但是当括号中有多个表达式时,如何更改它?

r regex extract
1个回答
0
投票

由于您有多个括号可以从中提取数据,因此需要使正则表达式变得懒惰。

library(dplyr)
library(tidyr)

extract(dt, subject, into = c("Country", "State", "Capital"),
              regex = "(.*)\\((.*?)\\)\\((.*)\\)", remove = FALSE)

#  ID                     subject Country      State    Capital
#1  1          USA(Texas)(Austin)     USA      Texas     Austin
#2  2 USA(California)(Sacramento)     USA California Sacramento

使用separate的另一个选项

dt %>%
  mutate(subject = trimws(gsub('[()]', ' ', subject))) %>%
  separate(subject, into = c("Country", "State", "Capital"), sep = "\\s+")

数据

dt <- structure(list(ID = 1:2, subject = structure(2:1, 
.Label = c("USA(California)(Sacramento)", "USA(Texas)(Austin)"), 
class = "factor")), class = "data.frame", row.names = c(NA, -2L))
© www.soinside.com 2019 - 2024. All rights reserved.