我在 R 中有一个 data.frame,其中有一列和截断的行,我想通过正则表达式将其分为三列,例如:
dados <- c(
"N 2022NE001264 75",
"FRETES INTERNACIONAIS LTDA 7.500,00 C",
"N 2022NE000286 84",
"UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ 2.856,71 C",
"N 2022NE001297 48 720,00 C",
"N 2022NE001333 16",
"CASTRO COMERCIO LTDA 5.256,00 C",
"N 2022NE001353 92",
"CONSTRUCOES E INSTALACOES LTDA 734,20 C",
"N 2022NE000279 12",
"UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ 180,00 C",
"N 2022NE000293 12",
"EQUIPAMENTOS E PRODUTOS PARA LABORATORIOS L 1.716,00 C"
)
dados_df <- data.frame(V1 = dados)
使用正则表达式,我想将我的 data.frame 转换成这样的
nota org value
N 2022NE001264 75 FRETES INTERNACIONAIS LTDA 1500.00 C
N 2022NE000286 84 UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ 2856.71 C
N 2022NE000297 48 720.00 C
N 2022NE001333 16 CASTRO COMERCIO LTDA 5256.00 C
N 2022NE001353 92 CONSTRUCOES E INSTALACOES LTDA 734.00 C
N 2022NE000279 12 UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ 180.00 C
N 2022NE000293 12 EQUIPAMENTOS E PRODUTOS PARA LABORATORIOS L 1716.00 C
这是我迄今为止开发的代码:
library(dplyr)
library(tidyr)
library(stringr)
dados_df <- dados_df %>%
mutate(
nota = str_extract(V1, "^N\\s\\d+NE\\d+\\s\\d+\\s*"),
org = str_extract(V1, "(?<=\\d\\s{1,}).*(?=\\s{2,}\\d)"),
value = str_extract(V1, "\\d+\\.?\\d*,\\d{2}\\sC$")
)
根据示例,我对您的数据做出以下假设:
nota
变量org
和 value
变量org
和 value
变量。如果这些假设成立,那么我们可以:
flag
模式匹配的 nota
变量id
变量来唯一标识每条记录pivot_wider()
将两个字符串放在一行paste
记录在一起dados_df <- dados_df %>%
mutate(
# flag which line we are reading, line1 or line2
flag = ifelse(str_detect(V1, "^N\\s\\d+NE\\d+\\s\\d+\\s*"), "line1", "line2"),
# create a record identifier
id = cumsum(flag == "line1")
) %>%
# pivot to get each record onto 1 line
pivot_wider(id_cols = id, names_from = flag, values_from = V1) %>%
# replace NA with "" to avoid problems with paste below
replace_na(list(line2 = "")) %>%
mutate(
# join the 2 strings into 1
line = paste(line1, line2),
# put the final \\s* inside a look-ahead group to remove trailing blanks from nota
nota = str_extract(line, "^N\\s\\d+NE\\d+\\s\\d+(?=\\s*)"),
# the look-behind group needed to be changed here bacause the nota and org are now
# part of a single string
org = str_trim(str_extract(line, "(?<=\\d\\s\\d{2}\\s).*(?=\\s{2,}\\d)")),
# added a look-ahead group at the end to remove any spaces created by paste
value = str_extract(line, "\\d+\\.?\\d*,\\d{2}\\sC(?=\\s?$)")
) %>%
select(nota, org, value)
dados_df
# A tibble: 7 × 3
nota org value
<chr> <chr> <chr>
1 N 2022NE001264 75 FRETES INTERNACIONAIS LTDA 7.500,00 C
2 N 2022NE000286 84 UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ 2.856,71 C
3 N 2022NE001297 48 FRETES INTERNACIONAIS LTDA 720,00 C
4 N 2022NE001333 16 CASTRO COMERCIO LTDA 5.256,00 C
5 N 2022NE001353 92 CONSTRUCOES E INSTALACOES LTDA 734,20 C
6 N 2022NE000279 12 UNIRIO UNIVERSIDADE FEDERAL DO ESTADO RJ 180,00 C
7 N 2022NE000293 12 EQUIPAMENTOS E PRODUTOS PARA LABORATORIOS L 1.716,00 C