我的原始数据中有重复的列名。我只想提取关键信息,例如“ansemis”和“mafruit”,保留一行列名,其余行是提取的数据。 但是,由于我的原始数据缺少一些“masec(n)”数据,因此并非每一列都是相同的数据(如图 1 所示)。
我想提取这个表中的数据,如图2所示。
同时,我有77个相同格式的csv文件(依次命名为1_modrapport、2_modrapport、3_modrapport、4_modrapport...77_modrapport)。我想总结每个csv的提取结果,并插入一个垂直行“soil_ref”,以*_modrapport 1,2,3,4,5,... 77的先前编号命名。
raw_data<-structure(list(P_usm = c("001_Pal_IRR1_N0", "P_usm", "001_Pal_IRR1_N0",
"P_usm", "001_Pal_IRR1_N0", "P_usm", "001_Pal_IRR1_N0", "P_usm",
"001_Pal_IRR1_N0"), wlieu = c("87_073_v3test", "wlieu", "87_073_v3test",
"wlieu", "87_073_v3test", "wlieu", "87_073_v3test", "wlieu",
"87_073_v3test"), ansemis = c("1980", "ansemis", "1981", "ansemis",
"1982", "ansemis", "1983", "ansemis", "1984"), CNgrain = c("7.7690000000000001",
"CNgrain", "6.4790000000000001", "CNgrain", "7.3739999999999997",
"CNgrain", "6.5549999999999997", "CNgrain", "6.5449999999999999"
), `masec(n)` = c("6.9470000000000001", "masec(n)", "7.7850000000000001",
"mafruit", "2.8279999999999998", "mafruit", "3.355", "mafruit",
"3.3410000000000002"), mafruit = c("2.3639999999999999", "mafruit",
"3.1230000000000002", NA, NA, NA, NA, NA, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))
我的代码:
library(dplyr)
library(readr)
process_csv_file <- function(file_path) {
raw_data <- read_csv(file_path, col_types = cols(.default = "c"))
column_names <- names(raw_data)
num_rows <- nrow(raw_data)
data_list <- lapply(seq(1, num_rows, by = 2), function(i) {
if (i + 1 <= num_rows) {
data_row <- raw_data[i + 1, ]
tibble(
P_usm = data_row$P_usm,
ansemis = data_row$ansemis,
mafruit = data_row$mafruit
)
}
})
data_combined <- bind_rows(data_list)
return(data_combined)
}
path <- "C:/MyJavaSTICS/01_grid/Output_Results/MGIPallador_Results"
csv_files <- list.files(path, pattern = "*.csv", full.names = TRUE)
all_data <- lapply(csv_files, process_csv_file)
final_data <- bind_rows(all_data)
final_data_cleaned <- final_data %>%
mutate(across(everything(), as.character))
print(head(final_data_cleaned))
write_csv(final_data_cleaned, "C:/MyJavaSTICS/01_grid/Output_Results/MGIPallador_Results/combined_data.csv")
这是我想要的最终格式,将所有数据汇总在一起后,soil_ref将从1到77而不是1到3。我应该如何调整我的代码? 预先感谢您的帮助!
这是一种方法,我将行分组为对,重新调整更长的形状,使用每对的标题行指定数据所属的列,删除标题行,然后再次重新调整宽度。
(我还将标题添加为第一行,因此第一组可以与其他组一样对待。)
rbind(colnames(raw_data), raw_data) |>
# Use whatever test is most reliable to distinguish
# header row vs data row
mutate(header_row = P_usm == "P_usm", row_grp = cumsum(header_row)) |>
pivot_longer(-(header_row:row_grp)) |>
mutate(name = value[header_row], .by = c(row_grp, name)) |>
filter(!header_row) |>
pivot_wider(names_from = name, values_from = value)
结果
# A tibble: 5 × 9
header_row row_grp P_usm wlieu ansemis CNgrain `masec(n)` mafruit `NA`
<lgl> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 FALSE 1 001_Pal_IRR1_N0 87_073_v3test 1980 7.7690000000000001 6.94700000000… 2.3639… NA
2 FALSE 2 001_Pal_IRR1_N0 87_073_v3test 1981 6.4790000000000001 7.78500000000… 3.1230… NA
3 FALSE 3 001_Pal_IRR1_N0 87_073_v3test 1982 7.3739999999999997 NA 2.8279… NA
4 FALSE 4 001_Pal_IRR1_N0 87_073_v3test 1983 6.5549999999999997 NA 3.355 NA
5 FALSE 5 001_Pal_IRR1_N0 87_073_v3test 1984 6.5449999999999999 NA 3.3410… NA