如何从列名重复的数据中提取数值数据?

问题描述 投票:0回答:1

我的原始数据中有重复的列名。我只想提取关键信息,例如“ansemis”和“mafruit”,保留一行列名,其余行是提取的数据。 但是,由于我的原始数据缺少一些“masec(n)”数据,因此并非每一列都是相同的数据(如图 1 所示)。

enter image description here

我想提取这个表中的数据,如图2所示。

enter image description here

同时,我有77个相同格式的csv文件(依次命名为1_modrapport、2_modrapport、3_modrapport、4_modrapport...77_modrapport)。我想总结每个csv的提取结果,并插入一个垂直行“soil_ref”,以*_modrapport 1,2,3,4,5,... 77的先前编号命名。

raw_data<-structure(list(P_usm = c("001_Pal_IRR1_N0", "P_usm", "001_Pal_IRR1_N0", 
                             "P_usm", "001_Pal_IRR1_N0", "P_usm", "001_Pal_IRR1_N0", "P_usm", 
                             "001_Pal_IRR1_N0"), wlieu = c("87_073_v3test", "wlieu", "87_073_v3test", 
                                                           "wlieu", "87_073_v3test", "wlieu", "87_073_v3test", "wlieu", 
                                                           "87_073_v3test"), ansemis = c("1980", "ansemis", "1981", "ansemis", 
                                                                                         "1982", "ansemis", "1983", "ansemis", "1984"), CNgrain = c("7.7690000000000001", 
                                                                                                                                                    "CNgrain", "6.4790000000000001", "CNgrain", "7.3739999999999997", 
                                                                                                                                                    "CNgrain", "6.5549999999999997", "CNgrain", "6.5449999999999999"
                                                                                         ), `masec(n)` = c("6.9470000000000001", "masec(n)", "7.7850000000000001", 
                                                                                                           "mafruit", "2.8279999999999998", "mafruit", "3.355", "mafruit", 
                                                                                                           "3.3410000000000002"), mafruit = c("2.3639999999999999", "mafruit", 
                                                                                                                                              "3.1230000000000002", NA, NA, NA, NA, NA, NA)), class = c("tbl_df", 
                                                                                                                                                                                                        "tbl", "data.frame"), row.names = c(NA, -9L))

我的代码:

library(dplyr)
library(readr)

process_csv_file <- function(file_path) {
  raw_data <- read_csv(file_path, col_types = cols(.default = "c"))
  column_names <- names(raw_data)
  num_rows <- nrow(raw_data)
  data_list <- lapply(seq(1, num_rows, by = 2), function(i) {
    if (i + 1 <= num_rows) {

      data_row <- raw_data[i + 1, ]

      tibble(
        P_usm = data_row$P_usm,
        ansemis = data_row$ansemis,
        mafruit = data_row$mafruit
      )
    }
  })
  
  data_combined <- bind_rows(data_list)
  
  return(data_combined)
}

path <- "C:/MyJavaSTICS/01_grid/Output_Results/MGIPallador_Results"

csv_files <- list.files(path, pattern = "*.csv", full.names = TRUE)

all_data <- lapply(csv_files, process_csv_file)
final_data <- bind_rows(all_data)
final_data_cleaned <- final_data %>%
  mutate(across(everything(), as.character))

print(head(final_data_cleaned))

write_csv(final_data_cleaned, "C:/MyJavaSTICS/01_grid/Output_Results/MGIPallador_Results/combined_data.csv")

这是我想要的最终格式,将所有数据汇总在一起后,soil_ref将从1到77而不是1到3。我应该如何调整我的代码? 预先感谢您的帮助!

enter image description here

r dplyr data-binding mutate
1个回答
0
投票

这是一种方法,我将行分组为对,重新调整更长的形状,使用每对的标题行指定数据所属的列,删除标题行,然后再次重新调整宽度。

(我还将标题添加为第一行,因此第一组可以与其他组一样对待。)

rbind(colnames(raw_data), raw_data) |>

  # Use whatever test is most reliable to distinguish
  # header row vs data row
  mutate(header_row = P_usm == "P_usm", row_grp = cumsum(header_row)) |>
  pivot_longer(-(header_row:row_grp)) |>
  mutate(name = value[header_row], .by = c(row_grp, name)) |>
  filter(!header_row) |>
  pivot_wider(names_from = name, values_from = value)

结果

# A tibble: 5 × 9
  header_row row_grp P_usm           wlieu         ansemis CNgrain            `masec(n)`     mafruit `NA` 
  <lgl>        <int> <chr>           <chr>         <chr>   <chr>              <chr>          <chr>   <chr>
1 FALSE            1 001_Pal_IRR1_N0 87_073_v3test 1980    7.7690000000000001 6.94700000000… 2.3639… NA   
2 FALSE            2 001_Pal_IRR1_N0 87_073_v3test 1981    6.4790000000000001 7.78500000000… 3.1230… NA   
3 FALSE            3 001_Pal_IRR1_N0 87_073_v3test 1982    7.3739999999999997 NA             2.8279… NA   
4 FALSE            4 001_Pal_IRR1_N0 87_073_v3test 1983    6.5549999999999997 NA             3.355   NA   
5 FALSE            5 001_Pal_IRR1_N0 87_073_v3test 1984    6.5449999999999999 NA             3.3410… NA 
© www.soinside.com 2019 - 2024. All rights reserved.