将许多数据对合并为单独的数据集

问题描述 投票:0回答:3

我有大约100个数据集对需要合并到单个数据集中,我已经查看了显示如何同时合并多个数据集的帖子(例如,herehere),但我的问题是独一无二的。我的真实世界数据存储在我的硬盘上并且名称相似(例如,household2010household2011household2012person2010person2011person2012。它们不需要加载到全局环境中。例如下面。

工作数据:

library(tidyverse)

household2010 <- tribble(
~id, ~var2, ~var3, ~var4, ~var5,
"1",   "1",   "1",   "a",   "d",
"2",   "2",   "2",   "b",   "e",
"3",   "3",   "3",   "c",   "f"
)

person2010 <- tribble(
~id, ~var6, ~var7,
"1",   "1",   "1",
"2",   "2",   "2",
"3",   "3",   "3",
"4",   "4",   "4"
)

household2011 <- tribble(
~id, ~var8, ~var9, ~var10,
"1",   "1",   "1",    "1",
"2",   "2",   "2",    "2",
"3",   "3",   "3",    "3",
"4",   "4",   "4",    "4"
)

person2011 <- tribble(
~id, ~var11, ~var12, ~var13,
"1",   "1",   "1",    "1",
"2",   "2",   "2",    "2",
"3",   "3",   "3",    "3",
"4",   "4",   "4",    "4",
"5",   "5",   "5",    "5"
 )

我需要将household2010person2010合并,并创建一个名为hhperson2010的新数据集。我也需要对household2011person2011这样做。我可以单独做:

hhperson2010 <- left_join(household2010, person2010, by = "id")

hhperson2011 <- left_join(household2011, person2011, by = "id")

当我有超过100个数据对时,这变得笨拙。我可以使用lapply让它通过数据集列表并合并吗?就像是:

dflist1 <- list(household2010, household2011)

dflist2 <- list(person2011,    person2011)

lapply(function(x) left_join(dflist, dflist2, by = "id")
r merge
3个回答
1
投票

也许是这样的:

years <- 2010:2011
result <- lapply(years, 
              function(x) left_join(get(paste0("household", x)), 
                                    get(paste0("person", x)), 
                                    "id"))

names(result) <- paste0("household", years)

0
投票

以下是我用tidyverse和list-columns的方法

library(dplyr)
library(tidyr)
library(purrr)

env2listcol <- function(rdata_file)
{
  e <- new.env()
  load(rdata_file, envir = e)
  # since you know that there's only 1 df in each environment
  as.list.environment(e)[[1]] 
}

# assuming files are stored in `input` folder
dir("input", full.names = T) %>% as_tibble() %>% 
  # split the path
  separate(value, into=c("dir", "file", "ext"), remove=FALSE) %>% 
  # get the category and the key in separate columns
  extract(file, into=c("key", "year"), regex="([a-z]+)(\\d+)") %>% 
  # file path by category by year, remove unnecessary columns
  spread(key, value) %>% select(-dir, -ext) %>% 
  # extract dataframes from environments, and join them
  mutate(household=map(household, env2listcol),
         person=map(person, env2listcol),
         joined=map2(household, person, left_join)) %>% 
  # rbind joined tables, although you could pull(joined) or imap over it
  unnest(joined)

#> # A tibble: 7 x 14
#>    year    id  var2  var3  var4  var5  var6  var7  var8  var9 var10 var11 var12 var13
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1  2010     1     1     1     a     d     1     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 2  2010     2     2     2     b     e     2     2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 3  2010     3     3     3     c     f     3     3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>
#> 4  2011     1  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     1     1     1     1     1     1
#> 5  2011     2  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     2     2     2     2     2     2
#> 6  2011     3  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     3     3     3     3     3     3
#> 7  2011     4  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>     4     4     4     4     4     4

你决定用它做什么。您可以将它写回R对象(请,请...请改用Rds)。你可以将它写回一个表(我相信它更容易处理)。您甚至可以将其导出为json。


0
投票
just an alternate solution:
years <- c("2010", "2011", "2014") 

for (x in years){
  result <- merge(get(paste0("household", x)), get(paste0("person", x)), "id")
  names <- paste0("household", x)
  print(names)
  print(result)
}

你可以在循环或lapply之间做出选择,具体取决于你的进一步处理。如果你没有更多的数据集,我认为lapply只会解决目的。

© www.soinside.com 2019 - 2024. All rights reserved.