我有大约100个数据集对需要合并到单个数据集中,我已经查看了显示如何同时合并多个数据集的帖子(例如,here和here),但我的问题是独一无二的。我的真实世界数据存储在我的硬盘上并且名称相似(例如,household2010
,household2011
,household2012
和person2010
,person2011
,person2012
。它们不需要加载到全局环境中。例如下面。
工作数据:
library(tidyverse)
household2010 <- tribble(
~id, ~var2, ~var3, ~var4, ~var5,
"1", "1", "1", "a", "d",
"2", "2", "2", "b", "e",
"3", "3", "3", "c", "f"
)
person2010 <- tribble(
~id, ~var6, ~var7,
"1", "1", "1",
"2", "2", "2",
"3", "3", "3",
"4", "4", "4"
)
household2011 <- tribble(
~id, ~var8, ~var9, ~var10,
"1", "1", "1", "1",
"2", "2", "2", "2",
"3", "3", "3", "3",
"4", "4", "4", "4"
)
person2011 <- tribble(
~id, ~var11, ~var12, ~var13,
"1", "1", "1", "1",
"2", "2", "2", "2",
"3", "3", "3", "3",
"4", "4", "4", "4",
"5", "5", "5", "5"
)
我需要将household2010
与person2010
合并,并创建一个名为hhperson2010
的新数据集。我也需要对household2011
和person2011
这样做。我可以单独做:
hhperson2010 <- left_join(household2010, person2010, by = "id")
hhperson2011 <- left_join(household2011, person2011, by = "id")
当我有超过100个数据对时,这变得笨拙。我可以使用lapply
让它通过数据集列表并合并吗?就像是:
dflist1 <- list(household2010, household2011)
dflist2 <- list(person2011, person2011)
lapply(function(x) left_join(dflist, dflist2, by = "id")
也许是这样的:
years <- 2010:2011
result <- lapply(years,
function(x) left_join(get(paste0("household", x)),
get(paste0("person", x)),
"id"))
names(result) <- paste0("household", years)
以下是我用tidyverse
和list-columns的方法
library(dplyr)
library(tidyr)
library(purrr)
env2listcol <- function(rdata_file)
{
e <- new.env()
load(rdata_file, envir = e)
# since you know that there's only 1 df in each environment
as.list.environment(e)[[1]]
}
# assuming files are stored in `input` folder
dir("input", full.names = T) %>% as_tibble() %>%
# split the path
separate(value, into=c("dir", "file", "ext"), remove=FALSE) %>%
# get the category and the key in separate columns
extract(file, into=c("key", "year"), regex="([a-z]+)(\\d+)") %>%
# file path by category by year, remove unnecessary columns
spread(key, value) %>% select(-dir, -ext) %>%
# extract dataframes from environments, and join them
mutate(household=map(household, env2listcol),
person=map(person, env2listcol),
joined=map2(household, person, left_join)) %>%
# rbind joined tables, although you could pull(joined) or imap over it
unnest(joined)
#> # A tibble: 7 x 14
#> year id var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 var12 var13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 2010 1 1 1 a d 1 1 <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 2010 2 2 2 b e 2 2 <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 2010 3 3 3 c f 3 3 <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 2011 1 <NA> <NA> <NA> <NA> <NA> <NA> 1 1 1 1 1 1
#> 5 2011 2 <NA> <NA> <NA> <NA> <NA> <NA> 2 2 2 2 2 2
#> 6 2011 3 <NA> <NA> <NA> <NA> <NA> <NA> 3 3 3 3 3 3
#> 7 2011 4 <NA> <NA> <NA> <NA> <NA> <NA> 4 4 4 4 4 4
你决定用它做什么。您可以将它写回R对象(请,请...请改用Rds)。你可以将它写回一个表(我相信它更容易处理)。您甚至可以将其导出为json。
just an alternate solution:
years <- c("2010", "2011", "2014")
for (x in years){
result <- merge(get(paste0("household", x)), get(paste0("person", x)), "id")
names <- paste0("household", x)
print(names)
print(result)
}
你可以在循环或lapply之间做出选择,具体取决于你的进一步处理。如果你没有更多的数据集,我认为lapply只会解决目的。