我有宽格式的大型数据集(287 个变量的 1200 个观测值),其中大多数变量在 10 年范围内随时间变化,少数变量随时间变化。时变变量标签不遵循特定模式,除了以 2011 到 2020 之间的年份结尾之外。
我想以长格式重塑它,但我很难弄清楚如何以相当快的方式做到这一点。
这是一个具有可比较特征的简单数据框:
#Packages
library(dplyr)
library(tidyr)
#Reproducibility
set.seed(123)
#Generate data frame
wide <- as.data.frame(matrix(runif(n=60, min=0, max=10), nrow=5))
#Naming rows and columns
names(wide) = c("day2011",
"night2011",
"in_between2011",
"day2012",
"night2012",
"in_between2012",
"day2013",
"night2013",
"in_between2013",
"house",
"tree",
"dog"
)
wide$id <- c("red", "blue", "yellow", "black", "green")
我正在慢慢地努力做到这一点,但我不确定它在大型数据集上是否可持续:
long_day <- pivot_longer(data = wide,
cols = starts_with("day"),
#cols_vary = "slowest",
names_to = "year",
values_to = "Day") %>%
select(-(night2011:in_between2013)) %>%
mutate(year = gsub("day", "", year))
long_night <- pivot_longer(data = wide,
cols = starts_with("night"),
#cols_vary = "slowest",
names_to = "year",
values_to = "Night") %>%
select(-(day2011:in_between2013))%>%
mutate(year = gsub("night", "", year))
long_24 <- full_join(long_day, long_night, by = c("id", "year", "house", "dog", "tree"))
用途:
wide %>%
pivot_longer(matches('20(1[1-9]|20)$'),
names_to = c('.value', 'year'),
names_pattern = "(.*?)(\\d+)")
# A tibble: 15 × 8
house tree dog id year day night in_between
<dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1.39 0.458 2.07 red 2011 2.88 0.456 9.57
2 1.39 0.458 2.07 red 2012 9.00 8.90 7.09
3 1.39 0.458 2.07 red 2013 9.63 4.78 1.43
4 2.33 4.42 1.28 blue 2011 7.88 5.28 4.53
5 2.33 4.42 1.28 blue 2012 2.46 6.93 5.44
6 2.33 4.42 1.28 blue 2013 9.02 7.58 4.15
7 4.66 7.99 7.53 yellow 2011 4.09 8.92 6.78
8 4.66 7.99 7.53 yellow 2012 0.421 6.41 5.94
9 4.66 7.99 7.53 yellow 2013 6.91 2.16 4.14
10 2.66 1.22 8.95 black 2011 8.83 5.51 5.73
11 2.66 1.22 8.95 black 2012 3.28 9.94 2.89
12 2.66 1.22 8.95 black 2013 7.95 3.18 3.69
13 8.58 5.61 3.74 green 2011 9.40 4.57 1.03
14 8.58 5.61 3.74 green 2012 9.55 6.56 1.47
15 8.58 5.61 3.74 green 2013 0.246 2.32 1.52