从数据框中,是否有一种简单的方法同时聚合(
sum
、mean
、max
等)多个变量?
以下是一些示例数据:
library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05))
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)
我想按年和月同时聚合
x1
数据框中的 x2
和 df2
变量。下面的代码聚合了x1
变量,但是是否也可以同时聚合x2
变量呢?
### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
是的,在您的
formula
中,您可以cbind
要聚合的数字变量:
aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
year month x1 x2
1 2000 1 7.862002 -7.469298
2 2001 1 276.758209 474.384252
3 2000 2 13.122369 -128.122613
...
23 2000 12 63.436507 449.794454
24 2001 12 999.472226 922.726589
见
?aggregate
,formula
论点和例子。
使用 dplyr 包,您可以使用
summarize(across())
使用 tidyselect 语言 聚合多个变量。对于示例数据集,您可以按如下方式执行此操作:
library(dplyr)
# summarising all non-grouping variables
df1 %>% group_by(year, month) %>% summarise(across(everything(), n_distinct))
# summarising a specific set of non-grouping variables
df1 %>% group_by(year, month) %>% summarise(across(x1:x2, sum))
df1 %>% group_by(year, month) %>% summarise(across(c(x1, x2), sum))
df1 %>% group_by(year, month) %>% summarise(across(-date, sum))
# summarising a specific set of non-grouping variables using selection helpers:
df1 %>% group_by(year, month) %>% summarise(across(starts_with('x'), sum))
df1 %>% group_by(year, month) %>% summarise(across(matches('.*[0-9]'), sum))
# summarising a specific set of non-grouping variables based on condition (class)
df1 %>% group_by(year, month) %>% summarise(across(where(is.numeric), sum))
后五个选项的结果:
year month x1 x2
<dbl> <dbl> <dbl> <dbl>
1 2000 1 -73.58134 -92.78595
2 2000 2 -57.81334 -152.36983
3 2000 3 122.68758 153.55243
4 2000 4 450.24980 285.56374
5 2000 5 678.37867 384.42888
6 2000 6 792.68696 530.28694
7 2000 7 908.58795 452.31222
8 2000 8 710.69928 719.35225
9 2000 9 725.06079 914.93687
10 2000 10 770.60304 863.39337
# ... with 14 more rows
您还可以将多个功能应用于选择列:
df1 %>%
group_by(year, month) %>%
summarise(across(x1:x2, list(sum = sum, avg = mean)))
year month x1_sum x1_avg x2_sum x2_avg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 1 -97.0 -3.13 110. 3.55
2 2000 2 195. 6.73 150. 5.17
3 2000 3 224. 7.24 278. 8.96
4 2000 4 175. 5.85 199. 6.64
5 2000 5 172. 5.53 199. 6.43
6 2000 6 129. 4.29 602. 20.1
7 2000 7 4.84 0.156 1063. 34.3
8 2000 8 -373. -12.0 1285. 41.5
9 2000 9 -158. -5.26 1461. 48.7
10 2000 10 131. 4.22 1608. 51.9
# … with 14 more rows
一些最后的说明:
summarise()
删除最后一层分组,因此上面的所有示例仍将按 year
分组。要取消所有分组,您可以添加ungroup()
调用,或在.groups = "drop"
调用中设置summarise()
。.by
参数为汇总操作指定分组,例如,df1 %>% summarise(across(c(x1, x2), sum), .by = c(year, month))
across()also works with other dplyr verbs such as
mutate()and
reframe()`.across()
之前,这些类型的操作是由 summarise_all()
、summarise_at()
、summarise_if()
和(甚至更早)由 summarise_each()
完成的。这些现在已被取代或弃用,取而代之的是across()
.如我上面的评论中所述,您还可以使用
recast
-package中的reshape2
功能:
library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))
这会给你同样的结果。
使用
data.table
包,速度很快(对较大的数据集很有用)
https://github.com/Rdatatable/data.table/wiki
library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by = .(year, month), .SDcols = c("x1","x2")]
setDF(df2) # convert back to dataframe
使用 plyr 包
require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))
使用 Hmisc 包中的 summarize() (尽管在我的例子中列标题很乱)
# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
这个
year()
函数来自哪里?
您也可以使用
reshape2
包来完成此任务:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
有趣的是,base R
aggregate
的data.frame
方法这里没有展示,above使用公式接口,所以为了完整性:
aggregate(
x = df1[c("x1", "x2")],
by = df1[c("year", "month")],
FUN = sum, na.rm = TRUE
)
aggregate 的 data.frame 方法更通用:
因为我们提供了一个
data.frame
作为x
和list
(data.frame
也是 list
)作为 by
,如果我们需要以动态方式使用它,这将非常有用,例如使用其他列进行聚合和聚合非常简单例如像这样:
colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}
aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
随着
dplyr
版本>=1.0.0
,我们也可以使用summarise
通过across
在多个列上应用函数
library(dplyr)
df1 %>%
group_by(year, month) %>%
summarise(across(starts_with('x'), sum))
# A tibble: 24 x 4
# Groups: year [2]
# year month x1 x2
# <dbl> <dbl> <dbl> <dbl>
# 1 2000 1 11.7 52.9
# 2 2000 2 -74.1 126.
# 3 2000 3 -132. 149.
# 4 2000 4 -130. 4.12
# 5 2000 5 -91.6 -55.9
# 6 2000 6 179. 73.7
# 7 2000 7 95.0 409.
# 8 2000 8 255. 283.
# 9 2000 9 489. 331.
#10 2000 10 719. 305.
# … with 14 more rows
要获得更灵活、更快速的数据聚合方法,请查看 CRAN 上可用的
collapseR 包中的
collap
函数:
library(collapse)
# Simple aggregation with one function
head(collap(df1, x1 + x2 ~ year + month, fmean))
year month x1 x2
1 2000 1 -1.217984 4.008534
2 2000 2 -1.117777 11.460301
3 2000 3 5.552706 8.621904
4 2000 4 4.238889 22.382953
5 2000 5 3.124566 39.982799
6 2000 6 -1.415203 48.252283
# Customized: Aggregate columns with different functions
head(collap(df1, x1 + x2 ~ year + month,
custom = list(fmean = c("x1", "x2"), fmedian = "x2")))
year month fmean.x1 fmean.x2 fmedian.x2
1 2000 1 -1.217984 4.008534 3.266968
2 2000 2 -1.117777 11.460301 11.563387
3 2000 3 5.552706 8.621904 8.506329
4 2000 4 4.238889 22.382953 20.796205
5 2000 5 3.124566 39.982799 39.919145
6 2000 6 -1.415203 48.252283 48.653926
# You can also apply multiple functions to all columns
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax)))
year month fmean.x1 fmin.x1 fmax.x1 fmean.x2 fmin.x2 fmax.x2
1 2000 1 -1.217984 -4.2460775 1.245649 4.008534 -1.720181 10.47825
2 2000 2 -1.117777 -5.0081858 3.330872 11.460301 9.111287 13.86184
3 2000 3 5.552706 0.1193369 9.464760 8.621904 6.807443 11.54485
4 2000 4 4.238889 0.8723805 8.627637 22.382953 11.515753 31.66365
5 2000 5 3.124566 -1.5985090 7.341478 39.982799 31.957653 46.13732
6 2000 6 -1.415203 -4.6072295 2.655084 48.252283 42.809211 52.31309
# When you do that, you can also return the data in a long format
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax), return = "long"))
Function year month x1 x2
1 fmean 2000 1 -1.217984 4.008534
2 fmean 2000 2 -1.117777 11.460301
3 fmean 2000 3 5.552706 8.621904
4 fmean 2000 4 4.238889 22.382953
5 fmean 2000 5 3.124566 39.982799
6 fmean 2000 6 -1.415203 48.252283
注意:您可以将
mean, max
等基本函数与 collap
一起使用,但是 fmean, fmax
等是 collapse 包中提供的基于 C++ 的分组函数,速度明显更快(即大数据的性能aggregations 与 data.table 相同,同时提供更大的灵活性,并且这些快速分组功能也可以在没有 collap
的情况下使用。
Note2:
collap
还支持灵活的多类型数据聚合,您当然可以使用 custom
参数,但您也可以以半自动的方式将函数应用于数字和非数字列:
# wlddev is a data set of World Bank Indicators provided in the collapse package
head(wlddev)
country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.292 NA 114440000
2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.742 NA 233350000
3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.185 NA 114880000
4 Afghanistan AFG 1964-01-01 1963 1960 South Asia Low income FALSE NA 33.624 NA 236450000
5 Afghanistan AFG 1965-01-01 1964 1960 South Asia Low income FALSE NA 34.060 NA 302480000
6 Afghanistan AFG 1966-01-01 1965 1960 South Asia Low income FALSE NA 34.495 NA 370250000
# This aggregates the data, applying the mean to numeric and the statistical mode to categorical columns
head(collap(wlddev, ~ iso3c + decade, FUN = fmean, catFUN = fmode))
country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA
1 Aruba ABW 1961-01-01 1962.5 1960 Latin America & Caribbean High income FALSE NA 66.58583 NA NA
2 Aruba ABW 1967-01-01 1970.0 1970 Latin America & Caribbean High income FALSE NA 69.14178 NA NA
3 Aruba ABW 1976-01-01 1980.0 1980 Latin America & Caribbean High income FALSE NA 72.17600 NA 33630000
4 Aruba ABW 1987-01-01 1990.0 1990 Latin America & Caribbean High income FALSE 23677.09 73.45356 NA 41563333
5 Aruba ABW 1996-01-01 2000.0 2000 Latin America & Caribbean High income FALSE 26766.93 73.85773 NA 19857000
6 Aruba ABW 2007-01-01 2010.0 2010 Latin America & Caribbean High income FALSE 25238.80 75.01078 NA NA
# Note that by default (argument keep.col.order = TRUE) the column order is also preserved
更新的
dplyr
解决方案:自dplyr 1.1.0
以来,您可以在.by
中使用
summarise
进行内联临时分组(计算后自动ungroup
s)。
使用
across
(可从dplyr 1.0.0
获得)允许同时对多个列使用相同的功能。
library(dplyr)
df1 %>%
summarise(across(starts_with('x'), sum), .by = c(year, month))
# A tibble: 24 x 4
# year month x1 x2
# <dbl> <dbl> <dbl> <dbl>
# 1 2000 1 11.7 52.9
# 2 2000 2 -74.1 126.
# 3 2000 3 -132. 149.
# 4 2000 4 -130. 4.12
# 5 2000 5 -91.6 -55.9
# 6 2000 6 179. 73.7
# 7 2000 7 95.0 409.
# 8 2000 8 255. 283.
# 9 2000 9 489. 331.
#10 2000 10 719. 305.
# … with 14 more rows
下面是另一种汇总多列的方法,当函数需要更多参数时特别有用。您可以通过
everything()
或像any_of(c("a", "b"))
这样的列的子集来选择所有列。
library(dplyr)
# toy data
df <- tibble(a = sample(c(NA, 5:7), 30, replace = TRUE),
b = sample(c(NA, 1:5), 30, replace = TRUE),
c = sample(1:5, 30, replace = TRUE),
grp = sample(1:3, 30, replace = TRUE))
df
#> # A tibble: 30 × 4
#> a b c grp
#> <int> <int> <int> <int>
#> 1 7 1 3 1
#> 2 7 4 4 2
#> 3 5 1 3 3
#> 4 7 NA 3 2
#> 5 7 2 5 2
#> 6 7 4 4 2
#> 7 7 NA 3 3
#> 8 NA 5 4 1
#> 9 5 1 1 2
#> 10 NA 3 1 2
#> # … with 20 more rows
df %>%
group_by(grp) %>%
summarise(across(everything(),
list(mean = ~mean(., na.rm = TRUE),
q75 = ~quantile(., probs = .75, na.rm = TRUE))))
#> # A tibble: 3 × 7
#> grp a_mean a_q75 b_mean b_q75 c_mean c_q75
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6.6 7 2.88 4.25 3 4
#> 2 2 6.33 7 2.62 3.25 2.9 4
#> 3 3 5.78 6 3.33 4 3.09 4
聚会迟到了,但最近找到了另一种获取摘要统计信息的方法。
library(psych)
describe(data)
会输出: 每个变量的均值、最小值、最大值、标准差、n、标准误差、峰度、偏度、中位数和范围。