在dplyr中使用summarize_at进行额外统计

问题描述 投票:4回答:3

有没有办法在summarize_at电话中添加额外的统计数据?例如

iris %>% group_by(Species) %>% summarise_at(vars(), funs(mean, sd))

将计算4列的平均值和标准偏差(总共8列)。假设我也想知道每组中有多少行。即,类似的东西

# Below is not valid syntax 
iris %>% 
  group_by(Species) %>% 
  summarise_at(vars(), funs(mean, sd)) + summarise(n())

鉴于以上不起作用,kludge是

iris %>% group_by(Species) %>% summarise_at(vars(), funs(mean, sd, length))

实际上,它产生4个计数列的副本。

也许这超出了summarize_at和朋友可以方便地处理的事情?

r dplyr
3个回答
10
投票

这个怎么样:

iris %>% 
    group_by(Species) %>% 
    mutate(Count = n()) %>%
    group_by(Species, Count) %>%
    summarize_at(vars(), funs(mean, sd))

2
投票

我们可以通过data.table以更灵活的方式做到这一点

library(data.table)
as.data.table(iris)[, c(n = .N, unlist(lapply(.SD, function(x) 
    list(Mean=mean(x), SD=sd(x))), recursive = FALSE)), .(Species)]
# Species  n Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD Petal.Length.Mean Petal.Length.SD Petal.Width.Mean
#1:     setosa 50             5.006       0.3524897            3.428      0.3790644             1.462       0.1736640            0.246
#2: versicolor 50             5.936       0.5161711            2.770      0.3137983             4.260       0.4699110            1.326
#3:  virginica 50             6.588       0.6358796            2.974      0.3224966             5.552       0.5518947            2.026
#   Petal.Width.SD
#1:      0.1053856
#2:      0.1977527
#3:      0.2746501

或者使用dplyr,我们可能需要做一个join

iris1 <- iris %>%
             group_by(Species) %>% 
             summarise_all(funs(mean, sd))

iris %>% 
     group_by(Species) %>% 
     summarise(n = n()) %>%
     full_join(iris1)

或者与bind_cols

iris %>%
 group_by(Species) %>% 
 summarise_all(funs(mean, sd)) %>% bind_cols(., iris %>% count(Species) %>% select(-Species))
# A tibble: 3 × 10
#     Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean Sepal.Length_sd Sepal.Width_sd Petal.Length_sd Petal.Width_sd     n
#      <fctr>             <dbl>            <dbl>             <dbl>            <dbl>           <dbl>          <dbl>           <dbl>          <dbl> <int>
#1     setosa             5.006            3.428             1.462            0.246       0.3524897      0.3790644       0.1736640      0.1053856    50
#2 versicolor             5.936            2.770             4.260            1.326       0.5161711      0.3137983       0.4699110      0.1977527    50
#3  virginica             6.588            2.974             5.552            2.026       0.6358796      0.3224966       0.5518947      0.2746501    50

1
投票

要指定应用统计信息的列:

iris %>%   group_by(Species) %>% 
     mutate(Count = n()) %>%
     group_by(Species, Count) %>%
     summarize_at(vars(Sepal.Length)), funs(mean, sd)) -> dt_stat
dt_stat

或者应用于以"Sepal"开头的所有列:

iris %>%   group_by(Species) %>% 
     mutate(Count = n()) %>%
     group_by(Species, Count) %>%
     summarize_at(vars(starts_with("Sepal")), funs(mean, sd)) -> dt_stat2
dt_stat2
© www.soinside.com 2019 - 2024. All rights reserved.