计算每组内的行数

Question

我有一个数据框，我想计算每组内的行数。我经常使用

aggregate

函数对数据求和，如下所示：

df2 <- aggregate(x ~ Year + Month, data = df1, sum)

现在，我想计算观察结果，但似乎无法找到

FUN

的正确论据。凭直觉，我认为应该是这样的：

df2 <- aggregate(x ~ Year + Month, data = df1, count)

但是，没有这样的运气。

有什么想法吗？

一些玩具数据：

set.seed(2)
df1 <- data.frame(x = 1:20,
                  Year = sample(2012:2014, 20, replace = TRUE),
                  Month = sample(month.abb[1:3], 20, replace = TRUE))

Answer 1

101
投票

tidyverse/dplyr 方式：

library(dplyr)
df1 %>% count(Year, Month)

Answer 2

按照@Joshua的建议，这里有一种方法可以计算

df

数据框中的观察数量，其中

Year

= 2007年和

Month

= Nov（假设它们是列）：

nrow(df[,df$YEAR == 2007 & df$Month == "Nov"])

并与

aggregate

，关注@GregSnow：

aggregate(x ~ Year + Month, data = df, FUN = length)

Answer 3

dplyr

包使用
count
/
tally
命令或
n()
函数:

执行此操作

首先，一些数据：

df <- data.frame(x = rep(1:6, rep(c(1, 2, 3), 2)), year = 1993:2004, month = c(1, 1:11))

现在数数：

library(dplyr)
count(df, year, month)
#piping
df %>% count(year, month)

我们还可以使用带有管道和

n()

功能的稍长版本：

df %>% 
  group_by(year, month) %>%
  summarise(number = n())

或

tally

函数：

df %>% 
  group_by(year, month) %>%
  tally()

Answer 4

一个没有

data.table

解决方案的老问题。所以这里...

使用

.N

library(data.table)
DT <- data.table(df)
DT[, .N, by = list(year, month)]

Answer 5

与

aggregate

一起使用的简单选项是

length

函数，它将给出子集中向量的长度。有时更强大的是使用

function(x) sum( !is.na(x) )

。

Answer 6

创建一个新变量

Count

，每行的值为 1：

df1["Count"] <-1

然后聚合数据帧，按

Count

列求和：

df2 <- aggregate(df1[c("Count")], by=list(Year=df1$Year, Month=df1$Month), FUN=sum, na.rm=TRUE)

Answer 7

在这种情况下，

aggregate()

函数的替代方案是

table()

与

as.data.frame()

，这也将指示年和月的哪些组合与零出现相关联

df<-data.frame(x=rep(1:6,rep(c(1,2,3),2)),year=1993:2004,month=c(1,1:11))

myAns<-as.data.frame(table(df[,c("year","month")]))

并且没有零出现的组合

myAns[which(myAns$Freq>0),]

Answer 8

如果您想包含数据中缺失的月年计数 0，您可以使用一点

table

魔法。

data.frame(with(df1, table(Year, Month)))

例如，问题 df1 中的玩具 data.frame 不包含 2014 年 1 月的观察结果。

df1
    x Year Month
1   1 2012   Feb
2   2 2014   Feb
3   3 2013   Mar
4   4 2012   Jan
5   5 2014   Feb
6   6 2014   Feb
7   7 2012   Jan
8   8 2014   Feb
9   9 2013   Mar
10 10 2013   Jan
11 11 2013   Jan
12 12 2012   Jan
13 13 2014   Mar
14 14 2012   Mar
15 15 2013   Feb
16 16 2014   Feb
17 17 2014   Mar
18 18 2012   Jan
19 19 2013   Mar
20 20 2012   Jan

基本 R

aggregate

函数不返回 2014 年 1 月的观测值。

aggregate(x ~ Year + Month, data = df1, FUN = length)
  Year Month x
1 2012   Feb 1
2 2013   Feb 1
3 2014   Feb 5
4 2012   Jan 5
5 2013   Jan 2
6 2012   Mar 1
7 2013   Mar 3
8 2014   Mar 2

如果您想以 0 作为计数来观察这个月-年，那么上面的代码将返回一个 data.frame，其中包含所有月-年组合的计数：

data.frame(with(df1, table(Year, Month)))
  Year Month Freq
1 2012   Feb    1
2 2013   Feb    1
3 2014   Feb    5
4 2012   Jan    5
5 2013   Jan    2
6 2014   Jan    0
7 2012   Mar    1
8 2013   Mar    3
9 2014   Mar    2

Answer 9

使用 sqldf 包的

sql

解决方案：

library(sqldf)
sqldf("SELECT Year, Month, COUNT(*) as Freq
       FROM df1
       GROUP BY Year, Month")

Answer 10

在

collapse

 中使用

R

包

library(collapse)
library(magrittr)
df %>% 
    fgroup_by(year, month) %>%
    fsummarise(number = fNobs(x))

Answer 11

对于我的聚合，我通常最终希望看到平均值和“这个组有多大”（又名长度）。所以这是我在这些场合的方便片段；

agg.mean <- aggregate(columnToMean ~ columnToAggregateOn1*columnToAggregateOn2, yourDataFrame, FUN="mean")
agg.count <- aggregate(columnToMean ~ columnToAggregateOn1*columnToAggregateOn2, yourDataFrame, FUN="length")
aggcount <- agg.count$columnToMean
agg <- cbind(aggcount, agg.mean)

Answer 12

library(tidyverse)

df_1 %>%
  group_by(Year, Month) %>%
  summarise(count= n())

Answer 13

考虑@Ben的回答，如果

df1

不包含

列，R会抛出错误。但可以用

paste

优雅地解决：

aggregate(paste(Year, Month) ~ Year + Month, data = df1, FUN = NROW)

同样，如果使用两个以上的变量进行分组，也可以进行泛化：

aggregate(paste(Year, Month, Day) ~ Year + Month + Day, data = df1, FUN = NROW)

Answer 14

我通常使用table功能


df <- data.frame(a=rep(1:8,rep(c(1,2,3, 4),2)),year=2011:2021,month=c(1,3:10))

new_data <- as.data.frame(table(df[,c("year","month")]))

Answer 15

两个非常快的

collapse

选项是

GRPN

和

fcount

。

fcount

是

dplyr::count

的快速版本，并使用相同的语法。您可以使用

add = TRUE

将其添加为列（类似于

mutate

）：

library(collapse)
fcount(df1, Year, Month) #or df1 %>% fcount(Year, Month)

#   Year Month N
# 1 2012   Feb 4
# 2 2014   Jan 3
# 3 2013   Mar 2
# 4 2013   Feb 2
# 5 2012   Jan 2
# 6 2012   Mar 2
# 7 2013   Jan 1
# 8 2014   Feb 3
# 9 2014   Mar 1

GRPN

更接近

collapse

的原始语法。首先，用

GRP

对数据进行分组。然后使用

GRPN

。默认情况下，

GRPN

创建与原始数据匹配的扩展向量。（在

dplyr

中，相当于使用

mutate

）。使用

expand = FALSE

输出汇总向量。

library(collapse)
GRPN(GRP(df1, .c(Year, Month)), expand = FALSE)

具有 100,000 x 3 数据框和 4997 个不同组的微基准测试。

collapse::fcount

比任何其他选项都要快得多。

library(collapse)
library(dplyr)
library(data.table)
library(microbenchmark)

set.seed(1)
df <- data.frame(x = gl(1000, 100),
           y = rbinom(100000, 4, .5),
           z = runif(100000))
dt <- df

mb <- 
  microbenchmark(
  aggregate = aggregate(z ~ x + y, data = df, FUN = length),
  count = count(df, x, y),
  data.table = setDT(dt)[, .N, by = .(x, y)],
  'collapse::fnobs' = df %>% fgroup_by(x, y) %>% fsummarise(number = fnobs(z)),
  'collapse::GRPN' = GRPN(GRP(df, .c(x, y)), expand = FALSE),
  'collapse::fcount' = fcount(df, x, y)
)

# Unit: milliseconds
#             expr      min        lq       mean    median        uq      max neval
#        aggregate 159.5459 203.87385 227.787186 223.93050 246.36025 335.0302   100
#            count  55.1765  63.83560  74.715889  73.60195  79.20170 196.8888   100
#       data.table   8.4483  15.57120  18.308277  18.10790  20.65460  31.2666   100
#  collapse::fnobs   3.3325   4.16145   5.695979   5.18225   6.27720  22.7697   100
#   collapse::GRPN   3.0254   3.80890   4.844727   4.59445   5.50995  13.6649   100
# collapse::fcount   1.2222   1.57395   3.087526   1.89540   2.47955  22.5756   100

Answer 16

您可以使用

by

函数作为

by(df1$Year, df1$Month, count)

来生成所需聚合的列表。

输出看起来像，

df1$Month: Feb
     x freq
1 2012    1
2 2013    1
3 2014    5
--------------------------------------------------------------- 
df1$Month: Jan
     x freq
1 2012    5
2 2013    2
--------------------------------------------------------------- 
df1$Month: Mar
     x freq
1 2012    1
2 2013    3
3 2014    2
>

Answer 17

这里已经有很多精彩的答案，但我想为那些想要向原始数据集添加新列（包含该行重复次数）的人提供另外一个选项。

df1$counts <- sapply(X = paste(df1$Year, df1$Month), 
                     FUN = function(x) { sum(paste(df1$Year, df1$Month) == x) })

将上述任意答案与

merge()

函数相结合也可以实现同样的效果。

Answer 18

如果您尝试上述聚合解决方案并收到错误：

变量的类型（列表）无效

因为您使用的是日期或日期时间戳，请尝试在变量上使用 as.character：

aggregate(x ~ as.character(Year) + Month, data = df, FUN = length)

在一个或两个变量上。

Answer 19

您还可以使用我的包 timeplyr 中的

fcount

，它接受 dplyr 语法，但在幕后使用折叠。

library(collapse)
library(timeplyr)
library(dplyr)
library(data.table)
library(microbenchmark)

set.seed(1)
df <- data.frame(x = gl(1000, 100),
                 y = rbinom(100000, 4, .5),
                 z = runif(100000))
dt <- df

mb <- 
  microbenchmark(
    aggregate = aggregate(z ~ x + y, data = df, FUN = length),
    count = count(df, x, y),
    data.table = setDT(dt)[, .N, by = .(x, y)],
    'collapse::fcount' = collapse::fcount(df, x, y),
    'timeplyr::fcount1' = timeplyr::fcount(df, x, y),
    'timeplyr::fcount2' = timeplyr::fcount(df, .cols = c("x", "y"), order = FALSE)
  )

mb
#> Unit: milliseconds
#>               expr     min        lq       mean    median        uq      max
#>          aggregate 84.0802 105.10615 123.593910 115.97675 134.65225 255.7676
#>              count 40.8108  50.82485  60.718189  56.81630  68.85530  97.4791
#>         data.table  3.7106   5.07485   6.273698   5.66645   6.44855  20.0465
#>   collapse::fcount  1.0118   1.37400   1.915809   1.61105   2.08465  13.9825
#>  timeplyr::fcount1  3.0390   3.74840   5.361852   4.56755   5.83405  44.0072
#>  timeplyr::fcount2  1.3787   1.98625   2.640338   2.47025   3.03450   8.6333
#>  neval
#>    100
#>    100
#>    100
#>    100
#>    100
#>    100

^{创建于 2023-11-22，使用 reprex v2.0.2}

计算每组内的行数

问题描述投票：0回答：19

19个回答

最新问题

计算每组内的行数

问题描述 投票：0回答：19

19个回答

最新问题

问题描述投票：0回答：19