R中数据帧的条件求和

Question

我试图在R中复制SUMIFS功能。我有两个数据帧。

数据框1

allReported

ID       employeeGroup
1093     Bargaining Unit
1093     Management
1093     Non-Union
55       Bargaining Unit
55       Management
55       Non-Union

数据框2

employeeCompSummary

ID       employeeGroup      statBenefits    regularWages
1093     Management         500.00          10000.00
1093     Management         200.00          60000.00
1093     Bargaining Unit    100.00          20000.00
1093     Bargaining Unit    150.00          30000.00
1093     Non-Union          500.00          60000.00
55       Bargaining Unit    750.00          65000.00
55       Bargaining Unit    500.00          75000.00
55       Management         250.00          45000.00
55       Management         850.00          90000.00

我试图将statBenefits（以及后来的正常工资）加起来创建一个可以产生以下结果的新表：

ID       employeeGroup          statBenefits
1093     Bargaining Unit        250.00
1093     Management             700.00
1093     Non-Union              500.00
55       Bargaining Unit        1250.00
55       Management             1100.00
55       Non-Union              0.00

我尝试过以下方法：

library(data.table)
setDT(allReported)[, list(total=sum(statbenefits)), list(employeeCompSummary, employeeGroup)]

并得到以下错误：

Error in `[.data.table`(setDT(allReported), , list(total = sum(statbenefits)),  :   column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]

我也尝试过：

sumTest <- aggregate(allReported, by = list(employeeCompSummary), sum)

并得到以下错误：

**Error in aggregate.data.frame(allReported, by = list(employeeCompSummary),  :   arguments must have same length**

任何人都可以提供的帮助将非常感激。我已经看过其他似乎与此有关的问题，但未能找到有效的答案。我将在多个事情上完成这项任务，所以我想知道是否有任何人都知道的简单技术。一如既往，感谢Stack Overflow上的精彩社区。

编辑两个示例表的dput（）：

allReported <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))

employeeCompSummary <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

 .

Answer 1

你可以使用dplyr和magrittr（用于%>%）包来做到这一点 -

library(dplyr)
library(magrittr)

df1 <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

df2 <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))

result <- left_join(df1, df2, by = c("ID", "employeeGroup")) %>%
  group_by(ID, employeeGroup) %>%
  summarize(
    statBenefits = sum(statBenefits, na.rm = T),
    regularWages = sum(regularWages, na.rm = T)
  )
result

Answer 2

根据您的评论进行编辑：一种方法是以这种方式使用data.table

library(data.table)
dt1 <- data.table(structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), 
               employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), 
          row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))

dt2 <- data.table(structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), 
          row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))



dt1[dt2][, lapply(.SD, sum), .SDcols = c("statBenefits", "regularWages"), by = c("ID", "employeeGroup")]

这使

ID   employeeGroup statBenefits regularWages
1:   55 Bargaining Unit         1250       140000
2:   55      Management         1100       135000
3:   55       Non-Union           NA           NA
4: 1093 Bargaining Unit          250        50000
5: 1093      Management          700        70000
6: 1093       Non-Union          500        60000

您可以稍后将NA值替换为0

Answer 3

我会做...

library(data.table)

# don't use setDT, since who knows if it works on tibbeldies
ar = data.table(allReported)
ecs = data.table(employeeCompSummary)

ecs[, total := ar[.SD, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI][, V1]]

     ID   employeeGroup total
1: 1093 Bargaining Unit   250
2: 1093      Management   700
3: 1093       Non-Union   500
4:   55 Bargaining Unit  1250
5:   55      Management  1100
6:   55       Non-Union    NA

即使OP请求了一个新表，此代码也会向ecs添加列。新表和ecs之间的行集是相同的，所以看起来浪费精神能量来携带它们。稍后删除列很简单。

如果您想知道这个“更新加入”是如何工作的，请尝试向后工作......

ar[ecs, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI]

# or

ar[ecs, on=.(ID, employeeGroup)]

注意.SD == ecs在原始代码中。见?.SD。

R中数据帧的条件求和

问题描述投票：1回答：3

3个回答

最新问题

R中数据帧的条件求和

问题描述 投票：1回答：3

3个回答

最新问题

问题描述投票：1回答：3