我试图在R中复制SUMIFS功能。我有两个数据帧。
数据框1
allReported
ID employeeGroup
1093 Bargaining Unit
1093 Management
1093 Non-Union
55 Bargaining Unit
55 Management
55 Non-Union
数据框2
employeeCompSummary
ID employeeGroup statBenefits regularWages
1093 Management 500.00 10000.00
1093 Management 200.00 60000.00
1093 Bargaining Unit 100.00 20000.00
1093 Bargaining Unit 150.00 30000.00
1093 Non-Union 500.00 60000.00
55 Bargaining Unit 750.00 65000.00
55 Bargaining Unit 500.00 75000.00
55 Management 250.00 45000.00
55 Management 850.00 90000.00
我试图将statBenefits(以及后来的正常工资)加起来创建一个可以产生以下结果的新表:
ID employeeGroup statBenefits
1093 Bargaining Unit 250.00
1093 Management 700.00
1093 Non-Union 500.00
55 Bargaining Unit 1250.00
55 Management 1100.00
55 Non-Union 0.00
我尝试过以下方法:
library(data.table)
setDT(allReported)[, list(total=sum(statbenefits)), list(employeeCompSummary, employeeGroup)]
并得到以下错误:
Error in `[.data.table`(setDT(allReported), , list(total = sum(statbenefits)), : column or expression 1 of 'by' or 'keyby' is type list. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))]
我也尝试过:
sumTest <- aggregate(allReported, by = list(employeeCompSummary), sum)
并得到以下错误:
**Error in aggregate.data.frame(allReported, by = list(employeeCompSummary), : arguments must have same length**
任何人都可以提供的帮助将非常感激。我已经看过其他似乎与此有关的问题,但未能找到有效的答案。我将在多个事情上完成这项任务,所以我想知道是否有任何人都知道的简单技术。一如既往,感谢Stack Overflow上的精彩社区。
编辑两个示例表的dput():
allReported <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))
employeeCompSummary <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
.
你可以使用dplyr
和magrittr
(用于%>%
)包来做到这一点 -
library(dplyr)
library(magrittr)
df1 <- structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55), employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)), row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame"))
result <- left_join(df1, df2, by = c("ID", "employeeGroup")) %>%
group_by(ID, employeeGroup) %>%
summarize(
statBenefits = sum(statBenefits, na.rm = T),
regularWages = sum(regularWages, na.rm = T)
)
result
根据您的评论进行编辑:一种方法是以这种方式使用data.table
library(data.table)
dt1 <- data.table(structure(list(ID = c(1093, 1093, 1093, 1093, 1093, 55, 55, 55,55),
employeeGroup = c("Management", "Management", "Bargaining Unit","Bargaining Unit", "Non-Union", "Bargaining Unit", "Bargaining Unit","Management", "Management"), statBenefits = c(500, 200, 100,150, 500, 750, 500, 250, 850), regularWages = c(10000, 60000,20000, 30000, 60000, 65000, 75000, 45000, 90000)),
row.names = c(NA,-9L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))
dt2 <- data.table(structure(list(ID = c(1093, 1093, 1093, 55, 55, 55), employeeGroup =c("Bargaining Unit","Management", "Non-Union", "Bargaining Unit", "Management", "Non-Union")),
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame")), key = c("ID", "employeeGroup"))
dt1[dt2][, lapply(.SD, sum), .SDcols = c("statBenefits", "regularWages"), by = c("ID", "employeeGroup")]
这使
ID employeeGroup statBenefits regularWages
1: 55 Bargaining Unit 1250 140000
2: 55 Management 1100 135000
3: 55 Non-Union NA NA
4: 1093 Bargaining Unit 250 50000
5: 1093 Management 700 70000
6: 1093 Non-Union 500 60000
您可以稍后将NA值替换为0
我会做...
library(data.table)
# don't use setDT, since who knows if it works on tibbeldies
ar = data.table(allReported)
ecs = data.table(employeeCompSummary)
ecs[, total := ar[.SD, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI][, V1]]
ID employeeGroup total
1: 1093 Bargaining Unit 250
2: 1093 Management 700
3: 1093 Non-Union 500
4: 55 Bargaining Unit 1250
5: 55 Management 1100
6: 55 Non-Union NA
即使OP请求了一个新表,此代码也会向ecs
添加列。新表和ecs
之间的行集是相同的,所以看起来浪费精神能量来携带它们。稍后删除列很简单。
如果您想知道这个“更新加入”是如何工作的,请尝试向后工作......
ar[ecs, on=.(ID, employeeGroup), sum(x.statBenefits), by=.EACHI]
# or
ar[ecs, on=.(ID, employeeGroup)]
注意.SD == ecs在原始代码中。见?.SD
。