假设我有以下数据帧:
Category = c("blue", "red", "red", "blue", "blue", "blue", "red", "red", "red","blue", "red", "red","blue","blue","red","blue","red")
Purchase = c(0,1,1,0,0,0,1,1,1,0,1,1,0,0,1,0,1)
Number = c(1,1,1,1,2,2,2,2,2,1,1,2,2,2,2,2,2)
Id = c("a","a","a","a","a","a","a","a","a","b","b","b","b","b","b","b","b")
Country = c("NL","BE","BE","UK","UK","NL","UK","UK","UK","BE","NL","NL","BE","UK","UK","BE","NL")
df = data.frame(Id, Number,Category, Purchase, Country)
> df
Id Number Category Purchase Country
1 a 1 blue 0 NL
2 a 1 red 1 BE
3 a 1 red 1 BE
4 a 1 blue 0 UK
5 a 2 blue 0 UK
6 a 2 blue 0 NL
7 a 2 red 1 UK
8 a 2 red 1 UK
9 a 2 red 1 UK
10 b 1 blue 0 BE
11 b 1 red 1 NL
12 b 2 red 1 NL
13 b 2 blue 0 BE
14 b 2 blue 0 UK
15 b 2 red 1 UK
16 b 2 blue 0 BE
17 b 2 red 1 NL
我想聚合红色后跟红色的行,按Id和数字分组,以便汇总这些行的购买。因此,我想要的输出是:
> desired
Id Number Category Purchase Country
1 a 1 blue 0 NL
2 a 1 red 2 BE
3 a 1 blue 0 UK
4 a 2 blue 0 UK
5 a 2 blue 0 NL
6 a 2 red 3 UK
7 b 1 blue 0 BE
8 b 1 red 1 NL
9 b 2 red 1 NL
10 b 2 blue 0 BE
11 b 2 blue 0 UK
12 b 2 red 1 UK
13 b 2 blue 0 BE
14 b 2 red 1 NL
因此,应该保持类别发生的顺序,并且只应聚合具有“红色”类别的类别。另外,在我的实际数据框中,我有几个列,例如Country列,我也想在输出中出现,但我不想手动定义所有这些列。我试过使用aggregate
函数或ddply
,但我还没有把它整理出来。
有人可以帮助我解决这个聚合问题,其中行的顺序被考虑在内吗?
这是data.table
的一个选项。将'data.frame'转换为'data.table'(setDT(df)
),按逻辑列(Category == "red"
)的run-length-id分组,以及'Id','Number'和'Category',if
的数量元素大于1和all
'类别'中的元素是'红色',然后得到'购买'的sum
或else
返回'购买'
library(data.table)
setDT(df)[, .(Purchase = if(.N > 1 & all("red" %in% Category)) sum(Purchase)
else Purchase), by = .(grp = rleid(Category == "red"), Id, Number, Category)
][, grp := NULL][]
# Id Number Category Purchase
# 1: a 1 blue 0
# 2: a 1 red 2
# 3: a 1 blue 0
# 4: a 2 blue 0
# 5: a 2 blue 0
# 6: a 2 red 3
# 7: b 1 blue 0
# 8: b 1 red 1
# 9: b 2 red 1
#10: b 2 blue 0
#11: b 2 blue 0
#12: b 2 red 1
#13: b 2 blue 0
#14: b 2 red 1
df$temp = with(data = rle(as.character(df$Category)),
cumsum(unlist(sapply(seq_along(values), function(i){
if(values[i] == "red"){
c(1, rep(0, lengths[i]-1))
}else{
rep(1, lengths[i])
}}))))
aggregate(Purchase~., df, sum)
# Id Number Category temp Purchase
#1 a 1 blue 1 0
#2 a 1 red 2 2
#3 a 1 blue 3 0
#4 a 2 blue 4 0
#5 a 2 blue 5 0
#6 a 2 red 6 3
#7 b 1 blue 7 0
#8 b 1 red 8 1
#9 b 2 red 8 1
#10 b 2 blue 9 0
#11 b 2 blue 10 0
#12 b 2 red 11 1
#13 b 2 blue 12 0
#14 b 2 red 13 1
这是使用dpyr
的方式。
首先,我建立一个子组,当颜色在组中发生变化时,与Id
和Number
一起定义子data.frames
。
然后我在包含do
的子data.frames
上使用red
来聚合购买。
然后我清理组和额外的列。
df %>%
group_by(Id,Number,subgroup = cumsum(c(TRUE,head(Category,-1) != tail(Category,-1)))) %>%
do({if(.$Category[1] == "red") aggregate(Purchase ~ .,.,sum) else .}) %>%
ungroup %>%
select(-subgroup)
# # A tibble: 14 x 4
# Id Number Category Purchase
# <fctr> <dbl> <fctr> <dbl>
# 1 a 1 blue 0
# 2 a 1 red 2
# 3 a 1 blue 0
# 4 a 2 blue 0
# 5 a 2 blue 0
# 6 a 2 red 3
# 7 b 1 blue 0
# 8 b 1 red 1
# 9 b 2 red 1
# 10 b 2 blue 0
# 11 b 2 blue 0
# 12 b 2 red 1
# 13 b 2 blue 0
# 14 b 2 red 1