滚动的和有条件的窗口

问题描述 投票:1回答:2

所以这是我的数据的一个例子

> d
   customer       date revenue
1:        A 2016-01-01      32
2:        A 2016-01-03      88
3:        A 2016-01-04      80
4:        A 2016-02-01      38
5:        B 2016-01-13      44
6:        B 2016-01-24      11
7:        B 2016-01-25      50
8:        B 2016-02-26      46
> dput(d)
structure(list(customer = c("A", "A", "A", "A", "B", "B", "B", 
"B"), date = structure(c(16801, 16803, 16804, 16832, 16813, 16824, 
16825, 16857), class = "Date"), revenue = c(32, 88, 80, 38, 44, 
11, 50, 46)), .Names = c("customer", "date", "revenue"), row.names = c(NA, 
-8L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000002a60788>)

我想要做的是,我想创建一个列,让我们称之为roll_sum_3days。此列是之后发生的收入的滚动总和。窗口大小以日期列为条件。在这种情况下,roll_sum_3days是之后发生的收入总和,不得晚于3天。

预期的结果将是这样的

   customer       date revenue    roll_sum_3days
1:        A 2016-01-01      32                168
2:        A 2016-01-03      88                 80
3:        A 2016-01-04      80                 0
4:        A 2016-02-01      38                 0
5:        B 2016-01-13      44                 0
6:        B 2016-01-24      11                 96
7:        B 2016-01-25      50                 46
8:        B 2016-01-26      46                 0
r dataframe data.table
2个回答
3
投票

可能的解决方案:

library(lubridate) # for the '%m+%'-function

d[, roll_sum_3d := .SD[.SD[, .(date, date2 = date %m+% days(3), revenue)]
                       , on = .(date > date, date <= date2)
                       ][, sum(revenue, na.rm = TRUE), by = date]$V1
  , by = customer][]

这使:

   customer       date revenue roll_sum_3d
1:        A 2016-01-01      32         168
2:        A 2016-01-03      88          80
3:        A 2016-01-04      80           0
4:        A 2016-02-01      38           0
5:        B 2016-01-13      44           0
6:        B 2016-01-24      11          96
7:        B 2016-01-25      50          46
8:        B 2016-01-26      46           0

这是做什么的:

  • dby = customer`组合customer with
  • 通过引用roll_sum_3d添加:=
  • 通过将每个组的roll_sum_3d(数据子集)与该组的日期窗口(具有非等式连接.SD.SD[, .(date, date2 = date %m+% days(3), revenue)])相加来计算on = .(date > date, date <= date2),汇总每个日期的收入并将其返回。

基于@ Arun评论的另一种选择:

d[, roll_sum_3d := d[d[, .(customer, date, date2 = date %m+% days(3), revenue)]
                     , on = .(customer, date > date, date <= date2)
                     , sum(revenue, na.rm = TRUE), by=.EACHI]$V1][]

1
投票

嗨,我想你的例子中还有另一个错误:观察数字8不会增加前两个观察的计数,因为它来自二月。没关系如果你想使用apply()POSIXct()函数,我有一个解决方案

df <- data.frame(customer = c("A", "A", "A", "A", "B", "B", "B", "B"),
       date = structure(c(16801, 16803, 16804, 16832, 16813, 16824, 
                          16825, 16857), class = "Date"), 
       revenue = c(32, 88, 80, 38, 44, 11, 50, 46))

df$date <- as.POSIXct(df$date)

calc <- function(x){
   date <- as.POSIXct(unlist(x["date"]),origin = "1970-01-01")
   customer <- unlist(x["customer"])
   # There you choose what you want to sum (here conditions are between the day and 3 days later and same customer)
   # 86400 is the number of second in a day!
   output <- sum(df[df$date > date & df$date <= (date+86400*3) & df$customer==customer,"revenue"])
   return(output)
   }

df$sum <- apply(df,1,calc)
# if you want to come back with your date format.
df$date <- as.Date(df$date)
df
  customer       date revenue sum
1        A 2016-01-01      32 168
2        A 2016-01-03      88  80
3        A 2016-01-04      80   0
4        A 2016-02-01      38   0
5        B 2016-01-13      44   0
6        B 2016-01-24      11  50
7        B 2016-01-25      50   0
8        B 2016-02-26      46   0

我无法保留您的日期格式,因为运营商>无法使用它。

© www.soinside.com 2019 - 2024. All rights reserved.