时间序列分组和过滤器在dplyr r中

问题描述 投票:1回答:1

我正在尝试在数据框中找到组中的特定模式。通过电子邮件,下订单的人员和金额,获取以下订单数据框。

set.seed(123)
dates = sample(seq(as.Date("2017-01-01"),as.Date("2017-12-31"), by = 'day'), 2000, replace = TRUE)
amount <- sample(-50:100, 2000, replace = TRUE)
placedorder <- sample(c(NA, NA, NA, "jeff", "alex", "steve", "amy", "john", "larry", "ryan"), 2000, replace = TRUE)
email <- sample(paste0(1:200, "@gmail.com"), 2000, replace = TRUE)
df <- data.frame(dates, email, placedorder, amount, stringsAsFactors = FALSE)

我想通过电子邮件地址找到所有这三个发生的组,以及它们在日期继承中发生的位置:

  1. 下订单的正值为placedorder为NA
  2. 在上述步骤之后下订单,其中负值为placedorder为NA
  3. 在步骤2之后下订单,具有正值并且其中placeholder不是NA

例:

# A tibble: 10 x 4
# Groups:   email [1]
        dates       email placedorder amount
       <date>       <chr>       <chr>  <int>
 1 2017-02-10 [email protected]        <NA>     68 # satisfies #1
 2 2017-02-27 [email protected]        <NA>    -21 # satisfies #2
 3 2017-03-07 [email protected]        jeff     -9
 4 2017-03-09 [email protected]       steve    -93
 5 2017-03-14 [email protected]       steve     22 # satisfies #3
 6 2017-03-18 [email protected]       steve    -81
 7 2017-04-28 [email protected]        <NA>    -12
 8 2017-05-06 [email protected]        <NA>      4
 9 2017-06-03 [email protected]        jeff    -40
10 2017-06-03 [email protected]       larry     13 #(this also satisfies #3)

上面的例子都在同一个email中,并且3个滤波器中的每一个都相对于时间一个接一个地发生。

我的尝试,我认为发现这些发生的地方,但没有考虑到日期和这种情况相继发生。而且,实际上将其过滤到这些订单,将是最好的。

df2 <- df %>%
  group_by(email) %>%
  filter(any(is.na(placedorder) & amount > 0),
         any(is.na(placedorder) & amount < 0),
         any(!is.na(placedorder) & amount > 0)
  )

提前致谢!

r filter group-by dplyr
1个回答
2
投票

假设我对“第一顺序”和“第二顺序”的解释是正确的,这里是在dplyr中设置命令的一种方法

library(dplyr)

df %>% group_by(email) %>% 
  arrange(email, dates) %>% 
  mutate(order_num=1:n()) %>% 
  #An order was placed with a positive value and where placedorder is NA
  filter((is.na(placedorder) & amount>0) |
  # An order was placed after the first one, with a negative value and where placedorder is NA
         (is.na(placedorder) & amount <0 & order_num >1) |
  # An order was placed after the second order, with a positive value and where placeholder is not NA
        (!is.na(placedorder) & amount >0 & order_num > 2)
    )

更新:非常感谢你澄清这个问题。基本上,您希望“观察客户状态”,并且只有在观察到之前的类型后才开始跟踪下一类事件。这是一个(稍微冗长,但希望可以理解)试图追踪客户过渡这些“状态”的尝试:

df %>% group_by(email) %>% 
  arrange(email, dates) %>% 
  mutate(event_1=ifelse(is.na(placedorder) & amount>0, 1, 0),
         post_event_1=cumsum(event_1),
         # only if at least one event_1 has happened
         event_2=ifelse(post_event_1>=1 & is.na(placedorder) & amount <0, 1,0),
         post_event_2=cumsum(event_2),
         # only if at least one event_2 has happened
         event_3=ifelse(post_event_2>=1 & !is.na(placedorder) & amount >0, 1, 0)) %>% 
  # only interested in first occurance of event_1 and event_2 preceding event_3
  filter((event_1==1 & post_event_1==1) | (event_2==1 & post_event_2==1) | event_3 ==1)

# A tibble: 390 x 9
# Groups:   email [165]
        dates         email placedorder amount event_1 post_event_1 event_2 post_event_2 event_3
       <date>         <chr>       <chr>  <int>   <dbl>        <dbl>   <dbl>        <dbl>   <dbl>
 1 2017-01-29   [email protected]        <NA>     76       1            1       0            0       0
 2 2017-05-25   [email protected]        <NA>    -37       0            1       1            1       0
 3 2017-08-14   [email protected]       steve     53       0            1       0            2       1
 4 2017-12-21   [email protected]        john     92       0            2       0            4       1
 5 2017-02-08 [email protected]        <NA>     89       1            1       0            0       0
 6 2017-01-16 [email protected]        <NA>     40       1            1       0            0       0
 7 2017-03-18 [email protected]        <NA>     20       1            1       0            0       0
 8 2017-05-16 [email protected]        <NA>    -45       0            2       1            1       0
 9 2017-06-08 [email protected]       larry     46       0            2       0            2       1
10 2017-07-22 [email protected]        john     93       0            3       0            2       1
# ... with 380 more rows

有一些“未完成的链”,例如当客户进展到state_1但没有进一步。不确定是否要丢弃这些(为此您可以计算每封电子邮件的观察数量并删除少于3条记录的观察数据)。

© www.soinside.com 2019 - 2024. All rights reserved.