带有计数的不同序列模式

问题描述 投票:3回答:1

我有一个看起来像这样的序列

  id ep value
1  1  1     a
2  1  2     a
3  1  3     b
4  1  4     d
5  2  1     a
6  2  2     a
7  2  3     c
8  2  4     e

而我想要做的就是减少它

      id    ep  value     n  time total
1      1     0      a     2    20    40
2      1     1      b     1    10    40
3      1     2      d     1    10    40
4      2     0      a     2    20    40
5      2     1      c     1    10    40
6      2     2      e     1    10    40

dplyrseems工作正常

short = df %>% group_by(id) %>%
 mutate(grp = cumsum(value != lag(value, default = value[1]))) %>%
  count(id, grp, value) %>% mutate(time = n*10) %>% group_by(id) %>% 
   mutate(total = sum(time))

但是,我的数据库真的很大,需要永远。

问题1

谁能帮助我将这一行翻译成data.table代码?

问题2

我感兴趣的是回到长格式,我想知道什么是速度方面最有效的解决方案。

目前,我正在使用这条线

short[rep(1:nrow(short), short$n), ] %>% 
  select(-n, -time, -total) %>% 
  group_by(id) %>% 
  mutate(ep = 1:n())

有什么建议?

df = structure(list(id = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("1", "2"), class = "factor"), ep = structure(c(1L, 
2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"), 
value = structure(c(1L, 1L, 2L, 4L, 1L, 1L, 3L, 5L), .Label = c("a", 
"b", "c", "d", "e"), class = "factor")), .Names = c("id", 
"ep", "value"), row.names = c(NA, -8L), class = "data.frame")
r dplyr data.table
1个回答
1
投票

一个选择是使用rleiddata.table

library(data.table)
short1 <- setDT(df)[,  .N,.(id, grp = rleid(value), value)
           ][,  time := N*10
            ][, c('total', 'ep') :=  .(sum(time), seq_len(.N) - 1), id
             ][, grp := NULL][]
short1
#   id value N time total ep
#1:  1     a 2   20    40  0
#2:  1     b 1   10    40  1
#3:  1     d 1   10    40  2
#4:  2     a 2   20    40  0
#5:  2     c 1   10    40  1
#6:  2     e 1   10    40  2

推导出“长”格式

short1[rep(seq_len(.N), N), -c('N', 'time', 'total', 'ep'), 
             with = FALSE][, ep1 := seq_len(.N), id][]

dplyr代码直接翻译成data.table将是

setDT(df)[, grp := cumsum(value != shift(value, fill = value[1])), id
   ][, .(N= .N), .(id, grp, value)
    ][, time := N*10
     ][, c('total', 'ep') :=  .(sum(time), seq_len(.N) - 1), id
       ][, grp := NULL][]
#   id value N time total ep
#1:  1     a 2   20    40  0
#2:  1     b 1   10    40  1
#3:  1     d 1   10    40  2
#4:  2     a 2   20    40  0
#5:  2     c 1   10    40  1
#6:  2     e 1   10    40  2
© www.soinside.com 2019 - 2024. All rights reserved.