对时间上连续的值进行分组

问题描述 投票:0回答:1

我正在尝试对时间上连续的

Value
进行分组。然而,我所能实现的就是标记那些连续的
Value
(用“是”)。这里的问题是两个不同的组可能会以连续的“是”结束,因此无法区分:

df %>%
  mutate(contiguous = ifelse(Endtime_ms == lead(Starttime_ms)|Starttime_ms == lag(Endtime_ms), "yes", "no"),
         grp = consecutive_id(contiguous)
  ) 
# A tibble: 20 × 5
   Value                            Starttime_ms Endtime_ms contiguous   grp
   <chr>                                   <dbl>      <dbl> <chr>      <int>
 1 "on this"                                 210        780 NA             1
 2 "okay"                                   3403       3728 no             2
 3 "cool thanks everyone um"                4221       5880 no             2
 4 "so yes in"                              5910       6900 yes            3 # one group
 5 "terms of our"                           6900       8370 yes            3 # one group
 6 "partnership"                            8370       8970 yes            3 # one group
 7 "projects"                               8970       9480 yes            3 # one group
 8 "what have we"                           9510      10080 yes            3 # another group
 9 "got on the"                            10080      11293 yes            3 # another group
10 "horizon? "                             11293      11960 yes            3 # another group
11 "let's have a look so the"              11980      13740 no             4
12 "LGBTQ plus"                            13813      16110 no             4
13 "city labs"                             16260      17070 yes            5
14 "have now"                              17070      17910 yes            5
15 "been um"                               17940      19320 no             6
16 "agreed in"                             19350      20190 yes            7
17 "terms of the"                          20190      20760 yes            7
18 "date so"                               20760      21330 yes            7
19 "we're looking at the fifteenth"        21330      22530 yes            7
20 "sixteenth"                             22860      23490 NA             8

所需的输出是这样的:

   Value                            Starttime_ms Endtime_ms contiguous   grp
   <chr>                                   <dbl>      <dbl> <chr>      <int>
 1 "on this"                                 210        780 NA             1
 2 "okay"                                   3403       3728 no             2
 3 "cool thanks everyone um"                4221       5880 no             2
 4 "so yes in"                              5910       6900 yes            3 
 5 "terms of our"                           6900       8370 yes            3 
 6 "partnership"                            8370       8970 yes            3 
 7 "projects"                               8970       9480 yes            3 
 8 "what have we"                           9510      10080 yes            4
 9 "got on the"                            10080      11293 yes            4
10 "horizon? "                             11293      11960 yes            4
11 "let's have a look so the"              11980      13740 no             4
12 "LGBTQ plus"                            13813      16110 no             5
13 "city labs"                             16260      17070 yes            6
14 "have now"                              17070      17910 yes            6
15 "been um"                               17940      19320 no             7
16 "agreed in"                             19350      20190 yes            8
17 "terms of the"                          20190      20760 yes            8
18 "date so"                               20760      21330 yes            8
19 "we're looking at the fifteenth"        21330      22530 yes            8
20 "sixteenth"                             22860      23490 NA             9

数据:

df <- structure(list(Value = c("on this", "okay", "cool thanks everyone um", 
                               "so yes in", "terms of our", "partnership", "projects", "what have we", 
                               "got on the", "horizon? ", "let's have a look so the", "LGBTQ plus", 
                               "city labs", "have now", "been um", "agreed in", "terms of the", 
                               "date so", "we're looking at the fifteenth", "sixteenth"), Starttime_ms = c(210, 
                                                                                                           3403, 4221, 5910, 6900, 8370, 8970, 9510, 10080, 11293, 11980, 
                                                                                                           13813, 16260, 17070, 17940, 19350, 20190, 20760, 21330, 22860
                               ), Endtime_ms = c(780, 3728, 5880, 6900, 8370, 8970, 9480, 10080, 
                                                 11293, 11960, 13740, 16110, 17070, 17910, 19320, 20190, 20760, 
                                                 21330, 22530, 23490)), row.names = c(NA, -20L), class = c("tbl_df", 
                                                                                                           "tbl", "data.frame"))
r dplyr
1个回答
1
投票

您可以使用以下内容:

mutate(df, 
       c1=Starttime_ms==lag(Endtime_ms, default=-1),
       c2=Endtime_ms==lead(Starttime_ms, default=-1),
       contiguous = ifelse(c1, TRUE, 
                           ifelse(c2, TRUE, FALSE)),
       grp1=consecutive_id(contiguous) +
            cumsum(!c1 & c1!=c2 & contiguous==lag(contiguous))) |>
  select(-c(c1, c2))

# A tibble: 20 × 5
   Value                            Starttime_ms Endtime_ms contiguous  grp1
   <chr>                                   <dbl>      <dbl> <lgl>      <int>
 1 "on this"                                 210        780 FALSE          1
 2 "okay"                                   3403       3728 FALSE          1
 3 "cool thanks everyone um"                4221       5880 FALSE          1
 4 "so yes in"                              5910       6900 TRUE           2
 5 "terms of our"                           6900       8370 TRUE           2
 6 "partnership"                            8370       8970 TRUE           2
 7 "projects"                               8970       9480 TRUE           2
 8 "what have we"                           9510      10080 TRUE           3
 9 "got on the"                            10080      11293 TRUE           3
10 "horizon? "                             11293      11960 TRUE           3
11 "let's have a look so the"              11980      13740 FALSE          4
12 "LGBTQ plus"                            13813      16110 FALSE          4
13 "city labs"                             16260      17070 TRUE           5
14 "have now"                              17070      17910 TRUE           5
15 "been um"                               17940      19320 FALSE          6
16 "agreed in"                             19350      20190 TRUE           7
17 "terms of the"                          20190      20760 TRUE           7
18 "date so"                               20760      21330 TRUE           7
19 "we're looking at the fifteenth"        21330      22530 TRUE           7
20 "sixteenth"                             22860      23490 FALSE          8

需要

cumsum
来分割独立的连续 TRUE 组。请注意,第 12 行有错误。

© www.soinside.com 2019 - 2024. All rights reserved.