如何根据行之间的时间间隔和与其他变量相关的其他两个条件选择DF的特定行

问题描述 投票:1回答:1

我有一个数据框架df,它总结了特定动物物种个体的观察结果。列DateTime告诉你何时看到动物,看到它的Observer专栏,Animal专栏告诉你哪个特定的个体(他们可以被识别)。

df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
               Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
               Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")

df
              DateTime Observer Animal
1  2016-08-01 12:04:07    Peter      1
2  2016-08-01 12:06:07   Sophie      2
3  2016-08-01 12:06:58    Peter      1
4  2016-08-01 13:12:12    Peter      1
5  2016-08-01 14:04:07   Sophie      2
6  2016-08-01 13:12:45   Sophie      1
7  2016-08-01 15:04:07    Peter      2
8  2016-08-01 17:13:16   Sophie      1
9  2016-08-01 17:21:16   Sophie      2
10 2016-08-01 17:21:34   Sophie      1
11 2016-08-01 17:23:42    Peter      1
12 2016-08-01 17:27:16    Peter      2
13 2016-08-01 17:27:22    Peter      2
14 2016-08-01 17:28:01   Andreu      2
15 2016-08-01 17:29:28   Sophie      1
16 2016-08-01 17:28:08     Anna      2
17 2016-08-01 17:28:15    Peter      2

由于计算动物的方法,同一个人不能在不到60秒的时间内看到同一个人,而另一个人则不会。

出于特定目的,我需要创建一个df,其中每当有人看到特定的个体时,我会在接下来的60秒观察其他人时删除行(如果同一个人在不到60秒的时间内看到相同的动物我直接删除该行。我们可以在第12行和第13行中看到这个例子,但是我在Other_observers列中添加了这些删除行的信息,这些列总结了看到这种动物的其他人的数量,以及Who,它总结了他们的名字。

我想得到的是:

df
              DateTime Observer Ind Other_observers         Who
1  2016-08-01 12:04:07    Peter   1               0          NA
2  2016-08-01 12:06:07   Sophie   2               0          NA
3  2016-08-01 12:06:58    Peter   1               0          NA
4  2016-08-01 13:12:12    Peter   1               1      Sophie
5  2016-08-01 14:04:07   Sophie   2               0          NA
6  2016-08-01 15:04:07    Peter   2               0          NA
7  2016-08-01 17:13:16   Sophie   1               0          NA
8  2016-08-01 17:21:16   Sophie   2               0          NA
9  2016-08-01 17:21:34   Sophie   1               0          NA
10 2016-08-01 17:23:42    Peter   1               0          NA
11 2016-08-01 17:27:16    Peter   2               2 Andreu Anna
12 2016-08-01 17:28:15    Peter   2               0          NA
13 2016-08-01 17:29:28   Sophie   1               0          NA

有谁知道怎么做?

r dplyr lubridate
1个回答
1
投票

我再次考虑了一下,我认为我有一个(也更简单)解决方案,没有我们讨论过的限制。我添加了一些额外的观察来检查边缘情况。

library(tidyverse)

df <- tribble(
  ~DateTime, ~Observer, ~Animal,
  "2016-08-01 12:04:07",   "Peter",       1,
  "2016-08-01 12:06:07",  "Sophie",       2,
  "2016-08-01 12:06:58",   "Peter",       1,
  "2016-08-01 13:12:12",   "Peter",       1,
  "2016-08-01 14:04:07",  "Sophie",       2,
  "2016-08-01 13:12:45",  "Sophie",       1,
  "2016-08-01 15:04:07",   "Peter",       2,
  "2016-08-01 17:13:16",  "Sophie",       1,
  "2016-08-01 17:21:16",  "Sophie",       2,
  "2016-08-01 17:21:34",  "Sophie",       1,
  "2016-08-01 17:23:42",   "Peter",       1,
  "2016-08-01 17:27:16",   "Peter",       2,
  "2016-08-01 17:27:22",   "Peter",       2,
  "2016-08-01 17:28:01",  "Andreu",       2,
  "2016-08-01 17:29:28",  "Sophie",       1,
  "2016-08-01 17:28:08",    "Anna",       2,
  "2016-08-01 17:28:15",   "Peter",       2,
  "2016-08-01 17:28:17",   "Peter",       2,
  "2016-08-01 17:28:21",   "Peter",       2,
  "2016-08-01 17:28:21",   "Anna",        2,
) %>%
  mutate(DateTime = as.POSIXct(DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC"))

min_diff = as.difftime(60, units = c("secs"))

cumsum_reset <- function(s, x, reset) {
  ns <- s + x
  if (ns > reset) return(0)
  ns
}

df_wrangled <- df %>%
  arrange(DateTime) %>%
  group_by(Animal) %>%
  mutate(
    # Time difference to laste observation of this animal
    Diff = replace_na(DateTime - lag(DateTime, 1), 0),
    # Cumulative time since first observation, resets to 0 when more than `min_diff`
    CumDiff = accumulate(Diff, cumsum_reset, reset = min_diff),
    # Group observations within the `min_diff` period
    ObsGroup = cumsum(CumDiff == 0)
  ) %>%
  group_by(ObsGroup, add = TRUE) %>%
  summarize(
    Other_observers = length(unique(Observer)) - 1,
    Who = paste(unique(setdiff(Observer, Observer[1])), collapse = " "),
    DateTime = DateTime[1],
    Observer = Observer[1]
  ) %>%
  ungroup()

print(df_wrangled, n = Inf)
#> # A tibble: 13 x 6
#>    Animal ObsGroup Other_observers Who         DateTime            Observer
#>     <dbl>    <int>           <dbl> <chr>       <dttm>              <chr>   
#>  1      1        1               0 ""          2016-08-01 12:04:07 Peter   
#>  2      1        2               0 ""          2016-08-01 12:06:58 Peter   
#>  3      1        3               1 Sophie      2016-08-01 13:12:12 Peter   
#>  4      1        4               0 ""          2016-08-01 17:13:16 Sophie  
#>  5      1        5               0 ""          2016-08-01 17:21:34 Sophie  
#>  6      1        6               0 ""          2016-08-01 17:23:42 Peter   
#>  7      1        7               0 ""          2016-08-01 17:29:28 Sophie  
#>  8      2        1               0 ""          2016-08-01 12:06:07 Sophie  
#>  9      2        2               0 ""          2016-08-01 14:04:07 Sophie  
#> 10      2        3               0 ""          2016-08-01 15:04:07 Peter   
#> 11      2        4               0 ""          2016-08-01 17:21:16 Sophie  
#> 12      2        5               2 Andreu Anna 2016-08-01 17:27:16 Peter   
#> 13      2        6               1 Anna        2016-08-01 17:28:17 Peter

reprex package创建于2019-04-30(v0.2.1)

旧解决方案:

这是使用优秀的fuzzyjoin package的一种解决方案。基本上我只是将它们与min_dist分开,加入观察。

这里有一些棘手的边缘情况,我没有解决。例如,如果一个观察者记录一个动物的观察结果,每隔30秒记录5分钟,我相信只要它们<1分钟,它们都将被过滤掉,除了第一次观察。这可能不是你想要的,但我现在不确定如何解决这个问题。

library(tidyverse)
library(fuzzyjoin)
df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
               Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
               Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")


min_diff = as.difftime(1, units = c("mins"))

df_wrangled <- df %>%
  as_tibble() %>%
  arrange(DateTime) %>%
  # Add a unique id for each observation
  mutate(id = 1:n()) %>%
  fuzzy_left_join(
    x = .,
    y = .,
    by = c("Animal", "DateTime"),
    match_fun = list(
      `==`,
      function(x, y) y - x < min_diff & y - x > 0
    )
  ) %>%
  # Remove observations that occured within `min_diff`
  filter(!(id.x %in% id.y)) %>%
  # Remove observations by same observer within `min_diff`
  filter(ifelse(is.na(Observer.y), TRUE, Observer.x != Observer.y)) %>%
  group_by(DateTime.x, Observer.x, Animal.x, id.x) %>%
  summarize(
    Other_observers = length(na.omit(Observer.y)),
    Who = paste(Observer.y, collapse = " ")
  ) %>%
  ungroup()

print(df_wrangled, n = Inf)
#> # A tibble: 12 x 6
#>    DateTime.x          Observer.x Animal.x  id.x Other_observers Who       
#>    <dttm>              <fct>         <dbl> <int>           <int> <chr>     
#>  1 2016-08-01 12:04:07 Peter             1     1               0 NA        
#>  2 2016-08-01 12:06:07 Sophie            2     2               0 NA        
#>  3 2016-08-01 12:06:58 Peter             1     3               0 NA        
#>  4 2016-08-01 13:12:12 Peter             1     4               1 Sophie    
#>  5 2016-08-01 14:04:07 Sophie            2     6               0 NA        
#>  6 2016-08-01 15:04:07 Peter             2     7               0 NA        
#>  7 2016-08-01 17:13:16 Sophie            1     8               0 NA        
#>  8 2016-08-01 17:21:16 Sophie            2     9               0 NA        
#>  9 2016-08-01 17:21:34 Sophie            1    10               0 NA        
#> 10 2016-08-01 17:23:42 Peter             1    11               0 NA        
#> 11 2016-08-01 17:27:16 Peter             2    12               2 Andreu An…
#> 12 2016-08-01 17:29:28 Sophie            1    17               0 NA

reprex package创建于2019-04-30(v0.2.1)

© www.soinside.com 2019 - 2024. All rights reserved.