我有一个数据框架df
,它总结了特定动物物种个体的观察结果。列DateTime
告诉你何时看到动物,看到它的Observer
专栏,Animal
专栏告诉你哪个特定的个体(他们可以被识别)。
df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")
df
DateTime Observer Animal
1 2016-08-01 12:04:07 Peter 1
2 2016-08-01 12:06:07 Sophie 2
3 2016-08-01 12:06:58 Peter 1
4 2016-08-01 13:12:12 Peter 1
5 2016-08-01 14:04:07 Sophie 2
6 2016-08-01 13:12:45 Sophie 1
7 2016-08-01 15:04:07 Peter 2
8 2016-08-01 17:13:16 Sophie 1
9 2016-08-01 17:21:16 Sophie 2
10 2016-08-01 17:21:34 Sophie 1
11 2016-08-01 17:23:42 Peter 1
12 2016-08-01 17:27:16 Peter 2
13 2016-08-01 17:27:22 Peter 2
14 2016-08-01 17:28:01 Andreu 2
15 2016-08-01 17:29:28 Sophie 1
16 2016-08-01 17:28:08 Anna 2
17 2016-08-01 17:28:15 Peter 2
由于计算动物的方法,同一个人不能在不到60秒的时间内看到同一个人,而另一个人则不会。
出于特定目的,我需要创建一个df
,其中每当有人看到特定的个体时,我会在接下来的60秒观察其他人时删除行(如果同一个人在不到60秒的时间内看到相同的动物我直接删除该行。我们可以在第12行和第13行中看到这个例子,但是我在Other_observers
列中添加了这些删除行的信息,这些列总结了看到这种动物的其他人的数量,以及Who
,它总结了他们的名字。
我想得到的是:
df
DateTime Observer Ind Other_observers Who
1 2016-08-01 12:04:07 Peter 1 0 NA
2 2016-08-01 12:06:07 Sophie 2 0 NA
3 2016-08-01 12:06:58 Peter 1 0 NA
4 2016-08-01 13:12:12 Peter 1 1 Sophie
5 2016-08-01 14:04:07 Sophie 2 0 NA
6 2016-08-01 15:04:07 Peter 2 0 NA
7 2016-08-01 17:13:16 Sophie 1 0 NA
8 2016-08-01 17:21:16 Sophie 2 0 NA
9 2016-08-01 17:21:34 Sophie 1 0 NA
10 2016-08-01 17:23:42 Peter 1 0 NA
11 2016-08-01 17:27:16 Peter 2 2 Andreu Anna
12 2016-08-01 17:28:15 Peter 2 0 NA
13 2016-08-01 17:29:28 Sophie 1 0 NA
有谁知道怎么做?
我再次考虑了一下,我认为我有一个(也更简单)解决方案,没有我们讨论过的限制。我添加了一些额外的观察来检查边缘情况。
library(tidyverse)
df <- tribble(
~DateTime, ~Observer, ~Animal,
"2016-08-01 12:04:07", "Peter", 1,
"2016-08-01 12:06:07", "Sophie", 2,
"2016-08-01 12:06:58", "Peter", 1,
"2016-08-01 13:12:12", "Peter", 1,
"2016-08-01 14:04:07", "Sophie", 2,
"2016-08-01 13:12:45", "Sophie", 1,
"2016-08-01 15:04:07", "Peter", 2,
"2016-08-01 17:13:16", "Sophie", 1,
"2016-08-01 17:21:16", "Sophie", 2,
"2016-08-01 17:21:34", "Sophie", 1,
"2016-08-01 17:23:42", "Peter", 1,
"2016-08-01 17:27:16", "Peter", 2,
"2016-08-01 17:27:22", "Peter", 2,
"2016-08-01 17:28:01", "Andreu", 2,
"2016-08-01 17:29:28", "Sophie", 1,
"2016-08-01 17:28:08", "Anna", 2,
"2016-08-01 17:28:15", "Peter", 2,
"2016-08-01 17:28:17", "Peter", 2,
"2016-08-01 17:28:21", "Peter", 2,
"2016-08-01 17:28:21", "Anna", 2,
) %>%
mutate(DateTime = as.POSIXct(DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC"))
min_diff = as.difftime(60, units = c("secs"))
cumsum_reset <- function(s, x, reset) {
ns <- s + x
if (ns > reset) return(0)
ns
}
df_wrangled <- df %>%
arrange(DateTime) %>%
group_by(Animal) %>%
mutate(
# Time difference to laste observation of this animal
Diff = replace_na(DateTime - lag(DateTime, 1), 0),
# Cumulative time since first observation, resets to 0 when more than `min_diff`
CumDiff = accumulate(Diff, cumsum_reset, reset = min_diff),
# Group observations within the `min_diff` period
ObsGroup = cumsum(CumDiff == 0)
) %>%
group_by(ObsGroup, add = TRUE) %>%
summarize(
Other_observers = length(unique(Observer)) - 1,
Who = paste(unique(setdiff(Observer, Observer[1])), collapse = " "),
DateTime = DateTime[1],
Observer = Observer[1]
) %>%
ungroup()
print(df_wrangled, n = Inf)
#> # A tibble: 13 x 6
#> Animal ObsGroup Other_observers Who DateTime Observer
#> <dbl> <int> <dbl> <chr> <dttm> <chr>
#> 1 1 1 0 "" 2016-08-01 12:04:07 Peter
#> 2 1 2 0 "" 2016-08-01 12:06:58 Peter
#> 3 1 3 1 Sophie 2016-08-01 13:12:12 Peter
#> 4 1 4 0 "" 2016-08-01 17:13:16 Sophie
#> 5 1 5 0 "" 2016-08-01 17:21:34 Sophie
#> 6 1 6 0 "" 2016-08-01 17:23:42 Peter
#> 7 1 7 0 "" 2016-08-01 17:29:28 Sophie
#> 8 2 1 0 "" 2016-08-01 12:06:07 Sophie
#> 9 2 2 0 "" 2016-08-01 14:04:07 Sophie
#> 10 2 3 0 "" 2016-08-01 15:04:07 Peter
#> 11 2 4 0 "" 2016-08-01 17:21:16 Sophie
#> 12 2 5 2 Andreu Anna 2016-08-01 17:27:16 Peter
#> 13 2 6 1 Anna 2016-08-01 17:28:17 Peter
由reprex package创建于2019-04-30(v0.2.1)
旧解决方案:
这是使用优秀的fuzzyjoin package的一种解决方案。基本上我只是将它们与min_dist
分开,加入观察。
这里有一些棘手的边缘情况,我没有解决。例如,如果一个观察者记录一个动物的观察结果,每隔30秒记录5分钟,我相信只要它们<1分钟,它们都将被过滤掉,除了第一次观察。这可能不是你想要的,但我现在不确定如何解决这个问题。
library(tidyverse)
library(fuzzyjoin)
df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")
min_diff = as.difftime(1, units = c("mins"))
df_wrangled <- df %>%
as_tibble() %>%
arrange(DateTime) %>%
# Add a unique id for each observation
mutate(id = 1:n()) %>%
fuzzy_left_join(
x = .,
y = .,
by = c("Animal", "DateTime"),
match_fun = list(
`==`,
function(x, y) y - x < min_diff & y - x > 0
)
) %>%
# Remove observations that occured within `min_diff`
filter(!(id.x %in% id.y)) %>%
# Remove observations by same observer within `min_diff`
filter(ifelse(is.na(Observer.y), TRUE, Observer.x != Observer.y)) %>%
group_by(DateTime.x, Observer.x, Animal.x, id.x) %>%
summarize(
Other_observers = length(na.omit(Observer.y)),
Who = paste(Observer.y, collapse = " ")
) %>%
ungroup()
print(df_wrangled, n = Inf)
#> # A tibble: 12 x 6
#> DateTime.x Observer.x Animal.x id.x Other_observers Who
#> <dttm> <fct> <dbl> <int> <int> <chr>
#> 1 2016-08-01 12:04:07 Peter 1 1 0 NA
#> 2 2016-08-01 12:06:07 Sophie 2 2 0 NA
#> 3 2016-08-01 12:06:58 Peter 1 3 0 NA
#> 4 2016-08-01 13:12:12 Peter 1 4 1 Sophie
#> 5 2016-08-01 14:04:07 Sophie 2 6 0 NA
#> 6 2016-08-01 15:04:07 Peter 2 7 0 NA
#> 7 2016-08-01 17:13:16 Sophie 1 8 0 NA
#> 8 2016-08-01 17:21:16 Sophie 2 9 0 NA
#> 9 2016-08-01 17:21:34 Sophie 1 10 0 NA
#> 10 2016-08-01 17:23:42 Peter 1 11 0 NA
#> 11 2016-08-01 17:27:16 Peter 2 12 2 Andreu An…
#> 12 2016-08-01 17:29:28 Sophie 1 17 0 NA
由reprex package创建于2019-04-30(v0.2.1)