我是编程和R的新手。我有点卡住了。我有以下数据表。
Date |ONIstatus
01/10/1993 |Average
01/11/1993 |Average
01/12/1993 |Average
01/01/1994 |Average
01/02/1994 |High
01/03/1994 |High
01/04/1994 |High
01/05/1994 |High
01/06/1994 |Low
01/07/1994 |Low
01/08/1994 |Average
01/09/1994 |Average
01/10/1994 |Average
01/11/1994 |Average
01/12/1994 |High
01/01/1995 |High
01/02/1995 |Low
01/03/1995 |Low
01/04/1995 |Low
01/05/1995 |Low
我想根据“ONIstatus”列中的事件序列提取开始日期和结束日期。因此,开始日期将是第一组'ONIstatus条目',结束日期将是下一个序列开始的时间 - 因此,例如前几组结果所需的输出将是
Start Date | End Date | ONIstatus
01/10/1993 | 01/02/1994 | Average
01/02/1994 | 01/06/1994 | High
01/06/1994 | 01/08/1994 | Low
01/08/1994 | 01/12/1994 | Average
01/12/1994 | 01/02/1995 | High
等等...我想循环遍历整个数据集,该数据集有几百个条目。
我一直试图用Dplyr和rle来做这件事,但没有太多运气
希望这可以帮助!
s <- rle(as.character(df$ONIstatus))
df_final <- data.frame(ONIstatus = s$values, length = s$lengths)
#end index
df_final$end <- cumsum(df_final$length)
df_final$desired_end <- df_final$end +1
#start index
df_final$start <- df_final$end - df_final$length + 1
#start_date & end_date calculation based on start & end index
df_final$start_date <- df$Date[df_final$start]
df_final$end_date <- df$Date[df_final$desired_end]
#final output
df_final <- na.omit(df_final[,c('ONIstatus','start_date','end_date')])
df_final
输出是:
ONIstatus start_date end_date
1 Average 01/10/1993 01/02/1994
2 High 01/02/1994 01/06/1994
3 Low 01/06/1994 01/08/1994
4 Average 01/08/1994 01/12/1994
5 High 01/12/1994 01/02/1995
#sample data
> dput(df)
structure(list(Date = structure(c(15L, 17L, 19L, 1L, 3L, 5L,
7L, 9L, 11L, 12L, 13L, 14L, 16L, 18L, 20L, 2L, 4L, 6L, 8L, 10L
), .Label = c("01/01/1994", "01/01/1995", "01/02/1994", "01/02/1995",
"01/03/1994", "01/03/1995", "01/04/1994", "01/04/1995", "01/05/1994",
"01/05/1995", "01/06/1994", "01/07/1994", "01/08/1994", "01/09/1994",
"01/10/1993", "01/10/1994", "01/11/1993", "01/11/1994", "01/12/1993",
"01/12/1994"), class = "factor"), ONIstatus = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("Average", "High", "Low"), class = "factor")), .Names = c("Date",
"ONIstatus"), class = "data.frame", row.names = c(NA, -20L))
我们可以使用tidyverse
library(dplyr)
library(lubridate)
df1 %>%
mutate(Date = dmy(Date)) %>%
group_by(ONIstatus) %>%
summarise(StartDate = min(Date), EndDate = max(Date)) %>%
mutate(EndDate = lead(StartDate)) %>%
na.omit() %>%
mutate_at(2:3, funs(format(., "%d/%m/%Y"))) %>%
select(StartDate, EndDate, ONIstatus)
# A tibble: 2 x 3
# StartDate EndDate ONIstatus
# <chr> <chr> <chr>
#1 01/10/1993 01/02/1994 Average
#2 01/02/1994 01/06/1994 High