我想将降水深度的 xts 对象分割成块,以便能够从连续的观测记录中界定单个事件,以便进行后续分析。
假设我正在处理这些数据:
datetime <- seq(as.POSIXct("2020-01-01 00:00", tz = "UTC"),
by = "1 min",
length.out = 1440)
vals <- rep(0, length(datetime))
x <- xts::xts(vals, order.by = datetime)
# fill xts object with some random values
# event #1
zoo::coredata(x["2020-01-01 00:30/2020-01-01 02:35"]) <- runif(126, min = 0.01, max = 0.2)
# event #2
zoo::coredata(x["2020-01-01 08:45/2020-01-01 12:50"]) <- runif(246, min = 0.01, max = 0.2)
# event #3
zoo::coredata(x["2020-01-01 17:15/2020-01-01 17:30"]) <- runif(16, min = 0.01, max = 0.2)
zoo::coredata(x["2020-01-01 18:15/2020-01-01 19:00"]) <- runif(46, min = 0.01, max = 0.2)
zoo::coredata(x["2020-01-01 22:30/2020-01-01 23:00"]) <- runif(31, min = 0.01, max = 0.2)
为了界定事件,我希望它们满足以下标准:
如果在接下来的 4 小时内没有记录到进一步的降水,则事件从第一个值 > 0 开始,并以最后一个值 > 0 结束。
根据我的研究,我需要一个带有
length(x)
的因子向量和根据“事件 ID”的级别,以将其用作 split.zoo
的输入。事件 #1 的特征是级别 = 1,事件 #2 的特征是级别 = 2,依此类推。我对降水中断本身不感兴趣,因此可以将它们简单地映射为 0(甚至此时将其忽略)。
我的预期结果将是包含各个事件的 Zoo/xts 对象列表:
# looking for `g` in the end, making use of some efficient rolling approach
split(x, g) |> str()
#> List of 4
#> $ 0:'zoo' series from 2020-01-01 to 2020-01-01 23:59:00
#> Data: num [1:722, 1] 0 0 0 0 0 0 0 0 0 0 ...
#> Index: POSIXct[1:722], format: "2020-01-01 00:00:00" "2020-01-01 00:01:00" ...
#> $ 1:'zoo' series from 2020-01-01 00:30:00 to 2020-01-01 02:35:00
#> Data: num [1:126, 1] 0.1737 0.0958 0.0491 0.1861 0.1877 ...
#> Index: POSIXct[1:126], format: "2020-01-01 00:30:00" "2020-01-01 00:31:00" ...
#> $ 2:'zoo' series from 2020-01-01 08:45:00 to 2020-01-01 12:50:00
#> Data: num [1:246, 1] 0.1136 0.1473 0.0433 0.1311 0.1741 ...
#> Index: POSIXct[1:246], format: "2020-01-01 08:45:00" "2020-01-01 08:46:00" ...
#> $ 3:'zoo' series from 2020-01-01 17:15:00 to 2020-01-01 23:00:00
#> Data: num [1:346, 1] 0.1614 0.0632 0.1216 0.1888 0.0967 ...
#> Index: POSIXct[1:346], format: "2020-01-01 17:15:00" "2020-01-01 17:16:00" ...
由于我的现实世界的详细数据跨越了几十年,我更喜欢一种快速的方法。理想情况下,该解决方案可以应用于各种时间分辨率的数据,即也可以处理 5 分钟或每小时的数据。
我们可以使用
na.trim
、rle
和 'split. First define functions to convert from 0 to NA and from NA to 0 for use later on. Then convert all zeroes to NA's and use
na.trimto trim the NA's off the ends. Then find runs of more than 240 NA's (change this to something else if not using minutes), take the cumulative sum and convert back from runs to data to get the grouping vector
g` 进行分割。分割后,修剪每个组件中的 NA,并将 NA 转换回零。最后插入原始数据作为组件0。
library(xts)
zero2na <- function(x) replace(x, x == 0, NA)
na2zero <- function(x) replace(x, is.na(x), 0)
x_na <- x |> zero2na() |> na.trim()
r <- rle(is.na(as.vector(x_na)))
r$values <- cumsum(r$values & r$lengths >= 240)
g <- inverse.rle(r) + 1
L <- lapply(split(x_na, g), \(x) x |> na.trim() |> na2zero())
L <- c(list("0" = x), L)
# check
str(L)
给予
List of 4
$ 0:An xts object on 2020-01-01 / 2020-01-01 23:59:00 containing:
Data: double [1440, 1]
Index: POSIXct,POSIXt [1440] (TZ: "UTC")
$ 1:‘zoo’ series from 2020-01-01 00:30:00 to 2020-01-01 02:35:00
Data: num [1:126, 1] 0.0646 0.1598 0.0877 0.1778 0.1887 ...
Index: POSIXct[1:126], format: "2020-01-01 00:30:00" "2020-01-01 00:31:00" ...
$ 2:‘zoo’ series from 2020-01-01 08:45:00 to 2020-01-01 12:50:00
Data: num [1:246, 1] 0.0393 0.0273 0.037 0.1411 0.1277 ...
Index: POSIXct[1:246], format: "2020-01-01 08:45:00" "2020-01-01 08:46:00" ...
$ 3:‘zoo’ series from 2020-01-01 17:15:00 to 2020-01-01 23:00:00
Data: num [1:346, 1] 0.1746 0.0965 0.1114 0.1931 0.1572 ...
Index: POSIXct[1:346], format: "2020-01-01 17:15:00" "2020-01-01 17:16:00" ...