我有一个关于如何计算R中数据集中指定排列的出现的问题。
我目前正在研究连续血糖监测数据集。不久,每个数据集具有1500至2000个观察值(每个观察值是6天内每5分钟测量的血浆葡萄糖值)。
我需要计算在数值范围内连续15分钟或更长且小于120分钟发生的低于3.9的葡萄糖值的出现(> 3次观察和<24次观察值<连续3.9次)。
我已经制作了一个新的变量,其因子1或0表示血浆葡萄糖值是否低于3.9。
然后我想计算排列的出现次数>连续三个1并且连续<二十四个1。
R中是否有这样的函数或者最简单的方法是什么?
我不确定我的数据结构是否合适,但以下代码可能仍然有用
我假设一个包含Measurement,person-id和measurement-id的数据结构。
library(dplyr)
# create dumy-data
set.seed(123)
data_test = data.frame(measure = rnorm(100, 3.5,2), person_id = rep(1:10, each = 10), measure_id = rep(1:10, 10))
data_test$below_criterion = 0 # indicator for measures below crit-value
data_test$below_criterion[which(data_test$measure < 3.9)] = 1 # indicator for measures below crit-value
# indicator, that shows if the current measurement is the first one below crit_val in a possible series
# shift columns, to compare current value with previous one
data_test = data_test %>% group_by(person_id) %>% mutate(prev_below_crit = c(below_criterion[1], below_criterion[1:(n()-1)]))
data_test$start_of_run = 0 # create the indicator variable
data_test$start_of_run[which(data_test$below_criterion == 1 & data_test$prev_below_crit == 0)] = 1 # if current value is below crit and previous value is above, this is the start of a series
data_test = data_test %>% group_by(person_id) %>% mutate(grouper = cumsum(start_of_run)) # helper-variable to group all the possible series within a person
data_test = data_test %>% select(measure, person_id, measure_id, below_criterion, grouper) # get rid of the previous created helper-variables
data_results = data_test %>% group_by(person_id, grouper) %>% summarise(count_below_crit = sum(below_criterion)) # count the length of each series by summing up all below_crit indicators within a person and series
data_results = data_results %>% group_by(person_id) %>% filter(count_below_crit >= 3 & count_below_crit <=24) %>% summarise(n()) # count all series within a desired length for each person
data_results
data.frame(data_test)