当变量“a”的值高于阈值时，如何从一天的变量“b”获得sd

Question

如果我有一个包含3个变量的小时数据集（时间，a，b），并且想要查看特定日期的“b”的标准偏差以及“a”中的异常值，我该怎么办？因此，想法是：如果变量“a”的值高于某个阈值，例如如下例所示99，整天变量“b”的标准偏差是多少。什么是前一天和后一天的“b”的sd。我试着用一个例子来澄清问题：

set.seed(1)
df = data.frame("time" =  seq( 
 from = as.POSIXct("2016-05-01 00:00", tz = "Europe/Berlin"), 
 to = as.POSIXct("2016-05-04 23:00", tz = "Europe/Berlin"),
 by = "hour"),  "a" = runif(96, min=0, max=100), "b" = runif(96, min=1200, 
 max=30000))

如果这是数据，我想写一个这样的命令：

test = data.frame("time" = df$time, "extreme" = ifelse(df$a> 99, sd(#take the sd of "b" for the day where df$a>99 occured) & sd(#and for the day before and after), 0 ))

test = subset(test, test$extreme>0) # to have a data frame with the important values only

我感谢任何帮助。

Answer 1

如果要查找a高于该阈值的那一天的所有值，然后计算前一天，后一天和后一天的bfor的标准偏差：

threshold_day <- day(df[df$a>99,]$time)
threshold_days <- c(threshold_day -1, threshold_day, threshold_day + 1)
outlier_days <- df[day(df$time) %in% threshold_days,]
outlier_days$sd_b <- sd(outlier_days$b)
head(outlier_days)
                 time        a        b     sd_b
# 1 2016-05-01 00:00:00 26.55087 14311.90 7730.978
# 2 2016-05-01 01:00:00 37.21239 13010.42 7730.978    
# 3 2016-05-01 02:00:00 57.28534 24553.06 7730.978
# 4 2016-05-01 03:00:00 90.82078 18622.08 7730.978
# 5 2016-05-01 04:00:00 20.16819 20056.05 7730.978
# 6 2016-05-01 05:00:00 89.83897 11372.08 7730.978

请注意，这仅包括当天和之后的一天（因为前一天没有数据，并且具有标准偏差的列通常不是非常有用（因为它是一个值），但我认为这就是您想要的...请澄清是否还有别的。

如果您只想要标准偏差，并且希望它们按天分组，只需按日分割，然后应用sd。同样，你只会得到两天（两组），因为你的数据是第一天的阈值。所以你不能包括前一天（因为4月没有数据）。

tapply(outlier_days$b, day(outlier_days$time), sd)

如果你真的希望它被分组，但是想要它在数据框中...你可以把它重新投入，但你可能最好使用dplyr：

threshold_day <- day(filter(df, a>99)$time)
threshold_days <- c(threshold_day -1, threshold_day, threshold_day + 1)
filter(df, day(time) %in% threshold_days) %>%
    group_by(day(time)) %>%
    mutate(sd_b = sd(b))

当然，如果你发送另一个带有不同数据的代表，比如一个带有另外几个月的日期，那么如果没有适合预期输入的修改，它将会失败。这就是为什么测试预期输入的覆盖范围很重要的原因。例如，对于超过一个月的数据，您需要按完整日期进行分组，而不仅仅是当天。（每天交换日期（）（），您将得到适用于该数据的结果。

Answer 2

正如评论中已经指出的那样，你只有1例a > 99。因此结果是NA。尽管如此，这是给你这个价值的代码：

library(tidyverse)
df %>% filter(a > 99) %>% mutate(sd_b = sd(b))

结果：

             time        a        b      sd_b
1 2016-05-01 17:00:00 99.19061 13626.44  NaN

请注意，如果您在NAs中有一个可能包含b的更大数据集，则必须考虑到这一点。

Answer 3

谢谢你的帮助@Dan Hall。我使用了一些命令来找到正确的答案：

# Add additional variable with the daily sd of "b"
df_augmented = df  %>% group_by(date(time)) %>%
mutate(sd_price = sd(b)) 

#Filter the dates plus minus one day where the value is a>99
sd.extreme = data.frame("time" = df_augmented$time, 
                    "date" = date(df_augmented$time),
                    "sd_b_lagday" = ifelse(df_augmented$a>99, 
                                    Lag(df_augmented$sd_price, shift = 24) , 0),
                    "sd_b_day" = ifelse(df_augmented$a>99, 
                                 df_augmented$sd_price , 0),
                    "sd_b_leadday" = ifelse(df_augmented$a>99, 
                                     Lag(df_augmented$sd_price, shift = -24) , 0)
                    )

sd.extreme = subset(sd.extreme, sd.extreme$sd_b_day >0)

sd.extreme = sd.extreme[!duplicated(sd.extreme$date) ,]    

sd.extreme = sd.extreme[,-1]

当变量“a”的值高于阈值时，如何从一天的变量“b”获得sd

问题描述投票：0回答：3

3个回答

最新问题

当变量“a”的值高于阈值时，如何从一天的变量“b”获得sd

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3