手动和ggplot计算的箱线图统计/异常值消除之间的差异

问题描述 投票:0回答:1

我有一些数据集合并在一个数据框中,我想从中消除异常值。 当尝试不同的方法来计算上限和下限阈值时,我发现 ggplot-boxplots 和手动计算的结果之间存在差异。 我想 (1) 了解差异并 (2) 找到一种便捷的方法来通过 dplyr 从多个相似数据集中消除异常值。

下面给出了四个具有 2x2 变体的数据集(Var1、SSW):

library(tidyverse)

values_A30 <- c(0.2079762,0.2605029,0.3054334,0.304067,0.8696487,0.3470931,0.2560001,0.3096838,0.2887556,0.3741472,0.2178375,0.2234682,0.2628923,0.2745458,0.5438208,0.7068278,0.6492924,1.100175,0.2740491,0.2849299,0.7562737,0.3009749,0.2598575,0.3460925,0.3265929,0.4208336,0.4353992,1.132036,0.3856708,0.1978752,0.3676808,0.4196799,0.4486595,0.3282394,0.3725664,0.385373,0.3680049,0.7875058,0.8098903,0.741165,1.260887,0.3521471,0.3883195,1.17124,0.3225514,0.3492051)
values_B30 <- c(0.2598824,0.3147266,0.3876806,0.3740659,0.9880903,0.3491571,0.2852879,0.3659836,0.3562278,0.3574071,0.2793339,0.2765582,0.326236,0.3305683,0.628697,0.7359492,0.6954842,1.139923,0.3106868,0.3187189,0.9236551,0.3218849,0.2722268,0.3102944,0.3590789,0.4290484,0.3649334,1.133538,0.3815261,0.313504,0.4090641,0.4127804,0.4103117,0.3039001,0.3421307,0.3383706,0.3697731,0.6795609,0.8174759,0.730511,1.248585,0.3350673,0.3678199,1.025086,0.3550109,0.2992851)
values_A32 <- c(0.3031411,0.6585525,0.2774704,0.3185133,0.3657107,0.36731,0.2690659,0.3000714,0.2638143,0.3952846,0.260601,0.2873786,0.3522794,0.4528319,0.2959548,0.3085563,0.2821835,0.28403,0.3282855,0.4996997,0.4005206,0.8866824,0.4036912,0.3818493,0.4250281,0.4804805,0.3840721,0.4288454,0.3920388,0.5721854,0.3303645,0.3137673,0.4255052,0.4639104,0.3755455,0.4013699,0.4690261,0.4198166,0.4578243,0.6717564)
values_B32 <- c(0.3597136,0.7568497,0.3340147,0.3257469,0.3921928,0.4232309,0.2661836,0.3098475,0.3049883,0.5052187,0.311451,0.3089702,0.367432,0.5030153,0.3493206,0.3470694,0.3631118,0.3742462,0.4100476,0.5922369,0.3922594,0.7923606,0.385271,0.3919856,0.4243319,0.4642854,0.3340272,0.3854504,0.3563194,0.5574781,0.3542073,0.3310583,0.4260903,0.5463172,0.3810555,0.3576101,0.4161085,0.4094533,0.4390219,0.6388255)

bpdata <- bind_rows(
  data.frame(Var1 = "A", SSW = 30, Value = values_A30),
  data.frame(Var1 = "B", SSW = 30, Value = values_B30),
  data.frame(Var1 = "A", SSW = 32, Value = values_A32),
  data.frame(Var1 = "B", SSW = 32, Value = values_B32)
  ) 

通常我从 ggplot 箱线图开始以获得视觉印象。

# test plot full
ggplot(bpdata, aes(SSW, Value, group = SSW)) + 
  geom_boxplot() + 
  facet_wrap(~ Var1, scales = "free_y") +
  scale_y_continuous(limits = c(0, 1.5),
                     breaks = seq(0, 1.5, by = 0.1))

所有四个箱线图都有一些标记为异常值的值,我想将其消除。为了便于进一步讨论/理解提到的差异,我选择第一个组合(Var1 = A,SSW = 30)。

为了消除 dyplyr 中的异常值,我必须获取数据框中的上限(和其他数据的下限)阈值,因此我的第一种方法是根据 geom_boxplot 帮助页面中的说明手动计算它们:

# manual calculation
bpstats_man <- bpdata |>
  filter(Var1 == "A", SSW == 30) |>
  summarise(
    Qu1 = quantile(Value, 0.25),
    Qu3 = quantile(Value, 0.75),
    IQR = IQR(Value)
  ) |>
  mutate(ymin = Qu1 - (1.5 * IQR),
         ymax = Qu3 + (1.5 * IQR))

但是,与图中显示的限制相比,此结果(ymin = -0.05051965 和 ymax = 0.8623606)有很大不同。为了直接比较,我还提取了geom_boxplot统计数据。当然,这里 ymin 和 ymax 对应于该图(ymin = 0.1978752 和 ymax = 0.8098903)。

# extract stats
bpstats_gg <- ggplot_build(
  ggplot(bpdata |> filter(Var1 == "A", SSW == 30),
         aes(x=SSW, y = Value)) +
    geom_boxplot()
)$data[[1]]

最后,我想(1)了解手动计算时 ymin 和 ymax 输出不同的原因,以及(2)找到一种方便的方法来计算限制,即。即,通过手动计算或从 geom_boxplot 统计数据中提取它们。我的目标是一种易于理解的方法来消除由 Var1 和 SSW 分组的许多不同“值”集的异常值。 我认为可能有一种方法可以通过使用 ggplot_build 进行 Nest() 和 unnest() ,但这对我来说仍然很难理解(任何在哪里寻找好的教程的提示都值得赞赏)。

ggplot2 dplyr boxplot outliers
1个回答
0
投票

如果您使用非常简单的值范围,可能会更容易看到发生了什么

Value <- 0:100
Qu1  <- Value  |> quantile(0.25)
Qu3  <- Value  |>  quantile(0.75)
IQR <- Value  |>  IQR()

min <- Qu1 - (1.5 * IQR)
max <- Qu3 + (1.5 * IQR)

cat(min, max)

给予

-50 150
© www.soinside.com 2019 - 2024. All rights reserved.