我正在使用
{ggplot2}
绘制分组变量的均值和置信区间,并注意到如果我手动计算置信区间或使用 stat_summary()
我会得到略有不同的结果。
有谁知道可能导致这种差异的原因是什么?
下面的可重现代码(注意 - 变量
y
故意倾斜以模仿我的实际数据集,但想知道这是否是导致问题的原因)。
# Generate data
# Number of observations per group
n_per_group <- 50
# Generate left-skewed data
group1 <- rgamma(n_per_group, shape = 2, scale = 1)
group2 <- rgamma(n_per_group, shape = 3, scale = 1.5)
group3 <- rgamma(n_per_group, shape = 4, scale = 2)
# Combine data into a single data frame
df <- data.frame(
y = rep(c("Group 1", "Group 2", "Group 3"), each = n_per_group),
x = c(group1, group2, group3)
)
# Using stat_summary()
df %>%
ggplot(., aes(x = x, y = y, group = y)) +
stat_summary(fun = "mean", geom = "point") +
stat_summary(fun.data = "mean_se",
geom = "errorbar",
width = 0.1) +
scale_x_continuous(breaks = seq(1, 10, by = 0.5))
# By hand
df %>%
group_by(y) %>%
summarise(mean = mean(x, na.rm = T),
std.dev = sd(x, na.rm = T),
n = n(),
se = std.dev / sqrt(n)) %>%
ggplot(., aes(y = y)) +
geom_errorbar(aes(xmin = mean - 1.96*se,
xmax = mean + 1.96*se),
width = 0.1) +
geom_point(aes(x = mean)) +
scale_x_continuous(breaks = seq(1, 10, by = 0.5))
mean_se
函数的ggplot默认值增加1个标准差,而不是1.96。你可以让你的情节像这样匹配:
df %>%
ggplot(., aes(x = x, y = y, group = y)) +
stat_summary(fun = "mean", geom = "point") +
stat_summary(fun.data = \(x) mean_se(x, mult = 1.96),
geom = "errorbar",
width = 0.1) +
scale_x_continuous(breaks = seq(1, 10, by = 0.5))
您可以在控制台输入
mean_se
查看其定义:
# mean_se
function (x, mult = 1)
{
x <- stats::na.omit(x)
se <- mult * sqrt(stats::var(x)/length(x))
mean <- mean(x)
data_frame0(y = mean, ymin = mean - se, ymax = mean + se,
.size = 1)
}
# <bytecode: 0x10bd2b7d8>
# <environment: namespace:ggplot2>