我在汇总工具中使用 stby 来按组计算加权描述性统计数据。然而,当我这样做时,与我通过分组变量进行过滤然后在摘要工具中应用 descr 函数时相比,我得到了不同的答案。请参阅下文 - mydf = 我的未过滤数据框,分数是一个 0-10 的变量,我想获取其平均值。
##when I filter first and split my df
filtered_male <- mydf$gender %>% filter(gender==1)
with(filtered_male, stby(score, gender, descr, weights = weight))
Weighted Descriptive Statistics
score by gender
Data Frame: filtered_male
Weights: weight
N: 838
1
--------------- ------------
Mean 6.86
Std.Dev 2.93
Min 0.00
Median 8.00
Max 10.00
MAD 2.97
CV 0.43
N.Valid 1509584.07
Pct.Valid 99.70
##when I don't split my df
with(mydf, stby(score, gender, descr, weights = weight, simplify = TRUE))
Weighted Descriptive Statistics
score by gender
Data Frame: mydf
Weights: weight
N: 838
1 2
--------------- ------------ ------------
Mean 7.01 6.79
Std.Dev 2.81 3.02
Min 0.00 0.00
Median 8.00 8.00
Max 10.00 10.00
MAD 2.97 2.97
CV 0.40 0.45
N.Valid 1715494.12 1379339.65
Pct.Valid 56.05 45.07
'''
关于为什么会发生这种情况或者我如何解决它以获得正确的加权平均值有什么想法吗? (我已经手动检查答案,并且我首先过滤的平均值是正确的)
同时,官方对此进行了修复,您可以使用以下命令生成有效的 stbyobject:
### Packages
library(dplyr)
library(purrr)
library(summarytools)
### Data
mtcars
### Output with summarytools
st=with(mtcars, stby(qsec, cyl,descr, weights = wt,simplify = TRUE))
### Fix the output with corrected values
mtcars %>%
group_by(cyl) %>%
group_map(~ descr(.x$qsec,descr, weights = .x$wt)) %>%
walk2(.y = 1:length(.),function(x,y){st[[y]][,]<<-.[[y]][,]})
### Bonus, add missing N number for each group
attributes(st[[1]])$data_info$N.Obs<-paste(map_int(1:length(st),~attributes(st[[.x]])$data_info$N.Obs),collapse = ",")
输出:
Weighted Descriptive Statistics
qsec by cyl
Data Frame: mtcars
Weights: wt
N: 11,7,14
4 6 8
--------------- -------- -------- --------
Mean 19.38 18.12 16.89
Std.Dev 1.72 1.59 1.13
Min 16.70 15.50 14.50
Median 19.24 18.46 17.34
Max 22.90 20.22 18.00
MAD 1.09 2.00 0.71
CV 0.09 0.09 0.07
N.Valid 25.14 21.82 55.99
Pct.Valid 100.00 100.00 100.00