为什么以下分组方法会得到不同的样本?我的假设是分组结果相似样本?
small <- data.frame(
id = 1:100,
gender = rep(c('male', 'female'))
)
set.seed(123)
small |>
group_by(gender) |>
slice_sample(n = 10, replace = F)
set.seed(123)
small |>
slice_sample(n = 10, replace = F, by = gender)
基本上,当您使用
.by
时,组会按首次出现的顺序排序,而当您使用 group_by()
时,组会排序。由于我们在“女性”之前看到“小”,这解释了结果的差异。
我的包 timeplyr 实际上有参数来控制这种行为。
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
small <- data.frame(
id = 1:100,
gender = rep(c('male', 'female'))
)
set.seed(123)
res1 <- small |>
group_by(gender) |>
slice_sample(n = 10, replace = F)
set.seed(123)
res2 <- small |>
slice_sample(n = 10, replace = F, by = gender)
library(timeplyr)
#>
#> Attaching package: 'timeplyr'
#> The following object is masked from 'package:dplyr':
#>
#> desc
res3 <- small |>
fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = TRUE)
res4 <- small |>
fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = FALSE)
identical(as.data.frame(res1), res3)
#> [1] TRUE
identical(as.data.frame(res2), res4)
#> [1] TRUE
创建于 2024-08-01,使用 reprex v2.0.2