slice_sample 在分组的 .data 中生成不同的样本

问题描述 投票:0回答:1

为什么以下分组方法会得到不同的样本?我的假设是分组结果相似样本?

small <- data.frame(
  id = 1:100,
  gender = rep(c('male', 'female'))
)

set.seed(123)
small |> 
  group_by(gender) |> 
  slice_sample(n = 10, replace = F)

set.seed(123)
small |> 
  slice_sample(n = 10, replace = F, by = gender)
r dplyr tidyverse sampling
1个回答
0
投票

基本上,当您使用

.by
时,组会按首次出现的顺序排序,而当您使用
group_by()
时,组会排序。由于我们在“女性”之前看到“小”,这解释了结果的差异。

我的包 timeplyr 实际上有参数来控制这种行为。

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
small <- data.frame(
  id = 1:100,
  gender = rep(c('male', 'female'))
)

set.seed(123)
res1 <- small |> 
  group_by(gender) |> 
  slice_sample(n = 10, replace = F)

set.seed(123)
res2 <- small |> 
  slice_sample(n = 10, replace = F, by = gender)

library(timeplyr)
#> 
#> Attaching package: 'timeplyr'
#> The following object is masked from 'package:dplyr':
#> 
#>     desc

res3 <- small |> 
  fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = TRUE)
res4 <- small |> 
  fslice_sample(n = 10, replace = F, .by = gender, seed = 123, sort_groups = FALSE)

identical(as.data.frame(res1), res3)
#> [1] TRUE
identical(as.data.frame(res2), res4)
#> [1] TRUE

创建于 2024-08-01,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.