我无法找出现有的方法或编写新代码来针对两种不同的人口基准分布使用数据集中的分层抽样
framework
。由于我不能 100% 确定我使用了正确的术语,因此我将用一个简化的示例更具体地解释:
我有一个数据集,其中包含我知道其性别和教育水平的小组成员,以及抽样
framework
。我想使用分层抽样从那里抽取样本。我知道gender
的人口分布,以及education
的分布,但不知道联合分布(而且我不愿意假设教育在性别之间分布相同)。使用分层抽样,我希望最终得到一个在这两个基准上(大致)具有代表性的样本。
我提供了下面的代码来展示如何在一个分布上进行采样(
gender
)。我知道抽样包中存在简化分层抽样的功能,但据我了解,它们不具备两个边际分布的分层抽样功能。
library(dplyr)
N = 1000 # framework size
n = 300 # sample size
# create sampling framework
framework = data.frame(id = seq(1:N),
gender = sample(c("M","F"), N, replace = TRUE, prob = c(0.3, 0.7)),
education = sample(c("1. Low", "2. Mid", "3. High"), N, replace = TRUE, prob = c(0.2, 0.3, 0.5)))
# create population benchmarks
pop_gender = data.frame(gender = c("M", "F"),
prop = c(0.5, 0.5))
pop_education = data.frame(education = c("1. Low", "2. Mid", "3. High"),
prop = c(0.4, 0.3, 0.3))
# loop through strata (in this case just M/F) and select sample
selected = NA # empty selection vector
for(i in pop_gender$gender){
# subset framework to stratum
framework_sel = framework %>%
filter(gender == i)
# select sample from stratum
selected_i = sample(framework_sel$id, # sample from ids
n*pop_gender$prop[pop_gender$gender == i], # sample size within stratum
replace = FALSE)
selected = c(selected, selected_i)
}
# pull sample from framework
sample = framework %>%
filter(id %in% selected)
# compare sample to population
prop.table(table(sample$gender))
prop.table(table(sample$education))
需要明确的是:我希望最终得到一个在性别和教育方面都与人口相匹配的样本。
我很感激任何见解!
我没有在这个简化示例中包含的另一个问题是,框架很可能在某些层中没有足够的人员来采样到预期的层样本大小。
下面的代码样本来自
framework
,按性别和教育程度分层。pop_gender
和 pop_education
中的概率。但即便如此,最终结果仍然存在随机性,并不完全是想要的结果。N <- 1000 # framework size
n <- 300 # sample size
# create population benchmarks
pop_gender = data.frame(gender = c("M", "F"),
prop = c(0.5, 0.5))
pop_education = data.frame(education = c("1. Low", "2. Mid", "3. High"),
prop = c(0.4, 0.3, 0.3))
library(sampling)
# make results reproducible, the code below
# uses randomness twice, one in the call to
# stats::r2dtable and the other in the call
# to sampling::strata
set.seed(2023)
# total to sample by gender and education
marg_gender <- pop_gender$prop * n
marg_education <- pop_education$prop * n
# account for the possibility that the framework does not
# have sufficient people in some strata to be sampled
# to the intended stratum sample size.
tbl <- table(framework[-1])
marg_gender <- pmin(marg_gender, rowSums(tbl))
marg_education <- pmin(marg_education, colSums(tbl))
# Random 2-way table with given marginals
sample_sizes <- r2dtable(1L, marg_gender, marg_education) |> unlist()
# stratified sampling without replacement
s <- strata(framework, c("gender", "education"), size = sample_sizes, method = "srswor")
# extract the sampled rows to a new data.frame
sample2 <- getdata(framework, s)
# see the results, the final proportions are
# not exactly the wanted proportions
# first gender
cbind(
wanted = pop_gender$prop,
prop = prop.table(table(sample2$gender)) |> round(2)
)
#> wanted prop
#> F 0.5 0.55
#> M 0.5 0.45
# and education
cbind(
wanted = pop_education$prop,
prop = prop.table(table(sample2$education)) |> round(2)
)
#> wanted prop
#> 1. Low 0.4 0.30
#> 2. Mid 0.3 0.36
#> 3. High 0.3 0.35
head(sample2)
#> id gender education ID_unit Prob Stratum
#> 10 10 F 2. Mid 10 0.328125 1
#> 101 101 F 2. Mid 101 0.328125 1
#> 116 116 F 2. Mid 116 0.328125 1
#> 120 120 F 2. Mid 120 0.328125 1
#> 138 138 F 2. Mid 138 0.328125 1
#> 146 146 F 2. Mid 146 0.328125 1
创建于 2023-11-18,使用 数据framework
使用
set.seed
创建,使结果可重现。N <- 1000 # framework size
n <- 300 # sample size
prob_gender <- c(0.3, 0.7)
prob_education <- c(0.2, 0.3, 0.5)
# create sampling framework
set.seed(1)
framework = data.frame(id = seq(1:N),
gender = sample(c("M","F"), N, replace = TRUE, prob = prob_gender),
education = sample(c("1. Low", "2. Mid", "3. High"), N,
replace = TRUE, prob = prob_education))
创建于 2023-11-18,使用