我有一个包含两组人的数据集:接种疫苗的人和未接种疫苗的人。在接种组中,每一行代表一个唯一的 ID 以及相应的唯一 T0。在未接种疫苗的组中,每个 ID 可能出现在多行中,每行与不同的 T0(长格式)相关联。每行包含三个变量:PCP 就诊次数、专科就诊次数和实验室就诊次数。
我的目标是从此数据集中进行采样,以便生成的数据每行包含一个唯一的 ID,并且接种疫苗组和未接种疫苗组之间的 PCP 就诊、专科就诊和实验室就诊的平均值相似。我怎样才能在 R 中实现这一目标?我认为这至少会涉及对未接种疫苗的群体进行一些分层抽样,因为每个人可以有多个记录。
下面是一些用于创建示例数据的 R 代码:
set.seed(123)
data <- data.frame(
ID = c(1:10, rep(11:20, 4)), # Vaccinated IDs are unique, Unvaccinated have repeats
group = c(rep("vaccinated", 10), rep("unvaccinated", 40)),
T0 = rep(1:10, 5),
PCP_visits = sample(0:10, 50, replace = TRUE),
specialty_visits = sample(0:5, 50, replace = TRUE),
lab_visits = sample(0:8, 50, replace = TRUE)
)
ID group T0 PCP_visits specialty_visits lab_visits
1 1 vaccinated 1 2 0 3
2 2 vaccinated 2 2 5 0
3 3 vaccinated 3 9 4 5
4 4 vaccinated 4 1 0 2
5 5 vaccinated 5 5 1 7
6 6 vaccinated 6 10 3 2
7 7 vaccinated 7 4 3 7
8 8 vaccinated 8 3 5 0
9 9 vaccinated 9 5 5 6
10 10 vaccinated 10 8 2 6
11 11 unvaccinated 1 9 5 6
12 12 unvaccinated 2 10 5 5
13 13 unvaccinated 3 4 0 6
14 14 unvaccinated 4 2 5 4
15 15 unvaccinated 5 10 1 5
16 16 unvaccinated 6 8 0 7
17 17 unvaccinated 7 8 1 4
18 18 unvaccinated 8 8 3 6
19 19 unvaccinated 9 2 4 3
20 20 unvaccinated 10 7 4 2
21 11 unvaccinated 1 9 5 8
22 12 unvaccinated 2 6 2 6
23 13 unvaccinated 3 9 0 5
24 14 unvaccinated 4 8 3 8
25 15 unvaccinated 5 2 5 6
26 16 unvaccinated 6 3 0 1
27 17 unvaccinated 7 0 5 2
28 18 unvaccinated 8 10 0 7
29 19 unvaccinated 9 6 2 3
30 20 unvaccinated 10 4 5 6
31 11 unvaccinated 1 9 3 3
32 12 unvaccinated 2 6 0 0
33 13 unvaccinated 3 8 5 7
34 14 unvaccinated 4 8 5 3
35 15 unvaccinated 5 9 2 8
36 16 unvaccinated 6 6 5 7
37 17 unvaccinated 7 10 4 5
38 18 unvaccinated 8 4 2 3
39 19 unvaccinated 9 6 5 7
40 20 unvaccinated 10 4 1 2
41 11 unvaccinated 1 10 4 3
42 12 unvaccinated 2 5 4 3
43 13 unvaccinated 3 8 2 5
44 14 unvaccinated 4 1 1 0
45 15 unvaccinated 5 4 1 3
46 16 unvaccinated 6 7 1 8
47 17 unvaccinated 7 1 3 6
48 18 unvaccinated 8 0 1 7
49 19 unvaccinated 9 8 1 4
50 20 unvaccinated 10 10 5 1
sampling
、函数 strata
对每个 id 采样一行。至于您要求的平均值,由于样本是随机的,因此统计数据平均而言应该是相同的。
set.seed(123)
data <- data.frame(
ID = c(1:10, rep(11:20, 4)), # Vaccinated IDs are unique, Unvaccinated have repeats
group = c(rep("vaccinated", 10), rep("unvaccinated", 40)),
T0 = rep(1:10, 5),
PCP_visits = sample(0:10, 50, replace = TRUE),
specialty_visits = sample(0:5, 50, replace = TRUE),
lab_visits = sample(0:8, 50, replace = TRUE)
)
library(sampling)
sizes <- matrix(1L, nrow = 10L, ncol = 2L)
i_strata <- strata(data, c("ID", "group"), size = sizes, method = "srswr")
data[i_strata$ID_unit, ]
#> ID group T0 PCP_visits specialty_visits lab_visits
#> 1 1 vaccinated 1 2 0 3
#> 2 2 vaccinated 2 2 5 0
#> 3 3 vaccinated 3 9 4 5
#> 4 4 vaccinated 4 1 0 2
#> 5 5 vaccinated 5 5 1 7
#> 6 6 vaccinated 6 10 3 2
#> 7 7 vaccinated 7 4 3 7
#> 8 8 vaccinated 8 3 5 0
#> 9 9 vaccinated 9 5 5 6
#> 10 10 vaccinated 10 8 2 6
#> 11 11 unvaccinated 1 9 5 6
#> 32 12 unvaccinated 2 6 0 0
#> 43 13 unvaccinated 3 8 2 5
#> 24 14 unvaccinated 4 8 3 8
#> 25 15 unvaccinated 5 2 5 6
#> 46 16 unvaccinated 6 7 1 8
#> 37 17 unvaccinated 7 10 4 5
#> 28 18 unvaccinated 8 10 0 7
#> 19 19 unvaccinated 9 2 4 3
#> 50 20 unvaccinated 10 10 5 1
创建于 2025-01-02,使用 reprex v2.1.1