R 中基于多个预测变量的分层采样

问题描述 投票:0回答:1

我有一个包含两组人的数据集:接种疫苗的人和未接种疫苗的人。在接种组中,每一行代表一个唯一的 ID 以及相应的唯一 T0。在未接种疫苗的组中,每个 ID 可能出现在多行中,每行与不同的 T0(长格式)相关联。每行包含三个变量:PCP 就诊次数、专科就诊次数和实验室就诊次数。

我的目标是从此数据集中进行采样,以便生成的数据每行包含一个唯一的 ID,并且接种疫苗组和未接种疫苗组之间的 PCP 就诊、专科就诊和实验室就诊的平均值相似。我怎样才能在 R 中实现这一目标?我认为这至少会涉及对未接种疫苗的群体进行一些分层抽样,因为每个人可以有多个记录。

下面是一些用于创建示例数据的 R 代码:

set.seed(123)
data <- data.frame(
  ID = c(1:10, rep(11:20, 4)), # Vaccinated IDs are unique, Unvaccinated have repeats
  group = c(rep("vaccinated", 10), rep("unvaccinated", 40)),
  T0 = rep(1:10, 5),
  PCP_visits = sample(0:10, 50, replace = TRUE),
  specialty_visits = sample(0:5, 50, replace = TRUE),
  lab_visits = sample(0:8, 50, replace = TRUE)
)
   ID        group T0 PCP_visits specialty_visits lab_visits
1   1   vaccinated  1          2                0          3
2   2   vaccinated  2          2                5          0
3   3   vaccinated  3          9                4          5
4   4   vaccinated  4          1                0          2
5   5   vaccinated  5          5                1          7
6   6   vaccinated  6         10                3          2
7   7   vaccinated  7          4                3          7
8   8   vaccinated  8          3                5          0
9   9   vaccinated  9          5                5          6
10 10   vaccinated 10          8                2          6
11 11 unvaccinated  1          9                5          6
12 12 unvaccinated  2         10                5          5
13 13 unvaccinated  3          4                0          6
14 14 unvaccinated  4          2                5          4
15 15 unvaccinated  5         10                1          5
16 16 unvaccinated  6          8                0          7
17 17 unvaccinated  7          8                1          4
18 18 unvaccinated  8          8                3          6
19 19 unvaccinated  9          2                4          3
20 20 unvaccinated 10          7                4          2
21 11 unvaccinated  1          9                5          8
22 12 unvaccinated  2          6                2          6
23 13 unvaccinated  3          9                0          5
24 14 unvaccinated  4          8                3          8
25 15 unvaccinated  5          2                5          6
26 16 unvaccinated  6          3                0          1
27 17 unvaccinated  7          0                5          2
28 18 unvaccinated  8         10                0          7
29 19 unvaccinated  9          6                2          3
30 20 unvaccinated 10          4                5          6
31 11 unvaccinated  1          9                3          3
32 12 unvaccinated  2          6                0          0
33 13 unvaccinated  3          8                5          7
34 14 unvaccinated  4          8                5          3
35 15 unvaccinated  5          9                2          8
36 16 unvaccinated  6          6                5          7
37 17 unvaccinated  7         10                4          5
38 18 unvaccinated  8          4                2          3
39 19 unvaccinated  9          6                5          7
40 20 unvaccinated 10          4                1          2
41 11 unvaccinated  1         10                4          3
42 12 unvaccinated  2          5                4          3
43 13 unvaccinated  3          8                2          5
44 14 unvaccinated  4          1                1          0
45 15 unvaccinated  5          4                1          3
46 16 unvaccinated  6          7                1          8
47 17 unvaccinated  7          1                3          6
48 18 unvaccinated  8          0                1          7
49 19 unvaccinated  9          8                1          4
50 20 unvaccinated 10         10                5          1
r sampling matchit
1个回答
0
投票

您可以使用包

sampling
、函数
strata
对每个 id 采样一行。至于您要求的平均值,由于样本是随机的,因此统计数据平均而言应该是相同的。

set.seed(123)
data <- data.frame(
  ID = c(1:10, rep(11:20, 4)), # Vaccinated IDs are unique, Unvaccinated have repeats
  group = c(rep("vaccinated", 10), rep("unvaccinated", 40)),
  T0 = rep(1:10, 5),
  PCP_visits = sample(0:10, 50, replace = TRUE),
  specialty_visits = sample(0:5, 50, replace = TRUE),
  lab_visits = sample(0:8, 50, replace = TRUE)
)

library(sampling)

sizes <- matrix(1L, nrow = 10L, ncol = 2L)
i_strata <- strata(data, c("ID", "group"), size = sizes, method = "srswr")

data[i_strata$ID_unit, ]
#>    ID        group T0 PCP_visits specialty_visits lab_visits
#> 1   1   vaccinated  1          2                0          3
#> 2   2   vaccinated  2          2                5          0
#> 3   3   vaccinated  3          9                4          5
#> 4   4   vaccinated  4          1                0          2
#> 5   5   vaccinated  5          5                1          7
#> 6   6   vaccinated  6         10                3          2
#> 7   7   vaccinated  7          4                3          7
#> 8   8   vaccinated  8          3                5          0
#> 9   9   vaccinated  9          5                5          6
#> 10 10   vaccinated 10          8                2          6
#> 11 11 unvaccinated  1          9                5          6
#> 32 12 unvaccinated  2          6                0          0
#> 43 13 unvaccinated  3          8                2          5
#> 24 14 unvaccinated  4          8                3          8
#> 25 15 unvaccinated  5          2                5          6
#> 46 16 unvaccinated  6          7                1          8
#> 37 17 unvaccinated  7         10                4          5
#> 28 18 unvaccinated  8         10                0          7
#> 19 19 unvaccinated  9          2                4          3
#> 50 20 unvaccinated 10         10                5          1

创建于 2025-01-02,使用 reprex v2.1.1

© www.soinside.com 2019 - 2024. All rights reserved.