使用 for 循环、while、tidyverse 或包来创建具有与前一个数据集匹配的特征的数据集(采样)

问题描述 投票:0回答:1

我正在处理面板数据。我们在 2019 年和 2020 年对儿童进行了评估。因此,我有两个数据集(2019 年和 2020 年),我想创建与第二个数据集(2020 年)中的数据相匹配的第三个数据集,该数据集与第一个数据集(2019 年)的特征相匹配。第三个数据集的参与者较少,但他们将具有与 2019 年“同龄人”相同的特征。因此,男孩和女孩的比例将与 2019 年大致相同,母亲的年龄将大致相同,等等。

示例: enter image description here

代码:

df_2019 = structure(list(asqse_quest = c(24, 24, 24, 24, 24, 24, 24, 24, 
                                         24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 
                                         24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 
                                         24, 24, 24, 24, 24, 24, 24, 24, 24, 24), year_completed_cat = structure(c(2L, 
                                                                                                                   2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                                                                                                   2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                                                                                                   2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                                                                                                   2L), levels = c("18", "19", "20", "21", "22", "23", "24"), class = "factor"), 
                         sex_male = c(1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 
                                      1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 
                                      0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0), momage = c(36, 
                                                                                                  39, 22, 20, 29, 40, 31, 37, 29, 38, 24, 35, 32, 30, 32, 31, 
                                                                                                  29, 21, 28, 29, 40, 21, 38, 29, 28, 33, 25, 25, 30, 29, 25, 
                                                                                                  27, 28, 31, 24, 28, 35, 29, 17, 35, 32, 29, 27, 24, 29, 25, 
                                                                                                  28, 24, 21, 26), momed = c(4, 4, 2, 2, 4, 3, 2, 3, 2, 4, 
                                                                                                                             3, 4, 4, 4, 4, 4, 3, 4, 3, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 
                                                                                                                             4, 2, 4, 3, 3, 3, 3, 4, 4, 2, 4, 4, 3, 2, 2, 3, 4, 4, 3, 
                                                                                                                             2, 4), income = c(4, 4, 2, 3, 4, 1, 2, 5, 4, 4, 5, 4, 4, 
                                                                                                                                               4, 4, 4, 4, 2, 3, 3, 4, 2, 3, 4, 4, 4, 5, 4, 3, 3, 4, 4, 
                                                                                                                                               3, 4, 1, 4, 2, 4, 3, 4, 4, 3, 4, 3, 4, 4, 4, 3, 4, 4)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                           -50L))


df_2020 = structure(list(asqse_quest = c(24, 24, 24, 24, 24, 24, 24, 24, 
                                         24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 
                                         24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 
                                         24, 24, 24, 24, 24, 24, 24, 24, 24, 24), year_completed_cat = structure(c(3L, 
                                                                                                                   3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
                                                                                                                   3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
                                                                                                                   3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
                                                                                                                   3L), levels = c("18", "19", "20", "21", "22", "23", "24"), class = "factor"), 
                         sex_male = c(1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 
                                      0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 
                                      1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1), momage = c(23, 
                                                                                                  26, 33, 34, 29, 26, 23, 29, 40, 36, 33, 18, 31, 31, 31, 32, 
                                                                                                  34, 35, 29, 37, 19, 30, 33, 25, 32, 35, 37, 27, 23, 29, 28, 
                                                                                                  26, 30, 27, 38, 28, 29, 39, 26, 25, 29, 39, 35, 32, 20, 38, 
                                                                                                  31, 27, 28, 23), momed = c(2, 4, 4, 3, 4, 3, 2, 2, 3, 4, 
                                                                                                                             1, 2, 2, 4, 4, 4, 4, 2, 4, 4, 2, 4, 4, 4, 2, 4, 4, 2, 4, 
                                                                                                                             2, 1, 4, 3, 2, 4, 4, 4, 2, 4, 2, 4, 4, 4, 4, 2, 4, 4, 4, 
                                                                                                                             4, 1), income = c(2, 4, 4, 4, 4, 5, 3, 2, 2, 4, 1, 3, 4, 
                                                                                                                                               5, 1, 4, 3, 1, 4, 5, 5, 4, 4, 4, 3, 4, 4, 2, 4, 5, 1, 4, 
                                                                                                                                               4, 1, 4, 4, 4, 4, 3, 4, 4, 4, 5, 4, 2, 4, 4, 4, 4, 4)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                           -50L))

创建于 2024-07-12,使用 reprex v2.1.0

r random while-loop tidyverse matching
1个回答
0
投票

您可以尝试MatchIt包,它有一个执行倾向得分匹配的功能。

我们首先将两个数据集与

bind_rows
合并,分配一个id来区分两个数据集:

data <- bind_rows(df_2019, df_2020, .id="year") |>
  mutate(year=+(year==1)) # 1=2019 (cases), 0=2020 (controls)

对应于year==1的行是您的案例(来自2019年的数据),year==0对应于您的控件(来自2020年的数据)。

为了找到尽可能与情况匹配的控件,我们可以使用

matchit
函数。有很多参数,为了简洁起见,我们将仅使用默认值。

图书馆(MatchIt)

我们首先尝试精确匹配完成年份、性别和母亲的年龄,看看是否有运气。

match_obj <- matchit(year ~ asqse_quest+year_completed_cat+sex_male+momage+momed+income,
                     data = data, 
                     exact= ~ year_completed_cat+sex_male+momage,
                     replace = FALSE)

#Error in `matchit()`:
#! No matches were found.

这并不奇怪,因为这两个数据集在完成年份上根本不匹配。 让我们的匹配条件不那么严格吧

match_obj <- matchit(year ~ asqse_quest+year_completed_cat+sex_male+momage+momed+income,
                     data = data, 
                     exact= ~ sex_male+momage,
                     replace = FALSE)

这次没有错误,但我们收到警告

#Warning message:
#Fewer control units than treated units in some `exact` strata; not all treated units will get a match. 

没关系。现在总结一下结果。

summary(match_obj)
...
Sample Sizes:
          Control Treated
All            50      50
Matched        25      25
Unmatched      25      25
Discarded       0       0

输出表明我们从原始的 50 个控件中找到了 25 个控件。还给出了其他有用的信息,但为了简单起见,我在这里省略了。现在使用

match.data
获取匹配项以及原始案例。

matched_data <- match.data(match_obj)

现在我们只需过滤掉案例,剩下匹配的控件:

df_2020_new <- filter(matched_data, year==0)
head(df_2020_new)

   asqse_quest year_completed_cat sex_male momage momed income
1           24                 20        1     23     2      2
2           24                 20        1     26     4      4
3           24                 20        1     33     4      4
4           24                 20        1     34     3      4
5           24                 20        0     29     4      4
6           24                 20        1     26     3      5
7           24                 20        0     23     2      3
8           24                 20        1     29     2      2
9           24                 20        0     40     3      2
10          24                 20        1     36     4      4

查看

matchit
的帮助页面,了解如何修改匹配方法。这里要介绍的细节太多,但这是基本思想。

© www.soinside.com 2019 - 2024. All rights reserved.