我的数据集中约有7000个家庭,每个家庭都有父母的收入和孩子的收入。现在,我想对父母的收入对其子女的收入进行简单的线性回归。但是,我需要确保针对每个家庭都运行此回归。
示例数据集:
income_parents <- c(1000, 15000, 4500, 7000, 6500, 2500, 3500, 9000, 1200)
income_children <- c(1200, 7500, 2500, 8000, 5500, 7500, 3250, 7500, 850)
family_name <- c("Miller", "Smith", "Clark", "Powell", "Brown", "Jone", "Garcia", "Williams", "Lopez")
df <- data.frame(income_parents, income_children, family_name)
按family_name分组后,我运行以下回归:
df_AR <- df %>% group_by(family_name)
AR_1 <- lm(income_children ~ income_parents, data = df_AR)
summary(AR_1)
现在,我想知道lm()函数是否考虑了嵌套数据结构?如果没有,那么:如何更改考虑到的代码?
这不会满足您的期望。 lm
方法已经很老了,因此它不熟悉dplyr
等较新库中的功能。
我相信您可以通过添加姓氏作为模型的指标来完成您想要的。类似于:
model <- lm(income_children ~ family_name + family_name:income_parents, data = df)
这将有效地为每个家庭创建一个迷你模型。第一部分给出每个家庭的截距,而交互变量给出
income_parents
的斜率。
如果要坚持使用多种模型方法,则可以使用nest。
model_one <- function(data) { lm(income_children ~ income_parents, data = data) } models <- df %>% group_by(family_name) %>% nest() %>% mutate(model = map(data, model_one)) models # # A tibble: 9 x 3 # # Groups: family_name [9] # family_name data model # <fct> <list> <list> # 1 Miller <tibble [1 × 2]> <lm> # 2 Smith <tibble [1 × 2]> <lm> # 3 Clark <tibble [1 × 2]> <lm> # 4 Powell <tibble [1 × 2]> <lm> # 5 Brown <tibble [1 × 2]> <lm> # 6 Jone <tibble [1 × 2]> <lm> # 7 Garcia <tibble [1 × 2]> <lm> # 8 Williams <tibble [1 × 2]> <lm> # 9 Lopez <tibble [1 × 2]> <lm>
您会注意到此输出还不是很有用。可以用broom::glance对其进行简洁地概括,然后取消嵌套。这里不是很有趣,因为每个模型只有一个数据点。
summarized <- models %>% mutate(summary = map(model, broom::glance)) %>% unnest(summary) # Drop the still-nested columns for display. summarized %>% select(-data, -model) # family_name r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual # <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> # 1 Miller 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 # 2 Smith 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 # 3 Clark 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 # 4 Powell 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 # 5 Brown 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 # 6 Jone 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 # 7 Garcia 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 # 8 Williams 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 # 9 Lopez 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0
有关更多详细信息,我建议使用Data Science for R的Many Models一章。