使用“ group_by”足以进行分组线性回归吗?

问题描述 投票:0回答:1

我的数据集中约有7000个家庭,每个家庭都有父母的收入和孩子的收入。现在,我想对父母的收入对其子女的收入进行简单的线性回归。但是,我需要确保针对每个家庭都运行此回归。

示例数据集:

income_parents <- c(1000, 15000, 4500, 7000, 6500, 2500, 3500, 9000, 1200)
income_children <- c(1200, 7500, 2500, 8000, 5500,  7500, 3250, 7500, 850)
family_name <- c("Miller", "Smith", "Clark", "Powell", "Brown", "Jone", "Garcia", "Williams", "Lopez")

df <- data.frame(income_parents, income_children, family_name)

按family_name分组后,我运行以下回归:

df_AR <- df %>% group_by(family_name)
AR_1 <- lm(income_children ~ income_parents, data = df_AR)
summary(AR_1)

现在,我想知道lm()函数是否考虑了嵌套数据结构?如果没有,那么:如何更改考虑到的代码?

r group-by dplyr linear-regression
1个回答
1
投票

这不会满足您的期望。 lm方法已经很老了,因此它不熟悉dplyr等较新库中的功能。

我相信您可以通过添加姓氏作为模型的指标来完成您想要的。类似于:

model <- lm(income_children ~ family_name + family_name:income_parents, data = df)

这将有效地为每个家庭创建一个迷你模型。第一部分给出每个家庭的截距,而交互变量给出income_parents的斜率。

如果要坚持使用多种模型方法,则可以使用nest

model_one <- function(data) {
  lm(income_children ~ income_parents, data = data)
}

models <- df %>%
  group_by(family_name) %>%
  nest() %>%
  mutate(model = map(data, model_one))

models
# # A tibble: 9 x 3
# # Groups:   family_name [9]
#   family_name data             model 
#   <fct>       <list>           <list>
# 1 Miller      <tibble [1 × 2]> <lm>  
# 2 Smith       <tibble [1 × 2]> <lm>  
# 3 Clark       <tibble [1 × 2]> <lm>  
# 4 Powell      <tibble [1 × 2]> <lm>  
# 5 Brown       <tibble [1 × 2]> <lm>  
# 6 Jone        <tibble [1 × 2]> <lm>  
# 7 Garcia      <tibble [1 × 2]> <lm>  
# 8 Williams    <tibble [1 × 2]> <lm>  
# 9 Lopez       <tibble [1 × 2]> <lm>  

您会注意到此输出还不是很有用。可以用broom::glance对其进行简洁地概括,然后取消嵌套。这里不是很有趣,因为每个模型只有一个数据点。

summarized <- models %>%
  mutate(summary = map(model, broom::glance)) %>%
  unnest(summary)

# Drop the still-nested columns for display.
summarized %>% select(-data, -model)
#   family_name r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC deviance df.residual
#   <fct>           <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
# 1 Miller              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 2 Smith               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 3 Clark               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 4 Powell              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 5 Brown               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 6 Jone                0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 7 Garcia              0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 8 Williams            0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0
# 9 Lopez               0             0   NaN        NA      NA     1    Inf  -Inf  -Inf        0           0

有关更多详细信息,我建议使用Data Science for R的Many Models一章。

© www.soinside.com 2019 - 2024. All rights reserved.