R 中的线性回归：基于虚拟编码接收不同的结果

Question

我正在使用一个连续预测变量和一个分类（两个级别）预测变量执行线性回归。我已将分类因子作为命名因子（“Control”和“BD”）包含在内，并且为了检查，将其作为虚拟编码因子（“0”和“1”）。对线性回归运行相同的命令会产生对于连续预测变量不同的结果，但对于分类因子和交互作用却是相同的。

我读了一个 csv，其中“Group”和“GrpNum”作为单独的列。 “组”列列出了分配给每个病例的组（对照、BD），而 GrpNum 列是该因素的虚拟代码（0 = 对照）。我得到了不同的结果，但仅针对连续预测变量的效果，具体取决于我使用的分类因素 - 即使它们完全相同。

我使用不同的虚拟代码（1 = Control，0 = BD）进行测试，发现这重现了问题。然而，当我检查这些因素中的任何一个是否已排序时，R 报告说它们没有排序。

因为效果不同，我不知道该相信哪一种分析是“正确的”。我知道顺序可能会改变截距，但这并不是非常重要。我担心的是下面示例中 ContX 效果的不同值。

> lmtest1 <- lm(Outcome ~ GrpNum * ContX, data = data)
> summary(lmtest1)

Call:
lm(formula = Outcome ~ GrpNum * ContX, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1330 -3.7301  0.0448  4.1718 11.1743 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -13.7037     4.2195  -3.248  0.00202 **
GrpNum        14.0967     5.1699   2.727  0.00865 **
ContX          0.4670     0.1364   3.423  0.00120 **
GrpNum:ContX  -0.4179     0.1639  -2.550  0.01372 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.111 on 53 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.2008,    Adjusted R-squared:  0.1556 
F-statistic:  4.44 on 3 and 53 DF,  p-value: 0.007413

请注意，ContX 的系数为 0.4670，这是一个显着的效果。

> lmtest2 <- lm(Outcome ~ GrpNum2 * ContX, data = data)
> summary(lmtest2)

Call:
lm(formula = Outcome ~ GrpNum2 * ContX, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1330 -3.7301  0.0448  4.1718 11.1743 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)     0.39294    2.98729   0.132  0.89585   
GrpNum2       -14.09666    5.16990  -2.727  0.00865 **
ContX           0.04908    0.09087   0.540  0.59137   
GrpNum2:ContX   0.41791    0.16392   2.550  0.01372 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.111 on 53 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:  0.2008,    Adjusted R-squared:  0.1556 
F-statistic:  4.44 on 3 and 53 DF,  p-value: 0.007413

变量“GrpNum2”是“GrpNum”的倒数。 ContX 的系数为 0.0490，这并不显着。

> is.ordered(data$GrpNum)
[1] FALSE
> is.ordered(data$GrpNum2)
[1] FALSE
> is.ordered(data$Group)
[1] FALSE

所有因素都是无序的。

重新创建：

GrpNum1 <- as.factor(rep(c(1,0), 20))
GrpNum2 <- as.factor(rep(c(0,1), 20))
Outcome <- rnorm(40, 0, 1)
ContX <- rnorm(40, 10, 5)
data <- data.frame(GrpNum1, GrpNum2, Outcome, ContX)
lmtest1 <- lm(Outcome ~ GrpNum1 * ContX, data = data)
lmtest2 <- lm(Outcome ~ GrpNum2 * ContX, data = data)
summary(lmtest1)
summary(lmtest2)

Answer 1

我认为您对交互术语感到困惑。有了 GrpNum:ContX 项，您就无法回答“ContX 的效果是什么？”，您必须问“给定 GrpNum 的特定值，ContX 的效果是什么？”这是一个玩具示例，其中如果 GrpNum1 == 1，我将 ContX 对 Outcome 的效果设置为

Outcome = 2.13ContX + 0.78

，如果 GrpNum1 == 0，则 Outcome 为零。我还向每个 Outcome 值添加了一些噪声。 ContX 看起来微不足道，在第一次拟合 reprex 时系数为 0，但在第二次拟合中 p 值很小，系数为 2.13。这是因为为基线情况选择了不同级别的 GrpNum1。两种拟合都会产生相同的预测值，如 reprex 末尾所示。两种适合都不是“正确的”。如果您在没有交互的情况下进行拟合，您将发现分类变量的编码在拟合结果中（在这种情况下）在符号内并不重要。

GrpNum1 <- as.factor(rep(c(1,0), 20))
GrpNum2 <- as.factor(rep(c(0,1), 20))
set.seed(123)
ContX <- rnorm(40, 10, 5)
Outcome <- ifelse(GrpNum1 == 1, ContX * 2.13 + 0.78, 0)
Outcom <- Outcome + rnorm(40, 0, 2)
data <- data.frame(GrpNum1, GrpNum2, Outcome, ContX)
lmtest1 <- lm(Outcome ~ GrpNum1 * ContX, data = data)
lmtest2 <- lm(Outcome ~ GrpNum2 * ContX , data = data)
summary(lmtest1)
#> Warning in summary.lm(lmtest1): essentially perfect fit: summary may be
#> unreliable
#> 
#> Call:
#> lm(formula = Outcome ~ GrpNum1 * ContX, data = data)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -3.599e-14  3.460e-16  1.145e-15  1.810e-15  4.162e-15 
#> 
#> Coefficients:
#>                  Estimate Std. Error    t value Pr(>|t|)    
#> (Intercept)    -4.143e-15  3.282e-15 -1.262e+00    0.215    
#> GrpNum11        7.800e-01  5.304e-15  1.471e+14   <2e-16 ***
#> ContX          -2.546e-16  2.999e-16 -8.490e-01    0.402    
#> GrpNum11:ContX  2.130e+00  4.742e-16  4.492e+15   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.484e-15 on 36 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 5.475e+31 on 3 and 36 DF,  p-value: < 2.2e-16
summary(lmtest2)
#> Warning in summary.lm(lmtest2): essentially perfect fit: summary may be
#> unreliable
#> 
#> Call:
#> lm(formula = Outcome ~ GrpNum2 * ContX, data = data)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -4.177e-14 -1.760e-15  1.376e-15  1.787e-15  2.963e-14 
#> 
#> Coefficients:
#>                  Estimate Std. Error    t value Pr(>|t|)    
#> (Intercept)     7.800e-01  5.705e-15  1.367e+14   <2e-16 ***
#> GrpNum21       -7.800e-01  7.262e-15 -1.074e+14   <2e-16 ***
#> ContX           2.130e+00  5.029e-16  4.236e+15   <2e-16 ***
#> GrpNum21:ContX -2.130e+00  6.492e-16 -3.281e+15   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 8.878e-15 on 36 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 2.92e+31 on 3 and 36 DF,  p-value: < 2.2e-16
data$Pred1 <- predict(lmtest1, newdata = data)
data$Pred2 <- predict(lmtest2, newdata = data)
head(data)
#>   GrpNum1 GrpNum2  Outcome     ContX         Pred1         Pred2
#> 1       1       0 16.11093  7.197622  1.611093e+01  1.611093e+01
#> 2       0       1  0.00000  8.849113 -6.395346e-15 -3.552714e-15
#> 3       1       0 38.68024 17.793542  3.868024e+01  3.868024e+01
#> 4       0       1  0.00000 10.352542 -6.778048e-15 -3.552714e-15
#> 5       1       0 23.45691 10.646439  2.345691e+01  2.345691e+01
#> 6       0       1  0.00000 18.575325 -8.871177e-15  0.000000e+00

^{创建于 2024 年 11 月 11 日，使用 reprex v2.1.1}

R 中的线性回归：基于虚拟编码接收不同的结果

问题描述投票：0回答：1

1个回答

最新问题

R 中的线性回归：基于虚拟编码接收不同的结果

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1