我想用四个预测变量来模拟多元线性回归的数据,我可以自由指定
我找到了满足前两点的解决方案,但是基于所有自变量彼此不相关的假设(参见下面的代码)。为了得到标准化的回归系数,我从人口变量中采样,均值= 0,方差= 1。
# Specify population variance/covariance of four predictor variables that is sampled from
sigma.1 <- matrix(c(1,0,0,0,
0,1,0,0,
0,0,1,0,
0,0,0,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from
mu.1 <- rep(0,4)
# Specify sample size, true regression coefficients, and explained variance
n.obs <- 50000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.30
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 +
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)
Residuals:
Min 1Q Median 3Q Max
-4.0564 -0.6310 -0.0048 0.6339 3.7119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.496063 0.004175 118.82 <2e-16 ***
V1 0.402588 0.004189 96.11 <2e-16 ***
V2 0.291636 0.004178 69.81 <2e-16 ***
V3 0.247347 0.004171 59.30 <2e-16 ***
V4 0.253810 0.004175 60.79 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9335 on 49995 degrees of freedom
Multiple R-squared: 0.299, Adjusted R-squared: 0.299
F-statistic: 5332 on 4 and 49995 DF, p-value: < 2.2e-16
问题:如果我的预测变量是相关的,那么如果指定它们的方差/协方差矩阵而非非对角线元素为0,则r2和回归系数在很大程度上与我希望它们的方式不同,例如,通过使用
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
有什么建议?谢谢!
在考虑了我的问题后,我找到了答案。
上面的代码首先以彼此之间给定的相关程度对预测变量进行采样。然后基于期望的r2值添加用于错误的列。然后将所有这些组合在一起,添加y列。
到目前为止,创建错误的行只是
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2)*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
因此,它假设每个β系数对y的解释贡献100%(=没有自变量的相互关系)。但如果x变量是相关的,那么每个beta都不会(!)贡献100%。这意味着误差的方差必须更大,因为变量彼此之间存在一些差异。
多大了?只需调整错误术语的创建如下:
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)
因此,通过添加cor(sample1$V1, sample1$V2)
,将自变量相关的程度添加到误差方差中。在相互关系为0.25的情况下,例如,通过使用
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
cor(sample1$V1, sample1$V2)
类似于0.25,并且该值被添加到误差项的方差中。
假设所有相互关系都是相同的,像这样,可以指定独立变量之间的任何程度的相互关系,以及真正的标准化回归系数和期望的R2。
证明:
sigma.1 <- matrix(c(1,0.35,0.35,0.35,
0.35,1,0.35,0.35,
0.35,0.35,1,0.35,
0.35,0.35,0.35,1),nrow=4,ncol=4)
# Specify population means of four predictor varialbes that is sampled from
mu.1 <- rep(0,4)
# Specify sample size, true regression coefficients, and explained variance
n.obs <- 500000 # to avoid sampling error problems
intercept <- 0.5
beta <- c(0.4, 0.3, 0.25, 0.25)
r2 <- 0.15
# Create sample with four predictor variables
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Add error variable based on desired r2
var.epsilon <- (beta[1]^2+beta[2]^2+beta[3]^2+beta[4]^2+cor(sample1$V1, sample1$V2))*((1 - r2)/r2)
sample1$epsilon <- rnorm(n.obs, sd=sqrt(var.epsilon))
# Add y variable based on true coefficients and desired r2
sample1$y <- intercept + beta[1]*sample1$V1 + beta[2]*sample1$V2 +
beta[3]*sample1$V3 + beta[4]*sample1$V4 + sample1$epsilon
# Inspect model
summary(lm(y~V1+V2+V3+V4, data=sample1))
> summary(lm(y~V1+V2+V3+V4, data=sample1))
Call:
lm(formula = y ~ V1 + V2 + V3 + V4, data = sample1)
Residuals:
Min 1Q Median 3Q Max
-10.7250 -1.3696 0.0017 1.3650 9.0460
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.499554 0.002869 174.14 <2e-16 ***
V1 0.406360 0.003236 125.56 <2e-16 ***
V2 0.298892 0.003233 92.45 <2e-16 ***
V3 0.247581 0.003240 76.42 <2e-16 ***
V4 0.253510 0.003241 78.23 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.028 on 499995 degrees of freedom
Multiple R-squared: 0.1558, Adjusted R-squared: 0.1557
F-statistic: 2.306e+04 on 4 and 499995 DF, p-value: < 2.2e-16