我想编写一个脚本来在整个 data.table 中运行回归模型,其中我的函数适合模型并提取信息以供以后分析。我有大量模型需要拟合,所以我想加快速度。为了做到这一点,我想向量化这个函数,但是由于数据中散布着 NA 值,并且我不想估算缺失值,所以我一直在努力,只需将每个模型与可用数据进行拟合即可。我还想从模型中提取一些统计数据。所以我的问题是,这个函数可以向量化吗?如果可以的话如何向量化?
总而言之,目的是使用“协变量”列作为协变量(最后一列)来拟合“dependent_var”(第一列)与每个预测变量列(单独)的简单线性回归模型。因此,对于每个模型,我想要以下内容:
return(indepent_var = col,
pearson = cor_coef,
pearson_p = p_value,
indepent_var_beta = slope,
indepent_var_beta_sig = slope_sig,
adj_r_squared = adj_r_squared,
indepent_var_beta_se = se_slope,
intercept_se = se_intercept,
n_obs = nrow(dfx_subset)) # dfx subset being the subset used to fit the model in question.
作为示例数据集
library(data.table)
set.seed(123)
num_rows <- 300
num_cols <- 6000
# Random data
data <- matrix(rnorm(num_rows * num_cols), nrow = num_rows)
# Many columns include NAs which need to be handled during fit without imputation
prop_na <- runif(num_cols, min = 0.1, max = 0.5)
for (i in 1:num_cols) {
num_na <- round(prop_na[i] * num_rows)
idx_na <- sample(1:num_rows, num_na)
data[idx_na, i] <- NA
}
# Mock table, dependent variable and binary covariate variable
DT <- as.data.table(cbind(data, covariate = sample(0:1, num_rows, replace = TRUE)))
setnames(DT, old = c("V1"), new = c("dependent_var"))
我将来可能会包含更多协变量,因此希望包含皮尔逊系数以进行一些快速检查/潜在的未来分析。由于我有大量数据需要处理,因此最好在 HPC 上提交一次。到目前为止,这是我的功能:
lm_analysis <- function(col, dfx) {
tryCatch({
dfx_subset <- dfx[complete.cases(dfx[[1]], dfx[[col]], dfx[["covariate"]]), ]
## Compute Pearson correlation
cor_test <- cor.test(dfx_subset[[1]], dfx_subset[[col]])
cor_coef <- cor_test$estimate
p_value <- cor_test$p.value
## Linear regression
lm_result <- lm(dfx_subset[[1]] ~ dfx_subset[[col]] + dfx_subset[["covariate"]], data = dfx_subset)
slope <- lm_result$coefficients[2]
model_summary <- summary(lm_result)
slope_sig <- model_summary$coefficients[2, 4]
adj_r_squared <- model_summary$adj.r.squared
se_slope <- model_summary$coefficients[2, "Std. Error"]
se_intercept <- model_summary$coefficients[1, "Std. Error"]
return(c(indepent_var = col,
pearson = cor_coef,
pearson_p = p_value,
indepent_var_beta = slope,
indepent_var_beta_sig = slope_sig,
adj_r_squared = adj_r_squared,
indepent_var_beta_se = se_slope,
intercept_se = se_intercept,
n_obs = nrow(dfx_subset)))
}, error = function(e){
return(NULL)
})
}
predictor_cols <- setdiff(names(DT), c("dependent_var", "covariate"))
lm_results <- lapply(predictor_cols, lm_analysis, dfx = DT)
results_lmfit <- as.data.table(do.call(rbind, lm_results))
我不想在分析之前过滤任何观察结果。另外,理想情况下,我不想并行化此组件,因为我计划在外循环上执行此操作,我将在外循环上对因变量的 data.table 执行上述操作。
我在如何向量化方面处于一个松散的状态,并且由于不同数据集之间的数据不匹配,我不确定这是否可以轻松实现。任何建议将不胜感激。
您可以融合数据集并应用一个稍微简单的函数,其中返回的对象是一个列表:
lm_analysis_simple <- function(d,i,c) {
cor_test <- cor.test(d,i)
## Linear regression
lm_result <- lm(d ~ i+c, model=FALSE)
model_summary = summary(lm_result)
return(list(
pearson = cor_test$estimate,
pearson_p = cor_test$p.value,
indepent_var_beta = lm_result$coefficients[2],
indepent_var_beta_sig = model_summary$coefficients[2, 4],
adj_r_squared = model_summary$adj.r.squared,
indepent_var_beta_se = model_summary$coefficients[2, "Std. Error"],
intercept_se = model_summary$coefficients[1, "Std. Error"],
n_obs = length(i)
))
}
DT_long = melt(DT,id.vars = c("dependent_var", "covariate"),variable.name = "independ_var")
DT_long[!is.na(value), lm_analysis_simple(dependent_var,value,covariate),independ_var]
输出:
independ_var pearson pearson_p indepent_var_beta indepent_var_beta_sig adj_r_squared indepent_var_beta_se intercept_se n_obs
<fctr> <num> <num> <num> <num> <num> <num> <num> <int>
1: V2 0.01870141 0.83733080 0.006931742 0.94264543 -0.006405378 0.09614583 0.1301829 183
2: V3 -0.04812460 0.55734413 -0.044441539 0.54942591 -0.002966220 0.07406963 0.1113687 229
3: V4 0.08896426 0.27255406 0.094223500 0.23947211 -0.001749838 0.07978495 0.1087159 228
4: V5 -0.14207020 0.07881438 -0.152165312 0.07242134 0.010553751 0.08411006 0.1092392 230
5: V6 -0.01676149 0.85888333 -0.008136059 0.92309369 0.004776748 0.08408907 0.1087452 175
---
5995: V5996 -0.15043800 0.10393970 -0.133171088 0.10896949 0.005849612 0.08243966 0.1128392 184
5996: V5997 0.01159498 0.89068030 0.009851107 0.90131503 -0.011246345 0.07930029 0.1115778 211
5997: V5998 -0.13492708 0.16384292 -0.133010591 0.13078530 0.010763237 0.08733929 0.1177022 181
5998: V5999 -0.10859013 0.17441982 -0.113119773 0.17429042 0.003738834 0.08288311 0.1106717 237
5999: V6000 0.07089022 0.47237635 0.059207306 0.48172533 -0.013942846 0.08384918 0.1323993 162