我使用 R 中的 mlr3 包创建了一个逻辑回归模型。我从模型中输出了残差,但我无法弄清楚它们是如何计算的 - 它们与我所知道的任何残差计算都不对应。
假设我使用 mlr3 包创建逻辑回归模型:
library(mlr3)
library(tidyverse)
#create dummy data
data <- data.frame(
predictor = c(rnorm(50, mean = 0), rnorm(50, mean = 1)),
dependant = as.factor(c(rep(0,50), rep(1, 50)))
)
#define and train a logistic regression model
classifier_log_reg <- mlr_learners$get("classif.log_reg")
task <- mlr3::TaskClassif$new(id = "my_data",
backend = data,
target = "dependant", # target variable
positive = "1")
classifier_log_reg$train(task, row_ids = 1:100)
我可以使用以下方法从模型中获取残差:
residuals <- classifier_log_reg$model$residuals
我的问题是:这些残差是如何计算的?我无法手动重现它们。它们与我使用以下函数计算皮尔逊残差或偏差残差时得到的数字不匹配:
pearson_residuals <- function(p, actual) {
# Standard deviation of the predicted binomial distribution
std_dev <- sqrt(p * (1 - p))
# Avoid division by zero in case of p values being 0 or 1
std_dev[std_dev == 0] <- .Machine$double.eps
# Calculate the Pearson residuals
residuals <- (actual - p) / std_dev
return(residuals)
}
deviance_residuals <- function(p, actual) {
# Ensure p is within valid range to avoid log(0) issues
p <- ifelse(p == 0, .Machine$double.eps, ifelse(p == 1, 1 - .Machine$double.eps, p))
# Calculate the deviance residuals
residuals <- sign(actual - p) * sqrt(-2 * (actual * log(p) + (1 - actual) * log(1 - p)))
return(residuals)
}
奇怪的是,我发现 classifier_log_reg$model$residuals
的残差似乎系统地对应于我可以手动计算的残差,即因变量的预测值和实际值之间的简单差异。请注意,我已经调整了手动计算和模型对象输出的残差,以最好地说明明显的 sigmoid 关系:
#get residuals directly from the model object
residuals <- classifier_log_reg$model$residuals
#####calculate residuals manually
#specify that predictions should be continuous
classifier_log_reg$predict_type <- 'prob'
#get the predictions
predictions <- classifier_log_reg$predict(task, row_ids = 1:100)
#isolate the vector containing the predictions
predictions <- predictions$data$prob %>% as.data.frame() %>% pull(1)
#subtract predictions from actual values of dependant variable
actual <- data$dependant %>% as.character() %>% as.numeric()
my_resid <- actual - predictions
#put the residuals from the model, the manually calculated residuals
#and the actual values into a dataframe.
#I have adjusted them a bit to illustrate the (apparent) sigmoid relationship that
#emerges after these adjustments.
df <- data.frame(
x = residuals - (actual * 2) + 1,
y = (my_resid + 1) / 2,
actual = actual
)
#plot the relationship between the manually calculated residuals (with adjustment)
#and the residuals straight from the model (with adjustment).
#The curve is completely smooth, but I cannot find the function linking x to y
ggplot(df) + geom_point(aes(x = x, y = y))
可以看出,这里似乎存在sigmoid关系。然而,我尝试使用 nls
函数来获取连接 x 和 y 的最佳拟合 sigmoid 曲线的参数......但它根本不适合!以下是我尝试过的;我没有粘贴生成的图,但足以说明它没有显示直线(如果 x 和 y 之间的关系确实是 sigmoid,这就是我所期望的):
sigmoid <- function(x, L, k, x0) {
L / (1 + exp(-k * (x - x0)))
}
model <- nls(y ~ sigmoid(x, L, k, x0),
data = df,
start = list(L = 1, k = 1, x0 = 1),
control = nls.control(maxiter = 100))
df$fitted <- predict(model, df)
ggplot(df) + geom_point(aes(x = fitted, y = y))
那么这里x和y之间的关系是什么?更重要的是,mlr3 逻辑回归模型的残差是如何在幕后计算的?
glm
命令计算它们的方式相同:
glm1 <- glm(dependant~predictor, family="binomial", data=data)
identical(residuals, glm1$residuals)
# [1] TRUE