这是创建数据集的过程。我以 rds 和 csv 格式保存,以防在保存过程中任何内部数据特征发生变化:
# Save the data frame
saveRDS(train_proc_rds, "train_proc_example1.rds")
fwrite(train_proc_rds, "train_proc_example2.csv")
# reload data frame
train_proc_df1 <- readRDS("train_proc_example1.rds")
train_proc_df2 <- fread("train_proc_example2.csv")
以下是两个示例数据的链接(示例数据不大 - 仅 23 kb,但可能会导致错误):
https://drive.google.com/drive/folders/1ZUSQxp9uycL8nRJXBLHUMRADk2KM6eYG?usp=sharing
这是我运行弗斯模型的代码。总是有不收敛的警告,并建议我增加迭代,但这并没有真正的帮助。奇怪的是,当我运行代码时,它永远卡住了,但活动监视器显示 CPU 使用率为 99%。详细代码如下:
library(caret)
library(logistf)
library(data.table)
# Define training control
train.control <- trainControl(method = "repeatedcv",
number = 3, repeats = 3,
savePredictions = TRUE,
classProbs = TRUE)
# Define the custom model function
firth_model <- list(
type = "Classification",
library = "logistf",
loop = NULL,
parameters = data.frame(parameter = c("none"), class = c("character"), label = c("none")),
grid = function(x, y, len = NULL, search = "grid") {
data.frame(none = "none")
},
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
data <- as.data.frame(x)
data$group <- y
logistf(group ~ ., data = data, control = logistf.control(maxit = 100), ...)
},
predict = function(modelFit, newdata, submodels = NULL) {
as.factor(ifelse(predict(modelFit, newdata, type = "response") > 0.5, "AD", "control"))
},
prob = function(modelFit, newdata, submodels = NULL) {
preds <- predict(modelFit, newdata, type = "response")
data.frame(control = 1 - preds, AD = preds)
}
)
当我使用第二个数据集运行模型 2 时,它复制了我的问题:
# Training the model
set.seed(123)
model1 <- train(train_proc_df1[, .SD, .SDcols = !c("AD", "group")],
train_proc_df1$group,
method = firth_model,
trControl = train.control)
print(model1)
这是最近的错误
model2 <- train(train_proc_df2[, .SD, .SDcols = !c("AD", "group")],
+ train_proc_df2$group,
+ method = firth_model,
+ trControl = train.control)
+ Fold1.Rep1: none=none
Warning in logistf(group ~ ., data = data, control = logistf.control(maxit = 100), :
Nonconverged PL confidence limits: maximum number of iterations for variables: (Intercept), x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24 exceeded. Try to increase the number of iterations by passing 'logistpl.control(maxit=...)' to parameter plcontrol
- Fold1.Rep1: none=none
+ Fold2.Rep1: none=none
Warning in logistf(group ~ ., data = data, control = logistf.control(maxit = 100), :
Nonconverged PL confidence limits: maximum number of iterations for variables: (Intercept), x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24 exceeded. Try to increase the number of iterations by passing 'logistpl.control(maxit=...)' to parameter plcontrol
- Fold2.Rep1: none=none
+ Fold3.Rep1: none=none
Warning in logistf(group ~ ., data = data, control = logistf.control(maxit = 100), :
Nonconverged PL confidence limits: maximum number of iterations for variables: (Intercept), x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24 exceeded. Try to increase the number of iterations by passing 'logistpl.control(maxit=...)' to parameter plcontrol
- Fold3.Rep1: none=none
+ Fold1.Rep2: none=none
Warning in logistf(group ~ ., data = data, control = logistf.control(maxit = 100), :
logistf.fit: Maximum number of iterations for full model exceeded. Try to increase the number of iterations or alter step size by passing 'logistf.control(maxit=..., maxstep=...)' to parameter control
忽略警告部分,模型似乎永远停留在第三折(并且它似乎正在运行,因为活动监视器显示 CPU 使用率为 99%)。当我运行模型 1 时,同样的问题似乎仍然存在。
R.version
_
platform aarch64-apple-darwin20
arch aarch64
os darwin20
system aarch64, darwin20
status
major 4
minor 3.2
year 2023
month 10
day 31
svn rev 85441
language R
version.string R version 4.3.2 (2023-10-31)
nickname Eye Holes
这是我的软件包版本:
> packageVersion("caret")
[1] ‘6.0.94’
> packageVersion("logistf")
[1] ‘1.26.0’
> packageVersion("data.table")
[1] ‘1.14.10’
如果按照警告消息的建议增加迭代次数,您应该会得到所需的结果。
train_proc <- data.table::fread("train_proc.csv")
names(train_proc)[25] <- "group" # missing from your code
set.seed(123)
firth.logist.model <- train(train_proc[, .SD, .SDcols = !"group"],
train_proc$group,
method = firth_model,
trControl = train_control,
plcontrol=logistpl.control(maxit=1000))
print(firth.logist.model)
53 samples
24 predictors
2 classes: 'AD', 'control'
No pre-processing
Resampling: Cross-Validated (3 fold, repeated 3 times)
Summary of sample sizes: 35, 36, 35, 36, 35, 35, ...
Resampling results:
Accuracy Kappa
0.4270153 -0.1482292
Tuning parameter 'none' was held constant at a value of none