Firth 的模型在使用 loggerf 包的 R 中卡住了(有不收敛警告和 CPU 使用率 99%)

问题描述 投票:0回答:1

这是创建数据集的过程。我以 rds 和 csv 格式保存,以防在保存过程中任何内部数据特征发生变化:

# Save the data frame
saveRDS(train_proc_rds, "train_proc_example1.rds")
fwrite(train_proc_rds, "train_proc_example2.csv")


# reload data frame
train_proc_df1 <- readRDS("train_proc_example1.rds")
train_proc_df2 <- fread("train_proc_example2.csv")

以下是两个示例数据的链接(示例数据不大 - 仅 23 kb,但可能会导致错误):

https://drive.google.com/drive/folders/1ZUSQxp9uycL8nRJXBLHUMRADk2KM6eYG?usp=sharing

这是我运行弗斯模型的代码。总是有不收敛的警告,并建议我增加迭代,但这并没有真正的帮助。奇怪的是,当我运行代码时,它永远卡住了,但活动监视器显示 CPU 使用率为 99%。详细代码如下:

library(caret)
library(logistf)
library(data.table)


# Define training control
train.control <- trainControl(method = "repeatedcv", 
                              number = 3, repeats = 3,
                              savePredictions = TRUE,
                              classProbs = TRUE)

# Define the custom model function
firth_model <- list(
  type = "Classification",
  library = "logistf",
  loop = NULL,
  parameters = data.frame(parameter = c("none"), class = c("character"), label = c("none")),
  grid = function(x, y, len = NULL, search = "grid") {
    data.frame(none = "none")
  },
  fit = function(x, y, wts, param, lev, last, classProbs, ...) {
    data <- as.data.frame(x)
    data$group <- y
    logistf(group ~ ., data = data, control = logistf.control(maxit = 100), ...)
  },
  predict = function(modelFit, newdata, submodels = NULL) {
    as.factor(ifelse(predict(modelFit, newdata, type = "response") > 0.5, "AD", "control"))
  },
  prob = function(modelFit, newdata, submodels = NULL) {
    preds <- predict(modelFit, newdata, type = "response")
    data.frame(control = 1 - preds, AD = preds)
  }
)

当我使用第二个数据集运行模型 2 时,它复制了我的问题:


# Training the model
set.seed(123)
model1 <- train(train_proc_df1[, .SD, .SDcols = !c("AD", "group")],
                            train_proc_df1$group,
                            method = firth_model,
                            trControl = train.control)
print(model1)

这是最近的错误

model2 <- train(train_proc_df2[, .SD, .SDcols = !c("AD", "group")],
+                 train_proc_df2$group,
+                             method = firth_model,
+                             trControl = train.control)
+ Fold1.Rep1: none=none 
Warning in logistf(group ~ ., data = data, control = logistf.control(maxit = 100),  :
  Nonconverged PL confidence limits: maximum number of iterations for variables: (Intercept), x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24 exceeded. Try to increase the number of iterations by passing 'logistpl.control(maxit=...)' to parameter plcontrol
- Fold1.Rep1: none=none 
+ Fold2.Rep1: none=none 
Warning in logistf(group ~ ., data = data, control = logistf.control(maxit = 100),  :
  Nonconverged PL confidence limits: maximum number of iterations for variables: (Intercept), x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24 exceeded. Try to increase the number of iterations by passing 'logistpl.control(maxit=...)' to parameter plcontrol
- Fold2.Rep1: none=none 
+ Fold3.Rep1: none=none 
Warning in logistf(group ~ ., data = data, control = logistf.control(maxit = 100),  :
  Nonconverged PL confidence limits: maximum number of iterations for variables: (Intercept), x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, x22, x23, x24 exceeded. Try to increase the number of iterations by passing 'logistpl.control(maxit=...)' to parameter plcontrol
- Fold3.Rep1: none=none 
+ Fold1.Rep2: none=none 
Warning in logistf(group ~ ., data = data, control = logistf.control(maxit = 100),  :
  logistf.fit: Maximum number of iterations for full model exceeded. Try to increase the number of iterations or alter step size by passing 'logistf.control(maxit=..., maxstep=...)' to parameter control

忽略警告部分,模型似乎永远停留在第三折(并且它似乎正在运行,因为活动监视器显示 CPU 使用率为 99%)。当我运行模型 1 时,同样的问题似乎仍然存在。



R.version
               _                           
platform       aarch64-apple-darwin20      
arch           aarch64                     
os             darwin20                    
system         aarch64, darwin20           
status                                     
major          4                           
minor          3.2                         
year           2023                        
month          10                          
day            31                          
svn rev        85441                       
language       R                           
version.string R version 4.3.2 (2023-10-31)
nickname       Eye Holes   

这是我的软件包版本:

> packageVersion("caret")
[1] ‘6.0.94’
> packageVersion("logistf")
[1] ‘1.26.0’
> packageVersion("data.table")
[1] ‘1.14.10’
r machine-learning logistic-regression sampling convergence
1个回答
0
投票

如果按照警告消息的建议增加迭代次数,您应该会得到所需的结果。

train_proc <- data.table::fread("train_proc.csv")
names(train_proc)[25] <- "group"  # missing from your code

set.seed(123)
firth.logist.model <- train(train_proc[, .SD, .SDcols = !"group"],
                            train_proc$group,
                            method = firth_model,
                            trControl = train_control,
                            plcontrol=logistpl.control(maxit=1000))

print(firth.logist.model)

53 samples
24 predictors
 2 classes: 'AD', 'control' 

No pre-processing
Resampling: Cross-Validated (3 fold, repeated 3 times) 
Summary of sample sizes: 35, 36, 35, 36, 35, 35, ... 
Resampling results:

  Accuracy   Kappa     
  0.4270153  -0.1482292

Tuning parameter 'none' was held constant at a value of none
© www.soinside.com 2019 - 2024. All rights reserved.