我尝试将贝叶斯优化应用于 tidymodels 框架内的二元分类问题 (XGBOOST)。
此外,当我在示例“mtcars”数据集上运行代码时,会出现以下错误,所以我认为我一定错过了一些东西
❯ Generating a set of 10 initial parameter results
→ A | warning: No control observations were detected in `truth` with control level '1'.
There were issues with some computations A: x1
✓ Initialization complete
── Iteration 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
i Current best: roc_auc=0.5 (@iter 0)
i Gaussian process model
! All of the roc_auc values were identical. The Gaussian process model cannot be fit to the data. Try expanding the range of the tuning parameters.
x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with no default
Error in `check_gp_failure()`:
! Gaussian process model was not fit.
Run `rlang::last_trace()` to see where the error occurred.
✖ Optimization stopped prematurely; returning current results.
更一般性的后续问题:
我可以将模型规范中的 nthread 参数与 do parallel 参数结合起来以加快计算速度,例如nthread=3 且 cluster = 10 - 让 3 个核心在 1 个进程上工作?
如果我的目标变量稍微不平衡(比例为 1/3),我应该使用哪个指标来优化模型。我认为 F1 (f_meas) 应该是最相关的?
# Load necessary libraries
library(tidymodels)
library(doParallel)
library(xgboost)
# Load example data
data(mtcars)
# Create a binary target
mtcars$am <- as.factor(mtcars$am)
set.seed(123)
df_split <- initial_split(mtcars, strata = am)
df_train <- training(df_split)
df_test <- testing(df_split)
df_train_folds <- vfold_cv(df_train, strata = am)
# /////////////////////////////////////////////////////////////////////////////
#prep
recipe_df <- recipe(am ~ ., data=df_train) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
xgb_prep<- prep(recipe_df,verbose=T,)
# Create model
xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
learn_rate = tune(),
loss_reduction = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# set ranges for parameters
params <- parameters(
learn_rate(),
tree_depth(),
min_n(),
loss_reduction()
) %>%
update(
learn_rate = learn_rate(c(0.01, 0.3)), # range for the learning rate
tree_depth = tree_depth(c(3, 10)), # range for the tree depth
min_n = min_n(c(1, 10)), # range for the minimum number of observations
loss_reduction = loss_reduction(c(1, 5)) # range for the loss reduction
) %>%
finalize(df_train)
# Merge into workflow
xgb_wf <- workflow() %>%
add_model(xgb_spec) %>%
add_recipe(xgb_prep)
#Parallel Processing
gc()
numCores = 30 #detectCores() 72
cl = parallel::makeCluster(numCores)
doParallel::registerDoParallel(cl)
options(tidymodels.dark = TRUE)
myxgb_res <- tune_bayes(
xgb_wf,
resamples = df_train_folds,
param_info = params,
initial = 10,
iter = 30,
metrics = metric_set(roc_auc),
control = control_bayes(
no_improve = 5,
save_pred = T,
verbose = T,
seed = 123
),
parallel_over = "everything",
)
stopCluster(cl)
doParallel::stopImplicitCluster()
感谢您的建议!