我正在自动化服务器工作流程中使用 Python 脚本(.py 文件)部署机器学习模型。模型训练过程的核心位于 model_training.py,其中包含数据预处理、使用 Optuna 进行超参数优化的模型训练以及模型评估的函数。
部署流程通过 main.py 进行编排,我在其中执行整个管道。直到我检索 best_params 进行模型训练的阶段,一切都运行顺利。然而,在 best_params 阶段,脚本似乎无限期地卡住,类似于提供的图像中所示的内容(即使我使用 n_Trials=1 和 Early_stopping_rounds=1 进行测试)。
这里 model_training.py:
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
import numpy as np
import optuna
from sklearn.model_selection import train_test_split
from optuna.integration import LightGBMPruningCallback
import warnings
warnings.filterwarnings("ignore", message="Found `n_estimators` in params. Will use it instead of argument")
optuna.logging.set_verbosity(optuna.logging.INFO)
seed = 42
np.random.seed(42)
def train_validation_test_split(X, y, test_size=0.2, random_state=seed):
"""
A function to split input data into training, validation, and test sets.
Parameters:
X (array-like): The input features.
y (array-like): The target variable.
test_size (float): The proportion of the dataset to include in the test split.
random_state (int): Controls the randomness of the training and testing indices.
Returns:
X_train (array-like): Training data for input features.
X_test (array-like): Testing data for input features.
y_train (array-like): Training data for target variable.
y_test (array-like): Testing data for target variable.
"""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
return X_train, X_test, y_train, y_test
def pre_lgb_dataset(X_train, X_test, y_train, y_test, cat_cols):
"""
Generate a LightGBM Dataset for training, validation, and testing data.
Parameters:
- X_train: training data features
- X_test: testing data features
- y_train: training data labels
- y_test: testing data labels
- cat_cols: list of categorical columns
- type: a string indicating the type of dataset
Returns:
- train_data: LightGBM Dataset for training data
- val_data: LightGBM Dataset for validation data
- test_data: LightGBM Dataset for testing data
"""
train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=cat_cols,free_raw_data=False)
test_data = lgb.Dataset(X_test, label=y_test, categorical_feature=cat_cols,free_raw_data=False)
return train_data, test_data
def train_optuna_cv(train_data, n_folds=5, n_trials=1, logging_period=10, early_stopping_rounds=10):
"""
Trains a LightGBM model using Optuna for hyperparameter optimization with cross-validation.
Parameters:
- data: Features for training.
- n_folds: Number of folds for cross-validation (default is 5).
- n_trials: Number of optimization trials to run (default is 100).
- logging_period: Interval for logging evaluation metrics during training (default is 10).
- early_stopping_rounds: Rounds to trigger early stopping if no improvement (default is 10).
Returns:
- best_params: Dictionary of the best hyperparameters found by Optuna.
"""
def objective(trial):
# Define the hyperparameter search space
params = {
'objective': 'regression',
'metric': 'rmse',
'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 5e-1, log=True),
'num_leaves': trial.suggest_int('num_leaves', 2, 256),
'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
'num_threads': 4,
'verbosity': -1 # Suppress internal LightGBM logging
}
# Perform cross-validation
cv_results = lgb.cv(
params,
train_data,
nfold=n_folds,
stratified=False, # Usually, stratification is not needed for regression
shuffle=True, # Shuffle data before splitting
callbacks=[
lgb.early_stopping(stopping_rounds=early_stopping_rounds),
lgb.log_evaluation(period=logging_period),
LightGBMPruningCallback(trial, 'rmse')
],
seed=42,
)
# Get the best score from cross-validation
best_score = cv_results['valid rmse-mean'][-1]
return best_score
# Create an Optuna study and optimize
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=n_trials)
# Return the best found hyperparameters
best_params = study.best_params
return best_params
def model_pred(best_params, train_data, val_data):
"""
Train the LightGBM model with the best hyperparameters
on the whole dataset and the lower and upper quantile models
on the validation set.
Args:
best_params: The best hyperparameters found by Optuna.
train_data: Training data for the LightGBM model.
val_data: Validation data for the LightGBM model and lower/upper quantile models.
Returns:
best_model: The trained LightGBM model.
"""
# Train the model
best_model = lgb.train(best_params, train_data, valid_sets=[val_data])
return best_model
这是 main.py 中我的工作流程的简化结构:
from model_training import train_validation_test_split, pre_lgb_dataset, train_optuna_cv, model_pred
import pandas as pd
import numpy as np
import optuna
seed = 42
np.random.seed(42)
def main():
# Data preparation and feature engineering steps here...
# Model Training
X_train, X_test, y_train, y_test = train_validation_test_split(df_features, df_target)
train_data, test_data = pre_lgb_dataset(X_train, X_test, y_train, y_test, cat_cols)
# Hyperparameter Optimization
best_params = train_optuna_cv(train_data, n_trials=1, early_stopping_rounds=1)
# Model Training with Best Parameters
best_model = model_pred(best_params, train_data, test_data)
# Further steps for model evaluation and deployment...
if __name__ == "__main__":
main()
为了调试,我尝试使用简化的sample_params,如下所示,它运行没有任何问题
sample_params = { 'objective': 'regression', 'metric': 'rmse', 'num_leaves': 31, 'learning_rate': 0.05, 'num_threads': 4 }
任何见解或建议将不胜感激。谢谢!
您遇到的问题可能是由于超参数搜索空间和优化过程的复杂性造成的。即使使用
n_trials=1
和 early_stopping_rounds=1
,Optuna 仍然需要探索超参数空间并至少运行模型一次,这可能非常耗时,具体取决于数据集的大小和模型的复杂性。
以下是有关如何解决或调试此问题的一些建议:
import logging
logging.basicConfig(level=logging.INFO)
简化搜索空间:降低超参数搜索空间的复杂性。例如,您可以限制叶子的数量 (
num_leaves
) 或缩小 learning_rate
的范围。
使用数据子集:尝试在较小的数据子集上运行优化过程。这可以帮助您确定问题是否与数据集的大小有关。
检查系统资源:在优化过程中监控服务器的CPU和内存使用情况。如果您的服务器资源不足,可能会导致进程挂起。
Timeout:为优化过程实现超时。这可以防止进程无限期地运行。 Optuna 支持使用
timeout
方法中的 optimize
参数设置优化过程的超时。
study.optimize(objective, n_trials=n_trials, timeout=600) # 600 seconds = 10 minutes
study.optimize(objective, n_trials=n_trials, n_jobs=-1) # Use all available cores
请记住在将这些更改部署到生产服务器之前在受控环境中测试这些更改。