超参数和运行模型但卡住了 Python .py 文件

问题描述 投票:0回答:1

我正在自动化服务器工作流程中使用 Python 脚本(.py 文件)部署机器学习模型。模型训练过程的核心位于 model_training.py,其中包含数据预处理、使用 Optuna 进行超参数优化的模型训练以及模型评估的函数。

部署流程通过 main.py 进行编排,我在其中执行整个管道。直到我检索 best_params 进行模型训练的阶段,一切都运行顺利。然而,在 best_params 阶段,脚本似乎无限期地卡住,类似于提供的图像中所示的内容(即使我使用 n_Trials=1 和 Early_stopping_rounds=1 进行测试)。

这里 model_training.py:

import lightgbm as lgb
from sklearn.metrics import mean_squared_error
import numpy as np
import optuna
from sklearn.model_selection import train_test_split
from optuna.integration import LightGBMPruningCallback
import warnings
warnings.filterwarnings("ignore", message="Found `n_estimators` in params. Will use it instead of argument")
optuna.logging.set_verbosity(optuna.logging.INFO)

seed = 42
np.random.seed(42)

def train_validation_test_split(X, y, test_size=0.2, random_state=seed):
    """
    A function to split input data into training, validation, and test sets.
    
    Parameters:
        X (array-like): The input features.
        y (array-like): The target variable.
        test_size (float): The proportion of the dataset to include in the test split.
        random_state (int): Controls the randomness of the training and testing indices.
    
    Returns:
        X_train (array-like): Training data for input features.
        X_test (array-like): Testing data for input features.
        y_train (array-like): Training data for target variable.
        y_test (array-like): Testing data for target variable.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

def pre_lgb_dataset(X_train, X_test, y_train, y_test, cat_cols):
    """
    Generate a LightGBM Dataset for training, validation, and testing data.

    Parameters:
        - X_train: training data features
        - X_test: testing data features
        - y_train: training data labels
        - y_test: testing data labels
        - cat_cols: list of categorical columns
        - type: a string indicating the type of dataset

    Returns:
        - train_data: LightGBM Dataset for training data
        - val_data: LightGBM Dataset for validation data
        - test_data: LightGBM Dataset for testing data
    """

    train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=cat_cols,free_raw_data=False)
    test_data = lgb.Dataset(X_test, label=y_test, categorical_feature=cat_cols,free_raw_data=False)
    return train_data, test_data

def train_optuna_cv(train_data, n_folds=5, n_trials=1, logging_period=10, early_stopping_rounds=10):
    """
    Trains a LightGBM model using Optuna for hyperparameter optimization with cross-validation.

    Parameters:
        - data: Features for training.
        - n_folds: Number of folds for cross-validation (default is 5).
        - n_trials: Number of optimization trials to run (default is 100).
        - logging_period: Interval for logging evaluation metrics during training (default is 10).
        - early_stopping_rounds: Rounds to trigger early stopping if no improvement (default is 10).

    Returns:
        - best_params: Dictionary of the best hyperparameters found by Optuna.
    """

    def objective(trial):
        # Define the hyperparameter search space
        params = {
            'objective': 'regression',
            'metric': 'rmse',
            'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
            'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
            'learning_rate': trial.suggest_float('learning_rate', 1e-3, 5e-1, log=True),
            'num_leaves': trial.suggest_int('num_leaves', 2, 256),
            'feature_fraction': trial.suggest_float('feature_fraction', 0.4, 1.0),
            'bagging_fraction': trial.suggest_float('bagging_fraction', 0.4, 1.0),
            'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
            'num_threads': 4,
            'verbosity': -1  # Suppress internal LightGBM logging
        }


        # Perform cross-validation
        cv_results = lgb.cv(
            params,
            train_data,
            nfold=n_folds,
            stratified=False,  # Usually, stratification is not needed for regression
            shuffle=True,  # Shuffle data before splitting
            callbacks=[
                lgb.early_stopping(stopping_rounds=early_stopping_rounds),
                lgb.log_evaluation(period=logging_period),
                LightGBMPruningCallback(trial, 'rmse')
            ],
            seed=42,
        )
        # Get the best score from cross-validation
        best_score = cv_results['valid rmse-mean'][-1]

        return best_score
    # Create an Optuna study and optimize
    study = optuna.create_study(direction='minimize')
    study.optimize(objective, n_trials=n_trials)

    # Return the best found hyperparameters
    best_params = study.best_params
    return best_params

def model_pred(best_params, train_data, val_data):
    """
    Train the LightGBM model with the best hyperparameters
    on the whole dataset and the lower and upper quantile models
    on the validation set.

    Args:
        best_params: The best hyperparameters found by Optuna.
        train_data: Training data for the LightGBM model.
        val_data: Validation data for the LightGBM model and lower/upper quantile models.

    Returns:
        best_model: The trained LightGBM model.
    """

    # Train the model
    best_model = lgb.train(best_params, train_data, valid_sets=[val_data])

    return best_model


这是 main.py 中我的工作流程的简化结构:

from model_training import train_validation_test_split, pre_lgb_dataset, train_optuna_cv, model_pred
import pandas as pd
import numpy as np
import optuna

seed = 42
np.random.seed(42)

def main():
    # Data preparation and feature engineering steps here...

    # Model Training
    X_train, X_test, y_train, y_test = train_validation_test_split(df_features, df_target)
    train_data, test_data = pre_lgb_dataset(X_train, X_test, y_train, y_test, cat_cols)

    # Hyperparameter Optimization
    best_params = train_optuna_cv(train_data, n_trials=1, early_stopping_rounds=1)

    # Model Training with Best Parameters
    best_model = model_pred(best_params, train_data, test_data)

    # Further steps for model evaluation and deployment...

if __name__ == "__main__":
    main()

为了调试,我尝试使用简化的sample_params,如下所示,它运行没有任何问题

sample_params = { 'objective': 'regression', 'metric': 'rmse', 'num_leaves': 31, 'learning_rate': 0.05, 'num_threads': 4 } 

  • 尽管更简单的配置运行良好,但什么可能导致脚本卡在 best_params 步骤?
  • 关于如何在自动化部署环境中进一步解决或调试此问题有什么建议吗?

任何见解或建议将不胜感激。谢谢!

python machine-learning deployment model optuna
1个回答
0
投票

您遇到的问题可能是由于超参数搜索空间和优化过程的复杂性造成的。即使使用

n_trials=1
early_stopping_rounds=1
,Optuna 仍然需要探索超参数空间并至少运行模型一次,这可能非常耗时,具体取决于数据集的大小和模型的复杂性。

以下是有关如何解决或调试此问题的一些建议:

  1. 日志记录:在代码中添加日志语句以跟踪优化过程的进度。这可以帮助您确定流程陷入困境的位置。
import logging
logging.basicConfig(level=logging.INFO)
  1. 简化搜索空间:降低超参数搜索空间的复杂性。例如,您可以限制叶子的数量 (

    num_leaves
    ) 或缩小
    learning_rate
    的范围。

  2. 使用数据子集:尝试在较小的数据子集上运行优化过程。这可以帮助您确定问题是否与数据集的大小有关。

  3. 检查系统资源:在优化过程中监控服务器的CPU和内存使用情况。如果您的服务器资源不足,可能会导致进程挂起。

  4. Timeout:为优化过程实现超时。这可以防止进程无限期地运行。 Optuna 支持使用

    timeout
    方法中的
    optimize
    参数设置优化过程的超时。

study.optimize(objective, n_trials=n_trials, timeout=600)  # 600 seconds = 10 minutes
  1. 并行化:如果您的服务器有多个核心,您可以使用Optuna的并行化功能来加快优化过程。
study.optimize(objective, n_trials=n_trials, n_jobs=-1)  # Use all available cores

请记住在将这些更改部署到生产服务器之前在受控环境中测试这些更改。

© www.soinside.com 2019 - 2024. All rights reserved.