使用joblib在sklearn估计器上并行进行随机网格搜索

问题描述 投票:0回答:2

我正在尝试在sklearn估算器上运行随机网格搜索,但我不想交叉验证,因为我已经为我的数据进行了训练/验证/测试分割。我已经构建了运行随机网格搜索的函数,但我想跨线程并行化。我一直在看joblib并试图弄清楚如何修改Parallel(延迟(func))函数,但无法弄清楚如何在我的代码上实现。

这是我的功能:

def randomized_grid_search(model=None, param_grid=None, percent=0.5,
                           X_train=None, y_train=None, 
                           X_val=None, y_val=None):        
    # converts parameter grid into a list
    param_list = list(ParameterGrid(param_grid))
    # the number of combinations to try in the grid
    n = int(len(param_list) * percent)
    # the reduced grid as a list
    reduced_grid = sample(param_list, n)
    best_score = 0
    best_grid = None

    """ 
    Loops through each of the posibble scenarios and
    then scores each model with prediction from validation set.
    The best score is kept and held with best parameters.
    """ 
    for g in reduced_grid:
        model.set_params(**g)
        model.fit(X_train,y_train)
        y_pred = model.predict(X_val)
        recall = recall_score(y_val, y_pred)
        if recall > best_score:
            best_score = recall
            best_grid = g

    """
    Combines the training and validation datasets and 
    trains the model with the best parameters from the 
    grid search"""
    best_model = model
    best_model.set_params(**best_grid)
    X2 = pd.concat([X_train, X_val])
    y2 = pd.concat([y_train, y_val])
    return best_model.fit(X2, y2)

https://joblib.readthedocs.io/en/latest/parallel.html我认为这是我需要的方向:

with Parallel(n_jobs=2) as parallel:
    accumulator = 0.
    n_iter = 0
    while accumulator < 1000:
       results = parallel(delayed(sqrt)(accumulator + i ** 2)
                          for i in range(5))
       accumulator += sum(results)  # synchronization barrier
       n_iter += 1

我应该做这样的事情还是我接近这一切都错了?

python parallel-processing scikit-learn joblib
2个回答
1
投票

我发现一些代码由@ skylander86在作者使用的GitHub上创作:

param_scores = Parallel(n_jobs=self.n_jobs)(delayed(_fit_classifier)(klass, self.classifier_args, param, self.metric, X_train, Y_train, X_validation, Y_validation) for param in ParameterGrid(self.param_grid))

我希望有所帮助。


0
投票

您是否尝试使用n_jobs参数使用内置并行化?

grid = sklearn.model_selection.GridSearchCV(..., n_jobs=-1)

GridSearchCV文档将n_jobs参数描述为:

n_jobs:int或None,可选(默认=无)并行运行的作业数。除非在joblib.parallel_backend上下文中,否则表示1。 -1表示使用所有处理器...

因此,虽然这不会跨线程分发,但它将分布在处理器之间;从而实现一定程度的并行化。

最新问题
© www.soinside.com 2019 - 2025. All rights reserved.