sklearn.model_selection.RandomizedSearchCV如何工作？

Question

我正在制作具有不平衡类（比率为1:10）的二进制分类器。我尝试了KNN，RF和XGB分类器。我从XGB分类器中获得了最佳的精度调用折衷和F1得分（可能是因为数据集的大小非常少-(1900,19)）

因此，在检查了XGB的误差图后，我决定从sklearn获取RandomizedSearchCV()来调整我的XGB分类器的参数。根据关于stackexchange的另一个答案，这是我的代码：

from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
score_arr = []
clf_xgb = XGBClassifier(objective = 'binary:logistic')
param_dist = {'n_estimators': [50, 120, 180, 240, 400],
              'learning_rate': [0.01, 0.03, 0.05],
              'subsample': [0.5, 0.7],
              'max_depth': [3, 4, 5],
              'min_child_weight': [1, 2, 3], 
              'scale_pos_weight' : [9]
            }
clf = RandomizedSearchCV(clf_xgb, param_distributions = param_dist, n_iter = 25, scoring = 'precision', error_score = 0, verbose = 3, n_jobs = -1)
print(clf)
numFolds = 6
folds = StratifiedKFold(n_splits = numFolds, shuffle = True)

estimators = []
results = np.zeros(len(X))
score = 0.0
for train_index, test_index in folds.split(X_train, y_train):
    print(train_index)
    print(test_index)
    _X_train, _X_test = X.iloc[train_index,:], X.iloc[test_index,:]
    _y_train, _y_test = y.iloc[train_index].values.ravel(), y.iloc[test_index].values.ravel()
    clf.fit(_X_train, _y_train, eval_metric="error", verbose=True)

    estimators.append(clf.best_estimator_)
    results[test_index] = clf.predict(_X_test)
    score_arr.append(f1_score(_y_test, results[test_index]))
    score += f1_score(_y_test, results[test_index])
score /= numFolds

所以RandomizedSearchCV实际上选择了分类器，然后以k折为单位进行拟合并在验证集上预测结果。 请注意我以k倍拆分形式给出了X_train和y_train，因此我有一个单独的test数据集来测试最终算法。

现在，问题是，如果您实际上在每个kfold迭代中都看起来是f1-score，就好像是score_arr = [0.5416666666666667, 0.4, 0.41379310344827586, 0.5, 0.44, 0.43478260869565216]一样。

但是当我将clf.best_estimator_作为模型进行测试时，在我的test数据集上，它给出f1-score的0.80并具有{'precision': 0.8688524590163934, 'recall': 0.7571428571428571}的精度和召回率。

在验证率较低时我的分数怎么了？测试集现在发生了什么？我的模特是正确的还是我错过了什么？

P.S。 -采用clf.best_estimator_的参数，我使用xgb.cv将它们分别拟合到我的训练数据上，然后f1-score也接近0.55。我认为这可能是由于RandomizedSearchCV和xgb.cv的训练方法之间的差异。请告诉我是否需要绘图或更多信息。

Update：我正在附上火车的误差图，并测试生成的模型的aucpr和classification accuracy。通过仅运行一次model.fit()（调整score_arr的值）即可生成该图。

Answer 1

对超参数的随机搜索。

虽然使用参数设置网格是当前最广泛用于参数优化的方法，但其他搜索方法具有更有利的特性。 RandomizedSearchCV实现对参数的随机搜索，其中每个设置都是从可能参数值的分布中采样的。与详尽搜索相比，这有两个主要好处：

A budget can be chosen independently of the number of parameters and possible values.

Adding parameters that do not influence the performance does not decrease efficiency.

如果所有参数均以列表形式显示，则执行不替换的采样。如果给定至少一个参数作为分布，则使用替换抽样。强烈建议对连续参数使用连续分布。

有关更多（参考）：SKLEARN documentation for RandomizedSearchCV

sklearn.model_selection.RandomizedSearchCV如何工作？

问题描述投票：1回答：1

1个回答

最新问题

sklearn.model_selection.RandomizedSearchCV如何工作？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1