我正在尝试在 XGBoost 上使用 scikit-learn 的 GridSearchCV 进行超参数搜索。在网格搜索期间,我希望它尽早停止,因为它大大减少了搜索时间,并且(期望)在我的预测/回归任务中获得更好的结果。我通过其 Scikit-Learn API 使用 XGBoost。
model = xgb.XGBRegressor()
GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)
我尝试使用 fit_params 提供早期停止参数,但随后它抛出此错误,这基本上是因为缺乏早期停止所需的验证集:
/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
187 else:
188 assert env.cvfolds is not None
189
190 def callback(env):
191 """internal function"""
--> 192 score = env.evaluation_result_list[-1][1]
score = undefined
env.evaluation_result_list = []
193 if len(state) == 0:
194 init(env)
195 best_score = state['best_score']
196 best_iteration = state['best_iteration']
如何使用 Early_stopping_rounds 在 XGBoost 上应用 GridSearch?
注意:模型无需 gridsearch 即可工作,GridSearch 也无需 'fit_params={'early_stopping_rounds':42} 即可工作
使用
early_stopping_rounds
时,您还必须提供 eval_metric
和 eval_set
作为拟合方法的输入参数。提前停止是通过计算评估集上的误差来完成的。误差必须每early_stopping_rounds
减少一次,否则额外树的生成会提前停止。
详情请参阅xgboosts拟合方法的文档。
在这里您可以看到一个最小的完整工作示例:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
cv = 2
trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]
# these are the evaluation sets
testX = trainX
testY = trainY
paramGrid = {"subsample" : [0.5, 0.8]}
fit_params={"early_stopping_rounds":42,
"eval_metric" : "mae",
"eval_set" : [[testX, testY]]}
model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
fit_params=fit_params,
cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)
从 sklearn 0.21.3 开始,对 @glao 的答案的更新以及对 @Vasim 的评论/问题的回复(请注意,
fit_params
已从 GridSearchCV
的实例化中移出,并移至 fit()
方法中;此外,导入还专门从 xgboost 中引入 sklearn 包装器模块):
import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
cv = 2
trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]
# these are the evaluation sets
testX = trainX
testY = trainY
paramGrid = {"subsample" : [0.5, 0.8]}
fit_params={"early_stopping_rounds":42,
"eval_metric" : "mae",
"eval_set" : [[testX, testY]]}
model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1,
cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))
gridsearch.fit(trainX, trainY, **fit_params)
这是一个在 GridSearchCV 管道中工作的解决方案。 当您拥有预处理训练数据所需的管道时,就会出现挑战。例如,当X是文本文档时,您需要TFTDFVectorizer对其进行矢量化。
重写 XGBRegressor 或 XGBClssifier.fit() 函数
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split
class XGBRegressor_ES(XGBRegressor):
def fit(self, X, y, *, eval_test_size=None, **kwargs):
if eval_test_size is not None:
params = super(XGBRegressor, self).get_xgb_params()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=eval_test_size, random_state=params['random_state'])
eval_set = [(X_test, y_test)]
# Could add (X_train, y_train) to eval_set
# to get .eval_results() for both train and test
#eval_set = [(X_train, y_train),(X_test, y_test)]
kwargs['eval_set'] = eval_set
return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs)
用法示例
下面是一个多步管道,其中包括对 X 的多次转换。管道的 fit() 函数将新的评估参数传递给上面的 XGBRegressor_ES 类,形式为 xgbr__eval_test_size=200。 在这个例子中:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('vt',VarianceThreshold()),
('scaler', StandardScaler()),
('Sp', SelectPercentile()),
('xgbr',XGBRegressor_ES(n_estimators=2000,
objective='reg:squarederror',
eval_metric='mae',
learning_rate=0.0001,
random_state=7)) ])
X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values
管道安装示例:
%time xgbr_pipe.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
GridSearchCV 拟合示例:
learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)
grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
我发现建议的解决方案有点笨拙,所以我实现了自己的解决方案。请参阅包
xgbsearch
(我创建的)以了解我自己的实现。
有关详细信息,请参阅 https://pypi.org/project/xgbsearch/。
使用这个包实现网格搜索很简单。
from xgbsearch import XgbGridSearch, XgbRandomSearch
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import roc_auc_score
X, y = make_classification(random_state=42)
X = pd.DataFrame(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# These parameters will be passed to xgb.fit as is.
fit_params = {
"device": "cuda",
"objective": "binary:logistic",
"eval_metric": ["auc"],
}
# The parameters here will be tuned. If the parameter is a single value, it will be passed as is.
# If the parameter is a list, all possible combinations will be searched using grid search.
tune_params_grid = {
"eta": [0.01, 0.001],
"max_depth": [5, 11],
"min_child_weight": 3,
}
grid_search = XgbGridSearch(tune_params_grid, fit_params)
eval_set = [(X_train, y_train, "train"), (X_test, y_test, "test")]
grid_search.fit(X_train, y_train, eval_set, 10000, 100, verbose_eval=25)