类似于 如何将参数仅传递给 scikit learn 中管道对象的一部分? 我想将参数仅传递给管道的一部分。通常,它应该可以正常工作,例如:
estimator = XGBClassifier()
pipeline = Pipeline([
('clf', estimator)
])
并像这样执行
pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20)
但失败了:
/usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
114 """
115 Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
--> 116 self.steps[-1][-1].fit(Xt, yt, **fit_params)
117 return self
118
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose)
443 early_stopping_rounds=early_stopping_rounds,
444 evals_result=evals_result, obj=obj, feval=feval,
--> 445 verbose_eval=verbose)
446
447 self.objective = xgb_options["objective"]
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, learning_rates, xgb_model, callbacks)
201 evals=evals,
202 obj=obj, feval=feval,
--> 203 xgb_model=xgb_model, callbacks=callbacks)
204
205
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
97 end_iteration=num_boost_round,
98 rank=rank,
---> 99 evaluation_result_list=evaluation_result_list))
100 except EarlyStopException:
101 break
/usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/callback.py in callback(env)
196 def callback(env):
197 """internal function"""
--> 198 score = env.evaluation_result_list[-1][1]
199 if len(state) == 0:
200 init(env)
IndexError: list index out of range
而
estimator.fit(X_train, y_train, early_stopping_rounds=20)
效果很好。
对于提前停止的轮次,您必须始终指定参数 eval_set 给出的验证集。以下是修复代码中的错误的方法。
pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20, clf__eval_set=[(test_X, test_y)])
我最近使用以下步骤来使用 Xgboost 的 eval metric 和 eval_set 参数。
pipeline_temp = pipeline.Pipeline(pipeline.cost_pipe.steps[:-1])
X_trans = pipeline_temp.fit_transform(X_train[FEATURES],y_train)
eval_set = [(X_trans, y_train), (pipeline_temp.transform(X_test), y_test)]
pipeline_temp.steps.append(pipeline.cost_pipe.steps[-1])
pipeline_temp.fit(X_train[FEATURES], y_train,
xgboost_model__eval_metric = ERROR_METRIC,
xgboost_model__eval_set = eval_set)
joblib.dump(pipeline_temp, save_path)
这是解决方案:https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13755/xgboost-early-stopping-and-other-issues Early_stooping_rounds 和需要传递监视列表/eval_set。不幸的是,这对我不起作用,因为监视列表上的变量需要一个预处理步骤,该步骤仅应用于管道/我需要手动应用此步骤。
这是一个在 GridSearchCV 管道中工作的解决方案:
重写 XGBRegressor 或 XGBClssifier.fit() 函数
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split
class XGBRegressor_ES(XGBRegressor):
def fit(self, X, y, *, eval_test_size=None, **kwargs):
if eval_test_size is not None:
params = super(XGBRegressor, self).get_xgb_params()
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=eval_test_size, random_state=params['random_state'])
eval_set = [(X_test, y_test)]
# Could add (X_train, y_train) to eval_set
# to get .eval_results() for both train and test
#eval_set = [(X_train, y_train),(X_test, y_test)]
kwargs['eval_set'] = eval_set
return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs)
用法示例
下面是一个多步骤管道,其中包括对 X 的多次转换。管道的 fit() 函数将新的评估参数传递给上面的 XGBRegressor_ES 类,形式为 xgbr__eval_test_size=200。在这个例子中:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
('vt',VarianceThreshold()),
('scaler', StandardScaler()),
('Sp', SelectPercentile()),
('xgbr',XGBRegressor_ES(n_estimators=2000,
objective='reg:squarederror',
eval_metric='mae',
learning_rate=0.0001,
random_state=7)) ])
X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values
管道安装示例:
%time xgbr_pipe.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
GridSearchCV 拟合示例:
learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)
grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train,
xgbr__eval_test_size=200,
xgbr__eval_metric='mae',
xgbr__early_stopping_rounds=75)
假设您有这样的管道:
pipeline = Pipeline([('preprocessor', preprocessor), ('model', xgboost_model)])
您可以使用此函数将
eval_set
传递到 model.fit
步骤:
def pipeline_fit_with_eval_set(pipeline, X_train, y_train, X_test, y_test, fit_params={}):
"""
Fit a scikit-learn pipeline with eval_set support.
Parameters:
- pipeline: The scikit-learn pipeline.
- X_train: Training data.
- y_train: Training labels.
- X_test: Test data.
- y_test: Test labels.
- fit_params: Additional fit parameters.
- pipeline_model_step_name: Name of the model step in the pipeline.
Usage:
pipeline_fit_with_eval_set(my_pipeline, X_train, y_train, X_test, y_test, fit_params={'eval_metric': 'logloss'})
"""
# Step 1: Extract Preprocessors
pipeline_preprocessors = Pipeline(pipeline.steps[:-1])
# Step 2: Fit preprocessors and Transform Training Data
# Make sure not to use any test data for the fit step
X_train_transformed = pipeline_preprocessors.fit_transform(X_train)
# Step 3: Transform Test Data
X_test_transformed = pipeline_preprocessors.transform(X_test)
# Step 4: Prepare Eval Set
fit_params["eval_set"] = [(X_test_transformed, y_test)]
# Step 5: Extract Model and Fit
model = pipeline.steps[-1][1]
model.fit(X_train_transformed, y_train, **fit_params)
我最初受到这个答案的启发并对其进行了改进。
您可以将此功能用于任何其他需要
eval_set
的模型,例如 LightGBM
和 CatBoost
。