我正在尝试将XGBRegressor()放在不平衡的数据集(97%/ 3%)上并评估结果,但是在生成正确的评估指标时存在问题。
我选择SMOTE对目标变量进行过采样。
X = multiSdata.filter(['col1', 'col2','col3','col4', 'col5','col6','col7','col8',
'col9','col10','col11','col12','col13','col14','col15','col16','col17',
'col18','col19','col20','col21','col22','col23','col24'])
# retain the original feature labels
feature_labels = pd.Series(X.columns.values)
X.head(5)
[![enter image description here][1]][1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=27)
print( "Predictor - Training : ", X_train.shape, "Predictor - Testing : ", X_test.shape, "Target - Training : ", y_train.shape, "Target - Testing : ", y_test.shape )
输出:预测-培训:(876742,24)预测-测试:(375747,24)目标-培训:(876742,)目标-测试:(375747,)
y_train.value_counts()
输出:0 8245181 52224名称:target,dtype:int64
sm = SMOTE(random_state = 27, ratio = 1.0)
X_train, y_train = sm.fit_sample(X_train.values, y_train.values)
np.bincount(y_train)
输出:数组([824518,824518])
xgb = XGBRegressor(learning_rate =0.1,
n_estimators=1000,
max_depth=5,
min_child_weight=1,
gamma=0.1,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=21,
eval_metric = ['auc','error'])
SMOTE = xreg.fit(X_train, y_train)
X_test = X_test.as_matrix()
smote_pred = SMOTE.predict(X_test)
import xgboost as xgb
params = {'learning_rate' : 0.1,
'n_estimators':1000,
'max_depth':5,
'min_child_weight':1,
'gamma':0.1,'subsampl':0.8,'colsample_bytre':0.8, 'objectiv': 'binary:logistic',
'nthread':4,'scale_pos_weight':1,'seed':21,'eval_metric':['auc','error']}
xg_train = xgb.DMatrix(data=X_train, label=y_train);
cv_results = xgb.cv(params,xg_train,num_boost_round=10,nfold=5,early_stopping_rounds=10)
cv_results
我正在尝试使用交叉验证,但是无法与XGBRegressor一起使用,而是使用xgboost并从X_train和y_train生成了DMatrix。不确定这是否会导致100%的准确性,这肯定是错误的。
非常感谢您提供有关如何进一步解决模型为何无法产生正确预测的建议。
过度采样可能会创建过于完美的新案例,实际上是旧火车案例的副本。像您一样保留测试集不变可能不会阻止这种情况。如果可行,最好切换到欠采样(不引入新的泄漏,但仍然可能)。如果正确完成,那么这两者都不会对准确性有太大帮助。
关于交叉验证,出于相同的原因,请确保首先删除重复项。