我不明白为什么 xgb.train 和 xgb.XGBRegressor 之间的指标不同。我确实有相同的预测值。你有想法吗?
下面是模拟数据的一个小例子。
导入库
import numpy as np
import pandas as pd
import xgboost as xgb
import plotly.express as px
import json
模拟数据
n = 1000
braking = np.random.normal(10, 2, n)
acceleration = np.random.normal(8, 1.5, n)
phone = np.random.normal(1, 0.5, n)
distance = np.random.normal(50, 50, n)
simdf = pd.DataFrame({
"braking": braking,
"acceleration": acceleration,
"phone": phone,
"distance": distance
})
simdf['distance'] = np.where(simdf['distance'] < 2, 2, simdf['distance'])
simdf['phone'] = np.where(simdf['phone'] < 0, 0, simdf['phone'])
mu_A = np.exp(-1 + 0.02 * simdf['braking'] + 0.001 * simdf['acceleration'] + 0.0008 * simdf['distance'])
y_A = np.random.poisson(mu_A, n)
simdf['response'] = y_A
simdf['margin'] = 0.02 * simdf['braking'] + 0.001 * simdf['acceleration']
设置原生xgboost的参数
model_param = {'objective': 'count:poisson',
'monotone_constraints': (1,1,1,1),
'n_estimators': 50,
'seed': 12345,
'eval_metric': 'poisson-nloglik',
}
计算原生 xgboost
xgbMatrix_A = xgb.DMatrix(simdf_train[["braking","acceleration","phone","distance"]],
label=simdf_train[["response"]])
xgbMatrix_A.set_info(base_margin=np.log(simdf_train[["margin"]]))
bst_A = xgb.train(model_param,
xgbMatrix_A,
num_boost_round=50,
evals = [(xgbMatrix_A,"train")]
)
bst_A
sim_df_pred = xgb.DMatrix(simdf[["braking","acceleration","phone","distance"]])
sim_df_pred.set_info(base_margin=np.log(simdf[["margin"]]))
predictions = bst_A.predict(sim_df_pred)
simdf['pred_python'] = predictions
获取原生xgboost的参数并更新
config = json.loads(bst_A.save_config())
model_param = config['learner']['gradient_booster']['updater']['grow_colmaker']['train_param']
model_param.update({'objective': 'count:poisson',
'n_estimators': 50,
'eval_metric': 'poisson-nloglik'
})
计算 XGBRegressor 模型
bst_B = xgb.XGBRegressor(**model_param)
bst_B.fit(simdf_train[["braking","acceleration","phone","distance"]],simdf_train[["response"]], base_margin=np.log(simdf_train[["margin"]]),
eval_set=[(simdf_train[["braking","acceleration","phone","distance"]], simdf_train[["response"]])])
predictions = bst_B.predict(simdf[["braking","acceleration","phone","distance"]], base_margin=np.log(simdf[["margin"]]))
simdf['pred_python_sk_log'] = predictions
比较预测和指标
## prediction comparison
np.sum(simdf["pred_python_sk_log"] - simdf["pred_python"])
## metric comparison
print(bst_A.eval(xgbMatrix_A))
print(bst_B.evals_result()["validation_0"]["poisson-nloglik"][-1])
相同的预测(总和为 0)但指标不同。
经过一番研究,我终于找到了必须在代码中添加哪些内容才能获得相同的指标值。我们需要添加参数
base_margin_eval_set
。
在“计算 XGBRegressor 模型”步骤中,将行
bst_B.fit = ...
替换为:
bst_B.fit(simdf_train[["braking","acceleration","phone","distance"]],simdf_train[["response"]],
base_margin_eval_set= [np.log(simdf_train[["margin"]])],
base_margin=np.log(simdf_train[["margin"]]),
eval_set=[(simdf_train[["braking","acceleration","phone","distance"]], simdf_train[["response"]])])