原生 xgb 和 XGBRegressor 具有相同的预测,但指标不同

问题描述 投票:0回答:1

我不明白为什么 xgb.train 和 xgb.XGBRegressor 之间的指标不同。我确实有相同的预测值。你有想法吗?

下面是模拟数据的一个小例子。

导入库

import numpy as np
import pandas as pd
import xgboost as xgb
import plotly.express as px
import json

模拟数据

n = 1000

braking = np.random.normal(10, 2, n)
acceleration = np.random.normal(8, 1.5, n)
phone = np.random.normal(1, 0.5, n)
distance = np.random.normal(50, 50, n)

simdf = pd.DataFrame({
    "braking": braking,
    "acceleration": acceleration,
    "phone": phone,
    "distance": distance
})
simdf['distance'] = np.where(simdf['distance'] < 2, 2, simdf['distance'])
simdf['phone'] = np.where(simdf['phone'] < 0, 0, simdf['phone'])
mu_A = np.exp(-1 + 0.02 * simdf['braking'] + 0.001 * simdf['acceleration'] + 0.0008 * simdf['distance'])
y_A = np.random.poisson(mu_A, n)
simdf['response'] = y_A
simdf['margin'] = 0.02 * simdf['braking'] + 0.001 * simdf['acceleration']

设置原生xgboost的参数

model_param = {'objective': 'count:poisson', 
               'monotone_constraints': (1,1,1,1), 
               'n_estimators': 50, 
               'seed': 12345,
               'eval_metric': 'poisson-nloglik',
               }

计算原生 xgboost

xgbMatrix_A = xgb.DMatrix(simdf_train[["braking","acceleration","phone","distance"]], 
                          label=simdf_train[["response"]])

xgbMatrix_A.set_info(base_margin=np.log(simdf_train[["margin"]]))
bst_A = xgb.train(model_param,
    xgbMatrix_A,
    num_boost_round=50,
    evals = [(xgbMatrix_A,"train")]
)
bst_A
sim_df_pred = xgb.DMatrix(simdf[["braking","acceleration","phone","distance"]])
sim_df_pred.set_info(base_margin=np.log(simdf[["margin"]]))
predictions = bst_A.predict(sim_df_pred)
simdf['pred_python'] = predictions

获取原生xgboost的参数并更新

config = json.loads(bst_A.save_config())
model_param = config['learner']['gradient_booster']['updater']['grow_colmaker']['train_param']
model_param.update({'objective': 'count:poisson',
               'n_estimators': 50,
               'eval_metric': 'poisson-nloglik'
               })

计算 XGBRegressor 模型

bst_B = xgb.XGBRegressor(**model_param)
bst_B.fit(simdf_train[["braking","acceleration","phone","distance"]],simdf_train[["response"]], base_margin=np.log(simdf_train[["margin"]]),
          eval_set=[(simdf_train[["braking","acceleration","phone","distance"]], simdf_train[["response"]])])
predictions = bst_B.predict(simdf[["braking","acceleration","phone","distance"]], base_margin=np.log(simdf[["margin"]]))
simdf['pred_python_sk_log'] = predictions

比较预测和指标

## prediction comparison 
np.sum(simdf["pred_python_sk_log"] - simdf["pred_python"])

## metric comparison 
print(bst_A.eval(xgbMatrix_A))
print(bst_B.evals_result()["validation_0"]["poisson-nloglik"][-1])

相同的预测(总和为 0)但指标不同。

python xgboost
1个回答
0
投票

经过一番研究,我终于找到了必须在代码中添加哪些内容才能获得相同的指标值。我们需要添加参数

base_margin_eval_set

在“计算 XGBRegressor 模型”步骤中,将行

bst_B.fit = ...
替换为:

bst_B.fit(simdf_train[["braking","acceleration","phone","distance"]],simdf_train[["response"]], 
          base_margin_eval_set= [np.log(simdf_train[["margin"]])],
          base_margin=np.log(simdf_train[["margin"]]), 
          eval_set=[(simdf_train[["braking","acceleration","phone","distance"]], simdf_train[["response"]])])
© www.soinside.com 2019 - 2024. All rights reserved.