使用分类数据输入为 XGBoost 绘制摘要图

问题描述 投票:0回答:2

XGBoost 支持直接将特征作为类别输入,这在类别变量很多的时候非常有用。这似乎与 Shap 不兼容:

import pandas as pd
import xgboost
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')

# Fit xgboost
model = xgboost.XGBRegressor(enable_categorical=True,
                                       tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data)

抛出错误:ValueError: DataFrame.dtypes for data must be int, float, bool or category.

这种情况可以用Shap吗?

python xgboost shap
2个回答
1
投票

我使用了 GradientBoostingRegressor 并将数组重塑为每个元素 2 个特征

from sklearn.ensemble import GradientBoostingRegressor
import shap
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import numpy as np

df = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']
                  })

df["feature_1"]=df["feature_1"].astype(int)
df["target"]=df["target"].astype(int)

encoder = preprocessing.LabelEncoder()
df["feature_2"]=encoder.fit_transform(df["feature_2"])

print(df)
SEED=42
    model = GradientBoostingRegressor(n_estimators=300, max_depth=8, random_state=SEED)

scale= StandardScaler()

#X=df[["feature_1","feature_2"]]
columns=["feature_1","feature_2"]
n_features=len(columns)
X=np.array(scale.fit_transform(df[columns])).reshape(-1,n_features)
y=np.array(df["target"])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
model.fit(X_train,y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(df)

print(shap_values)


y_pred=model.predict(X_test)

x=np.arange(len(X_test))
plt.bar(x,y_test)
plt.bar(x,y_pred,color='green')
plt.show()

输出:

target  feature_1  feature_2
0      23         38          0
1      42         83          1
2      58         38          0
3      29         28          2
4      28         57          0

Shap values

[[-4.65720266 -3.00946401  0.        ]
 [ 2.32860133 -3.00946401  0.        ]
 [ 2.32860133 -3.00946401  0.        ]
 [-4.65720266 -3.00946401  0.        ]
 [-4.65720266 -3.00946401  0.        ]]

    df = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']
                  })

df["feature_1"]=df["feature_1"].astype(int)
df["target"]=df["target"].astype(int)

encoder = preprocessing.LabelEncoder()
df["feature_2"]=encoder.fit_transform(df["feature_2"])

SEED=42
#model = xgboost.XGBRegressor(enable_categorical=True,tree_method='hist')
model=xgboost.XGBRegressor(enable_categorical=True,tree_method='hist')
#model = GradientBoostingRegressor(n_estimators=100, max_depth=2, random_state=SEED)

scale= StandardScaler()

#X=df[["feature_1","feature_2"]]
columns=["feature_1","feature_2"]
n_features=len(columns)
X=np.array(scale.fit_transform(df[columns])).reshape(-1,n_features)
y=np.array(df["target"])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.6,random_state=42)
model.fit(X_train,y_train)

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
print(shap_values)


y_pred=model.predict(X_test)

x=np.arange(len(X_test))
plt.bar(x,y_test)
plt.bar(x,y_pred,color='green')
plt.show()

0
投票

不幸的是, 使用分类变量通过 xgboost 生成 shap 值是一个悬而未决的问题。参见,fe,https://github.com/slundberg/shap/issues/2662

鉴于您的具体示例,我使用 Dmatrix 作为 shap 的输入使其运行(Dmatrix 是 xgboost 模型的基本数据类型输入,请参阅 Learning API。您正在使用的 sklearn api 不需要 Dmatrix,位于最少的培训):

import pandas as pd
import xgboost
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgboost.XGBRegressor(enable_categorical=True,
                                       tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
print(shap_values)

但是当存在分类变量时生成 shap 值的能力非常不稳定:例如,如果你在 xgboost 中添加其他参数,你会得到错误“Check failed: !HasCategoricalSplit()”,这是我第一个链接中引用的错误

import pandas as pd
import xgboost
import shap

# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
                      'feature_1' : [38, 83, 38, 28, 57],
                      'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgboost.XGBRegressor(colsample_bylevel= 0.7, 
                             enable_categorical=True,
                             tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )

# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
shap_values

我已经搜索了几个月的解决方案,但总的来说,就我的理解而言,使用 xgboost 和分类变量生成 shap 值是不可能的(我希望有人可以用一个可重现的例子反驳我)。 我建议你试试 Catboost

© www.soinside.com 2019 - 2024. All rights reserved.