XGBoost 支持直接将特征作为类别输入,这在类别变量很多的时候非常有用。这似乎与 Shap 不兼容:
import pandas as pd
import xgboost
import shap
# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
# Fit xgboost
model = xgboost.XGBRegressor(enable_categorical=True,
tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )
# Explain with Shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data)
抛出错误:ValueError: DataFrame.dtypes for data must be int, float, bool or category.
这种情况可以用Shap吗?
我使用了 GradientBoostingRegressor 并将数组重塑为每个元素 2 个特征
from sklearn.ensemble import GradientBoostingRegressor
import shap
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import numpy as np
df = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']
})
df["feature_1"]=df["feature_1"].astype(int)
df["target"]=df["target"].astype(int)
encoder = preprocessing.LabelEncoder()
df["feature_2"]=encoder.fit_transform(df["feature_2"])
print(df)
SEED=42
model = GradientBoostingRegressor(n_estimators=300, max_depth=8, random_state=SEED)
scale= StandardScaler()
#X=df[["feature_1","feature_2"]]
columns=["feature_1","feature_2"]
n_features=len(columns)
X=np.array(scale.fit_transform(df[columns])).reshape(-1,n_features)
y=np.array(df["target"])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
model.fit(X_train,y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(df)
print(shap_values)
y_pred=model.predict(X_test)
x=np.arange(len(X_test))
plt.bar(x,y_test)
plt.bar(x,y_pred,color='green')
plt.show()
输出:
target feature_1 feature_2
0 23 38 0
1 42 83 1
2 58 38 0
3 29 28 2
4 28 57 0
Shap values
[[-4.65720266 -3.00946401 0. ]
[ 2.32860133 -3.00946401 0. ]
[ 2.32860133 -3.00946401 0. ]
[-4.65720266 -3.00946401 0. ]
[-4.65720266 -3.00946401 0. ]]
或
df = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']
})
df["feature_1"]=df["feature_1"].astype(int)
df["target"]=df["target"].astype(int)
encoder = preprocessing.LabelEncoder()
df["feature_2"]=encoder.fit_transform(df["feature_2"])
SEED=42
#model = xgboost.XGBRegressor(enable_categorical=True,tree_method='hist')
model=xgboost.XGBRegressor(enable_categorical=True,tree_method='hist')
#model = GradientBoostingRegressor(n_estimators=100, max_depth=2, random_state=SEED)
scale= StandardScaler()
#X=df[["feature_1","feature_2"]]
columns=["feature_1","feature_2"]
n_features=len(columns)
X=np.array(scale.fit_transform(df[columns])).reshape(-1,n_features)
y=np.array(df["target"])
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.6,random_state=42)
model.fit(X_train,y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
print(shap_values)
y_pred=model.predict(X_test)
x=np.arange(len(X_test))
plt.bar(x,y_test)
plt.bar(x,y_pred,color='green')
plt.show()
不幸的是, 使用分类变量通过 xgboost 生成 shap 值是一个悬而未决的问题。参见,fe,https://github.com/slundberg/shap/issues/2662
鉴于您的具体示例,我使用 Dmatrix 作为 shap 的输入使其运行(Dmatrix 是 xgboost 模型的基本数据类型输入,请参阅 Learning API。您正在使用的 sklearn api 不需要 Dmatrix,位于最少的培训):
import pandas as pd
import xgboost
import shap
# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgboost.XGBRegressor(enable_categorical=True,
tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )
# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
print(shap_values)
但是当存在分类变量时生成 shap 值的能力非常不稳定:例如,如果你在 xgboost 中添加其他参数,你会得到错误“Check failed: !HasCategoricalSplit()”,这是我第一个链接中引用的错误
import pandas as pd
import xgboost
import shap
# Test data
test_data = pd.DataFrame({'target':[23,42,58,29,28],
'feature_1' : [38, 83, 38, 28, 57],
'feature_2' : ['A', 'B', 'A', 'C','A']})
test_data['feature_2'] = test_data['feature_2'].astype('category')
print(test_data.info())
# Fit xgboost
model = xgboost.XGBRegressor(colsample_bylevel= 0.7,
enable_categorical=True,
tree_method='hist')
model.fit(test_data.drop('target', axis=1), test_data['target'] )
# Explain with Shap
test_data_dm = xgb.DMatrix(data=test_data.drop('target', axis=1), label=test_data['target'], enable_categorical=True)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(test_data_dm)
shap_values
我已经搜索了几个月的解决方案,但总的来说,就我的理解而言,使用 xgboost 和分类变量生成 shap 值是不可能的(我希望有人可以用一个可重现的例子反驳我)。 我建议你试试 Catboost