我有一个问题,我想将管道(以 OHE 作为预处理,以简单的线性回归作为模型)与 SHAP 工具一起使用。
至于数据,这是我的数据(我正在使用我的自行车共享数据集的修改版本):
bike_data=pd.read_csv("bike_outlier_clean.csv")
bike_data['season']=bike_data.season.astype('category')
bike_data['year']=bike_data.year.astype('category')
bike_data['holiday']=bike_data.holiday.astype('category')
bike_data['workingday']=bike_data.workingday.astype('category')
bike_data['weather_condition']=bike_data.weather_condition.astype('category')
bike_data['season'] = bike_data['season'].map({1:'Spring', 2:'Summer', 3:'Fall', 4: 'Winter'})
bike_data['year'] = bike_data['year'].map({0: 2011, 1: 2012})
bike_data['holiday'] = bike_data['holiday'].map({0: False, 1: True})
bike_data['workingday'] = bike_data['workingday'].map({0: False, 1: True})
bike_data['weather_condition'] = bike_data['weather_condition'].map({1:'Clear', 2:'Mist', 3:'Light Snow/Rain', 4: 'Heavy Snow/Rain'})
bike_data = bike_data[['total_count','season','month','year','weekday','holiday','workingday','weather_condition','humidity','temp','windspeed']]
x = bike_data.drop('total_count', axis=1)
y = bike_data['total_count']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
还有我的管道
category_columns = list(set(bike_data.columns) - set(bike_data._get_numeric_data().columns))
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), category_columns)
],
remainder='passthrough'
)
model = LinearRegression()
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
pipeline.fit(x_train,y_train)
最后,使用 kernelSHAP 解释器
explainer = shap.KernelExplainer(pipeline.predict, shap.sample(x, 5))
然而,这就是错误发生的地方。
123 # Make a copy so that the feature names are not removed from the original model
124 out = copy.deepcopy(out)
--> 125 out.f.__self__.feature_names_in_ = None
126
127 return out
AttributeError: can't set attribute 'feature_names_in_'
我现在完全不知道我应该做什么来解决它。
Shap
与 Pipeline
对象的表现不太好,所以我建议如下(当我开始使用 numpy
数组而不是 Pandas
df 时请注意):
import shap
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
print(shap.__version__)
bike_data = pd.read_csv("archive/bike_sharing_daily.csv")
bike_data['season']=bike_data.season.astype('category')
bike_data['holiday']=bike_data.holiday.astype('category')
bike_data['workingday']=bike_data.workingday.astype('category')
bike_data['season'] = bike_data['season'].map({1:'Spring', 2:'Summer', 3:'Fall', 4: 'Winter'})
bike_data['holiday'] = bike_data['holiday'].map({0: False, 1: True})
bike_data['workingday'] = bike_data['workingday'].map({0: False, 1: True})
bike_data = bike_data[['season','weekday','holiday','workingday','temp','windspeed']]
x = bike_data
y = np.random.randint(0,10, len(bike_data))
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)
category_columns = list(set(bike_data.columns) - set(bike_data._get_numeric_data().columns))
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), [0,2,3])
],
remainder='passthrough'
)
model = LinearRegression()
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])
pipeline.fit(x_train.values,y_train) # <-- here
explainer = shap.KernelExplainer(pipeline.predict, x_train.values[:10])
sv = explainer.shap_values(x_train.values)
shap.summary_plot(sv, x.columns)
0.44.1.dev4