使用 PipeLine 的 SHAP KernelExplainer

问题描述 投票:0回答:1

我有一个问题,我想将管道(以 OHE 作为预处理,以简单的线性回归作为模型)与 SHAP 工具一起使用。

至于数据,这是我的数据(我正在使用我的自行车共享数据集的修改版本):

bike_data=pd.read_csv("bike_outlier_clean.csv")

bike_data['season']=bike_data.season.astype('category')
bike_data['year']=bike_data.year.astype('category')
bike_data['holiday']=bike_data.holiday.astype('category')
bike_data['workingday']=bike_data.workingday.astype('category')
bike_data['weather_condition']=bike_data.weather_condition.astype('category')

bike_data['season'] = bike_data['season'].map({1:'Spring', 2:'Summer', 3:'Fall', 4: 'Winter'})
bike_data['year'] = bike_data['year'].map({0: 2011, 1: 2012})
bike_data['holiday'] = bike_data['holiday'].map({0: False, 1: True})
bike_data['workingday'] = bike_data['workingday'].map({0: False, 1: True})
bike_data['weather_condition'] = bike_data['weather_condition'].map({1:'Clear', 2:'Mist', 3:'Light Snow/Rain', 4: 'Heavy Snow/Rain'})

bike_data = bike_data[['total_count','season','month','year','weekday','holiday','workingday','weather_condition','humidity','temp','windspeed']]

x = bike_data.drop('total_count', axis=1)
y = bike_data['total_count']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

还有我的管道

category_columns = list(set(bike_data.columns) - set(bike_data._get_numeric_data().columns))
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), category_columns)
    ],
    remainder='passthrough'  
)
model = LinearRegression()

pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

pipeline.fit(x_train,y_train)

最后,使用 kernelSHAP 解释器

explainer = shap.KernelExplainer(pipeline.predict, shap.sample(x, 5))

然而,这就是错误发生的地方。

    123             # Make a copy so that the feature names are not removed from the original model
    124             out = copy.deepcopy(out)
--> 125             out.f.__self__.feature_names_in_ = None
    126 
    127     return out

AttributeError: can't set attribute 'feature_names_in_'

我现在完全不知道我应该做什么来解决它。

python shap
1个回答
0
投票

Shap
Pipeline
对象的表现不太好,所以我建议如下(当我开始使用
numpy
数组而不是
Pandas
df 时请注意):

import shap

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

print(shap.__version__)

bike_data = pd.read_csv("archive/bike_sharing_daily.csv")
bike_data['season']=bike_data.season.astype('category')
bike_data['holiday']=bike_data.holiday.astype('category')
bike_data['workingday']=bike_data.workingday.astype('category')
bike_data['season'] = bike_data['season'].map({1:'Spring', 2:'Summer', 3:'Fall', 4: 'Winter'})
bike_data['holiday'] = bike_data['holiday'].map({0: False, 1: True})
bike_data['workingday'] = bike_data['workingday'].map({0: False, 1: True})
bike_data = bike_data[['season','weekday','holiday','workingday','temp','windspeed']]

x = bike_data

y = np.random.randint(0,10, len(bike_data))

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

category_columns = list(set(bike_data.columns) - set(bike_data._get_numeric_data().columns))
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), [0,2,3])
    ],
    remainder='passthrough'  
)
model = LinearRegression()

pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

pipeline.fit(x_train.values,y_train) # <-- here

explainer = shap.KernelExplainer(pipeline.predict, x_train.values[:10])
sv = explainer.shap_values(x_train.values)

shap.summary_plot(sv, x.columns)

0.44.1.dev4

© www.soinside.com 2019 - 2024. All rights reserved.