XGBclassifier 中 xgboost.plot_importance() 和 model.feature_importances_ 有什么区别。
所以我在这里做了一些虚拟数据
import numpy as np
import pandas as pd
# generate some random data for demonstration purpose, use your original dataset here
X = np.random.rand(1000,100) # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels
a = pd.DataFrame(X)
a.columns = ['param'+str(i+1) for i in range(len(a.columns))]
b = pd.DataFrame(y)
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
model = XGBClassifier()
model.fit(a,b)
# Feature importance
model.feature_importances_
fi = pd.DataFrame({'Feature-names':a.columns,'Importances':model.feature_importances_})
fi.sort_values(by='Importances',ascending=False)
plt.bar(range(len(model.feature_importances_)),model.feature_importances_)
plt.show()
plt.rcParams.update({'figure.figsize':(20.0,180.0)})
plt.rcParams.update({'font.size':20.0})
plt.barh(a.columns,model.feature_importances_)
sorted_idx = model.feature_importances_.argsort()
plt.barh(a.columns[sorted_idx],model.feature_importances_[sorted_idx])
plt.xlabel('XGBoost Classifier Feature Importance')
#plot_importance
xgb.plot_importance(model, ax=plt.gca())
plt.show
如果您看到图表,特征重要性和绘图重要性不会给出相同的结果。我尝试阅读文档,但我不明白外行人的意思 那么有人明白为什么情节重要性不能给出等于情节重要性的结果吗?
如果我这样做
fi['Importances'].sum()
我得到了1.0,这意味着特征重要性是百分比。
如果我想做降维,我应该使用哪个功能? 是来自特征重要性还是情节重要性?
您获得的分数未按总分归一化。
用你的例子:
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot as plt
np.random.seed(99)
X = np.random.rand(1000,100) # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels
a = pd.DataFrame(X)
a.columns = ['param'+str(i+1) for i in range(len(a.columns))]
b = pd.DataFrame(y)
model = XGBClassifier(importance_type = "weight")
model.fit(a,b)
xgb.plot_importance(model,max_num_features=10,importance_type = "weight")
这是前 10 个最重要的情节:
要获取绘图上显示的分数:
df = pd.DataFrame(model.get_booster().get_score(importance_type = "weight"),
index = ["raw_importance"]).T
df[:10]
raw_importance
param98 35
param57 30
param17 30
param20 29
param14 28
param45 27
param22 27
param59 27
param13 26
param30 26
要获取
model.feature_importances_
下的分数,您需要将原始重要性分数除以总和:
raw_importance normalized
param98 35 0.018747
param57 30 0.016069
param17 30 0.016069
param20 29 0.015533
param14 28 0.014997
param45 27 0.014462
param22 27 0.014462
param59 27 0.014462
param13 26 0.013926
param30 26 0.013926
您会看到它与模型下的内容相同:
pd.DataFrame(model.feature_importances_,columns=['score'],index = a.columns)\
.sort_values('score',ascending=False)[:10]
score
param98 0.018747
param57 0.016069
param17 0.016069
param20 0.015533
param14 0.014997
param45 0.014462
param59 0.014462
param22 0.014462
param12 0.013926
param13 0.013926
因此,要回答您的问题,对功能进行排名,您可以使用
model.feature_importances_