如何使用xgboost在选择特征数量少的情况下获得最高精度？

Question

我一直在寻找几种特征选择方法，从下面的链接中找到了关于借助XGBoost进行特征选择的方法(XGBoost特征重要性和选择). 我对我的案例实施了该方法，结果如下。

Thresh=0.000，n=11，准确率：55.56%。
Thresh=0.000，n=11，准确率：55.56%。
Thresh=0.000，n=11，准确率：55.56%。
Thresh=0.000，n=11，准确率：55.56%。
Thresh=0.097，n=7，准确率：55.56%。
Thresh=0.105，n=6，准确率：55.56%。
Thresh=0.110，n=5，准确率：50.00%。
Thresh=0.114，n=4，准确率：50.00%。
Thresh=0.169，n=3，准确率：44.44%。
Thresh=0.177，n=2，准确率：38.89%。
Thresh=0.228，n=1，准确率：33.33%。

那么我的问题是这样的，对于这种情况，如何才能在低特征数[n]的情况下选择精度最高的？代码可以在链接中找到】。

编辑1.感谢@Mihai Petre，我在他的回答中用代码让它工作了。

感谢@Mihai Petre，我用他的答案中的代码成功地让它工作了。我有另一个问题，比如说我运行了链接中的代码，得到了以下结果。

Feature Importance results = [29.205832   5.0182242  0.         0.         0. 6.7736177 16.704327  18.75632    9.529003  14.012676   0.       ]
Features = [ 0  7  6  9  8  5  1 10  4  3  2]

Thresh= 0.000, n= 11, Accuracy: 38.89%.
Thresh=0.000，n=11，准确率：38.89%。
Thresh=0.000，n=11，准确率：38.89%。
Thresh=0.000，n=11，准确率：38.89%。
Thresh=0.050，n=7，准确率：38.89%。
Thresh=0.068，n=6，准确率：38.89%。
Thresh=0.095，n=5，准确率：33.33%。
Thresh=0.140，n=4，准确率：38.89%。
Thresh=0.167，n=3，准确率：33.33%。
Thresh=0.188，n=2，准确率：38.89%。
Thresh=0.292，n=1，准确率：38.89%。

如何删除给出零特征重要性的特征，得到特征重要性值的特征？

侧面的问题。

我试图找到最佳的特征选择，包括usingspecific分类模型和有助于给出高准确度的最佳特征，例如，使用KNN分类器，并想找到给出高准确度的最佳特征。什么样的特征选择适合使用？
在实现多个分类模型时，是对每个分类模型都做特征选择最好，还是需要先做一次特征选择，然后将选择的特征用于多个分类模型？

Answer 1

好吧，那么你链接中的人是怎么用

thresholds = sort(model.feature_importances_)
for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    # eval model
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

是创建一个排序的阈值数组，然后他对XGBoost的每个元素进行训练。thresholds 阵列。

从你的问题来看，我想你只想选择第6种情况，即特征数最少、精度最高的情况。对于这种情况，你会想做这样的事情。

selection = SelectFromModel(model, threshold=threshold[5], prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (threshold[5], select_X_train.shape[1], accuracy*100.0))

如果你想让整个过程自动化，那么你就需要计算出最小的n，在这个for循环中，准确率是最大的，它看起来差不多是这样的。

n_min = *your maximum number of used features*
acc_max = 0
thresholds = sort(model.feature_importances_)
obj_thresh = thresholds[0]
for thresh in thresholds:
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    select_X_test = selection.transform(X_test)
    predictions = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, predictions)
    if(select_X_train.shape[1] < n_min) and (accuracy > acc_max):
        n_min = select_X_train.shape[1]
        acc_max = accuracy
        obj_thresh = thresh

selection = SelectFromModel(model, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
select_X_test = selection.transform(X_test)
predictions = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))

Answer 2

我想办法解决了这个问题请看下面的代码。

为了获得最少的特征数量和最高的精度:

# Fit the model:
f_max = 8
f_min = 2
acc_max = accuracy
thresholds = np.sort(model_FS.feature_importances_)
obj_thresh = thresholds[0]
accuracy_list = []
for thresh in thresholds:
    # select features using threshold:
    selection = SelectFromModel(model_FS, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model:
    selection_model = xgb.XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    # eval model:
    select_X_test = selection.transform(X_test)
    selection_model_pred = selection_model.predict(select_X_test)
    selection_predictions = [round(value) for value in selection_model_pred]
    accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
    accuracy = accuracy * 100
    print('Thresh= %.3f, n= %d, Accuracy: %.2f%%' % (thresh, select_X_train.shape[1], accuracy))
    accuracy_list.append(accuracy)
    if(select_X_train.shape[1] < f_max) and (select_X_train.shape[1] >= f_min) and (accuracy >= acc_max):
        n_min = select_X_train.shape[1]
        acc_max = accuracy
        obj_thresh = thresh
# select features using threshold:
selection = SelectFromModel(model_FS, threshold=obj_thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model:
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model:
select_X_test = selection.transform(X_test)
selection_model_pred = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, predictions)
selection_predictions = [round(value) for value in selection_model_pred]
accuracy = accuracy_score(y_true=y_test, y_pred=selection_predictions)
print("Selected: Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (obj_thresh, select_X_train.shape[1], accuracy*100.0))
key_list = list(range(X_train.shape[1], 0, -1))
accuracy_dict = dict(zip(key_list, accuracy_list))
optimum_num_feat = n_min
print(optimum_num_feat)

# Printing out the features:
X_train = X_train.iloc[:, optimum_number_features]
X_test = X_test.iloc[:, optimum_number_features]

print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)

要得到重要值的特征，而不是零重要值。

# Calculate feature importances
importances = model_FS.feature_importances_
print((model_FS.feature_importances_) * 100)

# Organising the feature importance in dictionary:
## The key value depends on your maximum number of features:
key_list = range(0, 11, 1)
feature_importance_dict = dict(zip(key_list, importances))
sort_feature_importance_dict = dict(sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True))
print('Feature Importnace Dictionary (Sorted): ', sort_feature_importance_dict)

# Removing the features that have value zero in feature importance:
filtered_feature_importance_dict = {x:y for x,y in sort_feature_importance_dict.items() if y!=0}
print('Filtered Feature Importnace Dictionary: ', filtered_feature_importance_dict)
f_indices = list(filtered_feature_importance_dict.keys())
f_indices = np.asarray(f_indices)
print(f_indices)

X_train = X_train.loc[:, f_indices]
X_test = X_test.loc[:, f_indices]

print('X Train FI: ')
print(X_train)
print('X Test FI: ')
print(X_test)

如何使用xgboost在选择特征数量少的情况下获得最高精度？

问题描述投票：0回答：1

1个回答

最新问题

如何使用xgboost在选择特征数量少的情况下获得最高精度？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1