如何提高决策树的准确性?

问题描述 投票:0回答:1

这是我的代码


# Prepare target and features
target_column = 'resolution'
X = data.drop(columns=[target_column])
y = data[target_column]

# Convert categorical data to binary/numerical where needed
X_encoded = pd.get_dummies(X)  # One-hot encoding for categorical features
le = LabelEncoder()
y_encoded = le.fit_transform(y)  # Encode target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.3, random_state=42)

# Perform Grid Search to tune max_depth, min_samples_split, min_samples_leaf, and class_weight
param_grid = {
    'max_depth': range(1, 11),  # Range of max_depth to test
    'min_samples_split': [2, 5, 10],  # Options for min_samples_split
    'min_samples_leaf': [1, 2, 4],
    'class_weight': ['balanced']# Options for min_samples_leaf
}
model = DecisionTreeClassifier(random_state=42)

# Initialize GridSearchCV with 5-fold cross-validation
grid = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy')  # Using accuracy for classification
grid.fit(X_train, y_train)

# Best parameters from grid search
best_params = grid.best_params_
# print("Best Parameters from Grid Search:", best_params)

# Evaluate the best model on the test set
best_model = grid.best_estimator_
test_accuracy = best_model.score(X_test, y_test) * 100  # Accuracy as percentage
# print(f"Test Accuracy of the Best Model: {test_accuracy:.2f}%")

# Display the structure of the decision tree
# tree_structure = tree.export_text(best_model, feature_names=list(X_encoded.columns))
# print("\nDecision Tree Structure:")
# print(tree_structure).

这是我的成果

分类报告: 精确召回率 f1-score 支持

  Outcome 1       0.39      0.88      0.54       388
  Outcome 2       0.95      0.61      0.75      1397

     accuracy                           0.67      1785
    macro avg       0.67      0.75      0.64      1785
 weighted avg       0.83      0.67      0.70      1785

混淆矩阵: [[343 45] [540857]] “结果 1”类别的准确度:88.40% “结果 2”类别的准确率:61.35%

决策树分类器的准确度:0.67

我希望整体准确性和“结果 2”更高。我认为的问题是,结果 1 在这里是少数,由于某种原因,决策树的准确性较低,尽管它应该是平衡的

python jupyter-notebook jupyter-lab decision-tree
1个回答
0
投票

您可以尝试

class_weight
中的
DecisionTreeClassifier()
参数来更改属于不同类别的类别对应的权重,并构建成本敏感的决策树模型(即指示通过为错误分类分配更高的惩罚来更改优化函数)某个类别),而不是使用
balanced
类别权重。

根据

sklearn
文档

class_weigh
t:
dict
list
dict
“balanced”
default=None

与类相关的权重,格式为{class_label:weight}。如果 没有,所有课程的权重都应该是一。

“balanced”
模式使用
y
的值自动调整 权重与输入数据中的类别频率成反比 如
n_samples / (n_classes * np.bincount(y))

您还可以使用类权重作为网格(超)参数,并通过

GirdSearchCV
进行超参数调整。这是
sklearn
digits
数据集的示例:

# your code as a function:
def train_test(model, X_encoded, y_encoded, class_weights = None):
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.3, random_state=42)

    # Perform Grid Search to tune max_depth, min_samples_split, min_samples_leaf, and class_weight
    param_grid = {
        'max_depth': range(1, 11),  # Range of max_depth to test
        'min_samples_split': [2, 5, 10],  # Options for min_samples_split
        'min_samples_leaf': [1, 2, 4],
        'class_weight': ['balanced'] if class_weights is None else class_weights # Options for min_samples_leaf
    }
    # Initialize GridSearchCV with 5-fold cross-validation
    model = DecisionTreeClassifier(random_state=42)
    grid = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy')  # Using accuracy for classification
    grid.fit(X_train, y_train)

    # Best parameters from grid search
    best_params = grid.best_params_
    # print("Best Parameters from Grid Search:", best_params)
    
    # Evaluate the best model on the test set
    best_model = grid.best_estimator_
    print(best_model)
    test_accuracy = best_model.score(X_test, y_test) * 100  # Accuracy as percentage
    # print(f"Test Accuracy of the Best Model: {test_accuracy:.2f}%")
    y_pred = best_model.predict(X_test)
    print(classification_report(y_test, y_pred))

# test dataset
import sklearn.datasets
X, y = sklearn.datasets.load_digits(return_X_y=True) # load_digits

# with balanced class_weights
train_test(model, X, y)
#Fitting 5 folds for each of 90 candidates, totalling 450 fits
#DecisionTreeClassifier(class_weight='balanced', max_depth=10,
#                       min_samples_split=5, random_state=42)
#              precision    recall  f1-score   support
#
#           0       1.00      0.91      0.95        53
#           1       0.89      0.78      0.83        50
#           2       0.88      0.81      0.84        47
#           3       0.78      0.85      0.81        54
#           4       0.85      0.83      0.84        60
#           5       0.92      0.85      0.88        66
#           6       0.88      0.96      0.92        53
#           7       0.82      0.84      0.83        55
#           8       0.78      0.88      0.83        43
#           9       0.76      0.81      0.79        59
#
#    accuracy                           0.85       540
#   macro avg       0.86      0.85      0.85       540
#weighted avg       0.86      0.85      0.85       540

可以看出,对于

balanced
类别权重,
f1-score
类别的
9
是最低的 (
0.79
)。因此,您可能需要更加重视类
9
的正确分类。您可以通过使用以下代码为类
9
分配比其他类多 5 倍或 10 倍的权重,然后使用具有不同类权重分布的网格搜索来实现此目的:

weights = [{i:1 if i < 9 else 5 for i in range(10)}, {i:1 if i < 9 else 10 for i in range(10)}] # different weight distributions
#print(weights)
train_test(model, X, y, weights)
#Fitting 5 folds for each of 180 candidates, totalling 900 fits
#DecisionTreeClassifier(class_weight={0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1,
#                                     7: 1, 8: 1, 9: 5},
#                       max_depth=9, min_samples_split=5, random_state=42)
#              precision    recall  f1-score   support
#
#           0       0.96      0.94      0.95        53
#           1       0.78      0.78      0.78        50
#           2       0.87      0.85      0.86        47
#           3       0.80      0.72      0.76        54
#           4       0.93      0.85      0.89        60
#           5       0.89      0.89      0.89        66
#           6       0.96      0.85      0.90        53
#           7       0.87      0.87      0.87        55
#           8       0.66      0.86      0.75        43
#           9       0.80      0.86      0.83        59
#
#    accuracy                           0.85       540
#   macro avg       0.85      0.85      0.85       540
#weighted avg       0.86      0.85      0.85       540

如您所见,通过网格搜索,类权重的最佳超参数为

5
类样本分配的权重是其他样本的
9
倍,从而将类的
f1-score
增加到
0.83
(从
0.79 
)。

您可以对您的数据集进行相同的尝试(使用类权重进行实验并运用您的直觉)。

© www.soinside.com 2019 - 2024. All rights reserved.