这是我的代码
# Prepare target and features
target_column = 'resolution'
X = data.drop(columns=[target_column])
y = data[target_column]
# Convert categorical data to binary/numerical where needed
X_encoded = pd.get_dummies(X) # One-hot encoding for categorical features
le = LabelEncoder()
y_encoded = le.fit_transform(y) # Encode target variable
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.3, random_state=42)
# Perform Grid Search to tune max_depth, min_samples_split, min_samples_leaf, and class_weight
param_grid = {
'max_depth': range(1, 11), # Range of max_depth to test
'min_samples_split': [2, 5, 10], # Options for min_samples_split
'min_samples_leaf': [1, 2, 4],
'class_weight': ['balanced']# Options for min_samples_leaf
}
model = DecisionTreeClassifier(random_state=42)
# Initialize GridSearchCV with 5-fold cross-validation
grid = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy') # Using accuracy for classification
grid.fit(X_train, y_train)
# Best parameters from grid search
best_params = grid.best_params_
# print("Best Parameters from Grid Search:", best_params)
# Evaluate the best model on the test set
best_model = grid.best_estimator_
test_accuracy = best_model.score(X_test, y_test) * 100 # Accuracy as percentage
# print(f"Test Accuracy of the Best Model: {test_accuracy:.2f}%")
# Display the structure of the decision tree
# tree_structure = tree.export_text(best_model, feature_names=list(X_encoded.columns))
# print("\nDecision Tree Structure:")
# print(tree_structure).
这是我的成果
分类报告: 精确召回率 f1-score 支持
Outcome 1 0.39 0.88 0.54 388
Outcome 2 0.95 0.61 0.75 1397
accuracy 0.67 1785
macro avg 0.67 0.75 0.64 1785
weighted avg 0.83 0.67 0.70 1785
混淆矩阵: [[343 45] [540857]] “结果 1”类别的准确度:88.40% “结果 2”类别的准确率:61.35%
决策树分类器的准确度:0.67
我希望整体准确性和“结果 2”更高。我认为的问题是,结果 1 在这里是少数,由于某种原因,决策树的准确性较低,尽管它应该是平衡的
您可以尝试
class_weight
中的DecisionTreeClassifier()
参数来更改属于不同类别的类别对应的权重,并构建成本敏感的决策树模型(即指示通过为错误分类分配更高的惩罚来更改优化函数)某个类别),而不是使用 balanced
类别权重。
sklearn
文档:
t:class_weigh
、dict
或list
的dict
,“balanced”
default=None
与类相关的权重,格式为{class_label:weight}。如果 没有,所有课程的权重都应该是一。
模式使用“balanced”
的值自动调整 权重与输入数据中的类别频率成反比 如y
n_samples / (n_classes * np.bincount(y))
您还可以使用类权重作为网格(超)参数,并通过
GirdSearchCV
进行超参数调整。这是 sklearn
的 digits
数据集的示例:
# your code as a function:
def train_test(model, X_encoded, y_encoded, class_weights = None):
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.3, random_state=42)
# Perform Grid Search to tune max_depth, min_samples_split, min_samples_leaf, and class_weight
param_grid = {
'max_depth': range(1, 11), # Range of max_depth to test
'min_samples_split': [2, 5, 10], # Options for min_samples_split
'min_samples_leaf': [1, 2, 4],
'class_weight': ['balanced'] if class_weights is None else class_weights # Options for min_samples_leaf
}
# Initialize GridSearchCV with 5-fold cross-validation
model = DecisionTreeClassifier(random_state=42)
grid = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, verbose=2, scoring='accuracy') # Using accuracy for classification
grid.fit(X_train, y_train)
# Best parameters from grid search
best_params = grid.best_params_
# print("Best Parameters from Grid Search:", best_params)
# Evaluate the best model on the test set
best_model = grid.best_estimator_
print(best_model)
test_accuracy = best_model.score(X_test, y_test) * 100 # Accuracy as percentage
# print(f"Test Accuracy of the Best Model: {test_accuracy:.2f}%")
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
# test dataset
import sklearn.datasets
X, y = sklearn.datasets.load_digits(return_X_y=True) # load_digits
# with balanced class_weights
train_test(model, X, y)
#Fitting 5 folds for each of 90 candidates, totalling 450 fits
#DecisionTreeClassifier(class_weight='balanced', max_depth=10,
# min_samples_split=5, random_state=42)
# precision recall f1-score support
#
# 0 1.00 0.91 0.95 53
# 1 0.89 0.78 0.83 50
# 2 0.88 0.81 0.84 47
# 3 0.78 0.85 0.81 54
# 4 0.85 0.83 0.84 60
# 5 0.92 0.85 0.88 66
# 6 0.88 0.96 0.92 53
# 7 0.82 0.84 0.83 55
# 8 0.78 0.88 0.83 43
# 9 0.76 0.81 0.79 59
#
# accuracy 0.85 540
# macro avg 0.86 0.85 0.85 540
#weighted avg 0.86 0.85 0.85 540
可以看出,对于
balanced
类别权重,f1-score
类别的 9
是最低的 (0.79
)。因此,您可能需要更加重视类 9
的正确分类。您可以通过使用以下代码为类 9
分配比其他类多 5 倍或 10 倍的权重,然后使用具有不同类权重分布的网格搜索来实现此目的:
weights = [{i:1 if i < 9 else 5 for i in range(10)}, {i:1 if i < 9 else 10 for i in range(10)}] # different weight distributions
#print(weights)
train_test(model, X, y, weights)
#Fitting 5 folds for each of 180 candidates, totalling 900 fits
#DecisionTreeClassifier(class_weight={0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1,
# 7: 1, 8: 1, 9: 5},
# max_depth=9, min_samples_split=5, random_state=42)
# precision recall f1-score support
#
# 0 0.96 0.94 0.95 53
# 1 0.78 0.78 0.78 50
# 2 0.87 0.85 0.86 47
# 3 0.80 0.72 0.76 54
# 4 0.93 0.85 0.89 60
# 5 0.89 0.89 0.89 66
# 6 0.96 0.85 0.90 53
# 7 0.87 0.87 0.87 55
# 8 0.66 0.86 0.75 43
# 9 0.80 0.86 0.83 59
#
# accuracy 0.85 540
# macro avg 0.85 0.85 0.85 540
#weighted avg 0.86 0.85 0.85 540
如您所见,通过网格搜索,类权重的最佳超参数为
5
类样本分配的权重是其他样本的 9
倍,从而将类的 f1-score
增加到 0.83
(从 0.79
)。
您可以对您的数据集进行相同的尝试(使用类权重进行实验并运用您的直觉)。