多类 SVM 中的类几乎所有数据点都作为支持向量(scikit-learn)正常吗?

问题描述 投票:0回答:1

我使用 scikit-learn 的 SVC 对虹膜数据集进行多类分类,一类几乎将所有数据点作为支持向量。这是预期的吗?或者我的实现或参数设置可能存在问题?

import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

# Read the Excel file
df = pd.read_excel("Classification iris.xlsx", index_col="instance_id")

# Split the data into training and test data
train_data = pd.concat([
    df.loc[1:35],
    df.loc[51:85],
    df.loc[101:135]
])

test_data = pd.concat([
    df.loc[36:50],
    df.loc[86:100],
    df.loc[136:150]
])

# Seperate the features and target variables
train_features = train_data.drop(columns=['class'])
train_target = train_data['class']
test_features = test_data.drop(columns=['class'])
test_target = test_data['class']

# Build the Support Vector Machine with One-vs-Rest strategy
svm = OneVsRestClassifier(SVC(kernel='linear', C=1e5))
svm.fit(train_features, train_target)

# Make predictions
targets_train_pred = svm.predict(train_features)
targets_test_pred = svm.predict(test_features)

# Calculate total errors
total_training_error =  sum(targets_train_pred != train_target) / len(train_data)
total_testing_error = sum(targets_test_pred != test_target) / len(test_data)

# Print the total training and testing errors
print(f"Q2.2.2 Calculation using Standard SVM Model:\ntotal training error: {total_training_error:.10f}, total testing error: {total_testing_error:.10f}")

# Initialize lists to store results
classes = ['setosa', 'versicolor', 'virginica']
linear_separable = []

for i, class_name in enumerate(classes):
    # Get the binary classifier for the current class
    classifier = svm.estimators_[i]
    
    # Calculate the errors for each class
    train_class_error = sum((targets_train_pred == class_name) != (train_target == class_name)) / len(train_data)
    test_class_error = sum((targets_test_pred == class_name) != (test_target == class_name)) / len(test_data)
    
    # Get the weight vector w and bias b
    w = classifier.coef_[0]
    b = classifier.intercept_[0]
    support_vector_indices = classifier.support_

    # Print the results for each class in the proper format
    print(f"\nclass {class_name}:")
    print(f"training error: {train_class_error:.10f}, testing error: {test_class_error:.10f}")
    print(f"w: [{', '.join([f'{x:.10f}' for x in w])}]")
    print(f"b: {b:.10f}")
    print(f"support vector indices: {support_vector_indices.tolist()}")
    
    # Check if the class is linearly separable
    if train_class_error == 0:
        linear_separable.append(class_name)

print(f"\nLinear separable classes: {', '.join(linear_separable)}")

这是输出:

Q2.2.2 Calculation using Standard SVM Model:
total training error: 0.0571428571, total testing error: 0.0888888889

class setosa:
training error: 0.0000000000, testing error: 0.0000000000
w: [0.0097327108, 0.5377790363, -0.8273513712, -0.3820427629]
b: 0.7734548984
support vector indices: [42, 23, 24]

class versicolor:
training error: 0.0000000000, testing error: 0.0000000000
w: [1.8485997715, -4.5023738999, -1.1043393026, 0.3212849949]
b: 5.6773042297
support vector indices: [1, 2, 8, 9, 13, 25, 27, 34, 71, 72, 73, 77, 78, 80, 81, 86, 88, 89, 90, 91, 92, 93, 96, 99, 100, 102, 103, 104, 35, 36, 37, 39, 40, 41, 42, 43, 44, 46, 47, 48, 49, 50, 52, 55, 58, 59, 60, 61, 62, 63, 65, 66, 67, 68, 69]

class virginica:
training error: 0.0000000000, testing error: 0.0000000000
w: [-3.6465034341, -5.1763639697, 7.4285254512, 11.0024158268]
b: -17.5703922240
support vector indices: [55, 57, 62, 68, 96, 97, 99, 103]

Linear separable classes: setosa, versicolor, virginica

由于 Versicolor 只有 35 个训练数据点,根据我对 SVM 的理论理解,我没想到它们都是支持向量。现在各个班级的训练错误对我来说似乎有点奇怪。

非常感谢您的帮助!

python machine-learning scikit-learn classification svm
1个回答
0
投票

您是否尝试过降低

C
svm = OneVsRestClassifier(SVC(kernel='linear', C=1e5))
的值?也许由于这个(相当高?)值,您对数据过度拟合。

© www.soinside.com 2019 - 2024. All rights reserved.