在数据集上尝试机器学习时,我在不同的机器学习算法上得到了相同的指标结果,例如准确性和 F 分数。
我有一个数据集,用于训练我选择的算法。我在 Kaggle 网站上找到了它:source。
以下是 Jupiter 文件中的代码片段及其执行结果:
中:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
import joblib
import tensorflow as tf
import numpy as np
from tensorflow.keras import models, layers
import warnings
warnings.filterwarnings('ignore')
中:
df = pd.read_csv("payload_mini.csv",encoding='utf-16')
df.head(10)
中:
df = pd.read_csv("payload_mini.csv",encoding='utf-16')
df = df[(df['attack_type'] == 'sqli') | (df['attack_type'] == 'norm')]
X = df['payload']
y = df['label']
vectorizer = CountVectorizer(min_df = 2, max_df = 0.8, stop_words = stopwords.words('english'))
X = vectorizer.fit_transform(X.values.astype('U')).toarray()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
输出:
(8040, 1585)
(8040,)
(2011, 1585)
(2011,)
中:
nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
y_pred = nb_clf.predict(X_test)
print(f"Accuracy of Naive Bayes on test set : {accuracy_score(y_pred, y_test)}")
print(f"F1 Score of Naive Bayes on test set : {f1_score(y_pred, y_test, pos_label='anom')}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
输出:
Accuracy of Naive Bayes on test set : 0.9806066633515664
F1 Score of Naive Bayes on test set : 0.9735234215885948
Classification Report:
precision recall f1-score support
anom 0.97 0.98 0.97 732
norm 0.99 0.98 0.98 1279
accuracy 0.98 2011
macro avg 0.98 0.98 0.98 2011
weighted avg 0.98 0.98 0.98 2011
中:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
print(f"Accuracy of Random Forest on test set : {accuracy_score(y_pred, y_test)}")
print(f"F1 Score of Random Forest on test set : {f1_score(y_pred, y_test, pos_label='anom')}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))
输出:
Accuracy of Random Forest on test set : 0.9806066633515664
F1 Score of Random Forest on test set : 0.9735234215885948
Classification Report:
precision recall f1-score support
anom 1.00 0.96 0.98 732
norm 0.98 1.00 0.99 1279
accuracy 0.99 2011
macro avg 0.99 0.98 0.99 2011
weighted avg 0.99 0.99 0.99 2011
中:
svm_clf = SVC(gamma = 'auto')
svm_clf.fit(X_train, y_train)
y_pred = svm_clf.predict(X_test)
print(f"Accuracy of SVM on test set : {accuracy_score(y_pred, y_test)}")
print(f"F1 Score of SVM on test set: {f1_score(y_pred, y_test, pos_label='anom')}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
输出:
Accuracy of SVM on test set : 0.9189457981103928
F1 Score of SVM on test set: 0.8658436213991769
Classification Report:
precision recall f1-score support
anom 1.00 0.76 0.87 689
norm 0.89 1.00 0.94 1322
accuracy 0.92 2011
macro avg 0.95 0.88 0.90 2011
weighted avg 0.93 0.92 0.92 2011
正如您所看到的,在使用不同的机器学习算法进行训练时,我们在随机森林和朴素贝叶斯分类器的情况下得到了相同的结果。 我希望你能帮助我修复代码中可能存在的错误或以某种方式改进它。
在随机森林代码中,您将预测存储为
y_pred_rf
,但在 y_pred
上调用指标