import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
whitewine_data = pd.read_csv('winequality-white.csv',
delimiter=';')
variables = ['alcohol_cat', 'alcohol', 'sulphates', 'density',
'total sulfur dioxide', 'citric acid', 'volatile acidity',
'chlorides']
X = whitewine_data[variables]
y = whitewine_data['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
predictions = model.predict([[0.27, 0.36, 0.045, 170, 1.001,
0.45, 8.9, 0]])
print(f'Predicted Output: {predictions}')
print(f'Accuracy: {accuracy * 100}%')
print(f'F1 Score: {f1 * 100}% ')
这个初始模型的准确度得分为 57%
====================================================== =============
whitewine_data = pd.read_csv('winequality-white.csv',
delimiter=';')
# Variables to be dropped from the data set - NOT THE INPUT
VARIABLES
variables = ['fixed acidity', 'residual sugar', 'free sulfur
dioxide', 'pH', 'quality', 'isSweet']
X = whitewine_data.drop(variables, axis=1)
y = whitewine_data['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
joblib.dump(model, 'WhiteWine_Quality_Predictor.joblib')
创建保存的模型
====================================================== =============
whitewine_data = pd.read_csv('winequality-white.csv',
delimiter=';')
variables = ['volatile acidity', 'citric acid', 'chlorides',
'total sulfur dioxide', 'density', 'sulphates', 'alcohol',
'alcohol_cat']
X_test = whitewine_data[variables]
y_test = whitewine_data['quality']
model = joblib.load('WhiteWine_Quality_Predictor.joblib')
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = accuracy_score(y_test, y_pred)
predictions = model.predict([[0.27, 0.36, 0.045, 170, 1.001,
0.45, 10.9, 3]])
print(f'F1 Score: {f1 * 100}%')
print(f'Model Accuracy: {accuracy * 100}%')
print(f'Predicted Output: {predictions}')
调用保存的模型现在准确率达到 92%
问题:调用已保存的模型如何导致增量增加 我看到的准确性
这是一开始处理 ML 算法时很常见的错误。
在第二个脚本中,您正在 winequality-white.csv 数据集上训练算法,然后保存它。完全没问题。
问题在于,在第三个中,您在与训练所用的完全相同的数据集上使用该算法。您基本上是在预测用于训练的观察结果,因此很明显该算法将以 100% 的准确度预测它们。
存储算法的方法是正确的,但是您必须使用另一个数据集来实际使用它进行预测,而不是用于训练的数据集。