validation_split
。我了解的是,我的培训数据将为80%
,测试数据将为
20%。我很困惑这些数据在后端如何处理。这是否像最高的80%the样品进行训练,低于20%的测试,或者从介于两者之间随机挑选样品?如果我想提供单独的培训和测试数据,我将如何使用
model.fit()
??
更重要的是,我的第二个问题是如何检查数据是否适合模型?从结果可以看出,训练精度约为90%,而验证精度约为55%。这是否意味着过度合身或不合格的情况?
我最后一个问题是评估回报什么?文件说,它返回了损失,但我已经在每个时期内都会获得损失和准确性(作为fit的回报()(inhistory
)。评估表演返回的精度和得分如何?如果通过评估回报返回的准确性90%,我可以说我的数据非常合适,无论每个时期的个人准确性和损失是什么? below是我的代码:
import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools
seed = 7
numpy.random.seed(seed)
dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))
dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y)
# baseline model
def create_baseline():
# create model
model = Sequential()
model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
model.add(Dense(5, kernel_initializer='normal', activation='relu'))
#model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))
model.add(Dense(2, kernel_initializer='normal', activation='softmax'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # for binayr classification
#model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # for multi class
return model
model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)
print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
pre_cls=model.predict_classes(X)
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : \n')
print(cm1)
score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)
KERAS文档说:“验证数据是从提供的X和Y数据中的最后一个样本中选择的。”,这意味着,这意味着随着拆分之后,随机示例发生,也有一个称为“ shuffle”的布尔参数,该参数称为“ shuffle”,该参数设置为true作为默认值,因此,如果您不希望将其设置为false forse forffals,则可以将其设置为false false,
在培训数据上找到良好的结果,然后在评估数据上获得不好或不那么好的结果,通常意味着您的模型过于拟合,过度合适的是您的模型在非常具体的方案中学习并且无法在新数据上取得好成果
evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting
train_test_split()
易于使用,看起来像这样:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
Model.fit
Model.fit
训练模型并返回A
History
对象。它的History.history
Scikit-learnImplicit在与Keras的培训时间拆分数据:import sklearn import keras # Sets all random seeds (Python, NumPy, and backend framework) seed = 123456 keras.utils.set_random_seed(seed) # shared variables dataset = tf.data.Dataset.range(12, output_type=tf.int8) TRAIN_SIZE: float = 0.25 DO_SHUFFLE: bool = False
用keras ssplit数据:用
x_train, x_test = keras.utils.split_dataset( dataset, left_size=TRAIN_SIZE, shuffle=DO_SHUFFLE, # default False seed=seed ) print(list(map(int, x_train.as_numpy_iterator())), list(map(int, x_test.as_numpy_iterator()))) #[0, 1, 2] [3, 4, 5, 6, 7, 8, 9, 10, 11]
model = keras.Sequential([keras.layers.Identity()])
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # not important for the example!
history = model.fit(
*[np.array(list(dataset.as_numpy_iterator()))]*2,
shuffle=DO_SHUFFLE, # defualt True
validation_split=1.-TRAIN_SIZE # size of validation is the complement of the training
)
print(history.history)
输出
{
'accuracy': [1.0], 'loss': [1.909542441368103], # results of the training
'val_accuracy': [1.0], 'val_loss': [133.94410705566406] # results of the validation
}
Model.evaluate
返回有关训练型模型在学习参数时的行为方式的反馈,从training data:
:
Scalar测试损失(如果模型具有单个输出,没有指标)
或标量列表(如果模型有多个输出
和/或指标)。属性model.metrics_names