培训和测试数据如何分开?

问题描述 投票:0回答:2
现在我使用的是

validation_split

20%
。我了解的是,我的培训数据将为
80%

,测试数据将为

20%
。我很困惑这些数据在后端如何处理。这是否像最高的80%the样品进行训练,低于20%的测试,或者从介于两者之间随机挑选样品?如果我想提供单独的培训和测试数据,我将如何使用model.fit()?? 更重要的是,我的第二个问题是如何检查数据是否适合模型?从结果可以看出,训练精度约为90%,而验证精度约为55%。这是否意味着过度合身或不合格的情况? 我最后一个问题是评估回报什么?文件说,它返回了损失,但我已经在每个时期内都会获得损失和准确性(作为fit的回报()(inhistory)。评估表演返回的精度和得分如何?如果通过评估回报返回的准确性90%
,我可以说我的数据非常合适,无论每个时期的个人准确性和损失是什么?
below是我的代码:

import numpy import pandas import matplotlib.pyplot as plt from keras.models import Sequential from keras.layers import Dense, Dropout from keras.wrappers.scikit_learn import KerasClassifier from sklearn.model_selection import cross_val_score from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import StratifiedKFold from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from keras.utils import np_utils from sklearn.model_selection import KFold from sklearn.metrics import confusion_matrix import itertools seed = 7 numpy.random.seed(seed) dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0)) dataset = dataframe.values X = dataset[:,0:50].astype(float) # number of cols-1 Y = dataset[:,50] encoder = LabelEncoder() encoder.fit(Y) encoded_Y = encoder.transform(Y) encoded_Y = np_utils.to_categorical(encoded_Y) print("encoded_Y=", encoded_Y) # baseline model def create_baseline(): # create model model = Sequential() model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu')) model.add(Dense(5, kernel_initializer='normal', activation='relu')) #model.add(Dense(2, kernel_initializer='normal', activation='sigmoid')) model.add(Dense(2, kernel_initializer='normal', activation='softmax')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # for binayr classification #model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # for multi class return model model=create_baseline(); history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1) print(history.history.keys()) # list all data in history print(history.history.keys()) # summarize history for accuracy plt.plot(history.history['acc']) plt.plot(history.history['val_acc']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.show() # summarize history for loss plt.plot(history.history['loss']) plt.plot(history.history['val_loss']) plt.title('model loss') plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['train', 'test'], loc='upper left') plt.show() pre_cls=model.predict_classes(X) cm1 = confusion_matrix(encoder.transform(Y),pre_cls) print('Confusion Matrix : \n') print(cm1) score, acc = model.evaluate(X,encoded_Y) print('Test score:', score) print('Test accuracy:', acc)

KERAS文档说:“验证数据是从提供的X和Y数据中的最后一个样本中选择的。”,这意味着,这意味着随着拆分之后,随机示例发生,也有一个称为“ shuffle”的布尔参数,该参数称为“ shuffle”,该参数设置为true作为默认值,因此,如果您不希望将其设置为false forse forffals,则可以将其设置为false false,

在培训数据上找到良好的结果,然后在评估数据上获得不好或不那么好的结果,通常意味着您的模型过于拟合,过度合适的是您的模型在非常具体的方案中学习并且无法在新数据上取得好成果

evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting


tensorflow machine-learning keras neural-network
2个回答
60
投票
train_test_split()
    函数。
  1. 易于使用,看起来像这样:

  2. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
  3. 没有,您无法使用
  4. Model.fit

  5. 拆分数据集。
Model.fit

训练模型并返回A


History对象。它的History.history

属性是 训练损失值和指标值的记录 在连续的时期以及验证损失值 和验证指标值(如果适用)。
Keras为此提供了

0
投票
功能,

keras.utils.split_dataset

将数据集拆分为火车和测试数据的比较:
HEADER:

import sklearn import keras # Sets all random seeds (Python, NumPy, and backend framework) seed = 123456 keras.utils.set_random_seed(seed) # shared variables dataset = tf.data.Dataset.range(12, output_type=tf.int8) TRAIN_SIZE: float = 0.25 DO_SHUFFLE: bool = False

用keras ssplit数据:
x_train, x_test = keras.utils.split_dataset(
    dataset,
    left_size=TRAIN_SIZE,
    shuffle=DO_SHUFFLE, # default False
    seed=seed
)

print(list(map(int, x_train.as_numpy_iterator())), list(map(int, x_test.as_numpy_iterator())))
#[0, 1, 2] [3, 4, 5, 6, 7, 8, 9, 10, 11]

Scikit-learn

ssplit数据: keras.utils.set_random_seed(seed) x_train, x_test = sklearn.model_selection.train_test_split( list(map(int, dataset.as_numpy_iterator())), # only Python or NumPy data train_size=TRAIN_SIZE, shuffle=DO_SHUFFLE, # default True random_state=seed ) print(x_train, x_test) #[0, 1, 2] [3, 4, 5, 6, 7, 8, 9, 10, 11]

Implicit

在与Keras的培训时间拆分数据:

model = keras.Sequential([keras.layers.Identity()]) model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # not important for the example! history = model.fit( *[np.array(list(dataset.as_numpy_iterator()))]*2, shuffle=DO_SHUFFLE, # defualt True validation_split=1.-TRAIN_SIZE # size of validation is the complement of the training ) print(history.history)

输出
{
    'accuracy': [1.0], 'loss': [1.909542441368103], #         results of the training
    'val_accuracy': [1.0], 'val_loss': [133.94410705566406] # results of the validation
}

Model.evaluate返回有关训练型模型在学习参数时的行为方式的反馈,从training data:

Scalar测试损失(如果模型具有单个输出,没有指标) 或标量列表(如果模型有多个输出 和/或指标)。属性model.metrics_names

会给你 标量输出的显示标签。
    

最新问题
© www.soinside.com 2019 - 2025. All rights reserved.