培训和测试数据如何分开？

Question

现在我使用的是

validation_split

20％

。我了解的是，我的培训数据将为

80％

，测试数据将为

20％

。我很困惑这些数据在后端如何处理。这是否像最高的80％the样品进行训练，低于20％的测试，或者从介于两者之间随机挑选样品？如果我想提供单独的培训和测试数据，我将如何使用model.fit()?? 更重要的是，我的第二个问题是如何检查数据是否适合模型？从结果可以看出，训练精度约为90％，而验证精度约为55％。这是否意味着过度合身或不合格的情况？ 我最后一个问题是评估回报什么？文件说，它返回了损失，但我已经在每个时期内都会获得损失和准确性（作为fit的回报（）（inhistory）。评估表演返回的精度和得分如何？如果通过评估回报返回的准确性90％

，我可以说我的数据非常合适，无论每个时期的个人准确性和损失是什么？
below是我的代码：

import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools

seed = 7
numpy.random.seed(seed)

dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))

dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]

encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y) 
# baseline model
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
    model.add(Dense(5, kernel_initializer='normal', activation='relu'))
    #model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))

    model.add(Dense(2, kernel_initializer='normal', activation='softmax'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # for binayr classification
        #model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])  # for multi class
    return model
    

model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)

print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


pre_cls=model.predict_classes(X)    
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : \n')
print(cm1)


score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)

KERAS文档说：“验证数据是从提供的X和Y数据中的最后一个样本中选择的。”，这意味着，这意味着随着拆分之后，随机示例发生，也有一个称为“ shuffle”的布尔参数，该参数称为“ shuffle”，该参数设置为true作为默认值，因此，如果您不希望将其设置为false forse forffals，则可以将其设置为false false，

在培训数据上找到良好的结果，然后在评估数据上获得不好或不那么好的结果，通常意味着您的模型过于拟合，过度合适的是您的模型在非常具体的方案中学习并且无法在新数据上取得好成果

evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting

Answer 1

train_test_split()

易于使用，看起来像这样：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Model.fit

Model.fit

训练模型并返回A

History对象。它的History.history

属性是训练损失值和指标值的记录在连续的时期以及验证损失值和验证指标值（如果适用）。

Keras为此提供了

Answer 2

功能，

keras.utils.split_dataset

。将数据集拆分为火车和测试数据的比较：

HEADER：

import sklearn import keras # Sets all random seeds (Python, NumPy, and backend framework) seed = 123456 keras.utils.set_random_seed(seed) # shared variables dataset = tf.data.Dataset.range(12, output_type=tf.int8) TRAIN_SIZE: float = 0.25 DO_SHUFFLE: bool = False
用keras ssplit数据：
x_train, x_test = keras.utils.split_dataset(
    dataset,
    left_size=TRAIN_SIZE,
    shuffle=DO_SHUFFLE, # default False
    seed=seed
)

print(list(map(int, x_train.as_numpy_iterator())), list(map(int, x_test.as_numpy_iterator())))
#[0, 1, 2] [3, 4, 5, 6, 7, 8, 9, 10, 11]
用

Scikit-learn

：ssplit数据： keras.utils.set_random_seed(seed) x_train, x_test = sklearn.model_selection.train_test_split( list(map(int, dataset.as_numpy_iterator())), # only Python or NumPy data train_size=TRAIN_SIZE, shuffle=DO_SHUFFLE, # default True random_state=seed ) print(x_train, x_test) #[0, 1, 2] [3, 4, 5, 6, 7, 8, 9, 10, 11]

Implicit

在与Keras的培训时间拆分数据：

model = keras.Sequential([keras.layers.Identity()])
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy']) # not important for the example!

history = model.fit(
    *[np.array(list(dataset.as_numpy_iterator()))]*2,
    shuffle=DO_SHUFFLE, # defualt True
    validation_split=1.-TRAIN_SIZE # size of validation is the complement of the training
)
print(history.history)

输出

{
    'accuracy': [1.0], 'loss': [1.909542441368103], #         results of the training
    'val_accuracy': [1.0], 'val_loss': [133.94410705566406] # results of the validation
}

Model.evaluate返回有关训练型模型在学习参数时的行为方式的反馈，从training data：

：

：

Scalar测试损失（如果模型具有单个输出，没有指标）或标量列表（如果模型有多个输出和/或指标）。属性model.metrics_names

会给你标量输出的显示标签。

培训和测试数据如何分开？

问题描述投票：0回答：2

2个回答

最新问题

培训和测试数据如何分开？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2