我的数据集由时间序列(10080)和其他描述性统计特征(85)组成连接成一行。数据框是
921 x 10166
。
数据看起来像这样,最后 2 列为
Y
(标签)。
id x0 x1 x2 x3 x4 x5 ... x10079 mean var ... Y0 Y1
1 40 31.05 25.5 25.5 25.5 25 ... 33 24 1 1 0
2 35 35.75 36.5 26.5 36.5 36.5 ... 29 31 2 0 1
3 35 35.70 36.5 36.5 36.5 36.5 ... 29 25 1 1 0
4 40 31.50 23.5 24.5 26.5 25 ... 33 29 3 0 1
...
921 40 31.05 25.5 25.5 25.5 25 ... 23 33 2 0 1
我检查了一些博客和教程,它们很有帮助,但我不确定如何处理我的输入数据,我将其分为
inputs_1
和inputs_2
,如下面的模型所示:
inputs_1 = keras.Input(shape=(10081,1))
layer1 = Conv1D(64,14)(inputs_1)
layer2 = layers.MaxPool1D(5)(layer1)
layer3 = Conv1D(64, 14)(layer2)
layer4 = layers.GlobalMaxPooling1D()(layer3)
inputs_2 = keras.Input(shape=(85,))
layer5 = layers.concatenate([layer4, inputs_2])
layer6 = Dense(128, activation='relu')(layer5)
layer7 = Dense(2, activation='softmax')(layer6)
model_2 = keras.models.Model(inputs = [inputs_1, inputs_2], output = [layer7])
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:10166], merge[['Result_cat','Result_cat1']].values, test_size=0.2)
X_train = X_train.to_numpy()
X_train = X_train.reshape([X_train.shape[0], X_train.shape[1], 1])
X_train_1 = X_train[:,0:10081,:]
X_train_2 = X_train[:,10081:10166,:].reshape(736,85)
X_test = X_test.to_numpy()
X_test = X_test.reshape([X_test.shape[0], X_test.shape[1], 1])
X_test_1 = X_test[:,0:10081,:]
X_test_2 = X_test[:,10081:10166,:].reshape(185,85)
adam = keras.optimizers.Adam(lr = 0.0005)
model_2.compile(loss = 'categorical_crossentropy', optimizer = adam, metrics = ['acc'])
history = model_2.fit([X_train_1,X_train_2], y_train, epochs = 120, batch_size = 256, validation_split = 0.2, callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)])
将特征分为两部分的原因是,
inputs_1
主要是时间序列数据,而inputs_2
是描述性统计数据。我认为考虑到数据的不同性质,最好将它们分开。如果我错了请纠正我。
我的问题是,由于我的特征数据在原始模型中被划分和单独处理,我是否应该在交叉验证中做同样的事情(分别处理
inputs_1
和inputs_2
)?特别是,例如,在 Jason 的模型中:
# MLP for Pima Indians Dataset with 10-fold cross validation
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import StratifiedKFold
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
for train, test in kfold.split(X, Y):
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)
# evaluate the model
scores = model.evaluate(X[test], Y[test], verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
评估是使用代码
scores = model.evaluate(X[test], Y[test], verbose=0)
完成的,其中使用了X[test], Y[test]
。就我而言,由于我有 inputs_1
和 inputs_2
而不是 X
(在示例模型中),我应该使用类似 [inputs_1,inputs_2][test]
的东西吗?
如有任何建议,我们将不胜感激。谢谢
更新:
我尝试将
inputs_1
和 inputs_2
与 连接起来
con_x = np.concatenate((X_train_1,X_train_2), axis = 1)
并将模型的第一行更改为
for train, test in kfold.split(con_x, Y):
但它又回来了
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-d53a7058d157> in <module>()
55 cvscores = []
---> 56 for train, test in kfold.split(con_x, Y):
57
58 inputs_1 = keras.Input(shape=(10080,1))
1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
537 if not allow_nd and array.ndim >= 3:
538 raise ValueError("Found array with dim %d. %s expected <= 2."
--> 539 % (array.ndim, estimator_name))
540 if force_all_finite:
541 _assert_all_finite(array,
ValueError: Found array with dim 3. Estimator expected <= 2.
但是,我仍然不确定像这样连接
inputs_1
和 inputs_2
是否有效。
这可能已经过时,但不建议将 keras 顺序 API 用于多输入模型。您需要子类化 keras.Model 或使用功能 API 构建模型 keras 顺序 API 避免场景。