考虑这两种架构:
prev_layer -> dropout 1.0 -> next_layer (output layer)
prev_layer -> stop_gradient -> next_layer (output layer)
当梯度从输出层流向输入时,两者必须产生相同的行为,其中
prev_layer
权重不会更新,那么有什么区别?
我已经用这段代码验证了:
input_layer = Input(shape=(1,))
prev_layer = Dense(32, activation='relu')(input_layer)
dropout_layer = Dropout(.99999)(prev_layer) # This is effectively disabling the layer
output_layer = Dense(1, activation='linear')(dropout_layer)
model_dropout = Model(inputs=input_layer, outputs=output_layer)
model_dropout.compile(optimizer='adam', loss='mse')
input_layer = Input(shape=(1,))
prev_layer = Dense(32, activation='relu')(input_layer)
stop_gradient_layer = Lambda(lambda x: tf.stop_gradient(x))(prev_layer)
output_layer = Dense(1, activation='linear')(stop_gradient_layer)
model_stopgradient = Model(inputs=input_layer, outputs=output_layer)
model_stopgradient.compile(optimizer='adam', loss='mse')
训练他们:
before_train_dropout = model_dropout.layers[1].get_weights()
before_train_stopgradient = model_stopgradient.layers[1].get_weights()
X_dummy = np.random.rand(5, 1)
y_dummy = np.random.rand(5, 1)
model_dropout.fit(X_dummy, y_dummy, epochs=50, verbose=0)
model_stopgradient.fit(X_dummy, y_dummy, epochs=50, verbose=0)
after_train_dropout = model_dropout.layers[1].get_weights()
after_train_stopgradient = model_stopgradient.layers[1].get_weights()
# Is array equal
print('weight')
display(np.array_equal(np.array(before_train_dropout[0]), np.array(after_train_dropout[0])))
display(np.array_equal(np.array(before_train_stopgradient[0]), np.array(after_train_stopgradient[0])))
print('bias')
display(np.array_equal(np.array(before_train_dropout[1]), np.array(after_train_dropout[1])))
display(np.array_equal(np.array(before_train_stopgradient[1]), np.array(after_train_stopgradient[1])))
返回:
weight
True
True
bias
True
True
那么,我什么时候应该在
Dropout(1.0)
和 stop_gradient
之间使用?
虽然它们在向后传球时做同样的事情,但在向前传球时它们的行为却截然不同。
stop_gradient 层将输入张量按原样传递到输出。
dropout层以一定的概率丢弃一些元素。如果rate=1,那么所有输出都为零。