I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.
My simplified model is the following:
InputSize = 15
MaxLen = 64
HiddenSize = 16
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)
The summary of the network is:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 64, 15) 0
_________________________________________________________________
gru_1 (GRU) (None, 64, 16) 1536
_________________________________________________________________
time_distributed_1 (TimeDist (None, 64, 15) 255
_________________________________________________________________
activation_1 (Activation) (None, 64, 15) 0
=================================================================
This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).
However, if I switch to a simple Dense layer:
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)
I still only have 255 parameters:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 64, 15) 0
_________________________________________________________________
gru_1 (GRU) (None, 64, 16) 1536
_________________________________________________________________
dense_1 (Dense) (None, 64, 15) 255
_________________________________________________________________
activation_1 (Activation) (None, 64, 15) 0
=================================================================
I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).
Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:
def build(self, input_shape):
assert len(input_shape) >= 2
input_dim = input_shape[-1]
self.kernel = self.add_weight(shape=(input_dim, self.units),
It also uses keras.dot to apply the weights:
def call(self, inputs):
output = K.dot(inputs, self.kernel)
The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.
See Question&Answers more detail:
os