In the context of autoencoders the input and output of the model is the same. So, if the input values are in the range [0,1] then it is acceptable to use sigmoid
as the activation function of last layer. Otherwise, you need to use an appropriate activation function for the last layer (e.g. linear
which is the default one).
As for the loss function, it comes back to the values of input data again. If the input data are only between zeros and ones (and not the values between them), then binary_crossentropy
is acceptable as the loss function. Otherwise, you need to use other loss functions such as 'mse'
(i.e. mean squared error) or 'mae'
(i.e. mean absolute error). Note that in the case of input values in range [0,1]
you can use binary_crossentropy
, as it is usually used (e.g. Keras autoencoder tutorial and this paper). However, don't expect that the loss value becomes zero since binary_crossentropy
does not return zero when both prediction and label are not either zero or one (no matter they are equal or not). Here is a video from Hugo Larochelle where he explains the loss functions used in autoencoders (the part about using binary_crossentropy
with inputs in range [0,1] starts at 5:30)
Concretely, in your example, you are using the MNIST dataset. So by default the values of MNIST are integers in the range [0, 255]. Usually you need to normalize them first:
trainX = trainX.astype('float32')
trainX /= 255.
Now the values would be in range [0,1]. So sigmoid
can be used as the activation function and either of binary_crossentropy
or mse
as the loss function.
Why binary_crossentropy
can be used even when the true label values (i.e. ground-truth) are in the range [0,1]?
Note that we are trying to minimize the loss function in training. So if the loss function we have used reaches its minimum value (which may not be necessarily equal to zero) when prediction is equal to true label, then it is an acceptable choice. Let's verify this is the case for binray cross-entropy which is defined as follows:
bce_loss = -y*log(p) - (1-y)*log(1-p)
where y
is the true label and p
is the predicted value. Let's consider y
as fixed and see what value of p
minimizes this function: we need to take the derivative with respect to p
(I have assumed the log
is the natural logarithm function for simplicity of calculations):
bce_loss_derivative = -y*(1/p) - (1-y)*(-1/(1-p)) = 0 =>
-y/p + (1-y)/(1-p) = 0 =>
-y*(1-p) + (1-y)*p = 0 =>
-y + y*p + p - y*p = 0 =>
p - y = 0 => y = p
As you can see binary cross-entropy have the minimum value when y=p
, i.e. when the true label is equal to predicted label and this is exactly what we are looking for.