At first glance you can see this log section divided into two: [Forward]
and [Backward]
. Recall that neural network training is done via forward-backward propagation:
A training example (batch) is fed to the net and a forward pass outputs the current prediction.
Based on this prediction a loss is computed.
The loss is then derived, and a gradient is estimated and propagated backward using the chain rule.
Caffe Blob
data structure
Just a quick re-cap. Caffe uses Blob
data structure to store data/weights/parameters etc. For this discussion it is important to note that Blob
has two "parts": data
and diff
. The values of the Blob
are stored in the data
part. The diff
part is used to store element-wise gradients for the backpropagation step.
Forward pass
You will see all the layers from bottom to top listed in this part of the log. For each layer you'll see:
I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037
I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114
I1109 ...] [Forward] Layer conv1, param blob 1 data: 0
Layer "conv1"
is a convolution layer that has 2 param blobs: the filters and the bias. Consequently, the log has three lines. The filter blob (param blob 0
) has data
I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114
That is the current L2 norm of the convolution filter weights is 0.00899.
The current bias (param blob 1
):
I1109 ...] [Forward] Layer conv1, param blob 1 data: 0
meaning that currently the bias is set to 0.
Last but not least, "conv1"
layer has an output, "top"
named "conv1"
(how original...). The L2 norm of the output is
I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037
Note that all L2 values for the [Forward]
pass are reported on the data
part of the Blobs in question.
Loss and gradient
At the end of the [Forward]
pass comes the loss layer:
I1109 ...] [Forward] Layer loss, top blob loss data: 2031.85
I1109 ...] [Backward] Layer loss, bottom blob fc1 diff: 0.124506
In this example the batch loss is 2031.85, the gradient of the loss w.r.t. fc1
is computed and passed to diff
part of fc1
Blob. The L2 magnitude of the gradient is 0.1245.
Backward pass
All the rest of the layers are listed in this part top to bottom. You can see that the L2 magnitudes reported now are of the diff
part of the Blobs (params and layers' inputs).
Finally
The last log line of this iteration:
[Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)
reports the total L1 and L2 magnitudes of both data and gradients.
What should I look for?
If you have nan
s in your loss, see at what point your data or diff turns into nan
: at which layer? at which iteration?
Look at the gradient magnitude, they should be reasonable. IF you are starting to see values with e+8
your data/gradients are starting to blow up. Decrease your learning rate!
See that the diff
s are not zero. Zero diffs mean no gradients = no updates = no learning. If you started from random weights, consider generating random weights with higher variance.
Look for activations (rather than gradients) going to zero. If you are using "ReLU"
this means your inputs/weights lead you to regions where the ReLU gates are "not active" leading to "dead neurons". Consider normalizing your inputs to have zero mean, add "BatchNorm"
layers, setting negative_slope
in ReLU.