Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C]
, where B
is the batch size, (H, W)
is the feature map size, C
is the number of channels. An index (x, y)
where 0 <= x < H
and 0 <= y < W
is a spatial location.
Usual batchnorm
Now, here's how the batchnorm is applied in a usual way (in pseudo-code):
# t is the incoming tensor of shape [B, H, W, C]
# mean and stddev are computed along 0 axis and have shape [H, W, C]
mean = mean(t, axis=0)
stddev = stddev(t, axis=0)
for i in 0..B-1:
out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)
Basically, it computes H*W*C
means and H*W*C
standard deviations across B
elements. You may notice that different elements at different spatial locations have their own mean and variance and gather only B
values.
Batchnorm in conv layer
This way is totally possible. But the convolutional layer has a special property: filter weights are shared across the input image (you can read it in detail in this post). That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W
values, at different locations.
Here's how the code looks like in this case (again pseudo-code):
# t is still the incoming tensor of shape [B, H, W, C]
# but mean and stddev are computed along (0, 1, 2) axes and have just [C] shape
mean = mean(t, axis=(0, 1, 2))
stddev = stddev(t, axis=(0, 1, 2))
for i in 0..B-1, x in 0..H-1, y in 0..W-1:
out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)
In total, there are only C
means and standard deviations and each one of them is computed over B*H*W
values. That's what they mean when they say "effective mini-batch": the difference between the two is only in axis selection (or equivalently "mini-batch selection").