The batch normalization is for layers that can suffer from deleterious drift. The math is simple: find the mean and variance of each component, then apply the standard transformation to convert all values to the corresponding Z-scores: subtract the mean and divide by the standard deviation. This ensures that the component ranges are very similar, so that they'll each have a chance to affect the training deltas (in back-prop).
If you're using the network for pure testing (no further training), then simply delete these layers; they've done their job. If you're training while testing / predicting / classifying, then leave them in place; the operations won't harm your results at all, and barely slow down the forward computations.
As for Caffe specifics, there's really nothing particular to Caffe. The computation is a basic stats process, and is the same algebra for any framework. Granted, there will be some optimizations for hardware that supports vector and matrix math, but those consist of simply taking advantage of the chip's built-in operations.
RESPONSE TO COMMENT
If you can afford a little extra training time, yes, you'd want to normalize at every layer. In practice, inserting them less frequently -- say, every 1-3 inceptions -- will work just fine.
You can ignore these in deployment because they've already done their job: when there's no back-propagation, there's no drift of weights. Also, when the model handles only one instance in each batch, the Z-score is always 0: every input is exactly the mean of the batch (being the entire batch).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…