http://francky.me/faqai.php#otherFAQs :
Subject: What learning rate should be used for
backprop?
In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning rate
makes the weights and objective function diverge, so there is no learning at all. If the objective function is
quadratic, as in linear models, good learning rates can be computed from the Hessian matrix (Bertsekas and
Tsitsiklis, 1996). If the objective function has many local and global optima, as in typical feedforward NNs
with hidden units, the optimal learning rate often changes dramatically during the training process, since
the Hessian also changes dramatically. Trying to train a NN using a constant learning rate is usually a
tedious process requiring much trial and error. For some examples of how the choice of learning rate and
momentum interact with numerical condition in some very simple networks, see
ftp://ftp.sas.com/pub/neural/illcond/illcond.html
With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use
standard backprop at all, since vastly more efficient, reliable, and convenient batch training algorithms exist
(see Quickprop and RPROP under "What is backprop?" and the numerous training algorithms mentioned
under "What are conjugate gradients, Levenberg-Marquardt, etc.?").
Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as
standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function of
the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need a
large step size; this happens when you initialize a network with small random weights. In other regions of
the weight space, the gradient is small and you need a small step size; this happens when you are close to a
local minimum. Likewise, a large gradient may call for either a small step or a large step. Many algorithms
try to adapt the learning rate, but any algorithm that multiplies the learning rate by the gradient to compute
the change in the weights is likely to produce erratic behavior when the gradient changes abruptly. The
great advantage of Quickprop and RPROP is that they do not have this excessive dependence on the
magnitude of the gradient. Conventional optimization algorithms use not only the gradient but also secondorder derivatives or a line search (or some combination thereof) to obtain a good step size.
With incremental training, it is much more difficult to concoct an algorithm that automatically adjusts the
learning rate during training. Various proposals have appeared in the NN literature, but most of them don't
work. Problems with some of these proposals are illustrated by Darken and Moody (1992), who
unfortunately do not offer a solution. Some promising results are provided by by LeCun, Simard, and
Pearlmutter (1993), and by Orr and Leen (1997), who adapt the momentum rather than the learning rate.
There is also a variant of stochastic approximation called "iterate averaging" or "Polyak averaging"
(Kushner and Yin 1997), which theoretically provides optimal convergence rates by keeping a running
average of the weight values. I have no personal experience with these methods; if you have any solid
evidence that these or other methods of automatically setting the learning rate and/or momentum in
incremental training actually work in a wide variety of NN applications, please inform the FAQ maintainer
([email protected]).
References:
- Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.
- Darken, C. and Moody, J. (1992), "Towards faster stochastic gradient
search," in Moody, J.E., Hanson, S.J., and Lippmann, R.P., eds.
- Advances in Neural Information Processing Systems 4, San Mateo, CA:
Morgan Kaufmann Publishers, pp. 1009-1016. Kushner, H.J., and Yin,
G. (1997), Stochastic Approximation Algorithms and Applications, NY:
Springer-Verlag. LeCun, Y., Simard, P.Y., and Pearlmetter, B.
(1993), "Automatic learning rate maximization by online estimation of
the Hessian's eigenvectors," in Hanson, S.J., Cowan, J.D., and Giles,
- C.L. (eds.), Advances in Neural Information Processing Systems 5, San
Mateo, CA: Morgan Kaufmann, pp. 156-163. Orr, G.B. and Leen, T.K.
(1997), "Using curvature information for fast stochastic search," in
- Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural
Information Processing Systems 9,Cambridge, MA: The MIT Press, pp.
606-612.
Credits:
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…