Part of Advances in Neural Information Processing Systems 5 (NIPS 1992)
In this paper we discuss the asymptotic properties of the most com(cid:173) monly used variant of the backpropagation algorithm in which net(cid:173) work weights are trained by means of a local gradient descent on ex(cid:173) amples drawn randomly from a fixed training set, and the learning rate TJ of the gradient updates is held constant (simple backpropa(cid:173) gation). Using stochastic approximation results, we show that for TJ ~ 0 this training process approaches a batch training and pro(cid:173) vide results on the rate of convergence. Further, we show that for small TJ one can approximate simple back propagation by the sum of a batch training process and a Gaussian diffusion which is the unique solution to a linear stochastic differential equation. Using this approximation we indicate the reasons why simple backprop(cid:173) agation is less likely to get stuck in local minima than the batch training process and demonstrate this empirically on a number of examples.