{"title": "Second Order Properties of Error Surfaces: Learning Time and Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 918, "page_last": 924, "abstract": null, "full_text": "Second Order Properties of Error Surfaces : \n\nLearning Time and Generalization \n\nYann Le Cun \nAT &T Bell Laboratories \nCrawfords Corner Rd. \nHolmdel, NJ 07733, USA \n\nIdo Kanter \n\nRamat Gan, 52100 Israel \n\nDepartment of Physics \n\nBar Ilan University \n\nSara A. Sona \nAT&T Bell Laboratories \nCrawfords Corner Rd. \nHolmdel, NJ 07733, USA \n\nAbstract \n\nThe learning time of a simple neural network model is obtained through an \nanalytic computation of the eigenvalue spectrum for the Hessian matrix, \nwhich describes the second order properties of the cost function in the \nspace of coupling coefficients. The form of the eigenvalue distribution \nsuggests new techniques for accelerating the learning process, and provides \na theoretical justification for the choice of centered versus biased state \nvariables. \n\n1 \n\nINTRODUCTION \n\nConsider the class of learning algorithms which explore a space {W} of possible \ncouplings looking for optimal values W\u00b7 for which a cost function E(W) is minimal. \nThe dynamical properties of searches based on gradient descent are controlled by \nthe second order properties of the E(W) surface. An analytic investigation of such \nproperties provides a characterization of the time scales involved in the relaxation \nto the solution W\u00b7. \n\nThe discussion focuses on layered networks with no feedback, a class of architectures \nremarkably successful at perceptual tasks such as speech and image recognition. \nWe derive rigorous results for the learning time of a single linear unit, and discuss \ntheir generalization to multi-layer nonlinear networks. Causes for the slowest time \nconstants are identified, and specific prescriptions to eliminate their effect result in \npractical methods to accelerate convergence. \n\n918 \n\n\fSecond Order Properties of Error Surfaces \n\n919 \n\n2 LEARNING BY GRADIENT DESCENT \n\nMulti-layer networks are composed of model neurons interconnected through a feed(cid:173)\nforward graph. The state Xi of the i-th neuron is computed from the states {Xj} of \nthe set Si of neurons that feed into it through the total input (or induced local field) \nai = Ej e Si Wij X j. The coefficient Wij ofthe linear combination is the coupling from \nneuron j to neuron i. The local field ai determines the state Xi through a nonlinear \ndifferentiable function! called the activation function: Xi = !(ad. The activation \nfunction is often chosen to be the hyperbolic tangent or a similar sigmoid function. \n\nThe connection graph of multi-layer networks has no feedback loops, and the stable \nstate is computed by propagating state information from the input units (which \nreceive no input from other units) to the output units (which propagate no infor(cid:173)\nmation to other units). The initialization of the state of the input units through \nan input vector X results in an output vector 0 describing the state of the output \nunits. The network thus implements an input-output map, 0 = O(X, W), which \ndepends on the values assigned to the vector W of synaptic couplings. \nThe learning process is formulated as a search in the space W, so as to find an \noptimal configuration W'\" which minimizes a function E(W). Given a training set \nof p input vectors XP and their desired outputs DP, 1 < J-t < p, the cost function \n\nE(W) = ..!:.. L IIDP - O(XP, W)II 2 \n\np \n\n2p p=l \n\n(2.1) \n\nmeasures the discrepancy between the actual behavior of the system and the desired \nbehavior. The minimization of E with respect to W is usually performed through \niterative updates using some form of gradient descent: \nW(k + 1) = W(k) - r(\\7 E, \n\n(2.2) \n\nwhere TJ is used to adjust the size of the updating step, and 'lEis an estimate of the \ngradient of E with respect to W . The commonly used Back-Propagation algorithm \npopularized by (Rumelhart, Hinton, and Williams, 1986), provides an efficient way \nof estimating 'lEfor multi-layer networks. \n\nThe dynamical behavior of learning algorithms based on the minimization of E(W) \nthrough gradient descent is controlled by the properties of the E( W) surface. The \ngoal of this work is to gain better understanding of the structure of this surface \nthrough an investigation of its second derivatives, as contained in the Hessian matrix \nH. \n\n3 SECOND ORDER PROPERTIES \n\nWe now consider a simple model which can be investigated analytically: an N(cid:173)\ndimensional input vector feeding onto a single output unit with a linear activation \nfunction !(a) = a. The output corresponding to input XP is given by \n\nN \n\nOP = LWiXr = WTXP, \n\ni=l \n\n(3.1) \n\n\f920 \n\nLe Cun, Kanter, and Solla \n\nwhere xf is the i-th component of the J..'-th input vector, and Wi is the coupling \nfrom the i-th input unit to the output. \n\nThe rule for weight updates \n\nW(k + 1) = W(k) - !1. 2:)OIS - dlS)XIS \n\np \n\nP 1S=1 \n\nfollows from the gradient of the cost function \n\nE(W) = - L(dlS - OIS)2 = - L(dlS - WT XIS)2. \n\n1 \n\np \n\n2p 1S=1 \n\n1 p \n\n2p 1S=1 \n\n(3.2) \n\n(3.3) \n\nNote that the cost function of Eq. (3.3) is quadratic in W, and can be rewritten as \n\n(3.4) \n\nwhere R is the covariance matrix of the input, Rij = l/p L~=1 xfxj, a symmetric \nand nonnegative N x N matrix; the N-dimensional vector Q has components qi = \n1/PL~=1 dlSxf, and the constant c = 1/PL~=1(dlS)2. The gradient is given by \n\\1 E = RW - Q, while the Hessian matrix of second derivatives is H = R. \nThe solution space of vectors W* which minimize E(W) is the subspace of solutions \nof the linear equation RW = Q, resulting from \\1 E = O. This subspace reduces to \na point if R is full rank. The diagonalization of R provides a diagonal matrix A \nformed by its eigenvalues, and a matrix U formed by its eigenvectors. Since R is \nnonnegative, all eigenvalues satisfy A > O. \nConsider now a two-step coordinate transformation: a translation V' = W - W* \nprovides new coordinates centered at the solution point; it is followed by a rotation \nV = UV' = U(W - W*) onto the principal axes of the error surface. In the new \ncoordinate system \n\n(3.5) \nwith A = UTRU and Eo = E(W*). Then 8E/8vj = AjVj, and 82E/8vj8vk = \nAj 6jk . The eigenvalues of the input covariance matrix give the second derivatives \nof the error surface with respect to its principal axes. \n\nIn the new coordinate system the Hessian matrix is the diagonal matrix A, and the \nrule for weight updates becomes a set of N decoupled equations: \n\nV(k + 1) = V(k) -17AV(k), \n\n(3.6) \n\nThe evolution of each component along a principal direction is given by \n\nvj(k) = (1 -17Aj)kVj(O), \n\n(3.7) \nso that Vj will converge to zero (and thus Wj to the solution wj) provided that \no < 17 < 2/ Aj. In this regime Vj decays to zero exponentially, with characteristic \ntime Tj = (17Aj )-1. The range l/Aj < 17 < 2/Aj corresponds to underdamped \ndynamics: the step size is large and convergence to the solution occurs through \n\n\fSecond Order Properties of Error Surfaces \n\n921 \n\noscillatory behavior. The range 0 < TJ < 1/ Aj corresponds to overdamped dynamics: \nthe step size is small and convergence requires many iterations. Critical damping \noccurs for TJ = 1/ Aj; if such choice is possible, the solution is reached in one iteration \n(Newton's method). \nIf all eigenvalues are equal, Aj = A for all 1 < j < N, the Hessian matrix is \ndiagonal: H = A. Convergence can be obtained in one iteration, with optimal \nstep size TJ = I/A, and learning time T = 1. This highly symmetric case occurs \nwhen cross-sections of E(W) are hyperspheres in the N-dimensional space {W} . \nSuch high degree of symmetry is rarely encountered: correlated inputs result in \nnondiagonal elements for H, and the principal directions are rotated with respect \nto the original coordinates. The cross-sections of E(W) are elliptical, with different \neigenvalues along different principal directions. Convergence requires 0 < TJ < 2/ Aj \nfor alII < j < N, thus TJ must be chosen in the range 0 < TJ < I/Amax , where Amax is \nthe largest eigenvalue. The slowest time constant in the system is Tmax = (TJAmin)-l, \nwhere Amin is the lowest nonzero eigenvalue. The optimal step size TJ = I/Amax thus \nleads to T max = Amax/ Amin for the decay along the principal direction of smallest \nnonzero curvature. A distribution of eigenvalues in the range Amin < A < Amax \nresults in a distribution of learning times, with average < T >= Amax < I/A >. \nThis analysis demonstrates that learning dynamics in quadratic surfaces are fully \ncontrolled by the eigenvalue distribution of the Hessian matrix. It is thus of interest \nto investigate such eigenvalue distribution. \n\n4 EIGENVALUE SPECTRUM \n\nThe simple linear unit of Eq. (3.1) leads to the error function (3.4), for which the \nHessian is given by the covariance matrix \n\np \n\nRij = lip L zrzj. \n\n#,=1 \n\n(4.1) \n\nIt is assumed that the input components {zr} are independent, and drawn from a \ndistribution with mean m and variance v. The size ofthe training set is quantified by \nthe ratio a = p/ N between the number of training examples and the dimensionality \nof the input vector. The eigenvalue spectrum has been computed (Le Cun, Kanter, \nand Solla, 1990), and it exhibits three dominant features: \n(a) If p < N, the rank of the matrix R is p. The existence of (N - p) zero eigenvalues \nout of N results in a delta function contribution of weight (I-a) at A = 0 for a < l. \n(b) A continuous part of the spectrum, \n\nwithin the bounded interval A_ < A < A+, with A\u00b1 = (1 \u00b1 va)2 v/ a (Krogh and \nHertz, 1991). Note that peA) is controlled only by the variance v of the distribution \nfrom which the inputs are drawn. The bounds A\u00b1 are well defined, and of order \none. For all a < 1, A_ > 0, indicating a gap at the lower end of the spectrum. \n\n(4.2) \n\n\f922 \n\nLe Cun, Kanter, and Solla \n\n(c) An isolated eigenvalue of order N, AN, present in the case of biased inputs \n(m 1= 0). \nTrue correlations between pairs (xi, x,,) of input components might lead to a quite \ndifferent spectrum from the one described above. \nThe continuous part (4.2) of the eigenvalue spectrum has been computed in the \nN ---4 00 limit, while keeping a constant and finite. The magnitude of finite size \neffects has been investigated numerically for N < 200 and various values of a. \nResults for N = 200, shown in Fig. 1, indicate that finite size effects are negligible: \nthe distribution peA) is bounded within the interval [A_, A+], in good agreement \nwith the theoretical prediction, even for such small systems. The result (4.2) is \nthus applicable in the finite p = aN case, an important regime given the limited \navailability of training data in most learning problems. \n\n2.5.-________________________________ -. \n\np(A,} \n\n2.0 \n\n1.5 \n\n1.0 \n\n0.5 \n\no \n\no.oUl~~-=r:~~::~~~~~~~.J \n6 \n0, v = 1, and \nFigure 1: Spectral density peA) predicted by Eq. (4.2) for m \na = 0.6,1.2,4, and 16. Experimental histograms for a = 0.6 (full squares) and \na = 4 (open squares) are averages over 100 trials with N = 200 and xr = \u00b11 with \nprobability 1/2 each. \n\n1 \n\n5 \n\n2 \n\n3 \n\n0.6 \n\n/ \n\n4 \n\nThe existence of a large eigenvalue AN is easily understood by considering the \nstructure of the covariance matrix R in the p ---4 00 limit, a regime for which a \ndetailed analysis is available in the adaptive filters literature (Widrow and Stearns, \n1985). In this limit, all off-diagonal elements of R are equal to m2 , and all diagonal \nelements are equal to v + m 2\u2022 The eigenvector UN = (1...1) thus corresponds to \nthe eigenvalue AN = N m2 + v. The remaining (N - 1) eigenvalues are all equal \nto v (note that the continuous part of the spectrum collapses onto a delta function \nat A_ = A+ = v as p ---4 00 ), thus satisfying trR = N(m 2 + v). The large part \nof AN is eliminated for centered distributions with m = 0, such as xr = \u00b11 with \nprobability 1/2, or xr = 3, -1, -2 with probability 1/3. Note that although m is \n\n\fSecond Order Properties of Error Surfaces \n\n923 \n\ncrucial in controlling the existence of an isolated eigenvalue of order N, it plays no \nrole in the spectral density of Eq. (4.2). \n\n5 LEARNING TIME \n\nConsider the learning time T = a(Amax/Amin). The eigenvalue ratio (Amax/Amin) \nmeasures the maximum number of iterations, and the factor of a accounts for the \ntime needed for each presentation of the full training set. \nFor m = 0, Amax = A+, and Amin = A_. The learning time T = a(A+/A_) can be \neasily computed using Eq. (4.2): T = a(l + ~2 /(1- .jQ)2. As a function of a, T \ndiverges at a = 1, and, surprisingly, goes through a minimum at a = (1 + J2)2 = \n5.83 before diverging linearly for a ...... 00. Numerical simulations were performed to \nestimate T by counting the number T of presentations of training examples needed \nto reach an allowed error level jj; through gradient descent. If the prescribed error jj; \nis sufficiently close to the minimum error Eo, T is controlled by the slowest mode, \nand it provides a good estimate for T. Numerical results for T as a function of \na, shown in Fig. 2, were obtained by training a single linear neuron on randomly \ngenerated vectors. As predicted, the curve exhibits a clear maximum at a = 1, \nas well as a minimum between a = 4 and a = 5. The existence of such optimal \ntraining set size for fast learning is a surprising result. \n\n800~ ______________________________ ~ \n\n600 \n\nT(a) \n\n400 \n\n200 \n\n2 \n\n4 \n\n6 \n\n8 \n\n10 12 14 16 18 \n\nFigure 2: Number of iterations T (averaged over 20 trials) needed to train a linear \nneuron with N = 100 inputs. The xj are uniformly distributed between -1 and +l. \nInitial and target couplings W are chosen randomly from a uniform distribution \nwithin the [-1, +l]N hypercube. Gradient descent is considered complete when the \nerror reaches the prescribed value jj; = 0.001 above the Eo = 0 minimum value. \n\n\f924 \n\nLe Cun, Kanter, and Solla \n\nBiased inputs m 1:- 0 produce a large eigenvalue Amax = AN, proportional to N \nand responsible for slow convergence. A simple approach to reducing the learning \ntime is to center each input variable Xi by subtracting its mean. An obvious source \nof systematic bias m is the use of activation functions which restrict the state \nvariables to the [0,1] interval. Symmetric activation functions such as the hyperbolic \ntangent are empirically known to yield faster convergence than their nonsymmetric \ncounterparts such as the logistic function. Our results provide an explanation to \nthis observation. \n\nThe extension of these results to multi-layer networks rests on the observation that \neach neuron i receives state information {xi} from the j E Si neurons that feed \ninto it, and can be viewed as minimizing a local objective function Ei whose Hes(cid:173)\nsian involves the the covariance matrix of such inputs. If all input variables are \nuncorrelated and have zero mean, no large eigenvalues will appear. But states with \nXj = m 1:- 0 produce eigenvalues proportional to the number of input neurons Ni in \nthe set Si, resulting in slow convergence if the connectivity is large. An empirically \nknown solution to this problem, justified by our theoretical analysis, is to use indi(cid:173)\nvidual learning rates '1'Ji inversely proportional to the number of inputs Ni to the i-th \nneuron. Yet another approach is to keep a running estimate of the average Xi and \nuse centered state variables ij = Xi - xi. Such algorithm results in considerable \nreductions in learning time. \n\n6 CONCLUSIONS \n\nOur results are based on a rigorous calculation of the eigenvalue spectrum for a sym(cid:173)\nmetric matrix constructed from the outer product of random vectors. The spectral \ndensity provides a full description of the relaxation of a single adaptive linear unit, \nand yields a surprising result for the optimal size of the training set in batch learn(cid:173)\ning. Various aspects of the dynamics of learning in multi-layer networks composed \nof nonlinear units are clarified: the theory justifies known empirical methods and \nsuggests novel approaches to reduce learning times. \n\nReferences \n\nA. Krogh and J. A. Hertz (1991), 'Dynamics of generalization in linear perceptrons', \nin Advances in Neural Information Processing Systems 3, ed. by D. S. Touretzky \nand R. Lippman, Morgan Kaufmann (California). \nY. Le Cun, I. Kanter, and S. A. Solla (1990), 'Eigenvalues of covariance matrices: \napplication to neural-network learning', Phys. Rev., to be published. \n\nD. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986), 'Learning representations \nby back-propagating errors', Nature 323, 533-536. \n\nB. Widrow and S. D. Stearns (1985), Adaptive Signal Processing, Prentice-Hall \n(New Jersey). \n\n\f", "award": [], "sourceid": 314, "authors": [{"given_name": "Yann", "family_name": "LeCun", "institution": null}, {"given_name": "Ido", "family_name": "Kanter", "institution": null}, {"given_name": "Sara", "family_name": "Solla", "institution": null}]}