{"title": "Learning in large linear perceptrons and why the thermodynamic limit is relevant to the real world", "book": "Advances in Neural Information Processing Systems", "page_first": 207, "page_last": 214, "abstract": null, "full_text": "Learning in large linear perceptrons and \nwhy the thermodynamic limit is relevant \n\nto the real world \n\nDepartment of Physics, University of Edinburgh \n\nPeter Sollich \n\nEdinburgh EH9 3JZ, U.K. \n\nP.Sollich~ed.ac.uk \n\nAbstract \n\nWe present a new method for obtaining the response function 9 \nand its average G from which most of the properties of learning \nand generalization in linear perceptrons can be derived. We first \nrederive the known results for the 'thermodynamic limit' of infinite \nperceptron size N and show explicitly that 9 is self-averaging in \nthis limit. We then discuss extensions of our method to more gen(cid:173)\neral learning scenarios with anisotropic teacher space priors, input \ndistributions, and weight decay terms. Finally, we use our method \nto calculate the finite N corrections of order 1/ N to G and discuss \nthe corresponding finite size effects on generalization and learning \ndynamics. An important spin-off is the observation that results \nobtained in the thermodynamic limit are often directly relevant to \nsystems of fairly modest, 'real-world' sizes. \n\n1 \n\nINTRODUCTION \n\nOne of the main areas of research within the Neural Networks community is the issue \nof learning and generalization. Starting from a set of training examples (normally \nassumed to be input-output pairs) generated by some unknown 'teacher' rule V, one \nwants to find, using a suitable learning or training algorithm, a student N (read \n'Neural Network') which generalizes from the training set, i.e., predicts the outputs \ncorresponding to inputs not contained in the training set as accurately as possible. \n\n\f208 \n\nPeter Sollich \n\nIf the inputs are N-dimensional vectors x E nN and the outputs are scalars yEn, \nthen one of the simplest functional forms that can be assumed for the student .N is \nthe linear perceptron, which is parametrized in terms of a weight vector W N E nN \nand implements the linear input-output mapping \n\nYN(X) = 7Nw~x. \n\n(1) \n\nA commonly used learning algorithm for the linear perceptron is gradient descent \non the training error, i.e., the error that the student .N makes on the training set. \nUsing the standard squared output deviation error measure, the training error for a \ngiven set ofp training examples {(xlJ , ylJ),J-l = 1 . . . p} is E t = L:IJ ~(ylJ-YN(XIJ))2 = \n~ L:IJ (ylJ - w~xlJ / VN)2. To prevent the student from fitting noise in the training \ndata, a quadratic weight decay term ~AW~ is normally added to the training error, \nwith the value of the weight decay parameter A determining how strongly large \nweight vectors are penalized. Gradient descent is thus performed on the function \nE = E t + ~AW~, and the corresponding learning dynamics is, in a continuous time \napproximation , dw N / dt = -\"V wE. As discussed in detail by Krogh and Hertz \n(1992), this results in an exponential approach of W N to its asymptotic value , with \ndecay constants given by the eigenvalues of the matrix M N , defined by (1 denotes \nthe N x N identity matrix) \n\nMN = Al + A, A = ~ L:IJ xlJ(xlJ)T. \n\nTo examine what generalization performance is achieved by the above learning \nalgorithm, one has to make an assumption about the functional form of the teacher. \nThe simplest such assumption is that the problem is learnable, i. e., that the teacher , \nlike the student, is a linear perceptron. A teacher V is then specified by a weight \nvector Wv and maps a given input x to the output Yv(x) = w~ x/VN. We assume \nthat the test inputs for which the student is asked to predict the corresponding \noutputs are drawn from an isotropic Gaussian distribution , P(x) 00 is (Krogh and Hertz, 1992) \n\nOC] \nfg(t -> 00) = 2 rr 2C + A(rr2 - A) OA \n\n1 [ \n\n' \n\n(3) \n\nwhere G is the average of the so-called response function over the training inputs: \n(4) \n\nQ -\n\n1 t M- 1 \n- N r N\u00b7 \n\nP( {x\"}), \n\nC \n\n= \n\n(Q) \n\n\fLearning in Large Linear Perceptrons \n\n209 \n\nThe time dependence of the average generalization error for finite but large t is an \nexponential approach to the asymptotic value (3) with decay constant oX + amin, \nwhere am in is the lowest eigenvalue occurring in the average eigenvalue spectrum of \nthe input correlation matrix A (Krogh and Hertz, 1992). This average eigenvalue \nspectrum, which we denote by p(a), can be calculated from the average response \nfunction according to (Krogh, 1992) \n\np(a) = ! lim ImGI.>.=_a-if, \n\n11' f-O+ \n\n(5) \n\nwhere we have assumed p( a) to be normalized, fda p( a) = 1. \nEqs. (3,5) show that the key quantity determining learning and generalization in \nthe linear percept ron is the average response function G defined in (4). This \nfunction has previously been calculated in the 'thermodynamic limit', N -- 00 \nat a = piN = const., using a diagrammatic expansion (Hertz et al., 1989) and the \nreplica method (Opper, 1989, Kinzel and Opper, 1991). In Section 2, we present \nwhat we believe to be a much simpler method for calculating G , based only on \nsimple matrix identities. We also show explicitly that 9 is self-averaging in the \nthermodynamic limit, which means that the fluctuations of 9 around its average G \nbecome vanishingly small as N -> 00 . This implies, for example, that the gener(cid:173)\nalization error is also self-averaging. In Section 3 we extend the method to more \ngeneral cases such as anisotropic teacher space priors and input distributions, and \ngeneral quadratic penalty terms. Finite size effects are considered in Section 4, \nwhere we calculate the O(IIN) corrections to G, \u20acg(t -- (0) and p(a) . We discuss \nthe resulting effects on generalization and learning dynamics and derive explicit con(cid:173)\nditions on the perceptron size N for results obtained in the thermodynamic limit \nto be valid. We conclude in Section 5 with a brief summary and discussion of our \nresults. \n\n2 THE BASIC METHOD \n\nOur method for calculating the average response function G is based on a recursion \nrelation relating the values of the (unaveraged) response function 9 for p and p + 1 \ntraining examples. Assume that we are given a set of p training examples with \ncorresponding matrix M N . By adding a new training example with input x, we \nobtain the matrix Mt = MN + ~ xxT . It is straightforward to show that the \ninverse of Mt can be expressed as \n( + ) - 1 _ \nMN \n\n-1 \n- MN -\n\nN XX \n\n. \n\n1 M-1 TM-1 \nIV \n1 + 1 TM-\n\nN \n1 \nN x \n\nN X \n\n(One way of proving this identi~ is to multiply both sides by Mt .and exploit ~he \nfact that MtM~',l = 1 + kxx M;~} . ) TaklOg the trace, we obtalO the follow 109 \nrecursion relation for g: \n\n..!.x\u2122-2x \n9 (p + 1) = 9 (p) - N 1 N 1 T ~ -1 \n+ N X \nN X \n\n1 \n\n. \n\n(6) \n\nNow denote Zi = ~xTMNix (i = 1,2). With x drawn randomly from the assumed \ninput distribution P(x) ex: exp( - ~x2), the Zi can readily be shown to be random \n\n\f210 \n\nPeter Sollich \n\nvariables with means and (co-)variances \n\n1 M-i \nZi = N tr N' \n( ) \n\n(A A) \n~Zi~Zj = N1 tr \n\n2 M-i-j \n\u2022 \n\nN \n\nCombining this with the fact that tr M:vk ~ N>.-k = O(N), we have that the \nfluctuations LlZi of the Zi around their average values are 0(1/.JN); inserting this \ninto (6), we obtain \n\n9(p + 1) \n\n9(p) _ -\n\nl I t M- 2 \nN \nN 1 + ~trM:vl \n\nN r \n\n+ 0(N-3/2) \n\n9(p) + ~ a9(p) \n\n1 \n\n+ 0(N-3/2). \n\nN \n\n()>. 1 + 9(p) \n\n(7) \nStarting from 9(0) = 1/>., we can apply this recursion p = aN times to obtain \n9(p) up to terms which add up to at most 0(pN-3/2) = 0(1/.JN). This shows \nthat 9 is self-averaging in the thermodynamic limit: whatever the training set, the \nvalue of 9 will always be the same up to fluctuations of 0(1/.JN). In fact, we shall \nshow in Section 4 that the fluctuations of 9 are only 0(1/ N). This means that the \n0(N-3 / 2 ) fluctuations from each iteration of (7) are only weakly correlated, so that \nthey add up like independent random variables to give a total fluctuation for 9(p) \nof O\u00abp/ N 3)1/2) = 0(1/ N). \nWe have seen that, in the thermodynamic limit, 9 is identical to its average G \nbecause its fluctuations are vanishingly small. To calculate the value of G in the \nthermodynamic limit as a function of a and >., we insert the relation 9(p+1)-9(p) = \n-ka9(a)/aa + 0(1/ N 2 ) into eq. (7) (with 9 replaced by G) and neglect all finite N \ncorrections. This yields the partial differential equation \n\naG \n\naG \naa - a>. 1 + G = 0, \n\n1 \n\n(8) \nwhich can readily be solved using the method of characteristic curves (see, e.g., \nJohn, 1978). Using the initial condition Glo:o = 1/>. gives a/(l + G) = l/G - >., \nwhich leads to the well-known result (see, e.g., Hertz et al., 1989) \n\nG = ;>. (1 - a - >. + J(l - a - >.)2 + 4>') . \n\n(9) \nIn the complex >. plane, G has a pole at >. = 0 and a branch cut arising from \nthe root; according to eq. (5), these singularities determine the average eigenvalue \nspectrum pea) of A, with the result (Krogh, 1992) \n\npea) = (1- a)8(1 - a)6(a) + -21 J(a+ - a)(a - a_), \n\n(10) \nwhere 8(x) is the Heaviside step function, 8(x) = 1 for x > 0 and 0 otherwise. \nThe root in eq. (10) only contributes when its argument is non-negative, i.e., for a \nbetween the 'spectral limits' a_ and a+, which have the values a\u00b1 = (1 \u00b1 fo)2. \n\n1ra \n\n3 EXTENSIONS TO MORE GENERAL LEARNING \n\nSCENARIOS \n\nWe now discuss some extensions of our method to more general learning sce(cid:173)\nnarios. First, consider the case of an anisotropic teacher space prior, P(w v ) ex: \n\n\fLearning in Large Linear Perceptrons \n\n211 \n\nexp(-!w~:E~lWV)' with symmetric positive definite :Ev. This leaves the defini(cid:173)\ntion of the response function unchanged; eq. (3), however, has to be replaced by \nfg(t -- 00) = 1/2{q2G + A[q2 - A(~tr :Ev)]oGloA}. \nAs a second extension, assume that the inputs are drawn from an anisotropic distri(cid:173)\nbution, P(x) oc exp(-~xT:E-1x). It can then be shown that the asymptotic value \nof the average generalization error is still given by eq. (3) if the response function is \nredefined to be 9 = ~ tr :EM~l. This modified response function can be calculated \nas follows: First we rewrite 9 as ~tr (A:E- 1 + A)-I, where A = ~ L:JJ(xJJ)TxJJ is \nthe correlation matrix of the transformed input examples xl' = :E- 1/ 2xJJ . Since the \nxJJ are distributed according to P(xJJ ) oc exp( _~(xJJ)2), the problem is thus reduced \nto finding the response function 9 = ~ tr (L + A)-1 for isotropically distributed \ninputs and L = A:E- 1. The recursion relations between 9(p + 1) and 9(p) derived \nin the previous section remain valid, and result, in the thermodynamic limit, in a \n?i!f~rential ~~uat.ion for the aver are response function ~ analo.gous. t? eq. (8). The \n1mt1al cond1tlon 1S now Gla=o = N tr L -1, and one obtams an 1mphc1t equatiOn for \nG, \n\n1 \n\n( \n\n)-1 \n\nG = N tr L + 1 + G 1 \n\na \n\n, \n\n(11) \n\nwhere in the case of an anisotropic input distribution considered here, L = A:E- 1. \nIf :E has a particularly simple form, then the dependence of G on a and A can be \nobtained analytically, but in general eq. (11) has to solved numerically. \n\nFinally, one can also investigate the effect of a general quadratic weight decay term, \n!W;:AWN' in the energy function E. The expression for the average generalization \nerror becomes more cumbersome than eq. (3) in this case, but the result can still be \nexpressed in terms of the average response function G = (9) = (~tr (A + A)-l), \nwhich can be obtained as the solution of eq. (11) for L = A . \n\n4 FINITE N CORRECTIONS \n\nSo far, we have focussed attention on the thermodynamic limit of perceptrons of \ninfinite size N. The results are clearly only approximately valid for real, finite \nsystems, and it is therefore interesting to investigate corrections for finite N. This \nwe do in the present section by calculating the O(IIN) corrections to G and pea). \nFor details of the calculations and results of computer simulations which support \nour theoretical analysis, we refer the reader to (Sollich, 1994). \nFirst note that, for A = 0, the exact result for the average response function is \nIIN)-l for a > 1 + liN (see, e.g., Eaton, 1983), which clearly \nGI.~=o = (a - 1 -\nadmits a series expansion in powers of liN. We assume that a similar expansion \nalso exists for nonzero A, and write \n\n(12) \nGo is the value of G in the thermodynamic limit as given by eq. (9). For finite N, the \nfluctuations ~9 = 9-G of9 around its average value G become relevant; for A = 0, \nthe variance of these fluctuations is known to have a power series expansion in II N, \nand again we assume a similar expansion for finite A, ((~9)2) = ~2 IN + 0(11 N2), \n\n\f212 \n\nPeter Sollich \n\nwhere the first term is 0(1/ N) and not 0(1) because, as discussed in Section 2, the \nfluctuations ofg for large N are no greater than 0(1/v'N). To calculate G I and A 2 , \none starts again from the recursion relation (6), now expanding everything up to \nsecond order in the fluctuation quantities AZi and Ag. Averaging over the training \ninputs and collecting orders of l/N yields after some straightforward algebra the \nknown eq. (8) for Go and two linear partial differential equations for GI and A 2 , \nthe latter obtained by squaring both sides of eq. (6) . Solving these , one obtains \n\nG _ G6(1 - AGO) \n(1 + AG6)2 \n\n(13) \nIn the limit A -- 0, G 1 = l/(a - 1)2 consistent with the exact result for G quoted \nabove; likewise, the result A 2 == 0 agrees with the exact series expansion of the \nvariance of the fluctuations of 9 for A = 0, which begins with an 0(1/ N 2 ) term \n(see, e.g., Barber et ai., 1994). \n\nI -\n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0.0 \n\n(a) \n\n- - A=O.OOl \n-------. A = 0.1 \n\n(b) \n\n1.5 \n\n1.0 \n\n0.5 \n\n0.0 \n\n-0.5 \n\n0.0 \n\n0.5 \n\n1.0 a \n\n1.5 \n\n2.0 \n\n0.0 \n\n0.5 \n\n1.0 a \n\n1.5 \n\n2.0 \n\nFigure 1: Average generalization error: Result for N -- 00, (g,O, and coefficient of \nO(l/N) correction, (g , l. (a) Noise free teacher, (72 = O. (b) Noisy teacher, (72 = 0.5. \nCurves are labeled by the value of the weight decay parameter A. \n\nFrom the l/N expansion (12) of G we obtain, using eq. (3), a corresponding ex(cid:173)\npansion of the asymptotic value of the average generalization error, which we write \nas (g(t -- 00) = (g ,O + (g ,d N + 0(1/ N2). It follows that the thermodynamic limit \nresult for the average generalization error , (g ,O, is a good approximation to the true \nresult for finite N as long as N ~ Nc = /fg ,I/(g ,ol. In Figure 1, we plot (g,O and \n(g,l for several values of A and (j 2 . It can be seen that the relative size of the first \norder correction l(g,I/(g,ol and hence the critical system size Nc for validity of the \nthermodynamic limit result is largest when A is small. Exploiting this fact, Nc can \nbe bounded by 1/(1 - a) for a < 1 and (3a + l)/[a(a - l)J for a > l. It follows, \nfor example, that the critical system size Nc is smaller than 5 as long as a < 0.8 \nor a > l.72 , for all A and (j2. This bound on Nc can be tightened for non-zero \nl)/(A + 1)2 < 1/3. We have thus \nA; for A > 2, for example, one has Nc < (2A -\nshown explicitly that thermodynamic limit calculations of learning and generaliza(cid:173)\ntion behaviour can be relevant for fairly small, ' real-world' systems of size N of the \norder of a few tens or hundreds. This is in contrast to the widespread suspicion \n\n\fLearning in Large Linear Perceptrons \n\n213 \n\namong non-physicists that the methods of statistical physics give valid results only \nfor huge system sizes of the order of N :::::: 1023 . \n\n(a) \n\npo \n\n(1 - 0-)0(1 - 0-) \n\n(b) \n\nPI \n\nI \n4 \n\na \n\na __ - - - -_ \n\nI \n4 \n\na \n\nFigure 2: Schematic plot of the average eigenvalue spectrum p( a) of the input \ncorrelation matrix A. (a) Result for N -+ 00, po(a). (b) O(l/N) correction, Pl(a). \nArrows indicate 6-peaks and are labeled by the corresponding heights. \n\nWe now consider the 0(1/ N) correction to the average eigenvalue spectrum of the \ninput correlation matrix A. Setting p(a) = po(a) + pda)/ N + 0(1/ N 2 ), po(a) is \nthe N -+ 00 result given by eq. (10), and from eq. (13) one derives \n\nFigure 2 shows sketches of Po (a) and PI (a). Note that fda PI (a) = 0 as expected \nsince the normalization of p( a) is independent of N. Furthermore, there is no \nO(l/N) correction to the 6-peak in po(a) at a = 0, since this peak arises from the \nN - p zero eigenvalues of A for c.r = p/ N < 1 and therefore has a height of 1- c.r for \nany finite N. The 6-peaks in PI (a) at the spectral limits a+ and a_ are an artefact \nof the truncated l/N expansion: p(a) is determined by the singularities of G as \na function of A, and the location of these singularities is only obtained correctly \nby resumming the full l/N expansion. The 6-peaks in pl(a) can be interpreted \nas 'precursors' of a broadening of the eigenvalue spectrum of A to values which, \nwhen the whole 1/ N series is resummed, will lie outside the N -+ 00 spectral \nrange [a_ , a+]. The negative term in PI (a) represents the corresponding 'flattening' \nof the eigenvalue spectrum between a_ and a+ . We can thus conclude that the \naverage eigenvalue spectrum of A for finite N will be broader than for N -+ 00, \nwhich means in particular that the learning dynamics will be slowed down since the \nsmallest eigenvalue amin of A will be smaller than a_. From our result for PI (a) we \ncan also deduce when the N -+ 00 result po(a) is valid for finite N; the condition \nturns out to be N ~ a/[(a+ - a)(a - a_)]. Consistent with our discussion of the \nbroadening of the eigenvalue spectrum of A, N has to be larger for a near the \nspectral limits a_, a+ if po(a) is to be a good approximation to the finite N average \neigenvalue spectrum of A. \n\n\f214 \n\nPeter Sollich \n\n5 SUMMARY AND DISCUSSION \n\nWe have presented a new method, based on simple matrix identities, for calculating \nthe response function 9 and its average G which determine most of the properties \nof learning and generalization in linear perceptrons. In the thermodynamic limit, \nN --+ 00, we have recovered the known result for G and have shown explicitly that 9 \nis self-averaging. Extensions of our method to more general learning scenarios have \nalso been discussed. Finally, we have obtained the 0(1/ N) corrections to G and the \ncorresponding corrections to the average generalization error, and shown explicitly \nthat the results obtained in the thermodynamic limit can be valid for fairly small, \n'real-world' system sizes N . We have also calculated the 0(1/ N) correction to the \naverage eigenvalue spectrum of the input correlation matrix A and interpreted it \nin terms of a broadening of the spectrum for finite N, which will cause a slowing \ndown of the learning dynamics. \nWe remark that the O( 1/ N) corrections that we have obtained can also be used \nin different contexts, for example for calculations of test error fluctuations and \noptimal test set size (Barber et al., 1994). Another application is in an analysis of \nthe evidence procedure in Bayesian inference for finite N, where optimal values of \n'hyperparameters' like the weight decay parameter A are determined on the basis of \nthe training data (G Marion, in preparation). We hope, therefore, that our results \nwill pave the way for a systematic investigation of finite size effects in learning and \ngeneralization. \n\nReferences \n\nD Barber, D Saad, and P Sollich (1994). Finite size effects and optimal test set size \nin linear perceptrons. Submitted to J. Phys. A. \n\nM LEaton (1983). Multivariate Statistics - A Vector Space Approach. Wiley, New \nYork. \nJ A Hertz, A Krogh, and G I Thorbergsson (1989) . Phase transitions in simple \nlearning. J. Phys. A, 22:2133-2150. \n\nF John (1978). Partial Differential Equations. Springer, New York, 3rd ed. \n\nW Kinzel and M Opper (1991). Dynamics of learning. In E Domany, J L van Hem(cid:173)\nmen, and K Schulten, editors, Models of Neural Networks, pages 149-171. Springer, \nBerlin. \nA Krogh (1992). Learning with noise in a linear perceptron. J. Phys. A, 25:1119-\n1133. \n\nA Krogh and J A Hertz (1992). Generalization in a linear percept ron in the presence \nof noise. J. Phys. A, 25:1135-1147. \n\nM Opper (1989). Learning in neural networks: Solvable dynamics. Europhysics \nLetters, 8:389-392. \nP Sollich (1994). Finite-size effects in learning and generalization in linear percep(cid:173)\ntrons. J. Phys. A, 27:7771-7784. \n\n\f", "award": [], "sourceid": 979, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}]}