{"title": "Dynamics of Generalization in Linear Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 897, "page_last": 903, "abstract": null, "full_text": "Dynamics of Generalization in Linear Perceptrons \n\nAnders  Krogh \n\nNiels  Bohr Institute \n\nBlegdamsvej  17 \n\nJohn A.  Hertz \n\nNORDITA \n\nBlegdamsvej  17 \n\nDK-2100  Copenhagen, Denmark \n\nDK-2100  Copenhagen,  Denmark \n\nAbstract \n\nWe study the evolution of the generalization ability of a  simple linear per(cid:173)\nceptron with N  inputs which learns to imitate a  \"teacher perceptron\".  The \nsystem is  trained  on p  =  aN  binary  example  inputs  and  the  generaliza(cid:173)\ntion  ability measured by testing for  agreement with  the teacher on all 2N \npossible  binary input  patterns.  The dynamics may  be solved  analytically \nand  exhibits  a  phase  transition  from  imperfect  to  perfect  generalization \nat  a  =  1.  Except  at  this  point  the  generalization  ability  approaches  its \nasymptotic  value  exponentially,  with critical slowing  down near  the  tran(cid:173)\nsition;  the  relaxation  time  is  ex  (1  - y'a)-2.  Right  at  the  critical  point, \n1 \nthe  approach  to  perfect  generalization  follows  a  power  law  ex  t - '2. \nIn \nthe presence of noise,  the generalization ability is  degraded  by  an amount \nex  (va - 1)-1 just above a  =  1. \n\n1 \n\nINTRODUCTION \n\nIt is  very important in  practical situations to know  how  well  a  neural network will \ngeneralize from the examples it is trained on to the entire set of possible inputs.  This \nproblem is  the focus  of a  lot of recent and current work  [1-11].  All  this  work,  how(cid:173)\never,  deals with the  asymptotic state of the  network  after  training.  Here  we  study \na  very  simple  model  which  allows  us  to follow  the  evolution  of the  generalization \nability  in time  under  training.  It has  a  single  linear  output  unit,  and  the  weights \nobey adaline learning.  Despite its simplicity,  it exhibits nontrivial behaviour:  a  dy(cid:173)\nnamical phase  transition at a  critical number of training examples,  with power-law \ndecay  right  at  the  transition  point  and  critical slowing  down  as  one  approaches  it \nfrom either side. \n\n897 \n\n\f898 \n\nKrogh and Hertz \n\n2  THE MODEL \nOur simple linear neuron has an output V  = N-\"2 2:i Wjei,  where ei  is the ith input. \nIt learns  to imitate a  teacher  [1]  whose  weights  are  Uj  by training on p examples of \ninput-output pairs  (er, ,~) with \n\n1 \n\ngenerated by  the teacher.  The adaline learning equation  [11]  is  then \n\n1 \n\nWi  =  Vii 'E('~ - v'N ~ Wje;)er  =  N  ~(Uj - Wj)e;er. \n\np \n\n1 \n\n1 \n\nBy  introducing the difference  between  the teacher  and the pupil, \n\n~=1 \n\nJ \n\n~J \n\nand the training input correlation matrix \n1 \n\nA ..  -\n\n-\n\nIJ  - N  L.J\"'J\"\"\n\np \n\"\"' r'!cf \n, \n~=1 \n\nthe  learning equation becomes \n\nVi  =  - 'EAijVj. \n\nj \n\n(1) \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nWe let the example inputs er  take the values \u00b11, randomly and independently, but it \nis straightforward to generalize it to any distribution of inputs with (ereJ)e  ex  6ij6~v . \nFor a large number of examples (p = O( N) ~ V, the resulting generalization ability \n\nwill  be  independent  of just  which  p  of the  2  possible  binary  input  patterns  we \nchoose.  All our results will  then depend only on  the fact  that we  can calculate  the \nspectrum of the matrix A. \n\n3  GENERALIZATION  ABILITY \n\nTo measure the generalization ability, we  test whether the output of our percept ron \nwith weights Wi  agrees with that of the teacher with weights Ui on all possible binary \ninputs.  Our  objective  function,  which  we  call  the  generalization  error,  is  just  the \nsquare of the error,  averaged over  all  these inputs: \n\nF \n\n(We  used that  2~ 2:{q} (Tj(Tj  is  zero unless i = j.)  That is,  F  is just proportional to \nthe square of the difference between the teacher and pupil weight vectors.  With the \n\n(6) \n\n\fDynamics of Generalization in Linear Perceptrons \n\n899 \n\nN- 1  normalization factor  F  will  then vary  between  1 (tabula rasa)  and  0  (perfect \ngeneralization)  if we  normalize  it to  length  .IN.  During  learning,  Wi  and  thus  Vi \ndepends  on  time,  so  F  is  a  function  of t.  The  complementary  quantity  1 - F(t) \ncould  be called  the generalization ability. \n\nIn the basis where A is  diagonal,  the  learning equation  (5)  is  simply \n\nVr  = -Arvr \n\n(7) \n\nwhere  Ar  are  the eigenvalues of A.  This has  the solution \nvr(t) = vr(O)e- Art = ur(O)e- Art , \n\n(8) \nwhere  it is  assumed  that the weights  are  zero  at time  t  = 0  (we  will  come  back  to \nthe more general  case  later).  Thus we  find \n\nF(t) =  N  L v;(t) =  N  L u;e- 2Art \n\n1 \n\n1 \n\nr \n\nr \n\n(9) \n\nA veraging over  all  possible  training sets of size  p this  can be expressed in  terms of \nthe density of eigenvalues of A,  peE): \n\nF(t) = 1~2 J d\u20acp(  \u20ac)e- 2ft . \n\n(10) \nIn the following  it will be assumed that the length of it is normalized to .IN, so the \nprefactor disappears. \n\nFor large N, the eigenvalue density is  (see, e.g.  [11],  where it can be obtained simply \nfrom  the imaginary  part of the Green's function  in  eq.(57)) \n\npeE)  = _1_)(\u20ac+  _  \u20ac)(\u20ac \n\n271'\u20ac \n\n_  L) + (1  - 0:)0(1  - 0:)8(\u20ac), \n\n(11) \n\nwhere \n\n\u20ac\u00b1  =  (1  \u00b1 fo)2 \n\n(12) \nand O() is  the unit step function.  The density has two terms:  a 'deformed semicircle' \nbetween  the  roots  \u20ac_  and  \u20ac+,  and for  0:  <  1  a  delta function  at  \u20ac  =  0  with  weight \n1 - 0:.  The  delta-function  term  appears  because  no  learning  takes  place  in  the \nsubspace  orthogonal  to  that  spanned  by  the  training  patterns.  For  0:  >  1  the \npatterns span the whole  space,  and  therefore  the  delta-function is  absent. \n\nThe  results  at infinite  time  are immediately evident.  For  0:  <  1 there  is  a  nonzero \nlimit,  F( 00) = 1 - 0:,  while  F( 00)  vanishes for  0:  >  1,  indicating perfect generaliza(cid:173)\ntion  (the  solid  line  in  Figure  1).  While  on  the  one  hand  it  may  seem  remarkable \nthat  perfect  generalization  can  be  obtained from  a  training set  which  forms  an  in(cid:173)\nfinitesimal  fraction of the entire set of possible examples,  the meaning of the result \nis  just that  N  points  are  sufficient  to  determine  an  N  -\nI-dimensional  hyperplane \nin  N  dimensions. \n\nFigure  2  shows  F(t)  as  obtained  numerically  from  (10)  and  (11).  The  qualitative \nform  of  the  approach  to  F (00)  can  be  obtained  analytically  by  inspection.  For \n0:  i=  1,  the asymptotic approach is  governed by the smallest nonzero eigenvalue  \u20ac_. \nThus we  have  critical slowing down,  with  a  divergent  relaxation time \n\n1 \n\n1 \n\nT  =  \u20ac_  =  lfo _ 112 \n\n(13) \n\n\f900 \n\nKrogh and Hertz \n\n2  .\u2022\u2022\u2022\u2022.. \n\nr:..  1 \n\n. . \n. . . . . . . . . \n. . . . . . \n...  . ...: . .... ........... ----\n\n... \n\n'. \n\n1 \na \n\nO~ ____________ ~~ ________ -_-__ -_-~-\no \n2 \n\nFigure  1:  The  asymptotic  generalization  error  as  a  function  of (}.  The  full  line \ncorresponds to A = 0, the dashed line to A =  0.2,  and the dotted line to Wo  = 1 and \nA =  O. \n\nas the transition at (}  = 1 is  approached.  Right at the critical point, the eigenvalue \ndensity  diverges  for  small f  like  (-'2, which  leads to the power law \n\n1 \n\n1 \nF(t) ex  Vi \n\n(14) \n\nat  long  times.  Thus,  while  exactly  N  examples  are  sufficient  to  produce  perfect \ngeneralization,  the  approach  to  this  desirable  state  is  rather  slow.  A  little  bit \nabove  (}  = 1,  F(t)  will  also  follow  this  power  law  for  times  t  ~ T,  going  over  to \n(slow)  exponential decay at  very  long times  (t > T).  By increasing the training set \nsize  well  above  N  (say,  to  ~N), one  can  achieve  exponentially  fast  generalization. \nBelow  (}  =  1,  where  perfect  generalization  is  never  achieved,  there  is  at  least  the \nconsolation  that the approach to the generalization level  the network  does  reach is \nexponential (though with the same problem of a  long relaxation time just below the \ntransition as just above it). \n\n4  EXTENSIONS \n\nIn  this section  we  briefly  discuss  some  extensions of the foregoing  calculation.  We \nwill  see  what  happens  if the  weights  are  non-zero  at t  = 0,  discuss  weight  decay, \nand finally  consider  noise in  the learning process. \n\nWeight decay is a simple and frequently-used way to limit the growth of the weights, \nwhich  might  be desirable for  several reasons.  It is  also  possible  to approximate the \nproblem  with  binary  weights  using  a  weight  decay  term  (the  so-called  spherical \nmodel,  see  [11]).  We  consider  the simplest  kind of weight  decay,  which  comes  in  as \nan additive term,  -AWi = -A( Ui - Vi),  in the learning equation (2), so the equation \n\n\fDynamics of Generalization in Linear Perceptrons \n\n901 \n\n1.0 \n\n0.8 \n\n0.6 -.. \"-\" \n\n~ \n\n0.4 \n\n0.2 \n\n0.0 \n0 \n\na=O.B \n\n............ \n\n---\n\n... .... -......... \n- - - - - - - - - ~\u00b7~\u00b7\u00b7~~~~~i~2~\u00b7~\u00b7\u00b7~\u00b7\u00b7~\u00b7~\u00b7~\u00b7~ \n\na=I.0 \n\n5 \n\n10 \nt \n\n15 \n\n20 \n\nFigure 2:  The generalization error as  a  function of time for  a  couple of different  o . \n\n(5)  for  the difference  between  teacher  and  pupil is  now \n\nVi  = - LAijVj + >'(Ui  - Vi)  = - L(Aij + >'8ij)Vj + >'Ui. \n\n(15) \n\nj \n\nj \n\nApart from  the  last term this just shifts the eigenvalue spectrum by>.. \nIn  the  basis  where  A is  diagonal  we  can  again  write  down  the  general  solution  to \nthis equation: \n\n_ \nVr  -\n\nThe square of this is \n\n(1 \n\n-\n\n-(Ar+,x)t) \\ \ne \nAr + 1\\ \n\n\\ \n\nI\\Ur \n\n+ vr \n\n(0)  -(Ar+,x)t \n. \n\ne \n\n(16) \n\n(17) \n\nv2  =  u2 \nr \nr \n\n[>'(1 \n\n-\n\ne-(Ar+,x)t) \nAr + >. \n\n+ e-(Ar+,x)t + _r_e-(Ar+,x)t \n\nW  (0) \n\n] 2 \n\nU r \n\nAs in (10) this has to be integrated over the eigenvalue spectrum to find the averaged \ngeneralization error.  Assuming that the initial weights are random, so that wr(O)  = \n0,  and that they  have a  relative  variance given by \n\nthe  average of F(t)  over  the  distibution of initial conditions  now  becomes \n\nF(t) = J dept e)  [ (,,(1-;, :~+\u00bb') + e-('+\u00bb') 2 + w6e- 2('+\u00bb'] . \n\n(18) \n\n(19) \n\n(Again it is  assumed  the length of it is  .IN.) \nFor  >.  =  0 we  see  the result  is  the same as before except for  a factor  1 + w5  in front \nof the integral.  This means that the asymptotic generalization error is  now \n\nF(oo) =  {  (1 + w5)(1  - 0)  for  0  <  1 \n\no  for  0  > 1, \n\n(20) \n\n\f902 \n\nKrogh and Hertz \n\nwhich is  shown  as  a  dotted line in  Figure  1 for  Wo  = 1.  The excess error can easily \nbe understood as a contribution to the error from the non-relaxing part of the initial \nweight vector in  the subspace orthogonal to the space spanned by the patterns.  The \nrelaxation  times  are  unchanged for  A = O .. \nFor  A >  0  the  relaxation  times  become  finite  even  at  a  = 0,  because  the  smallest \neigenvalue  is  shifted  by  A,  so  (13)  is  now \n\n1 \n\n1 \n\nT  = L  + A =  lfo _ 1F + A' \n\n(21) \nIn  this  case  the asymptotic error can easily be obtained numerically from  (19), and \nis  shown by  the dashed line in  Figure  1.  It is  smaller  than for  A = 0 for  w5  > 1 at \nsufficiently small a.  This is simply because the weight decay makes the part of w(O) \northogonal to the pattern space  decay  away  exponentially,  thereby eliminating the \nexcess  error  due to large  initial weight  components in  this subspace. \nThis phase transition is  very  sensitive  to noise.  Consider adding a  noise  term 77i(t) \nto the right-hand side of (2),  with \n\n(22) \nHere  we  restrict our attention  to the  case  A = O.  Carrying  the extra term through \nthe  succeeding manipulations leads,  in  place of (7),  to \n\n(r/i(t)77j(t'\u00bb  = 2T6(t - t'). \n\nvr  = -Arvr + 77r(t). \n\n(23) \n\n(24) \n\n(25) \n\n(26) \n\nThe additional term leads to a  correction  (after  Fourier  transforming) \n\n6  (  ) _ \n\nVr  W \n\n-\n\n77r(w) \nA \n\u2022 \n-zw+ \nr \n\nand thus to an extra (time-independent)  piece  of the generalization error F(t): \n\n6F = ~ '\" J dw \n\nN  L...J \n\nr \n\n(l77r(w)12)  = ~ '\" I-. \nN  L...J  Ar \n\n211\"  1- iw + Arl2 \n\nr \n\nFor  a  > 1,  where  there are  no  zero eigenvalues,  we  have \n\n6F = T  j~+ dfP(f) \n\nE_ \n\nf \n\nwhich  has the  large  a-limit T / a,  as found  in equilibrium analyses  (also for  thresh(cid:173)\nold  perceptrons  [2,3,5,6,7,8,9]).  Equation  (26)  gives  a  generalization  error  which \ndiverges  as one  approaches  the transition  at  a  = 1: \nT \n. \nr.:. \nya-1 \n\n6F  T  -1/2  _ \n-\n\n(27) \n\nIX \n\n-\n\nf \n\nEquation  (25)  blows  up  for  a  < 1,  where some of the Ar  are zero.  This  divergence \njust reflects the fact that in the subspace orthogonal to the training patterns, v feels \nonly  the  noise  and so  exhibits  a  random walk  whose  variance  diverges  as  t  --+- 00. \nKeeping more  careful track of the dynamics in this subspace leads  to \n\n6F  =  2T(1 - a)t + T 1~+ dfP~f) \n\ncx-::;- 2T [(1 - a)t + OC-Yra)] \n\n(28) \n\n\fDynamics of Generalization in Linear Perceptrons \n\n903 \n\n5  CONCLUSION \n\nGeneralization in  the linear perceptron can be understood in  the following  picture. \nTo  get  perfect  generalization  the  training  pattern  vectors  have  to span  the  whole \ninput space - N  points (in general position) are enough to specify any hyperplane. \nThis  means  that  perfect  generalization  appears  only  for  a  >  1.  As  a  approaches \n1 the  relaxation  time  - i.e.  learning  time  - diverges,  signaling  a  phase  transition, \nas  is  common  in  physical  systems.  Noise  has  a  severe  effect  on  this  transition.  It \nleads  to  a  degradation  of the  generalization  ability  which  diverges  as  one  reduces \nthe number of training examples  toward the critical number. \nThis model  is  of course  much  simpler  than  most  real-life  training  problems.  How(cid:173)\never, it does  allow us to examine in  detail the dynamical phase transition separating \nperfect  from  imperfect generalization.  Further extensions  of the model  can  also be \nsolved  and will  be reported elsewhere. \n\nReferences \n\n[1]  Gardner, E.  and B.  Derrida:  Three Unfinished Works on  the  Optimal Storage \n\nCapacity of Networks.  Journal of Physics A  22,  1983-1994  (1989). \n\n[2]  Schwartz, D.B., V.K. Samalam, S.A. Solla, and J .S. Denker:  Exhaustive Learn(cid:173)\n\ning.  Neural Computation  2,  371-382  (1990). \n\n[3]  Tishby,  N.,  E.  Levin,  and  S.A.  Solla:  Consistent  Inference  of Probabilities  in \nLayered  Networks:  Predictions and Generalization.  Proc.  IJCNN  Washington \n1989,  vol.  2 403-410,  Hillsdale:  Erlbaum (1989). \n\n[4]  Baum,  E.B.  and D.  Haussler:  What Size  Net  Gives  Valid Generalization.  Neu(cid:173)\n\nral Computation  1, 151-160  (1989). \n\n[5]  Gyorgyi,  G.  and  N.  Tishby:  Statistical Theory of Learning  a  Rule.  In  Neural \nNetworks and Spin  Glasses,  eds  W.K. Theumann and R.  Koeberle.  Singapore: \nWorld  Scientific  (1990). \n\n[6]  Hansel,  D.  and  H.  Sompolinsky:  Learning  from  Examples  in  a  Single-Layer \n\nNeural  Network.  Europhysics Letters 11, 687-692  (1990). \n\n[7]  Vallet,  F., J.  Cailton and P.  Refregier:  Linear  and Nonlinear  Extension of the \nPseudo-Inverse  Solution  for  Learning  Boolean  Functions.  Europhysics  Letters \n9,  315-320  (1989). \n\n[8]  Opper, M.,  W.  Kinzel,  J.  Kleinz,  and R.  Nehl:  On the Ability of the Optimal \n\nPerceptron  to Generalize.  Journal of Physics A  23,  L581-L586  (1990). \n\n[9]  Levin,  E.,  N.  Tishby,  and S.  A.  Solla:  A Statistical Approach to Learning and \nGeneralization in  Layered  Neural Networks. AT&T Bell Labs,  preprint (1990). \n[10]  Gyorgyi,  G.:  Inference  of a  Rule  by  a  Neural  Network  with  Thermal  Noise. \n\nPhysical  Review Letters 64, 2957-2960  (1990). \n\n[11]  Hertz,  J .A.,  A.  Krogh,  and  G.I.  Thorbergsson:  Phase  Transitions  in  Simple \n\nLearning.  Journal of Physics A  22, 2133-2150  (1989). \n\n\f", "award": [], "sourceid": 349, "authors": [{"given_name": "Anders", "family_name": "Krogh", "institution": null}, {"given_name": "John", "family_name": "Hertz", "institution": null}]}