{"title": "Asymptotic Convergence of Backpropagation: Numerical Experiments", "book": "Advances in Neural Information Processing Systems", "page_first": 606, "page_last": 613, "abstract": null, "full_text": "606 \n\nAhmad, Thsauro and He \n\nAsymptotic  Convergence  of Backpropagation: \n\nNumerical  Experiments \n\nSubutai Ahmad \nICSI \n1947 Center St. \nBerkeley,  CA 94704 \n\nGerald Tesauro \nmM Watson Labs. \n\nP.  O.  Box 704 \n\nYorktown Heights,  NY \n\n10598 \n\nABSTRACT \n\nYu  He \nDept.  of Physics \nOhio  State Univ. \nColumbus,  OH 43212 \n\nWe  have  calculated, both analytically and in simulations,  the rate \nof convergence  at  long  times  in  the  backpropagation learning  al(cid:173)\ngorithm  for  networks  with  and  without  hidden  units.  Our  basic \nfinding for units using the standard sigmoid transfer function is  lit \nconvergence of the  error for  large t,  with  at most logarithmic cor(cid:173)\nrections  for  networks  with  hidden  units.  Other transfer functions \nmay lead to a  8lower polynomial rate of convergence.  Our analytic \ncalculations were presented in (Tesauro, He &  Ahamd, 1989).  Here \nwe  focus  in more detail on our empirical measurements of the con(cid:173)\nvergence rate in numerical simulations,  which  confirm our analytic \nresults. \n\nINTRODUCTION \n\n1 \nBackpropagation  is  a  popular  learning  algorithm  for  multilayer  neural  networks \nwhich minimizes a global error function by gradient descent  (Werbos,  1974:  Parker, \n1985;  LeCun,  1985;  Rumelhart,  Hinton  &  Williams,  1986).  In this  paper,  we  ex(cid:173)\namine  the  rate  of convergence  of backpropagation late in learning  when  all  of the \nerrors are small.  In this limit,  the learning equations become more amenable to an(cid:173)\nalytic study.  By expanding in the small differences  between the desired  and actual \noutput  states,  and retaining  only  the  dominant  terms,  one  can  explicitly solve  for \nthe leading-order behavior of the weights as a function of time.  This is true both for \n\n\fAsymptotic Convergence of Backpropagation:  Numerical Experiments \n\n607 \n\nsingle-layer networks, and for multilayer networks containing hidden units.  We  con(cid:173)\nfirm  our  analysis  by  empirical measurements  of the convergence rate in numerical \nsimula tions. \nIn gradient-descent learning,  one minimizes  an error function  E  according to: \n\n(1) \n\nwhere  .:1tii  is  the  change  in  the  weight  vector at  each  time  step,  and  the  learning \nrate E  is  a  small numerical constant.  The convergence of equation 1 for single-layer \nnetworks with general error functions  and transfer functions  is  studied in section 2. \nIn  section 3,  we  examine  two  standard modifications  of gradient-descent:  the  use \nof a  \"margin\"  variable for  turning oft'the error backpropagation,  and the inclusion \nof a  \"momentum\"  term in the learning equation.  In section 4 we  consider networks \nwith  hidden  units,  and in  the  final  section  we  summarize  our  results  and  discuss \npossible  extensions in future  work. \n\n2  CONVERGENCE IN  SINGLE-LAYER  NETWORKS \nThe input-output  relationship  for  single-Ia.yer networks takes  the form: \n\nYp  =  g(tii\u00b7 zp) \n\n(2) \n\nwhere  zp  represents  the state of the input  units  for  pattern p,  10  is  the real-valued \nweight vector of the network, 9 is the input-output transfer function (for the moment \nunspecified),  and Yp  is  the output  state for  pattern p.  We  assume that  the  transfer \nfunction  approaches  0 for large negative inputs  and  1 for large positive inputs. \n\nFor convenience of analysis,  we  rewrite  equation 1 for  continuous  time as: \n\n~ __  ~ BEp  __  ~ BEp ~ __  ~ BEp  '(h)'\" \np:Cp \nW  -\n\nE L.J  B10  -\n\nE L.J  B  9 \n\nE L.J  B \n\nB10  -\n\n(3) \n\nP \n\nP \n\nYp \n\np \n\nYp \n\nwhere Ep  is the individual error for pattern p,  hp  = Uj,zp  is the total input activation \nof the output unit for pattern p,  and the summation over p is for an arbitrary subset \nof the  possible  training  patterns.  Ep  is  a  function  of the  difference  between  the \nactual output  Yp  and  the  desired  output  dp for  pattern p.  Examples  of common \nerror  functions  are  the  quadratic  error  Ep  =  (yP  - dp)2  and  the  \"cross-entropy\" \nerror (Hinton,  1987)  Ep  =  dp logyp + (1  - dp) log(l - Up). \nInstead of solving equation 3 for  the weights  directly,  it is  more convenient  to work \nwith  the outputs Yp'  The outputs  evolve  according to: \n\n. \nYp  =  -Eg \n\n'(h  ) ~ BEq  '(h)\" \n\nP  L.J  B  9 \n\n.. \nq:Cq ' :Cp \n\nq \n\nYq \n\n(4) \n\nLet  us  now  consider  the  situation late in  learning  when  the  output  states  are  ap(cid:173)\nproaching  the  desired  values.  We  define  new  variables  rJp  =  Yp  - dp ,  and  assume \n\n\f608 \n\nAhmad, Tesauro and He \n\n2.' \n\n'.8 \n\n-'.3 \n\n-1.5 \n\n-2.&7 \n\n-3.8 \n\n-5 .\u2022 0+----+----+-----+---1----;----01 \n\n\u2022\u2022\u2022\u2022 \n\n1.&7 \n\n3.33 \n\n5 \u2022\u2022\u2022 \n\n&.&7 \n\n8.33 \n\n10.0. \n\nFigure  1:  Plots  of In(error)  vs.  In(epochs)  for  single-layer networks learning  the \nmajority function using standard backpropagation without momentum.  Four differ(cid:173)\nent learning runs starting from different random initial weights  are shown.  In each \ncase,  the  asymptotic  behavior  is  approximately  E  ,..\"  l/t,  as  seen  by  comparison \nwith  a reference line  of slope -1. \n\nthat  'lp  is  small for  all  p.  For reasonable  error functions,  the  individual  errors  Ep \nwill  go  to zero as  some power of '1p,  i.e.,  Ep  ,..\"  '1;.  (For the  quadratic error, ..,  =  2, \nand for the cross-entropy error, ..,  = 1.)  Similarly,  the slope of the transfer function \nshould  approach  zero  as  the  output  state  approaches  1  or  0,  and  for  reasonable \ntransfer functions,  this  will  again follow  a  power law,  i.e.,  g'(hp)  ,..\"  'lpll.  Using  the \ndefinitions  of '1,  ..,  and {1,  equation 4 becomes: \n\nrl\"  ,..\"  l'1p III L '1q 'Y- 1 1'1q I\" II:~  \u2022 11:-;  + higher order \n\nq \n\n(5) \n\nThe absolute  value  appears  because  g  is  a  non-decreasing function.  Let  f'Ip  be  the \nslowest  to approach zero among all the 'lp's.  We  then have for '1r: \n\nUpon integrating we  obtain \n\nf'Ip  _  t- 1/(211+'Y- 2)  i  E  ,..\"  f'Ip 'Y  ,..\"  ,-'Y/(211+'Y- 2 ) \n\n(6) \n\n(7) \n\nWhen {1  = 1,  i.e.,  g'  ,..\"  '1,  the  error function  approaches  zero like  l/t, independent \nof..,.  Since {1  = 1 for  the standard sigmoid function g( 11:)  = (1 + e - III) -I, one expects \nto see  l/t behavior in the error function in this  case.  This behavior was in fact  first \n\n\fAsymptotic Convergence of Backpropagation:  Numerical Experiments \n\n609 \n\nseen in the numerical experiments of (Ahmad,  1988; Ahmad &  Tesauro, 1988).  The \nbehavior  was  obtained  at  relatively  small  t,  about  20  cycles  through  the  training \nset.  Figure  1 illustrates  this  behavior for  single-layer networks  learning  a  data set \ncontaining  200  randomly  chosen  instances  of the majority function.  In  each  case, \nthe behavior at  long  times in  this  plot  is  approximately a  straight line,  indicating \npower-law  decrease  of the  error.  The  slopes  are in  each  case  within  a  few  percent \nof the theoretically predicted  value  of -1. \nIt turns  out that {3  =  1 gives  the fastest  possible  convergence of the error function. \nThis is because {3  < 1 yields transfer functions  which do not saturate at finite  values, \nand  thus  are  not  allowed,  while  (3  > 1  yields  slower  convergence.  For  example,  if \nwe  take  the  transfer function  to be  g(.x)  =  0.5[1  + (2/,rr) tan- 1 .x],  then  (3  =  2.  In \nthis  case,  the  error function  will  go  to  zero  as  E  \"'\"  t-'Y/('Y+2 ).  In particular,  when \n;  =  2,  E \"'\"  l/Vi. \n3  MODIFICATIONS OF  GRADIENT  DESCENT \nOne common modification to strict gradient-descent is the use of a  \"margin\" variable \nIJ  such that, if the  difference between network output  and teacher signal is  smaller \nthan  IJ,  no  error  is  backpropagated.  This  is  meant  to  prevent  the  network  from \ndevoting resources to making its output arbitrarily close to the teacher signal, which \nis usually unnecessary.  It is clear from the structure of equations 5,  6 that the margin \nwill not affect  the basic l/t error convergence, except in a  rather trivial way.  When \na  margin  is  employed,  certain  driving  terms  on  the  right-hand  side  of equation  5 \nwill be set  to zero as soon as  they become small enough.  However,  as long as  !ome \nnon-zero driving terms are present, the basic polynomial solution of equation 7 will \nbe unaltered.  Of course,  when  all the driving  terms disappear because  they are all \nsmaller than  the  margin,  the network will  stop  learning,  and  the  error will  remain \nconstant at some positive  value.  Thus  the  prediced behavior is  l/t decrease in  the \nerror followed  eventually  by  a  rapid  transition  to  constant  non-zero  error.  This \nagrees with  what is  seen numerically in  Figure  2. \n\nAnother popular generalization of equation 1 includes  a  \"momentum\"  term: \n\n~w(t)  =  -E ~~(t) + Ct~tii(t - 1) \n\nIn  continuous  time,  this  takes  the form: \n\n-\n. \nCtW  + (1  - Ct)tii \n\nBE \n-E Bw \n\nTurning this  into an equation for  the evolution of outputs gives: \n\n-\n\nCtYp  - Ctg \n\np \n\n\"(h)[  YP]2 \n\n'(h)  +  1 - Ct  Yp  =  -eg \n9 \n\n) . \n\n( \n\nP \n\n'(h) '\" BEq  '(h)'\" \n\np  L...J a- g \n\n... \nq.xq \u2022 .xp \n\nq \n\nYq \n\n(8) \n\n(9) \n\n(10) \n\nOnce again, exapanding Yp,  Ep  and g'  in small TIp  yields  a  second-order differential \nequation for  TIp  in  terms of a  sum over other Tlq.  As  in equation 6,  the sum will be \n\n\f610 \n\nAhmad, Thsauro and He \n\n0,0.025 \n\n-s.OO+----+----+----+---f----+----ot \n\n3.33 \n\nS.IO \n\n6.67 \n\nB.33 \n\n10.00 \n\n'.00 \n\n1.67 \n\nFigure  2:  Plot  of In(error)  vs.  In(epochs)  for  various  values  of margin  variable  /J \nas indicated.  In each case  there is  a  1ft decrease in the error followed  by a  sudden \ntransition to constant  error.  This  transition occurs  earlier for  larger values  of /J. \n\ncontrolled by some dominant  term r,  and the equation for  this  term is: \n\n(11) \n\nwhere  C I , C2  and  C3  are  numerical  constants.  For polynomial solutions,.\".  -\nt Z , \nthe first  two  terms are of order t z - 2 ,  and can be neglected relative to the third term \nwhich  is  of order t z - l \u2022  The resulting  equation thus  has  exactly the same form  as \nin  the  zero  momentum  case  of section  2,  and  therefore  the  rate  of convergence  is \nthe  same as  in equation 7.  This  is  demonstrated numerically  in  Figure 3.  We  can \nsee  that  the  error behaves  as  1 ft for  large  t regardless  of the  value  of momentum \nconstant  cr.  Furthermore,  although  it  is  not  required  by  the  analytic  theory,  the \nnumerical prefactor appears  to be the  same in each case. \n\nFinally, we have also considered the effect  on convergence of schemes for  adaptively \naltering  the lea.rning  rate constant  E.  It was  shown  analytically in  (Tesauro,  He  & \nAhmad, 1989) that for the scheme proposed by Jacobs (1988), in which the learning \nrate could in principle  increase linearly  with time,  the error would  decrease as  Ift 2 \nfor sigmoid units,  instead of the  1ft result  for fixed  E. \n\n4  CONVERGENCE IN NETWORKS WITH HIDDEN UNITS \nWe  now  consider  networks  with  a  single  hidden  layer.  In (Tesauro,  He  &  Ahmad, \n1989),  it  was  shown  that  if the  hidden  units  saturate  late  in  Ie a.rning ,  then  the \nconvergence  rate is  no  different  from  the  single-layer rate.  This  should  be  typical \n\n\fAsymptotic Convergence of Backpropagation:  Numerical Experiments \n\n611 \n\n-0.3 \n\n-1.5 \n\n-2.6 \n\n-3.8 \n\n-5 \u2022\u2022 O+----+---~---+---_+_--_+_--__oe \n\n1.67 \n\n3.33 \n\n5.00 \n\n6 .67 \n\n8.33 \n\n10 . 00 \n\n\u2022\u2022\u2022\u2022 \n\nFigure  3:  Plot  of In( error)  vs.  In( epochs)  for  single-layer networks  learning  the \nmajority function,  with  momentum  const8.I1.t  (t  =  0,0.25,0.5, 0.75,0.99.  Each run \nstarts from  the same r8.I1.dom  initial  weights.  Asymptotic l/t behavior is  obtained \nin each case,  with  the same numerical prefactor. \n\nof what  usually  happens.  However,  assuming  for  purposes  of argument  that  the \nhidden  units  do  not  saturate,  when  one  goes  through  a  small 11  exp8.I1.sion  of the \nlearning equation, one obtains a  coupled system of equations of the following form: \n\n11  -\n\n11211+ y - 1 [1  + n2] \nn _ 11\"1+11- 1 \n\n(13) \nwhere n represents the magnitude of the second layer weights,  8.I1.d  for convenience \nall indices  have been suppressed  8.I1.d  all terms of order 1 have been written simply \nas 1. \nt.&,  n - t~, with \nFor f3  > 1,  this  system has polynomial solutions  of the  form 11  -\nz =  - 3/ (37 + 413  - 4) 8.I1.d  ..\\  = z h + f3  - 1) - 1.  It is  interesting to note that  these \nsolutions  converge  slightly  faster  th8.I1.  in  the  single-layer  case.  For example,  with \n7  = 2 8.I1.d  f3  =  2,  11  -\nt- 3/ 10  in the multilayer case, but  as shown previously,  11  goes \nto zero only as t- 1/ 4 in the single-layer case.  We emphasize that this slight speed-up \nwill  only  be obtained  when  the hidden  unit  states do  not  saturate.  To  the  extent \nthat  the hidden  units saturate 8.I1.d  their slopes become small,  the convergence rate \nwill  return to the single-layer rate. \nWhen f3  = 1  the  above polynomial solution is  not possible.  Instead,  one  C8.I1.  verify \nthat  the following  is  a  self-consistent leading order solution  to equations 12,  13: \n\n(12) \n\n(14) \n\n\f612 \n\nAhmad, Thsauro and He \n\n5.\" \n\n2.5' \n\n.... \n\n-2 . 51 \n\n-5.\" \n\n-7.51 \n\no Hidden Units \n\n3 Hidden Units \n10  Hidden Units \n50  Hidden  Units \n\n-n ... \n\n2 \n\n6 \n\n7 \n\nFigure  4:  Plot  of In(error)  vs.  In(epochs)  for  networks  with  varying numbers  of \nhidden  units  (as indicated)  learning  majority function  data set.  Approximate  l/t \nbehavior is  obtained in each case. \n\n(15) \n\nRecall that in the single-layer case, '1  \"'\"  t-1/'y.  Therefore, the effect of multiple layers \ncould provide  at most only  a logarithmic speed-up of convergence when  the hidden \nunits  do  not  saturate.  For practical purposes,  then,  we  expect  the  convergence of \nnetworks  with  hidden  units  to  be  no  different  empiric8Jly  from  networks  without \nhidden units.  This is  in fact  what  our simulations find,  as illustrated in Figure 4. \n\n5  DISCUSSION \nWe  have obtained results for  the asymptotic convergence of gradient-descent learn(cid:173)\ning which are valid for  a  wide  variety of error functions  a:nd  transfer functions.  We \ntypically  expect  the same rate of convergence  to be obtained regardless  of whether \nor not the network has hidden units.  However,  it may be possible to obtain a slight \npolynomial speed-up  when {3  > 1  or a logarithmic speed-up  when {3  = 1.  We  point \nout  that in all cases,  the sigmoid provides the maximum possible  convergence rate, \nand is  therefore  a  \"good\"  transfer function  to use  in that sense. \n\nWe  have  not  attempted analysis  of networks  with multiple  layers  of hidden  units; \nhowever,  the  analysis  of (Tesauro,  He  &  Ahmad,  1989) suggests that,  to the  extent \nthat the hidden unit states saturate and the g' factors vanish,  the rate of convergence \nwould be no different  even in networks with arbitrary numbers of hidden  layers. \n\nAnother important finding is  that the expected rate of convergence does not depend \non  the  use  of all  2ft.  input  patterns in the  training  set.  The  same behavior should \n\n\fAsymptotic Convergence of Backpropagation:  Numerical Experiments \n\n613 \n\nbe  seen  for  general  subsets  of training  data.  This  is  also  in  agreement  with  our \nnumerical results, and with the results of (Ahamd, 1988; Ahmand &  Tesauro, 1988). \n\nIn conclusion, a combination of analysis and numerical simulations has led to insight \ninto the late stages of gradient-descent learning.  It might also be possible  to extend \nour  approach  to  times  earlier  in  the  learning  process,  when  not  all  of the  errors \nare small.  One might  also  be able  to analyze  the  numbers, sizes  and shapes  of the \nbasins of attraction for gradient-descent learning in feed-forward networks.  Another \nimportant issue  is  the behavior of the generalization performance, i.e.,  the error on \na  set  of test  patterns not  used in training,  which  was  not  addressed in  this  paper. \nFinally,  our analysis  might  provide insight into the development  of new  algorithms \nwhich  might scale more favorably  than backpropagation. \n\nReferences \n\nS.  Ahmad.  (1988) A study of scaling and generalization in neural networks.  Master's \nThesis,  Univ.  of Illinois  at Urbana-Champaign, Dept.  of Computer Science. \n\nS.  Ahmad &  G.  Tesauro.  (1988)  Scaling  and  generalization in  neural networks:  a \ncase study.  In D.  S.  Touretzky et al.  (eds.),  Proceedings of the 1988 Connectionist \nModels  Summer School,  3-10.  San Mateo,  CA:  Morgan Kaufmann. \n\nG.  E.  Hinton. \nCMU-CS-87-115, Dept.  of Computer Science,  Carnegie-Mellon University. \n\n(1987)  Connectionist  learning  procedures.  Technical  Report  No. \n\nR.  A.  Jacobs.  (1988) Increased rates of convergence through learning rate adapta(cid:173)\ntion.  Neural Networks 1:295-307. \n\nY.  Le  Cun.  (1985)  A  learning  procedure  for  asymmetric network.  Proceedings of \nCognitiva (Paris) 85:599-604. \n\nD.  B. Parker.  (1985) Learning-logic.  Technical Report No.  TR-47,  MIT  Center for \nComputational Research in Economics and Management  Science. \n\nD.  E. Rumelhart,  G.  E. Hinton,  &  R.  J.  Williams.  (1986) Learning representations \nby back-propagating errors.  Nature 323:533-536. \n\nG. Tesauro, Y. He &  S.  Ahmad.  (1989) Asymptotic convergence of back propagation. \nNeural  Computation 1:382-391. \n\nP.  Werbos.  (1974)  Ph.  D.  Thesis,  Harvard University. \n\n\f", "award": [], "sourceid": 238, "authors": [{"given_name": "Subutai", "family_name": "Ahmad", "institution": null}, {"given_name": "Gerald", "family_name": "Tesauro", "institution": null}, {"given_name": "Yu", "family_name": "He", "institution": null}]}