{"title": "A Study of Parallel Perturbative Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 803, "page_last": 810, "abstract": null, "full_text": "A  Study of Parallel Perturbative \n\nGradient  Descent \n\nD. Lippe\u00b7  J. Alspector \n\nBellcore \n\nMorristown,  NJ  07960 \n\nAbstract \n\nWe  have  continued  our  study  of a  parallel  perturbative  learning \nmethod  [Alspector  et al.,  1993]  and implications for  its implemen(cid:173)\ntation in analog VLSI. Our new results indicate that, in most cases, \na single parallel perturbation (per pattern presentation) of the func(cid:173)\ntion  parameters  (weights in a  neural network)  is  theoretically  the \nbest  course.  This  is  not  true,  however,  for  certain  problems  and \nmay  not  generally  be  true  when  faced  with  issues  of implemen(cid:173)\ntation  such  as limited  precision.  In  these  cases,  multiple  parallel \nperturbations may be best as indicated in our previous results. \n\n1 \n\nINTRODUCTION \n\nMotivated  by  difficulties  in  analog  VLSI  implementation  of  back-propagation \n[Rumelhart et al.,  1986]  and  related  algorithms  that  calculate  gradients  based  on \ndetailed  knowledge  of  the  neural  network  model,  there  were  several  similar  re(cid:173)\ncent papers proposing to use a  parallel [Alspector et al.,  1993, Cauwenberghs,  1993, \nKirk et al.,  1993] or a semi-parallel [Flower and Jabri,  1993]  perturbative technique \nwhich  has the property that it  measures  (with the physical neural network)  rather \nthan calculates the gradient.  This technique is closely related to methods of stochas(cid:173)\ntic approximation [Kushner and Clark,  1978]  which have been investigated recently \nby workers in fields other than neural networks.  [Spall,  1992] showed that averaging \nmultiple parallel perturbations for  each pattern presentation may be asymptotically \npreferable in the presence of noise.  Our own results [Alspector et al.,  1993] indicated \n\n\u00b7Present address:  Dept.  of EECSj MITj  Cambridge,  MA 02139;  dalippe@mit.edu \n\n\f804 \n\nD.  Lippe,  1.  Alspector \n\nthat multiple parallel perturbations are also preferable when only limited precision \nis  available in the learning rate  which is  realistic  for  a  physical implementation.  In \nthis work we  have investigated whether multiple parallel perturbations for each pat(cid:173)\ntern are  non-asymptotically preferable  theoretically  (without  noise).  We  have  also \nstudied this  empirically,  to the limited degree  that simulations allow,  by  removing \nthe precision  constraints of our  previous work. \n\n2  GRADIENT ESTIMATION BY  PARALLEL  WEIGHT \n\nPERTURBATION \n\nFollowing our previous work,  one can estimate the gradient of the error, E( w),  with \nrespect  to any  weight,  Wi,  by  perturbing  Wi  by  6w1  and measuring  the  change  in \nthe  output  error,  6E,  as  the  entire  weight  vector,  W,  except  for  component  Wi  is \nheld constant. \n\n6E \n6w1 \n\nE(w + 6;1) - E(w) \n\n6Wi \n\nWe  now consider  perturbing all  weights  simultaneously.  However,  we  wish  to have \nthe  perturbation  vector,  6w,  chosen  uniformly  on  a  hypercube.  Note  that  this \nrequires  only  a  random  sign  multiplying  a  fixed  perturbation  and  is  natural  for \nVLSI  using a  parallel noise  generator  [Alspector et al.,  1991J. \n\nThis leads to the approximation  (ignoring  higher order terms) \n\nw \n- -+ \n\n' i\u00a21  \n\n-\n6E \n6w\u00b7  -\n, \n\n8E  2:(8E) (6Wi) \n8w\u00b7 \n\n-\n6w\u00b7\u00b7 \n\n-\n8w\u00b7 \n] \n\n, \n\n(1 ) \n\nThe last term has expectation value zero for  random and independently distributed \n6w1\u2022  The weight  change rule \n\nwhere  1]  is  a  learning rate,  will follow  the gradient on the average  but with consid(cid:173)\nerable noise. \n\nFor each  pattern, one  can reduce the variance of the noise  term in  (1)  by repeating \nthe random parallel perturbation many times to improve the statistical estimate.  If \nwe  average  over  P  perturbations,  we  have \n\nwhere  p indexes  the perturbation number. \n\n\fA Study of Parallel Perturbative Gradient Descellf \n\n805 \n\n3  THEORETICAL RELATIVE EFFICIENCY \n\n3.1  BACKGROUND \n\nSpall [Spall,  1992]  shows in an asymptotic sense that multiple perturbations may be \nfaster  if only  a  noisy  measurement of E( tV)  is  available,  and that one  perturbation \nis  superior otherwise.  His  results are  asymptotic in  that they compare  the  rate  of \nconvergence to the local minimum if the algorithms run for  infinite time.  Thus, his \nresults  may only indicate that 1 perturbation is  superior close  to a local minimum. \nFurthermore, his  result implicitly assumes that  P  perturbations per weight update \ntakes  P  times  as  long  as  1  perturbation  per  weight  update.  Experience  shows \nthat the time  required  to present  patterns to the hardware is  often  the bottleneck \nin  VLSI  implementations  of neural  networks  [Brown et al.,  1992].  In  a  hardware \nimplementation of a  perturbative learning algorithm, a few  perturbations might  be \nperformed with no time penalty while  waiting for  the next pattern presentation. \n\nThe  remainder  of this  section  sketches  an  argument  that  multiple  perturbations \nmay  be  desirable  for  some  problems  in  a  non-asymptotic  sense,  even  in  a  noise \nfree  environment  and  under  the  assumption  of a  multiplicative  time  penalty  for \nperforming  multiple  perturbations.  On the  other  hand,  the  argument  also  shows \nthat  there  is  little  reason  to  believe  in  practice  that  any  given  problem  will  be \nlearned more  quickly  by  multiple perturbations.  Space limitations prevent us  from \nreproducing  the  full  argument  and  discussion of its  relevance  which  can  be found \nin [Lippe,  1994]. \n\nThe argument fixes  a  point in weight  space  and considers  the expectation value  of \nthe  change  in  the  error  induced  by  one  weight  update  under  both  the  1  pertur(cid:173)\nbation  case  and  the  multiple  perturbation  case.  [Cauwenberghs,  1994]  contains  a \nsomewhat  related  analysis  of the  relative  speed  of one  parallel  perturbation  and \nweight  perturbation as  described in  [Jabri and Flower,  1991].  The  analysis  is  only \ntruly relevant  far from a local minimum because close  to a local  minimum the vari(cid:173)\nance  of the  change  of the  error  is  as  important  as  the  mean of the  change of the \nerror. \n\n3.2  Calculations \n\nIf P  is  the number of perturbations, then our learning rule is \n\n-'TJ  ~ 6E(p) \n\n~Wi = P  L.J~. \n\nP=1 6wi \n\nIf W  is  the number of weights,  then ~E, calculated to second order in  'TJ,  is \n\n~E =  '\" -8 ~Wi + - '\" '\" 8  8  ~Wi~Wj. \n\nw  8E \nL.J  W\u00b7 \ni= l '  \n\n82E \n\n1  W  W \n2 L.J L.J  W\u00b7  W \u00b7 \nJ \n\ni=l j=l \n\n\u2022 \n\nExpanding 6E(p)  to second order in (j  (where  6Wi  =  \u00b1(j), we  obtain \n\n8 2 \n6E(p)  = '\" -6w~P) + ! \"'\"  E \n\nW  W \n\nW  8E \nL.J 8w' \nj=l \nJ \n\nJ \n\n2 L.J L.J 8w' 8wk \n\nj=l k=l \n\nJ \n\n6w~P)6w(P). \n\nJ \n\nk \n\n(2) \n\n(3) \n\n(4) \n\n\f806 \n\nD.  Lippe, 1. Alspector \n\n[Lippe,  1994]  shows  that  combining  (2)-(4),  retaining  only  first  and  second  order \nterms, and taking expectation values gives \n\n< l:1E  >= -TJX + ~ (Y + PZ) \n\n2 \n\n(5) \n\nw  (8E)2 \nL  8w' \n\ni=l \n\n' \n\n' \n\nwhere \n\nx \n\nZ \n\nY \n\nNote  that  first  term in  (5)  is  strictly less  than  or  equal  to  0  since  X  is  a  sum  of \nsquares l .  The second  term,  on the  other  hand,  can be  either  positive or  negative. \nClearly then a  sufficient  condition for  learning is  that the first  term dominates the \nsecond  term.  By  making  TJ  small  enough,  we  can  guarantee  that  learning occurs. \nStrictly  speaking,  this  is  not  a  necessary  condition  for  learning.  However,  it  is \nimportant to keep in mind that we are only focusing on one point in weight space.  If, \nat this point in weight space, < l:1E  > is  negative but the second term's magnitude \nis  close  to  the  first  term's  magnitude,  it  is  not  unlikely  that  at  some  other  point \nin  weight  space  < l:1E  >  will be  positive.  Thus,  we  will  assume  that  for  efficient \nlearning  to  occur,  it  is  necessary  that  TJ  be  small  enough  to  make  the  first  term \ndominate the second term. \nAssume  that  some  problem  can  be  successfully learned  with  one  perturbation,  at \nlearning  rate  TJ(I).  Then  the  first  order  term  in  (5)  dominates  the  second  order \nterms.  Specifically,  at any  point in  weight  space  we  have,  for  some  large  constant \nJ1., \n\nTJ(I)X  ~ J1.TJ(I)2IY  + ZI \n\nIn order to learn with  P  perturbations, we  apparently need \n\nTJ( P)X ~ J1. TJ(~)2 IY + P ZI \n\n(6) \n\nThe assumption  that the first  order  term of (5)  dominates  the  second order  terms \nimplies  that convergence  time is  proportional to  ,.lp),  Thus,  learning is  more  effi(cid:173)\ncient in  the multiple perturbation case if \n\nJ1.TJ(P)  > J1.TJ(I) \n\nP \n\n(7) \n\nIt turns out,  as  shown  in  [Lippe,  1994]  that the conditions  (6)  and (7)  can be met \nsimultaneously with multiple perturbations if =f ~ 2. \n\nlIf we  are  at  a  stationary  point  then the first  term in  (5)  is  O. \n\n\fA Study of Parallel Perturbative Gradient Descent \n\n807 \n\nIt  is  shown  in  [Lippe,  1994],  by  using  the  fact  that  the  Hessian  of a  quadratic \nfunction  with  a  minimum is  positive  semi-definite,  that if E  is  guadratic  and  has \na  minimum,  then  Y  and  Z  have  the  same  sign  (and  hence  =f  <  2).  Any  well \nbehaved function  acts quadratically sufficiently  close  to  a  stationary  point.  Thus, \nwe  can not get  < flE > more  than a  factor  of P  larger  by  using  P  perturbations \nnear local  minima of well  behaved  functions.  Although,  as  mentioned  earlier,  we \nare entirely ignoring  the issue of the variance  of flE, this may be some  indication \nof the asymptotic superiority of 1 perturbation. \n\n3.3  Discussion of Results \nThe result  that multiple  perturbations are superior  when  -i ~ 2 may seem some(cid:173)\nwhat mysterious.  It sheds some light  on our answer  to rewrite  (5)  as \n\n< flE >= -\"IX + \"I2(p + Z). \n\nY \n\nFor strict gradient descent,  the corresponding equation is \n< flE >= flE = -\"IX + \"I2Z. \n\nThe  difference  between  strict  gradient  descent  and  perturbative  gradient  descent, \non average, is  the second order term \"I2~. This is  the term which  results from  not \nfollowing  the  gradient  exactly,  and  it  Obviously  goes  down  as  P  goes  up  and  the \ngradient  measurement  becomes  more  accurate.  Thus,  if Z  and  Y  have  different \nsigns,  P  can  be  used  to  make  the  second  order  term disappear.  There  is  no  way \nto know  whether  this  situation will occur frequently.  Furthermore, it is  important \nto keep in  mind  that  if Y  is  negative  and Z  is  positive,  then raising  P  may  make \nthe magnitude of the second order term smaller,  but it makes the term itself larger. \nThus,  in  general,  there is  little  reason  to  believe  that  multiple  perturbations  will \nhelp with a  randomly chosen  problem. \n\nAn  example  where  multiple  perturbations  help  is  when  we  are  at  a  point  where \nthe error surface is  convex along  the gradient direction,  and concave in most other \ndirections.  Curvature due  to  second  derivative  terms  in  Y  and  Z  help  when  the \ngradient  direction  is  followed,  but  can  hurt  when  we  stray from  the  gradient.  In \nthis case,  Z  < 0 and possibly Y  > 0,  so multiple perturbations might  be  preferable \nin order to follow  the gradient direction very closely. \n\n4  SIMULATIONS  OF  SINGLE AND  MULTIPLE \n\nPARALLEL PERTURBATION \n\n4.1  CONSTANT LEARNING RATES \n\nThe second order terms in  (5)  can be reduced either by using a small learning rate, \nor by using more perturbations, as discussed briefly in [Cauwenberghs,  1993].  Thus, \nif \"I  is  kept  constant,  we  expect a  minimum  necessary  number  of perturbations in \norder  to learn.  This in itself might  be of importance in a  limited  precision imple(cid:173)\nmentation.  If there  is  a  non-trivial lower  bound on  \"I,  then  it  might  be  necessary \nto use  multiple  perturbations in order to learn.  This is  the effect  that was  noticed \nin  [Alspector et al.,  1993].  At that time we  thought that we  had found empirically \n\n\f808 \n\nD.  Lippe,  J.  Alspector \n\nTable  1:  Running times for  the first  initial weight  vector \nTime for  < .1 \n1,121,459 \n831 , 684 \n784, 768 \n4 94,029 \n1,695,974 \n707,840 \n583,654 \n922,880 \n1,010,355 \nNot  tested \n\nTime for  < .5 \n32,179 \n18,534 \n11,008 \n9,933 \n9,728 \n23,834 \n16,845 \n13,261 \n12,006 \n17,024 \n\nTJ \n.0005 \n.001 \n.002 \n.003 \n.004 \n.00625 \n.008 \n.0125 \n.025 \n.035 \n\nP \n1 \n1 \n1 \n1 \n1 \n7 \n7 \n7 \n7 \n7 \n\nthat multiple perturbations were  necessary for  learning.  The problem was  that  we \nfailed  to decrease the learning rate with the number of perturbations. \n\n4.2  EMPIRICAL  RELATIVE  EFFICIENCY  OF  SINGLE  AND \n\nMULTIPLE  PERTURBATION  ALGORITHMS \n\nSection 3 showed that, in theory,  multiple perturbations might be faster than 1 per(cid:173)\nturbation.  We  investigated whether or not this is  the case for  the 7 input hamming \nerror  correction  problem  as  described in  [Biggs,  1989].  This  is  basically  a  nearest \nneighbor problem.  There exist  16 distinct 7 bit binary code words.  When presented \nwith an arbitrary  7 bit  binary  word,  the  network is  to output  the  code  word  with \nthe least  hamming distance from  the input. \n\nAfter  preliminary  tests  with  50,  25,  7,  and  1  perturbation,  it  seemed  that  7  per(cid:173)\nturbations provided the fastest learning, so we  concentrated on running simulations \nfor  both the  1 perturbation and the 7 perturbation case.  Specifically,  we  chose  two \ndifferent  (randomly generated) initial weight vectors, and five  different seeds for  the \npseudo-random function  used  to generate  the bWi.  For  each  of these  ten cases,  we \ntested both 1 perturbation and 7 perturbations with various learning rates in order \nto obtain the fastest  possible learning. \n\nThe 128 possible input patterns were repeatedly presented in order.  We investigated \nhow  many  pattern  presentations  were  necessary  to  drive  the  MSE  below  .1  and \nhow  many  presentations  were  necessary  to drive  it  below  .5.  Recalling  the  theory \ndeveloped in section 3,  we  know that multiple perturbations can be helpful only far \naway  from  a  stationary  point.  Thus,  we  expected  that  7  perturbations  might  be \nquicker reaching  .5  but  would be slower  reaching  .1. \n\nThe results are summarized in tables  1 and  2.  Each table summarizes information \nfor  a  different  initial  weight  vector.  All  of the  data presented are  averaged  over  5 \nruns,  one  with each of the different  random seeds.  The two columns labeled  \"Time \nfor  < .5\"  and  \"Time for  < .1\"  are  adjusted according to the  assumption  that one \nweight  update  at  7  perturbations  takes  7  times  as  long  as  one  weight  update  at \n1  perturbation.  In  each  table,  the  following  four  numbers  appear  in  italics:  the \nshortest  time  to reach  .1  with  1 perturbation,  the shortest  time to reach  .1  with  7 \nperturbations,  the  shortest  time  to  reach  .5  with  1 perturbation,  and the shortest \ntime  to reach  .5  with  7 perturbations. \n\n7  perturbations  were  a  loss  in  three  out  of four  of the  experiments.  Surprisingly, \n\n\fA Study of Parallel Perturbative Gradient Descent \n\n809 \n\nl' \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n\n'T/ \n.001 \n.002 \n.003 \n.004 \n.00625 \n.008 \n.0125 \n.025 \n.035 \n\nTable 2:  Running times for  the second initial weight  vector \n\nTIme for < .1 \n928,236 \n719 , 078 \n154,139 \n1,603,354 \n629 , 530 \n611,610 \n912,333 \n1,580,442 \nNot  tested \n\nTIme for  < .5 \n22,133 \n12,817 \n10,675 \n11,150 \n21,059 \n19,112 \n15,949 \n14,515 \n11,141 \n\nthe one time that multiple perturbations helped was in reaching  .1 from  the second \ninitial  weight  vector.  There  are  several  possible  explanations  for  this.  To  begin \nwith, these learning times are averages over only five  simulations each,  which makes \ntheir  statistical  significance  somewhat  dubious.  Unfortunately,  it  was  impractical \nto perform too many experiments as the data obtained required  180 computer sim(cid:173)\nulations,  each of which  sometimes took more  than a  day to complete. \n\nAnother  possible  explanation  is  that  .1  may  not  be  \"asymptotic  enough.\"  The \nnumbers  .5  and  .1  were  chosen  somewhat  arbitrarily  to  represent  non-asymptotic \nand asymptotic results.  However,  there is  no way of predicting from the theory how \nclose  the error  must  be to its  minimum before asymptotic results  become relevant. \n\nThe fact that 1 perturbation outperformed 7 perturbations in three out of four cases \nis  not surprising.  As  explained in section  3,  there is  in general no reason to  believe \nthat multiple  perturbations will  help  on a  randomly chosen  problem. \n\n5  CONCLUSION \n\nOur  results  show  that,  under  ideal  computational  conditions,  where  the  learning \nrate  can  be  adjusted  to  proper  size,  that  a  single  parallel  perturbation  is,  except \nfor  unusual problems,  superior to multiple  parallel perturbations.  However,  under \nthe  precision constraints imposed by analog VLSI  implementation,  where  learning \nrates may not  be adjustable and presenting a  pattern takes longer than performing \na  perturbation,  multiple  parallel perturbations are likely  to be the  best choice. \n\nAcknowledgment \n\nWe  thank  Gert  Cauwenberghs and  James  Spall for  valuable  and insightful  discus(cid:173)\nsIons. \n\nReferences \n\n[Alspector  et al.,  1991]  Alspector,  J.,  Gannett,  J.  W.,  Haber,  S.,  Parker,  M.  B., \n\nand  Chu,  R.  (1991).  A  VLSI-efficient  technique  for  generating  multiple  uncor(cid:173)\nrelated  noise  sources  and  its  application  to  stochastic  neural  networks.  IEEE \nTransactions  on  Circuits  and  Systems,  38:109-123. \n\n[Alspector et al.,  1993]  Alspector,  J.,  Meir,  R.,  Yuhas,  B.,  Jayakumar,  A.,  and \nLippe,  D.  (1993).  A  parallel  gradient  descent  method  for  learning  in  analog \n\n\f810 \n\nD.  Lippe,  J.  A/spector \n\nVLSI  neural  networks.  In  Hanson,  S.  J.,  Cowan,  J.  D.,  and  Giles,  C.  L.,  edi(cid:173)\ntors,  Advances in Neural Information  Processing  Systems  5,  pages  836-844,  San \nMateo,  California.  Morgan Kaufmann  Publishers. \n\n[Biggs,  1989]  Biggs,  N.  L.  (1989).  Discrete  Math.  Oxford  University  Press. \n[Brown  et al.,  1992]  Brown,  T.  X.,  Tran,  M.  D.,  Duong,  T.,  and  Thakoor,  A.  P. \n(1992).  Cascaded  VLSI  neural  network  chips:  Hardware  learning  for  pattern \nrecognition and classification.  Simulation,  58(5):340-347. \n\n[Cauwenberghs,  1993]  Cauwenberghs,  G.  (1993).  A fast  stochastic error-descent al(cid:173)\n\ngorithm for supervised learning and optimization. In Hanson,  S.  J., Cowan, J. D., \nand Giles,  C.  L.,  editors,  Advances in  Neural Information  Processing  Systems  5, \npages  244-251,  San Mateo,  California.  Morgan  Kaufmann  Publishers. \n\n[Cauwenberghs,  1994]  Cauwenberghs,  G.  (1994).  Analog  VLSI Autonomous  Sys(cid:173)\ntems for Learning  and  Optimization.  PhD thesis,  California Institute of Technol(cid:173)\nogy. \n\n[Flower  and Jabri,  1993J  Flower,  B.  and  Jabri,  M.  (1993).  Summed  weight  neu(cid:173)\nron  perturbation:  An  o(n)  improvement  over  weight  perturbation.  In  Hanson, \nS.  J.,  Cowan,  J.  D.,  and  Giles,  C.  L.,  editors,  Advances  in  Neural  Information \nProcessing  Systems  5,  pages  212-219,  San Mateo,  California.  Morgan Kaufmann \nPublishers. \n\n[Jabri and  Flower,  1991]  Jabri,  M.  and  Flower,  B.  (1991).  Weight  perturbation: \nAn optimal architecture and learning technique for  analog VLSI feedforward and \nrecurrent  multilayer networks.  In  Neural  Computation  3,  pages  546-565. \n\n[Kirk  et  al.,  1993]  Kirk,  D.,  Kerns,  D.,  Fleischer,  K.,  and Barr,  A.  (1993).  Analog \nVLSI  implementation of gradient  descent.  In  Hanson,  S.  J.,  Cowan,  J.  D.,  and \nGiles,  C. L., editors,  Advances in Neural Information Processing Systems 5,  pages \n789-796,  San Mateo,  California.  Morgan  Kaufmann  Publishers. \n\n[Kushner and Clark,  1978]  Kushner,  H.  and  Clark,  D.  (1978).  Stochastic  Approz(cid:173)\nimation  Methods  for  Constrained  and  Unconstrained  Systems.  Springer-Verlag, \nNew  York. \n\n[Lippe,  1994]  Lippe,  D.  A.  (1994).  Parallel, perturbative gradient descent  methods \nfor  learning  in  analog  VLSI  neural  networks.  Master's  thesis,  Massachusetts \nInstitute of Technology. \n\n[Rumelhart  et al.,  1986]  Rumelhart,  D.  E.,  Hinton,  G.  E.,  and  Williams,  R.  J. \n(1986).  Learning  internal  representations  by  error  propogation.  In  Rumelhart, \nD. E. and McClelland, J. L., editors,  Parallel Distributed Processing:  Ezplorations \nin the  Microstructure  of Cognition,  page  318.  MIT Press,  Cambridge,  MA. \n\n[Spall,  1992]  Spall,  J.  C.  (1992).  Multivariate  stochastic  approximation  using  a \nsimultaneous  perturbation gradient  approximation.  IEEE  Transactions  on  A u(cid:173)\ntomatic  Control,  37(3):332-341. \n\n\f", "award": [], "sourceid": 911, "authors": [{"given_name": "D.", "family_name": "Lippe", "institution": null}, {"given_name": "Joshua", "family_name": "Alspector", "institution": null}]}