{"title": "Weight Space Probability Densities in Stochastic Learning: II. Transients and Basin Hopping Times", "book": "Advances in Neural Information Processing Systems", "page_first": 507, "page_last": 514, "abstract": null, "full_text": "Weight  Space  Probability Densities \n\nin  Stochastic Learning: \n\nII.  Transients  and Basin Hopping Times \n\nGenevieve B.  Orr  and  Todd K.  Leen \n\nDepartment of Computer Science and Engineering \nOregon Graduate Institute of Science & Technology \n\n19600 N.W.  von Neumann Drive \n\nBeaverton,  OR 97006-1999 \n\nAbstract \n\nIn  stochastic  learning,  weights  are  random  variables  whose  time \nevolution  is  governed  by  a  Markov  process.  At  each  time-step, \nn,  the weights  can be described by  a  probability density function \npew, n).  We summarize the theory of the time evolution of P, and \ngive  graphical  examples  of  the  time  evolution  that  contrast  the \nbehavior  of stochastic learning  with  true gradient  descent  (batch \nlearning).  Finally, we use the formalism to obtain predictions of the \ntime required for noise-induced hopping between basins of different \noptima.  We  compare the theoretical predictions with simulations \nof large  ensembles  of networks  for  simple  problems in  supervised \nand unsupervised learning. \n\n1  Weight-Space Probability  Densities \n\nDespite the recent application of convergence theorems from stochastic approxima(cid:173)\ntion  theory to  neural  network  learning  (Oja 1982,  White  1989)  there remain  out(cid:173)\nstanding questions about the search dynamics in stochastic learning.  For example, \nthe convergence theorems do  not  tell  us  to which  of several  optima the algorithm \n\n507 \n\n\f508 \n\nOrr and Leen \n\nis  likely  to  converge 1 \u2022  Also,  while  it  is  widely  recognized  that  the intrinsic  noise \nin the weight  update can move the system out of sUb-optimal local  minima (for  a \ngraphical  example,  see  Darken  and  Moody  1991),  there have  been no  theoretical \npredictions of the time required to escape from local optima, or of its dependence \non learning rates. \n\nIn order to more fully  understand the dynamics of stochastic search, we  study the \nweight-space probability density and its time evolution.  In this paper we summarize \na  theoretical framework that describes this time evolution.  We graphically portray \nthe motion of the density for  examples that contrast stochastic and batch learning. \nFinally we use the theory to predict the statistical distribution of times required for \nescape from  local  optima.  We compare the theoretical results with simulations for \nsimple examples in supervised and unsupervised learning. \n\n2  Stochastic Learning and  Noisy  Maps \n\n2.1  Motion of the Probability Density \n\nWe consider stochastic learning algorithms of the form \n\nw(n+1}  = w(n)  + JJH[w(n},x(n)] \n\n(1) \n\nwhere w(n}  E 1R,m  is  the weight, x(n} is the data exemplar input to the algorithm at \ntime-step n, JJ  is  the learning rate, and H[ ... ] E 1R, m  is  the weight update function. \nThe  exemplars  x(n}  can  be  either  inputs  or,  in  the  case  of supervised  learning, \ninput/target pairs.  We assume that the x(n)  are i.i.d.  with  density  p(x).  Angled \nbrackets  (\"'):t denote averaging  over  this  density.  In  what  follows,  the learning \nrate will  be held constant. \n\nThe  learning  algorithm  (1)  is  a  noisy  map  on  w.  The weights  are  thus  random \nvariables described by the probability density function  P(w, n).  The time evolution \nof this density is given by the Kolmogorov equation \n\nP(w, n + 1)  = f dw' P(w', n)  W(w' -+  w) \n\n(2) \n\nwhere  the single  time-step  transition probability is  given  by  (Leen  and  Orr  1992, \nLeen and Moody 1993) \n\nW(w' -+ w)  =  (8{w-w'-JJH[w',xJ)  ):t \n\n(3) \n\nand 8 ( ... ) is  the Dirac delta function. \nThe  Kolmogorov  equation  can  be  recast  as  a  differential-difference  equation  by \nexpanding the transition probability (3) as a power series in JJ.  This gives a Kramers(cid:173)\nMoya!  expansion (Leen and Orr 1992, Leen and Moody 1993) \n\nIHowever  Kushner  (1987)  has  proved  convergence  to  global  optima  for  stochastic \napproximation  algorithms  with  added  Gaussian  noise  subject  to  logarithmic  annealing \nschedules. \n\n\fP(w, n + 1)  - P(w, n) \nL  (_~)t  L \n\n00 \n\nm \n\n. \n\nI. \n\ni=l \n\n;l \u2022... ;j=l \n\nWeight  Space  Probability Densities in  Stochastic Learning \n\n509 \n\n(4) \n\nwhere Wj\"  and Hj\"  are the j~h component of weight, and weight update, respectively. \n\nTruncating (4) to second order in J1,  leaves a Fokker-Planck equation2  that is valid for \nsmall 1 J1,H I.  The drift coefficient (H}z is simply the average update.  It is important \nto  note  that  the  diffusion  coefficients,  (Hj\"HjIJ )  ,  can  be  strongly  dependent  on \nlocation  in  the  weight-space.  This  spatial  depe~dence influences  both  equilibria \nand transient phenomena.  In section 3.1  we will use both the Kolmogorov equation \n(2), and the Fokker-Planck equation to track the time evolution of network ensemble \ndensities. \n\n2.2  First Passage Times \n\nOur discussion of basin hopping will  use  the notion of the first  passage time (Gar(cid:173)\ndiner,  1990);  the time required for  a  network  initialized at Wo  to first  pass into an \n\u20ac-neighborhood  D  of a global or local optimum w.  (see Figure 1).  The first  passage \ntime is  a  random variable.  Its distribution function P( n; wo)  is  the probability that \na  network initialized at Wo  makes its first passage into D  at the nth  iteration of the \nlearning rule. \n\nFigure 1:  Sample search path. \n\nTo arrive at an expression for  P( n; wo),  we  first  examine the probability of passing \nfrom  the initial  weight  Wo  to  the weight  w after n  iterations.  This probability can \nbe expressed as \n\npew, n I Wo, 0)  =  J dw'  pew, n I w', 1) W( Wo  -7 w'). \n\n(5) \n\n(6) \n\nSubstituting the single  time-step  transition probability  (3)  into  the above  expres(cid:173)\nsion, integrating over w',  and making use of the time-shift invariance of the system3 \nwe  find \n\npew, n 1 wo, 0)  =  ( pew, n - 1 1 Wo  + J1,H(wo, x), O)}z \n\n. \n\nNext,  let  G(n; wo)  denote the probability  that a  network  initialized  at  Wo  has  not \npassed into  the region  D  by  the nth  iteration.  We  obtain G(n; wo)  by  integrating \npew, n  1 Wo, 0)  over weights w not in D; \n\nG(n;wo)  =  f  dwP(w,nlwo,O) \n\nJDe \n\n(7) \n\n2See  (Ritter and Schulten 1988)  and (Radons  et  al.  1990) for  independent derivations. \nlearning  rate  J.l  and  stationary  sam-\n3With  our  assumptions  of  a  constant \nMathematically  stated, \n\nthe  system \n\ntime-shift \n\ninvariant. \n\nple  density  p{x), \nP{w,n I w',m) = P{w,n -1/ w',m -1) \n\nis \n\n\f510 \n\nOrr and  Leen \n\nwhere DC is the complement of D.  Substituting equation (6) into (7) and integrating \nover  w we  obtain the recursion \n\nG(n;wo)  =  (G(n - l;wo + JJH[wo,x])  L:  . \n\n(8) \n\nBefore any learning takes place, none of the networks in the ensemble have entered \nD.  Thus the initial condition for  G  is \n\nG(O;wo)  = 1,  woEDc. \n\n(9) \n\nNetworks  that have  entered D  are  removed  from  the ensemble  (i.e.  aD is  an ab(cid:173)\nsorbing boundary).  Thus G  satisfies the boundary condition \n\nG(n; wo)  = 0,  Wo  ED. \n\n(10) \n\nFinally,  the  probability  that  the  network  has  not passed into  the  region  D  on  or \nbefore  iteration  n  - 1  minus  the  probability  the  network  has  not  passed  into  D \non  or before iteration n  is  simply  the probability that the network has  passed into \nD  exactly  at  iteration  n.  This  is  just  the  probability  for  first  passage  into  D  at \ntime-step n.  Thus \n\nP(n;wo)  = G(n -1;wo)  - G(n;wo)  . \n\n(11) \n\nFinally the recursion (8) for  G  can be expanded in a  power series in JJ  to obtain the \nbackward Kramers-Moyal equation \n\nG(n;w)  - G(n -1;w)  = \n\nex> \n\n. \n\nJ.lt L\"1  L \n\nI. \n\nm \n\n;=1 \n\n;1, ... ;;=1 \n\n(12) \n\nTruncation to second order in JJ  results in the backward Fokker-Planck equation.  In \nsection 3.2  we  will  use  both the full  recursion  (8)  and the Fokker-Planck  approxi(cid:173)\nmation to (12)  to predict basin hopping times in stochastic learning. \n\n3  Backpropagation and Competitive Nets \n\nWe apply the above formalism to study the time evolution of the probability density \nfor  simple backpropagation and competitive learning problems.  We give  graphical \nexamples of the time evolution of the weight space density,  and calculate times for \npassage from local to global optima. \n\n3.1  Densities  for  the XOR Problem \n\nFeed-forward  networks  trained  to  solve  the  XOR problem  provide  an  example  of \nsupervised  learning  with  well-characterized  local  optima  (Lisboa and  Perantonis, \n1991).  We use a  2-input, 2-hidden,  I-output network (9 weights) trained by stochas(cid:173)\ntic  gradient  descent  on  the cross-entropy error function  in  Lisboa and Perantonis \n(1991).  For  computational  tractability,  we  reduce  the  state space  dimension  by \n\n\fWeight Space Probability Densities in  Stochastic Learning \n\n511 \n\nconstraining the search  to one- or two-dimensional  subs paces of the weight  space. \nTo provide global optima at finite  weight values,  the output targets are set to 8 and \n1 - 8,  with 8 < < 1. \nFigure 2a shows  the cost function evaluated along a  line in the weight space.  This \nline,  parameterized by  v,  is  chosen  to  pass  through  a  global  optimum  at  v  = 0, \nand  a  local  optimum  at  v  =  1.0.  In  this  one-dimensional  slice,  another  local \noptimum occurs at v = 1.24 .  Figure 2b shows the evolution of P( v, n) obtained by \nnumerical integration of the Fokker-Planck equation.  Figure 2c shows the evolution \nof P( v, n)  estimated  by  simuhtion  of 10,000  networks,  each  receiving  a  different \nrandom sequence of the four  input/target patterns.  Initially  the density is  peaked \nup about the local optimum at v = 1.24.  At intermediate times, there is  a  spike of \ndensity at  the local  optimum at  v  =  1.0.  This spike is  narrow since the diffusion \ncoefficient is  small there.  At late times the density collects at the global optimum. \nWe  note  that for  the learning rate  used  here,  the local  optimum at  v  =  1.24  is \nasymptotically stable under true gradient descent, and no escape would occur. \n\nCD \n\nIt) \n\nii 8\u00b7 .., \n(II -a) \n\n..().5  0.0  0.5 \n\n1.0 \n\n1.5  2.0 \n\nv \n\nc) \n\nFigure 2:  a)  XOR cost function.  b)  Predicted density.  c)  Simulated density. \n\nFigure 3 shows a series of snapshots of the density superimposed on the cost function \nfor  a  2-D  slice  through  the XOR weight  space.  The first  frame  shows  the weight \nevolution  under  true  gradient  descent.  The  weights  are  initialized  at  the  upper \nright-hand  corner  of  the  frame,  travel  down  the  gradient  and  settle  into  a  local \noptimum.  The remaining frames  show  the evolution  of the density  calculated by \ndirect integration of the Kolmogorov equation (2).  Here one sees an early spreading \nof the initial density and the ultimate concentration at the global optimum. \n\n3.2  Basin Hopping Times \n\nThe above examples graphically illustrate the intuitive notion that the noise inher(cid:173)\nent in  stochastic learning can move  the system out of local optima4  In this section \nwe calculate the statistical distribution of times required to pass between basins. \n\n4The  reader  should  not  infer from  these  examples that  stochastic update  necessarily \nconverges to global optima.  It is straightforward to construct examples for which stochastic \nlearning convergences to local optima with probability one. \n\n\f512 \n\nOrr and Leen \n\nTrue Gradient Descent \n\nTime =0 \n\nTime = 10 \n\nTime-28 \n\nTime =34 \n\nTime = 100 \n\nFigure 3:  Weight evolution for  2-D  XOR.  The density is superimposed on top of the cost \nfunction.  The first  frame shows density using true gradient descent  for  all 100 timesteps. \nThe remaining frames  show the density for  selected timesteps using stochastic descent. \n\n3.2.1  Basin Hopping  in Back-propagation \n\nFor the search direction used in the example of Figure 2,  we calculated the distribu(cid:173)\ntion of times required for  networks initialized at v = 1.2  to first  pass within \u20ac  = 0.1 \nof the  global  optimum  at  v  =  0.0.  For  this  example  we  numerically  integrated \nthe  backward Fokker-Planck  equation.  We  verified  the  theoretical  predictions  by \nobtaining  first  passage  times  from  an  ensemble  of  10,000  networks  initialized  at \nv  =  1.2.  See  Figure 4.  For  this example the agreement is  good at the small learn(cid:173)\ning  rate  (JJ  =  0.025)  used,  but degrades for  larger JJ  as higher order terms in  the \nexpansion (12)  become significant. \n\no o \n\no \n\n200  400  600  800  1000 \nFirst Passage Time \n\nFigure 4:  XOR problem.  Simulated (histogram)  and  theoretical (solid line)  distributions \nof first  passage times for  the cost function of Figure  la. \n\n\fWeight  Space  Probability Densities in  Stochastic Learning \n\n513 \n\nWhen  the Fokker-Planck  approximation fails,  results obtained from  the exact  ex(cid:173)\npression  (8)  are in excellent  agreement with experimental results.  One such exam(cid:173)\nple  is  shown in  Figure 5.  Similar to  Figure 2a,  we  have chosen  a  one-dimensional \nsubspace of the XOR weight space (but in a  different direction).  Here, the Fokker(cid:173)\nPlanck solution is  quite  poor because the steepness of the cost function results in \nlarge contributions from higher order terms in (12).  As one would expect, the exact \nsolution obtained using  (8)  agrees well  with the simulations. \n\nIt) \n\nExact \nFokker-Planck \n\n-0.5 \n\n0.0 \n\na) \n\n1.0 \n\n1.5 \n\n0.5 \n\nv \n\no \n\nb) \n\n200 \n\n600 \nFirst Passage Time \n\n400 \n\n800 \n\nFigure  5:  Second  I-D  XOR example.  a)  Cost  function.  b)  Simulated  (histogram)  and \ntheoretical (lines)  distributions of first  passage times. \n\n3.2.2  Basin Hopping  in  Competitive Learning \n\nAs  a  final  example,  we  consider competitive learning with  two  2-D  weight  vectors \nsymmetrically  placed  about  the  center of a  rectangle.  Inputs  are  uniformly  dis(cid:173)\ntributed  in  a  rectangle  of  width  1.1  and  height  1.  This  configuration  has  both \nglobal and local optima. \n\nFigure 6a shows a sample path with weights started near the local optimum (crosses) \nand switching  to hover  around the global  optimum.  The measured and predicted \n(from numerical integration of (8)) distribution of times required to first pass within \na  distance \u20ac  =  0.1  of the global optimum are shown in  Figure 6b. \n\nw2 \n\n~:~. : .. :...: .<~~:\\.:\u00b7\u00b7i \n\n4--~:-O---=,,\"-:----'=''''':::-''''''' -:-:---r-- w1 \n\n0.2 \n\n0.4 \n\n0.6  O.S \n\n1 \n\na) \n\n\u00a7 \n0 \n~ \n~8 \n-80 \nQ: \n\nq \n0 \n\n0 \n\nb) \n\n100 \n\n200 \n\n400 \nFirst Passage Time \n\n300 \n\n500 \n\nFigure 6:  Competitive Learning a)  Data (small dots) and sample weight path (large dots). \nb)  First passage times. \n\n4  Discussion \n\nThe dynamics of the time evolution of the weight space probability density provides \na  direct handle on the performance of learning algorithms.  This paper has focused \n\n\f514 \n\nOrr and Leen \n\non  transient  phenomena in  stochastic learning  with  constant  learning  rate.  The \nsame  theoretical  framework  can  be used  to  analyze  the  asymptotic properties of \nstochastic search with decreasing learning rates, and to analyze equilibrium densi(cid:173)\nties.  For a  discussion  of the latter, see  the companion paper in  this  volume  (Leen \nand Moody 1993). \n\nAcknowledgements \n\nThis work was supported under grants N00014-90-J-1349 and NOOO-HI-J-1482 from \nthe Office of Naval Research. \n\nReferences \n\nE.  Oja  (1982),  A  simplified  neuron model  as  a  principal  component analyzer.  J. \nMath.  Biology,  15:267-273. \n\nHalbert White (1989), Learning in artificial neural networks:  A statistical perspec(cid:173)\ntive.  Neural  Computation,  1:425-464. \n\nJ.J.  Kushner (1987),  Asymptotic global behavior for stochastic approximation and \ndiffusions with slowly decreasing noise effects:  Global minimization via monte carlo. \nSIAM J.  Appl.  Math.,  47:169-185. \n\nChristian  Darken  and  John  Moody  (1991),  Note  on  learning  rate  schedules  for \nstochastic optimization.  In Advances  in Neural  Information  Processing  Systems  3, \nSan Mateo,  CA,  Morgan Kaufmann. \nTodd  K.  Leen  and  Genevieve  B.  Orr.  (1992),  Weight-space  probability  densities \nand  convergence  times for  stochastic learning.  In  International  Joint  Conference \non Neural  Networks,  pages IV 158-164.  IEEE. \n\nTodd K. Leen and John Moody (1993), Probability Densities in Stochastic Learning: \nDynamics  and  Equilibria.  In  Giles,  C.L.,  Hanson,  S.J.,  and  Cowan,  J.D.  (eds.), \nAdvances  in  Neural  Information  Processing  Systems  5.  San  Mateo,  CA:  Morgan \nKaufmann Publishers. \n\nH.  Ritter and  K.  Schulten  (1988),  Convergence properties of Kohonen's  topology \nconserving  maps:  Fluctuations,  stability  and  dimension  selection,  Bioi.  Cybern., \n60,  59-71. \n\nG. Radons, H.G. Schuster and D. Werner (1990), Fokker-Planck description oflearn(cid:173)\ning in backpropagation networks,  International  Neural  Network  Conference,  Paris, \nII 993-996,  Kluwer Academic Publishers. \n\nC.W.  Gardiner (1990),  Handbook  of Stochastic  Methods,  2nd  Ed.  Springer-Verlag, \nBerlin. \n\nP.  Lisboa and S.  Perantonis (1991),  Complete solution of the local  minima in  the \nXOR problem.  Network:  Computation  in Neural  Systems,  2:119. \n\n\f", "award": [], "sourceid": 637, "authors": [{"given_name": "Genevieve", "family_name": "Orr", "institution": null}, {"given_name": "Todd", "family_name": "Leen", "institution": null}]}