{"title": "Manifold Stochastic Dynamics for Bayesian Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 694, "page_last": 700, "abstract": "", "full_text": "Manifold Stochastic Dynamics \n\nfor Bayesian Learning \n\nMark Zlochin \n\nDepartment of Computer Science \n\nTechnion - Israel Institute of Technology \n\nTechnion City, Haifa 32000, Israel \n\nzmark@cs.technion.ac.il \n\nYoramBaram \n\nDepartment of Computer Science \n\nTechnion - Israel Institute of Technology \n\nTechnion City, Haifa 32000, Israel \n\nbaram@cs.technion.ac.il \n\nAbstract \n\nWe propose a new Markov Chain Monte Carlo algorithm which is a gen(cid:173)\neralization of the stochastic dynamics method.  The algorithm performs \nexploration of the state space using its intrinsic geometric structure, facil(cid:173)\nitating efficient sampling of complex distributions. Applied to Bayesian \nlearning in neural networks, our algorithm was found to perform at least \nas well as the best state-of-the-art method while consuming considerably \nless time. \n\n1  Introduction \n\nIn  the  Bayesian  framework  predictions are  made  by  integrating the  function  of interest \nover  the  posterior parameter  distribution, the  lattt~r being the  normalized  product  of the \nprior distribution and the likelihood. Since in most problems the integrals are too complex \nto be calculated analytically, approximations are needed. \n\nEarly  works  in  Bayesian  learning  for  nonlinear  models  [Buntineand Weigend  1991, \nMacKay  1992]  used  Gaussian  approximations  to  the  posterior  parameter  distribution. \nHowever,  the  Gaussian  approximation  may  be  poor, especially  for  complex models,  be(cid:173)\ncause of the multi-modal character of the posterior distribution. \n\nHybrid  Monte Carlo  (HMC)  [Duane et al.  1987]  introduced  to  the  neural  network com(cid:173)\nmunity by  [Neal  1996], deals more successfully with multi-modal distributions but is very \ntime consuming.  One of the main causes of HMC inefficiency is the anisotropic character \nof the posterior distribution - the density changes rapidly in some directions while remain(cid:173)\ning almost constant in others. \n\nWe  present a  novel  algorithm which overcomes the above  problem by  using the intrinsic \ngeometrical structure of the model space. \n\n2  Hybrid Monte Carlo \n\nMarkov Chain Monte Carlo (MCMC) [Gilks et al.  1996] approximates the value \n\nE[a] = /  a(O)Q(O)dO \n\n\fManifold Stochastic Dynamics for Bayesian Learning \n\n695 \n\nby the mean \n\n1 \n\na =  N  L a(O(t\u00bb) \n\nIV \n\nt=l \n\nwhere  e(l) , ... , O(N)  are successive states of the ergodic Markov chain with invariant dis(cid:173)\ntribution Q(8)  . \nIn addition to ergodicity and invariance of Q(O)  another quality we would like the Markov \nchain to have is rapid exploration of the state space.  While the first two qualities are rather \neasily attained, achieving rapid exploration of the state space is often nontrivial. \n\nA state-of-the-art MCMC method, capable of sampling from complex distributions, is Hy(cid:173)\nbrid Monte Carlo [Duane et al.  1987]. \n\nThe algorithm is expressed in terms of sampling from canonical distribution for the state, \nq, of a \"physical\" system, defined in terms of the energy function E( q)  I: \n\nP(q)  ex  exp(-E(q)) \n\n(1) \n\nTo allow the use of dynamical methods, a \"momentum\" variable, p, is introduced , with the \nsame dimensionality as  q.  The canonical distribution over the \"phase space\"  is  defined to \nbe: \n\n(2) \nwhere  H(q , p)  =  E(q) + K(p)  is the  \"Hamiltonian\", which  represents  the total energy. \nf{ (p)  is the \"kinetic energy\" due to momentum, defined as \n\nP(q,p) ex  exp(-H(q , p)) \n\nn \n\n2 \nK (p)  =  '\" J!.L \n~2m' \ni=l \n\nl \n\n(3) \n\nwhere pi , i  =  1, . . . , n  are  the  momentum components and  m i  is  the \"mass\"  associated \nwith i'th component, so that different components can be given different weight. \n\nSampling from  the canonical  distribution can  be done using stochastic dynamics method \n[Andersen 1980],  in  which the task  is  split into two sub tasks  - sampling  uniformly from \nvalues  of q and  p with  a  fixed  total  energy,  H(q , p) , and  sampling  states  with  different \nvalues of H.  The first task is done by simulating the Hamiltonian dynamics of the system: \n\ndqi \ndT \n\nBH \n=+-\nBPi \n\nPi \n\nm j \n\nDifferent  energy \nlevels  are  obtained  by  occasional  stochastic  Gibbs  sampling \n[Geman and Geman  1984]  of the momentum.  Since  q and  p  are  independent,  p  may  be \nupdated without reference to q by drawing a value with probability density proportional to \nexp( - K (p)), which, in the case of (3), can be easily done, since the Pi'S have independent \nGaussian distributions. \n\nIn  practice, Hamiltonian dynamics  cannot be simulated exactly,  but can  be approximated \nby  some  discretization  using  finite  time  steps.  One  common  approximation  is  leapfrog \ndiscretization [Neal  1996], \n\nIn the hybrid Monte Carlo method stochastic dynamic transitions are used to generate can(cid:173)\ndidate states for the Metropolis algorithm [Metropolis et al.  1953].  This eliminates certain \n\n1 Note that any probability density that is nowhere zero can be put in this form, by simply defining \n\nE( q)  = - log P( q)  - log Z, for any convenient Z). \n\n\f696 \n\nM  Zlochin and Y.  Baram \n\ndrawbacks of the stochastic dynamics such as systematic errors due to leapfrog discretiza(cid:173)\ntion, since Metropolis algorithm ensures that every  transition keeps canonical distribution \ninvariant. However, the empirical comparison between the uncorrected stochastic dynamics \nand  the HMC in application to Bayesian learning in  neural  networks [Neal  1996]  showed \nthat with appropriate discretization stepsize there is no notable difference between the two \nmethods. \n\nA modification proposed in  [Horowitz 1991]  instead of Gibbs sampling of momentum,  is \nto replace p each time by p. cos (0) + ( . sin( 0),  where 0 is a small angle and (  is distributed \naccording to N(O, 1).  While keeping canonical  distribution invariant, this scheme,  called \nmomentum persistence, improves the rate of exploration. \n\n3  Riemannian geometry \n\nA Riemannian manifold [Amari  1997]  is a set e  ~ R n  equipped with a metric tensor G \nwhich  is  a  positive semidefinite  matrix  defining  the  inner product between  infinitesimal \nincrements as: \n\n< dOl, d02  >= doT  . G . d02 \n\nLet us denote entries of G by Gi,j and entries of G- l  by Gi,j. This inner product naturally \ngives us the norm \n\nII  dO  IIb=< dO, dO  >= dOT. G . dO. \n\nThe Jeffrey prior over e is defined by the density function: \n\n11\" ( 0)  ex:  JiG(ijI \n\nwhere I . I denotes determinant. \n\n3.1  Hamiltonian dynamics over a manifold \n\nFor Riemannian manifold the dynamics take a more general form than the one described in \nsection 2. \n\nIf the metric tensor is G and all masses are set to one then the Hamiltonian is given by: \n\nH(q,p)  =  E(q) + 2pT  . G- l \n\n1 \n\n. P \n\n(4) \n\nThe dynamics are governed by the following set of differential equations [Chavel  1993]: \n\nwhere r~,k are the Christoffel symbols given by: \n\nr i. k  =! ~Gi,m(OGm,k + oGm,j  _  OGj,k) \noqm \n\n2 ~  oqj \n\nJ, \n\nOqk \n\nand q = ~: is related to p by q = G-lp. \n\n\fManifold Stochastic Dynamics for Bayesian Learning \n\n697 \n\n3.2  Riemannian geometry of functions \n\nIn regression the log-likelihood is proportional to the empirical  error,  which is simply the \nEuclidean distance between  the target  point, t,  and  candidate function evaluated  over the \nsample.  Therefore, the most natural distance measure between the models is the Euclidean \nseminorm : \n\nd(Ol,{;2)2  =11  hi - !(Plir= L(f(Xi,01) - !(Xi,02)f \n\nI \n\ni=1 \n\nThe resulting metric tensor is: \n\nG = L{Y'e!(xi,O). Y'd(Xi,Of} = JT  . J \n\nI \n\ni=1 \n\n(5) \n\n(6) \n\nwhere V' e denotes gradient and J  = [(] ~~~ d] is the Jacobian matrix. \n\nJ \n\n3.3  Bayesian geometry \n\nA Bayesian approach would suggest the inclusion of prior assumptions about the parame(cid:173)\nters in the manifold geometry. \n\nIf, for example, a priori 0 \"\"  N (0, 1/ a), then the log-posterior can be written as: \n\n10gp(Olx)  =  P L(f(Xi , OI)  - t)2 + a L(Ok - 0)2 \n\nI \n\nn \n\nwhere P is inverse noise variance. \nTherefore, the natural metric in the model space is \n\ni=l \n\nk=1 \n\nd(01, ( 2)2  = P L(f(Xi, ( 1) - !(Xi, ( 2))2 + a L(O.! - Ok)2 \n\nI \n\nn \n\nwith the metric tensor: \n\ni=l \n\nk=1 \n\n\" \nGB=p\u00b7G+a\u00b7I=J  .J \n\n\"T \n\nwhere j  is the \"extended Jacobian\": \n\nj\"j = { \n\ni  < I \ni  > I \n\n(7) \n\n(8) \n\nwhere &i,j is the Kroneker's delta. \nNote, that as  a  -+  0,  GB  -+  PG,  hence as the prior becomes vaguer we approach a non(cid:173)\nBayesian paradigm.  If, on the other hand,  a  -+  00 or P . G  -+  0,  the Bayesian geometry \napproaches the Euclidean geometry ofthe parameter space.  These are the qualities that we \nwould  like the  Bayesian  geometry to have  - if the  prior is \"strong\" in comparison  to  the \nlikelihood, the exact form of G  should be of little importance. \n\nThe definitions above can be applied to any log-concave prior distribution with the inverse \nHessian of the log-prior, (V'V' logp( 0)) -1, replacing a I  in (7).  The framework is not re(cid:173)\nstricted to regression.  For a general distribution class it is natural to use Fisher information \nmatrix, I, as a metric tensor [Amari  1997}.  The Bayesian metric tensor then becomes: \n\nGB  = I  + (V'V'logp(O))-l \n\n(9) \n\n\f698 \n\nM  Zlochin and y.  Baram \n\n4  Manifold Stochastic Dynamics \n\nAs  mentioned  before,  the  energy  landscape  in  many  regression  problems  is  anisotropic. \nThis degrades the performance of HMC in two aspects: \n\n\u2022  The dynamics may  not be optimal for efficient exploration of the posterior distri(cid:173)\n\nbution as suggested by the studies of Gaussian diffusions [Hwang et al.  1993]. \n\n\u2022  The  resulting  differential  equations  are  stiff [Gear  1971],  leading  to  large  dis(cid:173)\n\ncretization errors, which in  turn necessitates  small time  steps, implying that the \ncomputational burden is high. \n\nBoth of these problems disappear if instead of the Euclidean Hamiltonian dynamics used \nin  HMC  we  simulate  dynamics  over  the  manifold  equipped  with  the  metric  tensor  G B \nproposed in the previous section. \n\nIn  the  context  of regression  from  the definition G B  =  jT . j, we  obtain  an  alternative \n\n\u2022 \n\n& \n\nequatIOn lor dT2  ,In a matnx lorm: \n\nd2q . \n\n.  & \n\n2 \n\nd  q  =  -G- 1(\"V E + jT oj q) \ndT2 \n\ndT \n\nB \n\n' \n\n(10) \n\nIn  the  canonical  distribution  P(q,p)  ex:  exp(-H(q,p))  the conditional distribution of p \ngiven q is a zero-mean Gaussian  with the covariance matrix G B (q)  and  the marginal dis(cid:173)\ntribution over q is proportional to exp( -E(q))1r(q).  This is equivalent to mUltiplying the \nprior by the Jeffrey prior2. \n\nThe sampling from the canonical distribution is two-fold: \n\n\u2022  Simulate the Hamiltonian  dynamics  (3.1)  for  one  time-step  using  leapfrog  dis(cid:173)\n\ncretisation. \n\n\u2022  Replace  p  using momentum persistence.  Unlike the  HMC  case,  the momentum \n\nperturbation (is distributed according to N(O, GB). \n\nThe actual  weights mUltiplying the matrices I  and G  in (7) may be chosen to be different \nfrom the specified a and /3,  so as to improve numerical stability. \n\n5  Empirical comparison \n\n5.1  Robot ann problem \n\nWe  compared  the  performance  of the  Manifold  Stochastic  Dynamics  (MSD)  algorithm \nwith the standard HMC.  The comparison was carried  using MacKay's robot arm  problem \nwhich is a common benchmark for Bayesian methods in  neural networks [MacKay  1992, \nNeal  1996]. \n\nThe robot arm problem is concerned with the mapping: \n\nYI  =  2.0 cos Xl  + 1.3 COS(XI  + X2)  + el,  Y2  = 2.0 sin Xl + 1.3 sin(xi + X2) + e2 \n\nwhere  el, e2  are  independent  Gaussian  noise  variables  of standard  deviation  0.05.  The \ndataset used by Neal and Mackay contained 200 examples in the training set and 400 in the \ntest set. \n\n2In  fact,  since the actual prior over the weights is  unknown, a truly  Bayesian approach would be \nto  use a  non-informative prior such as  71\"( q).  In this  paper we kept the modified prior which  is  the \nproduct of 7I\"(q)  and a zero-mean Gaussian. \n\n\fManifold Stochastic Dynamics for Bayesian Learning \n\n699 \n\n, \n\n, , \n, \n, , \n\n...,...-- - - -\n\n---\n\n\"-\n\n\"-\n\n1.2 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n-0.2 \n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\n50 \n\n0.9 \n\n0.8 \n\n0.7 \n\n0.6 \n\n0.5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0 \n0 \n\nFigure 1: Average (over the 1 0 runs) autocorrelation of input-to-hidden (left) and  hidden(cid:173)\nto-output (right) weights for HMC  with  100 and  30 leapfrog steps per iteration and  MSD \nwith  single  leapfrog  step  per  iteration.  The  horizontal  axis  gives  the  lags,  measured  in \nnumber of iterations. \n\nWe  used  a neural  network with two input units, one hidden layer containing 8 tanh units \nand two linear output units. \nThe hyperparameter f3  was set to its correct value of 400 and  0\"  was chosen to be 1. \n\n5.2  Algorithms \n\nWe  compared MSD with two versions of HMC  - with 30 and  with  100 leapfrog steps per \niteration,  henceforth  referred  to  as  HMC30 and  HMCIOO.  MSD  was  run  with  a  single \nleapfrog step per iteration. In all  three algorithms momentum was resampled using persis(cid:173)\ntence with cos(O)  =  0.95. \nA  single iteration of HMC100  required  about 4.8  . 106  floating  point operations (flops), \nHMC30 required  1.4  . 106  flops  and MSD required 0.5  . 106  flops.  Hence the  computa(cid:173)\ntionalload of MSD was about one third of that of HMC30 and  10 times lower than that of \nHMClOO. \n\nThe discretization stepsize for HMC was chosen so as to keep the rejection rate below 5%. \nAn equivalent criterion of average  error in the Hamiltonian around 0.05  was used  for  the \nMSD. \n\nAll  three  sampling algorithms  were  run  10  times,  each  time  for  3000 iteration  with  the \nfirst  1000 samples discarded  in order to allow the algorithms to reach  the regions of high \nprobability. \n\n5.3  Results \n\nOne appropriate measure for the rate of state space exploration is  weights autocorrelation \n[Neal  1996].  As shown in Figure  1,  the behavior of MSD was  clearly  superior to that of \nHMC. \n\nAnother value of interest is the total squared error over the test set.  The predictions for the \ntest set  were  made  as  follows.  A  subsample of 100  parameter  vectors  waS  generated  by \ntaking every twentieth sample  vector starting from  1001  and  on.  The predicted value  was \n\n\f700 \n\nM.  Zlochin and Y.  Baram \n\nthe average over the empirical function distribution of this sUbsample. \n\nThe total squared errors, nonnalized with respect to the variance on the test cases, have the \nfollowing statistics (over the 10 runs): \n\nHMC30 \nHMCI00 \nMSD \n\naverage \n1.314 \n1.167 \n1.161 \n\nstandard deviation \n\n0.074 \n0.044 \n0.023 \n\nThe average error ofHMC30 is high, indicating that the algorithm failed to reach the region \nof high  probability.  The  errors  of HMC 1 00 and  MSD  are  comparable  but  the  standard \ndeviation for MSD is twice as low as that for HMC 1 00, meaning that the estimate obtained \nusing MSD is more reliable. \n\n6  Conclusion \n\nWe have described a new algorithm for efficient sampling from complex distributions such \nas  those appearing  in  Bayesian  learning with non-linear models.  The empirical  compar(cid:173)\nison  shows  that  our algorithm achieves  results  superior to  the  best  achieved  by  existing \nalgorithms in considerably smaller computation time. \n\nReferences \n\n[Amari  1997] \n\nAmari  S.,  \"Natural  Gradient  Works  Efficiently  in  Learning\",  Neural \nComputation, vol.  10, pp.251-276. \n\n[Andersen 1980]  Andersen H.e., \"Molecular dynamics simulations at constant pressure \nand/or temperature\", Journal of Chemical Physics, vol.  3,pp. 589-603. \n[Buntine and Weigend 1991]  \"Bayesian  back-propagation\",  Complex systems,  vol.  5, pp. \n\n[Chavel  1993] \n\n603-643. \nChavel  I.,  Riemannian Geometry:  A  Modem Introduction, University \nPress, Cambridge. \n\n[Duane et al.  1987]  \"Hybrid Monte Carlo\", Physics Letters B,vol.  195,pp. 216-222. \n[Gear 1971] \n\nGear  e.W.,  Numerical  initial value problems  in  ordinary differential \nequations, Prentice Hall. \n\n[Geman and Geman  1984]  Geman  S.,Geman  D.,  \"Stochastic  relaxation,Gibbs  distribu(cid:173)\ntions  and  the  Bayesian  restoration  of images\",  IEEE  Trans.,PAMI-\n6,721-741. \n\n[Gilks et al.  1996]  Gilks W.R., Richardson S. and Spiegelhalter DJ., Markov Chain Monte \n\nCarlo in Practice,  Chapman&Hall. \n\n[Hwang et al.  1993]  Hwang,  C.,-R,  Hwang-Ma  S.,-Y.  and  Shen.  S.,-J.,  \"Accelerating \n\nGaussian diffusions\", Ann. Appl.  Prob. , vol. 3,  897-913. \n\n[Horowitz 1991]  Horowitz  A.M.,  \"A  generalized  guided  Monte  Carlo  algorithm\", \n\nPhysics Letters B\" vol. 268, pp. 247-252. \n\n[MacKay  1992]  MacKay  D.le., Bayesian Methods for Adaptive Models, Ph.D.  thesis, \n\nCalifornia Institute of Technology. \n\n[Metropolis et al.  1953]  Metropolis N.,  Rosenbluth A.W.,  Rosenbluth M.N.,  Teller A.H. \n\n[Neal  1996] \n\nand Teller E.,  \"Equation of State Calculations by Fast Computing Ma(cid:173)\nchines\", Journal of Chemical Physics,vol.21,pp.  1087-1092. \nNeal, R.M., Bayesian Learn ing for Neural Networks, Springer 1996. \n\n\fPART V \n\nIMPLEMENTATION \n\n\f\f", "award": [], "sourceid": 1757, "authors": [{"given_name": "Mark", "family_name": "Zlochin", "institution": null}, {"given_name": "Yoram", "family_name": "Baram", "institution": null}]}