{"title": "Global Optimisation of Neural Network Models via Sequential Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 410, "page_last": 416, "abstract": null, "full_text": "Global  Optimisation of Neural Network \n\nModels Via Sequential  Sampling \n\nJ oao FG de Freitas \nCambridge University \n\nEngineering Department \n\nCambridge CB2  1PZ England \n\njfgf@eng.cam.ac.uk \n\n[Corresponding author] \n\nMahesan Niranjan \nCambridge University \nEngineering Department \n\nCambridge CB2  1PZ England \n\nniranjan@eng.cam.ac.uk \n\nArnaud Doucet \n\nCambridge University \n\nEngineering Department \n\nCambridge CB2  1PZ England \n\nad2@eng.cam.ac.uk \n\nAndrew H  Gee \n\nCambridge University \nEngineering Department \n\nCambridge CB2  1PZ England \n\nahg@eng.cam.ac.uk \n\nAbstract \n\nWe  propose a novel strategy for  training neural networks using se(cid:173)\nquential sampling-importance resampling algorithms.  This global \noptimisation strategy allows  us  to learn the  probability distribu(cid:173)\ntion of the network  weights  in  a  sequential framework.  It is  well \nsuited to applications involving on-line, nonlinear, non-Gaussian or \nnon-stationary signal processing. \n\n1 \n\nINTRODUCTION \n\nThis paper addresses sequential training of neural networks using powerful sampling \ntechniques.  Sequential techniques are important in many applications of neural net(cid:173)\nworks involving real-time signal processing, where data arrival is inherently sequen(cid:173)\ntial.  Furthermore,  one  might  wish  to adopt  a  sequential training strategy to deal \nwith non-stationarity in signals, so that information from the recent past is lent more \ncredence than information from the distant past.  One way to sequentially estimate \nneural network models is to use a state space formulation and the extended Kalman \nfilter (Singhal and Wu 1988, de Freitas, Niranjan and Gee 1998).  This involves local \nlinearisation of the output equation, which can be easily performed, since we  only \nneed the derivatives of the output with  respect  to the  unknown  parameters.  This \napproach has been employed by several authors, including ourselves. \n\n\fGlobal Optimisation of Neural Network Models via Sequential Sampling \n\n4]] \n\nHowever, locallinearisation leading to the EKF algorithm is a gross simplification of \nthe probability densities involved.  Nonlinearity of the output model induces multi(cid:173)\nmodality of the resulting distributions.  Gaussian approximation of these  densities \nwill loose important details.  The approach we adopt in this paper is one of sampling. \nIn particular, we discuss the use of 'sampling-importance resampling' and 'sequential \nimportance sampling' algorithms, also  known  as  particle filters  (Gordon,  Salmond \nand Smith 1993, Pitt and Shephard 1997), to train multi-layer neural networks. \n\n2  STATE SPACE NEURAL  NETWORK MODELLING \n\nWe start from  a state space representation to model the neural network's evolution \nin time.  A transition equation describes the evolution of the network weights, while \na  measurements equation describes the nonlinear relation between the inputs and \noutputs of a  particular physical process, as follows: \n\nWk+l  =  Wk  +dk \n\nYk  =  g(Wk, Xk)  +  Vk \n\n(1) \n(2) \nwhere  (Yk  E  lRO)  denotes the  output measurements,  (Xk  E  !R<i)  the input measure(cid:173)\nments  and  (Wk  E  lRm)  the neural  network  weights.  The measurements  nonlinear \nmapping  g(.)  is  approximated by  a  multi-layer perceptron  (MLP).  The measure(cid:173)\nments are assumed  to  be  corrupted  by  noise  Vk.  In  the  sequential  Monte  Carlo \nframework,  the  probability  distribution  of the  noise  is  specified  by  the  user.  In \nour examples  we  shall  choose  a  zero  mean  Gaussian  distribution  with covariance \nR.  The measurement noise is assumed to be uncorrelated with the network weights \nand initial conditions. \nWe  model  the  evolution  of  the  network  weights  by  assuming  that  they  depend \non  the  previous  value  Wk  and  a  stochastic  component  dk.  The  process  noise  dk \nmay  represent  our  uncertainty  in  how  the  parameters evolve,  modelling errors or \nunknown inputs.  We  assume the process noise to be a zero mean Gaussian process \nwith covariance Q, however other distributions can also be adopted.  This choice of \ndistributions for  the network weights requires further  research.  The process  noise \nis also assumed to be uncorrelated with the network weights. \nThe  posterior  density  p(WkIYk),  where  Yk  =  {Yl,  Y2,  \"',  Yk}  and  Wk  = \n{Wl,  W2,  \"',  Wk},  constitutes  the complete  solution  to  the  sequential  estima(cid:173)\ntion problem.  In many applications, such  as  tracking,  it is  of interest to estimate \none  of its marginals, namely the filtering  density p(wkIYk).  By computing the fil(cid:173)\ntering density recursively,  we  do  not need to keep  track of the complete history of \nthe  weights.  Thus,  from  a  storage point  of view,  the  filtering  density  turns  out \nto be more parsimonious than the full  posterior density function.  IT  we  know  the \nfiltering  density  of the network  weights,  we  can  easily  derive  various  estimates of \nthe network weights,  including centroids, modes, medians and confidence intervals. \n\n3  SEQUENTIAL IMPORTANCE SAMPLING \n\nIn  the sequential importance sampling optimisation framework,  a set  of represen(cid:173)\ntative  samples  is  used  to  describe  the  posterior  density  function  of the  network \nparameters.  Each sample consists of a complete set of network  parameters.  More \nspecifically, we  make use of the following Monte Carlo approximation: \n\np(WkIYk) =  ~ L 6(Wk  - W~i)) \n\nN \n\ni=l \n\n\f412 \n\n1.  F  G. de Freitas, M.  Niranjan, A.  Doucet and A.  H.  Gee \n\nwhere  W~i) represents the samples used  to describe  the  posterior density  and 6(.) \ndenotes the Dirac delta function.  Consequently, any expectations of the form: \n\nE[A(Wk)] = ! !k(Wk)p(WkIYk)dWk \n\nmay be approximated by  the following estimate: \n\nE[jk(Wk)] ~ ~ LA(W~i\u00bb \n\nN \n\ni=l \n\nwhere the samples W~i) are drawn from  the posterior density function.  Typically, \none  cannot  draw samples directly from  the posterior density.  Yet,  if we  can  draw \nsamples from a proposal density function 7r(WkIYk),  we can transform the expecta(cid:173)\ntion under p(WkIYk)  to an expectation under 7r(WkIYk)  as follows: \n\nE[A(Wk)]  = \n\np(WkIYk) \n\n! \nJ A (Wk)qk (Wk)7r(WkIYk)dWk \n\n!k(Wk) 7r(WkIYk) 7r(WkIYk)dWk \n\nJ qk (Wk)7r(Wk IYk)dWk \n\nE,.. [qk (Wk)!k(Wk)] \n\nE,..[qk(Wk)] \n\nwhere the variables qk(Wk)  are known as  the unnormalised importance ratios: \n\np(YkIWk)p(Wk) \nqk  =  =--:...-:..:-=-:-~,-:-....::.. \n\n7r(WkIYk) \n\n(3) \n\nHence,  by  drawing  samples from  the  proposal function  7r(.),  we  can approximate \nthe expectations of interest by the following estimate: \n\n(\") \nliN Li=l !k(Wk'  )qk(Wk'  ) \n\n(\") \n\nN \n\n(\") \nliN Li=l qk(Wk'  ) \n\nN \n\nN \nL  !k(W~i\u00bbqk(W~i) \ni=l \n\n(4) \n\nwhere the normalised importance ratios tiii)  are given by: \n\n-Ci)  _ \nqk \n\nC i) \nqk \n- \"N \n\n(j) \n\nL..Jj=l qk \n\nIt is  not  difficult to show  (de Freitas, Niranjan, Gee  and Doucet  1998)  that, if we \nassume w  to be a hidden Markov process with initial density p(wo)  and transition \ndensity  p(wklwk-l),  various  recursive  algorithms can  be  derived.  One  of  these \nalgorithms  (HySIR),  which  we  derive  in  (de  Freitas,  Niranjan,  Gee  and  Doucet \n1998), has been shown to perform well in neural network training. Here we extended \nthe algorithm to deal  with  multiple  noise  levels.  The pseudo-code for  the  HySIR \nalgorithm with EKF updating is  as followsl : \n\n1 We  have made available  the software for  the implementation of the HySIR algorithm \n\nat the following  web-site:  http://svr-vwv.eng.cam.ac.ukrjfgf/ softvare . html . \n\n\fGlobal Optimisation of Neural Network Models via Sequential Sampling \n\n413 \n\n1.  INITIALISE  NETWORK  WEIGHTS  (k=O): \n2.  For  k  =  1\"\", L \n\n(a)  SAMPLING  STAGE: \nFor  i  = 1,\u00b7\u00b7\u00b7,N \n\u2022  Predict  via  the  dynamics  equation: \n(i)  + d(i) \n\nA  (i)  _ \nW k+1  - wk \n\nk \n\nwhere  d~i)  is  a  sample  from  p(dk )  (N(O, Qk)  in  our  case). \n\n\u2022  Update  samples  with  the  EKF  equations. \n\n\u2022  Evaluate  the  importance  ratios: \n\nqi11  = qii)p(Yk+1Iw~iL) = q~) N(g(Xk+1, W~11)' Rk) \n\n\u2022  Normalise  the  importance  ratios: \n\n(b)  RESAMPLING  STAGE: \n\nFor  i  =  1,\u00b7\u00b7 \u00b7 ,N \nIf Nell  ~ Threshold: \n\n(i) \n\n_ \n\n(i)  _ \n\nA  (i) \n\u2022  w k+1  - W k+1 \nA  (i) \n\u2022  PH1  - Pk+1 \nQ*(i)  _  Q*(i) \nk+1 \n\nk+1  -\n\n\u2022 \n\nElse \n\n\u2022  Resample  new \n\n(i) \n\n_ \n\nA  (j) \n\n\u2022  W k+1  - W k +1 , \n\n(i)  _  1 \n\u2022  qk+l  - N \n\nwhere KH1 is known as ~he Kalman gain matrix, Imm denotes the identity matrix of \nsize m  x m, and R*  and Q* are two tuning parameters, whose roles are explained in \ndetail in (de Freitas, Niranjan and Gee 1997).  G represents the Jacobian matrix and, \nstrictly speaking,  Pk  is  an approximation to the covariance matrix of the network \nweights.  The  resampling stage is  used  to eliminate samples  with  low  probability \nand multiply samples with high probability.  Various authors have described efficient \nalgorithms for accomplishing this task in O(N) operations (Pitt and Shephard 1997, \nCarpenter, Clifford and Fearnhead 1997, Doucet 1998). \n\n\f414 \n\nJ.  F  G.  de Freitas,  M.  Niranjan,  A.  Doucet and A. H.  Gee \n\n4  EXPERlMENT \n\nTo assess the ability of the hybrid algorithm to estimate time-varying hidden param(cid:173)\neters, we  generated input-output data from  a logistic function followed  by a  linear \nscaling and  a displacement as shown in Figure 1.  This simple model is equivalent \nto an MLP  with one hidden neuron and an output linear neuron.  We  applied two \nGaussian  (N(O, 10))  input sequences to the model  and corrupted the weights and \noutput values with Gaussian noise (N(O, 1 x 10-3) and N(O, 1 x 10-4) respectively). \nWe  then  trained  a  second  model  with the same structure using  the  input-output \n\ny \n\nFigure  1:  Logistic function  with  linear  scaling and  displacement  used  in  the  ex(cid:173)\nperiment.  The  weights  were  chosen  as  follows:  wl(k)  =  1 + k/100,  w2(k)  = \nsin(0.06k) - 2,  w3(k) = 0.1,  w4(k) = 1,  ws(k) = -0.5. \n\ndata generated by the first  model.  In so doing,  we  chose  100 sampling trajectories \nand set  R  to 10,  Q to  1 X  10-3155 ,  the initial weights  variance to 5,  Po  to 100155 , \nR*  to 1 X  10-5 \u2022  The process noise parameter Q*  was set to three levels:  5 x 10-3, \n1 X 10-3 and 1 x 10-10,  as shown in the plot of Figure 2 at time zero.  In the training \n\n20 \n\n,s \n\nSamples \n\n'20 \n\nFigure 2:  Noise level estimation with the HySIR algorithm. \n\nphase, of 200  time steps, we  allowed the model weights to vary with time.  During \nthis phase, the HySIR algorithm was  used to track the input-output training data \nand estimate the latent model weights.  In addition, we assumed three possible noise \nvariance levels at the begining of the training session.  After the 200-th time step, \nwe  fixed  the  values  of the  weights  and  generated  another  200  input-output  data \ntest sets from  the original model.  The input test  data was  then fed  to the trained \nmodel,  using the weights values  estimated at the 200-th time step.  Subsequently, \n\n\fGlobal Optimisation of Neural Network Models via Sequential Sampling \n\n415 \n\nthe output prediction of the trained model was  compared to the output data from \nthe original model to assess the generalisation performance of the training process. \nAs shown in Figure 2, the noise level of the trajectories converged to the true value \n(1  x  10-3 ).  In  addition,  it  was  possible to track the network weights and obtain \naccurate output predictions as shown in Figures 3 and 4. \n\n\" \n\n3 \n\ns a. \nS  2 \n0 \nQi \n1/1  1 \nCl \nC \n\"E  0 \n\"~ \n~ -1 \n\n-2 \n\n-2 \n\n0 \n\n2 \n\nOutput prediction \n\n\" \n\n3.5 \n\n3  : .:  :/ \"\" \n\n.... \n\n1 \n\n2 \n\nOutput prediction \n\n'[2.5 \nS \n0 \nQi \n1/11 .5 \n\n~ \n\n0.5 \n\n. \n. \n\n2  / \n\n1 \n\n. \n\n~\n\n. \n\n0 \n0 \n\n. \n. . . .  \n\n. \n\n. \n\n-\n\n..  -\n\n. \n\n. \n\nFigure 3:  One step ahead predictions during the training phase (left) and stationary \npredictions in the test phase (right). \n\nE 100 \n.... \nIII \nC)  50 \n.s \nI~ \n\nII) \n\n\" \n\n.!!! \n-\u00a7,2 \n'm \n~ \n.::L.  0 \n.... \n0 \n.! II)  -2 \nZ \n\n-4 \n0 \n\n2 \n\nTime \n\n0 \n\nW2 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\nTime \n\n120 \n\n140 \n\n160 \n\n180 \n\n200 \n\nFigure 4:  Weights tracking performance with the HySIR  algorithm.  As  indicated \nby the histograms of W2, the algorithm performs a global search in parameter space. \n\n\f416 \n\n1.  F  G.  de Freitas,  M.  Niranjan, A. Doucet and A. H.  Gee \n\n5  CONCLUSIONS \n\nIn  this  paper,  we  have  presented  a  sequential  Monte  Carlo approach for  training \nneural  networks  in  a  Bayesian  setting.  In  particular,  we  proposed  an  algorithm \n(HySIR)  that makes use of both gradient and sampling information.  HySIR can be \ninterpreted  as  a  Gaussian  mixture filter,  in  that  only  a  few  sampling trajectories \nneed to be employed. Yet, as the number of trajectories increases, the computational \nrequirements  increase  only  linearly.  Therefore,  the  method  is  also  suitable  as  a \nsampling  strategy  for  approximating  multi-modal  distributions.  Further  avenues \nof research  include  the  design  of algorithms for  adapting the  noise  covariances  R \nand  Q,  studying  the  effect  of different  noise  models for  the  network  weights  and \nimproving the computational efficiency of the algorithms. \n\nACKNOWLEDGEMENTS \n\nJoao FG de Freitas is financially supported by two University of the Witwatersrand \nMerit  Scholarships,  a  Foundation  for  Research  Development  Scholarship  (South \nAfrica), an ORS award and a Trinity College External Studentship  (Cambridge). \n\nReferences \n\nCarpenter, J., Clifford, P.  and Fearnhead, P.  (1997).  An improved particle filter for \n\nnon-linear problems,  Technical  report,  Department of Statistics,  Oxford  Uni(cid:173)\nversity, England.  Available at http://www.stats.ox.ac.ukrclifford/index.htm. \n\nde Freitas,  J. F.  G., Niranjan,  M.  and Gee,  A.  H.  (1997).  Hierarchichal Bayesian(cid:173)\nKalman  models  for  regularisation  and  ARD  in  sequential  learning,  Tech(cid:173)\nnical  Report  CUED/F-INFENG/TR  307,  Cambridge  University,  http://svr(cid:173)\nwww.eng.cam.ac.uk/-jfgf. \n\nde Freitas, J. F. G., Niranjan, M. and Gee, A.  H.  (1998). Regularisation in sequential \nlearning  algorithms,  in  M.  I.  Jordan,  M.  J.  Kearns  and  S.  A.  Solla  (eds), \nAdvances  in Neural Information  Processing  Systems,  Vol.  10, MIT  Press. \n\nde  Freitas,  J.  F.  G.,  Niranjan,  M.,  Gee,  A.  H.  and  Doucet,  A.  (1998).  Sequen(cid:173)\ntial Monte  Carlo  methods  for  optimisation  of neural network  models,  Tech(cid:173)\nnical  Report  CUED/F-INFENG/TR  328,  Cambridge  University,  http://svr(cid:173)\nwww.eng.cam.ac.uk/-jfgf. \n\nDoucet,  A.  (1998).  On sequential simulation-based methods for  Bayesian filtering, \nTechnical  Report  CUED/F-INFENG/TR  310,  Cambridge  University.  Avail(cid:173)\nable at http://www.stats.bris.ac.uk:81/MCMC/pages/list.html. \n\nGordon,  N.  J .,  Salmond,  D.  J.  and  Smith,  A.  F.  M.  (1993).  Novel  approach \nto  nonlinear/non-Gaussian  Bayesian  state  estimation,  lEE  Proceedings-F \n140(2):  107-113. \n\nPitt,  M.  K.  and  Shephard,  N.  (1997).  Filtering via simulation:  Auxiliary particle \nfilters,  Technical  report,  Department of Statistics, Imperial College of London, \nEngland.  Available at http://www.nuff.ox.ac.uk/economics/papers. \n\nSinghal, S.  and Wu,  L.  (1988).  Training multilayer perceptrons with the extended \nKalman algorithm,  in D.  S.  Touretzky  (ed.),  Advances  in Neural Information \nProcessing  Systems, Vol.  1, San Mateo, CA,  pp.  133-140. \n\n\f", "award": [], "sourceid": 1598, "authors": [{"given_name": "Jo\u00e3o", "family_name": "de Freitas", "institution": null}, {"given_name": "Mahesan", "family_name": "Niranjan", "institution": null}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": null}, {"given_name": "Andrew", "family_name": "Gee", "institution": null}]}