{"title": "Dual Estimation and the Unscented Transformation", "book": "Advances in Neural Information Processing Systems", "page_first": 666, "page_last": 672, "abstract": null, "full_text": "Dual Estimation and the Unscented \n\nTransformation \n\nEricA. Wan \n\nericwan@ece.ogi.edu \n\nRudolph van der Merwe \nrudmerwe@ece.ogi.edu \n\nAlex T.  Nelson \n\natneison@ece.ogi.edu \n\nOregon Graduate Institute of Science & Technology \nDepartment of Electrical and Computer Engineering \n20000 N.W.  Walker Rd., Beaverton, Oregon 97006 \n\nAbstract \n\nDual estimation  refers  to  the problem of simultaneously  estimating the \nstate of a dynamic system and the model which gives rise  to the dynam(cid:173)\nics.  Algorithms  include  expectation-maximization (EM), dual  Kalman \nfiltering,  and joint Kalman methods.  These methods have recently  been \nexplored in  the context of nonlinear modeling,  where a  neural network \nis  used  as  the functional form of the  unknown model.  Typically, an  ex(cid:173)\ntended  Kalman  filter  (EKF)  or smoother  is  used  for  the  part of the  al(cid:173)\ngorithm that estimates the clean state given the current estimated model. \nAn  EKF may  also be used to estimate the weights of the  network.  This \npaper points  out the flaws  in  using  the  EKF,  and  proposes an  improve(cid:173)\nment based on a new approach called the unscented transformation (UT) \n[3].  A  substantial performance gain is  achieved  with  the same order of \ncomputational complexity as  that of the standard EKF.  The approach is \nillustrated on several dual estimation methods. \n\n1 \n\nIntroduction \n\nWe  consider  the  problem  of learning  both  the  hidden  states  Xk  and  parameters  w  of a \ndiscrete-time nonlinear dynamic system, \n\nF(Xk , Vk, w) \nH(xk, nk, w), \n\n(1) \n(2) \n\nwhere Yk  is the only observed signal. The process noise Vk  drives the dynamic system, and \nthe observation noise is given by nk. Note that we are not assuming additivity of the noise \nsources. \n\nA  number of approaches have  been  proposed for  this  problem.  The dual EKF algorithm \nuses two separate EKFs:  one for signal estimation, and one for model estimation. The states \nare  estimated  given  the  current weights  and  the  weights  are  estimated  given  the  current \nstates.  In the joint EKF, the state and model parameters are concatenated within a combined \nstate  vector,  and  a  single  EKF is  used  to  estimate  both  quantities  simultaneously.  The \nEM algorithm  uses  an  extended  Kalman  smoother for  the  E-step,  in  which  forward  and \n\n\fDual Estimation and the Unscented Transformation \n\n667 \n\nbackward passes  are  made through the data to  estimate the  signal.  The model is  updated \nduring a separate M-step. \n\nFor a more thorough treatment and a theoretical basis on  how these algorithms relate, see \nNelson  [6].  Rather than  provide a comprehensive comparison between the different algo(cid:173)\nrithms,  the  goal  of this  paper is  to  point out the  assumptions  and  flaws  in  the  EKF (Sec(cid:173)\ntion  2), and offer a improvement based on  the unscented transformation/filter (Section 3). \nThe  unscented filter  has  recently  been  proposed  as  a  substitute  for  the EKF in  nonlinear \ncontrol problems (known dynamic model) [3]. This paper presents new research on the use \nof the  UF within  the dual estimation  framework for  both state and  weight estimation.  In \nthe case of weight estimation, the UF represents a new efficient \"second-order\" method for \ntraining neural networks in general. \n\n2  Flaws in the EKF \n\nAssume for now  that we  know  the  model (weight parameters) for the dynamic system in \nEquations  1 and 2.  Given  the  noisy  observation Yk,  a recursive estimation  for Xk  can  be \nexpressed in  the form, \n\nXk  =  (optimal prediction ofxk) + Gk x  [Yk  -\n\n(optimal prediction ofYk)] \n\n(3) \n\nThis recursion provides the optimal MMSE estimate for Xk  assuming the prior estimate Xk \nand current observation Yk  are Gaussian.  We  need not assume linearity of the  model.  The \noptimal terms in this recursion are given by \n\nyl:  =  E[H(xl:, nk)], \n\n(4) \n\nwhere the optimal prediction xl:  is  the expectation of a nonlinear function of the  random \nvariables  Xk-l  and  Vk-l  (similar  interpretation  for  the  optimal  prediction  of Yk).  The \noptimal  gain term  is  expressed as  a function  of posterior covariance matrices (with Yk  = \nYk  - Yl:)\u00b7 Note these terms also require taking expectations of a nonlinear function of the \nprior state estimates. \n\nThe Kalman filter calculates these quantities exactly in  the linear case. For nonlinear mod(cid:173)\nels, however, the extended KF approximates these as: \n\nYI:  =  H(xl:,fl), \n\n(5) \n\nwhere predictions are approximated as  simply the function of the prior mean value for es(cid:173)\ntimates (no expectation taken). The covariance are determined by  linearizing the dynamic \nequations (Xk+l  ~ AXk + BVk, Yk  ~ CXk  + Dnk), and then determining the posterior \ncovariance matrices  analytically  for  the  linear system.  As  such,  the  EKF can  be  viewed \nas  providing \"first-order\" approximations to  the  optimal  terms  (in  the  sense  that expres(cid:173)\nsions are  approximated using  a first-order Taylor series expansion of the  nonlinear terms \naround the mean values).  While \"second-order\" versions of the EKF exist, their increased \nimplementation and computational complexity tend to prohibit their use. \n\n3  The Unscented TransformationIFilter \n\nThe unscented transformation (UT)  is  a  method for calculating the  statistics  of a random \nvariable which undergoes a  nonlinear transformation [3].  Consider propagating a random \nvariable a  (dimension L) through a nonlinear function, (3  =  g( a). Assume a  has mean ct \nand covariance P Q.  To calculate the statistics of {3,  we form a matrix  X  of 2L + 1 sigma \nvectors Xi,  where the first  vector (Xo) corresponds to  ct,  and  the  rest are computed from \nthe mean (+ )plus and (-)minus each column of the matrix square-root of P Q.  These sigma \n\n\f668 \n\nE.  A.  Wan,  R. v.  d.  Merwe and A.  T.  Nelson \n\nvectors are propagated through the nonlinear function, and the mean and covariance for [3 \nare approximated using a weighted sample mean and covariance, \n\n/3 ~ -\u00a3  1  {~g(xo) + -21 I:9(Xi)}  , \n\ni=l \n\n+ ~ \n\n(6) \n\nPp\" L:~ {~[g(XO) - il][g(Xo) - il)T + ~ ~[g(X') - il)[g(X,) - ilf}  (7) \n\nwhere ~ is a scaling factor.  Note that this method differs substantially from  general \"sam(cid:173)\npling\" methods (e.g.,  Monte-Carlo methods and  particle  filters  [1]) which  require orders \nof magnitude  more sample points  in  an  attempt to  propagate an  accurate (possibly non(cid:173)\nGaussian) distribution of the state.  The UT approximations are accurate to the third order \nfor  Gaussian  inputs  for  all  nonlinearities.  For non-Gaussian inputs,  approximations are \naccurate to  at least the second-order, with  the accuracy determined by the choice of ~ [3]. \nA simple example is  shown in  Figure  1 for  a 2-dimensional system:  the left plots  shows \nthe  true mean  and  covariance propagation using Monte-Carlo sampling;  the center plots \nshow  the  performance of the UT (note only 5 sigma points are  required);  the right plots \nshow the results using a linearization approach as  would be done in the EKF.  The superior \nperformance of the UT is clear. \n\nActual (sampling) \n\nUT \n\nLinearized (EKF) \n\nmean  .11\"=- 0 \n\n- I \n(3  = g(o) \n\nYi =  g(Xi ) \n\n1 \n\nP(:!  =  ATPaA \n\nl \n\nI \n(3 = g(o) \n1 \n\nFigure  1:  Example of the  UT for mean  and  covariance  propagation.  a)  actual,  b) \nUT, c)  first-order linear (EKF). \n\nThe  unscented filter  (UF)  [3]  is  a  straightforward  extension  of the  UT  to  the  recursive \nestimation in Equation 3, where we set 0: =  Xk, and denote the corresponding sigma matrix \nas  X(klk).  The UF equations are given on the next page.  It is  interesting to  note that no \nexplicit calculation  of lacobians or Hessians  are  necessary  to  implement this  algorithm. \nThe total number of computations is only order \u00a32 as compared to \u00a33 for the EKF. I \n\n4  Application to Dual Estimation \n\nThis section shows the use of the UF within several dual estimation approaches.  As an ap(cid:173)\nplication domain for comparison, we consider modeling a noisy time-series as a nonlinear \n\nINote that a matrix square-root using the Cholesky  factorization is of order L 3 /6.  However,  the \ncovariance matrices are expressed recursively, and thus the square-root can be computed in only order \nL2  by performing a recursive update to the Cholesky factorization. \n\n\fDual Estimation and the Unscented Transformation \n\n669 \n\nUF Equations \n\n,  WI . .. W2L  =  1/2(L + fl.) \n\nWo  =  K/(L + K) \nX(klk - 1)  =  F[X(k - 11k - 1), P!lv2 ] \nx\"k  =  2:~!o WiXi(klk - 1) \nP\"k  =  2:~!o WdXi(klk - 1) - x\"k][Xi(klk - 1) - x\"kf \nY(klk - 1)  =  H[X(klk - 1), P~;] \nY\"k  =  2:~!o WiYi(klk - 1) \nPhh =  2::!o Wi[Yi(klk - 1) - Y\"kJ[Yi(klk - 1) - Yk\"f \nP XkYk  =  2::!o Wi[Xi(klk - 1) - x\"k][Yi(klk - 1) - Y\"kf \nXk  =  x\"k  + PXkYkP~:h (n - Y\"k) \nP k  =  P\"k  - PX\"Yk(P~:yJTP!'kYk \n\nautoregression: \n\nXk  =  f(Xk-l, ... Xk-M, w)  + Vk \nYk  =  Xk + nk, \n\nVk  E {l. . . N} \n\n(8) \n\nThe underlying  clean  signal  Xk  is  a  nonlinear  function  of its  past  M  values,  driven  by \nwhite Gaussian process noise Vk  with variance 11;.  The observed data point Yk  includes the \nadditive noise nk, which is assumed to be Gaussian with  variance 11;.  The corresponding \nstate-space representation for the signal Xk  is given by: \n\n+  B\u00b7 Vk-I \n\n(9) \n\nYk  =  [1  0 \n\n(10) \n\nIn this context, the dual estimation problem consists of simultaneously estimating the clean \nsignal Xk  and the model parameters w  from the noisy data Yk. \n\n4.1  Dual EKF I Dual UF \n\nOne dual estimation approach is  the dual extended Kalman filter developed in [8,  6].  The \ndual  EKF requires  separate state-space  representation  for  the  signal  and  the  weights.  A \nstate-space representation for the weights is generated by considering them to be a station(cid:173)\nary process with an identity state transition matrix, driven by process noise Uk: \n\nWk  =  Wk-l + Uk \nYk  =  f(Xk-I,Wk)  +Vk +nk\u00b7 \n\n(11) \n(12) \n\nThe noisy measurement Yk  has been rewritten as an observation on w.  This allows the use \nof an  EKF for  weight estimation  (representing a \"second-order\" optimization procedure) \n[7].  Two EKFs can now be run simultaneously for signal and weight estimation.  At every \ntime-step,  the  current  estimate of the  weights  is  used  in  the  signal-filter,  and  the  current \nestimate of the signal-state is  used in  the weight-filter. \n\n\f670 \n\nE. A.  Wan,  R. v.  d.  Merwe and A.  T  Nelson \n\nThe dual UFIEKF algorithm is  formed  by  simply  replacing the  EKF for  state-estimation \nwith the UF while still using an EKF for weight-estimation. In the dual UF algorithm both \nstate- and weight-estimation are done with the UF.  Note that the state-transition is linear in \nthe  weight filter,  so the nonlinearity is  restricted  to  the measurement equation.  Here,  the \nUF gives  a  more  exact measurement-update phase  of estimation.  The use  of the  UF for \nweight estimation in general is discussed in further detail in  Section 5. \n\n4.2  Joint EKF I Joint UF \n\nAn alternative approach to dual estimation is provided by  the joint extended Kalman filter \n[4,5].  In this framework the signal-state and weight vector are concatenated into a single, \njoint state vector:  Zk  =  [xf wfV. The estimation of Zk  can be done recursively by writing \nthe state-space equations for the joint state as: \n\n(13) \n\nand running an EKF on the joint state-space to produce simultaneous estimates of the states \nXk  and  w .  As  discussed  in  [6],  the joint EKF provides approximate MAP estimates  by \nmaximizing the joint density of the signal and weights given the noisy data.  Again, our ap(cid:173)\nproach in this paper is to use the UF instead of the EKF to provide more accurate estimation \nof the state, resulting in the joint UF algorithm. \n\n4.3  EM - Unscented Smoothing \n\nA  somewhat different iterative  approach  to  dual  estimation  is  given  by  the  expectation(cid:173)\nmaximization (EM) algorithm applied to  nonlinear dynamic systems [2].  In each iteration, \nthe  conditional  expectation  of the  signal  is  computed,  given  the data and  the  current es(cid:173)\ntimate  of the model  (E-step).  Then the  model  is  found that maximizes a function  of this \nconditional mean  (M-step).  For linear models,  the  M-step can  be solved  in  closed  form . \nThe E-step is  computed with a Kalman smoother,  which combines the forward-time esti(cid:173)\nmated mean and covariance (x{ ,pt) of the signal given past data, with the backward-time \npredicted  mean  and  covariance (xf ,pf) given  the future  data,  producing  the  following \nsmoothed statistics given all the data: \n\n(14) \n\n(15) \n\nWhen a  MLP  neural  network  model  is  used,  the  M-step  can  no  longer be  computed  in \nclosed-form,  and  a  gradient-based  approach  is  used  instead.  The resulting  algorithm  is \nusually referred to  as  generalized EM (GEM)  2.  The E-step is  typically  approximated by \nan  extended Kalman smoother, wherein a linearization of the model is  used for backward \npropagation of the state estimates. \n\nWe  propose  improving the  E-step of the  EM  algorithm  for  nonlinear models  by  using  a \nUP instead  of an  EKF to  compute both the  forward  and  backward passes  in  the  Kalman \nsmoother. Rather than linearize the model for the backward pass, as in [2], a neural network \nis  trained on  the  backward dynamics (as  well  as  the  forward  dynamics).  This  allows for \na more exact backward estimation phase using the UF, and enables the development of an \nunscented smoother (US). \n\n2 An exact M-step is  possible using RBF networks [2]. \n\n\fDual Estimation and the Unscented Transformation \n\n671 \n\n4.4  Experiments \n\nWe  present results  on  two  simple time-series  to  provide a clear illustration  of the use of \nthe  UP  over  the  EKE  The  first  series  is  the  Mackey-Glass  chaotic  series  with  additive \nWGN (SNR ~ 3dB).  The second time series (also chaotic) comes from an autoregressive \nneural network with random weights driven by Gaussian process noise and also corrupted \nby  additive WGN  (SNR  ~ 3dB).  A  standard  5-3-1  MLP  with  tanh  hidden  activation \nfunctions and a linear output layer was used in all the filters.  The process and measurement \nnoise variances were assumed to be known. \n\nResults on training and testing data, as well as training curves for the different dual estima(cid:173)\ntion methods are shown below.  The quoted numbers are normalized (clean signal variance) \nmean-square estimation and prediction errors.  The superior performance of the UT based \nalgorithms (especially  the dual  UF)  are clear.  Note also  the  more stable  learning curves \nusing the  UF approaches.  These improvements have been  found to  be consistent and sta(cid:173)\ntistically significant on  a number of additional experiments. \n\nTrain \n\nTest \n\nEst.  Pred.  Est.  Pred. \n0.54 \n0.20 \n0.53 \n0.19 \n0.48 \n0.15 \n0.56 \n0.22 \n0.19 \n0.53 \n\n0.50 \n0.50 \n0.45 \n0.53 \n0.50 \n\n0.21 \n0.19 \n0.14 \n0.22 \n0.18 \n\nChaotic  AR-NN \nAlgorithm \n. Dual  EKF \nDual  UF/EKF \nDual  UF \nJoint EKF \nJoint UF \n\nTrain \n\nTest \n\nEst.  Pred.  Est.  Pred. \n0.32 \n0.69 \n0.69 \n0.26 \n0.63 \n0.23 \n0.72 \n0.29 \n0.25 \n0.67 \n\n0.36 \n0.28 \n0.27 \n0.34 \n0.30 \n\n0.62 \n0.58 \n0.55 \n0.58 \n0.55 \n\nMackey-Glass \n\nChaotic AR-NN \n\n.  Dual EKF \n.  Dual UFIEKF \n\nDual UF \nJointEKF \n...... UF \n\n0 \n0 \n\n\" \n\nMackey-Glass \nAlgorithm \nDual EKF \nDual UF/EKF \nDual UF \nJoint EKF \nJoint UF \n\no. \n\n0.' \n\n01 \n\n~O8 \n:::;; \n\n.. \n\n]05 \nE \ngo\" \n\n03 \n\n02 \n\n0.1 \n0 \n\n\u2022 \n5 \niteration \n\n10 \n\n11 \n\nW en \n:::;;  o. \n\" \" ~ \n'\" \nE035 \n5 \nC \n\n0 3 \n\n0 25 \n\n0 2 \n0 \n\n10 \n\n15 \n\n~eration \n\n20 \n\n2S \n\n'\" \n\nThe final  table below compares smoother performance used for the E-step in the EM algo(cid:173)\nrithm. In this case, the network models are trained on the clean time-series, and then tested \non  the  noisy  data  using  either  the  standard  Kalman  smoother with  linearized  backward \nmodel (EKS 1), a Kalman smoother with a second nonlinear backward model (EKS2), and \nthe unscented smoother (US). The forward (F), backward (B), and smoothed (S) estimation \nerrors are reported.  Again the performance benefits of the unscented approach is clear. \n\nMackey-Glass \nAlgorithm \nEKSI \nEKS2 \nUS \n\nNorm. MSE \n\nF \n0.20 \n0.20 \n0.]0 \n\nB \n0.70 \n0.3] \n0.24 \n\nS \n0.27 \n0.19 \n0.08 \n\nChaotic AR-NN \nAlgorithm \nEKSI \nEKS2 \nUS \n\nNorm. MSE \n\nF \n0.35 \n0.35 \n0.23 \n\nB \n0.32 \n0.22 \n0.2] \n\nS \n\n0.28 \n0.23 \n0.16 \n\n5  UF Neural Network Training \n\nAs  part of the dual UF algorithm, we introduced the use of the UF for  weight estimation. \nThe approach can also be seen as a new method for the general problem of training neural \nnetworks (i.e.,  for regression or classification problems where the input x  is  observed and \n\n\f672 \n\nE.  A.  Wan.  R.  v.  d.  Merwe and A. T.  Nelson \n\nno state-estimation is  required). The advantage of the UF over the EKF in  this case is  not \nas obvious, as the state-transition function is linear (See Equation 11). However, as pointed \nout earlier, the observation is  nonlinear.  Effectively, the  EKF builds up an  approximation \nto  the  expected Hessian by taking outer products of the gradient.  The UF,  however,  may \nprovide a  more accurate estimate through direct approximation of the  expectation  of the \nHessian. We have performed a number of preliminary experiments on standard benchmark \ndata.  The figure  below shows the  mean  and  std.  oflearning curves  (computed over  100 \nexperiments  with  different initial  weights)  for  the Mackay  Robot Arm  Mapping dataset. \nNote the faster convergence, lower variance, and  lower final  MSE performance of the UF \nweight training. While these results are encouraging, further study is still necessary to fully \ncontrast differences between UF and EKF weight training. \n\nO.06n---.-----,\"L\"le;O;anmWln;:;r,g1\"it.;'\"luwrv:v:ec.s,..---r= \n\nI::;;U~F (;:=m.==an7=>l) I r \n-\n:  :  ~~~ ~~~n) \n\nUF(. ld) \n\n0 .05 \n\nW \n~ 0 .004 \nc: \n~ 0,03 \n\n'\\ \n\nE  0.02  ~_. __ -..~ __ -.. ______________ \u2022  _ _ \n\n\"---~ \n\n...... -\n\n-\n\n.. \n\n.. \n\n0.01 \n\n6  Conclusions \n\nThe EKF has been widely accepted as a standard tool in the machine learning community. \nIn  this  paper we  have  presented an  alternative to  the EKF using the  unscented filter.  The \nUF  consistently achieves  a  better  level  of accuracy  than  the  EKF at  a  comparable level \nof complexity.  We  demonstrated  this  performance gain  on  a  number of dual  estimation \nmethods as well as standard regression modeling. \n\nAcknowledgements \nThis work was sponsored in part by the NSF under grant IRI-9712346. \n\nReferences \n[1]  J. F.  G. de  Freitas,  M.  Niranjan,  A.  H.  Gee,  and A. Doucet.  Sequential  Monte  Carlo methods \nfor  optimisation  of neural  network  models.  Technical  Report  TR-328,  Cambridge  University \nEngineering Department, Cambridge, England, November 1998. \n\n[2]  Z. Ghahramani and S. T. Roweis.  Learning nonlinear dynamical systems using an EM algorithm. \nIn M. J. Keams, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing \nSystems II:  Proceedings of the  1998 Conference.  MIT Press,  1999. \n\n[3]  S. J. Julier and J. K. Uhlmann.  A New Extension of the Kalman Filter to Nonlinear Systems.  In \n\nProc.  of AeroSense:  The  11th International Symposium on Aerospace/Defence Sensing.  Simula(cid:173)\ntion and Controls. Orlando. Florida.,  1997. \n\n[4]  R.  E.  Kopp  and  R.  J.  Orford.  Linear regression  applied  to  system  identification for  adaptive \n\ncontrol systems. AlAA 1.,  I :2300-06, October 1963. \n\n[5]  M.  B.  Matthews  and  G.  S.  Moschytz.  Neural-network  nonlinear  adaptive  filtering  using  the \n\nextended Kalman filter algorithm.  In INNC, pages  115-8, 1990. \n\n[6]  A. T.  Nelson.  Nonlinear Estimation and Modeling of Noisy Time-Series by Dual Kalman Filter(cid:173)\n\ning Methods. PhD thesis, Oregon Graduate Institute, 1999.  In preparation. \n\n[7]  S.  Singhal  and  L.  Wu.  Training  multilayer  perceptrons  with  the  extended  Kalman  filter. \n\nIn \nAdvances in  Neural  Information  Processing Systems  1,  pages  133-140,  San Mateo, CA,  1989. \nMorgan Kauffman. \n\n[8]  E. A. Wan and A. T.  Nelson. Dual Kalman filtering methods for nonlinear prediction, estimation, \n\nand smoothing.  In Advances in Neural Information  Processing Systems 9,  1997. \n\n\f", "award": [], "sourceid": 1681, "authors": [{"given_name": "Eric", "family_name": "Wan", "institution": null}, {"given_name": "Rudolph", "family_name": "van der Merwe", "institution": null}, {"given_name": "Alex", "family_name": "Nelson", "institution": null}]}