{"title": "Learning Multi-Class Dynamics", "book": "Advances in Neural Information Processing Systems", "page_first": 389, "page_last": 395, "abstract": null, "full_text": "Learning multi-class  dynamics \n\nA.  Blake,  B.  North and M.  Isard \n\nDepartment of Engineering Science,  University of Oxford,  Oxford  OXl 3P J, UK. \n\nWeb:  http://www.robots.ox.ac.uk/ ... vdg/ \n\nAbstract \n\nStandard  techniques  (eg.  Yule-Walker)  are  available  for  learning \nAuto-Regressive process models of simple,  directly observable,  dy(cid:173)\nnamical  processes.  When  sensor  noise  means  that  dynamics  are \nobserved  only  approximately,  learning can  still  been  achieved  via \nExpectation-Maximisation  (EM)  together  with  Kalman  Filtering. \nHowever,  this  does  not  handle  more  complex  dynamics,  involving \nmultiple  classes  of motion.  For  that  problem,  we  show  here  how \nEM  can  be combined  with  the  CONDENSATION  algorithm,  which \nis  based on propagation of random sample-sets.  Experiments have \nbeen  performed  with  visually  observed juggling,  and plausible dy(cid:173)\nnamical models are found  to emerge from  the learning process. \n\n1 \n\nIntroduction \n\nThe paper presents a probabilistic framework for estimation (perception) and classi(cid:173)\nfication  of complex time-varying signals, represented as temporal streams of states. \nAutomated learning of dynamics  is  of crucial  importance as  practical models  may \nbe  too  complex  for  parameters  to  be  set  by  hand.  The framework  is  particularly \ngeneral, in several respects,  as follows. \n\n1.  Mixed states:  each  state comprises  a  continuous  and  a  discrete  component. \nThe  continuous  component  can  be  thought  of  as  representing  the  instantaneous \nposition of some  object  in  a  continuum.  The discrete  state represents  the current \nclass of the motion, and acts as  a  label, selecting the current member from  a set of \ndynamical models. \n\n2.  Multi-dimensionality: \nallowed  to  be  multi-dimensional.  This  could  represent  motion  in  a  higher  dimen(cid:173)\nsional  continuum,  for  example,  two-dimensional  translation  as  in  figure  1.  Other \nexamples include multi-spectral acoustic or image signals, or multi-channel sensors \nsuch  as  an electro-encephalograph. \n\nthe  continuous  component  of a  state is,  in  general, \n\n\f390 \n\nA.  Blake. B.  North and M  Isard \n\nFigure 1:  Learning the dynamics of juggling.  Three  motion  classes,  emerging \nfrom  dynamical learning,  turn out to  correspond accurately to  ballistic  motion  (mid \ngrey),  catch/throw  (light  grey)  and carry  (dark  grey). \n\n3.  Arbitrary order:  each dynamical system is  modelled  as an Auto-Regressive \nProcess  (ARP)  and  allowed  to  have  arbitrary order  (the  number  of time-steps  of \n\"memory\"  that it carries.) \n\n4.  Stochastic  observations: \nnot \nobservable  directly,  but  only  via  observations,  which  may  be  multi-dimensional, \nand are stochastically related to the continuous component of states.  This aspect is \nessential to represent the inherent variability of response of any real signal sensing \nsystem. \n\nthe sequence  of mixed  states  is  \"hidden\"  -\n\nEstimation  for  processes  with  properties  2,3,4  has  been  widely  discussed  both  in \nthe  control-theory  literature as  \"estimation\"  and  \"Kalman filtering\"  (Gelb,  1974) \nand in statistics as  ''forecasting''  (Brockwell and Davis,  1996).  Learning of models \nwith  properties  2,3 is  well  understood  (Gelb,  1974)  and once  learned can  be  used \nto drive pattern classification procedures, as in Linear Predictive Coding (LPC)  in \nspeech analysis (Rabiner and Bing-Hwang, 1993), or in classification of EEG signals \n(Pardey  et  al.,  1995).  When  property 4  is  added,  the  learning  problem  becomes \nharder (Ljung,  1987)  because the training sets are no longer observed directly. \n\nMixed states (property 1) allow for combining perception with classification.  Allow(cid:173)\ning properties 2,4, but restricted to a Oth order ARP (in breach of property 3), gives \n\n\fLearning Multi-Class Dynamics \n\n391 \n\nHidden Markov Models  (HMM)  (Rabiner and Bing-Hwang,  1993), which have been \nused  effectively  for  visual  classification  (Bregler, 1997).  Learning HMMs  is  accom(cid:173)\nplished by the \"Baum-Welch\"  algorithm, a form of Expectation-Maximisation (EM) \n(Dempster  et  al.,  1977).  Baum-Welch  learning  has  been  extended  to  \"graphical(cid:173)\nmodels\"  of quite general topology  (Lauritzen,  1996).  In  this paper, graph topology \nis  a simple chain-pair as in standard HMMs,  and the complexity of the problem lies \nelsewhere -\n\nin the generality of the dynamical model. \n\nGenerally then, restoring non-zero order to the ARPs (property 3), there is no exact \nalgorithm for estimation.  However the estimation problem can be solved by random \nsampling  algorithms,  known  variously  as  bootstrap  filters  (Gordon  et  al.,  1993), \nparticle filters  (Kitagawa, 1996), and CONDENSATION  (Blake and Isard, 1997).  Here \nwe  show  how  such algorithms can be used,  with EM,  in  dynamical learning theory \nand experiments  (figure 1). \n\n2  Multi-class  dynamics \n\nContinuous dynamical systems can be specified in terms of a continuous state vector \nXt  E  nNcr.  In machine vision, for  example,  Xt  represents the parameters of a  time(cid:173)\nvarying shape at time t .  Multi-class dynamics are represented by appending to the \ncontinuous state vector  Xt,  a  discrete state component Yt  to make a  \"mixed\"  state \n\nX t  =  (  ~:  )  , \n\nwhere  Yt  E  Y  =  {I, . .. , Ny}  is  the  discrete  component  of the  state,  drawn  from \na  finite  set  of  integer  labels.  Each  discrete  state  represents  a  class  of motion , for \nexample  \"stroke\",  \"rest\"  and  \"shade\"  for  a  hand engaged in drawing. \n\nCorresponding  to  each  state  Yt  =  Y  there  is  a  dynamical  model,  taken  to  be  a \nMarkov model of order KY  that specifies Pi (Xt IXt-l, . .. Xt-KY ) .  A linear-Gaussian \nMarkov model of order K  is  an Auto-Regressive Process  (ARP)  defined  by \n\nK \n\nXt  =  LAkxt-k + d + BWt \n\nk=1 \n\nin  which  each  Wt  is  a  vector of N x  independent  random N(O, 1)  variables  and  Wi, \nW t'  are independent for  t  \u00a5 t'.  The dynamical parameters of the model  are \n\n\u2022  deterministic parameters AI, A 2 , ... , AK \n\n\u2022  stochastic parameters B, which are multipliers for the stochastic process Wt, \nand determine the \"coupling\" of noise Wt  into the vector valued process  Xt. \n\nFor convenience of notation, let \n\nEach  state Y  E Y has a  set  {AY, BY, dY} of dynamical  parameters,  and  the goal  is \nto  learn  these  from  example  trajectories.  Note  that  the  stochastic  parameter  BY \nis  a  first-class  part  of a  dynamical  model,  representing  the  degree  and  the  shape \nof  uncertainty  in  motion,  allowing  the  representation  of  an  entire  distribution  of \npossible motions  for  each state y.  In addition,  and independently,  state transitions \nare governed by the transition matrix for  a  1st order Markov chain: \n\nP(Yt  =  y'IYt-1  =  y)  =  My,y\" \n\n\f392 \n\nA.  Blake.  B.  North and M.  Isard. \n\nObservations Zt  are assumed  to be conditioned  purely on the continuous  part x  of \nthe mixed state, independent of Yt,  and this maintains a healthy separation between \nthe modelling  of dynamics  and  of observations.  Observations  are also  assumed  to \nbe  independent,  both  mutually  and  with  respect  to  the  dynamical  process.  The \nobservation process is  defined  by specifying,  at each time t,  the conditional density \np(ZtIXt)  which  is  taken to be Gaussian in  experiments here. \n\n3  Maximum Likelihood learning \n\nWhen  observations  are  exact,  maximum  likelihood  estimates  (MLE)  for  dynami(cid:173)\ncal parameters can be obtained from a training sequence Xi ... XT of mixed states. \nThe well known Yule-Walker formula approximates MLE (Gelb, 1974; Ljung, 1987), \nbut generalisations are needed to allow  for  short training sets  (small T), to include \nstochastic  parameters  B,  to  allow  a  non-zero  offset  d  (this  proves essential  in  ex(cid:173)\nperiments later)  and to encompass multiple dynamical classes. \n\nThe resulting MLE learning rule is  as follows. \n\nAY RY  = BY \n0' \n\nd Y = \n\n1 \n\nTY  _  KY \n\n0 \n\n(RY _  AYRY)  CY  = \n\n, \n\n1 \n\nTY  _  KY  \"\"0,0 \n\n(iW  _  AY('QY)T) \n\n.L\"O' \n\nwhere  (omitting the Y  superscripts for  clarity)  C =  BBT  and \n\nand the first-order moments Ri and (offset-invariant) auto correlations Ri,j, for each \nclass  y,  are given by \n\nRf = L x;_i  and  RL = RL - T  ~ KRfRrT, \n\ny;=y \n\nY \n\nwhere \n\nRL  =  L X;_iX;_j T; \n\nYt=Y \n\nTy  = H t  : Y;  = y}  ==  L  1. \n\nt:Yt=Y \n\nThe MLE for  the transition matrix M  is  constructed from  relative frequencies  as: \n\nM \n\nY,Y'  =  \"\"  T, were  y,y'  =  II \n\nh \n\nT \n\nTy,y' \n\n6y'EY  Y,Y \n\nll{t\u00b7  * \n\n.  Yt-l  = y, Yt  = Y \n\n* \n\n'} \n. \n\n4  Learning  with stochastic observations \n\nTo  allow  for  stochastic observations,  direct  MLE is  no longer possible,  but  an  EM \nlearning  algorithm  can  be  formulated.  Its  M-step  is  simply  the  MLE  estimate  of \nthe  previous section.  It might  be thought that the E-step should  consist  simply of \ncomputing expectations, for  instance [[xtIZ[J,  (where Zi =  (Zl,\"\"  Zt)  denotes  a \nsequence of observations)  and treating them  as  training values  x;.  This  would  be \nincorrect however because the log-likelihood function  I:- for the problem is  not linear \nin  the x;  but quadratic.  Instead, we  need expectations \n\n\fLearning Multi-Class Dynamics \n\n393 \n\nconditioned  on  the  entire  training  set  Z'[  of observations,  given  that  \u00a3  is  linear \nin  the  R i ,  Ri,j  etc.  (Shumway  and Stoffer,  1982).  These expected  values  of auto(cid:173)\ncorrelations and  frequencies  are to be used  in  place of actual  auto correlations and \nfrequencies  in  the learning formulae of section  3.  The question  is,  how  to compute \nthem.  In the special case y = {I} of single-class dynamics, and assuming a Gaussian \nobservation density,  exact methods are available for  computing expected  moments, \nusing  Kalman  and  smoothing filters  (Gelb,  1974),  in  an  \"augmented  state\"  filter \n(North  and  Blake,  1998).  For  multi-class  dynamics,  exact  computation  is  infeasi(cid:173)\nble,  but good approximations can be achieved based on propagation of sample sets, \nusing  CONDENSATION. \n\nForward sampling with backward  chaining \n\nFor  the  purposes  of learning,  an  extended  and  generalised  form  of the  CONDEN(cid:173)\nSATION  algorithm  is  required.  The  generalisations  allow  for  mixed  states,  arbi(cid:173)\ntrary order for  the ARP, and backward-chaining of samples.  In backward chaining, \nsample-sets for  successive times  are  built  up  and  stored  together  with  a  complete \nstate history back to time t = O.  The extended  CONDENSATION  algorithm is  given \nin  figure  2.  Note that the algorithm needs  to be initialised.  This  requires  that the \nYo  and  (X~~lo'  k  = 0, ... ,KYO  - 1)  be drawn  from  a  suitable  (joint)  prior for  the \nmulti-class process.  One way to do this is  to ensure that the training set starts in a \nknown  state and  to fix  the initial sample-values  accordingly.  Normally,  the choice \nof prior is  not too important as it is  dominated  by  data. \nAt  time  t  =  T,  when  the  entire  training  sequence  has  been  processed,  the  final \nsample set is \n\n{ (X(n) \n\nTIT'\u00b7 .. ,  OIT' 7rT \n\nX(n\u00bb) \n\n(n)} \n\n- 1 \n\n,n -\n\n, ... , \n\nN} \n\nrepresents  fairly  (in  the  limit,  weakly,  as  N  -+  00)  the  posterior  distribution  for \nthe  entire  state  sequence  X O,  .\u2022\u2022 ,XT,  conditioned  on  the  entire  training  set  Z'[ \nof observations.  The  expectations  of the  autocorrelation  and  frequency  measures \nrequired  for  learning can be estimated from  the sample set, for  example: \n\nAn  alternative algorithm is  a  sample-set  version  of forward-backward  propagation \n(Kitagawa, 1996).  Experiments have suggested that probability densities generated \nby  this  form  of smoothing  converge  far  more  quickly  with  respect  to  sample  set \nsize  N,  but  at the expense  of computational  complexity  - O(N2)  as  opposed  to \nO(N log N) for  the algorithm above. \n\n5  Practical applications \n\nExperiments are reported briefly here on learning the dynami(:s of juggling using the \nEM-Condensation  algorithm,  as  in  figure  1.  An  offset  d Y  is  learned for  each  class \nin Y  =  {I, 2, 3};  other dynamical  parameters are fixed  such  that that  learning  d Y \namounts to learning mean  accelerations a Y  for  each  class.  The transition matrix is \nalso learned.  From a  more or \u00b7less  neutral starting point, learned structure emerges \nas in figure 3.  Around 60 iterations of EM suffice,  with N  = 2048, to learn dynamics \nin  this  case.  It is  clear  from  the figure  that the learned structure is  an  altogether \nplausible model for  the juggling process. \n\n\f394 \n\nA.  Blake,  B. North and M.  Isard \n\nIterate for  t  =  1, ... , T. \n\nonstruct t  e samp e-set \n\nh \n\nC \nt. \n\nI \n\n{(X (n) \n\nlit\"'\"  X tit  ,7rt \n\n(n\u00bb) \n\n(n)} \n\n,n =  1, ... , N  for  time \n\n. \n\nFor each n: \n\n1.  Choose (with  replacement)  mE {I, .. . , N}  with  prob.  7ri~{' \n\n2.  Predict by sampling from \n(x  I  vt-l -\n\n1\"\\.1 \n\nt \n\n-\n\nP \n\n(X(m) \n\nllt-l\"'\" \n\nX(m\u00bb)) \n\nt-llt-1 \n\nto choose X~~).  For multi-class ARPs this is  done in  two steps. \n\nDiscrete:  Choose y~n) =  y'  E Y with probability  My,y\"  where \n\ny  =  y~~i. \n\nContinuous:  Compute \n\nK \n\n(n)  _  ~AY (m) \n\nx tit  - ~ kXt-klt-l \n\n+d + Bw~n), \n\nk=l \n\nwhere y  =  y~n) and w~n) is  a vector of standard normal r.v. \n\n3.  Observation weights 7r~n)  are computed from the observation \n\ndensity,  evaluated for  the current observations  Zt: \n\n(n) \n\n7rt  = P  Zt  Xt = x tit \n\n(n\u00bb) \n' \n\n(I \n\nthen normalised multiplicatively so that  En 7ri n )  =  1. \n\n4.  Update sample history: \n\nX (n)  - x(m) \n\nti lt  -\n\ntilt-I'  t  =  1, ... , t - 1. \n\nI \n\nFigure 2:  The CONDENSATION  algorithm for forward propagation with back(cid:173)\nward  chaining. \n\nAcknowledgements \n\nWe  are  grateful  for  the  support  of  the  EPSRC  (AB,BN)  and  Magdalen  College \nOxford  (MI). \n\nReferences \n\nBlake,  A. and Isard,  M.  (1997) .  The Condensation algorithm -\n\nconditional density prop(cid:173)\nagation  and applications to visual tracking.  In Advances  in Neural  Information  Pro(cid:173)\ncessing  Systems  9,  pages  361-368.  MIT Press. \n\n\fLearning Multi-Class Dynamics \n\n395 \n\n(:0 \n\n0.01 \n\na  =  (  0.0  ) \n\n-9.7 \n\n0.04 \n\nBallistic \n\npat:::) \n~ Cony \n\na=(-;:)  ~ \nCatchlthrow  J \n\nFigure  3:  Learned  dynamical  model  for  juggling.  The  three  motion  classes \nallowed  in  this  experiment  organise  themselves  into:  ballistic  motion  (acceleration \na  ~ -g),- catch/throw,- carry.  As expected,  life-time  in the  ballistic  state  is  longest, \nthe  transition  probability  of 0.95  corresponding  to  20  time-steps  or  about  0.7  sec(cid:173)\nonds.  Transitions  tend  to  be  directed,  as  expected,- for  example  ballistic  motion  is \nmore  likely  to  be  followed  by  a  catch/throw  (p  = 0.04)  than  by  a  carry  (p  = 0.01). \n(Acceleration a shown here  in units of m/ S2 .) \n\nBregler,  C.  (1997).  Learning and recognising human dynamics in video sequences.  In Proc. \n\nConf.  Computer  Vision  and Pattern  Recognition. \n\nBrockwell,  P.  and Davis,  R.  (1996).  Introduction  to  time-series  and forecasting.  Springer(cid:173)\n\nVerlag. \n\nDempster,  A.,  Laird,  M.,  and  Rubin,  D.  (1977) .  Maximum  likelihood  from  incomplete \n\ndata via the EM  algorithm.  J.  Roy.  Stat.  Soc.  B ., 39:1-38. \n\nGelb,  A.,  editor  (1974).  Applied Optimal Estimation.  MIT Press,  Cambridge,  MA. \n\nGordon,  N.,  Salmond,  D.,  and  Smith,  A.  (1993).  Novel  approach  to  nonlinear/non(cid:173)\n\nGaussian  Bayesian state estimation.  lEE Proc.  F,  140(2):107- 113. \n\nKitagawa,  G.  (1996).  Monte  Carlo  filter  and smoother for  non-Gaussian  nonlinear  state \n\nspace models.  Journal  of Computational  and  Graphical  Statistics,  5(1) :1- 25 . \n\nLauritzen,  S.  (1996).  Graphical  models.  Oxford. \n\nLjung,  L.  (1987).  System  identification:  theory  for  the  user.  Prentice-Hall. \n\nNorth,  B.  and  Blake,  A.  (1998). \n\nLearning  dynamical  models  using  expectation(cid:173)\n\nmaximisation.  In  Proc.  6th  Int.  Conf.  on  Computer  Vision,  pages 384-389. \n\nPar dey,  J.,  Roberts,  S.,  and  Tarassenko,  L.  (1995).  A  review  of parametric  modelling \n\ntechniques for  EEG  analysis.  Medical  Engineering  Physics,  18(1):2- 1l. \n\nRabiner, L.  and Bing-Hwang, J.  (1993) .  Fundamentals  of speech recognition. Prentice-Hall. \n\nShumway, R.  and Stoffer, D.  (1982) . An approach to time series smoothing and forecasting \n\nUSing  the EM  algorithm.  J.  Time  Series  Analysis,  3:253-226 . \n\n\f", "award": [], "sourceid": 1511, "authors": [{"given_name": "Andrew", "family_name": "Blake", "institution": null}, {"given_name": "Ben", "family_name": "North", "institution": null}, {"given_name": "Michael", "family_name": "Isard", "institution": null}]}