{"title": "Time Series Prediction using Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 309, "page_last": 318, "abstract": null, "full_text": "Time Series Prediction Using Mixtures of \n\nExperts \n\nAssaf J.  Zeevi \n\nRon Meir \n\nInformation Systems Lab \n\nDepartment of Electrical Engineering \n\nDepartment of Electrical Engineering \n\nStanford University \nStanford,  CA.  94305 \n\nazeevi~isl.stanford.edu \n\nTechnion \n\nHaifa 32000, Israel \n\nrmeir~ee.technion.ac.il \n\nRobert J.  Adler \n\nDepartment of Statistics \n\nUniversity of North  Carolina \n\nChapel Hill,  NC.  27599 \nadler~stat.unc.edu \n\nAbstract \n\nWe  consider  the  problem  of  prediction  of stationary  time  series, \nusing the architecture known as  mixtures of experts  (MEM).  Here \nwe  suggest  a  mixture  which  blends  several autoregressive models. \nThis study focuses  on  some  theoretical foundations  of the  predic(cid:173)\ntion  problem  in  this  context.  More  precisely,  it  is  demonstrated \nthat this model is  a  universal approximator,  with  respect to learn(cid:173)\ning the  unknown  prediction function .  This statement is  strength(cid:173)\nened  as  upper  bounds  on the  mean  squared error are  established. \nBased on these results it is  possible to compare the MEM to other \nfamilies of models  (e.g.,  neural networks and state dependent mod(cid:173)\nels).  It is  shown  that  a  degenerate  version  of the  MEM  is  in  fact \nequivalent to a  neural  network, and  the  number  of experts  in  the \narchitecture  plays  a  similar role  to the  number  of hidden  units  in \nthe latter model. \n\n\f310 \n\n1 \n\nIntroduction \n\nA. 1. Zeevi, R.  Meir and R. 1. Adler \n\nIn this work we  pursue a new family of models for  time series, substantially extend(cid:173)\ning,  but  strongly related  to and based on  the  classic linear  autoregressive moving \naverage  (ARMA)  family.  We  wish  to exploit the linear autoregressive technique in \na manner that will  enable a substantial increase in modeling power, in a framework \nwhich  is  non-linear and yet mathematically tractable. \n\nThe novel  model,  whose  main building blocks  are linear AR  models,  deviates from \nlinearity in the integration process, that is, the way these blocks are combined.  This \nmodel was first formulated in the context of a regression problem, and an extension \nto a  hierarchical structure was  also given  [2].  It  was  termed the  mixture of experts \nmodel  (MEM). \n\nVariants  of  this  model  have  recently  been  used  in  prediction  problems  both  in \neconomics and engineering.  Recently,  some theoretical aspects of the MEM  , in the \ncontext of non-linear regression, were studied by Zeevi  et al.  [8],  and an equivalence \nto a  class  of neural network models  has  been  noted. \n\nThe  purpose  of  this  paper  is  to  extend  the  previous  work  regarding  the  MEM \nin  the  context  of regression,  to the problem of prediction  of time  series.  We  shall \ndemonstrate that the MEM is a universal approximator, and establish upper bounds \non  the  approximation  error,  as  well  as  the  mean  squared  error,  in  the  setting  of \nestimation of the predictor function. \n\nIt is  shown that the MEM  is  intimately related to several existing, state of the art, \nstatistical  non-linear  models  encompassing Tong's  TAR  (threshold  autoregressive) \nmodel  [7],  and  a  certain  version  of Priestley's  [6]  state dependent  models  (SDM). \nIn addition,  it is  demonstrated that the MEM  is  equivalent  (in a  sense that will  be \nmade precise)  to the  class  of feedforward,  sigmoidal, neural networks. \n\n2  Model  Description \n\nThe MEM [2]  is an architecture composed of n  expert networks, each being an AR( d) \nlinear model.  The experts are combined via a  gating network,  which partitions the \ninput  space  accordingly.  Considering  a  scalar  time  series  {xt},  we  associate  with \neach  expert  a  probabilistic  model  (density function)  relating input  vectors  x!=~ == \n[Xt-l,Xt-2, ... ,Xt-d]  to  an  output  scalar  Xt  E]R and  denote  these  probabilistic \nmodels  by p(xtl  x!=~;Oj,O\"j) j  = 1,2, ... ,n where  (OJ,O\"j)  is  the expert  parameter \nvector,  taking  values  in  a  compact  subset  of ]Rd+l.  In  what  follows  we  will  use \nupper case X t  to denote random variables, and lower case Xt  to denote values taken \nby  those  r.v.'s. \n\nj  = \nLetting  the  parameters  of  each  expert  network  be  denoted  by  ( OJ, 0\" j  ), \n1,2, ... , n,  those  of the  gating  network  by  Og  and  letting  8  =  ({OJ, O\"j }j=l, Og) \nrepresent the complete set of parameters specifying the model,  we  may express the \nconditional distribution of the model, p(xtlx:=~, 8), as \n\np(Xtlx!=~; 8) = L gj (x!=~; Og)p(Xtlx!=~; OJ, O\"j), \n\nn \n\n(1) \n\nj=l \n\n\fTune Series Prediction using Mixtures of Experts \n\n311 \n\no  Vx~=~.  We  assume  that  the  parameter  vector  e  E  n,  a  compact  subset  of \nJR2n(d+1) . \n\nFollowing the  work of Jordan and Jacobs  [2]  we  take the probability density func(cid:173)\ntions  to  be  Gaussian  with  mean  Of X:=~ + OJ,O  and  variance  (7j  (representative \nof  the  underlying,  local  AR{d)  model).  The  function  9j{X; Og)  ==  exp{O~x + \nOg;.O}/(E~1 exp{O~x + Ogi.O}'  thus  implementing  a  multiple  output  logistic  re(cid:173)\ngression function. \n\nThe  underlying  non-linear  mapping  (i.e.,  the  conditional  expectation,  or  L2  pre(cid:173)\ndiction  function)  characterizing the MEM,  is  described  by  using  (1)  to obtain  the \nconditional expectation of Xt, \n\nf~ =  E[XtIX:=~; Mn]  =  L 9j{X:=~; Og)[Of X:=~ + OJ,o]' \n\nn \n\n(2) \n\nj=1 \n\nwhere  Mn  denotes  the  MEM  model.  Here  the  subscript  n  stands  for  the  number \nof experts.  Thus, we  have X t = fn  = fn(Xt_d; e) where  fn  : JR  x n ---+  JR,  and X t \ndenotes the projection of X t  on the  'relevant past',  given  the model,  thus defining \nthe  model predictor function. \n\nt-l \n\n()  _ \n\nd \n\n~ \n\n~ \n\nWe  will  use the notation MEM(n; d)  where n  is the number of experts in the model \n(proportional to the complexity, or number of parameters in the model),  and d the \nlag size.  In this work we  assume that d is  known and given. \n\n3  Main results \n\n3.1  Background \n\nWe  consider a stationary time series,  more precisely a discrete time stochastic pro(cid:173)\ncess  {Xd  which  is  assumed  to be  strictly  stationary.  We  define  the  L2  predictor \nfunction \n\nf  =  E[X  IXt- 1 ]  =  E[X  IXt- 1 ] \nt-d \n\n-00 \n\nt \n\nt \n\n-\n\na.s. \n\nfor  some fixed  lag size  d.  Markov chains are perhaps the  most widely  encountered \nclass  of probability models  exhibiting this  dependence.  The  NAR(d),  that is  non(cid:173)\nlinear AR{ d),  model is  another example, widely studied in the context of time series \n(see  [4]  for  details).  Assuming additive noise,  the NAR(d)  model may  be expressed \nas \n\n(3) \nWe  note  that in  this  formulation  {cd plays the  role  of the  innovation process  for \nXt,  and the function  fe)  describes the information on X t contained within its past \nhistory. \n\nIn  what follows,  we  restrict the discussion to stochastic processes satisfying certain \nconstraints on  the  memory decay,  more  precisely we  are  assuming  that  {Xd is  an \nexponentially a-mixing process.  Loosely stated, this assumption enables the process \nto  have  a  law  of large numbers  associated  with  it,  as  well  as  a  certain  version  of \nthe central limit theorem.  These results are the  basis for  analyzing the asymptotic \nbehavior of certain parameter estimators (see,  [9]  for further details), but other than \nthat this assumption is  merely stated here for the sake of completeness.  We  note in \n\n\f312 \n\nA.  1. Zeevi, R.  Meir and R.  1. Adler \n\npassing that this assumption may be substantially weakened, and still allow similar \nresults  to hold,  but requires  more  background and notation to be  introduced,  and \ntherefore  is  not  pursued  in  what  follows  (the  reader  is  referred  to  [1]  for  further \ndetails). \n\n3.2  Objectives \n\nKnowing the  L2  predictor function,  f, allows  optimal prediction of future  samples, \nwhere optimal is meant in the sense that the predicted value is the closest to the true \nvalue of the next sample point, in the mean squared error sense.  It therefore seems a \nreasonable strategy, to try and  learn the optimal predictor function,  based on some \nfinite  realization of the stochastic process, which we  will  denote VN = {Xt}t!t'H. \nNote that for  N  \u00bb d,  the number of sample points is  approximately  N. \nWe  therefore  define  our objective  as  follows.  Based on  the  data V N ,  we  seek  the \n'best' approximation to f, the L2  predictor function, using the MEM(n, d)  predictor \nf~ E  Mn as  the approximator model. \n\nMore precisely, define the least squares (LS)  parameter estimator for the MEM(n, d) \nas \n\nN \n\nA \n\n9n,N = arg~l~  L.t  X t  -\n\n\u2022  \" [  \n\nfn(Xt _ d , 9) \n\nt-I]2 \n\nt=d+l \n\nwhere fn(X:=~, 9)  is  f~ evaluated at the point X:=~, and define  the LS  functional \nestimator as \n\n(J \n\n_ \n\nA \n\nfn,N  = fnl(J=on,N \n\nwhere 9n ,N  is  the LS  parameter estimator. \nNow,  define  the functional  estimator risk  as \n\nMSE[f, in,N]  ==  Ev [/ If - in'NI2dv] \n\nwhere  v  is  the  d fold  probability  measure  of  the  process  {Xt}.  In  this  work  we \nmaintain that the integration is over some compact domain Id  C Rd, though recent \nwork  [3]  has  shown  that the results can be extended to Rd, at the price of slightly \nslower convergence rates. \n\nIt is  reasonable,  and quite  customary,  to expect a  'good'  estimator to be one  that \nis asymptotically unbiased.  However, growth of the sample size itself need not, and \nin general does not, mean that the estimator is  'becoming' unbiased.  Consequently, \nas a figure  of merit,  we  may restrict attention to the  approximation  capacity of the \nmodel.  That is  we ask, what is the error in approximating a given class of predictor \nfunctions,  using the MEM(n, d)  (Le.,  {Mn}) as the approximator class. \n\nTo  measure this figure,  we  define  the  optimal risk as \n\nwhere f~ ==  f!I(J=(J*  and \n\nMSE[j, f~] ==  /  If - f~12dv, \n\n9~ = argmin/ If - f~12dv, \n\n(JE9 \n\n\fTime Series Prediction using Mixtures of Experts \n\n313 \n\nthat  is,  9~ is  the  parameter minimizing  the  expected  L2  loss  function.  One  may \nthink of f~ as  the  'best'  predictor function  in  the class  of approximators,  i.e.,  the \nclosest approximation to the optimal predictor, given the finite  complexity, n,  (Le., \nfinite  number of parameters) of the model.  Here  n  is simply the number of experts \n(AR models)  in  the architecture. \n\n3.3  Upper Bounds on the Mean Squared Error and Universal \n\nApproximation Results \n\nConsider first the case where we are simply interested in approximating the function \nf,  assuming  it  belongs  to some  class  of functions.  The question  then  arises  as  to \nhow well one may approximate f  by a MEM architecture comprising n  experts.  The \nanswer to this question is given in the following  proposition, the proof of which can \nbe  found  in  [8]. \n\nProposition 3.1  (Optimal  risk  bounds)  Consider  the  class  of functions  Mn  de(cid:173)\nfined  in  (2)  and  assume  that  the  optimal  predictor  f  belongs  to  a  Sobolev  class \ncontaining r  continuous  derivatives  in L 2 \u2022  Then  the following  bound holds: \n\nMSE[f, f~] ::;  n2~jd \n\n(4) \n\nwhere  c  is  a  constant independent  of n . \n\nPROOF  SKETCH:  The proof proceeds by first  approximating the normalized gating \nfunction  gj 0  by  polynomials  of finite  degree,  and  then  using  the  fact  that  poly(cid:173)\nnomials  can  approximate functions  in  Sobolev  space  to within  a  known  degree  of \napproximation. \n\nThe following main theorem, establishing upper bounds on the functional estimator \nrisk,  constitutes the main result of this  paper.  The proof is  given in  [9]. \n\nTheorem 3.1  (Upper bounds on  the estimator risk) \nSuppose  the stochastic process  obeys  the  conditions set forth  in the previous section. \nAssume  also  that  the  optimal predictor function,  f, possesses r  smooth  derivatives \nin L 2 .  Then  for  N  sufficiently  large  we  have \n\n(  1 ) \nMSE[j, fn,N]  ::;  n2rjd +  2; + 0  N \n\nm* \n\n~ \n\nc \n\n' \n\n(5) \n\nwhere r  is  the  number of continuous derivatives  in L2  that f  is  assumed to  possess, \nd  is  the  lag  size,  and N  is  the  size  of the  data  set 'DN \u00b7 \n\nPROOF  SKETCH:  The proof proceeds by  a  standard stochastic Taylor expansion of \nthe loss around the point 9~. Making common regularity assumptions [1]  and using \nthe assumption  on  the  a-mixing  nature  of the  process  allows  one  to establish  the \nusual asymptotic normality results,  from  which  the  result follows. \n\nWe  use  the  notation m~ to denote  the  effective  number  of parameters.  More  pre(cid:173)\ncisely,  m~ =  Tr{B~(A~)-l} and the matrices  A*  and B*  are related to the Fisher \ninformation  matrix  in  the  case  of misspecified  estimation  (see  [1]  for  further  dis(cid:173)\ncussion).  The upper bound  presented in  Theorem 3.1  is  related to the  classic  bias \n- variance decomposition  in  statistics  and the obvious  tradeoffs  are evident by  in(cid:173)\nspection. \n\n\f314 \n\n3.4  Comments \n\nA. 1  Zeevi,  R.  Meir and R.  1  Adler \n\nIt follows  from  Proposition 3.1  that the class of mixtures  of experts  is  a  universal \napproximator, w.r.t.  the class of target functions  defined for  the optimal predictor. \nMoreover,  Proposition 3.1  establishes the rate of convergence of the approximator, \nand therefore relates the approximation error to the number of experts used in the \narchitecture (n) . \n\nTheorem  3.1  enhances  this  result,  as  it  relates  the  sample  complexity  and  model \ncomplexity,  for  this  class  of models.  The  upper  bounds  may  be  used  in  defining \nmodel  selection  criteria,  based  on  upper  bound  minimization.  In  this  setting,  we \nmay  use  an  estimator of the stochastic error bound  (i.e.,  the  estimation  error), to \npenalize  the  complexity  of the  model,  in  the  spirit  of Ale,  MDL  etc  (see  [8]  for \nfurther  discussion). \n\nAt a first glance it may seem surprising to find  that a combination of linear models \nis  a  universal  function  approximator.  However,  one  must  keep  in  mind  that  the \nglobal model is  nonlinear, due  to the gating network.  Nevertheless,  this  result does \nimply,  at  least  on  a  theoretical  ground,  that  one  may  restrict  the  MEM{n, d)  to \nbe  locally  linear,  without  loss  of  generality.  Thus,  taking  a  simple  local  model, \nenabling efficient  and tractable  learning algorithms  (see  [2]),  still  results  in  a  rich \nglobal model. \n\n3.5  Comparison \n\nRecently,  Mhaskar [5]  proved upper bounds on a feedforward sigmoidal neural net(cid:173)\nwork, for  target functions  in the same class as  we  consider herein,  i.e.,  the Sobolev \nclass.  The  bound  we  have  obtained  in  Proposition  3.1,  and  its  extension  in  [8], \ndemonstrate  that  w.r.t.  to  this  particular  target  class,  neural  networks  and  mix(cid:173)\ntures  of experts  are  equivalent.  That  is,  both  models  attain  optimal  precision  in \nthe  degree  of approximation results  (see  [5]  for  details  of this  argument).  Keeping \nin  mind  the  advantages  of  the  MEM  with  respect  to  learning  and  generalization \n[2],  we  believe  that  our  results  lend  further  credence  to the  emerging  view  as  to \nthe superiority of modular architectures over the more standard feed forward neural \nnetworks. \n\nMoreover,  the  detailed  proof  of  Proposition  3.1  (see  [8])  actually  takes  the \nMEM(n, d)  to be made up of local  constants.  That is,  the linear experts are degen(cid:173)\nerated to constant  functions.  Thus,  one  may  conjecture  that  mixtures of experts \nare in fact  a  more general class  than feedforward  neural networks,  though we  have \nno  proof of this  as  of yet. \n\nTwo  nonlinear  alternatives,  generalizing  standard  statistical  linear  models,  have \nbeen  pointed  out  in  the  introductory  section.  These  are  Tong's  TAR  (threshold \nautoregressive)  model  [7],  and  the  more  general  SDM  (state  dependent  models) \nintroduced  by  Priestley.  The  latter  models  can  be  reduced  to  a  TAR  model  by \nimposing a  more  restrictive structure (for further details see  [6])  .  We  have  shown, \nbased on  the results  described  above  (see  [9]),  that the MEM  may  be  viewed  as  a \ngeneralization of the SDM  (and consequently of the  TAR model).  The relation to \nthe state dependent  models  is  of particular interest,  as  the  mixtures  of experts  is \nstructured on state dependence as well.  Exact statement and proofs  of these facts \ncan be found  in  [9] . \n\n\fTime Series Prediction using Mixtures of Experts \n\n315 \n\nWe  should also  note  that we  have  conducted several numerical  experiments,  com(cid:173)\nparing the performance of the  MEM  with  other approaches.  We  tested the  model \non both synthetic as well as real-world data.  Without any fine-tuning of parameters \nwe  found  the  performance of the  MEM,  with  linear  expert  functions,  to  compare \nvery favorably with other approaches  (such  as  TAR, ARMA and neural networks). \nDetails  of  the  numerical  results  may  be  found  in  [9].  Moreover,  the  model  also \nprovided a  very natural and intuitive segmentation of the  process. \n\n4  Discussion \n\nIn this  work we  have pursued a  novel non-linear model for  prediction in stationary \ntime  series.  The mixture of experts model  (MEM)  has  been demonstrated to  be  a \nrich  model,  endowed  with a  sound  theoretical  basis,  and  compares  favorably  with \nother, state of the art, nonlinear models. \n\nWe  hope that the results of this study will  aid in  establishing the MEM  as,  yet an(cid:173)\nother, powerful tool for  the study of time-series applicable to the fields  of statistics, \neconomics,  and signal processing. \n\nReferences \n\n[1]  Domowitz,  I.  and  White,  H.  \"Misspecified  Models  with  Dependent  Observa(cid:173)\n\ntions\",  Journal  of Econometrics,  vol.  20:  35-58,  1982. \n\n[2]  Jordan,  M.  and  Jacobs,  R.  \"Hierarchical  Mixtures  of  Experts  and  the  EM \n\nAlgorithm\" , Neural  Computation,  vol.  6,  pp.  181-214, 1994. \n\n[3]  Maiorov,  V.  and  Meir.  V.  \"Approximation  Bounds  for  Smooth  Functions  in \nC(1Rd) by Neural and Mixture Networks\", submitted for publication, December \n1996. \n\n[4]  Meyn,  S.P.  and Tweedie, R.L.  (1993)  Markov  Chains  and Stochastic  Stability, \n\nSpringer-Verlag,  London. \n\n[5]  Mhaskar,  H.  (1996)  \"Neural  Networks for  Optimal Approximation of Smooth \n\nand Analytic  Functions\",  Neural  Computation vol.  8(1),  pp. 164-177. \n\n[6]  Priestley M.B.  Non-linear and Non-stationary  Time Series Analysis, Academic \n\nPress,  New  York,  1988. \n\n[7]  Tong,  H.  Threshold  Models  in Non-linear  Time  Series  Analysis,  Springer Ver(cid:173)\n\nlag,  New  York,  1983. \n\n[8]  Zeevi,  A.J.,  Meir,  R.  and Maiorov,  V.  \"Error Bounds for  Functional  Approx(cid:173)\nimation and Estimation Using  Mixtures of Experts\",  EE Pub.  CC-132., Elec(cid:173)\ntrical Engineerin g Department, Technion,  1995. \n\n[9]  Zeevi,  A.J., Meir, R.  and Adler, R.J.  \"Non-linear Models for  Time Series Using \nMixtures  of Experts\",  EE  Pub.  CC-150,  Electrical  Engineering  Department, \nTechnion,  1996. \n\n\f\fPART IV \n\nALGORITHMS AND ARCHITECTURE \n\n\f\f", "award": [], "sourceid": 1203, "authors": [{"given_name": "Assaf", "family_name": "Zeevi", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}, {"given_name": "Robert", "family_name": "Adler", "institution": null}]}