{"title": "A Dynamic HMM for On-line Segmentation of Sequential Data", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 800, "abstract": null, "full_text": "ADynamic HMM for  On-line \nSegmentation of Sequential Data \n\nJens Kohlmorgen* \nFraunhofer FIRST.IDA \n\nKekulestr.  7 \n\n12489 Berlin, Germany \njek@first\u00b7fraunhofer.de \n\nSteven Lemm \n\nFraunhofer FIRST.IDA \n\nKekulestr.  7 \n\n12489 Berlin, Germany \nlemm @first\u00b7fraunhofer.de \n\nAbstract \n\nWe  propose  a  novel  method  for  the  analysis  of  sequential  data \nthat exhibits  an inherent  mode  switching.  In  particular,  the  data \nmight  be  a  non-stationary  time  series  from  a  dynamical  system \nthat switches between multiple operating modes.  Unlike other ap(cid:173)\nproaches, our method processes the data incrementally and without \nany  training of internal  parameters.  We  use  an  HMM  with  a  dy(cid:173)\nnamically changing number of states and an on-line variant of the \nViterbi algorithm that performs an unsupervised segmentation and \nclassification of the data on-the-fly,  i.e.  the method is  able to pro(cid:173)\ncess  incoming data in  real-time.  The main idea  of the approach is \nto track and segment changes of the probability density of the data \nin  a  sliding window  on the incoming  data stream.  The usefulness \nof the algorithm is  demonstrated  by an application to a  switching \ndynamical system. \n\n1 \n\nIntroduction \n\nAbrupt  changes  can  occur  in  many  different  real-world  systems  like,  for  example, \nin  speech,  in  climatological  or  industrial  processes,  in  financial  markets,  and  also \nin  physiological signals  (EEG/MEG). Methods for  the analysis of time-varying dy(cid:173)\nnamical systems are therefore an important issue in many application areas.  In [12], \nwe introduced the annealed competition of experts method for time series from non(cid:173)\nlinear switching dynamics,  related approaches were  presented, e.g.,  in  [2,  6,  9,  14]. \nFor a  brief review  of some  of these  models  see  [5],  a  good  introduction is  given  in \n[3]. \nWe here present a  different  approach in two  respects.  First, the segmentation does \nnot  depend  on  the  predictability  of the  system.  Instead,  we  merely  estimate  the \ndensity  distribution of the  data and track its  changes.  This  is  particularly an im(cid:173)\nprovement for  systems where data is  hard to predict, like, for example, EEG record(cid:173)\nings  [7]  or financial data.  Second, it is an on-line method.  An incoming data stream \nis  processed incrementally while keeping the computational effort limited by a fixed \n\n\u2022 http://www.first.fraunhofer.de/..-.jek \n\n\fupper  bound,  i.e.  the  algorithm  is  able  to  perpetually  segment  and  classify  data \nstreams with a  fixed  amount of memory and CPU resources.  It is  even  possible to \ncontinuously  monitor  measured  data in  real-time,  as  long  as  the  sampling  rate  is \nnot too high.l  The main reason for  achieving a  high on-line processing speed is  the \nfact  that  the  method,  in  contrast  to  the  approaches  above,  does  not  involve  any \ntraining,  i.e.  iterative adaptation of parameters.  Instead,  it  optimizes  the  segmen(cid:173)\ntation on-the-fly by means of dynamic programming [1],  which thereby results in an \nautomatic correction or fine-tuning of previously estimated segmentation bounds. \n\n2  The segmentation algorithm \n\nWe  consider  the  problem  of  continuously  segmenting  a  data  stream  on-line  and \nsimultaneously  labeling  the segments.  The data stream is  supposed  to have  a  se(cid:173)\nquential  or  temporal  structure as  follows:  it  is  supposed  to  consist  of consecutive \nblocks  of  data  in  such  a  way  that  the  data  points  in  each  block  originate  from \nthe same underlying  distribution.  The segmentation task is  to be  performed in  an \nunsupervised fashion,  i.e.  without any a-priori given labels or segmentation bounds. \n\n2.1  Using pdfs as  features  for  segmentation \n\nConsider  Yl, Y2 , Y3,  ... ,  with  Yt  E  Rn,  an  incoming  data  stream  to  be  analyzed. \nThe sequence might have already passed a  pre-processing step like filtering  or sub(cid:173)\nsampling,  as  long  as  this  can be  done  on-the-fly  in case  of an on-line  scenario.  As \na  first  step  of further  processing,  it  might  then  be  useful  to  exploit  an  idea from \ndynamical  systems  theory  and  embed  the  data  into  a  higher-dimensional  space, \nwhich aims to reconstruct the state space of the underlying system, \n\nXt  = (Yt,Yt-n'\"  ,Yt -(m-l)r )' \n\n(1) \nThe  parameter  m  is  called  the  embedding  dimension  and  T  is  called  the  delay \nparameter of the embedding.  The dimension of the vectors Xt  thus is d =  m n.  The \nidea behind embedding is  that the measured data might be a potentially non-linear \nprojection  of the  systems  state  or  phase  space.  In  any  case,  an  embedding  in  a \nhigher-dimensional  space  might  help  to  resolve  structure  in  the  data,  a  property \nwhich  is  exploited,  e.g.,  in  scatter  plots.  After  the  embedding  step  one  might \nperform  a  sub-sampling  of the  embedded  data in  order  to  reduce  the  amount  of \ndata for  real-time  processing. 2  Next,  we  want  to  track the  density  distribution of \nthe embedded data and therefore estimate the probability density function (pdf) in a \nsliding window of length W.  We use a standard density estimator with multivariate \nGaussian  kernels  [4]  for  this  purpose,  centered  on  the  data points3  in  the  window \n}W -l \n{ ~ \nXt-w  w=o, \n\n() \n\n1  ~l  1 \n\nPt  x  =  W  ~ (27fa 2 )d/2  exp \n\n( \n\n-\n\n(x - Xt_w)2) \n\n2a2 \n\n. \n\n(2) \n\nThe kernel  width a  is  a  smoothing parameter and its value is  important to obtain \na  good representation of the underlying distribution.  We  propose to choose  a  pro(cid:173)\nportional to the mean distance of each  Xt  to its first  d nearest  neighbors,  averaged \nover a  sample set  {xt}. \n\n1 In our reported application we  can process data at 1000 Hz  (450  Hz including display) \non  a  1.33  GHz  PC  in  MATLAB/C  under  Linux,  which  we  expect  is  sufficient  for  a  large \nnumber of applications. \n\n2In that case,  our further notation of time indices would refer to the subsampled data. \n3We  use  if  to  denote  a  specific  vector-valued  point  and  x  to  denote  a  vector-valued \n\nvariable. \n\n\f2.2  Similarity of two  pdfs \n\nOnce  we  have  sampled  enough  data points  to  compute  the  first  pdf according  to \neq.  (2),  we  can  compute  a  new  pdf with  each  new  incoming  data point.  In  order \nto quantify the difference  between two such functions,  f  and g, we  use  the squared \nL2-Norm,  also  called  integrated  squared  error  (ISE) ,  d(f, g)  =  J(f - g)2 dx , which \ncan be calculated  analytically if f  and 9  are mixtures of Gaussians  as  in our case \nof pdfs estimated from  data windows, \n\n(3) \n\n2.3  The  HMM in the  off-line  case \n\nBefore we  can discuss the on-line variant, it is  necessary to introduce the HMM and \nthe respective off-line  algorithm first.  For a given a  data sequence,  {X'dT=l'  we  can \nobtain  the  corresponding  sequence  of pdfs  {Pt(X)}tES,  S  =  {W, ... , T},  according \nto eq.  (2).  We  now  construct a  hidden  Markov model  (HMM)  where each of these \npdfs  is  represented  by  a  state  s  E  S,  with  S  being  the  set  of states  in  the  HMM. \nFor each state s,  we  define  a  continuous  observation  probability  distribution, \n\n(  ( ) I  ) -\nPPt X  s-~ exp\n\n1 \n\nV  21f <; \n\n-\n\n(  d(Ps(X),Pt(x))) \n\n22 \n<; \n\n' \n\n(4) \n\nfor  observing  a  pdf Pt(x)  in  state  s.  Next,  the  initial  state  distribution  {1fsLES \nof the  HMM  is  given  by  the  uniform  distribution,  1fs  =  liN, with  N  =  lSI  being \nthe  number  of states.  Thus,  each  state  is  a-priori  equally  probable.  The  HMM \ntransition  matrix,  A  =  (PijkjES,  determines  each  probability  to  switch  from  a \nstate Si  to a  state  Sj.  Our aim is  to find  a  representation of the given sequence of \npdfs  in  terms of a  sequence  of a  small  number of representative pdfs,  that we  call \nprototypes, which moreover exhibits only a small number of prototype changes.  We \ntherefore define  A  in such a way that transitions to the same state are k times more \nlikely than transitions to any of the other states, \n\n_  {  k+~-l \n\n1 \n\nPij  -\n\nk+N - l \n\n;ifi=J \n\n;ifi-j.J \n\n(5) \n\nThis completes the definition of our HMM.  Note that this HMM  has only two free \nparameters,  k  and  <;.  The  well-known  Viterbi  algorithm  [13]  can  now  be  applied \nto  the  above  HMM  in  order  to  compute  the  optimal - i.e.  the  most  likely - state \nsequence  of prototype pdfs  that might  have generated the given  sequence  of pdfs. \nThis state sequence represents the segmentation we  are aiming at.  We  can compute \nthe  most  likely  state sequence  more  efficiently  if we  compute  it  in  terms  of costs, \nc =  -log(p),  instead of probabilities p,  i.e.  instead of computing the maximum of \nthe likelihood function  L , we  compute the minimum of the cost function, -log(L), \nwhich yields the optimal state sequence as well.  In this way we  can replace products \nby  sums  and  avoid  numerical  problems  [13].  In  addition  to  that,  we  can  further \nsimplify the  computation for  the special  case of our particular HMM  architecture, \nwhich finally  results in the following  algorithm: \nFor each time step, t  =  w, ... , T,  we  compute for  all  S  E  S  the cost cs(t)  of the opti(cid:173)\nmal state sequence from  W  to t,  subject to the constraint that it ends in state S  at \n\n\ftime t.  We  call  these constrained optimal sequences c-paths and the unconstrained \noptimum  0* -path.  The  iteration  can  be  formulated  as  follows,  with  ds,t  being  a \nshort hand for  d(ps(x)'pt(x))  and bs,s denoting the Kronecker delta function: \n\nInitialization, Vs  E  S: \n\nInduction, Vs  E  S: \n\nCs(W)  := ds ,w, \n\n(6) \n\ncs(t)  := ds t + min  {  cs(t - 1) + C (1- bs  s)}, \n\nfor  t  =  W  + 1, ... , T, \n\n(7) \n\n, \n\nsES \n\n' \n\nTermination: \n\nsES \n\n0*  := min {  cs(T)  }  . \n\n(8) \nThe regularization constant C, which is  given by C  = 2C;2 10g(k)  and thus subsumes \nour two free  HMM  parameters, can be interpreted as transition cost for  switching \nto  a  new  state in  the  path. 4  The  optimal  prototype sequence  with  minimal  costs \n0*  for  the complete series of pdfs, which is  determined in the last step,  is  obtained \nby logging and updating the c-paths for  all states s during the iteration and finally \nchoosing the one with minimal costs according to eq.  (8). \n\n2.4  The  on-line algorithm \n\nIn  order  to  turn  the  above  segmentation  algorithm  into  an  on-line  algorithm,  we \nmust  restrict  the  incremental  update  in  eq.  (7),  such  that  it  only  uses  pdfs  (and \ntherewith states)  from  past data points.  We neglect at this stage that memory and \nCPU resources are limited. \n\nSuppose that we  have already processed data up to T  - 1.  When a  new data point \nYT  arrives  at  time T,  we  can  generate  a  new  embedded  vector  XT  (once  we  have \nsampled  enough  initial  data points  for  the  embedding),  we  have  a  new  pdf pT(X) \n(once we  have sampled enough embedded vectors Xt  for  the first  pdf window),  and \nthus we  have given  a  new  HMM  state.  We  can also  readily  compute the distances \nbetween  the  new  pdf and  all  the  previous  pdfs,  dT,t,  t  < T,  according  to  eq.  (3). \nA  similarly  simple  and  straightforward  update  of the  costs,  the  c-paths  and  the \noptimal state sequence is  only possible, however,  if we  neglect to consider potential \nc-paths that would have contained the new pdf as a prototype in previous segments. \nIn  that  case  we  can  simply  reuse  the  c-paths  from  T  - 1.  The  on-line  update  at \ntime  T  for  these  restricted  paths,  that  we  henceforth  denote  with  a  tilde,  can  be \nperformed as  follows: \nFor T  =  W,  we  have cw(W)  := o*(W)  := dw,w  =  O.  For T  > W: \n\n1.  Compute the cost cT(T - 1)  for  the new state s = T  at time T  - 1: \n\nFor t  =  w, ... , T  - 1,  compute \n\nCT(t)  :=dT,t+  min{cT(t-1) ; o*(t-1)+C}:  else \n\n{ \n\n0 \n\nift=W \n\n(9) \n\nand update \n\no*(t)  := CT(t), \n\n(10) \nHere we  use  all  previous  optimal segmentations o*(t),  so  we  don't need  to \nkeep the complete matrix (cs(t))S,tES  and repeatedly compute the minimum \n4We developed an algorithm that computes an appropriate value for the hyperparameter \nC  from  a  sample set  {it}.  Due  to the limited space  we  will  present  that  algorithm  in  a \nforthcoming  publication  [8]. \n\nif  CT(t)  < o*(t). \n\n\fover all states.  However,  we  must  store and update the history of optimal \nsegmentations 8* (t). \n\n2.  Update from  T  - 1 to T  and compute cs(T)  for  all  states  s E S  obtained \n\nso  far,  and also get 8*(T):  For  s =  W, ... , T , compute \n\ncs(T)  := ds,T + min {cs(T - 1);  8*(T - 1) + C} \n\nand finally  get  the cost of the optimal path \n\n8* (T)  := min {cs(T)} . \n\nsES \n\n(11) \n\n(12) \n\nAs  for  the  off-line  case,  the  above  algorithm  only  shows  the  update  equations  for \nthe  costs  of the  C- and  8* -paths.  The  associated  state  sequences  must  be  logged \nsimultaneously during the computation.  Note that this can be done by just storing \nthe  sequence of switching points for  each  path.  Moreover,  we  do  not need  to keep \nthe full  matrix (cs(t))s,tES  for  the update, the most recent column is  sufficient. \nSo  far  we  have  presented  the  incremental  version  of the  segmentation  algorithm. \nThis  algorithm still  needs  an amount  of memory and  CPU time that is  increasing \nwith  each  new  data point.  In  order to limit  both resources to a  fixed  amount,  we \nmust  remove old  pdfs,  i.e.  old  HMM  states, at some  point.  We  propose to do  this \nby  discarding all  states with time indices  smaller or equal  to  s each time the path \nassociated with cs(T) in eq.  (11)  exhibits a switch back from a more recent state/pdf \nto  the currently considered state  s  as  a  result  of the min-operation in  eq.  (11).  In \nthe  above  algorithm  this  can  simply  be done  by  setting  W  := s + 1 in  that  case, \nwhich also allows us to discard the corresponding old cs(T)- and 8* (t)-paths, for  all \ns:::::  sand t  < s.  In addition,  the  \"if t  =  W\"  initialization clause in eq.  (9)  must be \nignored after the first  such cut and the 8* (W - I)-path must therefore still be kept \nto compute the else-part also for  t = W  now.  Moreover, we  do  not have CT(W -1) \nand  we  therefore  assume  min {CT(W  - 1);  8*(W - 1) + C}  =  8*(W - 1)  + C  (in \neq.  (9)). \n\nThe  explanation  for  this  is  as  follows:  A  switch  back in  eq.  (11)  indicates  that  a \nnew  data distribution is  established,  such  that the  c-path that  ends  in  a  pdf state \ns  from  an  old  distribution  routes  its  path  through  one  of the  more  recent  states \nthat represent  the new  distribution,  which  means  that this  has  lower  costs  despite \nof the  incurred  additional  transition.  Vice  versa,  a  newly  obtained  pdf is  unlikely \nto properly represent the previous mode then,  which justifies our above assumption \nabout CT (W -1).  The effect of the proposed cut-off strategy is that we discard paths \nthat end in  pdfs  from  old  modes  but  still  allow  to find  the  optimal  pdf prototype \nwithin the current segment. \n\nCut-off  conditions  occur  shortly  after  mode  changes  in  the  data  and  cause  the \nremoval  of HMM  states  with  pdfs  from  old  modes.  However,  if no  mode  change \ntakes place in the incoming data sequence, no states will be discarded.  We therefore \nstill  need to set a  fixed  upper limit\", for  the number of candidate paths/pdfs that \nare simultaneously under consideration if we  only have limited  resources available. \nWhen this  limit is  reached  because no  switches  are  detected,  we  must successively \ndiscard  the  oldest  path/pdf stored,  which  finally  might  result  in  choosing  a  sub(cid:173)\noptimal  prototype for  that  segment  however.  Ultimately,  a  continuous  discarding \neven  enforces  a  change of prototypes after 2\",  time steps if no switching is  induced \nby the data until then.  The buffer size\", should therefore be as large as possible.  In \nany case, the buffer overflow condition can be recorded along with the segmentation, \nwhich allows us  to identify such artificial switchings. \n\n\f2.5  The  labeling algorithm \n\nA  labeling  algorithm is  required  to  identify  segments  that  represent  the  same  un(cid:173)\nderlying distribution and thus have similar pdf prototypes.  The labeling algorithm \ngenerates labels for  the segments and assigns  identical labels  to segments  that  are \nsimilar in this respect.  To this end, we  propose a relatively simple on-line clustering \nscheme for  the prototypes, since we  expect the prototypes obtained from  the same \nunderlying  distribution  to be  already  well-separated from  the  other  prototypes  as \na  result of the segmentation algorithm.  We  assign  a  new  label  to a  segment if the \ndistance  of its  associated  prototype  to  all  preceding  prototypes  exceeds  a  certain \nthreshold  e,  and  we  assign  the  existing  label  of the  closest  preceding  prototype \notherwise.  This can be written as \n\nl(R)  =  {  ne.wlabel ,.  if min1:'Sr<R  {d(Pt(r) (x),  Pt(R) (x))}  > e \n\n(13) \n\n1 (mdexmml:'Sr<R {d(Pt(r) (x), Pt(R) (x))} ),  else; \n\nwith the initialization l(l) =  newlabel.  Here,  r  =  1, ... , R, denotes the enumeration \nof the  segments obtained so  far ,  and  t(\u00b7)  denotes  the  mapping to  the  index of the \ncorresponding pdf prototype.  Note that the segmentation algorithm might  replace \na  number  of  recent  pdf prototypes  (and  also  recent  segmentation  bounds)  during \nthe on-line processing in order to optimize the segmentation each time new data is \npresented.  Therefore, a relabeling of all segments that have changed is  necessary in \neach update step of the labeler. \nAs  for  the hyperparameters  (J  and  C,  we  developed  an algorithm that computes a \nsuitable value for e from a sample set {X'd.  We refer to our forthcoming publication \n[8]. \n\n3  Application \n\nWe  illustrate  our  approach  by  an  application  to  a  time  series  from  switching  dy(cid:173)\nnamics based on the Mackey-Glass delay differential equation, \n\ndx(t)  =  -O.lx(t) \n\ndt \n\n0.2x(t - td) \n+ 1 + x( t - td)l\u00b0 \n\n. \n\n(14) \n\nEq.  (14)  describes a high-dimensional chaotic system that was originally introduced \nas  a  model of blood cell  regulation  [10].  In our example, four  stationary operating \nmodes,  A,  B,  C,  and  D,  are established  by using  different  delays,  td  =  17,  23,  30, \nand  35,  respectively.  The  dynamics  operates  stationary in  one  mode  for  a  certain \nnumber  of time  steps,  which  is  chosen  at  random  between  200  and  300  (referring \nto  sub-sampled data with a  step size  6.  =  6) .  It then randomly switches to one of \nthe other modes  with  uniform  probability.  This procedure is  repeated  15  times,  it \nthus generates a switching chaotic time series with 15 stationary segments.  We then \nadded  a  relatively  large  amount  of  \"measurement\"  noise  to  the  series:  zero-mean \nGaussian noise  with  a  standard deviation of 30%  of the  standard deviation of the \noriginal series. \n\nThe on-line segmentation algorithm was then applied to the noisy data, i.e. process(cid:173)\ning  was  performed  on-line  although  the  full  data set  was  already  available  in this \ncase.  The scalar time series was embedded on-the-fly by using m  =  6 and T  =  1 (on \nthe  sub-sampled  data)  and  we  used  a  pdf window  of size  W  = 50.  The  algorithm \nprocessed  457  data  points  per  second  on  a  1.33  GHz  PC  in  MATLAB/C  under \nLinux,  including  the  display  of the  ongoing  segmentation,  where  one  can  observe \nthe re-adaptation of past segmentation bounds and labels when new  data becomes \navailable. \n\n\factual  modes \n\nmode D \n\nmodeC \n\nmode B \n\nmode A \n\nlabels  1 \n\n2 \n\n3 \n\n4 \n\n3 \n\n5 61  \n\n3 \n\n3 \n\n6 \n\n2 \n\n2 \n\nbounds \n\nxl!) \n\non-line segmentation \n\nFigure 1:  Segmentation of a switching Mackey-Glass time series with noise (bottom) \nthat  operates  in  four  different  modes  (top).  The  on-line  segmentation  algorithm \n(middle) , which receives the data points one by one, but not the mode information, \nyields  correct  segmentation  bounds  almost  everywhere.  The on-line  labeler,  how(cid:173)\never,  assigns  more labels  (6)  than desired  (4) , presumably due  to the fact  that the \nsegments are very short and noisy. \n\nThe final  segmentation is shown in Fig.  1.  Surprisingly, the bounds of the segments \nare almost perfectly recovered from the very noisy data set.  The only two exceptions \nare the third segment from  the right, which is  noticeably shorter than the original \nmode,  and  the  segment  in  the  middle,  which  is  split  in  two  by  the  algorithm. \nThis  split  actually makes  sense if one  compares it  with the data:  there is  a  visible \nchange in the signal characteristics at that point  (t  ~ 1500)  even though the delay \nparameter was  not modified  there.  This change is  recorded by the algorithm since \nit operates in an unsupervised way. \n\nThe  on-line  labeling  algorithm  correctly  assigns  single  labels  to  modes  A,  B,  and \nC,  but  it  assigns  three  labels  (4,  5,  and  6)  to  the  segments  of mode  D,  the  most \nchaotic  one.  This  is  probably  due  to  the  small  sample  sizes  (of the  segments),  in \ncombination with the large amount of noise in the data. \n\n4  Discussion \n\nWe  presented an on-line method for  the unsupervised segmentation and identifica(cid:173)\ntion of sequential data, in particular from  non-stationary switching dynamics.  It is \nbased  on  an  HMM  where  the  number  of states  varies  dynamically  as  an  effect  of \nthe  way  the  incoming  data is  processed.  In  contrast  to other  approaches,  it  pro(cid:173)\ncesses  the  data on-line  and  potentially  even  in  real-time  without  training  of any \nparameters.  The  method  provides  and  updates  a  segmentation  each  time  a  new \ndata  point  arrives.  In  effect,  past  segmentation  bounds  and  labels  are  automat(cid:173)\nically  re-adapted  when  new  incoming  data  points  are  processed.  The  number  of \nprototypes and labels that identify the segments is  not fixed  but determined by the \n\n\falgorithm.  We  expect  useful  applications  of this  method  in  fields  where  complex \nnon-stationary  dynamics  plays  an  important  role,  like,  e.g.,  in  physiology  (EEG, \nMEG),  climatology, in industrial applications,  or in finance. \n\nReferences \n\n[1]  Bellman,  R.  E.  (1957).  Dynamic  Programming,  Princeton  University  Press, \n\nPrinceton, N J . \n\n[2]  Bengio, Y, Frasconi, P.  (1995).  An Input Output HMM  Architecture. In:  Ad(cid:173)\n\nvances in  Neural  Information  Processing Systems  7  (eds.  Tesauro,  Touretzky, \nLeen),  Morgan Kaufmann,  427- 434. \n\n[3]  Bengio, Y  (1999).  Markovian Models for  Sequential Data.  Neural  Computing \n\nSurveys,  http://www.icsi.berkeley.edu/~jagota/NCS, 2:129-162. \n\n[4]  Bishop, C.  M.  (1995).  Neural Networks for  Pattern Recognition, Oxford Univ. \n\nPress,  NY. \n\n[5]  Husmeier,  D.  (2000).  Learning Non-Stationary Conditional Probability Distri(cid:173)\n\nbutions.  Neural  Networks 13,  287- 290. \n\n[6]  Kehagias, A.,  Petridis,  V.  (1997).  Time  Series  Segmentation  using  Predictive \n\nModular Neural Networks.  Neural  Computation 9, 1691- 1710. \n\n[7]  Kohlmorgen,  J. ,  Miiller,  K.-R.,  Rittweger,  J. ,  Pawelzik,  K.  (2000).  Identifi(cid:173)\n\ncation  of  Nonstationary  Dynamics  in  Physiological  Recordings,  Bioi  Cybern \n83(1),73- 84. \n\n[8]  Kohlmorgen, J. , Lemm, S. ,  to appear. \n[9]  Liehr,  S.,  Pawelzik, K. , Kohlmorgen, J .,  Miiller,  K.-R. (1999).  Hidden  Markov \nMixtures of Experts with an Application to EEG Recordings from Sleep.  Theo \nBiosci  118,  246- 260. \n\n[10]  Mackey, M.,  Glass, 1. (1977).  Oscillation and Chaos in a Physiological Control \n\nSystem.  Science  197,  287. \n\n[11]  Packard, N.  H.,  Crutchfield J.  P. , Farmer, J . D. , Shaw, R.  S.  (1980).  Geometry \n\nfrom  a  Time  Series.  Phys Rev Letters 45,  712- 716. \n\n[12]  Pawelzik, K.,  Kohlmorgen, J. , Miiller, K.-R.  (1996).  Annealed  Competition of \nExperts for  a  Segmentation and  Classification of Switching Dynamics.  Neural \nComputation 8(2), 340- 356. \n\n[13]  Rabiner,  L.  R.  (1989).  A  Tutorial  on  Hidden  Markov  Models  and  Selected \nApplications  in  Speech  Recognition, Proceedings of the IEEE 77(2) , 257- 286. \n[14]  Ramamurti,  V.,  Ghosh,  J.  (1999).  Structurally  Adaptive  Modular  Networks \nfor  Non-Stationary Environments.  IEEE Tr.  Neural Networks  10(1),  152- 160. \n\n\f", "award": [], "sourceid": 2039, "authors": [{"given_name": "Jens", "family_name": "Kohlmorgen", "institution": null}, {"given_name": "Steven", "family_name": "Lemm", "institution": null}]}