{"title": "Phase-Space Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 481, "page_last": 488, "abstract": null, "full_text": "Phase-Space Learning \n\nFu-Sheng Tsung \n\nChung Tai Ch'an Temple \n\n56, Yuon-fon  Road, Yi-hsin  Li,  Pu-li \n\nNan-tou County, Taiwan 545 \n\nRepublic of China \n\nGarrison W.  Cottrell\u00b7 \n\nInstitute for  Neural Computation \nComputer Science  &  Engineering \nUniversity of California, San  Diego \n\nLa Jolla,  California 92093 \n\nAbstract \n\nExisting recurrent  net learning algorithms are inadequate.  We  in(cid:173)\ntroduce the conceptual framework of viewing recurrent  training as \nmatching vector fields of dynamical systems in phase space.  Phase(cid:173)\nspace  reconstruction  techniques  make  the  hidden  states  explicit, \nreducing  temporal  learning  to  a  feed-forward  problem.  In  short, \nwe  propose  viewing  iterated  prediction  [LF88]  as  the  best  way  of \ntraining  recurrent  networks  on  deterministic  signals.  Using  this \nframework,  we  can  train  multiple trajectories,  insure  their  stabil(cid:173)\nity,  and  design  arbitrary dynamical systems. \n\n1 \n\nINTRODUCTION \n\nExisting  general-purpose  recurrent  algorithms  are  capable  of rich  dynamical  be(cid:173)\nhavior.  Unfortunately, straightforward applications of these  algorithms to training \nfully-recurrent  networks  on  complex  temporal  tasks  have  had  much  less  success \nthan  their  feedforward  counterparts.  For  example,  to  train  a  recurrent  network \nto oscillate like  a  sine  wave  (the  \"hydrogen  atom\"  of recurrent  learning),  existing \ntechniques  such  as  Real  Time  Recurrent  Learning  (RTRL)  [WZ89]  perform  sub(cid:173)\noptimally.  Williams  &  Zipser  trained  a  two-unit  network  with  RTRL,  with  one \nteacher signal.  One unit of the resulting network showed a  distorted waveform, the \nother only half the desired  amplitude.  [Pea89]  needed  four  hidden units.  However, \nour work demonstrates that a  two-unit recurrent  network with no  hidden units can \nlearn  the  sine  wave  very  well  [Tsu94].  Existing  methods  also  have  several  other \n\n\u00b7Correspondence  should  be addressed  to the second  author:  gary@cs.ucsd.edu \n\n\f482 \n\nFu-Sheng  Tsung,  Garrison  W.  Cottrell \n\nlimitations.  For  example,  networks  often  fail  to  converge  even  though  a  solution \nis  known  to  exist;  teacher  forcing  is  usually  necessary  to learn  periodic signals;  it \nis  not  clear  how  to  train  multiple  trajectories  at  once,  or  how  to  insure  that  the \ntrained trajectory  is stable (an  attractor). \nIn  this  paper,  we  briefly  analyze  the  algorithms  to  discover  why  they  have  such \ndifficulties,  and propose a general solution to the problem.  Our solution is based on \nthe simple idea of using the  techniques  of time series  prediction  as  a  methodology \nfor  recurrent  network training. \nFirst, by way of introducing the appropriate concepts,  consider a system of coupled \nautonomousl  first  order  network equations: \n\nFI (Xl (t), X2(t), . .. , Xn (t)) \nF2(XI(t), X2(t),\u00b7 \u00b7 \u00b7, Xn(t)) \n\nor, in vector notation, \n\nX(t) = F(X)  where  XCt)  = (XICt), X2(t),\u00b7\u00b7 ., xn(t)) \n\nThe phase space is the space of the dependent variables (X), it does  not include t, \nwhile the state space incorporates t.  The evolution of a trajectory X(t) traces out \na phase curve, or orbit, in the n-dimensional phase space of X .  For low dimensional \nsystems (2- or 3-D), it is  easy  to visualize the limit sets in the phase space:  a  fixed \npoint  and  a  limit cycle  become  a  single  point  and  a  closed  orbit  (closed  curve), \nrespectively.  In  the state space  they  become  an infinite  straight line  and  a  spiral. \nF(X) defines  the vector field of X,  because it associates a vector with each point \nin  the phase space  of X  whose  direction  and magnitude determines the  movement \nof that point in  the next instant of time (by definition,  the tangent vector). \n\n2  ANALYSIS  OF  CURRENT APPROACHES \n\nTo get a better understanding of why recurrent  algorithms have not been very effec(cid:173)\ntive,  we  look at what happens during training with  two  popular recurrent  learning \ntechniques:  RTRL and back  propagation through time (BPTT). With each,  we  il(cid:173)\nlustrate a different problem, although the problems apply equally to each technique. \nRTRL  is  a  forward-gradient  algorithm  that  keeps  a  matrix of partial  derivatives \nof the network  activation values  with respect  to every  weight.  To train  a  periodic \ntrajectory,  it is  necessary  to  teacher-force  the  visible  units  [WZ89],  i.e.,  on  every \niteration, after the gradient has been  calculated,  the activations of the visible units \nare replaced by the teacher.  To see why, consider learning a pair of sine waves offset \nby  90\u00b0.  In  phase  space,  this  becomes  a  circle  (Figure  la).  Initially  the  network \n\n1 Autonomous means the right hand side of a differential equation does not explicitly  ref(cid:173)\n\nerence t, e.g.  dx/dt =  2x is autonomous, even though x is a function oft, but dx/dt =  2x+t \nis  not.  Continuous  neural  networks  without  inputs  are  autonomous.  A  nonautonomous \nsystem can  always  be turned into an autonomous  system in  a  higher  dimension. \n\n\fPhase-Space Learning \n\n483 \n\na \n\nb \n\nFigure  1:  Learning  a  pair of sine  waves  with  RTRL  learning.  (a)  without  teacher \nforcing,  the  network  dynamics  (solid  arrows)  take  it  far  from  where  the  teacher \n(dotted arrows) assumes it is, so the gradient is incorrect.  (b) With teacher forcing, \nthe network's visible units are returned  to the trajectory. \n(thick arrows)  is  at position Xo  and has arbitrary dynamics.  After a few  iterations, \nit wanders far away from  where  the teacher  (dashed  arrows)  assumes it to be.  The \nteacher  then  provides an  incorrect next  target from  the network's current  position. \nTeacher-forcing (Figure 1b), resets the network back on the circle,  where the teacher \nagain provides useful  information. \n\nHowever,  if the network  has hidden units,  then  the phase space of the visible units \nis  just  a  projection  of the  actual  phase  space  of the  network,  and  the  teaching \nsignal gives  no information as  to  where  the  hidden  units should  be  in  this  higher(cid:173)\ndimensional phase space.  Hence the hidden unit states, unaltered by teacher forcing , \nmay be entirely unrelated  to what they should be.  This leads to the  moving targets \nproblem.  During training, every  time the visible units  re-visit a  point,  the hidden \nunit activations will differ,  Thus the mapping changes during learning.  (See  [Pin88, \nWZ89]  for  other discussions  of teacher forcing.) \nWith  BPTT,  the  network  is  unrolled  in  time  (Figure  2).  This  unrolling  reveals \nanother  problem:  Suppose in  the  teaching  signal,  the  visible units'  next  state is  a \nnon-linearly separable function of their current state.  Then hidden units are needed \nbetween  the  visible  unit  layers,  but  there  are  no  intermediate hidden  units  in  the \nunrolled network.  The network  must thus take  two  time steps to get  to the hidden \nunits  and  back.  One  can  deal  with  this  by  giving  the  teaching  signal  every  other \niteration, but clearly, this is not optimal (consider that the hidden units must  \"bide \ntheir  time\"  on  the alternate steps).2 \n\nThe  trajectories  trained  by  RTRL and  BPTT  tend  to  be  stable  in  simulations of \nsimple tasks  [Pea89,  TCS90],  but this stability  is  paradoxical.  Using  teacher  forc(cid:173)\ning, the networks are trained to go from a point on the trajectory,  to a point within \nthe ball defined  by  the error  criterion  f  (see  Figure 4 (a)).  However,  after learning, \nthe networks  behave such  that from  a  place  near  the  trajectory,  they  head for  the \ntrajectory  (Figure  4  (b)).  Hence  the  paradox.  Possible  reasons  are:  1)  the  hid(cid:173)\nden  unit  moving targets  provide  training  off the  desired  trajectory,  so  that  if the \ntraining is  successful,  the  desired  trajectory  is  stable;  2)  we  would  never  consider \nthe training successful  if the network  \"learns\"  an unstable trajectory;  3)  the stable \ndynamics  in  typical  situations  have  simpler  equations  than  the  unstable  dynam(cid:173)\nics  [N ak93].  To create an unstable periodic trajectory would imply the existence of \nstable regions  both inside and outside the  unstable trajectory;  dynamically this  is \n\n2 At  NIPS, 0 delay  connections  to the hidden  units were  suggested,  which  is essentially \n\npart of the solution  given  here. \n\n\f484 \n\nFu-Sheng  Tsung,  Garrison  W.  Cottrell \n\n~------------. \n\nFigure  2:  A  nonlinearly  separable \nmapping  must  be  computed  by  the \nhidden  units  (the  leftmost  unit  here) \nevery  other time step. \n\nFigure  3:  The  network  used  for  iter(cid:173)\nated  prediction  training.  Dashed  con(cid:173)\nnections  are  used  after learning. \n\n\" \n\na \n\nb \n\nFigure 4:  The paradox of attractor learning with teacher forcing.  (a)  During learn(cid:173)\ning,  the  network learns to move from  the trajectory  to a  point near the  trajectory. \n(b)  After learning, the network moves from nearby points towards the trajectory. \n\nmore complicated than a  simple periodic attractor.  In  dynamically complex tasks, \na stable trajectory  may no longer be the simplest solution, and stability could be a \nproblem. \n\nIn summary, we have pointed out several problems in the RTRL (forward-gradient) \nand  BPTT (backward-gradient)  classes of training algorithms: \n\n1.  Teacher  forcing  with  hidden  units is  at  best  an  approximation,  and leads \n\nto  the moving targets  problem. \n\n2.  Hidden  units are  not  placed properly for  some tasks. \n3.  Stability is  paradoxical. \n\n3  PHASE-SPACE  LEARNING \n\nThe inspiration for our approach is prediction training  [LF88], which at first appears \nsimilar  to  BPTT,  but  is  subtly  different.  In  the  standard  scheme,  a  feedforward \nnetwork  is  given  a  time window,  a  set  of previous  points  on  the  trajectory  to  be \nlearned, as inputs.  The output is the next point on the trajectory.  Then,  the inputs \nare  shifted  left  and  the network  is  trained on the  next  point  (see  Figure  3).  Once \nthe  network  has  learned,  it  can  be  treated  as  recurrent  by  iterating  on  its  own \npredictions. \n\nThe prediction  network  differs  from  BPTT in  two  important ways.  First,  the  vis(cid:173)\nible  units encode  a  selected  temporal history  of the trajectory  (the  time window) . \nThe  point  of this  delay  space  embedding  is  to  reconstruct  the  phase  space  of the \nunderlying system. \nministic system.  Note that in the reconstructed phase space, the mapping from one \n\n[Tak81]  has shown  that this  can  always  be  done  for  a  deter(cid:173)\n\n\fPhase-Space Learning \n\n485 \n\nYI+I \n\nr;.-.-.-.  --------\n\n.. ,--(cid:173)\n\n... ... ... ... ... ... ... ... ... ... ... \n'\" '\" '\" .. , \n'\" \u2022\u2022\u2022 \n\na \n\nb \n\nFigure 5:  Phase-space learning.  (a) The training set is  a sample of the vector field. \n(b)  Phase-space  learning network.  Dashed connections  are  used  after learning. \n\npoint to the next  (based on the vector field)  is deterministic.  Hence  what  originally \nappeared  to  be  a  recurrent  network  problem  can  be  converted  into  an  entirely  feed \nforward  problem.  Essentially,  the  delay-space  reconstruction  makes  hidden  states \nvisible,  and  recurrent  hidden  units  unnecessary.  Crucially,  dynamicists  have  de(cid:173)\nveloped  excellent  reconstruction  algorithms that  not  only  automate the  choices  of \ndelay  and embedding dimension but also filter  out noise or  get  a  good  reconstruc(cid:173)\ntion  despite  noise  [FS91,  Dav92,  KBA92].  On  the  other  hand,  we  clearly  cannot \ndeal  with non-deterministic systems by  this method. \n\nThe second  difference  from  BPTT is  that the hidden  units  are  between  the  visible \nunits,  allowing the network to produce nonlinearly separable transformations of the \nvisible  units  in  a  single  iteration.  In  the  recurrent  network  produced  by  iterated \nprediction,  the sandwiched hidden units can  be considered  \"fast\"  units with delays \non the input/output links summing to  1. \nSince  we  are now  lear~ing a  mapping in  phase space,  stability is  easily ensured  by \nadding additional training examples that converge  towards  the desired  orbit. 3  We \ncan also explicitly control convergence speed by the size and direction of the vectors. \n\nThus,  phase-space  learning  (Figure  5)  consists  of:  (1)  embedding  the  temporal \nsignal  to recover  its  phase space  structure,  (2)  generating  local  approximations of \nthe vector field  near the desired trajectory,  and (3)  functional approximation of the \nvector field  with a feedforward network.  Existing methods developed for these three \nproblems  can  be  directly  and  independently  applied  to solve  the  problem.  Since \nfeedforward  networks  are  universal  approximators  [HSW89],  we  are  assured  that \nat least  locally, the trajectory can be represented.  The trajectory is  recovered  from \nthe iterated output of the pre-embedded portion of the visible units.  Additionally, \nwe  may also extend the phase-space  learning framework to also include inputs that \naffect  the output of the system (see  [Tsu94]  for  details).4 \nIn  this  framework,  training  multiple  attractors  is  just  training  orbits  in  different \nparts  of the  phase  space,  so  they  simply  add  more  patterns  to  the  training  set. \nIn  fact,  we  can  now  create  designer  dynamical  systems  possessing  the  properties \nwe  want,  e.g.,  with  combinations  of fixed  point,  periodic,  or  chaotic  attractors. \n\n3The  common  practice  of  adding  noise  to  the  input  in  prediction  training  is  just  a \n\nsimple  minded  way  of adding  convergence  information. \n\n4Principe  &  Kuo(this volume)  show that for  chaotic attractors, it is better to treat this \n\nas a  recurrent  net  and  train  using  the  predictions. \n\n\f486 \n\nFu-Sheng  Tsung,  Garrison  W. Cottrell \n\n0.5  Q \n\n-0.5 \n\n-0.5 \n\n0 \n\n0.5 \n\nFigure 6:  Learning the van der Pol oscillator.  ( a)  the training set.  (b)  Phase space \nplot of network  (solid curve)  and teacher  (dots).  (c)  State space  plot. \n\nAs  an  example,  to  store  any  number  of arbitrary  periodic  attractors  Zi(t)  with \nperiods 11  in a single recurrent  network,  create  two new  coordinates for  each  Zi(t), \n(Xi(t),Yi(t))  =  (sin(*t),cos(*t)),  where  (Xi,Yi)  and  (Xj,Yj)  are  disjoint  circles \nfor  i  'I j.  Then  (Xi, Yi, Zi)  is  a  valid  embedding  of all  the  periodic  attractors  in \nphase space,  and the network  can  be trained.  In  essence,  the first  two  dimensions \nform  \"clocks\"  for their associated trajectories. \n\n4  SIMULATION RESULTS \n\nIn this section we illustrate the method by learning the van der Pol oscillator (a much \nmore  difficult  problem  than  learning  sine  waves),  learning  four  separate  periodic \nattractors,  and learning an attractor inside the basin of another attractor. \n\n4.1  LEARNING THE VAN  DER POL OSCILLATOR \n\nThe van  der  Pol equation  is  defined  by: \n\nWe  used  the  values  0.7,  1,  1 for  the  parameters  a,  b,  and w, for  which  there  is  a \nglobal periodic  attractor  (Figure  6).  We  used  a  step  size  of 0.1,  which  discretizes \nthe trajectory into 70 points.  The network therefore has two visible units.  We used \ntwo hidden layers with 20  units each, so that the unrolled, feedforward network has \na  2-20-20-2  architecture.  We  generated  1500  training  pairs  using  the  vector  field \nnear the attractor.  The learning rate was 0.01, scaled by the fan-in, momentum was \n0.75,  we  trained for  15000 epochs.  The order of the training pairs was  randomized. \nThe attractor learned by  the network is shown in (Figure 6 (b)).  Comparison of the \ntemporal trajectories is shown in  Figure 6 (c);  there is a slight frequency  difference. \nThe average  MSE  is 0.000136.  Results from a  network with two layers of 5 hidden \nunits each  with 500 data pairs were  similar (MSE=0.00034). \n\n4.2  LEARNING  MULTIPLE PERIODIC ATTRACTORS \n\n[Hop82]  showed  how  to  store  multiple  fixed-point  at tractors  in  a  recurrent  net. \n[Bai91]  can  store  periodic  and  chaotic  at tractors  by  inverting  the  normal forms of \nthese attractors into higher order recurrent networks.  However, traditional recurrent \ntraining offers no obvious method of training multiple attractors.  [DY89]  were able \n\n\fPhase.Space Learning \n\n487 \n\n_.0.1. Il.6, 0.63. 0.7 \n\n'I'---:.ru\"='\"\"\"--'O~--:O-=-\" ---! \n\n.I'--::.ru-;---;;--;;-:Oj---! \n\nA \n\n8 \n\n.ru \n\nOJ \n\n0 \nE \n\n'1'--::4U~-::-0 --;Oj~-! \n\nF \n\n100 \n\n1OO \nD \n\n300 \n\n400 \n\n1.--------, \n\n~~ .1 0 \n\n50  100  150  1OO  ~ 300 \n\nH \n\nFigure  7:  Learning  mUltiple  attractors.  In  each  case,  a  2-20-20-2  network  using \nconjugate  gradient  is  used.  Learning  4  attractors:  (A)  Training  set.  (B)  Eight \ntrajectories of the trained  network.  (C) Induced  vector field of the network.  There \nare five  unstable fixed  points.  (D) State space behavior as the network is  \"bumped\" \nbetween  attractors.  Learning 2 attractors,  one  inside  the other:  (E)  Training set. \n(F) Four trajectories ofthe trained network.  (G) Induced vector field of the network. \n(H)  State  space \nThere  is  an  unstable  limit  cycle  between  the  two  stable  ones. \nbehavior with a  \"bump\". \n\nto store two limit cycles  by starting with fixed  points stored in a  Hopfield net,  and \ntraining each fixed  point locally  to become a  periodic  attractor.  Our approach  has \nno difficulty  with  multiple attractors.  Figure 7  (A-D)  shows  the  result  of training \nfour  coexisting  periodic  attractors,  one  in  each  quadrant  of  the  two-dimensional \nphase space.  The network will remain in one of the attractor basins until an external \nforce  pushes  it into another attractor basin.  Figure 7 (E-H)  shows  a  network with \ntwo  periodic  attractors,  this  time one  inside  the  other.  This  vector  field  possess \nan unstable limit cycle  between  the two stable limit cycles.  This is  a more difficult \ntask, requiring 40 hidden units, whereas 20 suffice for the previous task (not shown). \n\n5  SUMMARY \n\nWe  have  presented  a  phase  space  view  of the  learning  process  in  recurrent  nets. \nThis perspective  has helped  us  to understand  and overcome some of the  problems \nof using  traditional recurrent  methods for  learning periodic and chaotic attractors. \nOur  method  can  learn  multiple  trajectories,  explicitly  insure  their  stability,  and \navoid overfitting; in short,  we  provide a practical approach to learning complicated \ntemporal  behaviors.  The  phase-space  framework  essentially  breaks  the  problem \ninto  three  sub-problems:  (1)  Embedding  a  temporal  signal  to  recover  its  phase \nspace  structure,  (2)  generating  local  approximations of the  vector  field  near  the \ndesired  trajectory,  and  (3)  functional  approximation in  feedforward  networks.  We \nhave demonstrated that using this method, networks can learn complex oscillations \nand multiple periodic  attractors. \n\n\f488 \n\nFu-Sheng  Tsung,  Garrison  W.  Cottrell \n\nAcknowledgements \n\nThis work  was supported by  NIH grant R01  MH46899-01A3.  Thanks for  comments \nfrom Steve Biafore, Kenji Doya, Peter Rowat, Bill Hart, and especially Dave DeMers \nfor  his timely assistance with simulations. \n\nReferences \n\n[Bai91] \n\n[Dav92] \n\n[DY89] \n\n[FS91] \n\n[Hop82] \n\nW. Baird and F. Eeckman. Cam storage of analog patterns and continuous \nsequences  with  3n2  weights.  In  R.P.  Lippmann,  J .E.  Moody,  and  D.S. \nTouretzky,  editors,  Advances  in  Neural Information  Processing  Systems, \nvolume 3,  pages  91-97,  1991.  Morgan  Kaufmann, San  Mateo. \nM.  Davies.  Noise reduction  by gradient descent.  International Journal  of \nBifurcation  and  Chaos,  3:113-118, 1992. \nK.  Doya and S.  Yoshizawa.  Memorizing oscillatory patterns in the analog \nneuron  network.  In  IJCNN, Washington  D.C.,  1989. IEEE. \nJ.D. Farmer and J.J. Sidorowich. Optimal shadowing and noise reduction. \nPhysica  D,  47:373-392, 1991. \nJ.J.  Hopfield.  Neural  networks  and  physical  systems with  emergent  col(cid:173)\nlective  computational abilities.  Proceedings  of the  National  Academy  of \nSciences,  USA,  79,  1982. \n\n[HSW89]  K.  Hornik,  M.  Stinchcombe,  and  H.  White.  Multilayer feedforward  net(cid:173)\nworks  are universal  approximators.  Neural Networks,  2:359-366,  1989. \n[KBA92]  M.B. Kennel,  R.  Brown,  and  H.  Abarbanel.  Determining embedding di(cid:173)\nmension for  phase-space reconstruction  using a geometrical construction. \nPhysical Review A, 45:3403-3411, 1992. \n\n[LF88]  A.  Lapedes  and  R.  Farber.  How  neural  nets  work.  In  D.Z.  Anderson, \neditor,  Neural  Information  Processing  Systems,  pages  442-456,  Denver \n1987,  1988. American Institute of Physics,  New  York. \n\n[N ak93]  Hiroyuki Nakajima. A paradox in learning trajectories in neural networks. \n\nWorking paper,  Dept. of EE II,  Kyoto  U.,  Kyoto,  JAPAN,  1993. \n\n[Pea89]  B.A.  Pearlmutter.  Learning  state  space  trajectories  in  recurrent  neural \n\nnetworks.  Neural  Computation,  1:263-269, 1989. \n\n[Pin88]  F.J . Pineda.  Dynamics and  architecture for  neural computation.  Journal \n\nof Complexity, 4:216-245, 1988. \n\n[Tak81]  F. Takens.  Detecting strange attractors in  turbulence.  In D.A. Rand and \nL.-S.  Young,  editors,  Dynamical  Systems  and  Turbulence,  volume  898 \nof  Lecture  Notes  in  Mathematics,  pages  366-381,  Warwick  1980,  1981. \nSpringer-Verlag, Berlin. \n\n[TCS90]  F-S.  Tsung,  G. W.  Cottrell,  and  A.  I. Selverston.  Some experiments on \nlearning stable network oscillations. In  IJCNN, San  Diego,  1990. IEEE. \n[Tsu94]  F-S.  Tsung.  Modelling  Dynamical  Systems  with  Recurrent  Neural  Net(cid:173)\n\nworks.  PhD  thesis,  University of California, San  Diego,  1994. \n\n[WZ89]  R.J. Williams and D.  Zipser.  A learning algorithm for continually running \n\nfully  recurrent  neural networks.  Neural  Computation,  1:270-280, 1989. \n\n\f", "award": [], "sourceid": 943, "authors": [{"given_name": "Fu-Sheng", "family_name": "Tsung", "institution": null}, {"given_name": "Garrison", "family_name": "Cottrell", "institution": null}]}