{"title": "The \"Moving Targets\" Training Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 558, "page_last": 565, "abstract": null, "full_text": "558 \n\nRohwer \n\nThe  'Moving  Targets'  Training  Algorithm \n\nRichard Rohwer \n\nCentre for  Speech Technology Research \n\nEdinburgh University \n\n80,  South Bridge \n\nEdinburgh EH1  1HN  SCOTLAND \n\nABSTRACT \n\nA  simple  method  for  training  the  dynamical  behavior  of  a  neu(cid:173)\nral  network  is  derived.  It  is  applicable  to  any  training  problem \nin  discrete-time networks with  arbitrary feedback.  The algorithm \nresembles back-propagation in  that an error function  is  minimized \nusing a  gradient-based method,  but the optimization is carried out \nin the hidden part of state space either instead of,  or in addition to \nweight space.  Computational results are presented for some simple \ndynamical  training  problems,  one  of which  requires response  to  a \nsignal  100 time steps in the past. \n\nINTRODUCTION \n\n1 \nThis paper presents a  minimization-based algorithm for training the dynamical be(cid:173)\nhavior of a  discrete-time neural network model.  The central idea is  to treat hidden \nnodes  as  target  nodes  with  variable  training  data.  These  \"moving  targets\"  are \nvaried  during  the  minimization  process.  Werbos  (Werbos,  1983)  used  the  term \n\"moving  targets\"  to  describe  the qualitative  idea  that  a  network should  set  itself \nintermediate objectives, and vary these objectives as information is  accumulated on \ntheir  attainability and  their  usefulness  for  achieving overall objectives.  The  (coin(cid:173)\ncidentally)  like-named  algorithm presented here can  be regarded  as  a  quantitative \nrealization of this qualitative idea. \n\nThe literature contains several temporal training algorithms based on minimization \nof  an  error  measure  with  respect  to  the  weights.  This  type  of  method  includes \nthe straightforward extension of the back-propagation method to back-propagation \n\n\fThe 'Moving Targets' Training Algorithm \n\n559 \n\nthrough  time  (Rumelhart,  1986),  the  methods  of  Rohwer  and  Forrest  (Rohwer, \n1987),  Pearlmutter (Pearlmutter, 1989), and the forward propagation of derivatives \n(Robinson,  1988,  Williams 1989a, Williams  1989b, Kuhn,  1990).  A careful compar(cid:173)\nison of moving targets with back-propagation in time and teacher forcing appears in \n(Rohwer,  1989b).  Although applicable only to fixed-point  training,  the algorithms \nof Almeida (Almeida,  1989)  and Pineda (Pineda, 1988)  have much in  common with \nthese  dynamical  training  algorithms.  The  formal  relationship  between  these  and \nthe method of Rohwer and Forrest is  spelled  out  in  (Rohwer 1989a). \n\n2  NOTATION AND  STATEMENT OF  THE TRAINING \n\nPROBLEM \n\nConsider a neural network model with arbitrary feedback as  a dynamical system in \nwhich  the  dynamical variables  Xit  change with time according  to  a  dynamical law \ngiven  by the mapping \n\nLWij/(Xj,t-l) \n\nj \nbias constant \n\nXOt \n\n(1) \n\nunless  specified  otherwise.  The  weights  Wi;  are  arbitrary parameters representing \nthe  connection  strength  from  node  :i  to  node  i. \n/  is  an  arbitrary  differentiable \nfunction.  Let  us call any given variable Xit  the  \"activation\" on node i  at time t.  It \nrepresents the  total input into node i  at time t.  Let  the  \"output\" of each node  be \ndenoted  by Yit  =  /(Xit).  Let  node 0 be  a  \"bias  node\",  assigned  a  positive constant \nactivation so that the weights  WiO  can be interpreted as activation thresholds. \n\nIn  normal  back-propagation,  a  network  architecture  is  defined  which  divides  the \nnetwork into input, hidden, and target nodes.  The moving targets algorithm makes \nitself applicable to  arbitrary training problems by defining  analogous concepts in a \nmanner dependent  upon  the training  data,  but  independent of the network archi(cid:173)\ntecture.  Let  us call  a  node-time pair an  \"event\"'.  To define a  training problem,  the \nset  of all events  must be divided  into three disjoint sets,  the  input events  I,  target \nevents T,  and  hidden events H.  A node  may participate in different  types of event \nat different times.  For every input event  (it)  E  I, we require  training  data  Xit  with \nwhich to overrule the dynamical law (1)  using \n\nXit  = Xit \n\n(it)  E  I. \n\n(2) \n\n(The  bias events  (Ot)  can  be regarded  as  a  special case  of input  events.)  For each \ntarget  event  (it)  E  T,  we require  training  data  X it  to specify  a  desired  activation \nvalue  for  event  (Ot).  No  notational  ambiguity  arises  from  referring  to  input  and \ntarget  data with  the same  symbol X  because  I  and  T  are  required  to  be  disjoint \nsets.  The  training  dat a  says  nothing  about  the  hidden  events  in  H.  There  is  no \nrestriction on how the  initial events  (iO)  are classified. \n\n\f560 \n\nRohwer \n\n3  THE  \"MOVING TARGETS\"  METHOD \nLike back-propagation, the moving targets training method uses (arbitrary) gradient(cid:173)\nbased minimization techniques to minimize an  \"error\" function such as  the  \"output \ndeficit\" \n\n(3) \n\nEod =  ~  L  {Yit  - ~tl2, \n\n(it)ET \n\nwhere Yit  = f(xid and ~t = f(Xid.  A modification of the output deficit error gave \nthe  best  results  in  numerical experiments.  However,  the  most  elegant  formalism \nfollows  from  an  \"activation  deficit\" error function: \n\nEad =!  L  {Xit  - Xitl 2 , \n\n(it)ET \n\n(4) \n\nso  this is  what we shall  use  to present the formalism. \nThe basic idea is to treat the hidden node activations as variable target activations. \nTherefore let  us denote these variables as  X it , just as the (fixed)  targets and inputs \nare  denoted.  Let  us  write  the  computed  activation  values  Xit  of  the  hidden  and \ntarget events in terms of the inputs and  (fixed  and moving)  targets of the previous \ntime  step.  Then let  us  extend  the sum in  (4)  to include  the hidden  events,  so  the \nerror becomes \n\nE  =  ~  L {L wiif(Xi,t-l) _  Xit}2 \n\n(it)ETUH \n\ni \n\n(5) \n\nThis is  a  function of the weights Wii,  and because there are  no  x's present, the full \ndependence  on  Wii  is  explicitly displayed.  We  do  not  actually  have desired  values \nfor  the  Xit  with  (it)  E  H.  But  any values  for  which  weights  can  be  found  which \nmake  (5)  vanish  would  be  suitable,  because  this  would  imply  not  only  that  the \ndesired  targets  are  attained,  but  also  that  the  dynamical  law is  followed  on  both \nthe  hidden  and  target  nodes.  Therefore  let  us  regard  E  as  a  function  of both the \nweights and the  \"moving targets\"  X it , (it)  E  H.  This is the  essence of the method. \nThe derivatives with respect  to  all  of the  independent  variables can  be  computed \nand plugged  into a standard minimization algorithm. \n\nThe reason for  preferring  the activation deficit  form  of the error  (4)  to  the output \ndeficit form  (3)  is  that the activation deficit  form  makes (5)  purely quadratic in the \nweights.  Therefore the equations for the minimum, \n\n(6) \n\nform  a  linear  system,  the  solution  of which  provides  the  optimal weights  for  any \ngiven  set  of  moving  targets.  Therefore  these  equations  might  as  well  be  used  to \ndefine the weights as functions of the moving targets,  thereby making the error  (5) \na  function  of the  moving  targets  alone. \n\n\fThe 'Moving Targets' Training Algorithm \n\n561 \n\nThe derivation of the derivatives with respect to the moving targets is  spelled  out \nin  (Rohwer,  1989b).  The result is: \n\nwhere \n\nand \n\n(it)  E  TuH \n(it)  \u00a2  1'uH \neie  = 2:: Wij/(Xj,t-d - Xie , \n\nj \n\nf !  =  d/(x) I \n\n.t \n\nd \nx  ~-x . \n\n- - It \n\n' \n\nW \u00b7 \u00b7  - ~ (~X'  X\u00b7  Y;k \nIJ  - ~ L:  It \n\nIt \n\n,t-i \n\n)  M(i)-i \n, \n\nkj \n\n(7) \n\n(8) \n\n(9) \n\n(to) \n\n(11) \n\nwhere  M(a)-i  is  the  inverse  of  M(a),  the  correlation  matrix  of the  node  outputs \ndefined  by \n\nM (a)  - ~X y.. \n\n- L- at  I,t-i  J,t-i\u00b7 \n\ny . \n\nij \n\n(12) \n\nt \n\nIn  the  event  that  any  of the  matrices  M  are  singular,  a  pseudo-inversion  method \nsuch as  singular value decomposition  (Press,  1988)  can  be used  to define  a  unique \nsolution among the infinite number available. \n\nNote also  that  (11)  calls for  a  separate matrix inversion for  each node.  However if \nthe set  of input nodes remains  fixed  for  all  time,  then all  these matrices  are equal. \n\n3.1  FEEDFORWARD VERSION \n\nThe  basic  ideas  used  in  the  moving  targets  algorithm  can  be  applied  to feedfor(cid:173)\nward networks to provide an  alternative method to back-propagation.  The hidden \nnode  activations  for  each  training  example  become  the  moving  target  variables. \nFurther details appear in  (Rohwer, 1989b).  The moving targets method for feedfor(cid:173)\nward nets is analogous to the method of Grossman,  Meir,  and Domany (Grossman, \n1990a,  1990b)  for networks with discrete node values.  Birmiwal, Sarwal, and Sinha \n(Birmiwal,  1989)  have  developed  an  algorithm for feedforward  networks which in(cid:173)\ncorporates  the  use  of  hidden  node  values  as  fundamental  variables  and  a  linear \n\n\f562 \n\nRohwer \n\nsystem of equations for  obtaining  the weight  matrix.  Their  algorithm differs  from \nthe feedforward version of moving targets mainly in the (inessential) use of a specific \nminimization algorithm which discards  most of the gradient information except for \nthe  signs  of the  various  derivatives.  Heileman,  Georgiopoulos,  and  Brown  (Heile(cid:173)\nman, 1989) also have an algorithm which bears some resemblance to the feedforward \nversion of moving targets.  Another similar algorithm has been developed by Krogh, \nHertz,  and  Thorbergasson  (Krogh,  1989,  1990). \n\n4  COMPUTATIONAL RESULTS \nA  set  of numerical experiments  performed  with  the  activation deficit  form  of the \nalgorithm  (4)  is  reported  in  (Rohwer,  1989b).  Some  success  was  attained,  but \ngreater progress was made after changing to a quartic output deficit  error function \nwith temporal weighting of errors: \n\nEquartic = t L  (1.0 + at){Yit - }'ie}4. \n\n(it)ET \n\n(13) \n\nHere a is  a small positive constant.  The quartic function is dominated by the terms \nwith the greatest error.  This combats a  tendency to fail on a  few  infrequently seen \nstate transitions in  order  to gain  unneeded  accuracy on  a  large number of similar, \nlow-error  state  transitions.  The  temporal  weighting  encourages  the  algorithm  to \nIn  some  cases  this \nfocus  first  on  late-time  errors,  and  then  work  back  in  time. \nhelped  with  local  minimum  difficulties.  A  difficulty  with  convergence  to  chaotic \nattractors reported  in  (Rohwer,  1989b)  appears  to have  mysteriously  disappeared \nwith the adoption of this error measure. \n\n4.1  MINIMIZATION ALGORITHM \n\nFurther progress was made by altering the minimization algorithm.  Originally the \nconjugate  gradient  algorithm  (Press,  1988)  was  used,  with  a  linesearch  algorithm \nfrom  Fletcher  (Fletcher,  1980).  The  new  algorithm  might  be  called  \"curvature \navoidance\" .  The  change  in  the  gradient  with  each  linesearch  is  used  to  update \na  moving  average  estimate  of  the  absolute  value  of  the  diagonal  components  of \nthe  Hessian.  The linesearch direction is  taken to be  the component-by-component \nquotient of the gradient with these curvature averages.  Were it not for the absolute \nvalues,  this  would  be  an  unusual way  of estimating  the conjugate  gradient.  The \nabsolute  values  are  used  to  discourage  exploration  of  directions  which  show  any \nhint  of being  highly  curved.  The  philosophy  is  that  by  exploring  low-curvature \ndirections  first,  narrow canyons are entered only  when  necessary. \n\n4.2  SIMULATIONS \n\nSeveral simulations have been done  using  fully  connected networks.  Figure  1 plots \nthe node outputs of a network trained to switch between different limit cycles under \ninput  control.  There  are  two input  nodes,  one  target  node,  and  2  hidden  nodes, \nas  indicated  in  the  left  margin.  Time  proceeds from  left  to right.  The oscillation \n\n\fThe 'Moving Targets' Training Algorithm \n\n563 \n\nperiod of the target node increases with the binary number represented by the two \ninput nodes.  The network was trained on one period of each of the four frequencies. \n\nFigure  1:  Controlled switching between limit cycles \n\nFigure 2 shows the operation of a network trained to detect whether an even or odd \nnumber  of  pulses  have  been  presented  to  the  input;  a  temporal  version  of parity \ndetection.  The network was trained on  the data preceding the third input pulse. \n\ncontrol fila: 1550 \ne- \u00b71.ClOOOOOe+OO  a- \u00b71.()Q()()()Qe+OO \no Linasaarchas.  0 Gradiant avals.  0 error avals.  0 CPU sacs. \n\nlog f~a: lu6Isiplrr/rmndir/movingtargalSlWorkiparilyllogSlts5O \n\nH  JJ  LlJ) \nT  n n  n  r \n\nF  J \n\n--\nr-\n\n.--\n\n-\n\nr \n\n\"1 \n\n.--\n\n.--\n\n-\n\n,...-\n\nH \n\nl \n\nI \n\n-::-:::-:  = \n\n-::-::- ~ \n\n= \n\nFigure 2:  Parity detection \n\nFigure  3  shows  the  behavior  of  a  network  trained  to  respond  to  the  second  of \ntwo  input  pulses  separated  by  100  time  steps.  This  demonstrates  a  unique  (in \nthe author's knowledge) capability of this method,  an  ability to utilize very distant \n\n\f564 \n\nRohwer \n\ntemporal correlations when there is no other way to solve the problem.  This network \nwas  trained  and  tested  on  the  same  data,  the  point  being  merely  to  show  that \ntraining  is  possible  in  this  type  of problem.  More  complex  problems  of this  type \nfrequently get stuck in  local minima. \n\ncontrol file: cx100.tr \nE- 2.2328OOe-11  a- 9.9nS18a-04 \n4414linasearchas.  9751  Gradient avals.  9043 Error avals.  3942 CPU  &eea. \n\nlog file: lu6Isiplrr/rmndir/movinglargelslworlclcx1l1ogslcx100.1r \n\nH  r \n\nT \n\nI  I \n\nr \n\n{ \n\nr \n\nI \n\nJ \n\nI \n\nFigure  3:  Responding to temporally distant input \n\n5  CONCLUDING REMARKS \nThe simulations show that this  method works,  and show in  particular that  distant \ntemporal correlations can  be discovered.  Some  practical difficulties  have emerged, \nhowever,  which  are  currently  limiting  the  application  of  this  technique  to  'toy' \nproblems.  The  most  serious  are  local  minima  and  long  training  times.  Problems \ninvolving  large  amounts  of  training  data  may  present  the  minimization  problem \nwith  an  impractically  large  number  of  variables.  Variations  of the  algorithm  are \nbeing studied  in  hopes  of overcomming these difficulties. \n\nAcknowledgements \n\nThis work was supported by ESPRIT Basic Research  Action  3207  ACTS. \n\nReferences \n\nL.  Almeida,  (1989),  \"Backpropagation  in  Non-Feedforward  Networks\",  in  Neural \nComputing  Architecture!,  I. Aleksander,  ed.,  North  Oxford  Academic. \n\nK.  Birmiwal,  P.  Sarwal,  and  S.  Sinha,  (1989),  \"A  new  Gradient-Free  Learning \nAlgorithm\",  Tech.  report,  Dept.  of EE,  Southern Illinois U.,  Carbondale. \n\nR.  Fletcher,  (1980),  Practical Methods  of Optimization, v1,  Wiley. \n\nT. Grossman,  (1990a),  \"The CHIR Algorithm:  A Generalization for  Multiple  Out(cid:173)\nput and  Multilayered  Networks\" , to appear in  Complex  Systems. \n\n\fThe 'Moving Targets' Training Algorithm \n\n565 \n\nT.  Grossman,  (1990bL  this volume. \nG.  L.  Heileman,  M.  Georgiopoulos,  and  A.  K.  Brown,  (1989),  \"The  Minimal Dis(cid:173)\nturbance  Back Propagation  Algorithm\",  Tech.  report,  Dept.  of EE,  U.  of  Central \nFlorida,  Orlando. \nA. Krogh, J. A.  Hertz, and G.1. Thorbergsson,  (1989),  \"A Cost Function for Internal \nRepresentations\",  NORDITA preprint 89/37 S. \nA.  Krogh,  J.  A.  Hertz,  and  G.  I. Thorbergsson,  (1990),  this volume. \nG.  Kuhn,  (1990)  \"Connected Recognition with a  Recurrent  Network\", to appear in \nProc.  NEUROSPEECH, 18  May 1989,  as  special issue of Speech  Communication, \n9, no.  2. \nB.  Pearlmutter,  (1989),  \"Learning  State  Space  Trajectories  in  Recurrent  Neural \nNetworks\",  Proc.  IEEE IJCNN 89,  Washington  D.  C.,  II-365. \n\nF.  Pineda,  (1988),  \"Dynamics and Architecture for  Neural Computation\",  J.  Com(cid:173)\nplexity 4,  216. \n\nW.  H.  Press,  B.  P.  Flannery,  S.  A.  Teukolsky,  and  W.  T.  Vetterling,  (1988),  Nu(cid:173)\nmerical Recipes  in  C,  The  Art of Scientific  Computing,  Cambridge. \n\nA.  J.  Robinson  and  F.  Fallside,  (1988),  \"Static  and  Dynamic  Error  Propagation \nNetworks with Applications to Speech Coding\", Neural Information Processing Sys(cid:173)\ntems,  D.  Z.  Anderson,  Ed.,  AlP,  New  York. \n\nR.  Rohwer and B. Forrest, (1987),  \"Training Time Dependence in Neural Networks\" \nProc.  IEEE ICNN,  San Diego,  II-701. \n\nR.  Rohwer and  S.  Renals,  (1989a),  \"Training Recurrent  Networks\",  in  Neural Net(cid:173)\nworks  from Models  to  Applications,  L.  Personnaz and  G.  Dreyfus, eds.,  I.D.S.E.T., \nParis,  207. \n\nR. Rohwer,  (1989b),  \"The 'Moving Targets' Training Algorithm\", to appear in  Proc. \nDANIP, G MD  Bonn,  J.  Kinderman and  A.  Linden,  Eds. \n\nD.  Rumelhart, G.  Hinton and  R.  Williams,  (1986),  \"Learning Internal Representa(cid:173)\ntions  by Error Propagation\"  in  Parallel  Distributed Processing,  v.  1,  MIT. \n\nP.  Werbos,  (1983)  Energy Models  and Studies,  B.  Lev,  Ed.,  North  Holland. \n\nR.  Williams  and  D.  Zipser,  (1989a),  \"A  Learning  Algorithm for  Continually  Run(cid:173)\nning Fully  Recurrent  Neural Networks\" ,  Neural  Computation 1, 270. \n\nR.  Williams  and  D.  Zipser,  (1989bL  \"Experimental Analysis of the Real-time  Re(cid:173)\ncurrent  Learning  Algorithm\",  Connection  Science 1, 87. \n\n\f", "award": [], "sourceid": 233, "authors": [{"given_name": "Richard", "family_name": "Rohwer", "institution": null}]}