{"title": "Learning long-term dependencies is not as difficult with NARX networks", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 583, "abstract": null, "full_text": "Learning long-term dependencies \n\nis  not  as difficult  with NARX networks \n\nTsungnan Lin* \n\nBill G.  Horne \n\nDepartment of Electrical Engineering \n\nNEC Research Institute \n\nPrinceton University \nPrinceton,  N J  08540 \n\n4 Independence Way \nPrinceton, NJ 08540 \n\nPeter Tiiio \n\nDept.  of Computer Science and Engineering \n\nSlovak Technical University \n\nIlkovicova 3,  812  19  Bratislava, Slovakia \n\nc. Lee  Gilest \n\nNEC  Research Institute \n\n4 Independence Way \nPrinceton, N J  08540 \n\nAbstract \n\nIt  has  recently  been  shown  that  gradient  descent  learning  algo(cid:173)\nrithms for  recurrent  neural  networks can  perform poorly  on  tasks \nthat  involve  long-term  dependencies. \nIn  this  paper  we  explore \nthis  problem  for  a  class  of architectures  called  NARX  networks, \nwhich  have  powerful  representational capabilities.  Previous  work \nreported that gradient descent learning is  more effective in NARX \nnetworks  than  in  recurrent  networks  with  \"hidden  states\".  We \nshow  that although  NARX  networks  do  not  circumvent the  prob(cid:173)\nlem  of long-term dependencies,  they  can  greatly  improve  perfor(cid:173)\nmance  on  such  problems.  We  present  some  experimental 'results \nthat  show  that  NARX  networks  can  often  retain  information  for \ntwo to three times as long as conventional recurrent  networks. \n\n1 \n\nIntroduction \n\nRecurrent  Neural  Networks  (RNNs)  are  capable of representing arbitrary  nonlin(cid:173)\near  dynamical  systems  [19,  20].  However,  learning simple  behavior  can  be  quite \n\n\"Also with NEC  Research Institute. \ntAlso with UMIACS,  University of Maryland,  College  Park,  MD  20742 \n\n\f578 \n\nT.  LIN, B.  G.  HORNE, P. TINO, C. L. GILES \n\ndifficult  using gradient descent.  For  example,  even  though  these systems  are 'lUr(cid:173)\ning  equivalent,  it  has  been  difficult  to  get  them  to successfully  learn  small  finite \nstate machines  from  example strings  encoded  as  temporal  sequences.  Recently,  it \nhas  been  demonstrated  that  at  least  part  of this  difficulty  can  be  attributed  to \nlong-term  dependencies,  i.e.  when the desired output at time T  depends  on inputs \npresented at times t \u00ab T.  In [13]  it was  reported that RNNs were able to learn short \nterm musical structure using gradient based methods,  but  had difficulty  capturing \nglobal  behavior.  These ideas  were  recently  formalized  in  [2],  which  showed that if \na  system is  to robustly  latch information,  then the fraction  of the gradient  due to \ninformation n  time steps  in the past approaches zero as n  becomes large. \n\nSeveral  approaches  have  been  suggested  to  circumvent  this  problem.  For  exam(cid:173)\nple,  gradient-based methods can be abandoned in favor of alternative optimization \nmethods  [2,  15].  However,  the algorithms  investigated  so  far  either  perform  just \nas  poorly on problems involving long-term dependencies,  or,  when they are better, \nrequire far  more computational resources  [2].  Another possibility is  to modify con(cid:173)\nventional gradient descent by more heavily weighing the fraction of the gradient due \nto information far  in  the past, but there is  no guarantee that such a  modified  algo(cid:173)\nrithm would converge to a minima of the error surface being searched [2].  Another \nsuggestion has been to alter the input data so that it represents a reduced description \nthat makes global features more explicit and more readily detectable [7,  13,  16,  17]. \nHowever,  this  approach  may  fail  if short term  dependencies  are equally  as  impor(cid:173)\ntant.  Finally,  it  has  been suggested  that  a  network architecture  that  operates  on \nmultiple time scales  might  be useful  [5,  6]. \n\nIn  this  paper,  we  also  propose  an  architectural  approach  to  deal  with  long-term \ndependencies  [11].  We  focus  on  a  class  of architectures  based  upon  Nonlinear  Au(cid:173)\ntoRegressive  models  with  eXogenous  inputs  (NARX  models),  and  are  therefore \ncalled NARX networks [3,  14].  This is  a powerful class of models  which has recently \nbeen  shown  to  be  computationally  equivalent  to  'lUring  machines  [18].  Further(cid:173)\nmore,  previous  work  has  shown  that  gradient  descent  learning  is  more  effective \nin  NARX  networks  than  in  recurrent  network  architectures  with  \"hidden  states\" \nwhen  applied  to  problems  including  grammatical  inference  and  nonlinear  system \nidentification  [8].  Typically,  these  networks  converge  much  faster  and  generalize \nbetter than other  networks.  The results  in  this  paper  give  an explanation  of this \nphenomenon. \n\n2  Vanishing gradients and long-term dependencies \n\nBengio  et  al.  [2]  have analytically explained why  learning problems  with long- term \ndependencies  is  difficult.  They argue that for  many practical applications  the goal \nof  the  network  must  be  to  robustly  latch  information,  i.e.  the  network  must  be \nable to store information for  a  long  period of time  in  the presence of noise.  More \nspecifically, they argue that latching of information is  accomplished when the states \nof the  network  stay  within  the  vicinity  of  a  hyperbolic  attractor,  and  robustness \nto  noise  is  accomplished  if the states  of the  network  are  contained in  the  reduced \nattracting  set of that attractor,  i.e.  those set  of points  at which  the eigenvalues  of \nthe Jacobian are contained within the unit  circle. \n\nIn  algorithms  such  as  Backpropagation  Through  Time  (BPTT),  the  gradient  of \nthe cost function function  C  is  written assuming that the weights  at different  time \n\n\fLearning Long-term Dependencies Is  Not as  Difficult with NARX  Networks \n\n579 \n\nu(k) \n\nu(k-l)  u(k-2) \n\ny(k-3)  y(k-2)  y(k-l) \n\nFigure 1:  NARX  network. \n\nindices  are independent  and  computing the  partial gradient  with  respect  to these \nweights.  The total gradient is  then equal to the sum of these partial gradients. \n\nIt can be easily shown that the weight  updates are proportional to \n\nwhere  Yp(T)  and  d p  are  the  actual  and  desired  (or  target)  output  for  the  pth \npattern!,  x(t)  is  the  state  vector  of  the  network  at  time  t  and  Jx(T,T  - T)  = \n\\l xC-r)x(T)  denotes  the Jacobian of the network expanded over T  - T  time steps. \n\nIn [2],  it  was  shown that if the network robustly latches  information, then  Jx(T, n) \nis  an  exponentially  decreasing function  of n,  so  that  limn-too Jx(T, n)  =  0  .  This \nimplies that the portion of \\l we due to information at times T  \u00ab T  is  insignificant \ncompared to the portion  at times  near T.  This  vanishing gradient  is  the essential \nreason  why  gradient  descent  methods  are  not  sufficiently  powerful  to  discover  a \nrelationship between target outputs and inputs that occur at a  much earlier time. \n\n3  NARX networks \n\nAn important class of discrete- time nonlinear systems is the Nonlinear AutoRegres(cid:173)\nsive  with  eXogenous  inputs (NARX)  model  [3,  10,  12,  21]: \n\ny(t)  =  f  (u(t - D u ), ... ,u(t - 1), u(t), y(t - D y ),'\" \n\n,y(t - 1)) , \n\nwhere u(t) and y(t) represent input and output ofthe network at time t, Du and Dy \nare the input  and output order,  and f  is  a  nonlinear function.  When the function \nf  can be approximated by  a  Multilayer Perceptron, the resulting system is  called a \nNARX network [3,  14]. \n\nIn  this  paper  we  shall  consider  NARX  networks  with  zero  input  order  and  a  one \ndimensional  output.  However  there  is  no  reason  why  our  results  could  not  be \nextended to networks with  higher input orders.  Since  the states of a  discrete-time \n\nlWe deal only with problems in which the target  output is  presented at the  end of the \n\nsequence. \n\n\f580 \n\n0 9 \n\nDe \n\nT.  LIN, B.  G.  HORNE, P. TINO, C.  L. GILES \n\nEm\" 0.3 \n\n- \u00b70. 6 \n\nt]., \n\n....  0 .3 \n,- ' \u00b7 0.6 \n\n:' 1 \n\n~I ., , \nj , \ni \nI \n'...\\ \n\n0' \n\n009 \n\nDOS \n\n\"5 \n~ OO7 \n~ O 06 \nIi \nJOO5 \n:i \n..Q  004 \n\"0 \nh 03 \n~ \"' 002 \n\nDO' \n\n(a) \n\n60 \n\n0 \n0 \n\n10 \n\n20 \n\n. 0 \n\n60 \n\n30 \nn \n\n(b) \n\nFigure 2:  Results  for  the latching problem.  (a)  Plots of J(t,n)  as  a  function  of n. \n(b)  Plots of the ratio  E~~(ltJ(tr)  as  a function  of n . \n\ndynamical  system  can  always  be  associated  with  the  unit-delay  elements  in  the \nrealization of the system,  we can then describe such a network in a state space form \n\ni=l \ni  =  2, ... ,D \n\n(1) \n\nwith  y(t)  =  Xl (t + 1)  . \nIf the Jacobian of this system has all of its eigenvalues inside the unit circle at each \ntime step, then the states of the network will be in the reduced attracting set of some \nhyperbolic attractor, and thus the system will be robustly latched at that time.  As \nwith any other RNN, this implies that limn-too Jx(t, n)  = o.  Thus,  NARX networks \nwill  also suffer from  vanishing gradients and the long- term dependencies  problem. \nHowever,  we  find  in  the  simulation  results  that  follow  that  NARX  networks  are \noften much better at discovering long-term dependencies than conventional RNNs. \n\nAn  intuitive  reason  why  output  delays  can  help  long-term  dependencies  can  be \nfound  by  considering  how  gradients  are  calculated  using  the  Backpropagation \nThrough  Time  algorithm.  BPTT involves  two  phases:  unfolding  the  network  in \ntime and backpropagating the error through the unfolded network.  When a  NARX \nnetwork is  unfolded in time,  the output  delays  will  appear as  jump-ahead connec(cid:173)\ntions in the unfolded  network.  Intuitively,  these jump-ahead connections provide a \nshorter path for  propagating gradient  information,  thus  reducing the sensitivity of \nthe  network to  long- term dependencies.  However,  this  intuitive  reasoning  is  only \nvalid  if the total gradient through these jump- ahead  pathways is  greater than the \ngradient through the layer-to-layer pathways. \n\nIt  is  possible  to  derive  analytical  results  for  some  simple  toy  problems  to  show \nthat  NARX  networks  are  indeed  less  sensitive  to  long-term  dependencies.  Here \nwe  give  one  such  example,  which  is  based  upon  the  latching  problem  described \nin  [2] .  Consider the  one  node  autonomous  recurrent  network described  by,  x(t)  = \ntanh(wx(t  - 1))  where  w  =  1.25,  which  has  two  stable  fixed  points  at  \u00b10.710 \nand  one  unstable fixed  point  at zero.  The one node,  autonomous  NARX  network \nx(t)  =  tanh (L:~=l wrx(t - r))  has  the same fixed  points  as  long as L:?:l Wi  =  w. \n\n\fLearning Long-tenn Dependencies Is  Not as Difficult with  NARX  Networks \n\n581 \n\nAssume the state of the network has reached equilibrium at the positive stable fixed \npoint and there are no external inputs.  For simplicity, we only consider the Jacobian \nJ(t, n) =  8~{t~~)' which will be a component of the gradient 'ilwC.  Figure 2a shows \nplots  of J(t, n)  with  respect  to  n  for  D  = 1,  D  = 3 and  D  = 6  with  Wi  = wiD . \nThese  plots  show  that  the effect  of output  delays  is  to flatten  out the curves  and \nplace more emphasis on the gradient due to terms farther in the past.  Note that the \ngradient contribution due to short term dependencies is  deemphasized.  In Figure 2b \nwe  show plots of the ratio  L::~\\tj(t,r) , which illustrates the percentage of the total \ngradient that can be attributed to information n time steps in the past.  These plots \nshow  that  this  percentage is  larger  for  the  network  with  output  delays,  and  thus \none  would  expect that these networks  would  be able to more effectively  deal  with \nlong-term dependencies. \n\n4  Experimental results \n\n4.1  The latching problem \n\nWe  explored a  slight  modification  on  the latching problem described  in  [2],  which \nis  a  minimal  task  designed  as  a  test  that  must  necessarily  be  passed  in  order for \na  network to robustly latch  information.  In this task there are three inputs Ul(t), \nU2(t),  and  a  noise  input  e(t),  and  a  single  output  y(t) .  Both Ul(t)  and U2(t)  are \nzero for  all  times  t> 1.  At  time t  = 1,  ul(l) = 1 and  u2(1)  = 0 for  samples from \nclass  1,  and ul(l) =  0 and u2(1)  =  1 for  samples from  class  2.  The noise input e(t) \nis  drawn  uniformly  from  [-b, b]  when  L  < t  S  T,  otherwise e(t)  =  0  when  t  S  L. \nThis network used to solve this problem is  a  NARX  network consisting of a  single \nneuron, \n\nwhere the parameters h{  are adjustable and the recurrent weights Wr  are fixed  2 . \nWe fixed the recurrent feedback weight to Wr  =  1.251 D, which gives the autonomous \nnetwork  two  stable fixed  points  at  \u00b10.710,  as  described  in  Section  3.  It  can  be \nshown  [4]  that the network is  robust to perturbations in  the range [-0.155,0.155]. \nThus, the uniform  noise in e(t)  was  restricted to this  range. \n\nFor each simulation,  we  generated 30 strings from each class,  each with a  different \ne(t).  The initial  values  of h{  for  each  simulation  were  also  chosen  from  the  same \ndistribution that defines  e(t).  For strings from  class  one,  a  target value of 0.8  was \nchosen, for  class two  -0.8 was  chosen.  The network was  run using a simple BPTT \nalgorithm with a learning rate of 0.1 for a maximum of 100 epochs.  (We found that \nthe network converged to some solution consistently within a  few  dozen epochs.)  If \nthe simulation  exceeded  100  epochs  and  did  not  correctly  classify  all  strings  then \nthe simulation was  ruled a failure.  We  varied T  from  10 to 200  in  increments of 2. \nFor each value of T, we ran 50 simulations.  Figure 3a shows a plot of the percentage \nof those  runs  that  were  successful  for  each  case.  It  is  clear from  these  plots  that \n\n2 Although this  description  may appear different  from  the one  in  [2],  it  can  be shown \n\nthat they are actually identical experiments for  D  = 1. \n\n\f582 \n\n0 9 \n\n09 \n\n. 0 1 \n~ \n~06 \n~ \n\n~05 10 \u2022 \n\" 03 \n\n02 \n\n0' \n\nT. LIN, B.  G.  HORNE, P.  TINO, C. L. GILES \n\n'.J \"60J' \n\n.  Ii' , , \n\n\\ \n\n~., \n\n... .  0.3 \n\u00b7-\u00b7-0.6 \n\n....... \n\n,  _ .. ..,.. .......... : . .: . \n\n!  , \n~ i ! , , \n\n09 \n\n06 \n\n04 \n\n02 \n\n, . \n\n\" \n\n...... ':~ .. ,.'< ...... . \n\n... ... \n\n20 \n\n40 \n\n\\ .. \n'\" \u00b7.~Il~ \n\n~~~~~~~~~~~~'6~0 ~'M~~200  ~~~~'0~~'5~~ro--~25--=OO~3~5 ~4~0~4~5~50\u00b7 \n00 \n\n(a) \n\nLanglh 01  InPJI nC.IIIe \n\n(b) \n\nFigure  3:  (a)  Plots  of percentage of successful simulations  as a  function  of T,  the \nlength of the input  strings.  (b)  Plots of the final  classification rate with respect to \ndifferent  length input strings. \n\nthe  NARX  networks  become  increasingly  less  sensitive to long- term dependencies \nas the output order is  increased. \n\n4.2  The  parity problem \n\nIn the parity problem, the task is to classify sequences depending on whether or not \nthe  number  of  Is in  the  input  string is  odd.  We  generated  20  strings  of different \nlengths from 3 to 5 and added uniformly distributed noise in the range [-0.2,0.2] at \nthe end of each string.  The length of input noise varied from 0 to 50.  We arbitrarily \nchose 0.7 and -0.7 to represent the symbol  \"1\"  and  \"0\".  The target is  only  given \nat the end of each string.  Three different  networks with different  number of output \ndelays  were  run on this  problem in order to evaluate the capability of the network \nto  learn  long-term dependencies.  In  order to  make  the  networks  comparable,  we \nchose  networks  in  which  the  number of weights  was  roughly  equal.  For  networks \nwith one to three delays,  5,  4 and 3 hidden neurons were chosen respectively,  giving \n21,  21,  and  19  trainable  weights.  Initial  weight  values  were  randomly  generated \nbetween  -0.5 and 0.5 for  10 trials. \n\nFig. 3b shows the average classification rate with respect to different length of input \nnoise.  When  the  length  of the  noise  is  less  than  5,  all  three  of the  networks  can \nlearn all  the sequences  with the classification rate near to 100%.  When the length \nincreases to between 10 and 35, the classification rate of networks with one feedback \ndelay drops quickly to about 60%  while the rate of those networks with two or three \nfeedback  delays  still remains about 80%. \n\n5  Conclusion \n\nIn this paper we considered an architectural approach to dealing with the problem of \nlearning long-term dependencies.  We explored the ability of a class of architectures \ncalled  NARX  networks to solve such problems.  This  has been observed previously, \nin the sense that gradient descent  learning appeared to be more effective in NARX \n\n\fLearning Long-tenn Dependencies Is  Not as  Difficult with  NARX Networks \n\n583 \n\nnetworks than in RNNs  [8].  We  presented an analytical example that showed that \nthe gradients  do  not  vanish  as  quickly  in  NARX  networks  as  they  do  in  networks \nwithout  multiple  delays  when  the  network  is  operating at a  fixed  point.  We  also \npresented  two  experimental problems  which  show  that  NARX  networks  can  out(cid:173)\nperform  networks with  single delays  on some simple problems involving long-term \ndependencies. \n\nWe speculate that similar results could be obtained for other networks.  In particular \nwe  hypothesize  that  any  network  that  uses  tapped  delay  feedback  [1,  9]  would \ndemonstrate improved performance on problems involving long-term dependencies. \n\nAcknowledgements \n\nWe would  like  to  thank A.  Back  and Y.  Bengio for  many useful suggestions. \n\nReferences \n\n(1]  A.D.  Back  and  A.C.  Tsoi.  FIR and  IIR  synapses,  a  new  neural  network  architecture  for  time \n\nseries modeling.  Neural  Computation,  3(3):375-385,  1991. \n\n(2]  Y.  Bengio, P. Simard, and P.  Frasconi.  Learning long-term dependencies with gradient is difficult. \n\nIEEE  Trans.  on  Neural  Networks,  5(2):157- 166,  1994. \n\n(3]  S.  Chen, S.A.  Billings,  and  P.M.  Grant.  Non-linear system identification using  neural  networks. \n\nInternational  Journal  of Control,  51(6):1191-1214,  1990. \n\n(4]  P.  Frasconi,  M.  Gori,  M.  Maggini,  and  G.  Soda.  Unified  integration  of explicit  knowledge  and \nlearning by example in  recurrent networks.  IEEE  Trans.  on Know.  and  Data  Eng.,7(2):340-346, \n1995. \n\n(5]  M.  Gori,  M.  Maggini,  and G.  Soda.  Scheduling of modular architectures for inductive inference of \nregular grammars.  In  ECAI'94  Work.  on  Comb.  Sym.  and  Connectionist  Proc.,  pages 78-87. \n(6J  S.  EI  Hihi  and Y.  Bengio.  Hierarchical  recurrent  neural  networks for  long-term  dependencies.  In \n\nNIPS  8,  1996.  (In  this  Proceedings.) \n\n(7]  S.  Hochreiter  and  J.  Schmidhuber.  Long  short  term  memory.  Technical  Report  FKI-207-95, \n\nTechnische Universitat Munchen,  1995. \n\n(8]  B.G. Horne and C.L. Giles.  An experimental comparison of recurrent neural networks.  In  NIPS 7, \n\npages  697-704,  1995. \n\n(9J  R.R.  Leighton  and B.C.  Conrath.  The autoregressive backpropagation algorithm.  In  Proceedings \nof the  International  Joint  Conference  on Neural  Networks,  volume 2,  pages 369-377,  July 1991. \n(10]  I.J.  Leontaritis and S.A.  Billings.  Input-output parametric models for  non-linear systems:  Part \n\nI:  deterministic non- linear systems.  International  Journal  of Control,  41(2):303-328,  1985. \n\n(ll]  T .N.  Lin, B.G.  Horne,  P.Tino and C.L.  Giles.  Learning long-term dependencies is not as difficult \nwith  NARX  recurrent  neural  networks.  Technical  Report  UMIACS-TR-95-78  and  CS-TR-3500, \nUniv.  Of Maryland,  1995. \n\n(12]  L.  Ljung.  System  identification:  Theory for  the  user.  Prentice-Hall,  1987. \n[13]  M.  C.  Mozer.  Induction of multiscale temporal structure.  In  J.E.  Moody,  S.  J. Hanson,  and R.P. \n\nLippmann, editors,  NIPS 4,  pages  275-282,  1992. \n\n(14]  K.S.  Narendra and K. Parthasarathy.  Identification and control of dynamical systems using neural \n\nnetworks.  IEEE  Trans.  on  Neural  Networks,  1:4-27,  March  1990. \n\n(15]  G .V .  Puskorius  and  L.A.  Feldkamp.  Recurrent  network  training  with  the  decoupled  extended \n\nKalman filter.  In  Proc.  1992 SPIE  Con/.  on  the  Sci.  of ANN,  Orlando,  Florida,  April  1992. \n\n(16]  J . Schmidhu ber.  Learning complex, extended sequences using the principle of history compression. \n\nIn  Neural  Computation,  4(2):234-242,  1992. \n\n(17]  J.  Schmidhuber.  Learning  unambiguous  reduced  sequence descriptions.  In  NIPS  4,  pages  291-\n\n298,1992. \n\n(18]  H.T.  Siegelmann,  B.G.  Horne,  and  C.L.  Giles.  Computational  capabilities of NARX  neural  net(cid:173)\n\nworks.  In  IEEE  Trans.  on Systems,  Man  and  Cybernetics,  1996.  Accepted. \n\n(19]  H.T.  Siegel mann and  E.D.  Sontag.  On the computational  power  of neural  networks.  Journal  of \n\nComputer  and  System  Science, 50(1):132-150,  1995. \n\n[20]  E.D.  Sontag.  Systems  combining  linearity  and  saturations  and  relations  to  neural  networks. \n\nTechnical Report  SYCON-92- 01,  Rutgers  Center for  Systems and  Control,  1992. \n\n(21]  H.  Su,  T.  McAvoy,  and  P.  Werbos.  Long-term predictions of chemical  processes  using  recurrent \n\nneural  networks:  A  parallel  training approach.  Ind.  Eng .  Chem.  Res., 31:1338,  1992. \n\n\f", "award": [], "sourceid": 1151, "authors": [{"given_name": "Tsungnan", "family_name": "Lin", "institution": null}, {"given_name": "Bill", "family_name": "Horne", "institution": null}, {"given_name": "Peter", "family_name": "Ti\u00f1o", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}]}