{"title": "An experimental comparison of recurrent neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 697, "page_last": 704, "abstract": null, "full_text": "An experimental comparison \nof recurrent  neural networks \n\nBill G.  Horne and C.  Lee Giles\u00b7 \n\nNEe Research  Institute \n\n4  Independence  Way \nPrinceton,  NJ  08540 \n\n{horne.giles}~research.nj.nec.com \n\nAbstract \n\nMany  different  discrete-time  recurrent  neural  network  architec(cid:173)\ntures  have  been  proposed.  However,  there  has  been  virtually  no \neffort  to compare these arch:tectures experimentally.  In  this paper \nwe  review  and categorize many of these architectures and compare \nhow  they  perform on various classes  of simple problems including \ngrammatical inference  and nonlinear system identification. \n\n1 \n\nIntroduction \n\nIn  the past few  years several  recurrent  neural  network  architectures  have emerged. \nIn this paper we  categorize various discrete-time recurrent neural network architec(cid:173)\ntures,  and  perform a  quantitative  comparison of these  architectures  on  two  prob(cid:173)\nlems:  grammatical inference  and nonlinear system identification. \n\n2  RNN Architectures \n\nWe broadly divide these  networks into two groups depending on whether or not the \nstates of the network  are  guaranteed to be observable.  A  network with observable \nstates has the property that the states of the system can always be determined from \nobservations  of the  input  and  output  alone.  The  archetypical  model  in  this  class \n\n.. Also  with  UMIACS,  University  of Maryland,  College  Park,  MD  20742 \n\n\f698 \n\nBill G.  Horne,  C.  Lee Giles \n\nTable 1:  Terms that are weighted  in various single layer network  architectures.  Ui \nrepresents  the  ith  input at the current  time step,  Zi  represents  the value of the lh \nnode  at the previous time step. \n\nArchitecture  bias  Ui \nFirst order \nx \nHigh order \n\nx \n\nBilinear \nQuadratic \n\nx \nx \n\nx \n\nZi \nx \n\nx \nx \n\nUiUj \n\nZiUj  ZiZj \n\nx \nx \nx \n\nx \n\nx \n\nwas  proposed by N arendra and Parthasarathy [9].  In their most general model, the \noutput of the network is computed by a multilayer perceptron  (MLP) whose inputs \nare  a  window of past inputs  and outputs, as shown  in  Figure  la.  A special case of \nthis network is  the Time Delay Neural  Network (TDNN), which is simply a tapped \ndelay line (TDL) followed by an MLP [7].  This network is not recurrent since there \nis  no  feedback;  however,  the  TDL  does  provide  a  simple  form  of dynamics  that \ngives  the  network  the  ability model  a  limited class of nonlinear  dynamic systems. \nA variation on the TDNN, called the Gamma network,  has been proposed in which \nthe  TDL  is  replaced  by  a  set  of cascaded  filters  [2].  Specifically,  if the  output of \none of the filters  is  denoted  xj(k),  and the output of filter  i  connects  to the input \nof filter  j, the output of filter  j  is  given  by, \n\nxj(k + 1) = I-'xi(k) + (l-I-')xj(k). \n\nIn this paper we only consider the case where I-'  is fixed,  although better results can \nbe obtained if it is adaptive. \n\nNetworks  that have  hidden  dynamics have states which  are not directly  accessible \nto observation.  In fact,  it  may be  impossible to determine  the states of a  system \nfrom  observations of it's inputs  and outputs alone.  We  divide  networks  with  hid(cid:173)\nden  dynamics into  three  classes:  single  layer  networks,  multilayer  networks,  and \nnetworks with local feedback. \n\nSingle layer networks are perhaps the most popular of the recurrent  neural network \nmodels.  In  a  single  layer  network,  every  node  depends  on  the  previous  output of \nall  of  the  other  nodes.  The  function  performed  by  each  node  distinguishes  the \ntypes  of recurrent  networks  in  this  class.  In  each  of the  networks,  nodes  can  be \ncharacterized  as  a  nonlinear  function  of a  weighted  sum of inputs,  previous  node \noutputs,  or  products  of these  values.  A  bias  term  may  also  be  included.  In  this \npaper we  consider first-order  networks,  high-order networks  [5],  bilinear networks, \nand Quadratic networks[12].  The terms that are weighted in each of these networks \nare summarized in Table 1. \n\nMultilayer  networks  consist  of a  feedforward  network  coupled  with  a  finite  set  of \ndelays as shown in Figure  lb.  One network in this class is  an architecture proposed \nby Robinson and Fallside [11], in which the feedforward network is an MLP. Another \npopular  networks  that  fits  into  this  class  is  Elman's  Simple  Recurrent  Network \n(SRN)  [3].  An Elman network can  be thought of as a  single layer  network with an \nextra layer of nodes that compute the output function,  as shown in Figure  lc. \n\nIn locally recurrent networks the feedback is provided locally within each individual \n\n\fAn Experimental Comparison of Recurrent Neural Networks \n\n699 \n\nMLP \n\nFigure 1:  Network architectures:  (a) Narendra and Parthasarathy's Recurrent Neu(cid:173)\nral Network,  (b)  Multilayer network and  (c)  an  Elman network. \n\nnode,  but the nodes are connected  together  in  a feed  forward  architecture.  Specifi(cid:173)\ncally, we  consider nodes that have local output feedback in which each node weights \na window of its own past outputs and windows of node outputs from previous layers. \nNetworks  with local recurrence  have been  proposed  in  [1,  4,  10]. \n\n3  Experimental Results \n\n3.1  Experimental methodology \n\nIn order  to make the  comparison as fair as  possible  we  have  adopted  the following \nmethodology. \n\u2022  Resources.  We  shall perform two fundamental comparisons.  One in which the \nnumber  of weights  is  roughly  the  same for  all  networks,  another  in  which  the \nnumber of states is equivalent.  In either case,  we shall make these numbers large \nenough that most of the networks can  achieve  interesting performance levels. \nNumber  of weights.  For static networks it is  well  known  that the generalization \nperformance is  related  to the number of weights  in the network.  Although this \ntheory has never been extended to recurrent neural networks, it seems reasonable \nthat  a  similar result  might apply.  Therefore,  in  some experiments we  shall  try \nto keep  the number of weights approximately equal across  all networks. \nNumber  of states.  It can  be  argued  that  for  dynamic problems  the  size  of the \nstate  space  is  a  more  relevant  measure  for  comparison  than  the  number  of \nweights.  Therefore,  in  some  experiments  we  shall  keep  the  number  of states \nequal across  all networks. \n\n\u2022  Vanilla learning. Several heuristics have been proposed to help speed learning \nand  improve  generalization  of gradient  descent  learning  algorithms.  However, \nsuch  heuristics  may favor  certain  architectures.  In order  to avoid  these  issues, \nwe  have chosen  simple gradient descent  learning algorithms. \n\n\u2022  Number  of  simulations.  Due  to  random  initial  conditions,  the  recurrent \nneural network solutions can vary widely.  Thus, to try to achieve a statistically \nsignificant estimation of the generalization of these  networks,  a large number of \nexperiments were  run. \n\n\f700 \n\nBill G.  Horne,  C.  Lee Giles \n\no \n\nstan );::===:====,O'l+------ll \n\no \n\no \n\nFigure 2:  A  randomly generated six state finite  state machine. \n\n3.2  Finite state machines \n\nWe chose two finite state machine (FSM) problems for a comparison of the ability of \nthe various recurrent networks to perform grammatical inference.  The first problem \nis  to learn  the minimal, randomly generated  six state machine shown  in  Figure  2. \nThe second problem is to infer a sixty-four state finite memory machine [6]  described \nby  the logic function \n\ny(k) = u(k - 3)u(k) + u(k - 3)y(k - 3) + u(k)u(k - 3)Y(k - 3) \n\nwhere  u(k)  and  y(k)  represent  the  input and output respectively  at  time k  and  x \nrepresents  the complement of x. \nTwo experiments were run.  In the first experiment all of the networks were designed \nsuch that the number of weights was less  than, but as close to 60  as possible.  In the \nsecond experiment, each network was restricted to six state variables, and if possible, \nthe  networks  were  designed  to have  approximately 75  weights.  Several  alternative \narchitectures were tried when it was possible to configure the architecture differently \nand yield  the same number of weights,  but those  used  gave the best results. \n\nA complete set of 254 strings consisting of all strings of length one through seven  is \nsufficient to uniquely identify both ofthese FSMs.  For each simulation, we randomly \npartitioned  the  data into a  training and testing set  consisting of 127 strings each. \nThe strings were ordered  lexographically in the  training set. \nFor each  architecture  100 runs were  performed on each  problem.  The on-line Back \nPropagation  Through  Time  (BPTT)  algorithm  was  used  to  train  the  networks. \nVanilla learning was used  with a learning rate of 0.5.  Training was stopped at 1000 \nepochs.  The  weights  of all  networks  were  initialized  to  random  values  uniformly \ndistributed  in the  range  [-0.1,0.1].  All  states were  initialize to zeros  at the begin(cid:173)\nning of each string except for the High Order net in which one state was  arbitrarily \ninitialized to a  value of 1. \n\nTable 2 summarizes the statistics for  each experiment.  From these  results we  draw \nthe following conclusions. \n\u2022  The bilinear and high-order networks  do best on the small randomly generated \nmachine, but poorly on the finite  memory machine.  Thus, it would appear that \nthere  is benefit  to having second  order  terms in  the network,  at least for  small \nfinite  state machine problems. \n\n\u2022  N arendra and  Parthasarathy's model and the network  with local recurrence  do \nfar better than the other networks on the problem of inferring the finite memory \n\n\fAn Experimental Comparison of Recurrent Neural  Networks \n\n701 \n\nTable 2: Percentage classification error on the FSM experiment for (a) networks with \napproximately the same number of weights,  (b)  networks with the same number of \nstate variables.  %P = The percentage of trials in which the training set was learned \nperfectly,  #W =  the number of weights,  and #S =  the number of states. \n\nF5M \n\nRND \n\nFMM \n\nF5M \n\nRND \n\nFMM \n\nArchitecture t  mean \nN&P \n2 .8 \nTDNN \n12.5 \n19.6 \nGamma \n12.9 \nFirst  Order \n0.8 \nHigh  Order \nBilinear \n1.3 \n12.9 \nQuadratic \n19 .4 \nMullilayer \n3 .5 \nElman \n2.8 \nLocal \nN&P \n0 .0 \nTDNN \n6.9 \n7.7 \nGamma \n4 .8 \nFirst  Order \nHigh  Order \n5.3 \nBilinear \n9 .5 \n32.5 \nQuadratic \n36. 7 \nMultilayer \n12.0 \nElman \nLocal \n0 . 1 \n\ntraining  error \n( std) \n(M) \n(2.1) \n(H) \n(6.9) \n(1.5) \n(2 . 7) \n(13.4) \n(13 .6) \n~5.~~ \n1.5 \n~0 . 2 ~ \n(2 .1 ) \n(2 .2) \n(3 .0) \n(4.0) \n(10 .4) \n(10.8) \n(11.9) \n(12.5) \n' (0.3) \n\nArchitecture tt \nN&P \nTDNN \nGamma \nFirst  Order \nHigh  Order \nBilinear \nQuadratic \nMullilayer \nElman \nLocal \nN&P \nTDNN \nGamma \nFirs t  Order \nHigh  Order \nBilinear \nQuadratic \nMullUayer \nElman \nLocal \n\n(a) \n\ntra.lnlng  error \n( std) \nmea.n \n4 .6 \n(  8.~~ \n(  2.0) \n11 . 7 \n(H) \n19.0 \n( 6.9) \n12.9 \n(  0 .5) \n0 .3 \n0 .6 \n(  0 .9) \n0 .2 \n(  0 .5) \n15. 4 \n(14 . 1) \n3.5 \n(  5.5) \n(  405) \n13.9 \n(  0 .8) \n0 .1 \n(  1.7) \n6 .8 \n(2.9) \n9 .0 \n(3 .0) \n4 .8 \n(  1.7) \n1.2 \n2 .6 \n(  402) \n12.6 \n(17.3) \n38.1 \n(12.6) \n12.8 \n~H.:~ \n3 .8 \n15 .3 \n(b) \n\ntesting error \n\nmea.n \n16.9 \n33.8 \n24.8 \n26.5 \n6 .2 \n5 .7 \n17.7 \n23.4 \n12.7 \n26.7 \n0 .1 \n15 .8 \n15.7 \n16 .0 \n26 .0 \n25.8 \n40.5 \n43 .5 \n24 .9 \n1.0 \n\n(std) \n(8 .6) \n(U) \n(3 .2) \n(9 .0) \n(6 .1 ) \n(6 .1) \n(14 .1) \n( 13.5) \n~9. !~ \n7.6 \n~ 1 . ~ ~ \n(3 .2) \n(3.3) \n(6 .5) \n( 5. 1 ) \n(7 .0) \n(7 .3) \n(8.5) \n(7 .9) \n( 3 .0) \n\ntestIng error \n\nmea.n \n14.1 \n34.3 \n25 .2 \n26.5 \n4 .6 \n4 .4 \n3.2 \n19.9 \n12.7 \n20.2 \n0 .3 \n16.2 \n14.9 \n16.0 \n25.1 \n20.3 \n26.1 \n42.8 \n27.6 \n22.2 \n\n( std) \n(11 .3 ) \n(  3 .9) \n(3.1) \n(9 .0) \n(  5 .1) \n(  U) \n(  2 .6) \n(lU) \n(  9 .1) \n(  5.7) \n(  1.4) \n(  2 .9) \n(2 .8) \n(6 .5) \n(  5 .1) \n(  7 .2) \n(12 .8) \n(  9.2) \n(10 .7) \n(  409) \n\n'YoP \n22 \n0 \n0 \n0 \n60 \n46 \n12 \n6 \n27 \n4 \n99 \n0 \n0 \n1 \n1 \n0 \n0 \n0 \n5 \n97 \n\n'YoP \n38 \n0 \n0 \n0 \n79 \n55 \n83 \n16 \n27 \n0 \n97 \n0 \n0 \n1 \n31 \n21 \n13 \n0 \n8 \n0 \n\n#W \n\n56 \n56 \n56 \n48 \n50 \n55 \n45 \n54 \n55 \n60 \n56 \n56 \n56 \n48 \n50 \n55 \n45 \n54 \n55 \n60 \n\n#W \n\n73 \n73 \nH \n48 \nH \n78 \n216 \n76 \n55 \n26 \n73 \n73 \n73 \n48 \nH \n78 \n216 \n76 \n55 \n26 \n\n#5 \n8 \n8 \n8 \n6 \n5 \n5 \n3 \n4 \n6 \n20 \n8 \n8 \n8 \n6 \n5 \n5 \n3 \n4 \n6 \n20 \n\n#5 \n\n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n6 \n\ntThe TDNN and Gamma network both had  8 input taps and 4 hidden layer nodes.  For \nthe Gamma network,  I' =  0.3  (RND)  and  I' = 0.7  (FMM).  Narendra and  Parthasarathy's \nnetwork  had 4 input  and  output  taps and  5 hidden  layer  nodes.  The High-order  network \nused  a  \"one-hot\"  encoding  of the input  values  [5].  The multilayer  network  had  4 hidden \nand  output layer  nodes.  The locally  recurrent  net  had 4  hidden  layer  nodes  with 5 input \nand  3 output taps,  and one output node  with  3 input  and  output taps. \nttThe  TDNN,  Gamma network,  and  N arendra  and  Parthasarathy's  network  all  had  8 \nhidden  layer  nodes.  For  the  Gamma network,  I' = 0.3  (RND)  and  I' = 0.7  (FMM). The \nHigh-order  network  again  used  a  \"one-hot\"  encoding of the input  values.  The multilayer \nnetwork  had  5  hidden  and  6 output layer  nodes.  The locally  recurrent  net  had  3  hidden \nlayer  nodes  and  one output layer  node,  all  with  only  one input  and  output  tap. \n\n\f702 \n\nBill G.  Horne,  C.  Lee Giles \n\nmachine when  the number of states is  not constrained.  It is not surprising  that \nthe former  network  did so  well  since  the sequential machine  implementation of \na  finite  memory machine is  similar to  this  architecture  [6].  However,  the  result \nfor  the locally recurrent  network  was  unexpected. \n\n\u2022  All  of the  recurrent  networks  do  better  than  the  TDNN  on  the small  random \nmachine.  However,  on the finite  memory machine the TDNN  does  surprisingly \nwell,  perhaps because  its structure is similiar to Narendra and Parthasarathy's \nnetwork  which was  well  suited for  this problem. \n\n\u2022  Gradient-based learning algorithms are  not  adequate for  many of these  archi(cid:173)\n\ntectures.  In  many  cases  a  network  is  capable  of representing  a  solution  to  a \nproblem that  the  algorithm was  not  able  to find.  This seems  particularly  true \nfor  the  Multilayer network. \n\n\u2022  Not  surprisingly,  an  increase  in  the  number  of weights  typically  leads  to over(cid:173)\ntraining.  Although,  the  quadratic network,  which  has  216  weights,  can  consis(cid:173)\ntently find  solutions for  the  random machine  that generalize  well  even  though \nthere  are only 127 training samples. \n\n\u2022  Although the performance on the training set  is  not always a  good indicator of' \ngeneralization  performance on  the testing set, we  find  that if a  network  is  able \nto frequently  find  perfect  solutions for  the  training data,  then it also  does  well \non  the testing data. \n\n3.3  Nonlinear system identification \n\nIn  this problem, we  train the network  to learn  the dynamics of the following set  of \nequations proposed  in  [8] \n\nzl(k+l) \nZ2  + 1  = \n\n(k) \n\ny(k) \n\nzl(k) + 2z2(k) \n\n(k) \n\nl+z~(k)  +u \n(k) \n\nzl(k)Z2(k) \n+ u \n1 + z~(k) \nzl(k) + z2(k) \n\nbased on observations of u( k)  and y( k)  alone. \nThe same networks that were  used for  the finite state machine problems were  used \nhere,  except  that  the  output  node  was  changed  to  be  linear  instead  of sigmoidal \nto  allow  the  network  to  have  an  appropriate  dynamic  range.  We  found  that  this \ncaused some stability problems in the quadratic and locally recurrent  networks.  For \nthe fixed  number of weights  comparison, we  added  an  extra node  to the quadratic \nnetwork,  and dropped  any second  order terms involving the fed  back output.  This \ngave  a  network  with  64  weights  and  4  states.  For  the  fixed  state  comparison, \ndropping  the  second  order  terms  gave  a  network  with  174  weights.  The  locally \nrecurrent  network presented stability problems only for  the fixed  number of weights \ncomparison.  Here,  we used  a network that had 6 hidden layer nodes and one output \nnode with 2 taps on the inputs and outputs each,  giving a network with 57 weights \nand  16 states.  In  the Gamma network  a value of l' = 0.8  gave  the best  results. \nThe networks  were  trained  with 100  uniform random noise  sequences of length 50. \nEach  experiment used  a  different  randomly generated  training set.  The noise  was \n\n\fAn Experimental Comparison of Recurrent Neural Networks \n\n703 \n\nTable 3:  Normalized mean squared error on a sinusoidal test signal for the nonlinear \nsystem identification experiment. \n\nArchi teet ure  Fixed  #  weights  Fixed #  states \nN&P \nTDNN \nGamma \nFirst Order \nHigh  Order \nBilinear \nQuadratic \nMultilayer \nElman \nLocal \n\n0.101 \n0.160 \n0.157 \n0.105 \n1.034 \n0.118 \n0.108 \n0.096 \n0.115 \n0.117 \n\n0.067 \n0.165 \n0.151 \n0.105 \n1.050 \n0.111 \n0.096 \n0.084 \n0.115 \n0.123 \n\nuniformly  distributed  in  the  range  [-2.0,2.0],  and  each  sequence  started  with  an \ninitial  value  of Xl(O)  = X2(0)  = O.  The  networks  were  tested  on  the  response  to \na  sine  wave  of frequency  0.04  radians/second.  This  is  an  interesting  test  signal \nbecause  it is  fundamentally different  than the training data. \nFifty runs were  performed for each network.  BPTT was used  for  500 epochs with a \nlearning rate of 0.002.  The weights of all networks were  initialized to random values \nuniformly distributed  in the range  [-0.1,0.1]. \nTable 3 shows the normalized mean squared error averaged over the 50  runs on the \ntesting set.  From these  results  we  draw  the following conclusions. \n\u2022  The high order network could not seem to match the dynamic range of its output \nto the target, as a result it performed much worse than the other networks.  It is \nclear  that there  is  benefit  to adding first  order  terms since the bilinear network \nperformed so much better. \n\n\u2022  Aside from the high order network, all of the other recurrent networks performed \n\nbetter than the TDNN,  although in  most cases  not significantly better. \n\n\u2022  The multilayer network performed exceptionally well on this problem, unlike the \n\nfinite  state machine experiments.  We speculate that the existence of target out(cid:173)\nput at every  point along the sequence  (unlike the finite state machine problems) \nis  important for  the multilayer network to be successful. \n\n\u2022  Narendra and Parthasarathy's architecture  did  exceptionally well,  even  though \n\nit is  not clear that its structure  is  well  matched to the problem. \n\n4  Conclusions \n\nWe  have  reviewed  many discrete-time recurrent  neural  network  architectures  and \ncompared them on two different problem domains, although we  make no claim that \nany of these  results will necessarily extend  to other problems. \n\nNarendra and Parthasarathy's model performed exceptionally well on the problems \nwe explored.  In general, single layer networks did fairly well, however it is important \nto include  terms  besides  simple state/input  products  for  nonlinear system  identi(cid:173)\nfication.  All  of the  recurrent  networks  usually  did  better  than  the  TDNN  except \n\n\f704 \n\nBill G.  Home,  C.  Lee Giles \n\non  the finite  memory machine problem.  In these experiments,  the use  of averaging \nfilters  as  a  substitute for  taps in the TDNN  did not seem  to offer  any  distinct  ad(cid:173)\nvantages in  performance,  although better  results might be obtained if the  value  of \nJ.I.  is  adapted. \n\nWe found  that the relative comparison of the networks did  not significantly change \nwhether or not the number of weights or states were held constant.  In fact,  holding \none  of these  values  constant  meant  that  in  some  networks  the  other  value  varied \nwildly, yet  there  appeared  to be little correlation with generalization. \n\nFinally,  it  is  interesting  to  note  that  though  some  are  much  better  than  others, \nmany of these networks are capable of providing adequate solutions to two seemingly \ndisparate  problems. \n\nAcknowledgements \n\nWe  would like  to thank Leon  Personnaz  and Isabelle  Rivals for  suggesting  we  per(cid:173)\nform the experiments with a fixed  number of states. \n\nReferences \n[1]  A.D.  Back  and A.C.  Tsoi.  FIR and IIR synapses,  a  new  neural network archi(cid:173)\n\ntecture for  time series  modeling.  Neural  Computation,  3(3):375-385, 1991. \n\n[2]  B.  de  Vries  and  J .C.  Principe.  The  gamma model:  A  new  neural  model for \n\ntemporal processing.  Neural Networks,  5:565-576, 1992. \n\n[3]  J .L.  Elman.  Finding structure in time.  Cognitive  Science,  14:179-211, 1990. \n[4]  P.  Frasconi,  M.  Gori,  and  G.  Soda.  Local  feedback  multilayered  networks. \n\nNeural  Computation,  4:120-130, 1992. \n\n[5]  C.L.  Giles,  C.B.  Miller,  et  al.  Learning  and  extracting  finite  state  automata \nwith second-order recurrent  neural networks.  Neural Computation, 4:393-405, \n1992. \n\n[6]  Z. Kohavi.  Switching  and finite  automata  theory.  McGraw-Hill,  NY,  1978. \n[7]  K.J. Lang, A.H. Waibel, and G.E. Hinton.  A time-delay neural network archi(cid:173)\n\ntecture for  isolated word  recognition.  Neural  Networks,  3:23-44,  1990. \n\n[8]  K.S.  Narendra.  Adaptive control  of dynamical systems using neural networks. \nIn  Handbook  of Intelligent  Control,  pages  141-183.  Van  Nostrand  Reinhold, \nNY,  1992. \n\n[9]  K.S.  Narendra and  K.  Parthasarathy.  Identification and control of dynamical \nsystems using neural networks.  IEEE Trans.  on  Neural Networks, 1:4-27, 1990. \n[10]  P.  Poddar  and  K.P.  Unnikrishnan.  Non-linear  prediction  of speech  signals \nusing  memory neuron  networks.  In  Proc.  1991  IEEE  Work.  Neural  Networks \nfor  Sig.  Proc.,  pages  1-10. IEEE Press,  1991. \n\n[11]  A.J. Robinson and F. Fallside.  Static and dynamic error propagation networks \n\nwith application to speech  coding.  In  NIPS,  pages 632-641,  NY,  1988. AlP. \n\n[12]  R.L.  Watrous  and  G.M.  Kuhn . \n\nInduction  of  finite-state  automata  using \n\nsecond-order  recurrent  networks.  In  NIPS4,  pages  309-316, 1992. \n\n\f", "award": [], "sourceid": 1009, "authors": [{"given_name": "Bill", "family_name": "Horne", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}]}