{"title": "A Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 26, "abstract": null, "full_text": "A  Unified  Gradient-Descent/Clustering \n\nArchitecture for \n\nFinite  State Machine  Induction \n\nSreerupa Das  and Michael C.  Mozer \n\nDepartment of Computer Science \n\nUniversity  of Colorado \nBoulder,  CO  80309-0430 \n\nAbstract \n\nAlthough  recurrent  neural  nets  have  been  moderately  successful \nin  learning  to  emulate  finite-state  machines  (FSMs),  the  continu(cid:173)\nous  internal  state  dynamics  of a  neural  net  are  not  well  matched \nto  the  discrete  behavior  of an  FSM.  We  describe  an  architecture, \ncalled  DOLCE, that allows discrete  states to evolve in a net as learn(cid:173)\ning  progresses.  DOLCE  consists  of a  standard  recurrent  neural  net \ntrained  by  gradient  descent  and  an  adaptive  clustering  technique \nthat quantizes  the state space.  DOLCE  is  based  on the assumption \nthat a  finite  set  of discrete  internal states  is  required  for  the  task, \nand that the actual network state belongs  to this  set  but has been \ncorrupted by noise  due  to inaccuracy in the weights.  DOLCE  learns \nto  recover  the discrete  state with  maximum a  posteriori  probabil(cid:173)\nity  from  the  noisy  state.  Simulations  show  that  DOLCE  leads  to  a \nsignificant  improvement in  generalization  performance  over  earlier \nneural  net  approaches  to FSM  induction. \n\n1 \n\nINTRODUCTION \n\nResearchers  often try to understand-post hoc-representations that emerge in the \nhidden  layers  of a  neural  net  following  training.  Interpretation  is  difficult  because \nthese  representations  are  typically  highly  distributed  and  continuous.  By  \"contin(cid:173)\nuous,\"  we  mean  that if one  constructed  a  scatterplot  over  the hidden  unit activity \nspace  of patterns obtained  in  response  to various  inputs,  examination at any scale \nwould  reveal  the  patterns  to be broadly distributed  over  the space. \n\nContinuous representations  aren't always appropriate.  Many task domains seem  to \nrequire  discrete  representations-representations  selected  from  a  finite  set  of alter(cid:173)\nnatives.  If a neural net learned a discrete representation,  the scatterplot over hidden \nactivity space would show points to be superimposed at fine  scales of analysis.  Some \n\n19 \n\n\f20 \n\nDas and Mozer \n\nexamples  of domains in  which  discrete  representations  might  be desirable  include: \nfinite-state  machine  emulation,  data  compression,  language  and  higher  cognition \n(involving discrete  symbol processing),  and categorization in the context of decision \nmaking.  In  such  domains,  standard  neural  net  learning  procedures,  which  have \na  propensity  to produce  continuous  representations,  may not  be  appropriate.  The \nwork we report here involves designing an inductive bias into the learning procedure \nin order  to encourage  the formation of discrete  internal  representations. \n\nIn  the  recent  years,  various  approaches  have  been  explored  for  learning  discrete \nrepresentations using neural networks (McMillan,  Mozer, & Smolensky, 1992; Mozer \n&  Bachrach,  1990;  Mozer  &  Das,  1993;  Schiitze,  1993;  Towell  &  Shavlik,  1992). \nHowever,  these  approaches  are domain  specific,  making strong  assumptions  about \nthe nature of the task.  In our work,  we  describe  a  general methodology that makes \nno assumption about the domain to which it is applied, beyond the fact that discrete \nrepresentations  are desireable. \n\n2  FINITE STATE  MACHINE INDUCTION \n\nWe  illustrate  the  methodology  using  the  domain  of finite-state  machine  (FSM) \ninduction.  An  FSM  defines  a  class  of symbol strings.  For example,  the class  (lOt \nconsists of all strings with one or more repetitions of 10; 101010 is a positive example \nof the class,  111  is  a  negative  example.  An  FSM  consists  principally  of a  finite  set \nof states and a  function  that maps the  current  state and the current  symbol of the \nstring into a  new  state.  Certain states  of the  FSM  are designated  \"accept\"  states, \nmeaning  that  if the  FSM  ends  up  in  these  states,  the  string  is  a  member  of the \nclass.  The induction  problem is  to  infer  an  FSM  that parsimoniously  characterizes \nthe positive  and negative  exemplars,  and hence  characterizes  the  underlying  class. \n\nA  generic  recurrent  net  architecture  that  could  be  used  for  FSM  emulation  and \ninduction  is  shown  on  the  left  side  of Figure  1.  A  string  is  presented  to  the  input \nlayer  of the  net,  one  symbol  at  a  time.  Following  the  end  of the  string,  the  net \nshould output  whether  or  not  the string is  a  member of the class.  The hidden  unit \nactivity  pattern  at  any  point  during  presentation  of a  string  corresponds  to  the \ninternal state of an  FSM. \n\nSuch a  net,  trained by a  gradient descent  procedure,  is  able to learn  to perform this \nor  related  tasks  (Elman,  1990;  Giles  et  al.,  1992;  Pollack,  1991;  Servan-Schreiber, \nCleeremans,  &  McClelland,  1991; Watrous &  Kuhn,  1992).  Although  these  models \nhave been relatively successful in learning to emulate FSMs, the continuous internal \nstate dynamics of a neural net are not well matched to the discrete behavior of FSMs. \nRoughly,  regions  of hidden  unit  activity  space  can  be  identified  with  states  in  an \nFSM,  but  because  the  activities  are  continuous,  one  often  observes  the  network \ndrifting from  one  state to another.  This occurs  especially  with input strings longer \nthan  those on which  the network was  trained. \n\nTo achieve  more  robust  dynamics,  one  might consider  quantizing  the  hidden state. \nTwo approaches  to  quantization  have  been  explored  previously.  In  the  first,  a  net \nis  trained in  the manner described  above.  After  training,  the  hidden  state space  is \npartitioned into disjoint  regions and each hidden activity pattern is then discretized \nby  mapping  it  to  the  center  of its  corresponding  region  (Das  &  Das,  1991;  Giles \n\n\fA Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction \n\n21 \n\nFigure 1:  On the left is a generic recurrent architecture that could be used for  FSM induc(cid:173)\ntion.  Each box  corresponds  to  a  layer  of units,  and arrows  depict  complete connectivity \nbetween layers.  At each  time step,  a new  symbol is  presented on  the input and the input \nand  hidden  representations  are  integrated  to  form  a  new  hidden  representation.  On  the \nright is  the general  architecture of DOLCE. \n\net  al.,  1992).  In  a  second  approach,  quantization  is  enforced  during  training  by \nmapping  the  the  hidden  state  at  each  time  step  to  the  nearest  corner  of a  [0,1]\" \nhypercube  (Zeng,  Goodman, &  Smyth, 1993). \n\nEach of these approaches has its limitations.  In the first  approach, because learning \ndoes  not  consider  the latter  quantization,  the  hidden  activity  patterns  that  result \nfrom  learning  may  not lie  in natural  clusters.  Consequently,  the quantization  step \nmay not group  together activity  patterns that correspond to the same state.  In the \nsecond  approach,  the quantization  process  causes  the  error  surface  to  have discon(cid:173)\ntinuities  and to be flat in  local  neighborhoods of the  weight  space.  Hence,  gradient \ndescent learning algorithms cannot be used;  instead, even more heuristic approaches \nare  required.  To overcome  the limitations  of these  approaches,  we  have  pursued an \napproach in  which  quantization is  an integral  part  of the learning  process. \n\n3  DOLCE \n\nOur  approach incorporates a  clustering module into the  recurrent  net  architecture, \nas shown on the right  side  of Figure  1.  The hidden layer activities  are processed  by \nthe clustering module before being passed on to other layers.  The clustering module \nmaps  regions  in  hidden  state space  to  a  single  point  in  the  same space,  effectively \npartitioning  or  clustering  the  hidden  state  space.  Each  cluster  corresponds  to  a \ndiscrete  internal  state.  The  clusters  are  adaptive  and  dynamic,  changing  over  the \ncourse  of learning.  We call  this  architecture  DOLCE,  for  gynamic Qn-!ine  \u00a3lustering \nand  state extraction. \n\nThe  DOLCE  architecture  may  be explored  along  two  dimensions:  (1)  the clustering \nalgorithm  used  (e.g.,  a  Gaussian  mixture  model,  ISODATA,  the  Forgy  algorithm, \nvector quantization schemes),  and (2)  whether supervised  or unsupervised  training \nis  used  to  identify  the  clusters.  In  unsupervised  mode,  the  performance  error  on \nthe  FSM  induction  task  has  no  effect  on  the  operation of the clustering  algorithm; \ninstead, an internal criterion characterizes goodness of clusters.  In supervised mode, \nthe primary measure that affects  the goodness of a  cluster is  the performance error. \nRegardless  of the  training  mode, all clustering  algorithms incorporate a  pressure  to \n\n\f22 \n\nDas and Mozer \n\no \n\nFigure  2:  Two  dimensions  of a  typical  state  space.  The  true  states  needed  to  perform \nthe task are  Cl,  C3,  and C3,  while  the observed hidden states, asswned  to be corrupted by \nnoise, are  distributed about  the  Ci. \n\nproduce a  small number of clusters.  Additionally,  as  we  elaborate more specifically \nbelow,  the algorithms must allow for a soft or continuous clustering during training, \nin  order  to be integrated into a  gradient-based  learning  procedure. \n\nWe  have  explored  two  possibilities  for  the  clustering  module.  The  first  involves \nthe  use  of Forgy's  algorithm  in  an  unsupervised  mode.  Forgy's  (1965)  algorithm \ndetermines  both  the  number  of clusters  and  the  partitioning  of the  space.  The \nsecond  uses  a  Gaussian  mixture  model  in  a  supervised  mode,  where  the  mixture \nmodel  parameters  are  adjusted  so  as  to  minimize  the  performance  error.  Both \napproaches  were  successful,  but  as  the latter  approach  obtained  better  results,  we \ndescribe  it in  the  next  section. \n\n4  CLUSTERING USING  A  MIXTURE MODEL \n\nHere  we  motivate the incorporation  of a  Gaussian  mixture  model into  DOLCE,  us(cid:173)\ning  an  argument  that  gives  the  approach  a  solid  theoretical  foundation.  Several \nassumptions underly  the approach.  First,  we  assume  that the  task faced  by DOLCE \nis  such  that  it  requires  a  finite  set  of internal  or  true states,  C  =  {Clt C2, \u2022\u2022. , CT}. \nThis  is  simply  the  premise  that  motivates  this  line  of work.  Second,  we  assume \nthat  any  observed  hidden  state-i.e.,  a  hidden  activity  pattern  that  results  from \npresentation  of a  symbol sequence-belongs  to  C  but has  been  corrupted  by  noise \ndue to inaccuracy in the network weights.  Third, we assume that this noise is  Gaus(cid:173)\nsian and decreases as learning  progresses  (i.e.,  as the weights  are adjusted to better \nperform the  task).  These assumptions are depicted  in  Figure  2. \n\nBased  on  these  assumptions,  we  construct  a  Gaussian  mixture  distribution  that \nmodels  the observed  hidden states: \n\np(hlC  tT  q) =  ~  qi \n\ne-lh-c.12 /2q~ \n\nT \n\n\" \n\nL...J  (27r0'~)H/2 \ni=l \n\n\u2022 \n\nwhere  h  denotes  an  observed  hidden  state,  0';  the  variance  of the  noise  that  cor(cid:173)\nrupts  state  Ci,  qi  is  the  prior  probability  that  the  true  state  is  Ci,  and  H  is  the \ndimensionality  of the hidden state space.  For pedagogical purposes,  a.ssume  for  the \ntime  being  that  the  parameters of the  mixture  distribution-T,  C,  tT,  and  q-are \nall  known;  in  a  later  section  we  discuss  how  these  parameters are determined. \n\n\fA Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction \n\n23 \n\nh \n\n00  0  OOOO!,~OO \n0  ~ \n\n0  7 \n\no \n\n~O 0 \n\no \n000 \n\n0 \n\nA \n\nbefore training \n\nafter successful training \n\nFigure  3:  A schematic depiction of the hidden state space before  and after training.  The \nhorizontal  plane  represents  the  state space.  The bumps indicate  the  probability  density \nunder  the mixture model.  Observed hidden states are represented by  small open circles. \n\nGiven a  noisy observed hidden state, h, DOLCE  computes the maximum a  posteriori \n(MAP) estimator of h  in  C. This estimator then replaces  the noisy state and is used \nin all subsequent computation.  The MAP estimator, h, is computed as follows.  The \nprobability  of an observed state  h  being  generated  by a  given  true state i  is \n\np(hltrue state i)  = (27rlTi)-!fe-lh-cill/2u:. \n\nUsing  Bayes'  rule,  one  can compute the  posterior  probability  of true  state i,  given \nthat  h  has been  observed: \n\np  true state  z  =  = - - - ' - - - ' - - - - - - - ' - - - -\n( \nL:j  p(hltrue state  j)qj \n\np(hltrue state i)qi \n\n.Ih) \n\nFinally,  the  MAP  estimator  is  given  by  it  =  Cargmax,p(true state ilh).  However, \nbecause  learning  requires  that  DOLCE's  dynamics  be  differentiable,  we  use  a  soft \nversion  of MAP  which  involves  using  ii = L:i cip(true state  ilh)  instead  of hand \nincorporating  a  \"temperature\"  parameter into lTi  as described  below. \n\nAn  important  parameter  in  the  mixture  model  is  T,  the  number  of true  states \n(Gaussians  bumps).  Because  T  directly  corresponds  to  the  number  of states  in \nthe  target  FSM,  if  T  is  chosen  too  small,  DOLCE  could  not  emulate  the  FSM. \nConsequently,  we  set  T  to  a  large  value,  and  the  training  procedure  includes  a \ntechnique  for  eliminating  unnecessary  true states.  (If the initially  selected  T  is  not \nlarge  enough, the  training  procedure  will  not converge  to zero error on the training \nset,  and  the  procedure can  be restarted  with a  larger  value  of T.) \n\nAt the start of training, each Gaussian center I  Ci,  is  initialized  to a  random location \nin the hidden state space.  The standard deviations of each Gaussian, lTi, are initially \nset  to  a  large  value.  The  priors,  qi,  are  set  to  liT.  The  weights  are  set  to  initial \nvalues  chosen  from  the  uniform  distribution  in  [-.25,.25].  All  connection  weights \nfeeding  into the hidden  layer are second  order. \n\nThe network weights and mixture model parameters-C, iT, and q-are adjusted by \ngradient  descent  in  a  cost  measure,  C.  This cost  includes  two components:  (a)  the \nperformance  error,  \u00a3,  which  is  a  squared  difference  between  the  actual  and  target \nnetwork  output  following  presentation  of a  training  string,  and  (b)  a  complexity \n\n\f24 \n\nDas and Mozer \n\nc:  800,------~...., \no II \n\nlanguage \n\n2000,--------, \n\nlanguage \n\nlanguage S \n\n200 \n\n100 \n\n0600 i 400 \n\nE \n'0 \n2 \n\nNO  ROLO OF  DG \n\nNO  ROLO  OF  DG \n\no  NO  RO LO  OF  DG \n\nlanguage \n\nlanguage 6 \n\n400 \n\n200 \n\nOL.......l.:.O=~ \n\nNO  ROLO OF  00 \n\nFigure 4:  Each graph depicts generalization performance on one of the Tomita languages \nfor  5 alternative neural net approaches:  no  clustering  [Ne), rigid quantization  [RQ), learn \nthen  quantize  [LQ],  DOLCE  in unsupervised  mode  using  Forgy's  algorithm  [DF],  DOLCE \nin  supervised  mode  using  mixture  model  [DG) .  The  vertical  axis  shows  the  number  of \nmisclassification of 3000 test strings.  Each bar is  the average  result across  10 replications \nwith different  initial weights. \n\ncost,  which  is  the entropy of the  prior  distribution,  q: \n\nwhere  ..\\  is  a  regularization  parameter.  The  complexity  cost  is  minimal  when  only \none  Gaussian  has a  nonzero  prior,  and  maximal when  all  priors  are  equal.  Hence, \nthe cost  encourages  unnecessary  Gaussians  to drop  out  of the  mixture  model. \n\nThe  particular  gradient  descent  procedure  used  is  a  generalization  of back  propa(cid:173)\ngation through  time  (Rumelhart,  Hinton,  &  Williams,  1986)  that incorporates  the \nmixture  model.  To  better  condition  the  search  space  and  to  avoid  a  constrained \nsearch,  optimization  is  performed not over  iT  and q  directly  but rather  over hyper(cid:173)\nparameters  a  and h, where u;  =  exp(ai)/,B and qi  =  exp( -bl)/Ej  exp( -bj). \nThe  global  parameter  ,B  scales  the  overall  spread  of the  Gaussians,  which  corre(cid:173)\nsponds  to  the  level  of  noise  in  the  model.  As  performance  on  the  training  set \nimproves,  we  assume  that  the  network  weights  are  coming  to  better  approximate \nthe  target  weights,  hence  that  the  level  of noise  is  decreasing.  Thus,  we  tie  ,B  to \nthe  performance  error  e.  We  have  used  various  annealing  schedules  and  DOLCE \nappears  robust  to  this  variation;  we  currently  use  {3  ex  1/ e.  Note  that  as  \u00a3  --+  0, \n{3  --+  00 and  the probability  density  under one Gaussian at h  will  become  infinitely \ngreater  than  the  density  under  any  other;  consequently,  the  soft  MAP  estimator, \nh,  becomes  equivalent  to  the  MAP  estimator h,  and the  transformed  hidden  state \nbecomes  discrete.  A  schematic  depiction  of the  probability  landscape  both  before \nand after  training is  depicted  in  Figure  3. \n\n\fA Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction \n\n25 \n\n5  SIMULATION STUDIES \n\nThe network was trained on a set ofregular languages first studied by Tomita (1982). \nThe languages, which  utilize  only the symbols 0 and 1, are:  (1)  1\u00b7; (2)  (10)\u00b7; (3)  no \nodd number of consecutive  1 's is directly  followed  by an odd number of consecutive \nO's;  (4)  any  string  not  containing  the  substring  \"000\";  (5) , [(01110)(01110)].;  (6) \nthe  difference  between  the  number  of ones  and  number  of zeros  in  the  string  is  a \nmultiple  of three;  and  (7)  0\u00b71\u00b7 0\u00b71\u00b7 . \n\nA  fixed  training  corpus  of strings  was  generated  for  each  language,  with  an  equal \nnumber of positive and negative examples.  The maximum string length varied from \n5 to 10 symbols and the total number of examples varied from 50  to 150, depending \non the difficulty  of the induction  task. \n\nEach  string  was  presented  one  symbol  at  a  time,  after  which  DOLCE  was  given  a \ntarget  output that specified  whether  the string was  a  positive  or  negative  example \nof the  language.  Training  continued  until  DOLCE  converged  on  a  set  of weights \nand mixture  model parameters.  Because  we  assume  that the  training examples  are \ncorrectly classified,  the error \u00a3  on the training set should go to zero when DOLCE  has \nlearned.  If this  did  not happen on a  given  training run, we  restarted  the simulation \nwith  different  initial  random  weights. \n\nFor  each language,  ten replications  of DOLCE  (with  the  supervised  mixture  model) \nwere  trained,  each  with  different  random  initial  weights.  The  learning  rate  and \nregularization  parameter .\\ were  chosen for each language by quick experimentation \nwith  the  aim of maximizing  the  likelihood  of convergence  on  the  training  set.  We \nalso  trained  a  version  of DOLCE  that clustered  using  the  unsupervised  Forgy algo(cid:173)\nrithm, as well  as several alternative  neural  net  approaches:  a  generic  recurrent  net, \nas shown on the left  side of Figure  1,  which  used  no clustering  [NC]; a  version  with \nrigid  quantization  during  training  [RQ],  comparable  to  the  earlier  work  of  Zeng, \nGoodman, and Smyth  (1993);  and  a  version  in  which  the  unsupervised  Forgyalgo(cid:173)\nrithm was used  to quantize the  hidden  state following  training  [LQ],  comparable to \nthe  earlier  work  of Das  and  Das  (1991).  In  these  alternative  approaches,  we  used \nthe  same  architecture  as  DOLCE  except  for  the  clustering  procedure.  We  selected \nlearning  parameters  to optimize  performance  on  the  training  set,  ran  ten  replica(cid:173)\ntions  for  each  language,  replaced  runs  which  did  not  converge,  and  used  the  same \ntraining sets. \n\n6  RESULTS  AND  CONCLUSION \n\nIn Figure 4,  we  compare the generalization  performance  of DOLCE-both the  unsu(cid:173)\npervised  Forgy  [DF]  and supervised  mixture  model  [DG]-to the  NC,  RQ, and  LQ \napproaches.  Generalization  performance  was  tested  using  3000  strings  not  in  the \ntraining  set,  half positive  examples  and  half negative.  The  two  versions  of DOLCE \noutperformed  the  alternative  neural  net  approaches,  and  the  DG  version  of DOLCE \nconsistently  outperformed  the  DF  version. \n\nTo summarize, we  have described  an approach that incorporates inductive bias into \na  learning  algorithm in order  to encourage the evolution  of discrete  representations \nduring  training.  This  approach  is  a  quite  general  and  can  be  applied  to  domains \n\n\f26 \n\nDas and Mozer \n\nother  than grammaticality judgement  where  discrete  representations  might  be  de(cid:173)\nsirable.  Also, this approach is not specific  to recurrent networks and may be applied \nto feedforward  networks.  We  are  now in the  process  of applying  DOLCE  to a  much \nlarger, real-world problem that involves predicting  the next symbol in a string.  The \ndata  base  comes  from  a  case  study  in  software  engineering,  where  each  symbol \nrepresents  an  operation  in  the  software  development  process.  This  data is  quite \nnoisy and it  is  unlikely  that  the data can be  parsimoniously  described  by an FSM. \nNonetheless,  our  initial  results  are  encouraging:  DOLCE  produces  predictions  at \nleast  three  times  more accurate  than a  standard recurrent  net  without clustering. \n\nAcknowledgements \n\nThis  research  was  supported  by  NSF  Presidential  Young  Investigator  award  IRI-\n9058450  and grant  90-21 from  the  James S.  McDonnell  Foundation. \n\nReferences \n\nS. Das &  R. Das.  (1991) Induction of discrete state-machine by stabilizing a  continuous re(cid:173)\ncurrent network using clustering.  Computer Science  and Informatics 21(2):35-40.  Special \nIssue  on  Neural Computing. \n\nJ.L. Elman.  (1990)  Finding structure in time.  Cognitive Science 14:179-212. \n\nE. Forgy.  (1965)  Cluster analysis of multivariate data:  efficiency versus interpretability of \nclassifications.  Biometrics 21:768-780. \n\nM.C. Mozer  &  J.D  Bachrach.  (1990)  Discovering  the  structure of a  reactive environment \nby exploration.  Neural  Computation 2( 4):447-457. \n\nC.  McMillan,  M.C.  Mozer,  &  P.  Smolensky.  (1992)  Rule  induction  through  integrated \nsymbolic  and  subsymbolic  processing.  In  J.E.  Moody,  S.J.  Hanson,  &  R.P.  Lippmann \n(eds.),  Advances in  Neural Information  Proceuing Sy6tems  4,  969-976.  San  Mateo,  CA: \nMorgan  Kaufmann. \n\nC.L.  Giles,  D.  Chen,  C.B.  Miller,  H.H.  Chen,  G.Z.  Sun,  &  Y.C.  Lee.  (1992)  Learning \nand extracting finite  state  automata with second-order  recurrent  neural network.  Neural \nComputation 4(3):393-405. \n\nH. Schiitze.  (1993) Word space.  In S.J. Hanson, J.D. Cowan, &  C.L. Giles (eds.), Advances \nin Neural Information Proceuing Systems 5, 895-902.  San Mateo, CA: Morgan Kaufmann. \n\nM. Tomita.  (1982) Dynamic construction of finite-state automata from examples using hill(cid:173)\nclimbing.  Proceedings  of the  Fourth  Annual Conference  of the  Cognitive  Science  Society, \n105-108. \n\nG.  Towell  &  J.  Shavlik. \nknowledge-based  neural  networks  into  rules.  In  J .E.  Moody,  S.J.  Hanson,  &  R.P.  Lipp(cid:173)\nmann (eds.),  Advances in Neural Information  Proceuing Systems 4,  977-984.  San Mateo, \nCA: Morgan Kaufmann. \n\n(1992)  Interpretion  of  artificial  neural  networks:  mapping \n\nR.L. Watrous &  G.M. Kuhn.  (1992)  Induction of finite state languages using second-order \nrecurrent  networks.  In  J.E.  Moody,  S.J.  Hanson,  &  R.P.  Lippmann  (eds.),  Advances  in \nNeural Information Proceuing Systems 4,  969-976.  San Mateo,  CA: Morgan Kaufmann. \nZ.  Zeng,  R.  Goodman,  &  P.  Smyth. \n(1993)  Learning  finite  state  machines  with  self(cid:173)\nclustering recurrent networks.  Neural  Computation 5(6):976-990. \n\n\f", "award": [], "sourceid": 846, "authors": [{"given_name": "Sreerupa", "family_name": "Das", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}]}