{"title": "Language Induction by Phase Transition in Dynamical Recognizers", "book": "Advances in Neural Information Processing Systems", "page_first": 619, "page_last": 626, "abstract": null, "full_text": "Language Induction by Phase Transition \n\nin Dynamical Recognizers \n\nJordan B. Pollack \n\nLaboratory for AI Research \nThe Ohio State University \n\nColumbus,OH 43210 \n\npollack@cis.ohio-state.edu \n\nAbstract \n\nA higher order recurrent neural network architecture learns to recognize and \ngenerate languages after being  \"trained\"  on categorized exemplars.  Studying \nthese  networks  from  the  perspective  of  dynamical  systems  yields  two \ninteresting  discoveries:  First,  a  longitudinal  examination  of  the  learning \nprocess  illustrates  a  new  form  of mechanical  inference:  Induction  by  phase \ntransition.  A  small  weight  adjustment  causes  a  \"bifurcation\"  in  the  limit \nbehavior of the network. This phase transition corresponds to the onset of the \nnetwork's  capacity  for  generalizing  to  arbitrary-length  strings.  Second,  a \nstudy of the  automata resulting  from  the  acquisition  of previously published \nlanguages  indicates  that  while  the architecture  is  NOT  guaranteed  to  find  a \nminimal  finite  automata  consistent  with  the  given  exemplars,  which  is  an \nNP-Hard  problem,  the  architecture  does  appear capable  of generating  non(cid:173)\nregular languages by exploiting fractal and chaotic dynamics. I end the paper \nwith  a  hypothesis  relating  linguistic  generative  capacity  to  the  behavioral \nregimes of non-linear dynamical systems. \n\n1  Introduction \n\nI expose a recurrent  high-order back-propagation  network  to  both positive and negative \nexamples  of boolean  strings,  and  report  that  although  the  network  does  not  find  the \nminimal-description finite state automata for  the languages (which is NP-Hard (Angluin, \n1978\u00bb,  it  does  induction  in  a  novel  and  interesting  fashion,  and  searches  through  a \nhypothesis space which, theoretically, is not constrained to machines of finite state. These \nresults  are  of import  to  many  related  neural  models  currently  under development,  e.g. \n(Elman,  1990; Giles et aI., 1990; Servan-Schreiber et al.,  1989), and relates ultimately to \nthe question of how linguistic capacity can arise in nature. \nAlthough the transitions among states in a  finite-state automata are  usually  thought of as \nbeing  fully  specified  by  a  table,  a  transition  function  can  also  be  specified  as  a \nmathematical function of the current state and the input.  It is known from  (McCulloch & \nPitts,  1943)  that  even  the  most  elementary  modeling  assumptions  yield  finite-state \n\n619 \n\n\f620 \n\nPollack \n\ncontrol. and it is worth reiterating that any network with the capacity to compute arbitrary \nboolean  functions  (say.  as  logical  sums of products)  lapedes  farber  how  nets  1.  white \nhomik .1. can be used recurrently to implement arbitrary finite state machines. \nFrom  a different  point of view. a recurrent network  with a state evolving across  k units \ncan  be  considered  a  k-dimensional  discrete-time  continuous-space  dynamical  tystem. \nwith  a  precise  initial  condition.  Zk(O).  and  a  state  space  in Z. a  subspace  of R  .  The \ngoverning function. F. is parameterized by a set of weights. W. and merely computes the \nnext  state  from  the  current  state  and  input.  Yj(t).  a  finite  sequence  of  patterns \nrepresenting tokens from  some alphabet 1:: \n\nZk(t+ 1) = FW(Zk(t).YjCt\u00bb \n\nIf we view one of the dimensions of this system. say  Za.  as an  \"acceptance\" dimension. \nwe can  define  the language accepted  by  such a Dynamical Recognizer  as  all  strings of \ninput tokens evolved from  the  precise initial state for  which  the accepting  dimension  of \nthe  state  is  above  a  certain  threshold.  In  network  terms.  one  output  unit  would  be \nsubjected to a threshold test after processing a sequence of input patterns. \nThe first question to ask is how can such a dynamical system be constructed. or taught, to \naccept a particular language? The weights in the network. individually. do not correspond \ndirectly  to graph  transitions  or to  phrase  structure rules.  The  second  question  to  ask is \nwhat sort of generative power can be achieved by such systems? \n\n2  The Model \n\nTo begin to answer the question of learning. I now present and elaborate upon my earlier \nwork on  Cascaded Networks  (pollack.  1987). which  were  used  in a recurrent  fashion  to \nlearn  parity.  depth-limited  parenthesis  balancing,  and  to  map  between  word  sequences \nand  proposition  representations  (pollack.  1990a).  A  Cascaded  Network  is  a  well(cid:173)\ncontrolled  higher-order  connectionist  architecture  to  which \nthe  back-propagation \ntechnique  of weight  adjustment  (Rumelhart  et  al..  1986)  can  be  applied.  Basically.  it \nconsists of two  subnetworks:  The function  network is a  standard  feed-forward  network; \nwith  or without hidden  layers.  However.  the  weights  are dynamically  computed by the \nlinear context network. whose outputs are mapped in  a  1: 1 fashion  to  the weights of the \nfunction net.  Thus the input pattern to the context network is used to \"multiplex\" the the \nfunction computed, which can result in simpler learning tasks. \nWhen the outputs of the function network are used as inputs to context network. a system \ncan  be  built which  learns  to  produce  specific  outputs  for  variable-length  sequences  of \ninputs. Because of the multiplicative connections, each input is, in effect, processed by a \ndifferent  function.  Given  an  initial  context.  Zk(O).  and  a  sequence  of  inputs. \nYj(t). t= 1. .. n.  the  network  computes  a  sequence  of state  vectors,  Zi(t).  t= 1. .. n  by \ndynamically changing the set of weights. Wij(t).  Without hidden  units the forward  pass \ncomputation is: \n\nWij(t) = L Wijk  zk(t-1) \n\nk \n\nZi(t) = geL Wij(t) Yj(t\u00bb \n\nj \n\n\fLanguage Induction by Phase 'll'ansition in Dynamical Recognizers \n\n621 \n\nwhere g is the usual sigmoid function used in back-propagation system. \nIn previous  work,  I assumed that a teacher could supply a  consistent and  generalizable \ndesired-state  for  each  member  of  a  large  set  of  strings,  which  was  a  significant \noverconstraint. In learning a two-state machine like parity, this did not matter, as the I-bit \nstate fully  determines the  output However, for  the case of a higher-dimensional  system, \nwe know  what  the  final  output of a  system  should  be,  but we  don't care  what its  state \nshould be along the way. \nJordan  (1986)  showed  how  recurrent  back-propagation  networks  could  be  trained  with \n\"don't care\"  conditions.  If there is no specific preference for the value of an output unit \nfor  a  particular  training  example,  simply  consider  the  error  term  for  that  unit  to  be  O. \nThis  will work, as long as that same unit receives feedback from  other examples.  When \nthe don't-cares line up,  the weights to those units will  never change.  My  solution  to  this \nproblem involves a backspace, unrolling the loop only once:  After propagating the errors \ndetermined on only a subset of the weights from  the \"acceptance\" unit Za: \n\naE a  .() = (za(n) - da) za(n) (1- za(n\u00bb  Yj(n) \n\nza)  n \n\naE \n\nThe  error  on  the  remainder  of the  weights  (a aE  ,i ~ a ) is  calculated  using  values \nf \nrom  e penu Ornate Orne step: \n\nw\"k \n') \n\nth \n\nI . \n\n. \n\n_a_E_=LL  aE \nazk(n-l) \n\na  j  aWajk  awa/n) \n\naE \n\naE \n\naE \n\naWij(n-l)  - aZi(n-l)  Yj(n-l) \n\naE \n- - -\naWijk \n\naE \n\naWij(n-l) \n\nzk(n-2) \n\nThis is done, in batch (epoch) style, for a set of examples of varying lengths. \n\n3  Induction as Phase Transition \n\nIn  initial  studies  of learning  the  simple regular  language  of odd parity,  I  expected  the \nrecognizer to merely implement \"exclusive or\"  with a feedback  link. It turns out that this \nis not quite enough. Because termination of back-propagation is usually defined as a 20% \nerror (e.g.  logical  \"I\" is above 0.8) recurrent  use  of this logic  tends  to  a  limit point.  In \nother  words,  mere  separation  of the  exemplars  is  no  guarantee  that  the  network  can \nrecognize  parity  in  the  limit.  Nevertheless,  this  is  indeed  possible  as  illustrated  by \nillustrated below.  In  order  to  test the  limit behavior of a recognizer,  we  can  observe its \nresponse to  a very long  \"characteristic  string\".  For odd parity, the  string  1 * requires an \nalternation of responses. \nA  small  cascaded  network  composed  of  a  1-2  function  net  and  a  2-6  context  net \n\n\f622 \n\nPollack \n\n(requiring 18 weights) was was trained on odd parity of a small set of strings up to length \n5.  At each  epoch,  the  weights  in  the  network  were saved in  a  file.  Subsequently,  each \nconfiguration  was tested in  its response  to  the  first 25 characteristic  strings.  In figure  I, \neach  vertical  column, corresponding  to  an  epoch,  contains  25  points  between  0  and  1. \nInitially,  all  strings  longer  than  length  1  are  not distinguished.  From  cycle  60-80,  the \nnetwork  is  improving at separating  finite  strings.  At cycle  85, the  network  undergoes  a \n\"bifurcation,\"  where  the  small  change  in  weights  of a  single  epoch  leads  to  a  phase \ntransition from  a limit point to a limit cycle. 1 This phase transition is so \"adaptive\"  to the \nclassification task that the network rapidly exploits iL \n\n:.... \n\n,','  .,' \n\n,,-\n. ... -: \n\u2022 \n\n...... ..\".-..... \n. \n. \n. \n.-:\",.'.: \n..... ~:... \n.. \n.... \n.... \n~_. \n-\n-.::::. \n. :::::~ ..... ,... \n\n-\n\n..~ \n\n. '   . .  Wi#- _ __ __  , \n\n.e'!' \n\n\"\"'pe. - - - -\n\n3 __ 1_\u00b7 \n!iIi!ili  ;a. w \n\n~ \u2022\u2022 _--.... 4 \n. .,,~.... ~iilU hli \n........ ~ .. \n\u2022 ..  \u00b7\u00b7.:iII \n........ \n.... \n\u00b7:\u00b7:\u00b7\u00b7sa \n-\",' \u2022. ':-J \n'. '''::1 \n\u2022 \n'-.':::~.! \n\u2022  -, \n\". \": \n: :t-:-j~ \n. ....... \n~ ..... .. \n. .. '  .... : \n' .  ~ \n_. \n...... \n\u2022\u2022\u2022\u2022 ~ \n..\"r \n.-.' . \n... ~ \nI\" \n..:.:~.:: \n.  . ... ' .. \n.\"'--. \n\".  ~ - - -=\"\"\"~ \n\n\" . . . . . .  - .   --~  ~>-~ \n\n'o,  ~_-~ \n\n= \n\n. \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n50 \n\n100 \n\n150 \n\n200 \n\nFigure 1:  A bifurcation diagram showing the response of the parity-learner to the first \n\n25 characteristic strings over 200 epochs of training. \n\nI wish to stress that this is a new and very interesting form of mechanical induction, and \nreveals that with the proper perspective, non-linear connectionist networks are capable of \nmuch  more  complex  behavior  than  hill-climbing.  Before  the  phase  transition,  the \nmachine is in principle not capable of performing the  serial parity  task;  after the phase \ntransition it is.  The similarity of learning through a \"flash of insight\" to biological change \nthrough a \"punctuated\" evolution is much more than coincidence. \n\n4  BenChmarking Results \n\nTomita (1982)  performed elegant experiments  in  inducing  finite  automata from  positive \nand negative evidence using hillclim bing in the space of 9-state automata.  Each case was \ndefined by two sets of boolean strings, accepted by and rejected by the regular languages \n\n1  For the  simple  low  dimensional dynamical  systems  usually studied, the  \"knob\" or cootrol parameter for \nsuch  a  bifurcation  diagram  is  a  scalar variable;  here  the  control  parameter is  the  entire  32-0  vcc:tor  of \nweights in the network, and bade-propagation turns the knobl \n\n\fLanguage Induction by Phase ltansition in Dynamical Recognizers \n\n623 \n\nlisted below. \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n\n1 * \n(10)* \n\nno odd zero strings after odd 1 strings \n\nno triples of zeros \n\npairwise, an even sum of 01 's and  lO's. \nnumber of 1 's - number ofO's = 3n \n\n0*1*0*1* \n\nRather  than  inventing  my  own  training  data,  or  sampling  these  languages  for  a  well(cid:173)\nformed  training  set  I  ran  all  7 Tomita  training  environments  as  given,  on  a  sequential \ncascaded network of a I-input 4-output function network (with bias, 8 weights to set) and \na  3-input  8-output  context  network  with  bias,  using  a  learning  rate  was  of 0.3  and  a \nmomentum to 0.7.  Termination was when all accepted strings returned output bits above \n0.8 and rejected strings below 0.2. \nOf Tomita's 7 cases,  all  but cases  #2  and  #6 converged  without  a  problem  in  several \nhundred epochs.  Case 2 would not converge, and kept treating a negative case as correct \nbecause of the difficulty for my architecture to induce a \"trap\"  state; I had  to  modify  the \ntraining set (by added reject strings 110 and  11010) in order to overcome this problem? \nCase 6 took several restarts  and  thousands  of cycles to converge, cause  unknown.  The \ncomplete experimental data is available in a longer report (pollack, 1990b). \n\nBecause  the  states  are  \"in  a  box\"  of  low  dimension,3  we  can  view  these  machines \ngraphically  to gain  some understanding of how  the state space  is  being arranged.  Based \nupon some intitial studies of parity, my initial hypothesis was that a set of clusters would \nbe  found,  organized  in  some  geometric  fashion:  i.e.  an  em bedding  of  a  finite  state \nmachine  into  a  finite  dimensional  geometry  such  that  each  token'S  transitions  would \ncorrespond  to  a  simple  transformation  of  space.  Graphs  of  the  states  visited  by  all \npossible  inputs  up  to  length  10,  for  the  7 Tomita test cases are  shown  in  figure  2.  Each \nfigure contains 2048 points, but often they overlap. \nThe  images (a)  and  (d)  are  what were  expected, clumps of points which  closely map  to \nstates of equivalent FSA's. Images  (b)  and  (e)  have  limit \"ravine's\"  which  can  each be \nconsidered states as well. \n\n5  Discussion \n\nHowever, the state spaces, (c),  (f), and (g) of the dynamical recognizers for Tomita cases \n3,6, and 7, are interesting, because, theoretically, they are infinite state machines, where \nthe  states  are  not arbitrary  or random,  requiring  an  infinite  table  of transitions,  but are \nconstrained in a powerful way by mathematical principle. In  other words, the complexity \nis in the dynamics, not in the specifications (weights). \nIn  thinking  about  such  a  principle,  consider  other  systems  in  which  extreme  observed \ncomplexity  emerges  from  algorithmic  simplicity  plus  computational  power.  It  is \n\n2 It can  be  argued that other FSA  inducing methods  get around this  problem  by presupposing  rather than \nlearning trap states. \n] One reason  I  have succeeded in such low dimensional induction is  because my  architecture is  a Mealy, \nrather than Moore Machine (Lee Giles, Personal Communication) \n\n\f624 \n\nPollack \n\nA \n\nc \n\nE \n\nG \n\nB \n\nD \n\nF \n\nFigure 2: Images of the state-spaces \nfor  the  7  benchmark  cases.  Each \npoints \nimage \ncorresponding  to  the  states  of  all \nboolean strings up to length 10. \n\ncontains \n\n2048 \n\n\fLanguage Induction by Phase 1ransition in Dynamical Recognizers \n\n625 \n\ninteresting to note that by eliminating  the  sigmoid and commuting the Yj  and Zk  terms, \nthe forward equation for higher order recurrent networks with is identical to the generator \nof an Iterated Function System (IFS)  (Bamsley et al.,  1985).  Thus, my  figures  of state(cid:173)\nspaces,  which  emerge  from  the  projection  of  :r.  into  Z,  are  of  the  same  class  of \nmathematical  object  as  Barnsley's  fractal  attractors  (e.g.  the  widely  reproduced  fern). \nUsing  the  method of (Grassberger &  Procaccia,  1983), the correlation dimension of the \nattractor in Figure 2(g) was found to be about 1.4. \n\nThe  link  between  work  in  complex  dynamical  systems  and  neural  networks  is  well(cid:173)\nestablished  both  on  the  neurobiological  level  (Skarda  &  Freeman,  1987)  and  on  the \nmathematical  level  (Derrida  &  Meir,  1988;  Huberman  &  Hogg,  1987;  Kurten,  1987; \nSmolensky,  1986). This paper expands a  theme from  an  earlier proposal to link them  at \nthe \"cognitive\" level (pollack, 1989). \n\nThere  is  an  interesting  formal  question,  which  has  been  brought  out  in  the  work  of \n(Wolfram, 1984) and others on the universality of cellular automata, and more recently in \nthe  work  of (Crutchfield  & Young,  1989)  on  the  descriptive  complexity  of bifurcating \nsystems:  What  is  the  relationship  between  complex  dynamics  (of neural  systems)  and \ntraditional measures of computational complexity? From  this work and other supporting \nevidence, I venture the following hypothesis: \n\nThe state-space limit of a dynamical recognizer, as :r. ~:roo, is an  Attractor, \nwhich is cut by a threshold (or similar decision) function.  The complexity of \nthe  language  recognized  is  regular  if the  cut  falls  between  disjoint  limit \npoints or cycles, context-free if it cuts a \"self-similar\" (recursive) region, and \ncontext-sensitive if it cuts a \"chaotic\" (pseudo-random) region. \n\nAcknowledgements \nThis research  has been partially supported by  the Office of Naval Research  under \ngrant NOOO 14-89-J -1200. \n\nReferences \n\nAngluin,  D.  (1978).  On  the  complexity  of minimum  inference  of regular  sets. \nInformation and Control. 39,337-350. \nBamsley,  M.  F.,  Ervin,  V.,  Hardin,  D.  &  Lancaster,  J.  (1985).  Solution  of an \ninverse problem  for fractals  and other sets.  Proceedings of the National Academy \nof Science. 83. \nCrutchfield, 1. P & Young, K.  (1989).  Computation at the Onset of Chaos.  In W. \nZurek, (Ed.),  Complexity. Entropy and the Physics of INformation.  Reading, MA: \nAddison-Wesley. \nDerrida,  B.  &  Meir,  R.  (1988).  Chaotic  behavior  of a  layered  neural  network. \nPhys. Rev. A. 38. \nElman, J. L.  (1990).  Finding Structure in Time.  Cognitive Science. 14, 179-212. \nGiles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C. & Chen, D.  (1990).  Higher Order \nRecurrent  Networks  and  Grammatical  Inference. \nAdvances  in  Neural  Information  Processing  Systems.  Los  Gatos,  CA:  Morgan \nKaufman. \n\nIn  D.  Touretzky,  (Ed.), \n\n\f626 \n\nPollack \n\n(1990).  Recursive  Distributed  Representation.  Artificial \n\nGrassberger.  P.  &  Procaccia.  I.  (1983).  Measuring  the  Strangeness  of Strange \nAttractors.  Physica. 9D. 189-208. \nHuberman. B. A.  & Hogg. T.  (1987).  Phase Transitions in Artificial  Intelligence \nSystems.  Artificial Intelligence. 33. 155-172. \nJordan.  M. I.  (1986).  Serial Order:  A  Parallel  Distributed Processing Approach. \nICS report 8608. La Jolla: Institute for Cognitive Science. UCSD. \nKurten.  K.  E.  (1987).  Phase  transitions  in  quasirandom  neural  networks.  In \nInstitute of Electrical and Electronics Engineers First International Conference on \nNeural Networks.  San Diego. 11-197-20. \nMcCulloch. w. S. & Pitts. W.  (1943).  A logical calculus of the ideas immanent in \nnervous activity.  Bulletin of Mathematical Biophysics. 5. 115-133. \nPOllack.  J.  B.  (1987).  Cascaded  Back  Propagation  on  Dynamic  Connectionist \nNetworks.  In  Proceedings  of the  Ninth  Conference  of the  Cognitive  Science \nSociety.  Seattle. 391-404. \nPollack, J.  B.  (1989).  Implications  of Recursive Distributed Representations.  In \nD.  Touretzky.  (Ed.).  Advances in Neural  Information  Processing  Systems.  Los \nGatos. CA: Morgan Kaufman. \nPollack.  J.  B. \nIntelligence. 46, 77-105. \nPollack. J. B.  (1990).  The Induction of Dynamical Recognizers.  Tech Report 90-\nlP-Automata. Columbus. OH 43210: LAIR. Ohio State University. \nRumelhart.  D.  E .\u2022  Hinton.  G.  &  Williams.  R. \n(1986).  Learning  Internal \nRepresentations through Error Propagation.  In  D.  E.  Rumelhart. 1.  L.  McClelland \n&  the PDP research Group. (Eds.). Parallel Distributed Processing: Experiments in \nthe Microstructure of Cognition. Vol.  1.  Cambridge: MIT Press. \nServan-Schreiber.  D .\u2022  Cleeremans.  A.  &  McClelland.  J.  L  (1989).  Encoding \nSequential  Structure  in  Simple  Recurrent  Networks.  In  D.  Touretzky.  (Ed.). \nAdvances  in  Neural  Information  Processing  Systems.  Los  Gatos.  CA:  Morgan \nKaufman. \nSkarda.  C.  A.  &  Freeman.  W.  J.  (1987).  How  brains  make  chaos.  Brain  & \nBehavioral Science.lO. \nSmolensky.  P. \nin  DynamiCal  Systems: \nFoundations of Harmony Theory.  In D. E. Rumelhart. J. L. McClelland & the PDP \nresearch  GrouP.  (Eds.).  Parallel  Distributed  Processing:  Experiments  in  the \nMicrostructure of Cognition. Vol.  1.  Cambridge: MIT Press. \nTomita. M.  (1982).  Dynamic construction of finite-state automata from  examples \nusing  hill-climbing.  In  Proceedings  of the  Fourth  Annual  Cognitive  Science \nConference.  Ann Arbor. MI. 105-108. \nWolfram. S.  (1984).  Universality and Complexity in Cellular Automata.  Physica. \nlOD.1-35. \n\n(1986). \n\nInformation  Processing \n\n\f", "award": [], "sourceid": 298, "authors": [{"given_name": "Jordan", "family_name": "Pollack", "institution": null}]}