{"title": "Language Induction by Phase Transition in Dynamical Recognizers", "book": "Advances in Neural Information Processing Systems", "page_first": 619, "page_last": 626, "abstract": null, "full_text": "Language Induction by Phase Transition \n\nin Dynamical Recognizers \n\nJordan B. Pollack \n\nLaboratory for AI Research \nThe Ohio State University \n\nColumbus,OH 43210 \n\npollack@cis.ohio-state.edu \n\nAbstract \n\nA higher order recurrent neural network architecture learns to recognize and \ngenerate languages after being \"trained\" on categorized exemplars. Studying \nthese networks from the perspective of dynamical systems yields two \ninteresting discoveries: First, a longitudinal examination of the learning \nprocess illustrates a new form of mechanical inference: Induction by phase \ntransition. A small weight adjustment causes a \"bifurcation\" in the limit \nbehavior of the network. This phase transition corresponds to the onset of the \nnetwork's capacity for generalizing to arbitrary-length strings. Second, a \nstudy of the automata resulting from the acquisition of previously published \nlanguages indicates that while the architecture is NOT guaranteed to find a \nminimal finite automata consistent with the given exemplars, which is an \nNP-Hard problem, the architecture does appear capable of generating non(cid:173)\nregular languages by exploiting fractal and chaotic dynamics. I end the paper \nwith a hypothesis relating linguistic generative capacity to the behavioral \nregimes of non-linear dynamical systems. \n\n1 Introduction \n\nI expose a recurrent high-order back-propagation network to both positive and negative \nexamples of boolean strings, and report that although the network does not find the \nminimal-description finite state automata for the languages (which is NP-Hard (Angluin, \n1978\u00bb, it does induction in a novel and interesting fashion, and searches through a \nhypothesis space which, theoretically, is not constrained to machines of finite state. These \nresults are of import to many related neural models currently under development, e.g. \n(Elman, 1990; Giles et aI., 1990; Servan-Schreiber et al., 1989), and relates ultimately to \nthe question of how linguistic capacity can arise in nature. \nAlthough the transitions among states in a finite-state automata are usually thought of as \nbeing fully specified by a table, a transition function can also be specified as a \nmathematical function of the current state and the input. It is known from (McCulloch & \nPitts, 1943) that even the most elementary modeling assumptions yield finite-state \n\n619 \n\n\f620 \n\nPollack \n\ncontrol. and it is worth reiterating that any network with the capacity to compute arbitrary \nboolean functions (say. as logical sums of products) lapedes farber how nets 1. white \nhomik .1. can be used recurrently to implement arbitrary finite state machines. \nFrom a different point of view. a recurrent network with a state evolving across k units \ncan be considered a k-dimensional discrete-time continuous-space dynamical tystem. \nwith a precise initial condition. Zk(O). and a state space in Z. a subspace of R . The \ngoverning function. F. is parameterized by a set of weights. W. and merely computes the \nnext state from the current state and input. Yj(t). a finite sequence of patterns \nrepresenting tokens from some alphabet 1:: \n\nZk(t+ 1) = FW(Zk(t).YjCt\u00bb \n\nIf we view one of the dimensions of this system. say Za. as an \"acceptance\" dimension. \nwe can define the language accepted by such a Dynamical Recognizer as all strings of \ninput tokens evolved from the precise initial state for which the accepting dimension of \nthe state is above a certain threshold. In network terms. one output unit would be \nsubjected to a threshold test after processing a sequence of input patterns. \nThe first question to ask is how can such a dynamical system be constructed. or taught, to \naccept a particular language? The weights in the network. individually. do not correspond \ndirectly to graph transitions or to phrase structure rules. The second question to ask is \nwhat sort of generative power can be achieved by such systems? \n\n2 The Model \n\nTo begin to answer the question of learning. I now present and elaborate upon my earlier \nwork on Cascaded Networks (pollack. 1987). which were used in a recurrent fashion to \nlearn parity. depth-limited parenthesis balancing, and to map between word sequences \nand proposition representations (pollack. 1990a). A Cascaded Network is a well(cid:173)\ncontrolled higher-order connectionist architecture to which \nthe back-propagation \ntechnique of weight adjustment (Rumelhart et al.. 1986) can be applied. Basically. it \nconsists of two subnetworks: The function network is a standard feed-forward network; \nwith or without hidden layers. However. the weights are dynamically computed by the \nlinear context network. whose outputs are mapped in a 1: 1 fashion to the weights of the \nfunction net. Thus the input pattern to the context network is used to \"multiplex\" the the \nfunction computed, which can result in simpler learning tasks. \nWhen the outputs of the function network are used as inputs to context network. a system \ncan be built which learns to produce specific outputs for variable-length sequences of \ninputs. Because of the multiplicative connections, each input is, in effect, processed by a \ndifferent function. Given an initial context. Zk(O). and a sequence of inputs. \nYj(t). t= 1. .. n. the network computes a sequence of state vectors, Zi(t). t= 1. .. n by \ndynamically changing the set of weights. Wij(t). Without hidden units the forward pass \ncomputation is: \n\nWij(t) = L Wijk zk(t-1) \n\nk \n\nZi(t) = geL Wij(t) Yj(t\u00bb \n\nj \n\n\fLanguage Induction by Phase 'll'ansition in Dynamical Recognizers \n\n621 \n\nwhere g is the usual sigmoid function used in back-propagation system. \nIn previous work, I assumed that a teacher could supply a consistent and generalizable \ndesired-state for each member of a large set of strings, which was a significant \noverconstraint. In learning a two-state machine like parity, this did not matter, as the I-bit \nstate fully determines the output However, for the case of a higher-dimensional system, \nwe know what the final output of a system should be, but we don't care what its state \nshould be along the way. \nJordan (1986) showed how recurrent back-propagation networks could be trained with \n\"don't care\" conditions. If there is no specific preference for the value of an output unit \nfor a particular training example, simply consider the error term for that unit to be O. \nThis will work, as long as that same unit receives feedback from other examples. When \nthe don't-cares line up, the weights to those units will never change. My solution to this \nproblem involves a backspace, unrolling the loop only once: After propagating the errors \ndetermined on only a subset of the weights from the \"acceptance\" unit Za: \n\naE a .() = (za(n) - da) za(n) (1- za(n\u00bb Yj(n) \n\nza) n \n\naE \n\nThe error on the remainder of the weights (a aE ,i ~ a ) is calculated using values \nf \nrom e penu Ornate Orne step: \n\nw\"k \n') \n\nth \n\nI . \n\n. \n\n_a_E_=LL aE \nazk(n-l) \n\na j aWajk awa/n) \n\naE \n\naE \n\naE \n\naWij(n-l) - aZi(n-l) Yj(n-l) \n\naE \n- - -\naWijk \n\naE \n\naWij(n-l) \n\nzk(n-2) \n\nThis is done, in batch (epoch) style, for a set of examples of varying lengths. \n\n3 Induction as Phase Transition \n\nIn initial studies of learning the simple regular language of odd parity, I expected the \nrecognizer to merely implement \"exclusive or\" with a feedback link. It turns out that this \nis not quite enough. Because termination of back-propagation is usually defined as a 20% \nerror (e.g. logical \"I\" is above 0.8) recurrent use of this logic tends to a limit point. In \nother words, mere separation of the exemplars is no guarantee that the network can \nrecognize parity in the limit. Nevertheless, this is indeed possible as illustrated by \nillustrated below. In order to test the limit behavior of a recognizer, we can observe its \nresponse to a very long \"characteristic string\". For odd parity, the string 1 * requires an \nalternation of responses. \nA small cascaded network composed of a 1-2 function net and a 2-6 context net \n\n\f622 \n\nPollack \n\n(requiring 18 weights) was was trained on odd parity of a small set of strings up to length \n5. At each epoch, the weights in the network were saved in a file. Subsequently, each \nconfiguration was tested in its response to the first 25 characteristic strings. In figure I, \neach vertical column, corresponding to an epoch, contains 25 points between 0 and 1. \nInitially, all strings longer than length 1 are not distinguished. From cycle 60-80, the \nnetwork is improving at separating finite strings. At cycle 85, the network undergoes a \n\"bifurcation,\" where the small change in weights of a single epoch leads to a phase \ntransition from a limit point to a limit cycle. 1 This phase transition is so \"adaptive\" to the \nclassification task that the network rapidly exploits iL \n\n:.... \n\n,',' .,' \n\n,,-\n. ... -: \n\u2022 \n\n...... ..\".-..... \n. \n. \n. \n.-:\",.'.: \n..... ~:... \n.. \n.... \n.... \n~_. \n-\n-.::::. \n. :::::~ ..... ,... \n\n-\n\n..~ \n\n. ' . . Wi#- _ __ __ , \n\n.e'!' \n\n\"\"'pe. - - - -\n\n3 __ 1_\u00b7 \n!iIi!ili ;a. w \n\n~ \u2022\u2022 _--.... 4 \n. .,,~.... ~iilU hli \n........ ~ .. \n\u2022 .. \u00b7\u00b7.:iII \n........ \n.... \n\u00b7:\u00b7:\u00b7\u00b7sa \n-\",' \u2022. ':-J \n'. '''::1 \n\u2022 \n'-.':::~.! \n\u2022 -, \n\". \": \n: :t-:-j~ \n. ....... \n~ ..... .. \n. .. ' .... : \n' . ~ \n_. \n...... \n\u2022\u2022\u2022\u2022 ~ \n..\"r \n.-.' . \n... ~ \nI\" \n..:.:~.:: \n. . ... ' .. \n.\"'--. \n\". ~ - - -=\"\"\"~ \n\n\" . . . . . . - . --~ ~>-~ \n\n'o, ~_-~ \n\n= \n\n. \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n50 \n\n100 \n\n150 \n\n200 \n\nFigure 1: A bifurcation diagram showing the response of the parity-learner to the first \n\n25 characteristic strings over 200 epochs of training. \n\nI wish to stress that this is a new and very interesting form of mechanical induction, and \nreveals that with the proper perspective, non-linear connectionist networks are capable of \nmuch more complex behavior than hill-climbing. Before the phase transition, the \nmachine is in principle not capable of performing the serial parity task; after the phase \ntransition it is. The similarity of learning through a \"flash of insight\" to biological change \nthrough a \"punctuated\" evolution is much more than coincidence. \n\n4 BenChmarking Results \n\nTomita (1982) performed elegant experiments in inducing finite automata from positive \nand negative evidence using hillclim bing in the space of 9-state automata. Each case was \ndefined by two sets of boolean strings, accepted by and rejected by the regular languages \n\n1 For the simple low dimensional dynamical systems usually studied, the \"knob\" or cootrol parameter for \nsuch a bifurcation diagram is a scalar variable; here the control parameter is the entire 32-0 vcc:tor of \nweights in the network, and bade-propagation turns the knobl \n\n\fLanguage Induction by Phase ltansition in Dynamical Recognizers \n\n623 \n\nlisted below. \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n\n1 * \n(10)* \n\nno odd zero strings after odd 1 strings \n\nno triples of zeros \n\npairwise, an even sum of 01 's and lO's. \nnumber of 1 's - number ofO's = 3n \n\n0*1*0*1* \n\nRather than inventing my own training data, or sampling these languages for a well(cid:173)\nformed training set I ran all 7 Tomita training environments as given, on a sequential \ncascaded network of a I-input 4-output function network (with bias, 8 weights to set) and \na 3-input 8-output context network with bias, using a learning rate was of 0.3 and a \nmomentum to 0.7. Termination was when all accepted strings returned output bits above \n0.8 and rejected strings below 0.2. \nOf Tomita's 7 cases, all but cases #2 and #6 converged without a problem in several \nhundred epochs. Case 2 would not converge, and kept treating a negative case as correct \nbecause of the difficulty for my architecture to induce a \"trap\" state; I had to modify the \ntraining set (by added reject strings 110 and 11010) in order to overcome this problem? \nCase 6 took several restarts and thousands of cycles to converge, cause unknown. The \ncomplete experimental data is available in a longer report (pollack, 1990b). \n\nBecause the states are \"in a box\" of low dimension,3 we can view these machines \ngraphically to gain some understanding of how the state space is being arranged. Based \nupon some intitial studies of parity, my initial hypothesis was that a set of clusters would \nbe found, organized in some geometric fashion: i.e. an em bedding of a finite state \nmachine into a finite dimensional geometry such that each token'S transitions would \ncorrespond to a simple transformation of space. Graphs of the states visited by all \npossible inputs up to length 10, for the 7 Tomita test cases are shown in figure 2. Each \nfigure contains 2048 points, but often they overlap. \nThe images (a) and (d) are what were expected, clumps of points which closely map to \nstates of equivalent FSA's. Images (b) and (e) have limit \"ravine's\" which can each be \nconsidered states as well. \n\n5 Discussion \n\nHowever, the state spaces, (c), (f), and (g) of the dynamical recognizers for Tomita cases \n3,6, and 7, are interesting, because, theoretically, they are infinite state machines, where \nthe states are not arbitrary or random, requiring an infinite table of transitions, but are \nconstrained in a powerful way by mathematical principle. In other words, the complexity \nis in the dynamics, not in the specifications (weights). \nIn thinking about such a principle, consider other systems in which extreme observed \ncomplexity emerges from algorithmic simplicity plus computational power. It is \n\n2 It can be argued that other FSA inducing methods get around this problem by presupposing rather than \nlearning trap states. \n] One reason I have succeeded in such low dimensional induction is because my architecture is a Mealy, \nrather than Moore Machine (Lee Giles, Personal Communication) \n\n\f624 \n\nPollack \n\nA \n\nc \n\nE \n\nG \n\nB \n\nD \n\nF \n\nFigure 2: Images of the state-spaces \nfor the 7 benchmark cases. Each \npoints \nimage \ncorresponding to the states of all \nboolean strings up to length 10. \n\ncontains \n\n2048 \n\n\fLanguage Induction by Phase 1ransition in Dynamical Recognizers \n\n625 \n\ninteresting to note that by eliminating the sigmoid and commuting the Yj and Zk terms, \nthe forward equation for higher order recurrent networks with is identical to the generator \nof an Iterated Function System (IFS) (Bamsley et al., 1985). Thus, my figures of state(cid:173)\nspaces, which emerge from the projection of :r. into Z, are of the same class of \nmathematical object as Barnsley's fractal attractors (e.g. the widely reproduced fern). \nUsing the method of (Grassberger & Procaccia, 1983), the correlation dimension of the \nattractor in Figure 2(g) was found to be about 1.4. \n\nThe link between work in complex dynamical systems and neural networks is well(cid:173)\nestablished both on the neurobiological level (Skarda & Freeman, 1987) and on the \nmathematical level (Derrida & Meir, 1988; Huberman & Hogg, 1987; Kurten, 1987; \nSmolensky, 1986). This paper expands a theme from an earlier proposal to link them at \nthe \"cognitive\" level (pollack, 1989). \n\nThere is an interesting formal question, which has been brought out in the work of \n(Wolfram, 1984) and others on the universality of cellular automata, and more recently in \nthe work of (Crutchfield & Young, 1989) on the descriptive complexity of bifurcating \nsystems: What is the relationship between complex dynamics (of neural systems) and \ntraditional measures of computational complexity? From this work and other supporting \nevidence, I venture the following hypothesis: \n\nThe state-space limit of a dynamical recognizer, as :r. ~:roo, is an Attractor, \nwhich is cut by a threshold (or similar decision) function. The complexity of \nthe language recognized is regular if the cut falls between disjoint limit \npoints or cycles, context-free if it cuts a \"self-similar\" (recursive) region, and \ncontext-sensitive if it cuts a \"chaotic\" (pseudo-random) region. \n\nAcknowledgements \nThis research has been partially supported by the Office of Naval Research under \ngrant NOOO 14-89-J -1200. \n\nReferences \n\nAngluin, D. (1978). On the complexity of minimum inference of regular sets. \nInformation and Control. 39,337-350. \nBamsley, M. F., Ervin, V., Hardin, D. & Lancaster, J. (1985). Solution of an \ninverse problem for fractals and other sets. Proceedings of the National Academy \nof Science. 83. \nCrutchfield, 1. P & Young, K. (1989). Computation at the Onset of Chaos. In W. \nZurek, (Ed.), Complexity. Entropy and the Physics of INformation. Reading, MA: \nAddison-Wesley. \nDerrida, B. & Meir, R. (1988). Chaotic behavior of a layered neural network. \nPhys. Rev. A. 38. \nElman, J. L. (1990). Finding Structure in Time. Cognitive Science. 14, 179-212. \nGiles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C. & Chen, D. (1990). Higher Order \nRecurrent Networks and Grammatical Inference. \nAdvances in Neural Information Processing Systems. Los Gatos, CA: Morgan \nKaufman. \n\nIn D. Touretzky, (Ed.), \n\n\f626 \n\nPollack \n\n(1990). Recursive Distributed Representation. Artificial \n\nGrassberger. P. & Procaccia. I. (1983). Measuring the Strangeness of Strange \nAttractors. Physica. 9D. 189-208. \nHuberman. B. A. & Hogg. T. (1987). Phase Transitions in Artificial Intelligence \nSystems. Artificial Intelligence. 33. 155-172. \nJordan. M. I. (1986). Serial Order: A Parallel Distributed Processing Approach. \nICS report 8608. La Jolla: Institute for Cognitive Science. UCSD. \nKurten. K. E. (1987). Phase transitions in quasirandom neural networks. In \nInstitute of Electrical and Electronics Engineers First International Conference on \nNeural Networks. San Diego. 11-197-20. \nMcCulloch. w. S. & Pitts. W. (1943). A logical calculus of the ideas immanent in \nnervous activity. Bulletin of Mathematical Biophysics. 5. 115-133. \nPOllack. J. B. (1987). Cascaded Back Propagation on Dynamic Connectionist \nNetworks. In Proceedings of the Ninth Conference of the Cognitive Science \nSociety. Seattle. 391-404. \nPollack, J. B. (1989). Implications of Recursive Distributed Representations. In \nD. Touretzky. (Ed.). Advances in Neural Information Processing Systems. Los \nGatos. CA: Morgan Kaufman. \nPollack. J. B. \nIntelligence. 46, 77-105. \nPollack. J. B. (1990). The Induction of Dynamical Recognizers. Tech Report 90-\nlP-Automata. Columbus. OH 43210: LAIR. Ohio State University. \nRumelhart. D. E .\u2022 Hinton. G. & Williams. R. \n(1986). Learning Internal \nRepresentations through Error Propagation. In D. E. Rumelhart. 1. L. McClelland \n& the PDP research Group. (Eds.). Parallel Distributed Processing: Experiments in \nthe Microstructure of Cognition. Vol. 1. Cambridge: MIT Press. \nServan-Schreiber. D .\u2022 Cleeremans. A. & McClelland. J. L (1989). Encoding \nSequential Structure in Simple Recurrent Networks. In D. Touretzky. (Ed.). \nAdvances in Neural Information Processing Systems. Los Gatos. CA: Morgan \nKaufman. \nSkarda. C. A. & Freeman. W. J. (1987). How brains make chaos. Brain & \nBehavioral Science.lO. \nSmolensky. P. \nin DynamiCal Systems: \nFoundations of Harmony Theory. In D. E. Rumelhart. J. L. McClelland & the PDP \nresearch GrouP. (Eds.). Parallel Distributed Processing: Experiments in the \nMicrostructure of Cognition. Vol. 1. Cambridge: MIT Press. \nTomita. M. (1982). Dynamic construction of finite-state automata from examples \nusing hill-climbing. In Proceedings of the Fourth Annual Cognitive Science \nConference. Ann Arbor. MI. 105-108. \nWolfram. S. (1984). Universality and Complexity in Cellular Automata. Physica. \nlOD.1-35. \n\n(1986). \n\nInformation Processing \n\n\f", "award": [], "sourceid": 298, "authors": [{"given_name": "Jordan", "family_name": "Pollack", "institution": null}]}