{"title": "A Connectionist Symbol Manipulator That Discovers the Structure of Context-Free Languages", "book": "Advances in Neural Information Processing Systems", "page_first": 863, "page_last": 870, "abstract": null, "full_text": "A Connectionist Symbol Manipulator \n\nThat Discovers the Structure of \n\nContext-Free Languages \n\nMichael C. Mozer and Sreerupa Das \n\nDepartment of Computer Science & \n\nInstitute of Cognitive Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \n\nAbstract \n\nWe present a neural net architecture that can discover hierarchical and re(cid:173)\ncursive structure in symbol strings. To detect structure at multiple levels, \nthe architecture has the capability of reducing symbols substrings to single \nsymbols, and makes use of an external stack memory. In terms of formal \nlanguages, the architecture can learn to parse strings in an LR(O) context(cid:173)\nfree grammar. Given training sets of positive and negative exemplars, \nthe architecture has been trained to recognize many different grammars. \nThe architecture has only one layer of modifiable weights, allowing for a \nstraightforward interpretation of its behavior. \n\nMany cognitive domains involve complex sequences that contain hierarchical or \nrecursive structure, e.g., music, natural language parsing, event perception. To il(cid:173)\nlustrate, \"the spider that ate the hairy fly\" is a noun phrase containing the embed(cid:173)\nded noun phrase \"the hairy fly.\" Understanding such multilevel structures requires \nforming reduced descriptions (Hinton, 1988) in which a string of symbols or states \n(\"the hairy fly\") is reduced to a single symbolic entity (a noun phrase). We present \na neural net architecture that learns to encode the structure of symbol strings via \nsuch red uction transformations. \n\nThe difficult problem of extracting multilevel structure from complex, extended \nsequences has been studied by Mozer (1992), Ring (1993), Rohwer (1990), and \nSchmidhuber (1992), among others. While these previous efforts have made some \n\n863 \n\n\f864 Mozer and Das \n\nd~ft demon units \n\ninput \nqueue \n\nstack \n\npop \n\npush \n\npush \n\nFigure 1: The demon model. \n\nprogress, no one has claimed victory over the problem. Our approach is based on a \nnew perspective-one of symbolic reduction transformations-which affords a fresh \nattack on the problem. \n\n1 A BLACKBOARD ARCHITECTURE \n\nOur inspiration is a blackboard style architecture that works as follows. The input, \na sequence of symbols, is copied onto a blackboard-a scratch pad memory-one \nsymbol at a time. A set of demon, watch over the blackboard, each looking for a \nspecific pattern of symbols. When a demon observes its pattern, it fire\" causing \nthe pattern to be replaced by a symbol associated with that demon, which we'll call \nits identity. This process continues until the entire input string has been read or no \ndemon can fire. The sequence of demon firings and the final blackboard contents \nspecify the structure of the input. \n\nThe model we present is a simplified version of this blackboard architecture. The \nblackboard is implemented as a stack. Consequently, the demons have no control \nover where they write or read a symbol; they simply push and pop symbols from \nthe stack. The other simplification is that the demon firing is based on template \nmatching, rather than a more sophisticated form of pattern matching. \n\nThe demon model is sketched in Figure 1. An input queue holds the input string \nto be parsed, which is gradually transferred to the stack. The top k stack symbols \nare encoded in a set of dack unit&; in the current implementation, k = 2. Each \ndemon is embodied by a special processing unit which receives input from the stack \nunits. The weights of each demon unit specify a pair of symbols, which the demon \nunit matches against the two stack symbols. If there is a match, the demon unit \npops the top two stack symbols and pushes its identity. If no demon unit matches, \nan additional unit, called the default unit, becomes active. The default unit is \nresponsible for transferring a symbol from the input queue onto the stack. \n\n\fConnectionist Symbol Manipulator Discovers Structure of Context-Free Languages \n\n865 \n\nS -+ a b \nS -+ a X \nX -+ S b \n\nS \n\n/\\ \na X \n/\\ S b \n/\\ \na b \n\nFigure 2: The rewrite rules defining a grammar that generates strings of the form \nanbn and a parse tree for the string aabb. \n\n2 PARSING CONTEXT-FREE LANGUAGES \n\nEach demon unit reduces a pair of symbols to a single symbol. We can express \nthe operation of a demon as a rewrite rule of the form X --+ a b, where the lower \ncase letters denote symbols in the input string and upper case letters denote the \ndemon identities, also symbols in their own right. The above rule specifies that \nwhen the symbols a and b appear on the top of the stack, in that order, the X \ndemon unit should fire, erasing those two symbols and replacing them with an X. \nDemon units can respond to internal symbols (demon identities) instead of input \nsymbols, allowing internal symbols on the right hand side of the rule. Demon units \ncan also respond to individual input symbols, achieving rules of the form X --+ a. \n\nMultiple demon units can have the same identity, leading to rewrite rules of a \nmore general form, e.g., X --+ a b lYe I d Z I a. This class of rewrite rules can \nexpress a subset of context-free grammars. Figure 2 shows a sample grammar that \ngenerates strings of the form anbn and a parse tree for the input string aabb. The \ndemon model essentially constructs such parse trees via the sequence of reduction \noperations. \n\nThat each rule has only one or two symbols on the right hand side imposes no \nlimitation on the class of grammars that can be recognized. However, the demon \nmodel does require certain knowledge about the grammars to be identified. First, \nthe maximum number of rewrite rules and the maximum number of rules having the \nsame left-hand side must be specified in advance. This is because the units have \nto be allocated prior to learning. Second, the LR-class of the grammar must be \ngiven. To explain, any context-free grammar can be characterized as LR( n), which \nindicates that the strings of the grammar can be parsed from left to right with n \nsymbols of look ahead on the input queue. The demon model requires that n be \nspecified in advance. In the present work, we examine only LR(O) grammars, but \nthe architecture can readily be generalized to arbitrary n. \n\nGiles et al. (1990), Sun et al. (1990), and Das, Giles, and Sun (1992) have previously \nexplored the learning of context-free grammars in a neural net. Their approach was \nbased on the automaton perspective of a recognizer, where the primary interest was \nto learn the dynamics of a pushdown automaton. There has also been significant \nwork in context-free grammar inference using symbolic approaches. In general, these \napproaches require a significant amount of prior information about the grammar \nand, although theoretically sound, have not proven terribly useful in practice. A \npromising exception is the recent proposal of Stolcke (1993). \n. .. \n\n\f866 \n\nMozer and Das \n\n3 CONTINUOUS DYNAMICS \n\nSo far, we have described the model in a discrete way: demon firing is all-or(cid:173)\nnone and mutually exclusive, corresponding to the demon units achieving a unary \nrepresentation. This may be the desired behavior following learning, but neural net \nlearning algorithms like back propagation require exploration in continuous state \nand weight spaces and therefore need to allow partial activity of demon units. The \ncontinuous activation dynamics follow. \nDemon unit i computes the distance between its weights, Wi, and the input, x: \ndi.ti = bi IWi - xl 2 , where bi is an adjustable bias associated with the unit. The \nactivity of unit i, denoted .i, is computed via a normalized exponential transform \n(Bridle, 1990j Rumelhart, in press), \n\ne-di,ti \n\n\u00b7i = L:i e-didj , \n\nwhich enforces a competition among the units. A special unit, called the default \nunit, is designed to respond when none of the demons fire strongly. Its activity, \n.del, is computed like that of any demon unit with di.tdel = bdel' \n\n4 CONTINUOUS STACK \n\nBecause demon units can be partially active, stack operations need to be performed \npartially. This can be accomplished with a continuou.s .stack (Giles et al., 1990). \nUnlike a discrete stack where an item is either present or absent, items can be \npresent to varying degrees. Each item on the stack has an associated thickneu, a \nscalar in the interval [0,1] indicating what fraction of the item is present (Figure 3). \n\nTo understand how the thickness plays a role in processing, we digress briefly and \nexplain the encoding of symbols. Both on the stack and in the network, symbols \nare represented by numerical vectors that have one component per symbol. The \nvector representation of some symbol X, denoted rx, has value 1 for the component \ncorresponding to X and 0 for all other components. H the symbol has thickness t, \nthe vector representation is trX' \n\nAlthough items on the stack have different thicknesses, the network is presented \nwith compo.site .ymbol.s having thickness 1.0. Composite symbols are formed by \ncombining stack items. For example, in Figure 3, composite symbol 1 is defined as \nthe vector .2rX + .5rz + .3rv. The input to the demon network consists of the top \ntwo composite symbols on the stack. \n\nThe advantages of a continuous stack are twofold. First, it is required for network \nlearningj if a discrete stack were used, a small change in weights could result in a big \n(discrete) change in the stack. This was the motivation underlying the continuous \nstack used by Giles et ale Second, the continuous stack is differentiable and hence \nallows us to back propagate error through the stack during learning. While we have \nsummarized this point in one sentence, the reader must appreciate the fact that it \nis no small feat! Giles et ale did not consider back propagation through the stack. \n\nEach time step, the network performs two operations on the stack: \n\n\fConnectionist Symbol Manipulator Discovers Structure of Context-Free Languages \n\n867 \n\ncomposite \nsymbol! \n\ncomposite \nsymbol 2 \n\ntop of stack \n\nthickness \n\nx \nZ \n\nV \n\nX \n\ny \n\n.2 \n.5 \n\n.4 \n\n.7 \n\n.4 \n\nFigure 3: A continuous stack. The symbols indicate the contentsj the height of \na stack entry indicates its thickness, also given by the number to the right. The \ntop composite symbol on the stack is a combination of the items forming a total \nthickness of 1.0j the next composite symbol is a combination of the items making \nup the next 1.0 units of thickness. \n\nPop. IT a demon unit fires, the top two composite symbols should be popped from \nthe stack (to be replaced by the demon's identity). If no demon unit fires, in which \ncase the default unit becomes active, the stack should remain unchanged. These \nbehaviors, as well as interpolated behaviors, are achieved by multiplying by 6deJ \nthe thickness of any portion of a stack item contributing to the top two composite \nsymbols. Remember that BdeJ is 0 when one or more demon units are strongly \nactive, and is 1 when the default unit is fully active. \nPush. The symbol written onto the stack is the composite symbol formed by sum(cid:173)\nming the identity vectors of the demon units, weighted by their activities: L:i 8iri, \nwhere ri is the vector representing demon i's identity. Included in this summation \nis the default unit, where rdeJ is defined to be the composite symbol over thickness \n'deJ of the input queue. (After a thickness of BdcJ is read from the input queue, it \nis removed from the queue.) \n\n5 TRAINING METHODOLOGY \n\nThe system is trained on positive and negative examples of a context-free grammar. \nIts task is to classify each input string as grammatical or not. Because the grammars \ncan always be written such that the root of the parse tree is the symbol S (e.g., \nFigure 2), the stack should contain just S upon completion of processing ofa positive \nexample. For a negative example, the stack should contain anything but s. \nThese criteria can be translated into an objective function as follows. If one assumes \na Gaussian noise distribution over outputs, the probability that the top of the stack \ncontains the symbol S following presentation of example i is \n\npi oot