Part of Advances in Neural Information Processing Systems 2 (NIPS 1989)
C. Giles, Guo-Zheng Sun, Hsing-Hen Chen, Yee-Chun Lee, Dong Chen
A higher order single layer recursive network easily learns to simulate a deterministic finite state machine and recognize regular grammars. When an enhanced version of this neural net state machine is connected through a common error term to an external analog stack memory, the combination can be interpreted as a neural net pushdown automata. The neural net finite state machine is given the primitives, push and POP. and is able to read the top of the stack. Through a gradient descent learning rule derived from the common error function, the hybrid network learns to effectively use the stack actions to manipUlate the stack memory and to learn simple context(cid:173) free grammars. INTRODUCTION
Biological networks readily and easily process temporal information; artificial neural networks should do the same. Recurrent neural network models permit the encoding and learning of temporal sequences. There are many recurrent neural net models. for ex(cid:173) ample see [Jordan 1986. Pineda 1987, Williams & Zipser 1988]. Nearly all encode the current state representation of the models in the activity of the neuron and the next state is determined by the current state and input. From an automata perspective, this dynamical structure is a state machine. One formal model of sequences and machines that generate and recognize them are formal grammars and their respective automata. These models formalize some of the foundations of computer science. In the Chomsky hierarchy of formal grammars [Hopcroft & Ullman 1979] the simplest level of com(cid:173) plexity is defmed by the finite state machine and its regular grammars. (All machines
Higher Order Recurrent Networks and Grammatical Inference
381
and grammars described here are deterministic.} The next level of complexity is de(cid:173) scribed by pushdown automata and their associated context-free grammars. The push(cid:173) down automaton is a fmite state machine with the added power to use a stack memory. Nemal networks should be able to perform the same type of computation and thus solve such learning problems as grammatical inference [pu 1982] . Simple grammatical inference is defined as the problem of finding (learning) a grammar from a fmite set of strings, often called the teaching sample. Recall that a grammar {phrase-structured} is defined as a 4-tuple (N, V, P, S) where N and V are a nonterm i(cid:173) na1 and terminal vocabularies, P is a finite set of production rules and S is the start sym(cid:173) bol. Here grammatical inference is also defined as the learning of the machine that recognizes the teaching and testing samples. Potential applications of grammatical in(cid:173) ference include such various areas as pattern recognition, information retrieval, pro(cid:173) gramming language design, translation and compiling and graphics languages [pu 1982]. There has been a great deal of interest in teaching nemal nets to recognize grammars and simulate automata [Allen 1989. Jordan 1986. Pollack 1989. Servant-Schreiber et. a1. 1989,Williams & Zipser 1988]. Some important extensions of that work are discussed here. In particular we construct recurrent higher order nemal net state machines which have no hidden layers and seem to be at least as powerful as any nemal net multilayer state machine discussed so far. For example, the learning time and training sample size are significantly reduced. In addition, we integrate this neural net fmite state machine with an external stack memory and inform the network through a common objective function that it has at its disposal the symbol at the top of the stack and the operation primitives of push and pop. By devising a common error function which integrates the stack and the nemal net state machine, this hybrid structure learns to effectively use the the interesting work of [Williams & stack to recognize context-free grammars. Zipser 1988] a recurrent net learns only the state machine part of a Turing Machine. since the associated move, read, write operations for each input string are known and are given as part of the training set. However, the model we present learns how to manipu(cid:173) late the push, POP. and read primitives of an external stack memory plus learns the ad(cid:173) ditional necessary state operations and structure. HIGHER ORDER RECURRENT NETWORK The recurrent neural network utilized can be considered as a higher order modification of the network model developed by [Williams & Zipser 1988]. Recall that in a recur(cid:173) rent net the activation state S of the neurons at time (t+l) is defined as in a state ma(cid:173) chine automata:
In
(1) where F maps the state S and the input I at time t to the next state. The weight matrix W forms the mapping and is usually learned. We use a higher order form for this map(cid:173) ping:
S(t+ 1) = F ( S(t), I(t); W }