{"title": "Dynamics of Learning in Recurrent Feature-Discovery Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 70, "page_last": 76, "abstract": null, "full_text": "Dynamics of Learning in Recurrent \n\nFeature-Discovery Networks \n\nTodd K. Leen \n\nDepartment of Computer Science and Engineering \nOregon Graduate Institute of Science & Technology \n\nBeaverton, OR 97006-1999 \n\nAbstract \n\nThe self-organization of recurrent feature-discovery networks is studied \nfrom the perspective of dynamical systems. Bifurcation theory reveals pa(cid:173)\nrameter regimes in which multiple equilibria or limit cycles coexist with the \nequilibrium at which the networks perform principal component analysis. \n\n1 \n\nIntroduction \n\nOja (1982) made the remarkable observation that a simple model neuron with an \nHebbian adaptation rule develops into a filter for the first principal component of \nthe input distribution. Several researchers have extended Oja's work, developing \nnetworks that perform a complete principal component analysis (PCA). Sanger \n(1989) proposed an algorithm that uses a single layer of weights with a set of \ncascaded feedback projections to force nodes to filter for the principal components. \nThis architecture singles out a particular node for each principal component. Oja \n(1989) and Oja and Karhunen (1985) give a related algorithm that projects inputs \nonto an orthogonal basis spanning the principal subspace, but does not necessarily \nfilter for the principal components themselves. \nIn another class of models, nodes are forced to learn different statistical features \nby a set of lateral connections. Rubner and Schulten (1990) use cascaded lateral \nconnections; the ith node receives signals from the input and all nodes j with j < i. \nThe lateral connections are modified by an anti-Hebbian learning rule that tends \nto de-correlate the node responses . Like Sanger's scheme, this architecture singles \nout a particular node for each principal component. Kung and Diamantaras (1990) \npropose a different learning rule on the same network topology. Foldiak (1989) \nsimulates a network with full lateral connectivity, but does not discuss convergence. \n\n\fDynamics of Learning in Recurrent &ature-Discovery Networks \n\n71 \n\nThe goal of this paper is to help form a more complete picture of feature-discovery \nmodels that use lateral signal flow. We discuss two models with particular empha(cid:173)\nsis on their learning dynamics. The models incorporate Hebbian and anti-Hebbian \nadaptation, and recurrent lateral connections. We give stability analyses and derive \nbifurcation diagrams for the models. Stability analysis gives a lower bound on the \nrate of adaptation the lateral connections, below which the equilibrium correspond(cid:173)\ning to peA is unstable. Bifurcation theory provides a description of the behavior \nnear loss of stability. The bifurcation analyses reveal stable equilibria in which the \nweight vectors from the input are combinations of the eigenvectors of the input \ncorrelation. Limit cycles are also found. \n\n2 The Single-Neuron Model \n\nIn Oja's model the input, x E R N , is a random vector assumed to be drawn from \na stationary probability distribution. The vector of synaptic weights is denoted w \nand the post-synaptic response is linear; y = x . w. The continuous-time, ensemble \naveraged form of the learning rule is \n\nw \n\n< x y > - < y2 > w \nRw -\n\n(w. Rw) w \n\n(1) \nwhere < ... > denotes the average over the ensemble of inputs, and R = < X x T > \nis the correlation matrix. The unit-magnitude eigenvectors of R are denoted \nei, i = 1 ... N and are assumed to be ordered in decreasing magnitude of the \nassociated eigenvalues Al > A2 > ... > AN > O. Oja shows that the weight vector \nasymptotically approaches \u00b1el' The variance of the node's response is thus max(cid:173)\nimized and the node acts as a filter for the first principal component of the input \ndistribution. \n\n3 Extending the Single Neuron Model \n\nTo extend the model to a system of M ::5 N nodes we consider a set of linear neurons \nwith weight vectors (called the forward weights) Wl \u2022\u2022 . WM connecting each to 'the--- __ ~ \nN -dimensional input. Without interactions between the nodes in the array, all M \nweight vectors would converge to \u00b1el. \nWe consider two approaches to building interactions that force nodes to filter for \ndifferent statistical features. In the first approach an internode potential is con(cid:173)\nstructed. This formulation results in a non-local model. The model is made local \nby introducing lateral connections that naturally acquire anti-Hebbian a.daptation. \nFor reasons that will become clear, the resulting model is referred to as a min(cid:173)\nimal coupling scheme. In the second approach, we write equations of motion of \nthe forward weights based directly on (1). The evolution of the lateral connection \nstrengths will follow a simple anti-Hebbian rule. \n\n3.1 Minimal Coupling \n\nThe response of the ith node in the array is taken to be linear in the input \n\n(2) \n\n\f72 \n\nLeen \n\nThe adaptation of the forward weights is derived from the potential \n\nU \n\n1 ~ 2 \nC ~ \n2 L.J < Yi > + 2\" L.J \ni,k;i\u00a2k \nC M \n\ni \n1 M \n-2 ~ (Wj . RWj) + 2 2: (Wj' R Wk)2, \n\nJ \n\nj,k;j\u00a2k \n\n(3) \n\nwhere C is a coupling constant. The first term of U generates the Hebb law, \nwhile the second term penalizes correlated node activity (Yuille et al. 1989). The \nequations of motion are constructed to perform gradient descent on U with a term \nadded to bound the weight vectors, \n\n2 < Yi > Wi \n\n-V\"w. U \n< x Yi > - c 2: < Yi Yj > < x Yj > - < yi > Wi \n\nM \n\nj\u00a2i \n\nM \n\n- C L: (Wi' RWj) RWj \n\nRWi \n\nj \u00a2i \n\n-\n\n(Wi' Rwd Wi. \n\n(4) \n\nNote that Wi refers to the weight vector from the input to the ith node, not the ith \ncomponent of the weight vector. \nEquation (4) is non-local as it involves correlations, < Yi Yj >, between nodes. In \norder to provide a purely local adaptation, we introduce a symmetric matrix of \nlateral connections \n\ni, j = 1, ... , M \n\n1Jij \n\n1Jii = O. \n\nThese evolve according to \n\n1Jij \n\n-d (1Jij + C < Yi Yj > ) \n-d (1Jij + C Wi . RWj ) \n\n(5) \n\nwhere d is a rate constant. In the limit of fast adaptation (large d) \n\nWith this limiting behavior in mind, we replace (4) with \n\n1Jij --+ -C < Yi Yj > . \n\n< XVi > + L: 1Jij < XYj > -\n\nM \n\nj\u00a2i \n\n2 < Yi > Wi \n\nM \n\nRWi + L: 1Jij RWj \n\nj\u00a2i \n\n-\n\n(Wi' RWi) Wi\u00b7 \n\n(6) \n\nEquations (5) and (6) specify the adaptation of the network. \nNotice that the response of the ith node is given by (2) and is thus independent of \nthe signals carried on the lateral connections. In this sense the lateral signals affect \nnode plasticity but not node response. This minimal coupling can also be derived \nas a low-order approximation to the model in \u00a73.2 below. \n\n\fDynamics of Learning in Recurrent &ature-Discovery Networks \n\n73 \n\n3.1.1 Stability and Bifurcation \n\nBy inspection the weight dynamics given by (5) and (6) have an equilibrium at \n\nAt this equilibrium the outputs are the first M principal components of input vec(cid:173)\ntors. In suitable coordinates the linear part of the equations of motion break into \nblock diagonal form with any possible instabilities constrained to 3 x 3 sub-blocks. \nDetails of the stability and bifurcation analysis are given in Leen (1991). The prin(cid:173)\ncipal component subspace is always asymptotically stable. However the equilibrium \nXo is linearly stable if and only if \n\n(7) \n\nd > do \n\n(Ai - Aj)2 (Ai + Aj) \n\nA~ + A? \nJ \n\nI \n\nC \n\n> \n\nCo \n\n-\n\n1 \n\nAi + Aj , \n\n1 ;:; (i,j) ;:; M. \n\n(8) \n\n(9) \n\nAt Co or do there is a qualitative change (a bifurcation) in the learning dynamics. If \nthe condition on d is violated, then there is a Hopf bifurcation to oscillating weights. \nAt the critical value Co there is a bifurcation to multiple equilibria. The bifurcation \nnormal form was found by Liapunov-Schmidt reduction (see e.g. Golubitsky and \nSchaeffer 1984) performed at the bifurcation point (Xo, Co). To deal effectively with \nthe large dimensional phase space of the network, the calculations were performed \non a symbolic algebra program. \n\nAt the critical point (Xo, Co) there is a supercritical pitchfork bifurcation. Two \nunstable equilibria appear near Xo for C > Co. At these equilibria the forward \nweights are mixtures of eM and eM -1 and the lateral connection strengths are \nnon-zero. Generically one expects a saddle-node bifurcation. However Xo is an \nequilibrium for all values of C, and the system has an inversion symmetry. These \nconditions preclude the saddle-node and transcritical bifurcations, and we are left \nwith the pitchfork. \nThe position of stable equilibria away from (Xo, Co) can be found by examining \nterms of order five and higher in the bifurcation expansion. Alternatively we exam(cid:173)\nine the bifurcation from the homogeneous solution, Xh, in which all weight vectors \nare proportional to el. For a system of two nodes this equilibrium is asymptotically \nstable provided \n\n(10) \n\nIf Al < 3A2' then there is a supercritical pitchfork bifurcation at Ch. Two stable \nequilibria emerge from Xh for C > Ch. At these stable equilibria, the forward \nweight vectors are mixtures of the first two correlation eigenvectors and the lateral \nconnection strengths are nonzero. \n\nThe complete bifurcation diagram for a system of two nodes is shown in Fig. 1. The \nupper portion of the figure shows the bifurcation at (Xo, Co). The horizontal line \ncorresponds to the peA equilibrium Xo. This equilibrium is stable (heavy line) for \n\n\f74 \n\nLeen \n\nC > Co, and unstable (light line) for C < Co. The subsidiary, unstable, equilibria \nthat emerge from (Xo, Co) lie on the light, parabolic branches of the top diagram. \nCalculations indicate that the form of this bifurcation is independent of the number \nof nodes, and of the input dimension. Of course the value of Co increases with \nincreasing number of nodes, c.f. (9). \nThe lower portion of Fig. 1 shows the bifurcation from (Xh' Ch) for a system of two \nnodes. The horizontal line corresponds to the homogeneous equilibrium X h. This \nis stable for C < Ch and unstable for C > Ch. The stable equilibria consisting of \nmixtures of the correlation eigenvectors lie on the heavy parabolic branches of the \ndiagram. For networks with more nodes, there are presumably further bifurcations \nalong the supercritical stable branches emerging from (Xh' Ch); equilibria with \nqualitatively different eigenvector mixtures are observed in simulations. \n\nEach inset in the figure shows equilibrium forward weight vectors for both nodes in \na two-node network. These configurations were generated by numerical integration \nof the equations of motion (5) and (6). The correlation matrix corresponds to an \nensemble of noise vectors with short-range correlations between the components. \nSimulations of the corresponding discrete, pattern-by-pattern learning rule confirm \nthe form of the weight vectors shown here. \n\n~2 3.0tlr \n\nO.S \n\n. \n\n2 \n\n4 \n\n6 \nAI \n\n8 \n\n10 \n\nFig 2: Regions in the (>'1, >'2) plane cor(cid:173)\nresponding to supercritical (shaded) and \nsubcritical (unshaded) Hopf bifurcation. \n\nFigure 1: Bifurcation diagram for \nthe minimal model \n\n3.2 Full Coupling \n\nIn a more conventional coupling scheme, the signals carried on the lateral connec(cid:173)\ntions affect the node activities directly. For linear node response, the vector of \nactivities is given by \n\n(11) \nwhere y E RM, TJ is the !l1 x !If matrix of lateral connection st.rengths and w is an \nM x N matrix whose ith row is the forward weight vector to the ith node. The \nadaptation rule is \n\nw \n\n< yxT > _ Diag\u00ab \n\nyyT \u00bbw \n\nD TJ \n\n- C < yyT >, TJii = 0, \n\n(12) \n\n(13) \n\n\fDynamics of Learning in Recurrent Feature-Discovery Networks \n\n75 \n\nwhere D and C are constants and Diag sets the off-diagonal elements of its argument \nequal to zero. This system also has the peA equilibrium Xo. This is linearly stable \nif \n\nD > 0 \n\nC > Co \n\nD \n\n(14) \n\n(15) \n\nEquation (14) tells us that the peA equilibrium is structurally unstable without the \nD'TJ term in (13). Without this term, the model reduces to that given by Foldiak \n(1989). That the latter generally does not converge to the peA equilibrium is \nconsistent with the condition in (14). \nIf, on the other hand, the condition on C is violated then the network undergoes a \nHopf bifurcation leading to oscillations. Depending on the eigenvalue spectrum of \nthe input correlation, this bifurcation may be subcritical (with stable limit cycles \nnear Xo for C < Co), or supercritical (with unstable limit cycles near Xo for \nC > Co). Figure 2 shows the corresponding regions in the (.AI, .A2) plane for a \nnetwork of two nodes with D = 1. Simulations show that even in the supercritical \nregime, stable limit cycles are found for C < Co, and for C > Co sufficiently \nclose to Co. This suggests that the complete bifurcation diagram in the super(cid:173)\ncritical regime is shaped like the bottom of a wine bottle, with only the indentation \nshown in figure 2. Under the approximation u ~ 1 + 'TJ, the super-critical regime is \nsignificantly narrowed. \n\n4 Discussion \n\nThe primary goal of this study has been to give a theoretical description of learning \nin feature-discovery models; in particular models that use lateral interactions to \nensure that nodes tune to different statistical features. The models presented here \nhave several different limit sets (equilibria and cycles) whose stability and location \nin the weight space depends on the relative learning rates in the network, and \non the eigenvalue spectrum of the input correlation. We have applied t.ools from \nbifurcation theory to qualitatively describe the location and determine stability of \nthese different limiting solutions. This theoretical approach provides a unifying \nframework within which similar algorithms can be studied. \nBoth models have equilibria at which the network performs peA. In addition, the \nminimal model has stable equilibria for which the forward weight vectors are mix(cid:173)\ntures of the correlation eigenvectors. Both models have regimes in which the weight \nvectors oscillate. The model given by Rubner et al. \n(1990) also loses stability \nthrough Hopf bifurcation for small values of the lateral learning rate. \nThe minimal values of C in (9) and (15) for the stability of the peA equilibrium \ncan become quite large for small correlation eigenvalues. These stringent conditions \ncan be ameliorated in both models by the replacement \nd 'TJij -+ \u00ab Y; > + < YJ > ) 'TJij. \n\nHowever in the minimal model, this leads to degenerate bifurcations which have not \nbeen thoroughly examined. \n\n\f76 \n\nLeen \n\nFinally, it remains to be seen whether the techniques employed here extend to similar \nsystems with non-linear node activation (e.g. Carlson 1991) or to the problem of \nlocating multiple minima in cost functions for supervised learning models. \n\nAcknowledgments \n\nThis work was supported by the Office of Naval Research under contract N00014-\n90-1349 and by DARPA grant MDA 972-88-J-1004 to the Department of Computer \nScience and Engineering. The author thanks Bill Baird for stimulating e-mail dis(cid:173)\nCUSSlon. \n\nReferences \nCarlson, A. (1991) Anti-Hebbian learning in a non-linear neural network Bioi. Cybern., \n\n64:171-176. \n\nFoldiak, P. (1989) Adaptive network for optimal linear feature extraction. In Proceedings \n\nof the JJCNN, pages I 401-405. \n\nGolubitsky, Martin and Schaeffer, David (1984) Singularities and Groups in Bifurcation \n\nTheory, Vol. I. Springer-Verlag, New York. \n\nKung, S. and Diamantaras K. (1990) A neural network learning algorithm for adaptive \nprincipal component extraction (APEX). In Proceedings of the IEEE International \nConference on Acoustics Speech and Signal Processing, pages 861-864. \n\nLeen, T. K. (1991) Dynamics oflearning in linear feature-discovery networks. Network: \n\nComputation in Neural Systems, to appear. \n\nOja, E. (1982) A simplified neuron model as a principal component analyzer. J. Math. \n\nBiology, 15:267-273. \n\nOja, E. (1989) Neural networks, principal components, and subspaces. \n\nJournal of Neural Systems, 1:61-68. \n\nInternational \n\nOja, E. and Karhunen, J. (1985) On stochastic approximation of the eigenvectors and \neigenvalues of the expectation of a random matrix. J. of Math. Anal. and Appl., \n106:69-84. \n\nRubner, J. and Schulten K. (1990) Development of feature detectors by self-organization: \n\nA network model. BioI. Cybern., 62:193-199. \n\nSanger, T. (1989) An optimality principle for unsupervised learning. In D.S. Touretzky, \n\neditor, Advances in Neural Information Processing Systems 1. Morgan Kauffmann. \n\nYuille, A.L, Kammen, D.M. and Cohen, D.S. (1989) Quadrature and the development of \n\norientation selective cortical cells by Hebb rules. Bioi. Cybern., 61:183-194. \n\n\f", "award": [], "sourceid": 319, "authors": [{"given_name": "Todd", "family_name": "Leen", "institution": null}]}