{"title": "Adjoint Operator Algorithms for Faster Learning in Dynamical Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 498, "page_last": 508, "abstract": null, "full_text": "498 \n\nBarben, Toomarian and Gulati \n\nAdjoint Operator Algorithms for Faster \nLearning in Dynamical Neural Networks \n\nJacob Barhen \n\nNikzad Toomarian \n\nSandeep Gulati \n\nCenter for Space Microelectronics Technology \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nABSTRACT \n\nA methodology for faster supervised learning in dynamical nonlin(cid:173)\near neural networks is presented. It exploits the concept of adjoint \noperntors to enable computation of changes in the network's re(cid:173)\nsponse due to perturbations in all system parameters, using the so(cid:173)\nlution of a single set of appropriately constructed linear equations. \nThe lower bound on speedup per learning iteration over conven(cid:173)\ntional methods for calculating the neuromorphic energy gradient is \nO(N2), where N is the number of neurons in the network. \n\nINTRODUCTION \n\n1 \nThe biggest promise of artifcial neural networks as computational tools lies in the \nhope that they will enable fast processing and synthesis of complex information \npatterns. In particular, considerable efforts have recently been devoted to the for(cid:173)\nmulation of efficent methodologies for learning (e.g., Rumelhart et al., 1986; Pineda, \n1988; Pearlmutter, 1989; Williams and Zipser, 1989; Barhen, Gulati and Zak, 1989). \nThe development of learning algorithms is generally based upon the minimization \nof a neuromorphic energy function. The fundamental requirement of such an ap(cid:173)\nproach is the computation of the gradient of this objective function with respect \nto the various parameters of the neural architecture, e.g., synaptic weights, neural \n\n\fAdjoint Operator Algorithms \n\n499 \n\ngains, etc. The paramount contribution to the often excessive cost of learning us(cid:173)\ning dynamical neural networks arises from the necessity to solve, at each learning \niteration, one set of equations for each parameter of the neural system, since those \nparameters affect both directly and indirectly the network's energy. \n\nIn this paper we show that the concept of adjoint operators, when applied to dynam(cid:173)\nical neural networks, not only yields a considerable algorithmic speedup, but also \nputs on a firm mathematical basis prior results for \"recurrent\" networks, the deriva(cid:173)\ntions of which sometimes involved much heuristic reasoning. We have already used \nadjoint operators in some of our earlier work in the fields of energy-economy mod(cid:173)\neling (Alsmiller and Barhen, 1984) and nuclear reactor thermal hydraulics (Barhen \net al., 1982; Toomarian et al., 1987) at the Oak Ridge National Laboratory, where \nthe concept flourished during the past decade (Oblow, 1977; Cacuci et al., 1980). \n\nIn the sequel we first motivate and construct, in the most elementary fashion, a \ncomputational framework based on adjoint operators. We then apply our results \nto the Cohen-Grossberg-Hopfield (CGH) additive model, enhanced with terminal \nattractor (Barhen, Gulati and Zak, 1989) capabilities. We conclude by presenting \nthe results of a few typical simulations. \n\n2 ADJOINT OPERATORS \nConsider, for the sake of simplicity, that a problem of interest is represented by the \nfollowing system of N coupled nonlinear equations \no \n\n(2.1) \n. Let u and p represent the N-vector of \nwhere rp denotes a nonlinear operator1 \ndependent state variables and the M-vector of system parameters, respectively. We \nwill assume that generally M \u00bb N and that elements of p are, in principle, inde(cid:173)\npendent. Furthermore, we will also assume that, for a specific choice of parameters, \na unique solution of Eq. (2.1) exists. Hence, u is an implicit function of p. A \nsystem \"response\", R, represents any result of the calculations that is of interest. \nSpecifically \n\nrp( u, p) \n\nR = R(u,p) \n\n(2.2) \ni.e., R is a known nonlinear function of p and u and may be calculated from Eq. (2.2) \nwhen the solution u in Eq. (2.1) has been obtained for a given p. The problem of \ninterest is to compute the \"sensitivities\" of R, i.e., the derivatives of R with respect \nto parameters PI\" \n\n1L = 1\"\", M. By definition \n\noR au \noR \n-+ - . -\nOPI' \nau OPI' \n\n(2.3) \n\n1 If differential operators appear in Eq. (2.1), then a corresponding set of boundary and/or \ninitial conditions to specify the domain of cp must also be provided. In general an inhomogeneous \n\"source\" term can also be present. The learning model discussed in this paper focuses on the \nadiabatic approximation only. Nonadiabatic learning algorithms, wherein the response is defined \nas a functional, will be discussed in a forthcoming article. \n\n\f500 \n\nBarhen, Toomarian and Gulati \n\nSince the response R is known analytically, the computation of oR/oPIS and oR/au \nis straightforward. The quantity that needs to be determined is the vector ou/ oPw \nDifferentiating the state equations (2.1), we obtain a set of equations to be referred \nto as \"forward\" sensitivity equations \n\n(2.4) \n\nTo simplify the notations, we are omitting the \"transposed\" sign and denoting the \nN by N forward sensitivity matrix ocp/ou by A, the N-vector oU/OPIS by I-'ij and \nthe \"source\" N-vector -ocp/ OPIS by ISS. Thus \n\n(2.5) \n\nSince the source term in Eq. (2.5) explicitly depends on ft, computing dR/dPI-\" \nrequires solving the above system of N algebraic equations for each parameter Pw \nThis difficulty is circumvented by introd ucing adjoint operators. Let A\u00b7 denote the \nformal adjoint2 of the operator A. The adjoint sensitivity equations can then be \nexpressed as \n\nA. I-' ij. \n\nIS -. S . \n\nBy definition, for algebraic operators \n\nSince Eq. (2.3), can be rewritten as \n\nif we identify \n\ndR \ndpl-' \n\noR \noR 1'-\nOPIS + au q, \n\noR \nau \n\n-\n\nI-' s. \n\n-* \ns \n\n(2.6) \n\n(2.8) \n\n(2.9) \n\nwe observe that the source term for the adjoint equations is independent of the \nspecific parameter PI-\" Hence, the solution of a single set of adjoint equations will \nprovide all the information required to compute the gradient of R with respect to all \nparameters. To underscore that fact we shall denote I-'ij* as ii. Thus \n\n(2.10) \n\nWe will now apply this computational framework to a CGH network enha.nced with \nterminal attractor dynamics. The model developed in the sequel differs from our \n\n2 Adjoint operators can only be considered for densely defined linear operators on Banach spaces \n(see e.g., Cacuci, 1980). For the neural application under consideration we will limit ourselves to \nreal Hilbert spaces. Such spaces are self-dual. Furthermore, the domain of an adjoint operator is \ndetennined by selecting appropriate adjoint boundary conditions l . The associated bilinear form \nevaluated on the domain boundary must thus be also generally included. \n\n\fAdjoint Operator Algorithms \n\n501 \n\nearlier formulations (Barhen, Gulati and Zak, 1989; Barhen, Zak and Gulati, 1989) \nin avoiding the use of constraints in the neuromorphic energy function, thereby \neliminating the need for differential equations to evolve the concomitant Lagrange \nmultipliers. Also, the usual activation dynamics is transformed into a set of equiv(cid:173)\nalent equations which exhibit more \"congenial\" numerical properties, such as \"con(cid:173)\ntraction\" . \n\n3 APPLICATIONS TO NEURAL LEARNING \nWe formalize a neural network as an adaptive dynamical system whose temporal \nevolution is governed by the following set of coupled nonlinear differential equations \n\n2:= Wnm Tnm g-y(zm) + kIn \n\nm \n\n(3.1) \n\nwhere Zn represents the mean soma potential of the nth neuron and Tnm denotes the \nsynaptic coupling from the m-th to the n-th neuron. The weighting factor Wnm \nenforces topological considerations. The constant Kn chara.cterizes the decay of neu(cid:173)\nron activity. The sigmoidal function g-y(.) modulates the neural response, with gain \ngiven by 1m; typically, g-y(z) = tanh(fz). The \"source\" term k In, which includes \ndimensional considerations, encodes contribution in terms of attractor coordinates \nof the k-th training sample via the following expression \n\nif n E Sx \nif n E SH U Sy \n\n(3.2) \n\nThe topographic input, output and hidden network partitions Sx, Sy and SH are \narchitectural requirements related to the encoding of ma.pping-type problems for \nwhich a number of possibilities exist (Barhen, Gulati and Zak, 1989; Barhen, Zak \nand Gulati, 1989). In previous articles (ibid; Zak, 1989) we have demonstrated that \nin general, for f3 = (2i + 1)-1 and i a strictly positive integer, such attractors have \ninfinite local stability and provide opportunity for learning in real-time. Typically, \nf3 can be set to 1/3. Assuming an adiabatic framework, the fixed point equations \nat equilibrium, i.e., as zn \ng \n\n--+ 0, yield \nUn = \n\n(3.3) \n\nnrn Urn + \n\nk -\n\nkI-\nn \n\nKn -l(k-) \n-\nIn \n\n~ T. \n~ Wnm \nm \n\nwhere Un = g-y(zn) represents the neura.l response. The superscript\"\" denotes \nquantities evaluated at steady state. Operational network dynamics is then given \nby \n\nUn + Un = g-y \n\n[ In 2:= Wnm T,lm Urn + In kIn 1 \n\nKn m \n\nKn \n\n(3.4) \n\nTo proceed formally with the development of a supervised learning algorithm, we \nconsider an approach based upon the minimization of a constrained \"neuromorphic\" \nenergy function E given by the following expression \n\nE(u,p) = ~ 2:= 2:= [ku n \n\nk \n\nn \n\n-\n\nkan ]2 \n\nV n E Sx U Sy \n\n(3.5) \n\n\f502 \n\nBarben, Toomarian and Gulati \n\nWe relate adjoint theory to neural learning by identifying the neuromorphic energy \nfunction, E in Eq. (3.5), with the system response R. Also, let p denote the following \nsystem parameters: \n\nThe proposed objective function enforces convergence of every neuron in Sx and \nSy to attractor coordinates corresponding to the components in the input-output \ntraining patterns, thereby prompting the network to learn the embedded invari(cid:173)\nances. Lyapunov stability requires an energy-like function to be monotonically de(cid:173)\ncreasing in time. Since in our model the internal dynamical parameters of interest \nare the synaptic strengths Tnm of the interconnection topology, the characteristic \ndecay constants Kn and the gain parameters In this implies that \n\nE = '\"\"\"' '\"\"\"' \n\n~ ~ ~ nm + ~ dK Kn + ~ d \nIn \nn m \n\n'\"\"\"' dE. \nIn \nn \n\ndE \nnm \n\nn \n\nn \n\nr.. \n\n'\"\"\"' dE. \n\n< 0 \n\n(3.6) \n\nFor each adaptive system parameter, PIA' Lyapunov stability will be satisfied by the \nfollowing choice of equations of motion \n\nPIA = -Tp \n\ndE \ndpIA \n\n(3.7) \n\nExamples include \n\n,n \n\n-r. -\n\ndE \n'Y din \n\n. \ndE \nTnm = -TT dTnm \nwhere the time-scale parameters TT, T,. and T\"y > O. Since E depends on PIA \nboth directly and indirectly, previous methods required solution of a system of N \nequations for each parameter PIA to obtain dE/dPIA from du/dPIA. Our methodology \n(based on adjoint operators), yields all deri vati ves dE / dplA' V J1. , by solving a \nsingle set of N linear equations. \nThe nonlinear neural operator for each training pattern k, k = 1,\u00b7\u00b7\u00b7 J(, at equi(cid:173)\nlibrium is given by \n\ndE \n\n\" \nl(Jn \n\n(\" -\n\nU, P = 9 \n\n-) \n\n- ~ Wnm' nm' U m , + -\n[ 1 \nKn \nKn \n\nr.\" -\n\n'\"\"\"' \n, \nm \n\nwhere, without loss of generality we have set ,n to unity. So, in principle\" Un = \n\"un [T, K, r, \"an,\u00b7\u00b7-j. Using Eqs. (3.8), the forward sensitivity matrix can be \ncomputed and compactly expressed as \n\n1 \"1- 1 \n\nn \n\n(3.8) \n\n{) \"l(Jn \n\n{) ,,-Um \n\n[ \nIn \nWnm Tnm + {)\"_ \nU m \n\n\" - 1 \n\n{) \n\n\"A \ngn -\n\n1 \nKn \n1 \"A \n-\nKn \n\ngn Wnm \n\nT. \n\nnm -\n\n,,~ \nfJn unm\u00b7 \n\n(3.9) \n\n\fAdjoint Operator Algorithms \n\n503 \n\nwhere \n\n(3.10) \nAbove, kgn represents the derivative of 9 with respect to kun, i.e., if 9 = tanh, \nthen \n\nif n E Sx \nifn E SHUSy \n\n'g. = 1 - ['g.J 2 where \n\n'g. = g[ :. ( ~w.m T.m 'um + 'I. ) 1 (3.11) \n\nRecall that the formal adjoint equation is given as A\u00b7 v = s\u00b7 ; here \n\n1 k~ \n-\nKm \n\ngm Wmn mn -\n\nT. \n\nk , \n\nTJm Umn \n\n(3.12) \n\nUsing Eqs. (2.9) and (3.5), we can compute the formal adjoint source \n\nBE \n.ll k(cid:173)\nv Un \n\nifn E Sx USy \nif n E SH \n\n(3.13) \n\nThe system of adjoint fixed-point equations can then be constructed using Eqs. \n(3.12) and (3.13), to yield: \n\n\"'\" 1 k~ \n~ -\nm Km \n\ngm Wmn mn Vm - ~ fJm Umn Vm \n\n, \n\nk-\n\nT. \n\nk-\n\n\"'\" k \nm \n\n(3.14) \n\nNotice that the above coupled system, (3.14), is linear in kv. Furthermore, it \nhas the same mathematical characteristics as the operational dynamics (3.4). Its \ncomponents can be obtained as the equilibrium points, (i.e., Vi \n--+ 0) of the adjoint \nneural dynalnics \n\n1 k ~ \n-\nKm \n\ngm Wmn mn Vm \n\nT. \n\nm \n\n(3.15) \n\nAs an implementation example, let us conclude by deriving the learning equations \nfor the synaptic strengths, Tw Recall that \n\ndE \ndTIJ \n\nBE + \"'\" k- IJk -\n- -\nBTIJ \n\nL \nk \n\nv, \n\nS \n\np. = (i, j) \n\n(3.16) \n\nWe differentiate the steady state equations (3.8) with respect to Tij , to obtain the \nforward source term, \n\na k 0 \n\n( 4. 3) \n\nThus, for f3 ~ 0 the relaxation time is infinite, while for f3 > 0 it is finite. The \ndynamical system (3.19) suffers a qualitative change for f3 > 0: it loses uniqueness \nof solution. The equilibrium point 1 \\7 E 1 = 0 becomes a singular solution being \nintersected by all the transients, and the Lipschitz condition is violated, as one can \nsee from \n\n( d ! \\7 E !) = -X 1 \\7 E 1-.6 _ \n\nd \n\nd 1 \\7 E 1 \n\ndt \n\n-00 \n\n(4.4) \n\nwhere 1 \\7 E 1 tends to zero, while f3 is strictly positive. Such infinitely stable points \nare\" terminal attractors\". By analogy with our previous results we choose f3 = 2/3, \nwhich yields \n\nT \n\n( \n\n)\n~ ~ [\\7TE ]~rn + ~ [\\7-yE]~ + ~ [\\7 ,.E]~ \n\n(4.5) \n\n-1/3 \n\nThe introduction of these adaptive time-scales dramatically improves the conver(cid:173)\ngence of the corresponding learning dynamical systems. \n\n\fAdjoint Operator Algorithms \n\n505 \n\n5 SIMULATIONS \nThe computational framework developed in the preceding section has been ap(cid:173)\nplied to a number of problems that involve learning nonlinear mappings, including \nExclusive-OR, the hyperbolic tangent and trignometric functions, e.g., sin. Some of \nthese mappings (e.g., XOR) have been extensively benchmarked in the literature, \nand provide an adequate basis for illustrating the computational efficacy of our pro(cid:173)\nposed formulation. Figures l(a)-I(d) demonstrate the temporal profile of various \nnetwork elements during learning of the XOR function. A six neuron feedforward \nnetwork was used, that included self-feedback on the output unit and bias. Fig. \nl(a) shows the LMS error during the training phase. The worst-case convergence of \nthe output state neuron to the presented attractor is displayed in Fig. l(b) . Notice \nthe rapid convergence of the input state due to the terminal attractor effect. The \nbehavior of the adaptive time-scale parameter T is depicted in Fig. 1 (c). Finally, \nFig. l(d) shows the evolution of the energy gradient components. \n\nThe test setup for signal processing applications, i.e., learning the sin function and \nthe tanh sigmoidal nonlinearlity, included a 8-neUl'on fully connected network with \nno bias. In each case the network was trained using as little as 4 randomly sampled \ntraining points. Efficacy of recall was determined by presenting 100 random sam(cid:173)\nples. Fig. (2) and (3b) illustrate that we were able to approximate the sin and the \nhyperbolic tangent functions using 16 and 4 pairs respectively. Fig. 3(a) demon(cid:173)\nstrates the network performance when 4 pairs were used to learn the hyperbolic \ntangent. \n\nWe would like to mention that since our learning methodology involves terminal \nat tractors, extreme caution must be exercised when simulating the algorithms in \na digital computing environment. Our discussion on sensitivity of results to the \nintegration schemes (Barhen, Zak and Gulati, 1989) emphasizes that explicit meth(cid:173)\nods such as Euler or Runge-Kutta shall not be used, since the presence of terminal \nat tractors induces extreme stiffness. Practically, this would require an integration \ntime-step of infinitesimal size, resulting in numerical round-off errors of unaccept(cid:173)\nable magnitude. Implicit integration techniques such as the Kaps- Rentrop scheme \nshould therefore be used. \n\n6 CONCLUSIONS \nIn this paper we have presented a theoretical framework for faster learning in dy(cid:173)\nnamical neural networks. Central to our approach is the concept of adjoint operators \nwhich enables computation of network neuromorphic energy gradients with respect \nto all system parameters using the solution of a single set of lineal' equations. If \nCF and CA denote the computational costs associated with solving the forward and \nadjoint sensitivity equations (Eqs. 2.5 and 2.6), and if M denotes the number of \nparameters of interest in the network, the speedup achieved is \n\n\f506 \n\nBarhen, Toomarian and Gulati \n\nIf we assume that CF ~ CA and that M = N 2 + 2N + ... , we see that the lower \nbound on speedup per learning iteration is O(N2). Finally, particular care must be \nexecrcised when integrating the dynamical systems of interest, due to the extreme \nstiffness introduced by the terminal attractor constructs. \n\nAcknowledgements \n\nThe research described in this paper was performed by the Center for Space Mi(cid:173)\ncroelectronics Technology, Jet Propulsion Laboratory, California Institute of Tech(cid:173)\nnology, and was sponsored by agencies of the U.S. Department of Defense, and \nby the Office of Basic Energy Sciences of the U.S. Department of Energy, through \ninteragency agreements with NASA. \n\nReferences \n\nR.G. Alsmiller, J. Barhen and J. Horwedel. (1984) \"The Application of Adjoint \nSensitivity Theory to a Liquid Fuels Supply Model\" , Energy, 9(3), 239-253. \nJ. Barhen, D.G. Cacuci and J.J. Wagschal. (1982) \"Uncertainty Analysis of Time(cid:173)\nDependent Nonlinear Systems\", Nucl. Sci. Eng., 81, 23-44. \nJ. Barhen, S. Gulati and M. Zak. (1989) \"Neural Learning of Constrained Nonlinear \nTransformations\", IEEE Computer, 22(6), 67-76. \n\nJ. Barhen, M. Zak and S. Gulati. (1989) \" Fast Neural Learning Algorithms Using \nNetworks with Non-Lipschitzian Dynamics\", in Proc. Neuro-Nimes '89,55-68, EC2, \nN anterre, France. \n\nD.G. Cacuci, C.F. Weber, E.M. Oblow and J.H. Marable. (1980) \"Sensitivity The(cid:173)\nory for General Systems of Nonlinear Equations\", Nucl. Sci. Eng., 75, 88-110. \n\nE.M. Oblow. (1977) \"Sensitivity Theory for General Non-Linear Algebraic Equa(cid:173)\ntions with Constraints\", ORNL/TM-5815, Oak Ridge National Laboratory. \n\nB.A. Pearlmutter. (1989) \"Learning State Space Trajectories in Recurrent Neural \nNetworks\", Neural Computation, 1(3), 263-269. \n\nF.J. Pineda. (1988) \"Dynamics and Architecture in Neural Computation\", Journal \nof Complexity, 4, 216-245. \nD.E. Rumelhart and J .L. Mclelland. (1986) Parallel and Distributed Procesing, MIT \nPress, Cambridge, MA. \n\nN. Toomarian, E. Wacholder and S. Kaizerman. (1987) \"Sensitivity Analysis of \nTwo-Phase Flow Problems\", Nucl. Sci. Eng., 99(1), 53-8l. \n\nR.J. Williams and D. Zipser. (1989) \"A Learning Algorithm for Continually Run(cid:173)\nning Fully Recurrent Neural Networks\", Neural Computation, 1(3), 270-280. \n\nM. Zak. (1989) \"Terminal Attractors\", Neural Networks, 2(4),259-274. \n\n\fAdjoint Operator Algorithms \n\n507 \n\n(a) \n\n(b) \n\n4 \n\ntil \n:2! \nt:r4 \n~ \n\n~ \n\n~ 1'--\n~ \n\n~ \n\nl \n\niterations \n\n\u2022 \n\n20 \n\n1.5 \n\n~ \nP \nQ) \n0 a Q) \nbJI \" ~ \n\n8 \n, \n150 \n\n1 \n\niterations \n\n150 \n\niterations \n\n150 \n\niterations \n\n150 \n\n(c) \n\n(d) \n\nFigure l(a)-(d). \n\nLearning the Exclusive-OR function using a 6-neumn \n(including bias) feedforward dynamical nctwork with \nsclf-feedback on the output unit. \n\n\f508 \n\nBarben, Toomarian and Gulati \n\n1 .000 , - - - - - - - - - - - - - . , . . . _ - -_ \n\n0 .500 \n\n0.000 \n\n-0.500 \n\n-1.000 t---..:::....~~--t__---t__--__.J \n\n-1.000 \n\n-0.500 \n\n0.000 \n\n0.500 \n\n1.000 \n\nFigure 2. \n\nLearning the Sin function using a fully connccted, 8-neunm \nnetwork with no bias. The truining set comprised of \n4 points that were randomly selected. \n\n3 (a) \n\n1.000 r----------.---:::=;~----. \n\n0.500 \n\n0000 \n\n-0.500 \n\n-1000~~~~~---t__---t__--~ \n1.000 \n\n- 1.000 \n\n-0.500 \n\n0 .500 \n\n0.000 \n\n3(b) \n\n1000 \n\n0.500 \n\n0.000 \n\n-0.500 \n\n-I.OOG .--\"-.-.!~---t__---t__--__.J \n\n- I.oeo \n\n-0 .500 \n\n0.000 \n\n0.500 \n\n1.000 \n\nIt'igure 3. \n\nLearning the Hyperbolic Tangent function using a fully connected, \n8-neunm network with no bias. (a> using 4 randomly selected \ntraining samples; (b> using 16 randomly selected training samples. \n\n\f", "award": [], "sourceid": 262, "authors": [{"given_name": "Jacob", "family_name": "Barhen", "institution": null}, {"given_name": "Nikzad", "family_name": "Toomarian", "institution": null}, {"given_name": "Sandeep", "family_name": "Gulati", "institution": null}]}