{"title": "Generalization of Back propagation to Recurrent and Higher Order Neural Networks", "book": "Neural Information Processing Systems", "page_first": 602, "page_last": 611, "abstract": null, "full_text": "602 \n\nGENERALIZATION OF BACKPROPAGATION \n\nTO \n\nRECURRENT AND HIGHER ORDER NEURAL NETWORKS \n\nFernando J.  Pineda \n\nApplied Physics Laboratory, Johns Hopkins University \n\nJohns Hopkins Rd., Laurel MD 20707 \n\nAbstract \n\nA general method for deriving backpropagation algorithms for networks \n\nwith recurrent and higher order networks is introduced.  The propagation of activation \nin these networks is determined by dissipative differential equations.  The error signal \nis backpropagated by integrating an associated differential equation.  The method is \nintroduced by applying it to the recurrent generalization of the feedforward \nbackpropagation network.  The method is extended to the case of higher order \nnetworks and to a constrained dynamical system for training a content addressable \nmemory.  The essential feature of the adaptive algorithms is that adaptive equation has \na simple outer product form. \n\nPreliminary experiments suggest that learning can occur very rapidly in \n\nnetworks with recurrent connections.  The continuous formalism makes the new \napproach more suitable for implementation in VLSI. \n\nIntroduction \n\nOne interesting class of neural networks, typified by the Hopfield neural \n\nnetworks (1,2)  or the networks studied by Amari(3,4) are dynamical systems with three \nsalient properties.  First, they posses very many degrees of freedom, second their \ndynamics are nonlinear and third, their dynamics are dissipative.  Systems with these \nproperties can have complicated attractor structures and can exhibit computational \nabilities. \n\nThe identification of attractors with computational objects, e.g.  memories at d \nrules,  is one of the foundations of the neural network paradigm.  In this paradigl n, \nprogramming becomes an excercise in manipulating attractors.  A learning algorithm is \na rule or dynamical  equation which changes the locations of fixed points to encode \ninformation.  One way of doing this is to minimize, by gradient descent,  some \nfunction of the system parameters.  This general approach  is reviewed by Amari(4) \nand forms the basis of many learning algorithms.  The formalism described here is a \nspecific case of this general approach. \n\nThe purpose of this paper is to introduce a fonnalism for obtaining adaptive \ndynamical systems which are based on backpropagation(5,6,7).  These dynamical \nsystems are expressed as systems of coupled first order differential equations.  The \nformalism will be illustrated by deriving adaptive equations for a recurrent network \nwith first order neurons, a recurrent network with higher order neurons and finally a \nrecurrent first order associative memory. \n\nExample 1: Recurrent backpropagation with first order units \n\nConsider a dynamical system whose state vector x evolves according to the \n\nfollowing set of coupled differential equations \n\n\u00ae American Institute ofPhvsics 1988 \n\n\fdx\u00b7/dt = -x' + g'(LW\"X') + I\u00b7 \nI \n\n1  1.  IJ  J \n\nI \n\nJ \n\nwhere i=l, ... ,N. The functions g' are assumed to be differentiable and may have \ndifferent forms for various populations of neurons.  In this paper we shall make no \nother requirements on gi'  In the neural network literature it is common to take these \nfunctions to be Sigmoid shaped functions.  A commonly used form is the logistic \nfunction, \n\n603 \n\n(1) \n\n(2) \n\nThis form is biologically motivated since it attempts to account for the refractory phase \nof real neurons.  However, it is important to stress that there is nothing in the \nmathematical content of this paper which requires this form -- any differentiable \nfunction will suffice in the formalism presented in this paper.  For example, a choice \nwhich may be of use in signal processing is sin(~). \n\nA necessary condition for the learning algorithms discussed here to exist is that the \nsystem posesses stable isolated attractors, i.e. fixed points.  The attractor structure of \n(1) is the same as the more commonly used equation \n\ndu/dt = -ui +~Wijg(Uj) + Ki' \n\n(3) \n\nJ \n\nBecause (1) and (3) are related by a simple linear transformation.  Therefore results \nconcerning the stability of (3) are applicable to (1).  Amari(3) studied the dynamics of \nequation (3) in networks with random conections.  He found that collective variables \ncorresponding to the mean activation and its second moment must exhibit either stable \nor bistable behaviour.  More recently, Hopfield(2) has shown how to construct content \naddressable memories from symmetrically connected networks with this same \ndynamical equation.  The symmetric connections in the network gaurantee global \nstability.  The solution of equation (1) is also globally asymptotically stable if w can be \ntransformed into a lower triangular matrix by row and column exchange operations. \nThis is because in such a case the network is a simply a feedforward network and the \noutput can be expressed as an explicit function of the input.  No Liapunov function \nexists for arbitrary weights as can be demonstrated by constructing a set of weights \nwhich leads to oscillation.  In practice, it is found that oscillations are not a problem \nand that the system converges to fixed points unless special weights are chosen. \nTherefore it shall be assumed, for the purposes of deriving the backpropagation \nequations, that the system ultimately settles down to a fixed point. \n\nConsider a system of N neurons, or units, whose dynamics is determined by \n\nequation (1).  Of all the units in the network we will arbitrarily define some subset of \nthem (A) as input  units and some other subset of them (0)  as output  units.  Units \nwhich are neither members of A  nor 0  are denoted hidden  units.  A unit may be \nsimultaneously an input unit and an output unit.  The external environment influences \nthe system through the source term, I.  If a unit is an input unit, the corresponding \ncomponent of I is nonzero.  To make this more precise it is useful to introduce a \nnotational convention.  Suppose that <I>  represent some subset of units in the network \nthen the function 8i<I> is defined by \n\n8'm= { \n\n1'V \n\n1 \n0 \n\nif i-th unit is a member of <I> \nth o  erwise \n\nIn terms of this function, the components of the I vector are given by \n\n(4) \n\n(5) \n\n\f604 \n\nwhere ~i is detennined by the external environment. \n\nOur goal will be to fmd a local algorithm which adjusts the weight matrix w so that \na given initial state XO  = x(to)' and a given input I  result in a fixed point, xoo= x(too), \nwhose components have a desired set of values Ti along the output units.  This will be \naccomplished by minimizing a function E which tneasures the distance between the \ndesired fixed point and the actual fixed point i.e., \n1  N \n\nE = - :E  Ji2 \n\n2 \n\ni=l \n\n(6) \n\nwhere \n\nJ. -\nI  -\n\n(T. - xoo.  ) e'n \nI.u. \n\nI \n\nI \n\n(7) \n\nE depends on the weight matrix w through the fixed point Xoo(w).  A learning \nalgorithm drives the fixed points towards the manifolds which satisfy  xi 00 = Ti on the \noutput units.  One way of accomplishing this with dynamics is to let the system evolve \nin the weight space along trajectories which are antiparallel to the gradient of E.  In \n. other words, \n\ndWi/dt  =  - T\\  -\n\ndE \ndw .. \nIJ \n\n(8) \n\nwhere T\\  is a numerical constant which defines the (slow) time scale on which w \nchanges.  T\\  must be small so that  x is always essentially at steady state, i.e. \nx(t) ==  xoo.  It is important to stress that the choice of gradient descent for the learning \ndynamics is by no means unique, nor is it necessarily the best choice.  Other learning \ndynamics which employ second order time derivatives (e.g.  the momentum \nmethod(5\u00bb  or which employ second order space derivatives (e.g.  second order \nbackpropagation(8\u00bb  may be more useful in particular applications.  However, equation \n(8) does have the virtue of being the simplest dynamics which minimizes E. \n\nOn performing the differentiations in equation (8), one immediately obtains \n\ndwrs/dt  = T\\ 1: Jk a \n\nk \n\ndxoo k \nwrs \n\n(9) \n\nThe derivative of xoo k with respect to w rs is obtained by first noting that the fixed \npoints of equation (1) satisfy the nonlinear algebraic equation \n\ndifferentiating both sides of this equation with respect to Wrs and finally solving for \ndxooId dWrs'  The result is \n\nXoo.  = g\u00b7(:Ewooxoo.) + J. \n\nI \n\nI .   IJ \n\nJ \n\nI '  \n\nJ \n\ndXook \n-\ndWrs \n\n= (L- 1)kr gr'(Ur)xoo s \n\n(10) \n\n(11) \n\n(12) \n\nwhere gr' is the derivative of gr and where the matrix L  is given by \n\nBii  is the Kroneker B function ( BU= 1 if i=j, otherwise Bij  = 0).  On substituting (11) \ninto (9) one obtains the remarkablY simple form \n\n\fwhere \n\ndWrsldt  = 11  YrXoo s \nYr = gr'(ur)  LJk(L -1)kr \n\nk= \n\n605 \n\n(13) \n\n(14) \n\nr \n\nr \n\nEquations (13) and (14) specify a fonnallearning rule.  Unfortunately, equation \n\n(14) requires a matrix inversion to calculate the error signals Yk'  Direct matrix \ninversions are necessarily nonlocal calculations and therefore this learning algorithm is \nnot suitable for implementation as a neural network.  Fortunately, a local method for \ncalculating Yr can be obtained by the introduction of an associated dynamical system. \nTo obtain this dynamical system fIrst rewrite equation (14) as \n\nLLrk (Yr / gr'(ur)}  = Jk \nr \n\n. \n\n(15) \n\nThen multiply both sides by  fk'(uk)' substitute the explicit form for L  and finally sum \nover r.  The result is \n\no = - Yk + gk'(uk){ LWrkYr + Jk}  . \n\n(16) \n\nOne now makes the observation that the solutions of this linear equation are the fIxed \npoints of the dynamical system given by \n\ndYk/dt = - Yk +gk'(uk){LWrkYr + Jk}  . \n\n(17) \n\nThis last step is not unique, equation (16) could be transformed in various ways \nleading to related differential equations, cf. Pineda(9).  It is not difficult to show that \nthe frrst order fInite difference approximation (with a time step ~t = 1) of equations \n(1), (13) and (17) has the same form as the conventional backpropagation algorithm. \n\nEquations (1), (13) and (17) completely specify the dynamics for an adaptive \n\nneural network, provided that (1) and (17) converge to stable fixed points and \nprovided that both quantities on the right hand side of equation (13)  are the steady \nstate solutions of (1) and (17). \n\nIt was pointed out by Almeida(10) that the local stability of (1) is a sufficient \n\ncondition for the local stability of (17).  To prove this it suffices to linearize equation \n(1) about a stable fixed point.  The resulting linearized equation depends on the same \nmatrix L  whose transpose appears in the derivation of equation (17),  cf. equation \n(15).  But Land LT have the same eigenValues, hence it follows that the fIXed points \nof (17) must also be locally stable if the fIxed points of (1) are locally stable. \n\nLearning multiple associations \n\nIt is important to stress that up  to this point the entire discussionhas assumed that I \nand T  are constant in time, thus no mechanism has been obtained for learning multiple \ninput/output associations.  Two methods for training the network to learn multiple \nassociations are now discussed.  These methods lead to qualitatively different learning \nbehaviour. \n\nSuppose that each input/output pair is labeled by a pattern label n, i.e.  {In ,Tn}. \n\nThen the energy function which is minimized in the above discussion must also \ndepend on this label since it is an implicit function of the In ,Tn pairs.  In order to \nlearn multiple input/output associations it is necessary to minimize all the E[n] \nsimultaniously.  In otherwords the function to minimize is \n\n(18) \n\n\f606 \n\nwhere the sum is over all  input/output associations.  From (18) it follows that the \ngradient for Etotal is simply the sum of the gradients for each association, hence the \ncorresponding gradient descent equation has the form, \n\ndWijldt  =  11  L yOOi[a]  xOOia]  . \n\n(19) \n\na \n\nIn numerical simulations, each time step of (19) requires relaxing (1) and (17) for each \npattern and  accumulating the gradient over all the patterns.  This fonn of the algorithm \nis deterministic and is guaranteed to converge because, by construction, Etotal is a \nLiapunov function for equation (19).  However, the system may get stuck m a local \nminimum.  This method is similar to the master/slave approach of Lapedes and \nFarber(1l).  Their adaptive equation, which plays the same role as equation (19), also \nhas a gradient form, although it is not strictly descent along the gradient.  For a \nrandomly or fully connected network it can be shown that tbe number of oper~tions \nrequired per weight update in the master/slave fonnalis~ is proportional to N  where \nN is the number of units.  This is because there are O(N  ) update equations and each \nequation requires O(N) operations (assuming some precomputation).  On the other \nhand, in the backpropagation formalism each update equation re~uires only 0(1) \noperations because of their trivial outer product form.  Also O(N  ) operations are \nrequired t~ precompute XOO  and yoo.  The result is that each weight update requires \nonly O(N  ) operations.  It is not possible to conclude from this argument that one or \nthe other approach will be more efficient in a particular application because there are \nother factors to consider such as the number of patterns and the number of time steps \nrequired for x and y to converge.  A detailed comparison of the two methods is in \npreparation. \n\nA second approach to learning multiple patterns is to use (13) and to change the \npatterns randomly  on each time step.  The system therefore receives a sequence of \nrandom impulses each of which attempts to minimize E[ ex]  for a single pattern.  One \ncan then defme L(w) to be the mean E[a] (averaged over the distribution of patterns). \n\nL(w) = <E [w, la,Ta ]> \n\n(20) \n\nAmari(4) has pointed out that if the sequence of random patterns is stationary and if \nL(w) has a unique minimum then the theory of stochastic approximation guarantees \nthat the solution of (13) wet) will converge to the minimum point '!min  of L(w) to \nwithin a small fluctuating tenn which vanishes as 11  tends to zero.  hVlaently  11  is \nanalogous to the temperature parameter in simulated annealing.  This second approach \ngenerally converges more slowly than the first, but it will ultimately converge (in a \nstatistical sense) to the global  minimum. \n\nIn principle the fixed points, to which the solutions of (1) and (17) eventually \n\nconverge, depend on the initial states.  Indeed,  Amari's(3) results imply that equation \n(1) is bistable for certain choices of weights.  Therefore the presentation of multiple \npatterns might seem problematical since in both approaches the final state of the \nprevious pattern becomes the initial state of the new pattern.  The safest approach is to \nreinitialize the network to the same initial state each time a new pattern is presented. \ne.g.  xi(t~ = 0.5 for all i.  In practice the system learns robustly even if the initial \nconditIons are chosen randomly. \n\nExample 2:  Recurrent higher order networks \n\nIt is straightforward to apply the technique of the previous section to a dynamical \n\nsystem with higher order units.  Higher order systems have been studied by \nSejnowski (12) and Lee et al.(13).  Higher order networks may have definite advantages \n\n\f607 \n\nover networks with first order units alone  A detailed discussion of the \nbackpropagation fonnalism applied to higher order networks is beyond the scope of \nthis paper.  Instead, the adaptive equations for a network with purely n-th order units \nwill be presented as an example of the fonnalism.  To this end consider a dynamical \nsystem of the fonn \n\nwhere \n\ndx\u00b7/dt - -x' + g'(lI!) + I\u00b7 \n1 \n\n1-:1 \n\n1 \n\nI \n\n-\n\n(21) \n\n(22) \n\nand where there are n+ 1 indices and the summations are over all indices except i.  The \nsuperscript on the weight tensor indicates the order of the correlation.  Note that an \nadditional nonlinear function f has been added to illustrate a further generalization. \nBoth f and g must be differentiable and may be chosen to be sigmoids.  It is not \ndifficult, although somewhat tedious, to repeat the steps of the previous example to \nderive the adaptive equations for this system.  The objective function in this case is the \nsame as was used in the fIrst example, i.e.  equation (6).  The n-th order gradient \ndescent equation has the fonn \n\nEquation (23) illustrates the major feature of backpropagation which distinguishes it \nfrom other gradient descent algorithms or similar algorithms which make use of a \ngradient.  Namely, that the gradient of the objective function has a very trivial outer \nproduct fonn.  y (n)oo  is the steady state solution of \n\n(23) \n\ndy(n)k/dt =  - y(n)k + gk'(uk) {fk'(xk)Ly(n)rkY (n)r + Jk }. \n\n(24) \n\nr \n\nThe matrix v(n) plays the role of w in the previous example, however v(n) now \ndepends on the state of the network according to \n\ny(n)ij = L'\"  L s<n)ijk\"'l  ( f(xk) ... f(xI)} \n\n(25) \n\nk \n\nI \n\nwhere is s(n) a tensor which is symmetric with respect to the exchange of the second \nindex and all the indices to the right, i.e. \n\nS(n).. \n\nIJk\"1 -\n\n- w(n) \n\nijk\"'l \n\n+ w(n) \n\n+ ...  + w(n) \n\nikj\"'l \n\nijl\"'k  . \n\n(26) \n\nFinally, it should be noted that:  1) If the polynomial ui is not homogenous, the \nadaptive equations are more complicated and involve cross tenns between the various \norders and that: 2)  The local stability of the n-th order backpropagation equations now \ndepends on the eigenvalues of the matrix \n\nL .. = 0\"  - g.'(u\u00b7) f.'(x\u00b7) y(n) .. \nIJ \n\nIJ \n\n1 \n\n1 \n\n1 \n\n1 \n\nIJ' \n\n(27) \n\nAs before, if the forward propagation converges so will the backward propagation. \n\nExample 3:  Adaptive content addressable memory \n\nIn this section the adaptive equations for a content addressable memory \n\n(CAM) are derived as a fmal illustration of the generality of the formalism.  Perhaps \n\n\f608 \n\nthe best known (and best studied) examples of dYnamical systems which exhibit CAM \nbehaviour are the systems discussed by Hopfield(l).  Hopfield used a nonadaptive \nmethod for programming the symmetric weight matrix.  More recently Lapedes and \nFarber<ll) have demonstrated how to contruct a master dynamical system which can be \nused to train the weights of a slave system which has the Hopfield fonn.  This slave \nsystem then performs the CAM operation.  The resulting weights are not symmetric. \n\nThe learning proceedure presented in this section is most closely related to the \nmethod of Lapedes and Farber in that a master network is used to adjust the weights of \na slave network.  In constrast to the afforementioned formalism, which  requires a \nvery large associated weight matrix for the master network, both the master and slave \nnetworks of the following approach make use of the same weight matrix.  The CAM \nunder consideration is based on equation (1).  However, the interpretation of the \ndynamics will be somewhat different from the first section.  The main difference is that \nthe dynamics in the learning phase is constrained.  The constrained dynamical system \nis denoted the master network.  The unconstrained system is denoted the slave \nnetwork.  The units in the network are divided into only two sets: the set of visible \nunits (V) and the set of internal or hidden units (H).  There will be no distinction made \nbetween input and output units.  Thus, I will generally be zero unless an input bias is \nneeded in some application. \n\nThe dynamical system will be used as an autoassociative memory,  thus the \nmemory recall is performed by starting the network at a particular initial state which \nrepresents partial information about a stored memory.  More precisely,  suppose that \nthere exists a subset K of the visible units whose states are known to have values Ti' \nThen the initial state of the network is \n\n(28) \n\nwhere the bi  are arbitrary.  The CAM relaxes to the previously stored memory whose \nbasin of attraction contains this partial state. \n\nMemories are stored by a master network whose topology is exactly the same \n\nas the slave network, but whose dynamics is somewhat modified.  The state vector z \nof the master network evolves according to the equation \n\nwhere Z is defmed by \n\nd~/dt = -~ + gi(LwikZk) + Ii \n\nN \n\nk=l \n\nZ, = T\u00b7  E)\u00b7V  + z\u00b7  E)'H \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n\u2022 \n\n(29) \n\n(30) \n\nThe components of Z along the visible units are just the target value specified by T. \nThis equation is useful as a master equation because if the weights can be chosen so \nthat the zi of the visible units relax to the target values Ti,:  then a fixed point of (29) is \nalso a fixed point of (1).  It can be concluded therefore, that by training the weights of \nthe master network one is also training the weights of the slave network.  Note that the \nform of Z implies that equation (29) can be rewritten as \n\nwhere \n\n. \n\n(31) \n\n(32) \n\nFrom equations (31) and (32) it is clear that the dynamics of the master system is \ndriven by the thresholds which depend on the targets. \n\n9i = - LWikTk \n\nkeY \n\n\fTo derive the adaptive equations consider the objective function \n\nwhere \n\nEmaster  = 2\"  1:  Ii \n\n1  N \n\ni=l \n\n2 \n\n609 \n\n(33) \n\n(34) \n\nIt is straightforward to apply the steps discussed in previous sections to EJ1Iaster'  This \nresults in adaptive equations for the weights.  The mathematical details Will  be omitted \nsince they are essentially the same as before, the gradient descent equation is \n\nwhere yOO  is the steady state solution of \n\ndWi/dt = 11yoo iZOOj \n\nwhere \n\ndyk\"dt =  - Yk  +g'k(vkHeiHLwrkYr + Ik} \n\nr \nvi  i  ~ikZoo k  . \n\n(35) \n\n(36) \n\n(37) \n\nEquations (31), and (35)-37) define the dynamics of the master network.  To train the \nslave network to be an autoassociative memory it is necessary to use the stored \nmemories as the initial states of the master network, i.e. \nz\u00b7(t  ) = T\u00b7 e\u00b7V  + b\u00b7 eiH \n\n(39) \n\n1  1 \n\n1  0 \n\n1 \n\nwhere bi is an arbitrary value as before.  The previous discussions concerning the \nstability of the three equations (1), (13) and (17) apply to equations (31) (35) and (36) \nas well.  It is also possible to derive the adaptive equations for a higher order \nassociative network, but this will not be done here. \n\nOnly preliminary computer simulations have been performed with this \n\nalgorithm to verify their validity, but more extensive experiments are in progress.  The \nfIrst simulation was with a fully connected network with  10 visible units and 5 hidden \nunits.  The training set consisted of four random binary vectors with the magnitudes of \nthe vectors adjusted so that 0.1  ~ Ti S;  0.9.  The equations were approximated by first \norder fmite difference equations with ~t = 1 and 11  = 1.  The training was performed \nwith the detenninistic method for learning multiple associations.  Figure 1.  shows \nEtotal as a function of the number of updates for both the master and slave networks. \nEtota! for the slave exhibits discontinous behaviour because the trajectory through the \nweight space causes x(to) to cut across the basins of attraction for the fixed points of \nequation (1). \n\nThe number of updates required for the network to learn the patterns is \nrelatively modest and can be reduced further by increasing 11.  This suggests that \nlearning can occur very rapidly in this type of network. \n\nDiscussion \n\nThe algorithms presented here by no means exhaust the class of possible \n\nadaptive algorithms which can be obtained with this formalism.  Nor is the choice of \ngradient descent a crucial feature in this formalism.  The key idea is that it is possible \nto express the gradient of an objective function as the outer product of vectors which \ncan be calculated by dynamical systems.  This outer produc2,form is also responsible \nfor the fact that the gradient can be calculated with only O(N  ) operations in a fully \nconnected or randomly connected network.  In fact the number of operations per \n\n\f610 \n\nweight update is proportional to the number of connections in the network.  The \nmethods used here will generalize to calculate higher order derivatives of the objective \nfunction as well. \n\nThe fact that the algorithms are expressed as differential equations suggests \n\nthat they may be implemented in analog electronic or optical hardware. \n\n2.00 . . . . . - - - - - - - - - - - - - - ,  \n\n1.00 --\"\", .. ~ \n\n~ Master \n-.- Slave \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\nUpdates \n\nfigure 1. Etota! as a function of the the number of updates. \n\nReferences \n\n(1) \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\n(6) \n\n(7) \n(8) \n\n(9) \n\n(10) \n\nI. J. HopfieJd. Neural Networks as Physical Systems with Emergent Collective \nComputational  Abilities. Proc. Nat. Acad. Sci. USA. Bio.79. 2554-2558. \n(1982) \n1.  I. Hopfield. Neurons with graded response have collective computational \nproperties like those of two-state neurons. Proc. Nat. Acad.  Sci. USA. Bio . .8l, \n3088-3092. (1984) \nShun-Ichi Amari.  IEEE Trans. on Systems Man and Cybernetics. 2.643-657. \n(1972) \nShun-Ichi Amari. in Systems Neuroscience. ed. Jacqueline Metzler. \nAcademic press. (1977) \nD. E. Rumelhart. G. E. Hinton and R.I. Williams. in Parallel Distributed \nProcessing. edited by D. E. Rumelhart and 1. L. McClelland. \nM.LT. press. (1986) \nDavid B. Parker. Learning-Logic, Invention Report. S81-64. File 1. \nOffice of Technology Licensing. Stanford University. October. 1982 \nY.  LeChun. Proceedings of Cognitiva. 85.  p.  599. (1985) \nDavid B. Parker. Second Order Backpropagation: Implementing an Optimal \nO(n) Approximation to Newton's Method as an Artificial Neural Network. \nsubmitted to Computer. (1987) \nFernando J. Pineda. Generalization ofbackpropagation to recurrent neural \nnetworks,  Phys. Rev. Lett .\u2022 l.8. 2229-2232. (1987) \nLuis B. Almeida. in the Proceedings of the IEEE First Annual International \nConference on Neural Networks. San Diego. California. June 1987. edited by \n\n\f611 \n\nM.  Caudil and C. Butler (to be published This is a discrete version of the \nalgorithm presented as the fIrst example \n\n(11)  Alan Lapedes and Robert Farber, A self-optimizing, nonsymmetrical neural net \n\nfor content addressable memory and pattern recognition, Physica, D22, \n247-259, (1986), see also, Programming a Massively Parallel, Computation \nUniversal System: Static Behaviour, in Neural Networks for Computing \nSnowbird, UT 1986, AIP Conference Proceedings, 151, (1986), \nedited by John S. Denker \n\n(12)  Terrence J. Sejnowski, Higher-order Boltzmann Machines,  Draft preprint \n\nobtained from author \n\n(13)  Y.C. Lee, Gary Doolen, H.H.  Chen, G.Z. Sun, Tom Maxwell, H.Y. Lee  and \n\nC. Lee Giles, Machine Learning using a higher order correlation network, \nPhysica D22, 276-306, (1986) \n\n\f", "award": [], "sourceid": 67, "authors": [{"given_name": "Fernando", "family_name": "Pineda", "institution": null}]}