{"title": "Green's Function Method for Fast On-Line Learning Algorithm of Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 333, "page_last": 340, "abstract": null, "full_text": "Green's Function Method for Fast On-line Learning \n\nAlgorithm of Recurrent Neural Networks \n\nGuo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee \n\nInstitute for Advanced Computer Studies \n\nand \n\nLaboratory for Plasma Research, \n\nUniversity of Maryland \nCollege Park, MD 20742 \n\nAbstract \n\nThe two well known learning algorithms of recurrent neural networks are \nthe back-propagation (Rumelhart & el al.,  Werbos) and the  forward propa(cid:173)\ngation (Williams and Zipser). The main drawback of back-propagation is  its \noff-line backward path in time for error cumulation. This violates the on-line \nrequirement in many  practical applications.  Although the  forward propaga(cid:173)\ntion algorithm can be used in an on-line manner, the annoying drawback is \nthe heavy computation load required to update the high dimensional sensitiv(cid:173)\nity matrix (0( fir) operations for each time step). Therefore, to develop a fast \nforward algorithm is a challenging task. In this paper w~ proposed a forward \nlearning algorithm which is one order faster (only 0(fV3) operations for each \ntime step) than the sensitivity matrix algorithm. The basic idea is that instead \nof integrating  the  high  dimensional  sensitivity dynamic  equation  we  solve \nforward in time for its Green's function to avoid the redundant computations, \nand then update the weights whenever the error is to be corrected. \n\nA Numerical  example  for  classifying  state  trajectories  using  a recurrent \n\nnetwork is  presented. It substantiated the faster speed of the proposed algo(cid:173)\nrithm than the Williams and Zipser's algorithm. \n\nI. Introduction. \n\nIn order to deal with sequential signals, recurrent neural networks are often put forward as a \nuseful model.  A particularly pressing issue concerning recurrent networks is the search for an \nefficient on-line  training algorithm. Error back-propagation method (Rumelhart, Hinton, and \nWilliams[ I]) was originally proposed to handle feedforward networks. This method can be ap(cid:173)\nplied to train recurrent networks if one unfolds the time sequence of mappings into a multilayer \nfeed-forward  net, each layer with identical weights.  Due  to the nature of backward path, it is \nbasically an off-line method. Pineda [2] generalized it to recurrent networks with hidden neu(cid:173)\nrons. However, he is mostly interested in time-independent fixed point type ofbehaviocs. Pearl(cid:173)\nmutter  [3]  proposed  a scheme  to  learn  temporal  trajectories  which  involves  equations  to  be \nsolved backward in time. It is essentially a generalized version of error back-propagation to the \nproblem of learning  a target state  trajectory. The viable  on-line  method  to  date  is  the  RTRL \n(Real Time Recurrent Learning) algorithm (Williams and Zipser [4]), which propagates a sen-\n\n333 \n\n\f334 \n\nSun, Chen, and Lee \n\nsitivity matrix forward in time. The main drawback of this algorithm is its high cost of compu(cid:173)\ntation. It needs O( JII) number of operations per time step. Therefore, a faster (less than O(JII) \noperations) on-line algorithm appears to be desirable. \n\nToomarian  and Barhen  [5]  proposed an  O(N2)  on-line  algorithm.  They derived  the  same \nequations as Pearlmutter's back-propagation using adjoint-operator approach. They then tried \nto convert the backward path into a forward path by adding a Delta function to its source term. \nBut this is not correct. The problem is not merely because it \"precludes straightforward numer(cid:173)\nical implementation\" as  they acknowledged later [6]. Even in theory, the result is  not correct. \nThe mistake is in their using a not well defined equity of the Delta function integration. Briefly \nf% 0 (t - t,)f(t) dt  = f(t,)  is not right if the functionj(t) is discontin-\nspeaking, the equity \nuous at t = tf The value of the left-side integral depends on the distribution of functionj(t) and \ntherefore is not uniquely defined. If we deal with the discontinuity carefully by splitting time \ninterval from to to 'linto two segments: to to 1\"\u00a3 and tr\u00a3 to 'land let E ~ 0, we will find out \nthat adding a Delta function to the source term does not affect the basic property of the adjoint \nequation. Namely, it still has to be solved backward in time. \n\nRecently, Toomarian and Barhen [6] modified their adjoint-operator approach and proposed \nan alternative O(~) on-line training algorithm. Although, in nature, their result is very similar \nto what we presented in this paper, it will be seen that our approach is more straightforward and \ncan be easily implemented numerically. \n\nSchmidhuber[7] proposed an O(N3) algorithm which is  a combination of back propagation \n(within each data block of size N) and forward propagation (between blocks). It is therefore not \ntruly an on-line algorithm. \n\nSun, Chen and Lee [8] studied this problem, using a more general approach - variational ap(cid:173)\nproach, in which a constrained optimization problem with Lagrangian multipliers was consid(cid:173)\nered. The dynamic  equation of the  Lagrangian multiplier was  derived,  which  is exactly the \nsame as adjoint equation[5]. By taking advantage oflinearity of this equation an O(N3) on-line \nalgorithm was derived.  But, the numerical implementation of the algorithm, especially the nu(cid:173)\nmerical instabilities are not addressed in the paper. \n\nIn this paper we will present a new approach to this problem - the Green's function method. \nThe advantages of the this method are the simple mathematical fonnulation and easy numerical \nimplementation. One numerical example of trajectory classification is presented to substantiate \nthe faster speed of the proposed algorithm. The numerical results are benchmarked with Wil(cid:173)\nliams and Zipser's algorithm. \nII. Green's Function Approach. \n(a) Definition of the Problem \n\nConsider a fully recurrent network with neural activity represented by an N-dimensional vec(cid:173)\ntor x(t). The dynamic equations can be written in general as a set of first order differential equa(cid:173)\ntions: \n\n(1) \nwhere W  is a matrix representing the set of weights and all other adjustable parameters, I(t) is a \nvector representing the neuron units clamped by external input signals at time t.  For a simple \nnetwork connected by first order weights the nonlinear function F may look like \n\ni(t)  =  F(x(t),w,I(t\u00bb \n\n(2) \nwhere the scaler function g(u)  could be, for instance, the Sigmoid function g(u) = 1 1(1+e\" u). \nSuppose that part of the state neurons  {Xi liE M}  are measurable and part of neurons  {Xi liE \n\nF(x(t), w,I(t\u00bb  =  -x(t) +g(w \u00b7x) +I(t) \n\n\fGreen's  Function Method for  Fast On-line Learning Algorithm of Recurrent Neural  Networks \n\n335 \n\nH}  are hidden.  For the  measurable units we may have desired output  x (t) . In order to train \nthe network, an objective functional (or an error measure functional) is often given to be \n\ntf \n\nE (x, f)  = f e (x (t),f (t\u00bb dt \n\n(3) \n\nwhere functional E depends on weights w implicitly through the measurable neurons  {Xi  liE \nM}.  A typical error function is \n\ne(x(t),x(t\u00bb  = (xU)  _x(t\u00bb2 \n\nThe gradient descent learning is to modify the weights according to \n\nAwoc:  oE  = -rf~. ax  dt \n. \n\nax  ow \n\now \n\nto \n\n(4) \n\n(5) \n\nIn order 0  evaluate the integral in Eq. (5) one needs to know both de/dw and dx/dW. The \nfirst  term  can  be  easily  obtained  by  taking  derivative  of  the  given  error  function \n\ne (x (t), f  (t\u00bb \n\n. For the second term one needs to solve the differential equation \n\n!l. ~ x)  = of . Ox  + of \nax  ow  ow \ndt dw \n\n(6) \n\nwhich is easily derived by taking derivative of Eq.(l) with respect to w.  The well known for(cid:173)\nward algorithm of recurrent networks [4] is to solve Equation  (6) forward in time and make the \nweight correction at the end (t = r.r)  of the input sequence. (This algorithm was developed inde-\npendently by several researchers, but due to the page limitation we could not refer all related \npapers and now simply call it Williams and Zipser's algorithm) The on-line learning is to make \nweight correction whenever an error is to be corrected during the input sequence \n\nA w (t)  = -11  ( ~ . ~ \n\nax  dW \n\n(7) \n\nThe proof of convergence of on-line learning algorithm will be addressed elsewhere. \n\nThe main drawback of this  forward  algorithm is that it requires O(Ni) operations per time \nstep to update the matrix dx/dW.  Our goal of the Green's function approach is  to find an on(cid:173)\nline algorithm which requires less computation load. \n(b). Green's Function Solution \n\nFirst let us analyze the computational complexity when integrating Eq. (6) directly. Rewrite \n\nEq. (6) as \n\nL. ax  = of \now \nwhere the linear operator L is defined as  L  = !l. _ of \ndt  ax \n\now \n\n(8) \n\nTwo types of redundancy will be seen from Eq. (8). First, the operator L does not depend on w \nexplicitly, which means that what we did in solving for dx/dw is to repeatedly solve the iden(cid:173)\ntical differential equation for each components of w. This is redundant. It is especially wasteful \nwhen higher order connection weights are used. The second redundancy is in the special form \nof dF/dw for neural computations where the same activity function (say, Sigmoid function) is \n\n\f336 \n\nSun, Chen, and Lee \n\nused for every neuron, so that \n\naFk \ny - =  g  (LWkl ' xI) 8ki  Xj \nUWi} \n\n, \n\nI \n\n(9) \n\nwhere 8ki is the Kronecker delta function. It is seen from Eq. (9) that among N3 components of \nthis third order tensor most of them, N2(N-l), are zero (when k ~ i) and need not to be computed \nrepeatedly. In the original forward learning scheme, we did not pay attention to this redundan(cid:173)\ncy. \n\nOur Green's function approach is able to avoid the redundancy by solving for the low dimen(cid:173)\nsional Green's function. And then we construct the solution ofEq. (8) by the dot product of (JF / \n(Jw  with the Green's function, which can in tum be reduced to a scaler product due to Eq. (9). \nThe Green's function of the operator L is defined as a dual time tensor function G(t-t) which \n\nsatisfies the following equation \n\nIt is  well known that, if the solution of Eq.  (10) is known, the solution of the original equation \nEq. (6) (or  (8\u00bb can be constructed using the source term (JF/dw through the integral \n\nd \n-G(t-t)-- \u00b7G(t-t)  =  8(t-t) \ndt \n\nax \ndW (t)  =  f (G (t - t)  . dW (t\u00bb dt \n\naF \n\nI \n\naF \nax \n, \n\n'0 \n\naF \nax \n\n~d \n\n-V(t) - - '  V(t)  = 0 \ndt \n(to)  =  1 \n\n(10) \n\n(11) \n\n(12) \n\n(13) \n\nTo find the Green's function solution we first introduce a tensor function V(t) that satisfies \n\nthe homogeneous form of Eq. (10) \n\nThe solution ofEq. (10) or the Green's function can then be constructed as \n\nG(t-t)  =  V(t)  . VI  (t)H(t-t) \n\nwhere H(t-t) is the Heaviside function defined as \n1 \nH(t- t)  =  {O \n\nt~t \n\nt<t \n\nUsing the well known equalities \n\n:tH  (t - t)  =  8 (t - t) \n\nand \n\nJ(t, t) 8 (t - t)  = J(t, t) 8 (t - t) \n\none can easily verify that the constructed Green's function shown in Eq. (13) is correct, that is, \nit satisfies Eq.  (10). Substituting G(t-t) from Eq.  (13) into Eq.  (11) we obtain the solution of \nEq. (6) as, \n\n: ,  (t)  =  V(t)  . f \u00abV(t\u00bb-l . ~: (t\u00bbdt \n\nt \n\nto \n\n(14) \n\n\fGreen's Function Method for  Fast  On-line Learning Algorithm of Recurrent Neural  Networks \n\n337 \n\nWe note that this fonnal solution not only satisfies Eq. (6) but also satisfies the required initial \ncondition \n\nax \ndw (to)  = 0 . \n\n(15) \n\n(16) \n\nThe \"on-line\" weight correction at time t is obtained easily from Eq. (5) \n\nOw  =  -11 1e \u00b7 dx =  -11 (de. V(t) Jt  \u00abV('t)f1 . dF ('t\u00bbd'tJ \n\ndx  dw \n\ndx \n\nto \n\ndw \n\n(c) Implementation \n\nTo implement Eq. (16) numerically we will introduce two auxiliary memories. First, we de(cid:173)\nfine U(t) to be the inverse of matrix V(t), i.e. U(t) = V  -I(t). It is easy to see that the dynamic \nequation of U(t) is \n\n~d U (t) + U (t)  . dF = 0 \n\ndt \n(to)  = 1 \n\ndx \n\nSecondly, we define a third order tensor TIijk that satisfies \n\ndTI =  U(t)  . dF \ndw \ndt \n\n(\n\nTI (to)  = 0 \n\nthen the weight correction in Eq. (16) becomes \n\nOw  =  -11 (v(t)  . TI(t\u00bb \n\nwhere the vector v(t) is the solution of the linear equation \n\nde \nv (t)  . U (t)  =  d x \n\nIn discrete time, Eqs. (17) - (20) become: \n\n(\n\nUtj (t)  =  Uij (t - 1) + At L Uik(t _l):Fk \nU .. (O)  = 0\" \n\" \n\nQXj \n\n'J \n\nk \n\nd~ \nTI\"k(t)  = TI\u00b7Jk (t-l) + (At) U'I(t-l)~ \n( \nQWjk \n\nIJ \n\nIJ \n\nI \n\nTIijk(O)  = 0 \n\nL \n\ni \n\nde \nv\u00b7 (t)  .  U .. (t) =  -d \nx \nI \nj \n\n'J \n\n(17) \n\n(18) \n\n(19) \n\n(20) \n\n(21) \n\n(22) \n\n(23) \n\n\f338 \n\nSun, Chen, and Lee \n\nawi}  =  -11  (~:>k (t) llkij (1\u00bb \n\nk \n\n(24) \n\nTo summarize the procedure of the Green's function method, we need to simultaneously in(cid:173)\ntegrate Eq.  (21) and Eq.  (22) for  U(I) and n forward in time  starting from  Ui}{O)  = Oij  and \n0ijk(O) = O. Whenever error message are generated, we shall solve Eq. (23) for v(t) and update \nweights according to Eq. (24). \n\nThe memory size required by this algorithm is simply I?+fil for storing U(I) and O(t). \nThe speed of the algorithm is analyzed as follows. From Eq. (21) and Eq. (22) we see that the \nupdate of U(I) and IT both need I? operations per time step. To solve for v(t) and update w, \nwe need also NJ operations per time step. So, the on-line Updating of weights needs totally 41? \noperations per time step. This is one order of magnitude faster than the current forward learning \nscheme. \nIn Numerical Simulation \nWe present in this section numerical examples to demonstrate the proposed learning algorithm \nand benchmark it against Williams&Zipser's algorithm. \n\nClass 1 \n\nClass 2 \n\nClass 3 \n\nFig.1 PHASE SPACE TRAJECTORIES \n\nThree different shapes of 2-D trajectory,  each is shown in one column with three examples. \nRecurrent neural networks are trained to recognize the different shapes of trajectory. \nWe consider the trajectory c1assitication problem. The input data are the time series of two \n\n\fGreen's Function Method for  Fast On-line Learning Algorithm of Recurrent Neural  Networks \n\n339 \n\ndimensional coordinate pairs  (x{t), yet)}  sampled along three different types  of trajectories in \nthe phase space. The sampling is taken uniformly with flt=27t160.  The trajectory equations are \nX{I)  = sin{I+~)lsin{I)1 \nX{I)  = sin(o.sl+~)sin(l.st)  X{I)  = sin(t+~)sin{21) \n{y (I)  = cos (I +~) I sin (I) I  {y (I)  = cos (0.51 +~) sin (l.5t)  {y (I)  = cos (I +~) sin (21) \nwhere ~ is  a uniformly distributed random parameter. When J3  is changed, these trajectories \nare distorted accordingly.  Nine examples (three for each class) are shown in Fig.l. The neural \nnet used here is a fully recurrent first-order network with dynamics \n\nSi(t+l)  =  Si(t) + (Tanh  L Wi/se/)})) \n\n(:\n\n(25) \n\n+6 \n\n}=1 \n\nwhere S  and I  are vectors of state and input neurons, the symbol e represents concatenation, \nand N is the number of state. Six input neurons are used to represent the normalized vector {I, \nx(t), yet), x(t)2, y(t)2, x(t)y(t)}. The neural network structure is shown in Fig. 2. \n\nState {t + 1 '\\  ---- the end of input S.fquence. \n\nCheck state neurons at \n\n\u2022  \u2022  \u2022\u2022  ~  error = Target - ;} \n\nSN \n\n2 \nState{t) \n\nInput{t) \n\nFig.2 Recurre\"t Neural Network for Trajectory ClassiflCatio\" \n\nFor recognition, each trajectory data sequence needs to be fed to the input neurons  and the \nstate neurons evolve according to the dynamics in Eq. (25). At the end of input series we check \nthe last three state neurons and classify the input trajectory according to the \"winner-take-all\" \nrule.  For training, we assign the desired final  output for the three trajectory classes to (1,0,0), \n(0,1,0) and (0,0,1) respectively. Meanwhile, we need to simultaneously integrate Eq.  (21) for \nU(t) and Eq. (22) for n. At the end, we calculated the error from Eq.  (4) and solve Eq.  (23) \nfor vet)  using LU decomposition algorithm. Finally, we update weights according to Eq. (24). \nSince the classification error is generated at the end of input sequence, this learning does  not \nhave to be on-line. We present this example only to compare the speeds of the proposed fast \nalgorithm against the Williams and Zipser's. We run the two algorithms for the same number \nof iterations and compare the CPU time used. The results are shown in Table. 1 , where in each \none iteration we present 150 training patterns, 50 for each class. These patterns are chosen by \nrandomly selecting ~ values. It is seen that the CPU time ratio is O( lIN), indicating the Green's. \nfunction algorithm is one order faster in N. \n\nAnother  issue  to  be  considered  is  the  error convergent  rate  (or  learning  rate,  as  usually \ncalled). Although the two algorithms calculate the same weight correction as in Eq. (7), due to \ndifferent numerical schemes the outcomes may be different. As the result, the error convergent \nrates are slightly different even if the same learning rate 11  is used. In all numerical simulations \nwe have conducted the learning results are very good (in testing, the recognition is perfect, no \nsingle misclassification was found). But, during training the error convergence rates are differ(cid:173)\nent. The numerical experiments show that the proposed fast algorithm converges  slower than \n\n\f340 \n\nSun, Chen,  and Lee \n\nthe Williams and Zipser's for the small size neural nets but faster for the large size neural net. \n\n~ Simulation \n\nFast Algorithm \n\nN=4 \n\n(Number of Iterations = 200) \n\nN=8 \n\n(Number of Iterations = 50) \n\nN=12 \n\n(Number of Iterations = 50) \n\n1607.4 \n\n1981.7 \n\n5947.6 \n\nWillillms&Zipser's \n\n-ratio \n\n5020.8 \n\n10807.0 \n\n45503.0 \n\n1:3 \n\n1:5 \n\n1,' 8 \n\nTable 1. The CPU time (in seconds) comparison, implemented in DEC3100 Workstation, \n\nfor learning the trajectory classification example. \n\nIV.  Conclusion \nThe Green's function has been used to develop a faster on-line learning algorithm for recur(cid:173)\n\nrent neural networks. This algorithm requires O(tv3) operations for each time step, which is one \norder faster than the Williams and Zipser's algorithm. The memory required is O(tv3). \n\nOne feature of this algorithm is its straightforward formula, which can be easily implemented \nnumerically. A numerical example of trajectory classification has been used to demonstrate the \nspeed of this fast algorithm compared to Williams and Zipser's algorithm. \nReferences \n\n[1]  D.Rumelhart,  G.  Hinton,  and  R.  Williams.  Learning  internal  representations  by error \n\npropagation. In Parallel distributed processing: VoU MIT press 1986. P. Werbos, Beyond Re(cid:173)\ngression: New tools for prediction and analysis in the behavior sciences. Ph.D. thesis, Harvard \nuniversity,  1974. \n\n[2] F.  Pineda, Generalization of back-propagation to recurrent neural networks. Phys.  Rev. \n\n[3]  B.  Pearlmutter, Learning  state space  trajectories  in recurrent  neural  networks.  Neural \n\nLetters,  19(59):2229, 1987. \n\nComputation,1(2):263,  1989. \n\n[4] R. Williams and D. Zipser, A learning algorithm for continually running fully recurrent \nneural networks. Tech. Report ICS Report 8805, UCSD, La Jolla, CA 92093, November 1988. \n[5]  N.  Toomarian,  J.  Barben and  S.  Gulati,  \"Application of Adjoint  Operators  to  Neural \n\nLearning\", Appl. Math.  Lett., 3(3), 13-18, 1990. \n\n[6] N. Toomarian and J.  Barhen, \"Adjoint-Functions and Temporal Learning Algorithms in \nNeural Networks\", Advances in Neural Information Processing Systems 3, p.  113-120, Ed. by \nR. Lippmann, J. Moody and D. Touretzky, Morgan Kaufmann,  1991. \n\n[7] J. H. Schmidhuber, \"An O(N3) Learning Algorithm for Fully Recurrent Networks\", Tech \n\nReport FKI-151-91, Institut fUr Informatik, Technische Universitiit MUnchen, May 1991. \n\n[8] Guo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee, \"A Fast On-line Learning Algo(cid:173)\nrithm for Recurrent Neural Networks\", Proceedings of International Joint Conference on Neu(cid:173)\nral Networks,  Seattle,  Washington, page 11-13, June 1991. \n\n\f", "award": [], "sourceid": 504, "authors": [{"given_name": "Guo-Zheng", "family_name": "Sun", "institution": null}, {"given_name": "Hsing-Hen", "family_name": "Chen", "institution": null}, {"given_name": "Yee-Chun", "family_name": "Lee", "institution": null}]}