{"title": "Effects of Noise on Convergence and Generalization in Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 656, "abstract": null, "full_text": "Effects of Noise on Convergence and \nGeneralization in Recurrent Networks \n\nKam Jim \n\nBill G. Horne \n\nc. Lee Giles* \n\nNEC Research Institute, Inc., 4 Independence Way, Princeton, NJ 08540 \n\n{kamjim,horne,giles}~research.nj.nec.com \n\nUMIACS, University of Maryland, College Park, MD 20742 \n\n*Also with \n\nAbstract \n\nWe introduce and study methods of inserting synaptic noise into \ndynamically-driven recurrent neural networks and show that ap(cid:173)\nplying a controlled amount of noise during training may improve \nconvergence and generalization. In addition, we analyze the effects \nof each noise parameter (additive vs. multiplicative, cumulative vs. \nnon-cumulative, per time step vs. per string) and predict that best \noverall performance can be achieved by injecting additive noise at \neach time step. Extensive simulations on learning the dual parity \ngrammar from temporal strings substantiate these predictions. \n\n1 \n\nINTRODUCTION \n\nThere has been much research in applying noise to neural networks to improve net(cid:173)\nwork performance. It has been shown that using noisy hidden nodes during training \ncan result in error-correcting codes which increase the tolerance of feedforward nets \nto unreliable nodes (Judd and Munro, 1992) . Also, randomly disabling hidden \nnodes during the training phase increases the tolerance of MLP's to node failures \n(Sequin and Clay, 1990). Bishop showed that training with noisy data is equivalent \nto Tikhonov Regularization and suggested directly minimizing the regularized error \nfunction as a practical alternative (Bishop , 1994) . Hanson developed a stochastic \nversion of the delta rule which adapt weight means and standard deviations instead \n\n\f650 \n\nKam Jim, Bill G. Home, C. Lee Giles \n\nof clean weight values (Hanson, 1990). (Mpitsos and Burton, 1992) demonstrated \nfaster learning rates by adding noise to the weight updates and adapting the mag(cid:173)\nnitude of such noise to the output error. Most relevant to this paper, synaptic noise \nhas been applied to MLP's during training to improve fault tolerance and training \nquality. (Murray and Edwards, 1993) \nIn this paper, we extend these results by introducing several methods of inserting \nsynaptic noise into recurrent networks, and demonstrate that these methods can \nimprove both convergence and generalization. Previous work on improving these \ntwo performance measures have focused on ways of simplifying the network and \nmethods of searching the coarse regions of state space before the fine regions. Our \nwork shows that synaptic noise can improve convergence by searching for promising \nregions of state space, and enhance generalization by enforcing saturated states. \n\n2 NOISE INJECTION IN RECURRENT NETWORKS \n\nIn this paper, we inject noise into a High Order recurrent network (Giles et al., 1992) \nconsisting of N recurrent state neurons Sj, L non-recurrent input neurons lie, and \nN 2 L weights Wijle. (For justification ofits use see Section 4.) The recurrent network \noperation is defined by the state process S:+1 = g(Lj,1e WijleSJlD, where g(.) is a \nsigmoid discriminant function. During training, an error function is computed as \nEp = !f;, where fp = Sb -dp, Sb is the output neuron, and dp is the target output \nvalue tor pattern p. \n\nSynaptic noise has been simulated on Multi-Layered-Perceptrons by inserting noise \nto the weights of each layer during training (Murray et al., 1993). Applying this \nmethod to recurrent networks is not straightforward because effectively the same \nweights are propagated forward in time. This can be seen by recalling the BPTT \nrepresentation of unrolling a recurrent network in time into T layers with identical \nweights, where T is the length of the input string. In Tables 2 and 3, we introduce \nthe noise injection steps for eight recurrent network noise models representing all \ncombinations of the following noise parameters: additive vs. multiplicative, cu(cid:173)\nmulative vs. non-cumulative, per time step vs. per string. As their name imply, \nadditive and multiplicative noise add or multiply the weights by a small noise term. \nIn cumulative noise, the injected noise is accumulated, while in non-cumulative \nnoise the noise from the current step is removed before more noise is injected on \nthe next step. Per time step and per string noise refer to when the noise is inserted: \neither at each time step or only once for each training string respectively. Table 1 \nillustrates noise accumulation examples for all additive models (the multiplicative \ncase is analogous). \n\n3 ANALYSIS ON THE EFFECTS OF SYNAPTIC NOISE \n\nThe effects of each noise model is analyzed by taking the Taylor expansion on \nthe error function around the noise-free weight set. By truncating this expansion \nto second and lower order terms, we can interpret the effect of noise as a set of \nregularization terms applied to the error function. From these terms predictions can \nbe made about the effects on generalization and convergence. A similar analysis was \n\n\fEffects of Noise on Convergence and Generalization in Recurrent Networks \n\n651 \n\nperformed on MLP's to demonstrate the effects of synaptic noise on fault tolerance \nand training quality (Murray et. al., 1993). Tables 2 and 3 list the noise injection \nstep and the resulting first and second brder Taylor expansion terms for all noise \nmodels. These results are derived by assuming the noise to be zero-mean white \nwith variance (F2 and uncorrelated in time. \n\n3.1 Predictions on Generalization \n\nOne common cause of bad generalization in recurrent networks is the presence of \nunsaturated state representations. Typically, a network cannot revisit the exact \nsame point in state space, but tends to wander away from its learned state repre(cid:173)\nsentation. One approach to alleviate this problem is to encourage state nodes to \noperate in the saturated regions of the sigmoid. The first order error expansion \nterms of most noise models considered are capable of encouraging the network to \nachieve saturated states. This can be shown by applying the chain rule to the \npartial derivative in the first order expansion terms: \n\n(1) \n\nwhere e~ is the net input to state node i at time step t. The partial derivatives \ng~ favor internal representations such that the effects of perturbations to the net \n\ninputs e: are minimized. \n\nMultiplicative noise implements a form of weight decay because the error expansion \nterms include the weight products Wt~ijk or Wt ,ijk Wu,ijk' Although weight decay \nhas been shown to improve generalization on feedforward networks (Krogh and \nHertz, 1992) we hypothesize this may not be the case for recurrent networks that \nare learning to solve FSA problems. Large weights are necessary to saturate the \nstate nodes to the upper and lower limits of the sigmoid discriminant function. \nTherefore, we predict additive noise will allow better generalization because of its \nabsence of weight decay. \n\ni , l,k \n\nNoise models whose first order error term contain the expression a~rb . attSb \n\n\",'mn. \nwill favor saturated states for those partials whose sign correspond to the sign of \na majority of the partials. It will favor unsaturated states, operating in the linear \nregion of the sigmoid, for partials whose sign is the minority. Such sign-dependent \nenforcement is not optimal. \nThe error terms for cumulative per time step noises sum a product with the expres(cid:173)\nsion v a:.rSb . a:Sb \n,where v = min(t + 1, U + 1). The effect of cumulative noise \n.tlm \" \nincreases more rapidly because of v and thus optimal generalization and detrimental \nnoise effects will occur at lower amplitudes than non-cumulative noise. \nFor cumulative per string noise models, the products (t+ l)(u+ 1) and Wt,ijk Wu,lmn \nin the expansion terms rapidly overwhelm the raw error term. Generalization im(cid:173)\nprovement is not expected for these models. \n\n,.\". \n\nWe also reason that all generalization enhancements will be valid only for a range \nof noise values, above which noise overwhelms the raw error information. \n\n\f652 \n\nKam Jim, Bill G. Horne, C. Lee Giles \n\n3.2 Predictions on Convergence \n\nSynaptic noise can improve convergence by favoring promising weights in the begin(cid:173)\nning stages of training. This can be demonstrated by examining the second order \nerror expansion term for non-cumulative, multiplicative, per time step noise: \n\nWhen fp is negative, solutions with a negative second order state-weight partial \nderivative will be de-stabilized. In other words, when the output Sb is too small \nthe network will favor updating in a direction such that the first order partial \nderivative is increasing. A corresponding relationship can be observed for the case \nwhen fp is positive. Thus the second order term of the error function will allow \na higher raw error fp to be favored if such an update will place the weights in a \nmore promising area, i.e. a region where weight changes are likely to move Sb in \na direction to reduce the raw error. The anticipatory effect of this term is more \nimportant in the beginning stages of training where fp is large, and will become \ninsignificant in the finishing stages of training as fp approaches zero. \n\nSimilar to arguments in Section 3.1, the absence of weight decay will make the \nlearning task easier and improve convergence. \n\nFrom this discussion it can be inferred that additive per time step noise mod(cid:173)\nels should yield the best generalization and convergence performance because of \ntheir sign-independent favoring of saturated states and the absence of weight decay. \nFurthermore, convergence and generalization performance is more sensitive to cu(cid:173)\nmulative noise, i.e. optimal performance and detrimental effects will occur at lower \namplitudes than in non-cumulative noise. \n\n4 SIMULATION RESULTS \n\nIn order to perform many experiments in a reasonable amount of computation \ntime, we attempt to learn the simple \"hidden-state\" dual parity automata from \nsample strings encoded as temporal sequences. (Dual parity is a 4-state automata \nthat recognizes binary strings containing an even number of ones and zeroes.) We \nchoose a second-order recurrent network since such networks have demonstrated \ngood performance on such problems (Giles et. al., 1992). Thus our experiments \nconsist of 500 simulations for each data point and achieve useful (90%) confidence \nlevels. Experiments are performed with both 3 and 4 state networks, both of which \nare adequate to learn the automata. The learning rate and momentum are set to \n0.5, and the weights are initialized to random values between [-1.0, 1.0]. The data \nconsists of 8191 strings of lengths 0 to 12. The networks are trained on a subset of \nthe training set, called the working set, which gradually increases in size until the \nentire training set is classified correctly. Strings from the working set are presented \nin alphabetical order. The training set consists of the first 1023 strings of lengths 0 \nto 9, while the initial working set consists of 31 strings of lengths 0 to 4. During \ntesting no noise is added to the weights of the network. \n\n\fEffects of Noise on Convergence and Generalization in Recurrent Networks \n\n653 \n\ne \u00b7 \u00b7\u00b7 \u00b7\u00b7\u00b7\u00b7 \u00b7 \u00b7 \n\n300rT-----,--~--rT----,_----. \n\nb \n\n2150 \n\n1200 . \n\u2022 S; i ~150 \n~ ~OO . \n\n150 . \n\n3 \n\no \n\n0 \n\n~.15 \nTraln'nG Nol_. Std Oev \n\n0_15 \n\n~ \n\n2 \n\n~ \n\n2 \n\nTraininG Nole. Std Oev \n\nFigure 1: Best Convergence/Generalization for Additive and Multiplicative Noises. \n(a) multiplicative non-cumulative per time step; (b) additive cumulative per time \nstep. \n\n4.1 Convergence and Generalization Performance \n\nSimulated performance closely mirror our predictions. Improvements were observed \nfor all noise models except for cumulative per string noises which failed to converge \nfor all runs. Generalization improvement was more emphasized on networks with \n4 states, while convergence enhancement was more noticeable on 3-state networks. \nThe simulations show the following results: \n\n\u2022 Additive noise is better tolerated than multiplicative noise, and achieves \n\nbetter convergence and generalization (Figure 1). \n\n\u2022 Cumulative noise achieves optimal generalization and convergence at lower \n\namplitudes than non-cumulative noise. Cumulative noise also has a nar(cid:173)\nrower range of beneficial noise, which is defined as the range of noise am(cid:173)\nplitudes which yields better performance than that of a noiseless network \n(Figure 2a illustrates this for generalization). \n\n\u2022 Per time step noise achieves better convergence/generalization and has a \n\nwider range of beneficial values than per string noise (Figure 2b). \n\nOverall, the best performance is obtained by applying cumulative and non(cid:173)\ncumulative additive noise at each time step. These results closely match the pre(cid:173)\ndictions of section 3.1. The only exceptions are that all multiplicative noise models \nseem to yield equivalent performance. This discrepancy between prediction and \nsimulation may be due to the detrimental effects of weight decay in multiplicative \nnoise , which can conflict with the advantages of cumulative and per time step noise. \n\n4.2 The Payoff Picture: Generalization vs. Convergence \n\nGeneralization vs. Convergence results are plotted in Figure 3. Increasing noise \namplitudes proceed from the left end-point of each curve to the right end-point. \n\n\f654 \n\nKam Jim, Bill G. Horne, C. Lee Giles \n\nTable 1: Examples: Additive Noise Accumulation. ~i is the noise at time step ti \n... \n... \nW +~1 +~2 W +~1 +~2 + ~3 ... \n... \n... \n\nNOISE MODEL \nper time step non-cumulative W+~l \nper time step cumulative \nW+~l \nper sequence non-cumulative W+~l \nper sequence cumulative \nW+~l \n\nW+~l \nW+2~1 \n\nW+~l \nW+3~1 \n\nTIME STEPS \n\nW+~2 \n\nW+~3 \n\nt2 \n\nt1 \n\nt3 \n\n300rT------r---r-~r---~----_. \n\n2150 \n\nj200 \n\n.5 \n, 1150 \n\n~ 100 \n\n1 \n\n00 \n\nTraining Not_. S'td D_v \n\n1 \n\n2 \n\nFigure 2: (I) Best Generalization for Cumulative and Non-Cumulative Noises: a) \ncumulative additive per time step; b) non-cumulative additive per time step. (II) \nBest Generalization for Per Time Step and Per String Noises: a) non-cumulative \nper string additive; b) non-cumulative per time step additive. \n\ne \n\n15.15 \n\n5 \n\n.) \n\n~ \n~ \n~ 3.15 \n~ 3 \n\n~ \n\n2.5 \n\n2 \n\n1.15 \n\n~oo \n\n.. \n\nc \n\ne \n\n5.5 \n\n15 \n\n4 .5 \n\nII \n\na. .... \n\n~ 4 \n.... \ni\n\n3 .5 \n\n'l\\J \n1>5 \nIii \nCJ \n\n3 \n\n2 .5 \n\n2 \n\n1 .5 \n\n200 \n\n300 \n\nConvergence In epocha \n\n400 \n\n~OO \n\n200 \n\n300 \n\nConvergence 'n epochs \n\n400 \n\nFigure 3: Payoff: Mean Generalization vs. Convergence for 4-state (I) and 3-\nstate(lI) recurrent network. (I a) Worst i-state - non-cumulative multiplicative per \nstring; (Ib, Ic) Best 4-state - cumulative and non-cumulative additive per time step, \nrespectively; (lIa) Worst 3-state - non-cumulative multiplicative per string; (lib) \nBest 3-state - cumulative additive per time step. \n\n\fEffects of Noise on Convergence and Generalization in Recurrent Networks \n\n655 \n\nTable 2: Noise injection step and error expansion terms for per time step nOIse \nmodels. v = min(t + 1, U + 1). W'\" is the noise-free weight set. \n\nNON-CUMULATIVE \n\nNoise step \n\nl.tt order \n\n2nd order \n\nNoise step \n\nl.tt order \n\n2nd order \n\nNoise step \n\n1st order \n\n2nd order \n\nNoise step \n\n1st order \n\n2nd order \n\nAddlhve \n\nAdditive \n\nWt,ijk = wtjl< + L AT,iJI< \n\nt \n\nT-O \n\nMU bpncahve \n\nClIM~LATIVE \n\nM ulhplicati ve \n\nWt,ijl< = wtjl< II (1 + AThl<) \n\nt \n\nT-O \n\n~~ ~2 L ~ \"W . . W \n\nT-1 \n\nt ,.JI< \n\n2 P \n\nL-\n. . \nt ,u=O .JI< \n\n~ T \nSo \nu,.JI< 8W .. 8W \n\n. _ \n\nt,.JI< \n\n.. \nu,.JI< \n\nAddItIve \n\nNON-CUMULATIVE \n\nMultlpllcahve \n\n-~ \n2 \n\nT-1 \n\n1 2 2: L \n2 L 2: \n\n. . \nt,u=O .JI< \n\nT 1 \n\n1 \n-~ ~ \n2 P \n\nt,u=O ijl< \n\n;UM ILA IVE \n\nW \" W \n\n.. \n\nt,.JI< u,.JI< 8W . - 8W \n\nT \n850 \n\nt,.JI< \n\nT \n8S0 \n\n. . \nu,.JI< \n\nW \n\n.. W \n\nt,.JI< \n\n_2 T \ncrSO \nu,.JI< 8W .. W \n\n.. \n\n. . \nt,.Jk u,.JI< \n\nAdditive \n\nMultiplicative \n\nWt,ijl< = W ijl\u00abl + Aijl<)' \n\n1 2 2: 2: \n\nT IT T \n850 \n\nw s - - -\n\n850 \n\n8Wt ,ijl< 8Wu ,ijl< \n\n-~ \n2 \n\n.. \nt,u=O .JI< \n\nt \n\n= wijl< + 2: Ct,T(Aijl<)T+1 \n-2: ~ \" .. \" _0 _ \n\nT \n85 \n\nT-1 \n\nT-O \n\n1 \n2 \n\nL-\n\nt,.JI< u,lmn 8W . . 8W \n\nT \n85 \n0 \n\nt,.JI< \n\nu,lmn \n\nt,u=O ijl<,lmn \n\nT-1 \n\nT \n+2~p L- L- \"t i j l< - - -\n2~~ 8S0 \n\n' \n\n8Wt ,ijl< \n\n.. \n.JI< \n\nt=O \n\nT-1 \n\n1 2: L \n\n-~ \n2 P \n\n.. \n\nt,u=O .JI<,lmn \n\n_2 T \ncrSO \nt,.JI< u ,lmn 8w .' 8W \n\n\" .. \" \n\nt,.JI< \n\nu,lmn \n\n\f656 \n\nKam Jim, Bill G. Home, C. Lee Giles \n\nThese plots illustrate the cases where both convergence and generalization are im(cid:173)\nproved. In figure 311 the curves clearly curl down and to the left for lower noise \namplitudes before rising to the right at higher noise amplitudes. These lower re(cid:173)\ngions are important because they represent noise values where generalization and \nconvergence improve simultaneously and do not trade off. \n\n5 CONCLUSIONS \n\nWe have presented several methods of injecting synaptic noise to recurrent neural \nnetworks. We summarized the results of an analysis of these methods and em(cid:173)\npirically tested them on learning the dual parity automaton from strings encoded \nas temporal sequences. (For a complete discussion of results, see (Jim, Giles, and \nHorne, 1994) ). Results show that most of these methods can improve generaliza(cid:173)\ntion and convergence simultaneously - most other methods previously discussed in \nliterature cast convergence as a cost for improved generalization performance. \n\nReferences \n\n[1] Chris M. Bishop. Training with noise is equivalent to Tikhonov Regularization. \n\nNeural Computation, 1994. To appear. \n\n[2] Robert M. Burton, Jr. and George J. Mpitsos. Event-dependent control of noise \n\nenhances learning in neural networks. Neural Networks, 5:627-637, 1992. \n\n[3] C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun, and Y.C. Lee. Learning \n\nand extracting finite state automata with second-order recurrent neural net(cid:173)\nworks. Neural Computation, 4(3):393-405, 1992. \n\n[4] Stephen Jose Hanson. A stochastic version ofthe delta rule. Physica D., 42:265-\n\n272, 1990. \n\n[5] Kam Jim, C.L. Giles, and B.G. Horne. Synaptic noise in dynamically-driven \nrecurrent neural networks: Convergence and generalization. Technical Report \nUMIACS-TR-94-89 and CS-TR-3322, Institute for Advanced Computer Studies, \nUniversity of Maryland, College Park, MD, 1994. \n\n[6] Stephen Judd and Paul W . Munro. Nets with unreliable hidden nodes learn \nIn S.J Hanson, J.D. Cowan, and C.L. Giles, editors, \nerror-correcting codes. \nAdvances in Neural Information Processing Systems 5, pages 89-96, San Mateo, \nCA, 1993. Morgan Kaufmann Publishers. \n\n[7] Anders Krogh and John A. Hertz. A simple weight decay can improve gener(cid:173)\n\nalization. In J .E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Advances \nin Neural Information Processing Systems 4, pages 450-957, San Mateo, CA, \n1992. Morgan Kaufmann Publishers. \n\n[8] Alan F. Murray and Peter J. Edwards. Synaptic weight noise during multilayer \nperceptron training: Fault tolerance and training improvements. IEEE Trans. \non Neural Networks, 4(4):722-725, 1993. \n\n[9] Carlo H. Sequin and Reed D. Clay. Fault tolerance in artificial neural networks. \n\nIn Proc. of IJCNN, volume I, pages 1-703-708, 1990. \n\n\f", "award": [], "sourceid": 882, "authors": [{"given_name": "Kam", "family_name": "Jim", "institution": null}, {"given_name": "Bill", "family_name": "Horne", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}]}