{"title": "Back Propagation is Sensitive to Initial Conditions", "book": "Advances in Neural Information Processing Systems", "page_first": 860, "page_last": 867, "abstract": null, "full_text": "Back Propagation is Sensitive to Initial Conditions \n\nJohn F. Kolen \n\nJordan B. Pollack \n\nLaboratory for Artificial Intelligence Research \n\nThe Ohio State University \nColumbus. OH 43210. USA \nkolen-j@cis.ohio-state.edu \npollack@cis.ohio-state.edu \n\nAbstract \n\nfunctions with \n\nlearning simple \n\nThis paper explores the effect of initial weight selection on feed-forward \nthe back-propagation \nnetworks \ntechnique. We first demonstrate. through the use of Monte Carlo \ntechniques. that the magnitude of the initial condition vector (in weight \nspace) is a very significant parameter in convergence time variability. In \norder to further understand \nthis result. additional deterministic \nexperiments were performed. The results of these experiments \ndemon~trate the extreme sensitivity of back propagation to initial weight \nconfiguration. \n\nINTRODUCTION \n\n1 \nBack Propagation (Rwnelhart et al .\u2022 1986) is the network training method of choice for \nmany neural network projects. and for good reason. Like other weak methods, it is simple \nto implement, faster than many other \"general\" approaches. well-tested by the field. and \neasy to mold (with domain knowledge encoded in the learning environment) into very \nspecific and efficient algorithms. \nRumelhart et al. made a confident statement: for many tasks. \"the network: rarely gets \nstuck in poor local mininla that are significantly worse than the global minima. \"(p. 536) \nAccording to them. initial weights of exactly 0 cannot be used. since symmetries in the \nenvironment are not sufficient to break symmetries in initial weights. Since their paper \nwas published. the convention in the field has been to choose initial weights with a \nuniform distribution between plus and minus P. usually set to 0.5 or less. \nThe convergence claim was based solely upon their empirical experience with the back \npropagation technique. Since then. Minsky & Papert (1988) have argued that there exists \nno proof of convergence for the technique. and several researchers (e.g. Judd 1988) have \nfound that the convergence time must be related to the difficulty of the problem. otherwise \nan unsolved computer science question (P J NP ) would finally be answered. We do not \nwish to make claims about convergence of the technique in the limit (with vanishing step-\n\n860 \n\n\fBack Propagation is Sensitive to Initial Conditions \n\n861 \n\nsize), or the relationship between task and perfonnance, but wish to talk about a pervasive \nbehavior of the teclmique which has gone unnoticed for several years: the sensitivity of \nback propagation to initial conditions. \n\n2 THE MONTE-CARLO EXPERIMENT \nInitially, we perfonned empirical studies to determine the effect of learning rate, \nmomentum rate, and the range of initial weights on t-convergence (Kolen and Goel, to \nappear). We use the tenn t-convergence to refer to whether or not a network, starting at a \nprecise initial configuration, could learn to separate the input patterns according to a \nboolean function (correct outputs above or below 0.5) within t epochs. The experiment \nconsisted of training a 2-2-1 network on exclusive-or while varying three independent \nvariables in 114 combinations: learning rate, 11, equal to 1.0 or 2.0; momentum rate, a., \nequal to 0.0,0.5, or 0.9; and initial weight range, p, equal to 0.1 to 0.9 in 0.1 increments, \nand 1.0 to 10.0 in 1.0 increments. Each combination of parameters was used to initialize \nand train a number of networks.' Figure 1 plots the percentage of t-convergent (where \nt == 50, 000 epochs of 4 presentations) initial conditions for the 2-2-1 network trained on \nthe exclusive-or problem. From the figure we thus conclude the choice of p S; 0.5 is more \nthan a convenient symmetry-breaking default, but is quite necessary to obtain low levels \nof nonconvergent behavior. \n\n90.00 \n\n1 \n\n80.00 -\n\n70.00 -\n\n% \nNon \nConvergence 60.00-\nAfter \n5000-\n50,000 \nTrials \n\n. \n\n40.00 --\n\n30.00 \n\n20.00 -\n\n10.00 \n\n0.00- 1 \n\n0.00 \n\nI \n\n2.00 \n\nI \n\n4.00 \n\nP \n\nL=l.OM=o.O \nL;;UHit-;;;o.5 \n[;;[O~~.9 \n\nL=2.0M=O.9 \n\nI \n\n6.00 \n\nI \n\n8.00 \n\nI \n\n10.00 \n\nFigure 1: Percentage T-Convergence vs. Initial Weight Range \n\n3 SCENES FROM EXCLUSIVE-OR \nWhy do networks exhibit the behavior illustrated in Figure I? While some might argue \nthat very high initial weights (i.e. p > 10.0) lead to very long convergence times since the \nderivative of the semi-linear sigmoid function is effectively zero for large weights, this \n\n1. Numbers ranged from 8 to 8355, depending on availability of computationaJ resources. \nThose data points calculated with small samples were usually surrounded by data points \nwith larger samples. \n\n\f862 \n\nKolen and Pollack \n\nFigure 2: \n\n(Schematic Network) \n\nFigure 3: \n\n(-5-3+3+6Y -1-6+7X) \n\n11=3.25 a=O.40 \n\nFigure 5: \n\n(-5+5+1-6+3XY+8+3) \n\n11=2.75 a--o.80 \n\nFigure 6: \n\n(YX-3+6+8+ 3+ 1 +7-3) \n\n11=3.25 a=O.OO \ni \n\n~ \n\nFigure 4: \n\n(+4-7+6+0-3Y + 1X+ 1) \n\n11=2.75 a=O.OO \n\nFigure 7: \n\n(Y+3-9-2+6+7-3X+7) \n\n11=3.25 a=O.60 \n\nFigure 8: \n\n(-6-4XY -6-6+9-4-9) \n\n11=3.00 a--o.50 \n\nFigure 9: \n\n(-2+1+9-1X-3+8Y -4) \n\n11=2.75 a=O.20 \n\nFigure 10: \n\n(+ 1 +8-3-6X-1 + 1 +8Y) \n\n11=3.50 a=O.90 \n\nFigure 11: \n\n(+7+4-9-9-5Y-3+9X) \n\n11=3.00 a=O.70 \n\nFigure 12: \n(-9.0,-1.8) \nstep 0.018 \n\nFigure 13: \n\n( -6.966.-0.500) \n\nstep 0.004 \n\n\fBack Propagation is Sensitive to Initial Conditions \n\n863 \n\ndoes not explain tbe fact that when p is between 2.0 and 4.0, the non-t-convergence rate \nvaries from 5 to 50 percent. \nThus, we decided to utilize a more deterministic approach for eliciting the structure of \ninitial conditions giving rise to t-convergence. Unfortunately, most networks have many \nweights, and thus many dimensions in initial-condition space. We can, however, examine \n2-dimensional slices through the space in great detail. A slice is specified by an origin and \ntwo orthogonal directions (the X and Y axes). In the figures below, we vary the initial \nweigbts regularly throughout the plane formed by the axes (with the origin in the lower \nleft-hand comer) and collect the results of rumrlng back-propagation to a particular time \nlimit for each initial condition. The map is displayed with grey-level linearly related to \ntime of convergence: black: meaning not t-convergent and white representing the fastest \nconvergence time in the picture. Figure 2 is a schematic representation of the networks \nused in this and the following experiment. The numbers on the links and in the nodes will \nbe used for identification purposes. Figures 3 through 11 show several interesting \"slices\" \nof the the initial condition space for 2-2-1 networks trained on exclusive-or. Each slice is \ncompactly identified by its 9-dimensional weight vector and associated learning! \nmomentum rates. For instance, the vector (-3+2+7-4X+5-2-6Y) describes a network with \nan initial weight of -0.3 between the left hidden unit and the left input unit. Likewise, \n\"+5\" in the sixth position represents an initial bias of 0.5 to the right hidden unit. The \nletters \"X\" and \"Y\" indicate that the corresponding weight is varied along the X- or Y(cid:173)\naxis from -10.0 to +10.0 in steps of 0.1. All the figures in this paper contain the results of \n40,000 runs of back-propagation (i .e.200 pixels by 200 pixels) for up to 200 epochs \n(where an epoch consists of 4 training examples). \nFigures 12 and 13 present a closer look at the sensitivity of back-propagation to initial \ncondition~. These figures zoom into a complex region of Figure 11; the captions list the \nlocation of the origin and step size used to generate each picture. \nSensitivity behavior can also be demonstrated with even simpler functions. Take the case \nof a 2-2-1 network learning the or function. Figure 14 shows the effect of learning \"or\" on \nnetworks (+5+5-1X+5-1 Y+3-l) and varying weights 4 (X-axis) and 7 (Y-axis) from -20.0 \nto 20.0 in steps of 0.2. Figure 15 shows the same region, except that it partitions tbe \ndisplay according to equivalent solution networks after t-convergence (200 epoch limit), \nrather than the time to convergence. Two networks are considered equivalent2if their \nweights have the same sign. Since there are 9 weights, there are 512 ($2 sup 9$) possible \nnetwork equivaJence classes. Figures 16 through 25 show successive zooms into the \ncentral swirl identified by the XY coordinate of the lower-left comer and pixel step size. \nAfter 200 iterations, the resulting networks could be partitioned into 37 (both convergent \nand nonconvergent) classes. Obviously, the smooth behavior of the t-convergence plots \ncan be deceiving, since two initial conditions, arbitrarily alike, can obtain quite different \nfinal network configuration. \nNote the triangles appearing in Figures 19, 21, 23 and the mosaic in Figure 25 \ncorresponding to tbe area which did not converge in 200 iteration~ in Figure 24. The \ntriangular boundaries are similar to fractal structures generated under iterated function \nsystems (Bamsley 1988): in this case, the iterated function is the back propagation \n\n2. For rendering purposes only. It is extremely difficult to know precisely the equivalence \nclasses of solutions, so we approximated. \n\n\f864 \n\nKolen and Pollack \n\n;.:.:-.. :.:.: ..... ;.; ..... .; ....... : ..... ; ... : ... ; .\u2022. : ... : .\u2022. ; ... ;.: .\u2022. ;.;.;.; ... ;.; ...\u2022. ; .. \n\nFigure 14: \n\n(-20.00000, -20.00000) \n\nStep 0.200000 \n\nFigure 15: \n\nSolution Networks \n\nFigure 16: \n\n(-4.500000, -4.500000) \n\nStep 0.030000 \n\nFigure 17: \n\nSolution Networks \n\nFigure 18: \n\n(-1.680000, -1.3500(0) \n\nStep 0.002400 \n\nFigure 19; \n\nSolution Networks \n\nFigure 20: \n\n(-1.536000, -1.197000) \n\nStep 0.000780 \n\nFigure 21 ; \n\nSolution Networks \n\nFigure 22: \n\n(-1.472820, -1.145520) \n\nStep 0.000070 \n\nFigure 23 : \n\nSolution Networks \n\nFigure 24: \n\n(-1.467150, -1.140760) \n\nStep 0.000016 \n\nFigure 25 : \n\nSolution Networks \n\n\fBack Propagation is Sensitive to Initial Conditions \n\n865 \n\nWeight 1 \nWeight 2 \nWeight 3 \nWeight 4 \nWeight 5 \nWeight 6 \nWeight 7 \nWeight 8 \nWeight 9 \nWeight 10 \nWeight 11 \nWeight 12 \nWeight 13 \nWeight 14 \nWeight 15 \nWeight 16 \n\nFigure 26 \n-0.34959000 \n0.00560000 \n-0.26338813 \n0.75501968 \n0.47040862 \n-0.18438011 \n0.46700363 \n-0.48619500 \n0.62821201 \n-0.90039973 \n0.48940201 \n-0.70239312 \n-0.95838741 \n0.46940394 \n-0.73719884 \n0.96140103 \n\nFigure 28 \n-0.34959000 \n0.00560000 \n0.39881098 \n-0.16718577 \n-0.28598450 \n-0.18438011 \n-0.06778983 \n0.66061292 \n-0.39539510 \n0.55021922 \n0.35141364 \n-0.17438740 \n-0.07619988 \n0.88460041 \n0.67141031 \n-0.10578894 \n\nFigure 27 \nFigure 29 \nFigure 30 \n-0.34959000 \n0.00560000 \n0.65060705 \n0.75501968 \n0.91281711 \n-0.19279729 \n0.56181073 \n0.20220653 \n0.11201949 \n0.67401200 \n-0.54978875 \n-0.69839197 \n-0.19659844 \n0.89221204 \n-0.56879740 \n0.20201484 \n\nTable 1: Network Weights for Figures 26 through 30 \n\nlearning method. We propose that these fractal-like boundaries arise in back-propagation \ndue to the existence of multiple solutions (attractors), the non-zero learning parameters, \nand the non-linear deterministic nature of the gradient descent approach. When more than \none hidden unit is utilized, or when an envirorunent has internal symmetry or is very \nundercollstrained. then there will be multiple attmctors corresponding to the large num ber \nof hidden-unit permutations which form equivalence classes of functionality. As the \nnumber of solutions available to the gradient descent method increases, the more \ncomplicated the non-local interactions between them. This explains the puzzling result \nthat several researchers have noted, that as more hidden units are added, instead of \nspeeding up, back-propagation slows down (e.g. Uppman and Gold, 1987). Rather than a \nhill-climbing metaphor with local peaks to get stuck on, we should instead think of a \nmany-body metaphor: The existence of many bodies does not imply that a particle will \ntake a simple path to land on one. From this view, we see that Rumelhart et al. 's claim of \nback-propagation usually converging is due to a very tight focus inside the \"eye of the \nstonn\" . \nCould learning and momentum rates also be involved in the stonn? Such a question \nprompted another study, this time focused on the interaction of learning and momentum \nrates. Rather than alter the initial weights of a set of networks, we varied the learning rate \nalong the X axis and momentum rate along the Y axis. Figures 26, 27, and 28 were \nproduced by training a 3-3-1 network on 3-bit parity until t-convergence (250 epoch \nlimit). Table 1 lists the initial weights of the networks trained in Figures 26 through 31. \nExamination of the fuzzy area in Figure 26 shows how small changes in learning and/or \nmomentwn rate can drasticly affect t-convergellce (Figures 30 and 31). \n\n\f866 \n\nKolen and Pollack \n\nFigure 26: \n\n11=(0.0,4.0) a=(0.0,1.25) \n\nFigure 27: \n\n11=(0.0,4.0) a=(0.0,1.25) \n\nFigure 28: \n\n11=(0.0,4.0) a=(0.0,1.25) \n\nFigure 29: \n\n11=(3.456,3.504 ) \na=(0.835,0.840) \n\nFigure 30: \n\n11=(3.84,3.936) \na=(0.59,0.62) \n\n4 DISCUSSION \nChaotic behavior has been carefully circumvented by many neural network researchers \n(through the choice of symmetric weights by Hopfield (1982), for example), but has been \nreported in increasing frequency over the past few years (e.g. Kurten and Clark, 1986). \nConnectionists, who use neural models for cognitive modeling, disregard these reports of \nextreme non-linear behavior in spite of common knowledge that non-linearity is what \nenables network models to perform non-trivial computations in the flI'St place, All work to \ndate has noticed various forms of chaos in network dynamics, but not in learning \ndynamics. Even if back-propagation is shown to be non-chaotic in the limit, this still does \nnot preclude the existance of fractal boundaries between attract or basins since other non(cid:173)\nchaotic non-linear systems produce such boundaries (i.e. forced pendulums with two \nattractors (D'Humieres et al., 1982\u00bb \nWhat does this mean to the back-propagation community? From an engineering \napplications standpoint, where only the solution matters, nothing at all. When an optimal \nset of weights for a particular problem is discovered, it can be reproduced through digital \nmeans. From a scientific standpoint, however, this sensitivity to initial conditions demands \nthat neural network learning results must be specially treated to guarantee replicability. \nWhen theoretical claims are made (from experience) regarding the power of an adaptive \n\n\fBack Propagation is Sensitive to Initial Conditions \n\n867 \n\nnetwork to model some phenomena, or when claims are made regarding the similarity \nbetween psychological data and network performance, the initial conditions for the \nnetwork need to be precisely specified or filed in a public scientific database. \nWhat about the future of back-propagation? We remain neutral on the issue of its ultimate \nconvergence, but our result points to a few directions for improved methods. Since the \nslowdown occurs as a result of global influences of multiple solutions, an algorithm for \nfirst factoring the symmetry out of both network and training environment (e.g. domain \nknowledge) may be helpful. Furthermore, it may also tum out that search methods which \nharness \"strange attractors\" ergodicaUy guaranteed to come arbitrarily close to somesubset \nof solutions might work better than methods based on strict gradient descent. Finally, we \nview this result as strong impetus to discover how to exploit the information-creative \naspects of non-linear dynamical systems for future models of cognition (Pollack 1989). \n\nAcknowledgements \nThis work was supported by Office of Naval Research grant number NOOOI4-89-Jl200. \nSubstantial free use of over 200 Sun workstations was generously provided by our \ndepartment. \n\nReferences \nM. BamsleY,Fractals Everywhere, Academic Press, San Diego, CA, (1988). \nJ. J. Hopfield, \"Neural Networks and Physical Systems with Emergent Collective \nComputational Abilities\", Proceedings US National Academy 0/ Science, 79:2554-2558, \n(1982). \nD. D'Humieres, M. R. Beasley, B. A. Huberman, and A. Libchaber, \"Chaotic States and \nRoutes to Chaos in the Forced Pendulum\", Physical Review A, 26:3483-96, (1982). \nS. Judd, \"Learning in Networks is Hard\", Journal o/Complexity, 4:177-192, (1988). \nJ. Kolen and A. Goel, \"Learning in Parallel Distributed Processing Networks: \nComputational Complexity and Information Content\", IEEE Transactions on Systems, \nMan, and Cybernetics, in press. \nK. E. KUrten and J. W. Clark, \"Chaos in Neural Networks\", Physics Letters, 114A,413-\n418, (1986). \nR. P. Lippman and B. Gold, \"Neural Oassifiers Useful for Speech Recognition\", In 1 st \nInternational Conference on Neural Networks ,IEEE, IV:417-426, (1987). \nM. L. Minsky and S. A. Papert, Perceptrons. MIT Press, (1988). \nJ. B. Pollack, \"Implications of Recursive Auto Associative Memories\", In Advances ;12 \nNeural Information Processing Systems. (ed. D. Touretzky) pp 527-536, Morgan \nKaufman, San Mateo, (1989) . \nD. E. Rumelhart, G. E. Hinton, and R. J. Williams, \"Learning Representation by Back(cid:173)\nPropagating Errors\", Nature, 323:533-536, (1986). \n\n\f", "award": [], "sourceid": 395, "authors": [{"given_name": "John", "family_name": "Kolen", "institution": null}, {"given_name": "Jordan", "family_name": "Pollack", "institution": null}]}