{"title": "Microscopic Equations in Rough Energy Landscape for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 302, "page_last": 308, "abstract": null, "full_text": "Microscopic Equations in Rough Energy \n\nLandscape for Neural Networks \n\nK. Y. Michael Wong \nDepartment of Physics, \n\nThe Hong Kong University of Science and Technology, \n\nClear Water Bay, Kowloon, Hong Kong. \n\nE-mail: phkywong@usthk.ust.hk \n\nAbstract \n\nWe consider the microscopic equations for learning problems in \nneural networks. The aligning fields of an example are obtained \nfrom the cavity fields, which are the fields if that example were \nabsent in the learning process. In a rough energy landscape, we \nassume that the density of the local minima obey an exponential \ndistribution, yielding macroscopic properties agreeing with the first \nstep replica symmetry breaking solution. Iterating the microscopic \nequations provide a learning algorithm, which results in a higher \nstability than conventional algorithms. \n\n1 \n\nINTRODUCTION \n\nMost neural networks learn iteratively by gradient descent. As a result, closed ex(cid:173)\npressions for the final network state after learning are rarely known. This precludes \nfurther analysis of their properties, and insights into the design of learning algo(cid:173)\nrithms. To complicate the situation, metastable states (i.e. local minima) are often \npresent in the energy landscape of the learning space so that, depending on the \ninitial configuration, each one is likely to be the final state. \n\nHowever, large neural networks are mean field systems since the examples and \nweights strongly interact with each other during the learning process. This means \nthat when one example or weight is considered, the influence of the rest of the system \ncan be regarded as a background satisfying Some averaged properties. The situation \nis similar to a number of disordered systems such as spin glasses, in which mean \nfield theories are applicable (Mezard, Parisi & Virasoro, 1987). This explains the \nsuccess of statistical mechanical techniques such as the replica method in deriving \nthe macroscopic properties of neural networks, e.g. the storage capacity (Gardner \n& Derrida 1988), generalization ability (Watkin, Rau & Biehl 1993). The replica \n\n\fMicroscopic Equations in Rough Energy Landscape for Neural Networks \n\n303 \n\nmethod, though, provides much less information on the microscopic conditions of \nthe individual dynamical variables. \n\nAn alternative mean field approach is the cavity method. It is a generalization of \nthe Thouless-Anderson-Palmer approach to spin glasses, which started from mi(cid:173)\ncroscopic equations of the system elements (Thouless, Anderson & Palmer, 1977). \nMezard applied the method to neural network learning (Mezard, 1989) . Subse(cid:173)\nquent extensions were made to the teacher-student perceptron (Bouten, Schietse \n& Van den Broeck 1995), the AND machine (Griniasty, 1993) and the multiclass \nperceptron (Gerl & Krey, 1995) . They yielded macroscopic properties identical to \nthe replica approach, but the microscopic equations were not discussed, and the \nexistence of local minima was neglected. \n\nRecently, the cavity method was applied to general classes of single and multilayer \nnetworks with smooth energy landscapes, i.e. without the local minima (Wong, \n1995a). The aligning fields of the examples satisfy a set of microscopic equations. \nSolving these equations iteratively provides a learning algoirthm, as confirmed by \nsimulations in the maximally stable perceptron and the committee tree. The method \nis also useful in solving the dynamics of feedforward networks which were unsolvable \npreviously (Wong, 1995b) . \nDespite its success, the theory is so far applicable only to the regime of smooth \nenergy landscapes. Beyond this regime, a stability condition is violated, and local \nminima begin to appear (Wong, 1995a). In this paper I present a mean field theory \nfor the regime of rough energy landscapes. The complete analysis will be published \nelsewhere and here I sketch the derivations, emphasizing the underlying physical \npicture. As shown below, a similar set of microscopic equations hold in this case, as \nconfirmed by simulations in the committee tree. In fact, we find that the solutions to \nthese equations have a higher stability than other conventional learning algorithms. \n\n2 MICROSCOPIC EQUATIONS FOR SMOOTH \n\nENERGY LANDSCAPES \n\nWe proceed by reviewing the cavity method for the case of smooth energy land(cid:173)\nscapes. For illustration we consider the single layer neural network (for two layer \nnetworks see Wong, 1995a). There are N \u00bb 1 input nodes {Sj} connecting to a \nsingle output node by the synaptic weights {Jj}. The output state is determined \nby the sign of the local field at the output node, i.e. Sout = sgn(Lj JjSj ). Learning \na set of p examples means to find the weights {Jj} such that the network gives the \ncorrect input-to-output mapping for the examples. If example J.l maps the inputs Sf \nto the output 01-', then a successful learning process should find a weight vector Jj \nsuch that sgn(Lj Jj~j) = 1, where ~j = 01-' Sf. Thus the usual approach to learn(cid:173)\ning is to first define an energy function (or error function) E = Ll-'g(AI-')' where \nAI-' == Lj Jj~f /VN are the aligning fields, i.e. the local fields in the direction of the \ncorrect output, normalized by the factor VN. For example, the Adatron algorithm \nuses the energy function g(A) = (I\\: - A)6(1\\: - A) where I\\: is the stability parameter \nand 6 is the step function (Anlauf & Biehl, 1989). Next, one should minimize E \nby gradient descent dynamics. To avoid ambiguity, the weights are normalized to \n'\" . S~ = '\" . J~ = N \nL...J \nThe cavity method uses a self-consistency argument to consider what happens when \na set of p examples is expanded to p + 1 examples. The central quantity in this \nmethod is the cavity field. For an added example labelled 0, the cavity field is \nthe aligning field when it is fed to a network which learns examples 1 to p (but \n\nL...J \n\nJ \n\nJ \n\n. \n\n\f304 \n\nK. Y. M. Wong \n\nnever learns example 0), i.e. to == E j JjeJ 1v'N. Since the original network has \nno information about example 0, Jj and eJ are uncorrelated. Thus the cavity field \nobeys a Gaussian distribution for random example inputs. \nAfter the network has learned examples 0 to p, the weights adjust from {Jj} to \n{Jj}, and the cavity field to adjusts to the generic aligning field Ao. As shown \nschematically in Fig. l(a), we assume that the adjustments of the aligning fields \nof the original examples are small, typically of the order O(N-l/2). Perturbative \nanalysis concludes that the aligning field is a well defined function of the cavity field, \ni.e. Ao = A(to) where A(t) is the inverse function of \n\nt = A + ,9' (A), \n\n(1) \n\nand, is called the local susceptibility. The cavity fields satisfy a set of self-consistent \nequations \n\nt JJ = I)A(tv) - tv]QVJJ + aXA(tJJ ) \n\nvtJJ \n\n(2) \n\nwhere QVJJ = Lj e;ej IN . X is called nonlocal susceptibility, and a == piN. The \nweights Jj are given by \n\n(3) \n\nNoting the Gaussian distribution of the cavity fields, the macroscopic properties of \nthe neural network, such as the storage capacity, can be derived, and the results \nare identical to those obtained by the replica method (Gardner & Derrida 1988). \nHowever, the real advantage of the cavity method lies in the microscopic information \nit provides. The above equations can be iterated sequentially, resulting in a general \nlearning algorithm. Simulations confirm that the equations are satisfied in the single \nlayer percept ron , and their generalized version holds in the committee tree at low \nloading (Wong, 1995a). \n\nE \n\nE \n\nJ \n\na. \n\nJ \n\nJ \n\nFigure 1: Schematic drawing of the change in the energy landscape in the weight \nspace when example 0 is added, for the regime of (a) smooth energy landscape, (b) \nrough energy landscape. \n\n\fMicroscopic Equations in Rough Energy Landscape for Neural Networks \n\n305 \n\n3 MICROSCOPIC EQUATIONS FOR ROUGH ENERGY \n\nLANDSCAPES \n\nHowever, the above argument holds under the assumption that the adjustment \ndue to the addition of a new example is controllable. We can derive a stability \ncondition for this assumption, and we find that it is equivalent to the Almeida(cid:173)\nThouless condition in the replica method (Mezard, Parisi & Virasoro, 1987). \nAn example for such instability occurs in the committee tree, which consists of \nhidden nodes a = 1, ... , K with binary outputs, each fed by K nonoverlapping \ngroups of N / K input nodes. The output of the committee tree is the majority state \nof the K hidden nodes. The solution in the cavity method minimizes the change \nfrom the cavity fields {tal to the aligning fields {Aa }, as measured by La(Aa -ta)2 \nin the space of correct outputs. Thus for a stability parameter K, Aa = K when \nta < K and the value of ta is above median among the K hidden nodes, otherwise \nAa = tao Note that a discontinuity exists in the aligning field function. Now \nless than ta. Then the addition of example \u00b0 may induce a change from tb < ta to \nsuppose ta < K is the median, but the next highest value tb happens to be slightly \n\ntbO > taO\u00b7 Hence AbO changes from tb to K whereas Aao changes from K to taO. The \nadjustment of the system is no longer small, and the previous perturbative analysis \nis not valid. In fact, it has been shown that all networks having a gap in the aligning \nfield function are not stable against the addition of examples (Wong, 1995a). \nTo consider what happens beyond the stability regime, one has to take into account \nthe rough energy landscape of the learning space. Suppose that the original global \nminimum for examples 1 to p is a. After adding example 0, a nonvanishing change \nto the system is induced, and the global minimum shifts to the neighborhood of the \nlocal minimum 13, as schematically shown in Fig. 1 (b). Hence the resultant aligning \nfields Ag are no longer well-defined functions of the cavity fields tg. Instead they \nare well-defined functions of the cavity fields tg. Nevertheless, one may expect that \ncorrelations exist between the states a and 13. \nLet ViiO be the correlation between the network states, i.e. (Jj J1) = ViiO. Since \nboth states a and 13 are determined in the absence of the added example 0, the \ncorrelation (tgtg) = ViiO as well. Knowing that both tg and tg obey Gaussian \ndistributions, the cavity field distribution can be determined if we know the prior \ndistribution of the local minima. \nAt this point we introduce the central assumption in the cavity method for rough \nenergy landscapes: we assume that the number of local minima at energy E obey \nan exponential distribution d~( E) = C exp( -wE)dE. Similar assumptions have \nbeen used in specifying the density of states in disordered systems (Mezard, Parisi \n& Virasoro 1987). Thus for single layer networks (and for two layer networks with \nappropriate generalizations), the cavity field ditribution is given by \n\nP(ti3jt<~) = G(tgltg)exp[-w~E(-\\(tg))] \n\nJ dtgG(tg Itg) exp[-w~E(-\\(tg))]' \n\no 0 \n\n(4) \n\nwhere G(tg Itg) is a Gaussian distribution. w is a parameter describing the distri(cid:173)\nbution, and -\\(tg) is the aligning field function. The weights J1 are given by \n\nJ1 = (1 - ax)-l ~ 2)-\\(t~) - t~]~f. \n\n(5) \n\nI' \n\nNoting the Gaussian distribution of the cavity fields, self-consistent equations for \nboth qo and the local susceptibility 'Y can be derived . \n\n\f306 \n\nK. Y. M. Wong \n\nTo determine the distribution of local minima, namely the parameters C and w, \nwe introduce a \"free energy\" F(p, N) for p examples and N input nodes, given \nby d~(E) = exp[w(F(p, N) - E)]dE. This \"free energy\" determines the averaged \nenergy of the local minima and should be an extensive quantity, i.e. it should scale \nas the system size. Cavity arguments enable us to find an expression F (p + 1, N) -\nF(p, N). Similarly, we may consider a cavity argument for the addition of one input \nnode, expanding the network size from N to N + l. This yields an expression for \nF(p, N + 1) - F(p, N). Since F is an extensive quantity, F(p, N) should scale as N \nfor a given ratio 0' = p/ N. This implies \n\nF \nN = O'(F(p + 1, N) - F(p, N)) + (F(p, N + 1) - F(p, N)). \n\n(6) \n\nWe have thus obtained an expression for the averaged energy of the local minima. \nMinimizing the free energy with respect to the parameter w gives a self-consistent \nequation. \n\nThe three equations for qo, 'Y and w completely determines the model. The macro(cid:173)\nscopic properties of the neural network, such as the storage capacity, can be derived, \nand the results are identical to the first step replica symmetry breaking solution in \nthe replica method. \n\nIt remains to check whether the microscopic equations have been modified due to \nthe roughening of the energy landscape. It turns out that while the cavity fields in \nthe initial state 0' do not satisfy the microscopic equations (2), those at the final \nmetastable state {3 do, except that the nonlocal susceptibility X has to be replaced \nby its average over the distribution of the local minima. In fact, the nonlocal \nsusceptibility describes the reactive effects due to the background examples, which \nadjust on the addition of the new example. (Technically, this is called the Onsager \nreaction.) The adjustments due to hopping between valleys in a rough energy \nlandscape have thus been taken into account. \n\n4 SIMULATION RESULTS \n\nTo verify the theory, I simulate a committee tree learning random examples . Learn(cid:173)\ning can be done by the more conventional Least Action algorithm (Nilsson 1965), \nor by iterating the microscopic equations. \nWe verify that the Least Action algorithm yields an aligning field function ..\\(t) \nconsistent with the cavity theory. Suppose the weights from input j to hidden node \na is given by Jaj = 2:1' xal'~j /..IN. Comparing with Jaj = (1- O'X)-l 2:1'(Aal' -\ntal')~j /..IN, we estimate the nonlocal susceptibility X by requiring the distribution \nof tal' == Aal' - (1 - O'X)xal' to have a zero first moment. tal' is then an estimate of \nIt agrees with the \ntal\" Fig. 2 shows the resultant relation between Aal' and tal\" \npredictions of the cavity theory. Fig. 3 shows the values of the stability parameter \nK, measured from the Least Action algorithm and the microscopic equations. They \nhave better agreement with the predictions of the rough energy landscape (first \nstep replica symmetry breaking solution) rather than the smooth energy landscape \n(replica symmetric solution). Note that the microscopic equations yield a higher \nstability than the Least Action algorithm. \n\n\fMicroscopic Equations in Rough Energy Landscape for Neural Networks \n\n307 \n\n5 \n\n3 \n\n\"C \nQ) \n~ 1 \n0) \nc \nc \n\n-1 \n\n.2' \u00ab \n\n'\" \n\n'\" \n\n.\", \n\n'\" \n\n_. \n\n- - -\n\n-\n\n-\n\n-3 \n\n'\" \n\n-5 \n\n-5 \n\n,. \n, .' \n'\" \n\n-3 \n\n-1 \n1 \nCavity field \n\n3 \n\n5 \n\nFigure 2: The aligning fields versus the cavity fields for a branch of the committee \ntree with K = 3, a = 0.8 and N = 600. The dashed line is the prediction of the \ncavity theory for the regime of rough energy landscape. \n\n2.0 \n\n1.5 \n\n~ 1.0 \n\n0.5 \n\n0.0 \n\n0.0 \n\n0.5 \n\n1.0 \na. \n\n1.5 \n\n2.0 \n\nFigure 3: The stability parameter K, versus the storage level a in the committee \ntree with K = 3 for the cavity theory of: (a) smooth energy landscape (dashed \nline), (b) rough energy landscape (solid line), and the simulation of: (c) iterating \nthe microscopic equations (circles), (d) the Least Action algorithm (squares). Error \nbars are smaller than the size of the symbols. \n\n5 CONCLUSION \n\nIn summary, we have derived the microscopic equations for neural network learning \nin the regime of rough energy landscapes. They turn out to have the same form as \nin the case of smooth energy landscape, except that the parameters are averaged \nover the distribution of local minima. Iterating the equations result in a learning \nalgorithm, which yields a higher stability than more conventional algorithms in the \ncommittee tree. However, for high loading, the iterations may not converge. \n\n\f308 \n\nK. Y. M. Wong \n\nThe success of the present scheme lies its ability to take into account the underlying \nphysical picture of many local minima of comparable energy. It correctly describes \nthe experience that slightly different training sets may lead to vastly different neural \nnetworks. The stability parameter predicted by the rough landscape ansatz has a \nbetter agreement with simulations than the smooth one. It provides a physical \ninterpretation of the replica symmetry breaking solution in the replica method. It \nis possible to generalize the theory to the physical picture with hierarchies of clusters \nof local minima, which corresponds to the infinite step replica symmetry breaking \nsolution, though the mathematics is much more involved. \n\nAcknowledgements \n\nThis work is supported by the Hong Kong Telecom Institute of Information Techol(cid:173)\nogy, HKUST. \n\nReferences \n\nAnlauf, J .K, & Biehl, M. (1989) The AdaTron: an adaptive perceptron algorithm. \nEurophysics Letters 10(7) :687-692. \nBouten, M., Schietse, J . & Van den Broeck, C. (1995) Gradient descent learning in \nperceptrons: A review of its possibilities. Physical Review E 52(2):1958-1967. \nGardner, E. & Derrida, B. (1988) Optimal storage properties of neural network \nmodels. Journal of Physics A : Mathematical and General 21(1) :271-284. \nGerl, F . & Krey, U. (1995) A Kuhn-Tucker cavity method for generalization with \napplications to perceptrons with Ising and Potts neurons. Journal of Physics A: \nMath ematical and General 28(23):6501-6516. \n\nGriniasty, M. (1993) \"Cavity-approach\" analysis of the neural-network learning \nproblem. Physical Review E 47(6):4496-4513. \n\nMezard, M. (1989) The space of interactions in neural networks: Gardner's compu(cid:173)\ntation with the cavity method. Journal of Physics A: Mathematical and General \n22(12):2181-2190. \nMezard, M., Parisi, G. & Virasoro, M. (1987) Spin Glass Theory and Beyond. \nSingapore: World Scientific. \nNilsson, N.J . (1965) Learning Machines. New York: McGraw-Hill. \nThouless, D.J., Anderson, P.W. & Palmer, R.G. (1977) Solution of 'solvable model \nof a spin glass'. Philosophical Magazin,e 35(3) :593-601. \nWatkin, T.L.H., Rau , A. & Biehl, M. (1993) The statistical mechanics of learning \na rule. Review of Modern Physics 65(2) :499-556. \n\nWong, KY.M. (1995a) Microscopic equations and stability conditions in optimal \nneural networks. Europhysics Letters 30(4):245-250 . \n\nWong, KY.M. (1995b) The cavity method: Applications to learning and retrieval \nin neural networks. In J.-H. Oh , C. Kwon and S. Cho (eds.), Neural Networks: The \nStatistical Mechanics Perspective, pp. 175-190. Singapore: World Scientific. \n\n\f", "award": [], "sourceid": 1177, "authors": [{"given_name": "K. Y. Michael", "family_name": "Wong", "institution": null}]}