{"title": "A Theory of Mean Field Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 360, "abstract": null, "full_text": "A Theory of Mean Field Approximation \n\nT.Tanaka \n\nDepartment of Electronics and Information Engineering \n\nTokyo Metropolitan University \n\nI-I, Minami-Osawa, Hachioji , Tokyo 192-0397 Japan \n\nAbstract \n\nI present a theory of mean field approximation based on information ge(cid:173)\nometry. This theory includes in a consistent way the naive mean field \napproximation, as well as the TAP approach and the linear response the(cid:173)\norem in statistical physics, giving clear information-theoretic interpreta(cid:173)\ntions to them. \n\n1 \n\nINTRODUCTION \n\nMany problems of neural networks, such as learning and pattern recognition, can be cast \ninto a framework of statistical estimation problem. How difficult it is to solve a particular \nproblem depends on a statistical model one employs in solving the problem. For Boltzmann \nmachines[ 1] for example, it is computationally very hard to evaluate expectations of state \nvariables from the model parameters. \n\nMean field approximation[2], which is originated in statistical physics, has been frequently \nused in practical situations in order to circumvent this difficulty. In the context of statistical \nphysics several advanced theories have been known , such as the TAP approach[3], linear \nresponse theorem[4], and so on. For neural networks, application of mean field approxi(cid:173)\nmation has been mostly confined to that of the so-called naive mean field approximation, \nbut there are also attempts to utilize those advanced theories[5, 6, 7, 8] . \nIn this paper I present an information-theoretic formulation of mean field approximation. It \nis based on information geometry[9], which has been successfully applied to several prob(cid:173)\nlems in neural networks[ 1 0]. This formulation includes the naive mean field approximation \nas well as the advanced theories in a consistent way. I give the formulation for Boltzmann \nmachines, but its extension to wider classes of statistical models is possible, as described \nelsewhere[ 11 ]. \n\n2 BOLTZMANN MACHINES \n\nA Boltzmann machine is a statistical model with N binary random variables Si E {-I, I}, \ni = 1, ... , N. The vector s = (s}, . .. , S N) is called the state of the Boltzmann machine. \n\n\f352 \n\nT. Tanaka \n\nThe state s is also a random variable, and its probability law is given by the Boltzmann(cid:173)\nGibbs distribution \n\np(s) = e- E (s)-1/J(p) , \n\n(I) \n\n(2) \n\nwhere E( s) is the \"energy\" defined by \n\nE(s) = - 2: hisi - 2: wij SiSj \n\n(ij) \n\nwith hi and wij the parameters, and -1jJ(p) is determined by the normalization condition \nand is called the Helmholtz free energy of p. The notation (ij) means that the summation \nshould be taken over all distinct pairs. \nLet 'fJi(P) == (Si}p and 'fJij(p) == (SiSj}p, where (.}p means the expectation with respect to \np. The following problem is essential for Boltzmann machines: \n\nProblem 1 Evaluate the expectations '1Ji (p) and 'fJij (p) from the parameters hi and wij of \nthe Boltzmann machine p. \n\n3 \n\nINFORMATION GEOMETRY \n\n3.1 ORTHOGONAL DUAL FOLIATIONS \n\nA whole set M of the Boltzmann-Gibbs distribution (1) realizable by a Boltzmann machine \nis regarded as an exponential family. Let us use shorthand notations I, J, ... , to represent \ndistinct pairs of indices, such as ij. The parameters hi and wI constitute a coordinate \nsystem of M, called the canonical parameters of M. The expectations \"Ii and 'f/I constitute \nanother coordinate system of M, called the expectation parameters of M. \nLet Fo be a subset of M on which wI are all equal to zero. I call Fo the factorizable \nsubmodel of M since p(s) E Fo can be factorized with respect to Si. On Fo the problem \nis easy: Since wI are all zero, Si are statistically independent of others, and therefore \n'fJi = tanh - 1 hi and 'fJij = 'fJi'fJj hold. \n\nMean field approximation systematically reduces the problem onto the factorizable sub(cid:173)\nmodel Fo. For this reduction, I introduce dual foliations F and A onto M. The foliation \nF = {F(w)}, M = Uw F(w), is parametrized by w == (wI) and each leaf F(w) is \ndefined as \n\nF(w) = {p(s) I wI (p) = wI}. \n\n(3) \nThe leaf F(O) is the same as Fo, the factorizable submodel. Each leaf F( w) is again an \nexponential family with hi and 'fJi the canonical and the expectation parameters, respec(cid:173)\ntively. A pair of dual potentials is defined on each leaf, one is the Helmholtz free energy \n1/J(p) == 1jJ(p) and another is its Legendre transform, or the Gibbs free energy, \n\n(4) \n\nand the parameters of p E F( w) are given by \n\n'fJi(P) = ()i1/J(p), hi(p) = ()i\u00a2>(p), \n\nwhere {)i == ({)/{)hi ) and {)i \nUrn A( m), is parametrized by m == (md and each leaf A( m) is defined as \n\n::::i (()/{)'fJi). Another foliation A \n\n(5) \n{A(m)}, M = \n\nA(m) = {p(s) I 'fJi(P) = mi}. \n\n(6) \n\n\fA Theory of Mean Field Approximation \n\n353 \n\nEach leaf A(m) is not an exponential family, but again a pair of dual potentials.\"b and \u00a2 is \ndefined on each leaf, the former is given by \n\nand the latter by its Legendre transform as \n\n\u00a2(p) = L wI (p)1]I(p) - .\"b(p), \n\nI \n\nand the parameters of P E A(m) are given by \n\n(7) \n\n(8) \n\n1JI(p) = fh.\"b(p), wI (p) = al \u00a2(p), \n\n(9) \n\nwhere al = (a/awl) and al = (a / a1JI). These two foliations form the orthogonal dual \n\nfoliations, since the leaves F{w) and A(m) are orthogonal at their intersecting point. I \nintroduce still another coordinate system on M, called the mixed coordinate system, on \nthe basis of the orthogonal dual foliations. It uses a pair (m, w) of the expectation and \nthe canonical parameters to specify a single element p EM. The m part specifies the leaf \nA(m) on which p resides, and the w part specifies the leaf F(w). \n\n3.2 REFORMULATION OF PROBLEM \n\nAssume that a target Boltzmann machine q is given by specifying its parameters hi (q) and \nwI (q). Problem I is restated as follows: evaluate its expectations 1Ji(q) and 1JI(q) from \nthose parameters. To evaluate 1Ji mean field approximation translates the problem into the \nfollowing one: \n\nProblem 2 Let F( w) be a leaf on which q resides. Find p E F{ w) which is the closest \nto q. \n\nAt first sight this problem is trivial, since one immediately finds the solution p = q. How(cid:173)\never, sol ving this problem with respect to TJi (p) is nontrivial, and it is the key to understand(cid:173)\ning of mean field approximation including advanced theories. \n\nLet us measure the proximity of p to q by the Kullback divergence \n\nD{pllq) = LP{s) log :~:~, \n\ns \n\n(10) \n\nthen solving Problem 2 reduces to finding a minimizer p E F{w) of D{pllq) for a given q. \nFor p, q E F(w), D{pllq) is expressed in terms of the dual potentials\",& and \u00a2 of F(w) as \n\nThe minimization problem is thus equivalent to minimizing \n\n(11 ) \n\n( 12) \n\nsince \",&{q) in eq. (11) does not depend on p. Solving the stationary condition EfG{p) = 0 \nwith respect to 1Ji(P) will give the correct expectations 1Ji{q), since the true minimizer is \np = q. However, the scenario is in general intractable since\u00a2{p) cannot be given explicitly \nas a function of 1Ji{P). \n\n\f354 \n\n3.3 PLEFKA EXPANSION \n\nT Tanaka \n\nThe problem is easy if wI = O. In this case \u00a2(p) is given explicitly as a function of \nmi == 7]i(p) as \n\n-\n(p) = \"2 ~ (1 + mi) log \n\n1 '\" [ \n\n1 + mi \n\n2 + (1 - md log \n\n1 - mi] \n. \n\n2 \n\n(13) \n\ni \n\nMinimization of G(p) with respect to mi gives the solution mi = tanh hi as expected. \nWhen wI 1= 0 the expression (13) is no longer exact, but to compensate the error one may \nuse, leaving convergence problem aside, the Taylor expansion of\u00a2(w) == \u00a2(p) with respect \nto w = 0, \n\n\u00a2(w) \n\n\u00a2(O) + 2:(ch\u00a2(O))wI + ~ 2:Uh aJ\u00a2(O))wI wJ \n+ ! 2: ({hfhaK\u00a2(O))wlwJwK + .... \n6 IJK \n\nIJ \n\n1 \n\n( 14) \n\nThis expansion has been called the Plefka expansion[ 12] in the literature of spin glasses. \nNote that in considering the expansion one should temporarily assume that m is fixed: One \ncan rely on the solution m evaluated from the stationary condition 8G(p) = 0 only if the \nexpansion does not change the value of m. \n\nThe coefficients in the expansion can be efficiently computed by fully utilizing the orthog(cid:173)\nonal dual structure of the foliations. First, we have the following theorem: \n\nTheorem 1 The coefficients o/the expansion (14) are given by the cumulant tensors (l/the \ncorresponding orders, dejined on A(m). \n\nBecause \u00a2 = -;fi holds, one can consider derivatives of;fi instead of those of \u00a2. The first(cid:173)\norder derivatives aI;fi are immediately given by the property of the potential of the leaf \nA(m) (eq. (9\u00bb, yielding \n\n( 15) \nwhere Po denotes the distribution on A(m) corresponding to w = O. The coefficients of \nthe lowest-orders, including the first-order one, are given by the following theorem. \n\nTheorem 2 The jirst-, second-, and third-order coefficients o/the expansion (14) are given \nby: \n\n(h;fi(o) \n(h{h;fi(O) \nalaJaK;fi(O) \n\nT/I(PO) \n((all)(aJl) )po \n((all)(aJl)(aKl) )po \n\n(16) \n\nwhere l == logPo. \n\nThe proofs will be found in [11]. It should be noted that, although these results happen to \nbe the same as the ones which would be obtained by regarding A(m) as an exponential \nfamily, they are not the same in general since actually A(m) is not an exponential family; \nfor example, they are different for the fourth-order coefficients. \n\nThe explicit formulas for these coefficients for Boltzmann machines are given as follows : \n\n\u2022 For the first-order, \n\n(17) \n\n\fA Theory of Mean Field Approximation \n\n\u2022 For the second-order, \n\n(th )2~(O) = (1 - mr)(1 - m;,) \n\n(I = ii'), \n\nand \n\n\u2022 For the third-order, \n\n(th )3~(O) = 4mi mi' (1 - mn(1 - mr,) (I = ii'), \nand for 1 = ij, J = j k, K = ik for three distinct indices i , j, and k, \n\n(h{h8K~(O) = (1 - m;)(1 - m;)(1 - m~) \n\nFor other combinations of I , J, and K , \n\n355 \n\n(18) \n\n(19) \n\n(20) \n\n(21) \n\n(22) \n\n4 MEAN FIELD APPROXIMATION \n\n4.1 MEAN FIELD EQUATION \n\nTruncating the Plefka expansion (14) up to n-th order term gives n-th order approximations, \n~n (P) and Gn(p) == ~n(P) - L:i hi(q)mi . The Weiss free energy, which is used in the naive \nmean field approximation, is given by ~l (p). The TAP approach picks up all relevant terms \nof the Plefka expansion[ 12], and for the SK model it gives the second-order approximation \n~2(P) . \nThe stationary condition 8i Gn (p) = 0 gives the so-called mean field equation, from which \na solution of the approximate minimization problem is to be determined. For n = 1 it takes \nthe following familiar form , \n\ntanh - 1 mi - hi - 2: wijmj = 0 \n\nand for n = 2 it includes the so-called On sager reaction term. \n\ntanh- 1 m i - hi - 2: wijmj + 2:(wij )2(1 - m;)mi = 0 \n\n#i \n\n# i \n\n# i \n\n(23) \n\n(24) \n\nNote that all of these are expressed as functions of ffii. \nGeometrically, the mean field equation approximately represents the \"surface\" hf(p) \nhi(q) in terms of the mixed coordinate system of M , since for the exact Gibbs free energy \nG, the stationary condition QiG(p) = 0 gives hi(p) - hi(q) = O. Accordingly, the ap-\nproximate relation hi(p) = 8i~n(P), for fixed m, represents the n-th order approximate \nexpression of the leaf A(m) in the canonical coordinate system . The fit of this expression \nto the true leaf A( m) around the point w = 0 becomes beller as the order of approximation \ngets higher, as seen in Fig. I. Such a behavior is well expected, since the Plefka expansion \nis essentially a Taylor expansion. \n\n4.2 LINEAR RESPONSE \n\nFor estimating r/1(p) one can utilize the linear response theorem . In information geomet(cid:173)\nrical framework it is represented as a trivial identity relation for the Fisher information on \nthe leaf F( w) . The Fisher information matrix (gij) , or the Riemannian metric tensor, on \nthe leaf F(w) , and its inverse (gij) are given by \n\n(25) \n\n\f356 \n\nT Tanaka \n\n,... \n\n,/ \n\n. \n\\1 I \n: ; \n!. \" \n\n/ / -\n\n0.4 .---------r--.,..---_~-... _ .. Oth order \n----\u00b7 1st order \n-_. 2nd order \n\n:'::~=~==~'~- ~--. ~.'.~ !;~ ~;~:; \n\n\",'\" \n\n0.25 \n\no \n\nFO, \n\n/\n\n/ \n)/ .// \n,/ ; \n\nOL-~--~~--------~ \n\n0.2 \n\nA(m) \n\"\n\n, . \nI '\\ \nI l\\ \n\nI \n\n\\ \n,1\"----- ...... : \n\\ \nI \n\n\" \n\n\" \n\n' -- -. _ _ _ \n0.1 '---__ -lo.-_~ __ __\" \n0.499 \n\n0.501 \n\nFigure I: Approximate expressions of A(m) by mean field approximations of several or(cid:173)\nders for 2-unit Boltzmann machine, with (ml' m2) = (0.5, 0.5) (left), and their magnified \nview (right). \n\nFigure 2: Relation between \"naive\" approximation and present theory. \n\nand \n\nrespectively. In the framework here, the linear response theorem states the trivial fact that \nthose are the inverse of the other. In mean field approximation, one substitutes an approxi(cid:173)\nmation \u00a2n(P) in place of \u00a2(p) in eq. (26) to get an approximate inverse of the metric (r/j). \nThe derivatives in eq. (26) can be analytically calculated, and therefore (rJj) can be numer(cid:173)\nically evaluated by substituting to it a solution Tni of the mean field equation. Equating its \ninverse to (9ij) gives an estimate of 17ij (p) by using eq. (25). So far, Problem I has been \nsol ved within the framework of mean field approximation, with T1li and 17ij obtained by the \nmean field equation and the linear response theorem, respectively. \n\n(26) \n\n5 DISCUSSION \n\nFollowing the framework presented so far, one can in principle construct algorithms of \nmean field approximation of desired orders. The first-order algorithm with linear response \nhas been first proposed and examined by Kappen and Rodrfguez[7, 8]. Tanaka[13] has \nformulated second- and third-order algorithms and explored them by computer simulations, \nIt is also possible to extend the present formulation so that it can be applicable to higher(cid:173)\norder Boltzmann machines. Tanaka[ 14] discusses an extension of the present formulation \nto third-order Boltzmann machines: It is possible to extend linear response theorem to \nhigher-orders, and it allows us to treat higher-order correlations within the framework of \nmean field approximation. \n\n\fA Theory of Mean Field Approximation \n\n357 \n\nThe common understanding about the \"naive\" mean field approximation is that it minimizes \nKullback divergence D(A>llq) with respect to A> E Fo for a given q. It can be shown that \nthis view is consistent with the theory presented in this paper. Assume that q E F( w) \nand Po E A(m), and let p be a distribution corresponding the intersecting point of the \nleaves F(w) and A(m). Because of the orthogonality of the two foliations F and A the \nfollowing \"Pythagorean law[9]\" holds (Fig. 2). \n\nD(Pollq) = D(Pollp) + D(pllq) \n\n(27) \nIntuitively, D(A> lip) measures the squared distance between F( w) and Fa, and is a second(cid:173)\norder quantity in w. \nIt should be ignored in the first-order approximation, and thus \nD(Pollq) ~ D(pllq) holds. Under this approximation minimization of the former with \nrespect to Po is equivalent to that of the latter with respect to p, which establishes the re(cid:173)\nlation between the \"naive\" approximation and the present theory. It can also be checked \ndirectly that the first-order approximation of D(pllq) exactly gives D(A>llq), the Weiss free \nenergy. \n\nThe present theory provides an alternative view about the validity of mean field approx(cid:173)\nimation: As opposed to a common \"belief\" that mean field approximation is a good one \nwhen N is sufficiently large, one can state from the present formulation that it is so when(cid:173)\never higher-order contribution of the Plefka expansion vanishes, regardless o/whether N \nis large or not. This provides a theoretical basis for the observation that mean field approx(cid:173)\nimation often works well for small networks. \n\nThe author would like to thank the Telecommunications Advancement Foundation for fi(cid:173)\nnancial support. \n\nReferences \n[1] Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985) A learning algorithm for Boltzmann \n\nmachines. Cognitive Science 9: 147-169. \n\n[2] Peterson, c., and Anderson, J. R. (1987) A mean field theory learning algorithm for neural \n\nnetworks. Complex Systems 1: 995-1019. \n\n[3] Thouless, D. J., Anderson, P. w., and Palmer, R. G. (1977) Solution of 'Solvable model of a \n\nspin glass'. Phil. Mag. 35 (3): 593-60l. \n\n[4] Parisi, G. (1988) Statistical Field Theory. Addison-Wesley. \n[5] Galland, C. C. (1993) The limitations of deterministic Boltzmann machine learning. Network 4 \n\n(3): 355-379. \n\n[6] Hofmann, T. and Buhmann, J. M. (1997) Pairwise data clustering by deterministic annealing. \n\nIEEE Trans. Patl. Anal. & Machine IntelJ. 19 (I): 1-14; Errata, ibid. 19 (2): 197 (1997). \n\n[7] Kappen, H. 1. and RodrIguez, F. B. (1998) Efficient learning in Boltzmann machines using \n\nlinear response theory. Neural Computation. 10 (5): 1137-1156. \n\n[8] Kappen, H. J. and Rodriguez, F. B. (1998) Boltzmann machine learning using mean field theory \nand linear response correction. In M. I. Jordan, M. 1. Kearns, and S. A. Solla (Eds.), Advances \nill Neural Information Processing S.ystems 10, pp. 280-286. The MIT Press. \n\n[9] Amari, S.-I. (1985) Differential-Geometrical Method in Statistics. Lecture Notes in Statistics \n\n28, Springer-Verlag. \n\n[10] Amari , S.-I., Kurata, K .\u2022 and Nagaoka, H. (1992) Information geometry of Boltzmann ma(cid:173)\n\nchines. IEEE Trans. Neural Networks 3 (2): 260-271. \n\n[II] Tanaka, T. Information geometry of mean field approximation. preprint. \n[12] Plefka, P. (1982) Convergence condition of the TAP equation for the infinite-ranged Ising spin \n\nglass model. 1. Phys. A: Math. Gen. 15 (6): 197t-1978. \n\n[13] Tanaka, T. (1998) Mean field theory of Boltzmann machine learning. Phys. Rev. E. 58 (2): \n\n2302-2310. \n\n[14] Tanaka, T. (1998) Estimation of third-order correlations within mean field approximation. In S. \nUsui and T. Omori (Eds.), Proc. Fifth International Conference on Neurallllformation Process(cid:173)\ning, vol. 1, pp. 554-557. \n\n\f\fPART IV \n\nALGORITHMS AND ARCHITECTURE \n\n\f\f", "award": [], "sourceid": 1604, "authors": [{"given_name": "Toshiyuki", "family_name": "Tanaka", "institution": null}]}