{"title": "High-temperature Expansions for Learning Models of Nonnegative Data", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 471, "abstract": null, "full_text": "High-temperature expansions for learning \n\nmodels of nonnegative data \n\nOliver B. Downs \n\nDept. of Mathematics \nPrinceton University \nPrinceton, NJ 08544 \n\nobdown s@ p r in c et o n.edu \n\nAbstract \n\nRecent work has exploited boundedness of data in the unsupervised \nlearning of new types of generative model. For nonnegative data it was \nrecently shown that the maximum-entropy generative model is a Non(cid:173)\nnegative Boltzmann Distribution not a Gaussian distribution, when the \nmodel is constrained to match the first and second order statistics of the \ndata. Learning for practical sized problems is made difficult by the need \nto compute expectations under the model distribution. The computa(cid:173)\ntional cost of Markov chain Monte Carlo methods and low fidelity of \nnaive mean field techniques has led to increasing interest in advanced \nmean field theories and variational methods. Here I present a second(cid:173)\norder mean-field approximation for the Nonnegative Boltzmann Machine \nmodel, obtained using a \"high-temperature\" expansion. The theory is \ntested on learning a bimodal 2-dimensional model, a high-dimensional \ntranslationally invariant distribution, and a generative model for hand(cid:173)\nwritten digits. \n\n1 Introduction \n\nUnsupervised learning of generative and feature-extracting models for continuous nonneg(cid:173)\native data has recently been proposed [1], [2] . In [1], it was pointed out that the maximum \nentropy distribution (matching Ist- and 2nd-order statistics) for continuous nonnegative \ndata is not Gaussian, and indeed that a Gaussian is not in general a good approximation \nto that distribution. The true maximum entropy distribution is known as the Nonnega(cid:173)\ntive Boltzmann Distribution (NNBD), (previously the rectified Gaussian distribution [3]) , \nwhich has the functional form \n\np(x) = {o~exp[-E(X)] if Xi ~ OVi, \nif any Xi < 0, \nwhere the energy function E(x) and normalisation constant Z are: \n\nE(x) \n\n(3xT Ax - bT X, \n\nZ = ( dx exp[-E(x)]. \n\n10;\"20 \n\n(1) \n\n(2) \n\n(3) \n\n\fIn contrast to the Gaussian distribution, the NNBD can be multimodal in which case its \nmodes are confined to the boundaries of the nonnegative orthant. \n\nThe Nonnegative Boltzmann Machine (NNBM) has been proposed as a method for learning \nthe maximum likelihood parameters for this maximum entropy model from data. Without \nhidden units, it has the stochastic-EM learning rule: \n\n(XiXj)f - (XiXj)c \n(Xi)c - (Xi)r, \n\n(4) \n(5) \n\nwhere the subscript \"c\" denotes a \"clamped\" average over the data, and the subscript \"f\" \ndenotes a \"free\" average over the NNBD: \n\n1 M \n\n(f(x))c = M L f(x(I')) \n(f(X))f = 1 dxp(x)f(x). \n\n1'=1 \n\nx~O \n\n(6) \n\n(7) \n\nThis learning rule has hitherto been extremely computationally costly to implement, since \nnaive variationaVmean-field approximations for (XXT)r are found empirically to be poor, \nleading to the need to use Markov chain Monte Carlo methods. This has made the NNBM \nimpractical for application to high-dimensional data. \n\nWhile the NNBD is generally skewed and hence has moments of order greater than 2, the \nmaximum-likelihood learning rule suggests that the distribution can be described solely in \nterms of the Ist- and 2nd-order statistics of the data. With that in mind, I have pursued \nadvanced approximate models for the NNBM. \nIn the following section I derive a second-order approximation for (XiXj)r analogous to the \nTAP-On sager correction for the mean-field Ising Model, using a high temperature expan(cid:173)\nsion, [4]. This produces an analytic approximation for the parameters A ij , bi in terms of \nthe mean and cross-correlation matrix of the training data. \n\n2 Learning approximate NNBM parameters using high-temperature \n\nexpansion \n\nHere I use Taylor expansion of a \"free energy\" directly related to the partition function \nof the distribution, Z in the fJ = 0 limit, to derive a second-order approximation for the \nNNBM model parameters. In this free energy we embody the constraint that Eq. 5 is \nsatisfied: \n\nwhere fJ is an \"inverse temperature\". There is a direct relationship between the \"free en(cid:173)\nergy\", G and the normalisation, Z of the NNBD, Eq. 3. \n\n-In Z = G(fJ, m) + Constant(b, m) \n\nThus, \n\n(9) \n\n(10) \n\n\fThe Lagrange multipliers, Ai embody the constraint that (Xi)f match the mean field of the \npatterns, mi = (x)c. This effectively forces tl.b = 0 in Eq. 5, with bi = -Ai((3). \nSince the Lagrange constraint is enforced for all temperatures, we can solve for the specific \ncase (3 = O. \n\nmi = (Xi)fl.8-o = \n\n-\n\nTIk Ixoo =0 Xi exp (- L:l Al(O)(XI - ml)) dXk \nTIk IXh=o exp (- L:l Al (0) (Xl - ml)) dXk \n\nhOO \n\n1 \n= --\nAi(O) \n\n(11) \n\nNote that this embodies the unboundedness of Xk in the nonnegative orthant, as compared \nto the equivalent term of Georges & Yedidia for the Ising model, mi = tanh(Ai(O)). \nWe consider Taylor expansion of Eq. 8 about the \"high temperature\" limit, (3 = O. \n\n8G I \nG((3, m) = G(O, m) + (3 8(3 \n\n.8=0 \n\n(32 82G I \n\n+ 2' 8(32 \n\n.8=0 \n\n+ ... \n\n(12) \n\nSince the integrand becomes factorable in Xi in this limit, the infinite temperature values of \nG and its derivatives are analytically calculable. \n\nG((3,m)I.8=o = - Lin {OO_ exp (- LAi(O)(Xi -mi)) dXk \n\nk \n\n}Xh-O \n\ni \n\nusing Eq. 11; \n\nG((3,m)I.8=o = - ~ln (Ak~O) exp (~Ai(O)mi)) \n\n=N+ Llnmk \n\nk \n\n(13) \n\n(14) \n\nThe first derivative is then as follows \n\n8GI \n8(3 .8=0 \n\nTIk 1000 (L:i .j -AijXiXj - L:i(Xi - mi) \u00a5t) exp (- L:l Am(O)(XI - ml)) dXk \n\nTIk 1000 exp (- L:l Am(O)(XI - ml)) dXk \n\ni,j \n\n(15) \n\n(16) \n\nThis term is exactly the result of applying naive mean-field theory to this system, as in [1]. \nLikewise we obtain the second derivative \n\n~~~ Ip~o ~ - ( (~A';X'X;) ') \n\n.8=0 \n\n+ (pi + O';)A,;m,m;) , \n\n+ (~AijXiXj L ~; (Xk - mk)) \n= - L L Qijkl Aij Aklmimjmkml \n\nt,} \n\nk \n\ni,j k,l \n\n.8=0 \n\n(17) \n\n(18) \n\nWhere Qijkl contains the integer coefficients arising from integration by parts in the first \nand second terms and (1 + Oij) in the second term of Eq. 17. \nThis expansion is to the same order as the TAP-Onsager correction term for the Ising model, \nwhich can be derived by an analogous approach to the equivalent free-energy [4]. Substi(cid:173)\ntuting these results into Eq. 10, we obtain \n\n(3(Xi Xj)f R! (3(1 + Oij)mimj - 2' L QijklAklmimjmkml \n\n(32 \n\nkl \n\n(19) \n\n\fWe arrive at an analytic approximation for Aij as a function of the 1st and 2nd moments of \nthe data, using Eq. 19 in the learning rule, Eq. 4, setting ~Aij = 0 and solving the linear \nequation for A. \nWe can obtain an equivalent expansion for Ai ((3) and hence bi. To first order in (3 (equiva(cid:173)\nlent to the order of (3 in the approximation for A), we have \n\nAi((3) ~ Ai(O) + (3 8; \n8A\u00b71 \n\nP \n\n/3 =0 \n\nUsing Eqs. 11 & 15 \n\nHence \n\n= - 2:(1 + c5ij )Aijmj \n\nj \n\n+ . .. \n\n(20) \n\n(21) \n\n(22) \n\n(23) \n\n(24) \n\nThe approach presented here makes an explicit approximation of the statistics required \nfor the NNBM learning rule (xxT}f' which can be substituted in the fixed-point equation \nEq. 4, and yields a linear equation in A to be solved. This is in contrast to the linear \nresponse theory approach of Kappen & Rodriguez [6] to the Boltzmann Machine, which \nexploits the relationship \n\n82 1nZ \n8bi8bj = (XiXj) - (Xi) (Xj) = Xij \n\n(25) \n\nbetween the free energy and the covariance matrix X of the model. In the learning problem, \nthis produces a quadratic equation in A, the solution of which is non-trivial. Computa(cid:173)\ntionally efficient solutions of the linear response theory are then obtained by secondary \napproximation of the 2nd-order term, compromising the fidelity of the model. \n\n3 Learning a 'Competitive' Nonnegative Boltzmann Distribution \n\nA visualisable test problem is that of learning a bimodal NNBD in 2 dimensions. Monte(cid:173)\nCarlo slice sampling (See [1] & [5]) was used to generate 200 samples from a NNBD as \nshown in Fig. l(a). The high temperature expansion was then used to learn approximate \nparameters for the NNBM model of this data. A surface plot of the resulting model distri(cid:173)\nbution is shown in Fig. l(b), it is clearly a valid candidate generative distribution for the \ndata. This is in strong contrast with a naive mean field ((3 = 0) model, which by construc(cid:173)\ntion would be unable to produce a multiple-peaked approximation, as previously described, \n[1] . \n\n4 Orientation Tuning in Visual Cortex - a translationally invariant \n\nmodel \n\nThe neural network model of Ben-Yishai et. al [7] for orientation-tuning in visual cortex \nhas the property that its dynamics exhibit a continuum of stable states which are trans-\n\n\f(a) \n\n8 \n\n~ \n\n6 \n\n><.-\n\n4 ~ \n\n2to \n\n0 \n\n0 \n0 \n\no oo~ Jiil.. \n\n2 \n\n4 \nx2 \n\n-\"\" \n\n8 \n\n6 \n\n15 \n\n>--\n\n'iii \nc \n~10 \n>-\n== :c 5 \nco \n.c \n... \n0 \nQ. \n\no \n\nFigure 1: \nLearned model distribution, under the high temperature expansion. \n\n(a) Training data, generated from 2-dimensional 'competitive' NNBD, (b) \n\nlationally invariant across the network. The energy function of the network model is a \ntranslationally invariant function of the angles of maximal response, Bi , of the N neurons, \nand can be mapped directly onto the energy of the NNBM, as described in [1]. \n\nAii=1'(c5ii + ~- ~COS(~li-jl)),bi=1' \n\n(26) \n\nWe can generate training data for the NNBM by sampling from the neural network model \nwith known parameters. It is easily shown that Aii has 2 equal negative eigenvalues, the \nremainder being positive and equal in value. The corresponding pair of eigenvectors of A \nare sinusoids of period equal to the width of the stable activation bumps of the network, \nwith a small relative phase. \n\nHere, the NNBM parameters have been solved using the high-temperature expansion for \ntraining data generated by Monte Carlo slice-sampling [5] from a lO-neuron model with \nparameters to = 4, I' = 100 in Eq. 26. Fig. 2 illustrates modal activity patterns of the learned \nNNBM model distribution, found using gradient ascent of the log-likelihood function from \na random initialisation of the variables. \n\n~x ex [-Ax + bj+ \n\n(27) \n\nwhere the superscript + denotes rectification. \n\nThese modes of the approximate NNBM model are highly similar to the training patterns, \nalso the eigenvectors and eigenvalues of A exhibit similar properties between their learned \nand training forms. This gives evidence that the approximation is successful in learning a \nhigh-dimensional translationally invariant NNBM model. \n\n5 Generative Model for Handwritten Digits \n\nIn figure 3, I show the results of applying the high-temperature NNBM to learning a gen(cid:173)\nerative model for the feature coactivations of the Nonnegative Matrix Factorization [2] \n\n\f6 \n\nQ) \nrn4 \n0: \nOJ \nc:: \n.;:: \nU:2 \n\n0 \n\n(a) \n\n0.4 \n\nQ) 0.2 \nrn \n0: \nOJ \nc:: \n\n0 \n.;:: u: -0.2 \n\n6 \n~ \n0: 4 \nOJ \nc:: \n.;:: \nu: \n\n2 \nO~ \n\n1 2 3 4 5 6 7 8 9 10 \n\nNeuron Number \n\n1 2 3 4 5 6 7 8 9 10 \n\nNeuron Number \n\n(b) \n\n~ \n0: \n\n10 \n\n2 \n\n4 \n\n6 \n\n8 \n\n10 \n\nNeuron Number \n\n2 \n\n4 \n\n6 \n\n8 \n\nNeuron Number \n\nFigure 2: Upper: 2 modal states of the NNBM model density, located by gradient-ascent \nof the log-likelihood from different random initialisations, Lower: The two negative(cid:173)\neigenvalue eigenvectors of A - a) in the learned model, and b) as used to generate the \ntraining data. \n\ndecomposition of a database of the handwritten digits, 0-9. This problem contains none of \nthe space-filling symmetry of the visual cortex model, and hence requires a more strongly \nmultimodal generative model distribution to generate distinct digits. Here performance is \npoor, although superior to uniformly-sampled feature activitations. \n\n6 Discussion \n\nIn this work, an approximate technique has been derived for directly determining the \nNNBM parameters A, b in terms of the Ist- and 2nd-order statistics of the data, using \nthe method of high-temperature expansion. To second order this produces corrections to \nthe naive mean field approximation of the system analogous to the TAP term for the Ising \nModel/Boltzmann Machine. The efficacy of this approximation has been demonstrated \nin the pathological case of learning the 'competitive' NNBD, learning the translationally \ninvariant model in 10 dimensions, and a generative model for handwritten digits. \n\nThese results demonstrate an improvement in approximation to models in this class over \na naive mean field ((3 = 0) approach, without reversion to secondary assumptions such as \nthose made in the linear response theory for the Boltzmann Machine. \nThere is strong current interest in the relationship between TAP-like mean field theory, \nvariational approximation and belief-propagation in graphical models with loops. All of \nthese can be interpreted in terms of minimising an effective free energy of the system [8]. \nThe distinction in the work presented here lies in choosing optimal approximate statistics \nto learn the true model, under the assumption that satisfaction of the fixed-point equations \nof the true model optimises the free energy. This compares favourably with variational \n\n\fa) \n\nb) \n\nFigure 3: Digit images generated with feature activations sampled from a) a uniform dis(cid:173)\ntribution, and b) a high-temperature NNBM model for the digits. \n\napproaches which directly optimise an approximate model distribution. \n\nMethods of this type fail when they add spurious fixed points to the learning dynamics. \nFuture work will focus on understanding the origins of such fixed points, and the regimes \nin which they lead to a poor approximation of the model parameters. \n\n7 Acknowledgements \n\nThis work was inspired by the NIPS 1999 Workshop on Advanced Mean Field Methods. \nThe author is especially grateful to David MacKay and Gayle Wittenberg for comments on \nearly versions of this manuscript. I also acknowledge guidance from John Hopfield and \nDavid Heckerman, detailed discussion with Bert Kappen, Daniel Lee and David Barber \nand encouragement from Kim Midwood. \n\nReferences \n\n[1] Downs, DB, MacKay, DJC, & Lee, DD (2000). The Nonnegative Boltzmann Machine. Ad(cid:173)\n\nvances in Neural Information Processing Systems 12, 428-434. \n\n[2] Lee, DD, and Seung, HS (1999) Learning the parts of objects by non-negative matrix factor(cid:173)\n\nization. Nature 401,788-791. \n\n[3] Socci, ND, Lee, DD, and Seung, HS (1998). The rectified Gaussian distribution. Advances in \n\nNeural Information Processing Systems 10, 350-356. \n\n[4] Georges, A, & Yedidia, JS (1991). How to expand around mean-field theory using high(cid:173)\n\ntemperature expansions. Journal of Physics A 24, 2173- 2192. \n\n[5] Neal, RM (1997). Markov chain Monte Carlo methods based on 'slicing' the density function. \n\nTechnical Report 9722, Dept. of Statistics, University of Toronto. \n\n[6] Kappen, HJ & Rodriguez, FB (1998). Efficient learning in Boltzmann Machines using linear \n\nresponse theory. Neural Computation 10, 1137-1156. \n\n[7] Ben-Yishai, R, Bar-Or, RL, & Sompolinsky, H (1995). Theory of orientation tuning in visual \n\ncortex. Proc. Nat. Acad. Sci. USA,92(9):3844-3848. \n\n[8] Yedidia, JS , Freeman, WT, & Weiss, Y (2000). Generalized Belief Propagation. Mitsubishi \n\nElectric Research Laboratory Technical Report, TR-2000-26. \n\n\f", "award": [], "sourceid": 1929, "authors": [{"given_name": "Oliver", "family_name": "Downs", "institution": null}]}