{"title": "The Nonnegative Boltzmann Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 428, "page_last": 434, "abstract": null, "full_text": "The Nonnegative Boltzmann Machine \n\nOliver B. Downs \nHopfield Group \nSchultz Building \n\nPrinceton University \nPrinceton, NJ 08544 \n\nobdowns@princeton.edu \n\nDavid J.e. MacKay \nCavendish Laboratory \n\nMadingley Road \n\nCambridge, CB3 OHE \n\nUnited Kingdom \n\nmackay@mrao.cam.ac.uk \n\nDaniel D. Lee \n\nBell Laboratories \n\nLucent Technologies \n700 Mountain Ave. \n\nMurray Hill, NJ 07974 \n\nddlee@bell-labs.com \n\nAbstract \n\nThe nonnegative Boltzmann machine (NNBM) is a recurrent neural net(cid:173)\nwork model that can describe multimodal nonnegative data.  Application \nof maximum likelihood estimation to this model gives a learning rule that \nis analogous to the binary Boltzmann machine. We examine the utility of \nthe mean field  approximation for the  NNBM,  and describe  how Monte \nCarlo sampling techniques can be  used to  learn its parameters.  Reflec(cid:173)\ntive  slice  sampling  is  particularly well-suited  for  this  distribution,  and \ncan efficiently be implemented to sample the  distribution.  We  illustrate \nlearning of the NNBM on a transiationally invariant distribution, as well \nas on a generative model for images of human faces. \n\nIntroduction \n\nThe  multivariate  Gaussian is the most elementary distribution used to model generic da(cid:173)\nta.  It represents the maximum entropy distribution under the constraint that the mean and \ncovariance matrix of the  distribution match that of the  data.  For the  case  of binary data, \nthe maximum entropy distribution that matches the first and second order statistics of the \ndata is  given by the  Boltzmann machine  [1].  The probability of a particular state  in the \nBoltzmann machine is given by the exponential form: \n\nP({Si = \u00b11}) = ~ exp (-~ L.siAijSj + ~biSi) . \n\nt J \n\nt \n\n(1) \n\nInterpreting Eq.  1 as a neural network, the parameters A ij  represent symmetric, recurrent \nweights between the different units in the network, and bi  represent local biases. Unfortu(cid:173)\nnately, these parameters are not simply related to the observed mean and covariance of the \n\n\fThe Nonnegative Boltzmann Machine \n\n429 \n\n(a) \n\n40 \n30 \n20 \n\n(b) \n\n5.-------------~ \n\no \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\nFigure  1:  a) Probability density and b) shaded contour plot of a two dimensional competi(cid:173)\ntive NNBM distribution.  The energy function E (x)  for this distribution contains a saddle \npoint and two local minima, which generates the observed multimodal distribution. \n\ndata as they are for the normal Gaussian. Instead, they need to be adapted using an iterative \nlearning rule that involves difficult sampling from the binary distribution [2]. \n\nThe Boltzmann machine can also be generalized to continuous and nonnegative variables. \nIn this case, the maximum entropy distribution for nonnegative data with known first and \nsecond order statistics is described by a distribution previously called the \"rectified Gaus(cid:173)\nsian\" distribution [3]: \n\nif Xi  2::  O'v'i, \nif any Xi  <0, \nwhere the energy function E (x)  and normalization constant Z are: \n\np(x) = {texP[-E(X)] \n\no \n\nE(x) \n\nZ \n\n_  ~xT Ax -bTx \n' \n\n2 \n\nr  dx exp[-E(x)]. \n\nIl:?o \n\n(2) \n\n(3) \n\n(4) \n\nThe  properties of this nonnegative Boltzmann machine (NNBM) distribution differ quite \nsubstantially from that of the normal Gaussian.  In particular, the presence of the nonnega(cid:173)\ntivity constraints allows the distribution to have multiple modes. For example, Fig. 1 shows \na two-dimensional NNBM distribution with two separate maxima located against the rec(cid:173)\ntifying axes.  Such a multimodal distribution would be poorly modelled by a single normal \nGaussian. \n\nIn this submission, we discuss how a multimodal NNBM distribution can be learned from \nnonnegative data.  We  show the limitations of mean field  approximations for this distribu(cid:173)\ntion, and illustrate how recent developments in efficient sampling techniques for continuous \nbelief networks can be used to tune the weights of the network [4].  Specific examples of \nlearning are demonstrated on a translationally invariant distribution, as well as on a gener(cid:173)\native model for face  images. \n\nMaximum Likelihood \n\nThe learning rule for the NNBM can be derived by maximizing the  log likelihood of the \nobserved data  under Eq.  2.  Given a set of nonnegative vectors {xJt }, where  J-L  = l..M \n\n\f430 \n\n0. B.  Downs, D.  J.  MacKay and D.  D.  Lee \n\nindexes the different examples, the log likelihood is: \n\nL= M  LlogP(xJL ) =  - M  LE(xJL) -logZ. \n\n1  M \n\nJl=l \n\n1  M \n\nJL=l \n\nTaking the derivatives ofEq. 5 with respect to the parameters A and b gives: \n\naL \n\n(5) \n\n(6) \n\n(7) \n\n(8) \n\n(9) \n\nwhere the  subscript \"c\" denotes a \"clamped\" average  over the  data,  and the subscript \"f\" \ndenotes a \"free\" average over the NNBM distribution: \n\n(f(x))c \n\nM \n\n~ Lf(xJL) \n\nJL=l \n\n(f(x))r  =  1\"20 dx P(x)f(x). \n\nThese derivatives are used to define  a gradient ascent learning rule for the NNBM that is \nsimilar to that of the binary Boltzmann machine.  The contrast between the clamped and \nfree  covariance matrix is used to update the iteractions A, while the difference between the \nclamped and free means is used to update the local biases b. \n\nMean field approximation \n\nThe  major difficulty  with  this learning algorithm lies  in  evaluating the  averages  (XiXj)f \nand  (Xi)r.  Because  it  is  analytically intractable to  calculate  these  free  averages exactly, \napproximations are  necessary  for  learning.  Mean field  approximations  have  previously \nbeen proposed as a deterministic alternative for learning in the binary Boltzmann machine, \nalthough there have been contrasting views on their validity [5,6]. Here, we investigate the \nutility of mean field theory for approximating the NNBM distribution. \n\nThe  mean field  equations are  derived by approximating the  NNBM distribution in Eq.  2 \nwith the factorized form: \n\nQ(x)  = II Q1';(Xi)  = II -- .-2. \n\n1  1  (X.)'Y \n\n.  I! 'Ti \n\n~ \n\n'Ti \n\n. \n\n~ \n\n!Ei \ne-1';, \n\n(10) \n\nwhere the different marginal densities Q(Xi) are characterized by the means 'Ti  with a fixed \nconstant I' The product of I-distributions is the natural factorizable  distribution for non(cid:173)\nnegative random variables. \n\nThe optimal mean field parameterS'Ti  are determined by minimizing the Kullback-Leibler \ndivergence between the NNBM distribution and the factorized distribution: \n\nDKL(QIIP) =  dx Q(x) log  P(x)  = (E(x))Q(x)  + log Z - H(Q). \n\n(11) \n\nJ \n\n[Q(X)] \n\nFinding the  minimum of Eq.  11  by setting  its  derivatives with respect to  the  mean field \nparameters 'Ti  to zero gives the simple mean field equations: \n\nA;m =  h + 1)  [bi  - ~ Ai;T; + ~i] \n\n(12) \n\n\fThe Nonnegative Boltzmann Machine \n\n431 \n\n(a) \n\n(b) \n\nFigure 2:  a) Slice sampling in one dimension. Given the current sample point, Xi,  a height \ny  E  [0, aP(x)] is randomly chosen.  This defines a slice (x  E  SlaP(x)  ~ y)  in which a \nnew Xi+!  is chosen. b) For a multidimensional slice S, the new point Xi+l  is chosen using \nballistic dynamics with specular reflections off the interior boundaries of the slice. \n\nThese  equations can then be  solved self-consistently for Ti.  The  \"free\" statistics of the \nNNBM are then replaced by their statistics under the factorized distribution Q (x): \n\n(Xi}r  ~ Ti,  (XiXj}r ~ [h + 1)2 + (r + 1) 8ij ] TiTj. \n\n(13) \n\nThe fidelity  of this approximation is determined by how well the  factorized  distribution \nQ(x)  models  the  NNBM  distribution.  Unfortunately,  for  distributions  such  as  the  one \nshown in Fig. 3, the mean field approximation is quite different from that of the true mul(cid:173)\ntimodal NNBM distribution.  This suggests that the  naive  mean field approximation is i(cid:173)\nnadequate for learning in the NNBM,  and in  fact  attempts to  use  this  approximation fail \nto learn the examples given in following sections.  However, the mean field approximation \ncan still be used to initialize the parameters to reasonable values before using the sampling \ntechniques that are described below. \n\nMonte-Carlo sampling \n\nA more direct approach to calculating the \"free\" averages in Eq.  6-7 is to numerically ap(cid:173)\nproximate them.  This can be accomplished by using Monte Carlo sampling to generate a \nrepresentative set of points that sufficiently approximate the statistics of the continuous dis(cid:173)\ntribution.  In particular, Markov chain Monte-Carlo methods employ an iterative stochastic \ndynamics whose equilibrium distribution converges to that of the desired distribution [4]. \nFor the binary Boltzmann machine, such sampling dynamics involves random \"spin flips\" \nwhich change the value of a single binary component.  Unfortunately, these single compo(cid:173)\nnent dynamics are easily caught in local energy minima, and can converge very slowly for \nlarge systems.  This makes sampling the binary distribution very difficult,  and more spe(cid:173)\ncialized computational techniques such as  simulated annealing, cluster updates,  etc.,  have \nbeen developed to try to circumvent this problem. \n\nFor the NNBM, the use of continuous variables makes it possible to investigate different \nstochastic dynamics in order to more  efficiently sample the  distribution.  We  first  experi(cid:173)\nmented with Gibbs sampling with ordered overrelaxation [7],  but found that the required \ninversion of the  error function was too  computationally expensive.  Instead,  the recently \ndeveloped method of slice sampling [8] seems particularly well-suited for implementation \nin the NNBM. \n\nThe basic idea of the slice  sampling algorithm is shown in Fig.  2.  Given a  sample point \nXi,  a random y  E  [0, aP(xi)] is first uniformly chosen.  Then a slice S is  defined as the \nconnected set of points (x  E  S  I aP(x)  ~ y),  and the new point Xi+l  E  S  is  chosen \n\n\f432 \n\n0. B.  Downs,  D.  J.  MacKay and D.  D.  Lee \n\n4 \n\n(b) \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\n2 \n\n3 \n\n4 \n\n5 \n\nFigure 3:  Contours of the two-dimensional competitive NNBM distribution overlaid by a) \n'Y  = 1 mean field approximation and b) 500 reflected slice samples. \n\nrandomly from  this  slice.  The  distribution of Xn  for  large  n  can be  shown to  converge \nto  the  desired density P(x).  Now,  for the NNBM,  solving  the  boundary points along  a \nparticular direction in a given slice  is quite simple, since it only involves solving the roots \nof a quadratic equation.  In order to efficiently choose a new point within a particular slice, \nreflective  \"billiard ball\" dynamics are used.  A  random initial velocity is  chosen,  and the \nnew point is evolved by travelling a certain distance from the current point while specularly \nreflecting from the boundaries of the slice.  Intuitively, the reversibility of these reflections \nallows the dynamics to satisfy detailed balance. \n\nIn Fig.  3,  the  mean field  approximation and  reflective  slice  sampling are  used to  mod(cid:173)\nel  the  two-dimensional  competitive NNBM distribution.  The  poor fit  of the mean field \napproximation is apparent from the unimodality of the  factorized density,  while the sam(cid:173)\nple points from the reflective slice sampling algorithm are more representative of the un(cid:173)\nderlying NNBM distribution.  For higher dimensional data,  the mean field  approximation \nbecomes progressively worse.  It is  therefore necessary to  implement the numerical slice \nsampling algorithm in order to accurately approximate the NNBM distribution. \n\nTranslationally invariant model \n\nBen-Yishai et al.  have proposed a model for orientation tuning in primary visual cortex that \ncan be interpreted as  a cooperative NNBM distribution [9].  In the absence of visual input, \nthe firing rates of N  cortical neurons are described as minimizing the energy function E (x) \nwith parameters: \n\n8ij + N  - N  cos( N  Ii - jl) \n\n27r \n\n1 \n\n\u20ac \n\n(14) \n\n1 \n\nThis distribution was used to test the NNBM learning algorithm.  First,  a large set of N  = \n25  dimensional nonnegative training vectors were  generated by sampling the  distribution \nwith (3  = 50 and \u20ac  = 4.  Using these samples as training data, the A and b parameters were \nlearned from a unimodal initialization by evolving the training vectors using reflective slice \nsampling, and these evolved vectors were used to calculate the \"free\" averages in Eq.  6-7. \nThe A and b estimates were then updated, and this procedure was iterated until the evolved \naverages matched that of the training data. The learned A and b parameters were then found \nto almost exactly match the original form in Eq.  14.  Some representative samples from the \nlearned NNBM distribution are shown in Fig. 4. \n\n\fThe Nonnegative Boltzmann Machine \n\n433 \n\n3 \n\n2 \n\n5 \n\n10 \n\n15 \n\n20 \n\n25 \n\nFigure 4:  Representative samples taken from a NNBM after training to learn a translation(cid:173)\nally invariant cooperative distribution with (3  =  50 and \u20ac  =  4. \n\nb) \n\nFigure 5:  a)  Morphing of a face  image by successive sampling from the learned NNBM \ndistribution.  b) Samples generated from a normal Gaussian. \n\nGenerative model for faces \n\nWe have also used the NNBM to learn a generative model for images of human faces.  The \nNNBM is used to model the correlations in the coefficients of the nonnegative matrix fac(cid:173)\ntorization (NMF) of the face  images [10]. NMF reduces the dimensionality of nonnegative \ndata by decomposing the  face  images into parts correponding to  eyes, noses,  ears, etc.  S(cid:173)\nince the different parts are coactivated in reconstructing a face, the activations of these parts \ncontain significant correlations that need to be captured by a generative model.  Here we \nbriefly demonstrate how the NNBM is able to learn these correlations. \n\nSampling from the  NNBM stochastically generates coefficients which can graphically be \ndisplayed as face  images.  Fig.  5 shows some representative face  images as  the reflective \nslice sampling dynamics evolves the coefficients. Also displayed in the figure are the anal(cid:173)\nogous images generated if a normal Gaussian is used to model the correlations instead.  It \nis  clear that the nonnegativity constraints and multimodal nature of the NNBM results  in \nsamples which are cleaner and more distinct as faces. \n\n\f434 \n\nDiscussion \n\nO.  B.  Downs, D. J.  MacKay and D.  D.  Lee \n\nHere we have  introduced the NNBM  as  a recurrent neural network model that is  able to \ndescribe multimodal nonnegative data.  Its application is made practical by the  efficiency \nof the slice sampling Monte Carlo method. The learning algorithm incorporates numerical \nsampling from the NNBM distribution and is able to  learn from observations of nonneg(cid:173)\native  data.  We  have  demonstrated the  application of NNBM learning to  a  cooperative, \ntranslationally invariant distribution, as well as to real data from images of human faces. \n\nExtensions to  the  present work include incorporating hidden units into the recurrent net(cid:173)\nwork.  The addition of hidden units implies modelling certain higher order statistics in the \ndata, and requires calculating averages over these hidden units.  We anticipate the marginal \ndistribution over these units to be most commonly unimodal, and hence mean field theory \nshould be valid for approximating these averages. \n\nAnother possible  extension  involves  generalizing  the  NNBM to  model  continuous data \nconfined within a certain range, i.e. 0 :s;  Xi  :s;  1. In this situation, slice sampling techniques \nwould also be used to efficiently generate representative samples. In any case, we hope that \nthis work stimulates more research into using these  types of recurrent neural networks to \nmodel complex, multimodal data. \n\nAcknowledgements \n\nThe authors acknowledge useful discussion with John Hopfield, Sebastian Seung, Nicholas \nSocci, and Gayle Wittenberg, and are indebted to  Haim Sompolinsky for pointing out the \nmaximum entropy interpretation of the Boltzmann machine. This work was funded by Bell \nLaboratories, Lucent Technologies. \n\nO.B.  Downs is  grateful  for the  moral support,  and open ears and minds of Beth Brittle, \nGunther Lenz, and Sandra Scheitz. \n\nReferences \n\n[1]  Hinton, GE &  Sejnowski, TJ (1983). Optimal perceptual learning. IEEE Conference on Com(cid:173)\n\nputer Vision and Pattern Recognition, Washington, DC, 448-453. \n\n[2]  Ackley,  DH,  Hinton,  GE,  &  Sejnowski,  TJ  (1985).  A  learning algorithm  for  Boltzmann ma(cid:173)\n\nchines. Cognitive Science 9, 147-169. \n\n[3]  Socci, ND, Lee,  DD, and Seung,  HS  (1998). The rectified Gaussian distribution. Advances in \n\nNeural Information Processing Systems 10, 350-356. \n\n[4]  MacKay,  DJC  (1998).  Introduction to Monte Carlo Methods.  Learning in  Graphical Models. \n\nKluwer Academic Press, NATO Science Series, 175-204. \n\n[5]  Galland, CC (1993). The limitations of deterministic Boltzmann machine learning. Network 4, \n\n355-380. \n\n[6]  Kappen, HJ & Rodriguez, FB (1997). Mean field approach to learning in Boltzmann machines. \n\nPattern Recognition in Practice Jij, Amsterdam. \n\n[7]  Neal,  RM  (1995).  Suppressing  random  walks  in  Markov  chain  Monte  Carlo using  ordered \n\noverrelaxation. Technical Report 9508, Dept. of Statistics, University of Toronto. \n\n[8]  Neal, RM (1997). Markov chain Monte Carlo methods based on \"slicing\" the density function. \n\nTechnical Report 9722, Dept. of Statistics, University of Toronto. \n\n[9]  Ben-Yishai,  R,  Bar-Or, RL, & Sompolinsky,  H (1995).  Theory of orientation tuning in visual \n\ncortex. Proc. Nat.  Acad. Sci.  USA  92, 3844-3848. \n\n[10]  Lee,  DD, and  Seung,  HS  (1999) Learning the parts of objects by  non-negative matrix factor(cid:173)\n\nization. Nature 401,788-791. \n\n\f", "award": [], "sourceid": 1743, "authors": [{"given_name": "Oliver", "family_name": "Downs", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}