{"title": "The Rectified Gaussian Distribution", "book": "Advances in Neural Information Processing Systems", "page_first": 350, "page_last": 356, "abstract": null, "full_text": "The Rectified  Gaussian  Distribution \n\nN.  D.  Socci,  D.  D.  Lee and H.  S.  Seung \n\nBell  Laboratories, Lucent Technologies \n\nMurray Hill,  NJ 07974 \n\n{ndslddleelseung}~bell-labs.com \n\nAbstract \n\nA simple  but powerful  modification of the standard Gaussian dis(cid:173)\ntribution  is  studied.  The  variables  of  the  rectified  Gaussian  are \nconstrained  to  be  nonnegative,  enabling the use of nonconvex en(cid:173)\nergy  functions.  Two  multimodal  examples,  the  competitive  and \ncooperative  distributions,  illustrate  the  representational  power  of \nthe rectified  Gaussian.  Since the cooperative distribution can rep(cid:173)\nresent  the translations of a  pattern, it  demonstrates the  potential \nof the rectified  Gaussian for  modeling pattern manifolds. \n\n1 \n\nINTRODUCTION \n\nThe  rectified  Gaussian  distribution  is  a  modification  of the  standard  Gaussian  in \nwhich  the  variables  are  constrained  to  be  nonnegative.  This  simple  modification \nbrings increased representational power,  as illustrated by two multimodal examples \nof the rectified  Gaussian,  the  competitive  and  the  cooperative  distributions.  The \nmodes of the competitive distribution are well-separated by regions of low probabil(cid:173)\nity.  The modes of the cooperative distribution are closely spaced along a nonlinear \ncontinuous  manifold.  Neither  distribution  can  be  accurately  approximated  by  a \nsingle standard Gaussian.  In short, the rectified Gaussian is  able to represent  both \ndiscrete and  continuous variability in a  way  that a standard Gaussian cannot. \nThis  increased representational power  comes  at the  price of increased  complexity. \nWhile finding the mode of a standard Gaussian involves solution of linear equations, \nfinding  the  modes  of a  rectified  Gaussian  is  a  quadratic  programming  problem. \nSampling  from  a  standard  Gaussian  can  be  done  by  generating  one  dimensional \nnormal  deviates,  followed  by  a  linear  transformation.  Sampling  from  a  rectified \nGaussian  requires  Monte  Carlo  methods.  Mode-finding  and  sampling  algorithms \nare  basic tools that are important in probabilistic modeling. \nLike  the  Boltzmann machine[l],  the rectified  Gaussian  is  an  undirected  graphical \nmodel.  The rectified  Gaussian is  a  better representation for  probabilistic modeling \n\n\fThe Rectified Gaussian Distribution \n\n351 \n\n(a) \n\n(c) \n\nFigure 1:  Three types of quadratic energy functions.  (a)  Bowl (b)  Trough (c) Saddle \n\nof continuous-valued data.  It is unclear whether learning will  be more tractable for \nthe rectified  Gaussian than it is  for  the Boltzmann machine. \nA different version of the rectified  Gaussian was recently introduced by  Hinton and \nGhahramani[2,  3].  Their  version  is  for  a  single  variable,  and  has  a  singularity  at \nthe  origin  designed  to  produce  sparse  activity  in  directed  graphical  models.  Our \nversion lacks  this singularity,  and  is  only  interesting in  the  case  of more  than one \nvariable,  for  it  relies  on  undirected  interactions  between  variables  to  produce  the \nmultimodal behavior that is  of interest here. \nThe present  work is  inspired  by  biological neural network models  that  use  contin(cid:173)\nuous dynamical attractors[4].  In particular, the energy function of the cooperative \ndistribution was previously studied in models of the visual cortex[5], motor cortex[6], \nand head direction system[7]. \n\n2  ENERGY FUNCTIONS:  BOWL, TROUGH,  AND \n\nSADDLE \n\nThe standard Gaussian distribution P(x) is  defined  as \n\nP(x) \nE(x)  = \n\ne \n\nZ -l  -{3E(;r:) \n, \n1 _xT Ax - bTx \n2 \n\n. \n\n(1) \n\n(2) \n\nThe  symmetric matrix  A  and  vector  b define  the  quadratic  energy function  E(x). \nThe  parameter  (3  =  lIT  is  an  inverse  temperature.  Lowering  the  temperature \nconcentrates the distribution at the minimum of the energy function.  The prefactor \nZ  normalizes the integral of P(x)  to unity. \nDepending on the matrix A, the quadratic energy function  E(x)  can have different \ntypes of curvature.  The energy function  shown in Figure  l(a) is  convex.  The mini(cid:173)\nmum of the energy corresponds to the peak of the distribution.  Such a distribution \nis  often  used  in  pattern  recognition  applications,  when  patterns are  well-modeled \nas a single prototype corrupted by  random noise. \nThe  energy function  shown  in  Figure  1 (b)  is  flattened  in  one  direction.  Patterns \ngenerated by such a distribution come with roughly equal1ikelihood from  anywhere \nalong the trough.  So the direction of the trough corresponds to the  invariances of \nthe  pattern.  Principal  component  analysis  can  be  thought  of as  a  procedure  for \nlearning distributions of this form. \nThe  energy  function  shown  in  Figure  1 (c)  is  saddle-shaped.  It cannot  be  used \nin  a  Gaussian  distribution,  because  the  energy  decreases  without  limit  down  the \n\n\f352 \n\nN.  D. Socci, D. D.  Lee and H.  S.  Seung \n\nsides  of the  saddle,  leading  to  a  non-normalizable  distribution.  However,  certain \nsaddle-shaped energy  functions  can  be used  in the  rectified  Gaussian distribution, \nwhich is  defined over vectors x  whose components are all nonnegative.  The class of \nenergy functions  that can  be  used  are those  where  the matrix A  has  the  property \nxT Ax > 0 for  all  x  > 0,  a  condition  known  as  copositivity.  Note  that  this  set  of \nmatrices  is  larger than  the  set of positive  definite  matrices that can  be  used  with \na  standard Gaussian.  The  nonnegativity constraints block the directions in which \nthe energy diverges to negative infinity.  Some  concrete examples  will  be discussed \nshortly.  The  energy  functions  for  these  examples  will  have  multiple  minima,  and \nthe  corresponding  distribution  will  be  multimodal,  which  is  not  possible  with  a \nstandard Gaussian. \n\n3  MODE-FINDING \n\nBefore  defining  some example distributions,  we  must  introduce  some  tools  for  an(cid:173)\nalyzing  them.  The  modes  of a  rectified  Gaussian  are  the  minima  of the  energy \nfunction  (2), subject to nonnegativity constraints.  At low temperatures, the modes \nof the distribution characterize much of its behavior. \nFinding the modes of a  rectified  Gaussian is  a  problem in quadratic programming. \nAlgorithms for  quadratic programming are particularly simple for  the case of non(cid:173)\nnegativity  constraints.  Perhaps  the  simplest  algorithm  is  the  projected  gradient \nmethod,  a  discrete time dynamics consisting of a  gradient step followed  by  a  recti(cid:173)\nfication \n\n(3) \nThe rectification  [x]+  =  max(x, 0)  keeps  x  within the nonnegative orthant (x  ~ 0). \nIf the  step  size  7J  is  chosen  correctly,  this  algorithm  can  provably  be  shown  to \nconverge to a stationary point of the energy function[8].  In practice, this stationary \npoint is  generally a  local minimum. \n\nNeural  networks  can  also  solve  quadratic  programming  problems.  We  define  the \nsynaptic weight  matrix W  = I  - A, and a  continuous time dynamics \n\nx+x =  [b+ Wx]+ \n\n(4) \n\nFor any  initial condition  in  the nonnegative orthant, the dynamics  remains  in  the \nnonnegative orthant, and the quadratic function  (2)  is  a  Lyapunov function  of the \ndynamics. \nBoth of these  methods converge to a  stationary point of the energy.  The gradient \nof the energy is given  by  9 =  Ax - b.  According to the Kiihn-Tucker conditions, a \nstationary point must satisfy the conditions that for  all i, either gi = 0 and Xi  > 0, \nor  gi  >  0  and  Xi  =  O.  The  intuitive  explanation  is  that  in  the  interior  of  the \nconstraint  region,  the  gradient  must  vanish,  while  at  the  boundary,  the  gradient \nmust point toward the interior.  For a  stationary point to be  a  local minimum,  the \nKiihn-Tucker  conditions must  be augmented  by  the condition that  the  Hessian  of \nthe nonzero variables  be positive definite. \nBoth methods are guaranteed to find  a global minimum only in the case where A is \npositive definite, so that the energy function  (2)  is convex.  This is  because a convex \nenergy function  has a unique minimum.  Convex quadratic programming is solvable \nin  polynomial time.  In  contrast, for  a  nonconvex energy function  (indefinite  A), it \nis not generally possible to find  the global minimum in polynomial time, because of \nthe possible  presence of local  minima.  In  many  practical situations,  however,  it is \nnot too difficult  to find  a reasonable solution. \n\n\fThe Rectified Gaussian Distribution \n\n353 \n\n(a) \n\n(b) \n\nFigure 2:  The competitive  distribution for  two variables.  (a)  A non-convex energy \nfunction  with two  constrained minima on  the  x  and y  axes.  Shown are contours of \nconstant energy, and arrows that represent the negative gradient of the energy.  (b) \nThe rectified  Gaussian distribution has two peaks. \n\nThe rectified  Gaussian happens to be most interesting in  the nonconvex case,  pre(cid:173)\ncisely  because of the  possibility  of multiple  minima.  The consequence  of multiple \nminima is a multimodal distribution, which cannot be well-approximated by a stan(cid:173)\ndard Gaussian.  We  now consider two examples of a multimodal rectified  Gaussian. \n\n4  COMPETITIVE DISTRIBUTION \n\nThe competitive distribution is  defined  by \n\nAij \n\n-dij + 2 \n\nbi  =  1; \n\nWe  first  consider the simple  case  N  =  2.  Then the energy function  given  by \n\nE(x,y)=-\n\nX2  +y2 \n\n2  +(x+y)2_(x+y) \n\n(5) \n(6) \n\n(7) \n\nhas  two  constrained  minima  at  (1,0)  and  (0,1)  and  is  shown  in  figure  2(a).  It \ndoes not lead to a normalizable distribution unless the nonnegativity constraints are \nimposed.  The two constrained minima of this nonconvex energy function correspond \nto  two  peaks  in  the  distribution  (fig  2(b)).  While  such  a  bimodal  distribution \ncould be  approximated by  a  mixture of two standard Gaussians, a  single Gaussian \ndistribution  cannot  approximate  such  a  distribution.  In  particular,  the  reduced \nprobability density between the two peaks would not be representable at all  with a \nsingle  Gaussian. \nThe  competitive  distribution  gets  its  name  because  its  energy  function  is  similar \nto  the ones  that  govern  winner-take-all  networks[9].  When  N  becomes  large,  the \nN  global  minima  of  the  energy  function  are  singleton  vectors  (fig  3),  with  one \ncomponent equal to unity, and the rest zero.  This is due to a competitive interaction \nbetween  the  components.  The mean  of the zero temperature distribution is  given \nby \n\nThe eigenvalues of the covariance \n\n(XiXj)  -\n\n1 \n(Xi)(Xj)  = N dij  - N2 \n\n1 \n\n(8) \n\n(9) \n\n\f354 \n\n-\n\n.:(a) \n., \n\nN.  D.  Socci, D. D.  Lee and H.  S.  Seung \n\n.: (b) \n\n: (c)  r-\n\n-\n\n0 \n\nr-\n\n\u00b7 \u00b7 n \nn \nI \nu \n\u00b7 \n. . ,  . . . ,  . . .. \n\nn \n\nIII \n\n, \n\n2 \n\nJ \n\na \n\nI \n\n\u2022 \n\n., \n\n\u2022\n\n\u2022 \n\nto \n\n'2   S  \u2022 \u2022 \u2022   .,  \u2022 \u2022   'I \n\nFigure 3:  The competitive distribution for  N  =  10  variables.  (a)  One  mode  (zero \ntemperature state)  of the  distribution.  The strong competition  between  the  vari(cid:173)\nables results  in only one  variable on.  There are N  modes of this form,  each  with a \ndifferent winner variable.  (b)  A sample at finite temperature (13  ~ 110) using Monte \nCarlo sampling.  There is  still a  clear  winner variable.  (c)  Sample from  a  standard \nGaussian with matched mean and covariance.  Even if we  cut off the negative values \nthis  sample  still  bears little  resemblance  to the  states shown  in  (a)  and  (b),  since \nthere is  no clear  winner variable. \n\nall  equal to 1/ N, except for  a  single  zero  mode.  The zero mode is  1, the  vector of \nall ones, and the other eigenvectors span the N  - 1 dimensional space perpendicular \nto  1.  Figure  3  shows  two  samples:  one  (b)  drawn  at finite  temperature from  the \ncompetitive distribution, and the other (c)  drawn from  a standard Gaussian distri(cid:173)\nbution with the same  mean and covariance.  Even if the sample from  the standard \nGaussian  is  cut  so  negative  values  are set to zero  the  sample does  not look  at all \nlike the original distribution.  Most importantly a  standard Gaussian  will  never  be \nable  to capture the strongly competitive character of this distribution. \n\n5  COOPERATIVE DISTRIBUTION \n\nTo  define  the  cooperative  distribution  on  N  variables,  an  angle  fh  =  27ri/N  is \nassociated with each variable Xi,  so that the variables can be regarded as sitting on \na  ring.  The energy function is  defined  by \n1 \n\n4 \n\n(10) \n\nAij \n\n6ij  + N  - N  COS(Oi  - OJ) \n1; \n\nbi  = \n\n(11) \nThe coupling Aij  between Xi  and X j  depends only on the separation Oi - 03.  between \nthem on the ring. \nThe minima,  or ground states, of the energy function can be found  numerically by \nthe methods  described earlier.  An  analytic calculation of the ground states in the \nlarge  N  limit  is  also  possible[5].  As  shown  in  Figure 4(a),  each  ground  state is  a \nlump  of activity  centered  at  some  angle  on  the  ring.  This  delocalized  pattern  of \nactivity  is  different  from  the singleton modes of the competitive  distribution,  and \narises from  the cooperative interactions between neurons on the ring.  Because the \ndistribution is invariant to rotations of the ring (cyclic permutations of the variables \nxd,  there are N  ground states, each with the lump at a  different  angle. \nThe mean and the covariance of the cooperative distribution are given  by \n\n(12) \n(13) \nA given sample of x,  shown in  Figure  4(a),  does  not look anything like  the  mean, \nwhich is completely uniform.  Samples generated from  a  Gaussian distribution with \n\n(XiXj)  - (Xi}(Xj)  =  C(Oi  - OJ) \n\n(Xi)  =  const \n\n\fThe Rectified Gaussian Distribution \n\n355 \n\n'(a) \n\n, (b) \n\n(c) \n\nr \n\nFigure 4:  The cooperative distribution for  N  = 25  variables.  (a)  Zero temperature \nstate.  A cooperative interaction between the variables leads to a delocalized pattern \nof activity that can sit at different  locations on the ring.  (b)  A finite  temperature \n(/3  = 50)  sample.  (c)  A sample from  a  standard Gaussian  with matched mean and \ncovariance. \n\nthe same mean and covariance look completely different from  the ground states of \nthe cooperative distribution  (fig 4(c)). \n\nThese deviations from  standard Gaussian  behavior reflect fundamental  differences \nin the underlying energy function.  Here the energy function has N  discrete minima \narranged along a  ring.  In  the limit of large  N  the  barriers between  these  minima \nbecome quite  small.  A  reasonable approximation is  to regard the energy  function \nas having a  continuous line of minima with a ring geometry[5] .  In other words,  the \nenergy surface  looks  like  a  curved  trough,  similar  to the  bottom of a  wine  bottle. \nThe mean is  the centroid of the ring and is  not close to any minimum. \n\nThe cooperative distribution is able to model the set of all translations of the lump \npattern  of  activity.  This  suggests  that  the  rectified  Gaussian  may  be  useful  in \ninvariant object recognition, in  cases where a continuous manifold of instantiations \nof an object  must  be  modeled.  One  such  case  is  visual  object  recognition,  where \nthe images of an object from  different  viewpoints form  a  continuous manifold. \n\n6  SAMPLING \n\nFigures 3 and 4 depict samples drawn from  the competitive and cooperative distri(cid:173)\nbution.  These samples were generated using the Metropolis Monte Carlo algorithm. \nSince full  descriptions of this algorithm can be found elsewhere, we  give only a brief \ndescription of the particular features used here.  The basic procedure is  to generate \na  new  configuration  of the  system  and  calculate  the  change  in  energy  (given  by \neq.  2).  If the energy  decreases,  one accepts the new  configuration unconditionally. \nIf it increases then the new  configuration is  accepted with  probability e-{3AE. \nIn our  sampling  algorithm one  variable  is  updated  at a  time  (analogous  to  single \nspin flips).  The acceptance ratio is  much  higher this way than if we  update all the \nspins simultaneously.  However, for some distributions the energy function may have \napproximately marginal directions;  directions in which  there is little or no  barrier. \nThe cooperative distribution has this property.  We can expect critical slowing down \ndue to this and consequently some sort of collective update (analogous to multi-spin \nupdates or cluster updates) might make sampling more efficient.  However, the type \nof update  will  depend  on  the  specifics  of the  energy  function  and  is  not  easy  to \ndetermine. \n\n\f356 \n\nN  D.  Socci, D.  D.  Lee and H.  S.  Seung \n\n7  DISCUSSION \n\nThe competitive and cooperative distributions are examples of rectified  Gaussians \nfor  which  no  good  approximation  by  a  standard  Gaussian  is  possible.  However, \nboth  distributions  can  be  approximated  by  mixtures  of standard  Gaussians.  The \ncompetitive  distribution  can  be  approximated  by  a  mixture  of N  Gaussians,  one \nfor  each singleton state.  The cooperative distribution can also be approximated by \na  mixture of N  Gaussians,  one  for  each location of the lump  on the  ring.  A  more \neconomical approximation would  reduce  the number of Gaussians  in  the  mixture, \nbut .make each one anisotropic[IO]. \nWhether the rectified  Gaussian is superior to these mixture models is  an empirical \nquestion that should be investigated empirically with specific real-world probabilis(cid:173)\ntic  modeling tasks.  Our intuition is  that the rectified  Gaussian will  turn out to be \na  good  representation for  nonlinear  pattern manifolds,  and  the  aim  of this  paper \nhas been to make this intuition concrete. \n\nTo  make  the  rectified  Gaussian  useful  in  practical  applications,  it  is  critical  to \nfind  tractable learning algorithms.  It is  not yet clear whether learning will  be more \ntractable for the rectified Gaussian than it was for the Boltzmann machine.  Perhaps \nthe continuous variables of the rectified  Gaussian may be easier to  work with than \nthe binary variables of the Boltzmann machine. \n\nAcknowledgments  We  would  like  to thank P.  Mitra, L.  Saul,  B.  Shraiman and \nH.  Sompolinsky for helpful discussions.  Work on this project was supported by Bell \nLaboratories, Lucent Technologies. \n\nReferences \n\n[1]  D.  H.  Ackley,  G.  E.  Hinton,  and  T.  J.  Sejnowski.  A  learning  algorithm  for \n\nBoltzmann machines.  Cognitive  Science,  9:147-169, 1985. \n\n[2]  G.  E.  Hinton  and  Z.  Ghahramani.  Generative  models  for  discovering  sparse \n\ndistributed representations.  Phil.  Trans.  Roy.  Soc.,  B352:1177-90, 1997. \n\n[3]  Z.  Ghahramani and G.  E.  Hinton.  Hierarchical non-linear factor  analysis and \n\ntopographic maps.  Adv.  Neural Info.  Proc.  Syst.,  11,  1998. \n\n[4]  H.  S.  Seung.  How  the brain keeps  the eyes  still.  Proc.  Natl.  Acad.  Sci.  USA, \n\n93:13339-13344, 1996. \n\n[5]  R. Ben-Yishai, R. L. Bar-Or, and H.  Sompolinsky. Theory of orientation tuning \n\nin visual  cortex.  Proc.  Nat.  Acad.  Sci.  USA,  92:3844-3848, 1995. \n\n[6]  A.  P.  Georgopoulos, M.  Taira, and A.  Lukashin.  Cognitive neurophysiology of \n\nthe motor cortex.  Science,  260:47-52,  1993. \n\n[7]  K.  Zhang.  Representation of spatial orientation  by  the  intrinsic  dynamics of \n\nthe head-direction cell  ensemble:  a  theory.  J.  Neurosci.,  16:2112-2126, 1996. \n[8]  D.  P.  Bertsekas.  Nonlinear  programming.  Athena  Scientific,  Belmont,  MA, \n\n1995. \n\n[9]  S.  Amari  and  M.  A.  Arbib.  Competition and  cooperation  in  neural  nets.  In \nJ. Metzler, editor,  Systems Neuroscience, pages 119-165. Academic Press, New \nYork,  1977. \n\n[10]  G.  E.  Hinton,  P.  Dayan,  and M.  Revow.  Modeling the manifolds of images of \n\nhandwritten digits.  IEEE  Trans.  Neural Networks,  8:65-74, 1997. \n\n\f", "award": [], "sourceid": 1402, "authors": [{"given_name": "Nicholas", "family_name": "Socci", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}, {"given_name": "H. Sebastian", "family_name": "Seung", "institution": null}]}