{"title": "Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models", "book": "Advances in Neural Information Processing Systems", "page_first": 535, "page_last": 541, "abstract": null, "full_text": "Beyond maximum likelihood  and density \nestimation:  A  sample-based criterion for \nunsupervised learning of complex models \n\nSepp Hochreiter and Michael C. Mozer \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder,  CO 80309- 0430 \n\n{hochreit,mozer}~cs.colorado.edu \n\nAbstract \n\nThe goal of many unsupervised learning procedures is to bring two \nprobability  distributions  into  alignment.  Generative  models  such \nas  Gaussian mixtures and Boltzmann machines can be cast in this \nlight,  as  can recoding models  such  as ICA  and projection pursuit. \nWe propose a novel sample-based error measure for these classes of \nmodels, which applies even in situations where maximum likelihood \n(ML)  and  probability  density  estimation-based  formulations  can(cid:173)\nnot be applied,  e.g.,  models that are nonlinear  or have intractable \nposteriors.  Furthermore,  our  sample-based  error  measure  avoids \nthe difficulties of approximating a  density function.  We  prove that \nwith  an unconstrained  model,  (1)  our approach  converges  on  the \ncorrect solution as the number of samples goes to infinity,  and  (2) \nthe expected solution of our approach in the generative framework \nis  the  ML  solution.  Finally,  we  evaluate our approach via simula(cid:173)\ntions of linear and nonlinear models  on  mixture of Gaussians and \nICA problems.  The experiments show the broad applicability and \ngenerality of our approach. \n\n1 \n\nIntroduction \n\nMany unsupervised learning procedures can be viewed as trying to bring two prob(cid:173)\nability distributions into  alignment.  Two  well  known  classes  of unsupervised  pro(cid:173)\ncedures  that  can  be cast  in  this  manner are  generative  and  recoding  models.  In  a \ngenerative unsupervised framework, the environment generates training examples(cid:173)\nwhich we will refer to as  observations-by sampling from one distribution; the other \ndistribution is embodied in the model.  Examples of generative frameworks are mix(cid:173)\ntures of Gaussians  (MoG)  [2],  factor  analysis  [4],  and  Boltzmann  machines  [8].  In \nthe recoding unsupervised framework,  the model transforms points from  an obser-\n\n\fvation space to an output space,  and the output distribution is compared either to \na  reference  distribution  or to a  distribution  derived  from  the output  distribution. \nAn example is independent component analysis (leA) [11],  a method that discovers \na  representation of vector-valued  observations in  which  the statistical dependence \namong the vector elements in the output space is minimized.  With ICA, the model \ndemixes observation vectors and the output distribution is  compared against a fac(cid:173)\ntorial distribution which  is  derived either from  assumptions about the distribution \n(e.g.,  supergaussian)  or from  a  factorization  of the output  distribution.  Other ex(cid:173)\namples  within  the  recoding framework  are  projection  methods  such  as  projection \npursuit (e.g., [14])  and principal component analysis.  In each case we have described \nfor  the unsupervised learning of a  model,  the objective is  to bring two  probability \ndistributions- one or both of which is  produced by the model-\ninto alignment.  To \nimprove the model, we need to define a measure of the discrepancy between the two \ndistributions,  and to know  how  the model parameters influence the discrepancy. \n\nOne  natural approach is  to  use  outputs from  the model  to  construct  a  probability \ndensity  estimator (PD\u00a3).  The primary disadvantage of such an approach is that the \naccuracy of the learning procedure depends highly on the quality of the PDE. PDEs \nface  the  bias-variance  trade-off.  For  the learning  of generative  models,  maximum \nlikelihood  (ML)  is  a  popular  approach that avoids  PDEs.  In an ML  approach, the \nmodel's generative distribution is expressed analytically, which makes it straightfor(cid:173)\nward to evaluate the posterior, p(data I model),  and therefore, to adjust the model \nparameters to maximize  the  likelihood  of the data being  generated by the  model. \nThis limits the ML  approach to models that have tractable posteriors, true only of \nthe simplest  models  [1,  6,  9]. \n\nWe  describe  an  approach  which,  like  ML,  avoids  the  construction  of an  explicit \nPDE,  yet  does  so  without  requiring  an  analytic expression for  the  posterior.  Our \napproach, which we call a  sample-based method,  assumes a set of samples from each \ndistribution and proposes an error measure of the disagreement  defined  directly in \nterms of the samples.  Thus, a second set of samples drawn from the model serves in \nplace of a PDE or an analytic expression of the model's density.  The sample-based \nmethod is  inspired by the theory of electric fields,  which  describes the interactions \namong charged particles.  For more details on the metaphor,  see  [10]. \n\nIn this paper, we  prove that our approach converges to the optimal solution as the \nsample size goes to infinity,  assuming an unconstrained  (maximally flexible)  model. \nWe  also  prove  that  the  expected  solution  of our  approach  is  the  ML  solution  in \na  generative context.  We  present empirical  results  showing that the sample-based \napproach works for  both linear and nonlinear models. \n\n2  The Method \n\nConsider a model to be learned,  fw,  parameterized by weights w.  The model maps \nan  input  vector,  zi,  indexed  by  i,  to  an  output  vector  xi  =  fw(zi).  The  model \ninputs  are sampled from  a  distribution  pz(.),  and the learning procedure  calls  for \nadjusting the model such that the output distribution, Px (.), comes to match a target \ndistribution,  py(.).  For  unsupervised  recoding  models,  zi  is  an  observation,  xi  is \nthe transformed representation of zi, and Py (.)  specifies the desired code properties. \nFor  unsupervised  generative  models,  pz(.)  is  fixed  and  py(.)  is  the  distribution  of \nobservations. \n\n\fThe Sample-based Method:  The Intuitive  Story \n\nAssume  that  we  have  data  points  sampled  from  two  different  distributions,  la(cid:173)\nbeled  \"- \"  and  \"+\"  (Figure 1).  The sample-based error measure specifies  how  sam(cid:173)\nples  should  be  moved  so  that  the two  distributions  are \nbrought into alignment.  In the figure,  samples from  the \nlower left  and upper right corners must be moved to the \nupper left  and lower right  corners.  Our goal is  to estab-\nlish an explicit correspondence between each  \"- \"  sample \nand  each  \"+\"  sample.  Toward  this  end,  our  sample-\nbased  method  utilizes  on  mass  interactions  among  the \nsamples,  by  introducing a  repelling  force  between  sam-\npIes  from  the same distribution and  an attractive force  between  samples from  dif(cid:173)\nferent  distributions, and allowing the samples to move according to these forces. \n\nFigure  1 \n\nThe Sample-based Method:  The Formal Presentation \nIn conceiving of the problem in terms of samples that attract and repel one another, \nit  is  natural  to  think  in  terms  of  physical  interactions  among  charged  particles. \nConsider a set of positively charged particles at locations denoted by xi, i  = L.Nx, \nand  a  set  of negatively  charged  particles  at  locations  denoted  by  yi,  j  =  L .N y . \nThe particles correspond  to  data samples  from  two  distributions.  The interaction \namong particles is  characterized by the Coulomb energy,  E: \n\nwhere  r(a, b)  is  a  distance  measure- Green's  function- which  results  in  nearby \nparticles having a  strong influence on the energy,  but distant particles having only \na  weak influence.  Green's function  is  defined  as r(a, b)  =  c(d)  Ilia - bll d - 2 ,  where \nd is the dimensionality of the space, c( d)  is a  constant only depending on d,  and 11.11 \ndenotes the Euclidean distance.  For d = 2,  r(a, b)  = k  In (ila - biD. \nThe  Coulomb  energy  is  low  when  negative  and positive  particles are near one  an(cid:173)\nother, positive particles are far from one another, and negative particles are far from \none another.  This is exactly the state we  would like to achieve for  our two distribu(cid:173)\ntions  of samples:  bringing the two  distributions into  alignment  without  collapsing \neither  distribution  into  a  trivial  form.  Consequently,  our  sample-based  method \nproposes using the Coulomb energy as  an objective function to be minimized. \n\nis \n\nto  a  sample's \n\nlocation \n\nis  readily  com(cid:173)\n\nthe  force  acting  on  that  sample),  and  this  gradient  can  be \n\nThe  gradient  of  E  with  respect \nputed  (it \nchained  with  the  Jacobian  of  the  location  with  respect  to  the  model  pa(cid:173)\n-to  \"V wE  = \nrameters  w  to  obtain  a  gradient-based  update  rule: \n- JY~~~l(WG)T \"Vyk<p(yk)),whereEiSa \n-to  C~z~~~l(~~)T \"Vxk<P(X k ) \n:=  N;l ~~1 r(a,xi)  - N:;;l ~~1 r(a,yi)  is  the potential with \nstep  size,  <p(a) \nN;;l \"Va<p(a)  =  \"VaE,  T  is the transposition and a =  xk  or yk.  Here axklaw is the \nJacobian of fw(zk)  and the time derivative of xk  is:i;k  =  iw(zk) =  -\"V<p(xk).  If yk \ndepends on w  then yk- notation is  analogous else  ayk law is the zero matrix. \n\nll.w  = \n\nThere  turns  out  to  be  an  advantage to  using  Green's  function  as  the  particle  in(cid:173)\nteractions basis over  other possibilities,  e.g.,  a  Gaussian function  (e.g.,  [12,  13,3]). \n\n\fThe advantage stems from  the fact  that with  Green's function,  the force  between \ntwo  nearby points goes to  infinity  as the points are pushed together, whereas with \nthe  Gaussian, the force  goes  to zero.  Consequently,  without  Green's function,  one \nmight expect local optima in which clusters of points collapse onto a single location. \nEmpirically,  simulations confirmed this conjecture. \n\nProof:  Correctness of the Update Rule \n\nAs  the  numbers  of  samples  N x  and  Ny  go  to  infinity,  <P  can  be  expressed  as \n<p(a)  =  J p(b)  r(a,b)  db,  where  p(b)  := Px(b)  - py(b).  Our sample-based method \nmoves  data  points,  but  by  moving  data  points,  the  method  implicitly  alters  the \nprobability  density  which  gave  rise  to  the  data.  The  relation  between  the  move(cid:173)\nment  of data points  and the change  in  the  density  can  be expressed  using  an  op(cid:173)\nerator from  vector  analysis,  the  divergence.  The  divergence  at  a  location  a  is  the \nnumber  of data  points  moving  out  of a  volume  surrounding  a minus  the  number \nof data  points  moving  in to  the  same  volume.  Thus,  the  negative  divergence  of \nmovements  at  a  gives  the  density  change  at  a.  The  movement  of data  points  is \ngiven  by -V<p(a).  We  get  p(a)  =  px(a)  - py(a)  =  -div(-V<p(a)).  For  Carte(cid:173)\nsian  (orthogonal)  coordinates the divergence  div  of a  vector field  V  at a  is  defined \nas  div (V (a))  := E1=1 8Vt (a) /8al.  The Laplace  operator f::::.  of a  scalar function  A \n:= div (VA(a))  =  E1=1 8 2 A(a)/8a;'  The  Laplace  operator \nis  defined  as  f::::.A(a) \nallows  an important  characterization of Green's function:  f::::.af(a, b)  =  -8(a - b), \nwhere 8 is  the Dirac delta function.  This characterization gives f::::.<p(a)  = -p(a). \np(a) = J.L(a)  div(V<p(a))  =  J.L(a)  f::::.<p(a)  =  -J.L(a)  p(a)  , \nwhere J.L(a)  gives the effectiveness ofthe algorithm in moving a sample at a.  We get \np(a, t)  = p(a,O)  exp( -J.L(a)  t).  For  the  integrated  squared  error  (ISE)  of the two \ndistributions we  obtain \nISE(t) = J (p(a,t))2da  ~ exp(-J.LO  t) J (p(a,0))2da  =  exp(-J.LO  t)  ISE(O)  , \n\nJ.L(a):2:  J.Lo  > 0, \n\nwhere ISE(O)  is  independent  of t.  Thus, the ISE between the two  distributions is \nguaranteed to decrease during learning, when the sample size goes to infinity. \n\nProof:  Expected Generative  Solution is  ML Solution \n\nIn  the  case  of a  generative  model  which  has  no  constraints  (i.e.,  can  model  any \ndistribution),  the  maximum  likelihood  solution  will  have  distribution  px(a)  = \n.Jy  Ef':18(yi  - a),  i.e.,  the  model  will  produce  only  the  observations  and  all  of \nthem with equal probability.  For this case, we  show that our sample-based method \nwill  yield the same solution in expectation as ML. \n\nThe  sample-based  method  converges  to  a  local  minimum  of  the  energy,  where \n(Va<p(a)}x  = 0 for  all  a,  where  Ox  is the expectation over  model output.  Equiva-\nlently,  (Vaf(a,x)}x -\n\n.Jy  Ef,:l Vaf (a,yi)  =  0 or \nJ \n\n. \npx(x)Vaf(a,x)  dx  =  7VLVaf(a,yJ) \n\n(Vaf(a,x)}x  = \n\n1  Ny \n\ny  i=l \n\nBecause this equation holds for  all a,  we  obtain px(a)  = .Jy  Ef':l 8(yi - a),  which \nis  the ML  solution.  Thus,  the sample-based method  can be viewed  as  an approxi(cid:173)\nmation to ML  which gets more exact as the number of samples goes to infinity. \n\n\f3  Experiments \n\nWe  illustrate  the  sample-based  approach  for  two  common  unsupervised  learning \nproblems:  MoG  and  ICA.  In  both  cases,  we  demonstrate  that  the  sample-based \napproach  works  in  the linear  case.  We  also  consider  a  nonlinear  case to illustrate \nthe power of the sample-based approach. \n\nMixture of Gaussians \n\nIn this generative model framework, m  denotes a mixture component which is chosen \nwith  probability  Vm  from  M  components,  and  has  associated  model  parameters \nWm  = (Om, Mm).  In  the  standard  MoG  model,  given  a  choice  of component  m, \nthe  (linear)  model output  is  obtained by  Xi  =  fW m  (zi)  =  Om  zi  +  Mm,  where  zi \nis  drawn from  the  Gaussian  distribution  with  zero  mean  and  identity  covariance \nmatrix.  For a nonlinear mixture model,  we  used a 3-layer sigmoidal neural network \nfor  fW m  (zi).  An  update  rule  for  Vm  can  be  derived  for  our  approach:  ~vm = \n\n. \n\nt \n\n. \n\n- 1 . \n\nd \nIS  en orce  . \n\n\u00a3 \n\n-Ev  wi=l  Z \n\n\"\"Nz \n\n(  i) T  8z' \n\n. i  h \n\n7fi'  x  , were Ev  IS  a  s  ep sIze  an  wm=1 Vm  -\n\nd \"\"M \n\nWe  trained  a  linear  MoG  model  with  the  standard expected  maximization  (EM) \nalgorithm (using code from  [5])  and a linear and a nonlinear MoG  with our sample(cid:173)\nbased approach.  A fixed  training set of Ny  =  100 samples was used for  all models, \nand all models had M  =  10 except one nonlinear model which  had  M  =  1.  In the \nsample-based approach, we generated 100 samples from our model (the Xi) following \nevery training epoch.  The nonlinear model was trained with backpropagation. \n\nFigure  2 shows  the results.  The linear ML  model  is  better than the sample-based \nmodeL  That is  not  surprising because  ML  computes the model  probability values \nanalytically  (the  posterior  is  tractable)  and  our  algorithm  uses  only  samples  to \napproximate  the  model  probability  values.  We  used  only  100  model  samples  in \neach  epoch  and  the  linear  sample-based  model  found  an  acceptable  solution  and \nis  not  much  worse  than the  ML  modeL  The  nonlinear  models  fit  better the true \nring-like distribution and do  not suffer from  sharp corners and edges. \n\n,---.----.--:::c~---, \n\nML (10 -linear) \n\nFigure  2:  (upper  panel,  left  to right) \ntraining  samples  chosen  from  a  ring \ndensity,  a  larger  sample  from  this \ndensity,  the  solutions  obtained  from \nthe  linear  model  trained  with  EM; \n(lower  panels)  models  trained  with \nthe  sample-based  method  (left  to \nright):  linear model, nonlinear model, \nnonlinear model with one component. \n\nIndependent  COInponent  Analysis \n\nWith  a  recoding  model  we  tried  to  demix  sub gaussian  source  distributions  where \neach has supergaussian modes.  Most ICA methods are not  able to demix subgaus(cid:173)\nsian sources.  Figure 3 shows the results, which are nearly perfect.  The ideal result \nis  a  scaled  and permuted  identity matrix when  the  mixing  and  demixing matrices \nare multiplied.  For more details see  [10]. \n\n\fFigure  3:  For  a  three-dimensional  linear \nmixture  projections  of  sources  (first  row), \nmixtures  (second  row),  and  sources  recov(cid:173)\nered  by  our  approach  (third  row)  on  a  two(cid:173)\ndimensional plane are shown. \nThe demixing matrix multiplied with the mix(cid:173)\ning matrix yields: \n\n-0.0017 \n-0.0014 \n-0.1755 \n\n0.0010 \n0.1850 \n0.0003 \n\n0.2523 \n-0.0101 \n0.0053 \n\nIn  a  second  experiment,  we  tried  to  recover  sources  from  two  nonlinear  mixings. \nThis  problem  is  impossible  for  standard  rcA  methods  because  they  are  designed \nfor  linear  mixings.  The result  is  shown  in  Figure 4.  An exact  demixing  cannot be \nexpected,  because nonlinear ICA has no  unique  solution.  For more details see  [10]. \n\nSources \n\nMixtures \n\nRecovered Sources \n\nFigure  4:  For  two  two-dimensional \nnonlinear  mixing  functions-\nupper \nrow,  (z + a)2,  and lower row,  Jz + a, \nwith complex variable z-\nthe sources, \nmixtures, and recovered sources.  The \nmixing function  is  not  completely in(cid:173)\nverted  but  the  sources  are  recovered \nrecognizable. \n\n4  Discussion \n\nAlthough  our sample-based  approach is  intuitively  straightforward,  its  implemen(cid:173)\ntation  has  two  drawbacks:  (1)  One  has  to  be  cautious  of samples  that  are  close \ntogether,  because they  lead  to  unbounded  gradients;  and  (2)  all  samples  must  be \nconsidered  when  computing the force  on  a  data point,  which  makes the approach \ncomputation intensive.  However, in [10,  7]  approximations are proposed that reduce \nthe computational complexity of the approach. \n\nIn  this  paper,  we  have  presented  simulations  showing  the  generality  and  power \nof our  sample-based  approach  to  unsupervised  learning  problems,  and  have  also \nproven  two  important  properties  of the  approach:  (1)  With  certain  assumptions, \nthe  approach will find  the correct solution.  (2)  With an unconstrained model,  the \nexpected  solution  of our approach is  the  ML  solution.  In  conclusion,  our sample(cid:173)\nbased approach can be applied  to unsupervised  learning of complex  models  where \nML  does not work and our method avoids the drawbacks of PDE approaches. \n\n\fAcknow ledgIllents \n\nWe  thank  Geoffrey  Hinton for  inspirational suggestions regarding this  work.  The \nwork  was  supported  by  the  Deutsche  Forschungsgemeinschajt  (Ho  1749/1-1), \nMcDonnell-Pew  award 97-18,  and NSF award IBN-9873492. \n\nReferences \n\n[1]  P.  Dayan, G. E. Hinton, R. M.  Neal,  and R. S.  Zemel.  The Helmholtz machine.  Neural \n\nComputation,  7(5):889-904,  1995. \n\n[2]  R.  O. Duda and P. E.  Hart.  Pattern  Classification  and Scene  Analysis.  Wiley,  1973. \n\n[3]  D.  Erdogmus  and  J.  C.  Principe.  Comparision  of  entropy  and  mean  square  error \ncriteria in  adaptive system training  using  higher order statistics.  In  P.  Pajunen and \nJ.  Karhunen,  editors,  Proceedings  of the  Second  International  Workshop  on  Inde(cid:173)\npendent  Component  Analysis  and  Blind  Signal  Separation,  Helsinki,  Finland,  pages \n75-80.  Otamedia,  Espoo,  Finland, ISBN:  951-22-5017-9,  2000. \n\n[4]  B.  S.  Everitt.  An introduction  to  latent  variable  models.  Chapman and Hall,  1984. \n\n[5]  Z. Ghahramani and G.  E. Hinton.  The EM algorithm for  mixtures offactor analyzers. \nTechnical Report CRG-TR-96-1 , University of Toronto, Dept. ofComp. Science, 1996. \n\n[6]  Z.  Ghahramani  and  G.  E.  Hinton.  Hierachical  non-linear  factor  analysis  and topo(cid:173)\n\ngraphic  maps.  In M.  I.  Jordan,  M.  J.  Kearns,  and  S.  A.  Solla,  editors,  Advances  in \nNeural  Information  Processing  Systems  10,  pages  486- 492.  MIT Press,  1998. \n\n[7]  A.  Gray  and  A.  W.  Moore. \n\nIn  T.  K. \nLeen,  T . Dietterich, and V.  Tresp, editors,  Advances in Neural Information Processing \nSystems  13,  2001.  In this proceeding. \n\n'N-body'  problems  in  statistical  learning. \n\n[8]  G.  E.  Hinton and T.  J.  Sejnowski.  Learning  and relearning in  Boltzmann machines. \n\nIn Parallel  Distributed Processing, volume  1, pages  282- 317.  MIT Press,  1986. \n\n[9]  G. E. Hinton and T.  J. Sejnowski.  Introduction.  In G.  E. Hinton and T. J. Sejnowski, \neditors,  Unsupervised Learning:  Foundations  of Neural  Computation, pages VII- XVI. \nThe MIT Press,  Cambridge,  MA,  London,  England,  1999. \n\n[10]  S. Hochreiter and M. C. Mozer.  An electric field  approach to independent component \n\nanalysis.  In P. Pajunen and J . Karhunen, editors,  Proceedings  of the  Second  Interna(cid:173)\ntional  Workshop  on  Independent  Component  Analysis  and  Blind  Signal  Separation, \nHelsinki,  Finland, pages 45- 50.  Otamedia, Finland, ISBN: 951-22-5017-9,  2000. \n\n[11]  A.  Hyviirinen.  Survey on  independent component  analysis.  Neural  Computing  Sur(cid:173)\n\nveys,  2:94- 128,  1999. \n\n[12]  G.  C.  Marques  and L.  B.  Almeida.  Separation  of nonlinear  mixtures  using  pattern \nrepulsion. \nIn  J.-F.  Cardoso,  C.  Jutten,  and  P.  Loubaton,  editors,  Proceedings  of \nthe  First  International  Workshop  on  Independent  Component  Analysis  and  Signal \nSeparation,  Aussois,  France, pages  277- 282 ,  1999. \n\n[13]  J.  C.  Principe  and  D.  Xu.  Information-theoretic  learning  using  Renyi's  quadratic \nentropy. In J.-F. Cardoso, C.  Jutten, and P.  Loubaton, editors,  Proceedings  of the First \nInternational  Workshop  on  Independent  Component  Analysis  and Signal  Separation, \nAussois,  France, pages 407-412,  1999. \n\n[14]  Y.  Zhao and C. G.  Atkeson.  Implementing projection pursuit learning.  IEEE Trans(cid:173)\n\nactions  on  Neural  Networks, 7(2):362- 373,  1996. \n\n\f", "award": [], "sourceid": 1810, "authors": [{"given_name": "Sepp", "family_name": "Hochreiter", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}]}