{"title": "Learning a Hierarchical Belief Network of Independent Factor Analyzers", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 367, "abstract": null, "full_text": "Learning a  Hierarchical  Belief Network of \n\nIndependent  Factor  Analyzers \n\nH.  Attias* \n\nhagai@gatsby.ucl.ac.uk \n\nSloan Center for  Theoretical Neurobiology,  Box 0444 \n\nUniversity of California at San Francisco \n\nSan  Francisco, CA  94143-0444 \n\nAbstract \n\nMany  belief  networks  have  been  proposed  that  are  composed  of \nbinary units.  However,  for  tasks such  as  object  and speech  recog(cid:173)\nnition  which  produce real-valued  data,  binary network models  are \nusually inadequate.  Independent component analysis  (ICA)  learns \na  model  from  real  data,  but  the  descriptive  power  of this  model \nis  severly  limited.  We  begin  by  describing the  independent  factor \nanalysis  (IFA)  technique,  which  overcomes some of the limitations \nof ICA.  We  then  create  a  multilayer  network  by  cascading single(cid:173)\nlayer  IFA  models.  At  each  level,  the  IFA  network  extracts  real(cid:173)\nvalued  latent  variables  that  are  non-linear  functions  of the  input \ndata  with  a  highly  adaptive  functional  form,  resulting  in  a  hier(cid:173)\narchical  distributed  representation  of  these  data.  Whereas  exact \nmaximum-likelihood  learning of the network  is  intractable,  we  de(cid:173)\nrive an algorithm that maximizes a  lower  bound on the likelihood, \nbased on a  variational approach. \n\n1 \n\nIntroduction \n\nAn  intriguing  hypothesis  for  how  the  brain  represents  incoming  sensory  informa(cid:173)\ntion holds that it constructs a  hierarchical probabilistic model of the observed data. \nThe model  parameters are  learned  in  an  unsupervised  manner  by  maximizing  the \nlikelihood  that  these  data  are  generated  by  the  model.  A  multilayer  belief  net(cid:173)\nwork  is  a  realization  of such  a  model.  Many  belief networks  have  been  proposed \nthat  are  composed  of binary  units.  The  hidden  units  in  such  networks  represent \nlatent variables that explain different features of the data, and whose relation to the \n\n\u00b7Current  address:  Gatsby  Computational  Neuroscience  Unit,  University  College  Lon(cid:173)\n\ndon,  17  Queen Square,  London WC1N  3AR,  U .K. \n\n\f362 \n\nH.  Attias \n\ndata is  highly  non-linear.  However, for  tasks such as object and speech recognition \nwhich  produce real-valued  data,  the models  provided by binary networks are often \ninadequate.  Independent component analysis (ICA)  learns a generative model from \nreal  data,  and  extracts  real-valued  latent  variables  that  are  mutually  statistically \nindependent.  Unfortunately,  this model is restricted  to a single layer and the latent \nvariables  are simple linear functions  of the data;  hence,  underlying degrees of free(cid:173)\ndom  that are  non-linear cannot be extracted by  ICA.  In addition, the requirement \nof equal  numbers of hidden and observed  variables and the assumption of noiseless \ndata render the ICA  model inappropriate. \n\nThis  paper begins by  introducing the independent factor  analysis  (IFA)  technique. \nIFA  is  an  extension  of ICA,  that  allows  different  numbers  of latent  and  observed \nvariables  and  can  handle  noisy  data.  The  paper  proceeds  to  create  a  multilayer \nnetwork by  cascading single-layer IFA  models.  The resulting generative model  pro(cid:173)\nduces  a  hierarchical  distributed  representation of the input  data,  where  the  latent \nvariables extracted at each level  are  non-linear functions  of the data with  a  highly \nadaptive  functional  form.  Whereas  exact  maximum-likelihood  (ML)  learning  in \nthis  network is  intractable due  to the difficulty  in  computing the  posterior density \nover  the  hidden  layers,  we  present an algorithm  that maximizes  a  lower  bound on \nthe  likelihood.  This  algorithm  is  based  on a  general  variational approach  that  we \ndevelop for  the IFA  network. \n\n2 \n\nIndependent  Component and Independent  Factor \nAnalysis \n\nAlthough the concept of ICA originated in the field of signal processing, it is actually \na  density  estimation problem.  Given  an L' x  1 observed data vector y,  the task is \nto explain it in terms of an LxI vector x  of unobserved 'sources' that are mutually \nstatistically independent.  The relation between the two  is  assumed linear, \n\ny  =  Hx + u, \n\n(1) \nwhere  H  is  the  'mixing'  matrix;  the  noise  vector  u  is  usually  assumed  zero-mean \nGaussian  with  a  covariance  matrix  A.  In  the  context  of blind  source  separation \n[1]-[4],  the sOurce signals x  should be recovered from  the mixed noisy signals y  with \nno  knowledge of H, A,  or the source densities P(Xi),  hence  the term  'blind'.  In  the \ndensity estimation approach, one regards  (1)  as a probabilistic generative model for \nthe  observed p(y),  with  the  mixing  matrix,  noise  covariance,  and  source densities \nserving as  model  parameters.  In  principle,  these  parameters should  be  learned  by \nML, followed  by  inferring the sources  via a  MAP estimator. \n\nFor  Gaussian sources,  (1)  is  the factor  analysis model,  for  which  an EM  algorithm \nexists and the MAP estimator is  linear.  The problem becomes interesting and more \ndifficult  for  non-Gaussian sources.  Most  ICA  algorithms focus  on square  (L'  =  L), \nnoiseless  (y  =  Hx)  mixing,  and fix  P(Xi)  using  prior knowledge  (but see  [5]  for  the \ncase  of noisy  mixing  with  a  fixed  Laplacian  source  prior).  Learning H  occurs  via \ngradient-ascent  maximization  of the  likelihood  [1]-[4].  Source  density  parameters \ncan also be adapted in  this way  [3],[4],  but the resulting gradient-ascent learning is \nrather slow.  This state of affairs  presented a  problem  to ICA  algorithms, since the \nability  to learn arbitrary sOurce  densities  that are not known in advance is  crucial: \nusing an inaccurate p( Xi)  often leads to a  bad H  estimate and failed  separation. \nThis  problem  was  recently  solved  by  introducing  the  IFA  technique  [6]. \nIFA \nemploys  a  semi-parametric  model  of  the  source  densities,  which  allows  learning \nthem  (as  well  as the mixing  matrix)  using expectation-maximization (EM).  Specif(cid:173)\nically,  P(Xi)  is  described  as  a  mixture  of  Gaussians  (MOG),  where  the  mixture \n\n\fHierarchicalIFA Belief Networks \n\n363 \n\ncomponents  are  labeled  by  s  =  1, ... , ni  and  have  means  f..ti,s  and  variances  Ii,s: \np( Xi)  = ~ s p( S i  = s) 9 (Xi  -\nf..ti,s, Ii ,s).  I  The mixing  proportions are  parametrized \nusing  the  softmax  form:  P(Si  =  s)  = exp(ai,s)/ ~s' exp(ai,s').  Beyond  noiseless \nleA,  an  EM  algorithm  for  the  noisy  case  (1)  with  any  L, L'  was  also  derived  in \n[6]  using  the  MOG  description.  2  This  algorithm  learns  a  probabilistic  model \np(y  I W)  for  the  observed  data,  parametrized  by  W  =  (H,A,{ai ,s,f..ti,s\"i,s}) .  A \ngraphical  representation of this  model  is  provided  by  Fig.  1,  if  we  set  n  =  1  and \nyO  = bi  = VI  = 0 \n\nJ \n\nJ ,S \n\nJ ,s \n\n. \n\n3  Hierarchical Independent  Factor Analysis \n\nIn  the following  we  develop a  multilayer generalization of IFA,  by  cascading dupli(cid:173)\ncates of the generative model introduced in  [6].  Each layer n  =  1, ... , N  is  composed \nof two sublayers:  a  source sublayer which  consists of the units xi, i  =  1, ... , L n ,  and \nan output sublayer which  consists of Yj,  j  =  1, ... , L~ .  The two are linearly related \nvia yn  =  Hnxn + un  as  in  (1);  un  is  a  Gaussian noise  vector  with  covariance  An. \nThe nth-layer source xi is  described by a MOG density model with parameters ai S' \nf..ti,s'  and Irs' in analogy to the IFA  sources above. \nThe important step is to determine how  layer n  depends on the previous layers.  We \nchoose to introduce a dependence of the ith source of layer n only on the ith output \nof layer n - 1.  Notice that matching Ln  =  L~ _ l  is  now required.  This  dependence \nis  implemented  by  making  the  means  and  mixture  proportions  of the  Gaussians \nwhich  compose  p(xi)  dependent  on yr- l .  Specifically,  we  make  the replacements \n\u00b7t  \u00a3 \nn \nn \nf..ti ,s  -t f..ti ,s + vi ,sYi \ne  resu  mg Jomt  ensl  y  or \nlayer n,  conditioned on layer n - 1,  is \n\nan  ai ,s  -t ai,s +  i,sYi \n\nbn  n- l  Th \n\nIt\u00b7\u00b7\u00b7  d \n\nn  n- l \n\nd  n \n\nn \n\n. \n\n' \n\np(sn , xn,yn  I yn-l, wn) =  IIp(si I yr - 1 )  p(xi  I si,yr- l )  p(yn  I xn)  , \n\n(2) \n\nLn \n\nwhere vvn  are the  parameters of layer nand \n\ni=I \n\n(  n  _ \n\np  Si  - S  Yi \n\nI  n-I)  _ \n-\n\nexp(ai,s + bi,syr- 1 ) \nn  n-l \n'\"' \nL... exp(ai s'  + bi  s'Yi \n) \n, \ns' \n(  n  Inn - I )   9(  n \n=  Xi  -\nP  Xi \n\nSi  = S, Yi \n\nn \n\n' \n\n' \n\nn  n- l  n) \nf..ti  s - Vi  sYi  \"i s \n, \n\nn \n\" \n\n. \n\nThe full  model joint density is  given by the product of (2)  over n  =  1, ... , N  (setting \nyO  =  0) .  A  graphical  representation of layer  n  of the  hierarchical  IFA  network  is \ngiven in  Fig.  1.  All  units are hidden except yN. \n\nTo  gain  some  insight  into  our network,  we  examine the  relation  between  the nth(cid:173)\nlayer  source  xi  and  the  n  -\nlth-Iayer output  yr- 1 .  This  relation  is  probabilistic \nand is  determined by the conditional density p(xi I yr- 1 )  =  ~s~ p(si I y~-l )p(xi  I \nsi,yr- 1 ) .  Notice from  (2)  that this is  a  MOG density.  Its yr - (dependent mean is \ngiven  by \n\nXi  = f['(yr- 1 )  = LP(si = s  I yr- 1 )  (f..t~s + vf,syr- 1 )  , \n\n(3) \n\ns \n\nIThroughout  this  paper,  Q(x,~) =1  27r~ 1- 1 / 2  exp( _XT~ - IX/2) . \n2However,  for  many  sources  the  E-step  becomes  intractable,  since  the  number TIi ni \nof source state configurations s  =  (s 1, ... ,  s L)  depends exponentially  on  L.  Such cases  are \ntreated in  [6]  using a  variational  approximation. \n\n\f364 \n\nH.  Attias \n\nn \n\nJ-lj,s \n\nFigure  1:  Layer n  of the hierarchical leA generative model. \n\nand  is  a  non-linear  function  of  y~-l  due  to  the  softmax  form  of  p(si  I yr- 1 ). \nBy  adjusting  the  parameters,  the  function  II'  can  assume  a  very  wide  range  of \nforms:  suppose  that  for  state  si , ai,s  and bi,s  are  set  so  that  p(si  =  s  I yr- 1 )  is \nsignificant  only  in  a  small,  continuous  range of yr- 1  values,  with  different  ranges \nassociated  with  different  s 's.  In  this  range,  II'  will  be  dominated  by  the  linear \nterm  J.1.is  + lIrs y~-l.  Hence,  a  desired  ii can  be  produced  by  placing  oriented \nline seg~ents ~t appropriate points above the yr-1-axis,  then  smoothly join them \ntogether  by  the  p(si I yr- 1 ) .  Using  the  algorithm  below,  the  optimal  form  of ii \nwill  be  learned  from  the  data.  Therefore,  our  model  describes  the  data  yf  as  a \npotentially highly complex function of the top layer sources,  produced by repeated \napplication of linear mixing followed  by  a  non-linearity,  with  noise  allowed  at each \nstage. \n\n4  Learning  and  Inference  by Variational EM \n\nThe  need  for  summing over  an exponentially  large  number  of source  state config(cid:173)\nurations  (sr, ... , s\"lJ,  and  integrating over  the  softmax  functions  p(si  I yi),  makes \nexact  learning  intractable  in  our  network.  Thus,  approximations  must  be  made. \nIn  the  following  we  develop  a  variational  approach,  in  the  spirit  of  [8],  to  hierar(cid:173)\nchical  IFA.  We  begin,  following  the  approach  of [7]  ~o EM ,  by  bounding  the  log(cid:173)\nlikelihood  from  below:  \u00a3  =  10gp(yN)  2:  l:n{Elogp(yn I xn) + l:i, s~[Elogp(xi I \nsi, y~-l) + E  logp(si I y~- l)]} - E log q,  where E  denotes averaging over the hidden \nlayers  using an arbitrary posterior q =  q(Sl\u00b7\u00b7-N,xI ... N,yl .. . N-l  I yN).  In  exact EM, \nq  at  each  iteration  is  the  true  posterior,  parametrized  by  W 1 ... N  from  the  previ(cid:173)\nous iteration.  In  variational EM,  q  is  chosen  to  have a  form  which  makes learning \ntractable,  and  is  parametrized  by  a  separate set  of parameters  V I .. . N .  These  are \noptimized  to  bring q as  close to the true posterior as possible. \n\n. \n\n\fHierarchical IFA  Belief Networks \n\n365 \n\nE-step.  We  use a variational posterior that is factorized across layers.  Within layer \nn  it has  the form \n\nq(sn, x n, yn  I vn) = II Vf,si  9(zn  _  pn, ~n) , \n\nLn \n\n(4) \n\ni=l \n\nfor  n  < N, and q(sN, x N I VN)  = TIi Vt,'Si 9(xN - pN, ~N). The variational param(cid:173)\neters  vn  =  (pn, ~n, {vf,s})  depend  on  the  data yN.  The  full  N -layer  posterior  is \nsimply  a  product  of (4)  over  n.  Hence,  given  the  data,  the  nth-layer sources  and \noutputs are jointly Gaussian whereas the states sf are independent.  3 \nEven  with  the  variational  posterior  (4),  the  term  Elogp(sf  I y~-l) in  the  lower \nbound  cannot  be  calculated  analytically,  since  it  involves  integration  over  the \nsoftmax  function. \nInstead,  we  calculate  yet  a  lower  bound  on  this  term.  Let \nci,s  = ai,s  + bi,syr- l  and  drop  the  unit  and  layer  indices  i, n,  then  logp(s  I  y)  = \n-log(l + e- c ,  Ls'#s eC . ' ) .   B<?rrowing  an idea from  [8],  we  multiply  and  divide  by \ne71\u2022  under  the  logarithm  sign  and  use  Jensen's  inequality  to  get  Elogp(s  I y)  2': \n-TJsEcs -logE [e- 71\u2022 C \u2022  +e- (H71.)C.  Ls'#se c.,],  This results  in a  bound  that  can \nbe  calculated in  closed  form: \n\nElogp(sf = sl yr- l )  2':  -v~TJ~e~ - v~ log  (eJ::  + L ef ':.,)  = :Frs, \n\ns'#s \n\n(5) \n\ns \n\ns \n\ns  y  ' s  \n\nwhere  en  =  an  +  bnpn-l \njn  =  -'Ylncn  +  ('Ylnbn)2~n - I/2  jn  =  -(1 +  'Yln)cn  + \n~, + [(1  + TJ~)b~ - b~, F~~;l /2,  and  the  subscript  i  is  omitted.  We  also  defined \npn  = (p~, p;)T  and  similarly  ~xx, ~yy, ~xy =  ~;x are the subblocks  of~. Since \n(5) holds for arbitrary TJfs'  the latter are treated as additional variational parameters \nwhich  are optimized to tighten  this bound.  4 \n\n' s  s' \n\n'/ S  s \n\n'/ s  s \n\n'/ s \n\nyy \n\ns \n\nTo  optimize the variational parameters V I  .. N ,  we  equate the gradient of the lower \nbound on I:- to zero and obtain \n(  (HTA-IH)n+An \n\n_(HTA-I)n \n\n)  n \n\n_(A-1H)n \n\n(A-l)n+Bn+1 \n\np \n\np~+l  ) \nPy \n\nn-l \n\n(6) \n\n(7) \n\nwhere Ai}  = Ls (Vi,s /'t ,s)n8ij , Eij  = Ls (Vi,slli,s /'i,s)nsij, af = Ls (Vi ,sJ-ti,s/'i,s)n, \nand f3t  = Ls(Vi,sJ-ti,slli,s/'i,S)n.  (All  parameters within  (- . . )n  belong  to  layer  n). \nFntl  contain  the corresponding derivatives  of :F\";+l  (5),  summed  over  s.  For  the \nstate posteriors we  have \nvn = _  exp \ns \n\n(n \n's  +  _[(pn _  lin  _  lInpn-I)2 +  ~n  +  (lIn)2~n-ll +  __ s \n2 \n\nO:Fn) \n\novn' \n\n1 \nZn \n\ns  Y \n\n(8) \n\nxx \n\ns \n\nf\"s \n\nYY \n\np , \n\n1 \n2\",n \n/s \n\nx \n\ns \n\n3It  is  easy  to  introduce  more  structure into  (4)  by  allowing  the  means  p~ to  depend \non  8~, and the covariances  ~0 to depend on  87, 8;,  thus making the approximation more \naccurate (but more complex)  while  maintaining tractability. \n\n4 An  alternative  approach  to handle  E log p( 87  I y~ - l)  is  to  approximate  the required \n\nintegral by,  e.g., the maximum value of the integrand, possibly including Gaussian correc(cid:173)\ntions.  The resulting approximation is simpler than (5);  however, it is  no  longer guaranteed \nto  bound the log-likelihood from  below. \n\n\fH.  Attias \n\n366 \nwhere  the unit  subscript i  is  omitted  (i.e.,  ~~x =  ~~x ,ii) ;  zn  =  Zi is  set such  that \n2: s  v~s = 1.  A simple  modification of these equations is  required for  layer n  = N. \nThe  optimal  V I .. R  are  obtained  by  solving  the  fixed-point  equations  (6~8)  iter(cid:173)\natively  for  each  data vector  yN,  keeping  the  generative  parameters  W I ... N  fixed. \nNotice  that these equations couple layer n  to layers n \u00b1 1.  The additional parame(cid:173)\nters  1}~s  are adjusted  using gradient  ascent on .'Frs'  Once  learning is  complete,  the \ninference problem is  solved since the MAP estimate of the hidden unit values given \nthe data is  readily available from pi  and  v~s\u00b7 \nM-Step.  In  terms  of the  variational  parameters obtained  in  the  E-step,  the  new \ngenerative parameters are given by \n\n(  f.l~  ) \n\nv;-\n\nHn \n\nAn \n\n1 _ \nv n \ns \n\ny \n\n(pnpn T  + ~n )(pnpn T + ~n )~1 \n' \np n pn T  + ~n  _  H n (pnpn T  + ~n ) \nxy \ny  y \n\nx  x \n\nyx \n\nyy \n\nxx\n\nx \n\nx \n\nx \n\n, \n\n(9) \n\n[(pn  _  /In  _  vnpn~1)2 + ~n  + (vn)2~n~l] vn \ns\n\nrs \n\nxx \n\nyy \n\nx \n\ny \n\ns \n\ns \n\n' \n\n(10) \n\nomitting  the  subscript  i  as  in  (8),  and  are slightly  modified  for  layer  N.  In  batch \nmode,  averaging over the data is  implied and the v;- do  not cancel out.  Finally, the \nsoftmax parameters ai,s' bi,s  are adapted by  gradient ascent on the bound  (5). \n\n5  Discussion \n\nThe hierarchical IFA  network presented here constitutes a quite general framework \nfor  learning  and  inference  using  real-valued  probabilistic models  that  are strongly \nnon-linear but highly adaptive.  Notice that this  network includes  both continuous \nxi, yi  and  binary  si  units,  and  can  thus  extract  both  types  of  latent  variables. \nIn  particular,  the  uppermost  units  sI  may  represent  class  labels  in  classification \ntasks.  The models  proposed  in  [9]-[11]  can be  viewed  as  special  cases  where  xi  is \na  prescribed deterministic function  (e.g.,  rectifier)  of the previous outputs yj ~ l:  in \nthe  IFA  network,  a  deterministic  (but  still  adaptive)  dependence  can  be  obtained \nby  setting the  variances 'ris  =  O.  Note  that the source xi in such  a  case  assumes \nonly the values  f.li,s,  and thus  corresponds to a  discrete latent variable. \n\nThe  learning  and  inference  algorithm  presented  here  is  based  on  the  variational \napproach.  Unlike variational approximations in other belief networks  [8],[10]  which \nuse  a  completely  factorized  approximation,  the  structure  of  the  hierarchical  IFA \nnetwork facilitates  using a  variational posterior that allows correlations among hid(cid:173)\nden  units  occupying the same layer, thus  providing a  more accurate description of \nthe true posterior.  It would be interesting to compare the performance of our varia(cid:173)\ntional algorithm with the belief propagation algorithm [12]  which,  when adapted to \nthe densely connected IFA  network, would also be an approximation.  Markov chain \nMonte  Carlo  methods, including  the  more  recent  slice  sampling procedure used  in \n[11],  would  become very slow  as the network size increases. \nIt is  possible  to  consider  a  more  general  non-linear  network  along  the  lines  of hi(cid:173)\nerarchical  IFA.  Notice  from  (2)  that  given  the  previous  layer  output  yn ~ l,  the \nmean  output  of the  next  layer  is  Yi  =  2: j  H[jfP(yj~l)  (see  (3)),  i.e.  a  linear \nmixing preceded by a  non-linear function operating on each output component sep(cid:173)\narately.  However,  if  we  eliminate  the  sources  xj,  replace  the  individual  source \n\n\fHierarchicalIFA  Belief Networks \n\n367 \n\nstates  sj by  collective states  sn , and allow  the linear transformation to  depend  on \nsn ,  we  arrive  at  the  following  model:  p(sn  =  s  I y n-l)  ex:  exp(a~ + b~Tyn-l), \np(yn  I sn  = s,yn- l)  =  9(yn  - h~ - H~yn- l,An).  Now  we  have  yn  = 2: s p(sn  = \ns  I yn-l )(h~ + H~yn - l) ==  F(yn - l),  which  is  a  more general non-linearity. \nFinally, the blocks  {yn , xn, sn  I yn - l}  (Fig.  1), or alternatively the blocks  {yn, sn  I \nyn - l}  described above,  can be connected  not only vertically  (as  in this paper)  and \nhorizontally (creating layers with multiple blocks), but in any directed acyclic graph \narchitecture, with  the variational EM  algorithm extended  accordingly. \n\nAcknowledgements \n\nI thank V.  de Sa for  helpful  discussions.  Supported by The Office of Naval Research \n(N00014-94-1-0547), NIDCD  (R01-02260), and the Sloan Foundation. \n\nReferences \n\n[1]  Bell,  A.J.  and  Sejnowski,  T.J.  (1995).  An information-maximization  approach \nto blind  separation and blind  deconvolution.  Neural  Computation 7,  1129-1159. \n\n[2]  Cardoso,  J.-F.  (1997).  Infomax and maximum likelihood  for  source separation. \nIEEE Signal  Processing Letters 4,  112-114. \n\n[3]  Pearlmutter, B.A. and Parra, L.C.  (1997).  Maximum likelihood blind source sep(cid:173)\naration:  A context-sensitive generalization of ICA.  Advances in Neural Information \nProcessing  Systems 9  (Ed.  Mozer,  M.C. et al),  613-619.  MIT Press. \n\n[4]  Attias,  H.  and  Schreiner,  C.E.  (1998).  Blind  source separation and  deconvolu(cid:173)\ntion:  the  dynamic  component  analysis  algorithm.  Neural  Computation  10,  1373-\n1424. \n\n[5]  Lewicki,  M.S.  and  Sejnowski,  T.J.  (1998).  Learning  nonlinear  overcomplete \nrepresentations  for  efficient  coding.  Advances  in  Neural  Information  Processing \nSystems 10  (Ed.  Jordan, M.L et al),  MIT Press. \n\n[6]  Attias,  H.  (1999).  Independent factor  analysis.  Neural  Computation,  in press. \n\n[7]  Neal,  R.M. and Hinton,  G.E.  (1998) .  A view  of the EM  algorithm  that justifies \nincremental, sparse, and other variants.  Learning in  Graphical  Models  (Ed.  Jordan, \nM.L) , Kluwer  Academic  Press. \n\n[8]  Saul,  L.K.,  Jaakkola, T ., and Jordan, M.I.  (1996) .  Mean field  theory of sigmoid \nbelief networks.  Journal  of Artificial Intelligence  Research 4, 61-76. \n\n[9]  Frey,  B.J.  (1997)  Continuous sigmoidal  belief networks  trained using slice  sam(cid:173)\npling.  Advances  in  Neural  Information  Processing  Systems  9  (Ed.  Mozer,  M.C.  et \nal).  MIT  Press. \n\n[10]  Frey, B.J. and Hinton, G.E.  (1999).  Variational learning in non-linear Gaussian \nbelief networks.  Neural  Computation,  in  press. \n\n[11]  Ghahramani, Z.  and Hinton,  G.E.  (1998).  Hierarchical non-linear factor analy(cid:173)\nsis  and topographic maps.  Advances  in Neural  Information  Processing  Systems 10 \n(Ed.  Jordan, M.L  et al),  MIT  Press. \n\n[12]  Pearl, J .  (1988).  Probabilistic  Reasoning  in Intelligent  Systems.  Morgan Kauf(cid:173)\nmann,  San Mateo, CA. \n\n\f", "award": [], "sourceid": 1631, "authors": [{"given_name": "Hagai", "family_name": "Attias", "institution": null}]}