{"title": "Gaussian Fields for Approximate Inference in Layered Sigmoid Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 393, "page_last": 399, "abstract": null, "full_text": "Correctness of belief propagation in Gaussian \n\ngraphical models of arbitrary topology \n\nYair Weiss \n\nComputer Science Division \nUC Berkeley, 485 Soda Hall \nBerkeley, CA 94720-1776 \n\nPhone: 510-642-5029 \n\nyweiss@cs.berkeley.edu \n\nWilliam T. Freeman \n\nMitsubishi Electric Research Lab \n\n201  Broadway \n\nCambridge, MA 02139 \nPhone:  617-621-7527 \nfreeman @merl.com \n\nAbstract \n\nLocal \"belief propagation\" rules of the  sort proposed by Pearl  [15]  are \nguaranteed  to  converge to  the  correct  posterior probabilities  in  singly \nconnected graphical models. Recently, a number of researchers have em(cid:173)\npirically demonstrated good performance of \"loopy belief propagation\"(cid:173)\nusing these same rules on graphs with loops.  Perhaps the most dramatic \ninstance is the near Shannon-limit performance of \"Turbo codes\", whose \ndecoding algorithm is equivalent to loopy belief propagation. \nExcept for the case of graphs with a single loop, there has been little theo(cid:173)\nretical understanding of the performance of loopy propagation. Here we \nanalyze belief propagation in  networks  with  arbitrary  topologies  when \nthe nodes in  the graph describe jointly Gaussian random variables.  We \ngive an analytical formula relating the true  posterior probabilities with \nthose calculated using loopy propagation.  We give sufficient conditions \nfor convergence and show that when belief propagation converges it gives \nthe correct posterior means for all graph  topologies,  not just networks \nwith a single loop. \nThe related \"max-product\" belief propagation algorithm finds  the max(cid:173)\nimum posterior probability estimate for singly connected networks.  We \nshow  that,  even for non-Gaussian probability distributions,  the  conver(cid:173)\ngence points of the max-product algorithm in loopy  networks are  max(cid:173)\nima  over a  particular large  local  neighborhood of the  posterior proba(cid:173)\nbility.  These results  help clarify  the  empirical performance results  and \nmotivate  using  the  powerful  belief propagation algorithm in  a  broader \nclass of networks. \n\nProblems involving probabilistic belief propagation arise in a wide variety of applications, \nincluding error correcting codes,  speech recognition and medical diagnosis.  If the  graph \nis  singly connected, there exist local message-passing schemes to  calculate the  posterior \nprobability of an unobserved variable given the observed variables.  Pearl [15] derived such \na scheme for singly connected Bayesian networks and showed that this \"belief propagation\" \nalgorithm is guaranteed to converge to the correct posterior probabilities (or \"beliefs\"). \n\nSeveral groups have recently reported excellent experimental results by running algorithms \n\n\f674 \n\nY.  Weiss  and W T.  Freeman \n\nequivalent to Pearl's algorithm on networks with loops [8, 13, 6].  Perhaps the most dramatic \ninstance of this performance is for \"Turbo code\"  [2]  error correcting codes.  These codes \nhave been described as \"the most exciting and potentially important development in coding \ntheory  in  many years\" [12]  and have recently been shown  [10,  11] to utilize an algorithm \nequivalent to belief propagation in a network with loops. \n\nProgress in the analysis of loopy belief propagation has been made for the case of networks \nwith  a  single  loop  [17,  18,  4,  1].  For  these  networks,  it  can  be  shown  that  (1)  unless \nall  the  compatabilities are  deterministic,  loopy belief propagation will converge.  (2) The \ndifference between the loopy beliefs and the true beliefs is related to the convergence rate \nthe faster the convergence the more exact the approximation and (3) If \nof the messages -\nthe hidden nodes are binary, then the loopy beliefs and the true beliefs are both maximized \nby the same assignments, although the confidence in that assignment is wrong for the loopy \nbeliefs. \n\nIn this paper we analyze belief propagation in  graphs of arbitrary topology, for nodes de(cid:173)\nscribing jointly Gaussian random variables.  We  give an exact formula relating the correct \nmarginal  posterior probabilities with  the  ones calculated using  loopy  belief propagation. \nWe  show that if belief propagation converges, then it will give the correct posterior means \nfor all graph  topologies, not just networks with  a single loop.  We  show that the covari(cid:173)\nance estimates will  generally be  incorrect but present a relationship between the error in \nthe covariance estimates and the convergence speed.  For Gaussian or non-Gaussian vari(cid:173)\nables,  we  show  that the  \"max-product\" algorithm,  which  calculates the MAP estimate in \nsingly connected networks, only converges to points that are maxima over a particular large \nneighborhood of the posterior probability of loopy networks. \n\n1  Analysis \n\nTo  simplify  the  notation, we assume  the  graphical model  has  been preprocessed into an \nundirected  graphical  model  with  pairwise  potentials.  Any  graphical  model can  be  con(cid:173)\nverted into this form, and running belief propagation on  the  pairwise graph is  equivalent \nto running belief propagation on the original graph  [18].  We  assume each node X i  has  a \nlocal observation Yi . In each iteration of belief propagation, each node X i  sends a message \nto each neighboring X j  that is based on the messages it received from the other neighbors, \nits local observation Yl  and the pairwise potentials Wij(Xi , Xj)  and Wii(Xi, Yi) . We  assume \nthe message-passing occurs in parallel. \n\nThe idea  behind the  analysis  is  to  build  an  unwrapped tree.  The unwrapped tree  is  the \ngraphical model which belief propagation is  solving exactly when one applies the belief \npropagation rules in a loopy network [9, 20,  18]. It is constructed by maintaining the same \nlocal neighborhood structure as the loopy network but nodes are replicated so there are no \nloops. The potentials and the observations are replicated from the loopy graph. Figure 1 (a) \nshows an  unwrapped tree for the diamond shaped graph in  (b).  By construction, the belief \nat the root node X- I  is identical to that at node Xl  in the loopy graph after four iterations of \nbelief propagation. Each node has a shaded observed node attached to it, omitted here for \nclarity. \n\nBecause the original network represents jointly Gaussian variables, so will the unwrapped \ntree.  Since it is  a tree, belief propagation is  guaranteed to  give the correct answer for the \nunwrapped graph.  We  can  thus  use  Gaussian  marginalization formulae  to  calculate  the \ntrue mean and variances in both the original and the unwrapped networks.  In this way, we \ncalculate the accuracy of belief propagation for Gaussian networks of arbitrary topology. \n\nWe  assume that the joint mean is zero (the means can be added-in  later).  The joint distri-\n\n\fCorrectness of Belief Propagation \n\n675 \n\nFigure 1:  Left:  A Markov network with mUltiple  loops.  Right:  The unwrapped network \ncorresponding to this structure. \n\nbution of z  = (  :  )  is given by  P(z)  = ae-!zTVz, where V  = (~::  ~::) .  It \nis  straightforward to construct the inverse covariance matrix V  of the joint Gaussian that \ndescribes a given Gaussian graphical model [3]. \n\nWriting out the exponent of the joint and completing the square shows that the mean I-' of \nx, given the observations y, is given by: \n\n(1) \n\nand the covariance matrix C~IY of x  given y  is:  C~IY =  V~-;l. We  will denote by C~dY the \nith row of C~IY so the marginal posterior variance of Xi  given the data is (72 (i) =  C~i Iy (i). \n\nWe will use - for unwrapped quantities. We scan the tree in breadth first order and denote by \nx the vector of values in the hidden nodes of the tree when so scanned. Simlarly, we denote \nby y the observed nodes  scanned in  the same order and Vn , V~y the  inverse covariance \nmatrices.  Since we are scanning in breadth first order the last nodes are the leaf nodes and \nwe denote by  L the number of leaf nodes.  By the nature of unwrapping, tL(1)  is the mean \nof the  belief at node  Xl  after t iterations of belief propagation, where t is  the number of \nunwrappings.  Similarly 0-2 (1)  =  6~1Iy(1) is  the variance of the belief at node Xl  after t \niterations. \nBecause the data is replicated we can write y =  Oy where O(i, j) =  1 if Yi is a replica of Yj \nand 0 otherwise. Since the potentials W(Xi' Yi)  are replicated, we can write V~yO =  OV~y. \nSince the  W (Xi, X j)  are also replicated and all  non-leaf Xi  have the same connectivity as \nthe corresponding Xi, we can write V~~O = OVzz + E where E  is zero in all but the last \nL  rows.  When these relationships between  the  loopy and unwrapped inverse covariance \nmatrices are substituted into the loopy and unwrapped versions of equation  I, one obtains \nthe following expression, true for any iteration [19]: \n\nwhere e is a vector that is zero everywhere but the last L components (corresponding to the \nleaf nodes).  Our choice of the node for the root of the tree is  arbitrary, so this applies to \nall nodes of the loopy network. This formula relates, for any node of a network with loops, \nthe means calculated at each iteration by belief propagation with the true posterior means. \n\nSimilarly when the relationship between the loopy and unwrapped inverse covariance ma(cid:173)\ntrices  is  substituted  into  the  loopy  and unwrapped definitions  of C~IY we  can  relate  the \n\n(2) \n\n\f676 \n\nY  Weiss and W  T  Freeman \n\n0.5 \n\n0.4 \n\n~  0.3 \n~ \n.~  0.2 \nn; \n~  0.1 \n8 \n\"t:> \u00a7 \n\n0 \n\n-0.1 \n\n-0.2 \n\n0 \n\n20 \n\n40 \n\nnode \n\n60 \n\n80 \n\n100 \n\nFigure  2:  The  conditional correlation  between  the  root node  and  all  other nodes  in  the \nunwrapped tree of Fig.  1 after eight iterations.  Potentials were chosen randomly.  Nodes \nare presented in breadth first order so the last elements are the correlations between the root \nnode and the leaf nodes.  We  show that if this correlation goes to zero, belief propagation \nconverges and the loopy means are exact.  Symbols plotted with a star denote correlations \nwith nodes that correspond to the node Xl  in the loopy graph. The sum of these correlations \ngives the correct variance of node Xl  while loopy propagation uses only the first correlation. \n\nmarginalized covariances calculated by belief propagation to the true ones [19]: \n\n-2 \n\na  (1)  = a  (1) + CZllyel  - Czt/ye2 \n\n(3) \nwhere el  is  a vector that is zero everywhere but the last L  components while e2  is  equal \nto  1 for  all  nodes  in  the unwrapped tree that are replicas of Xl  except for Xl.  All other \ncomponents of e2  are zero, \n\n2 \n\n-\n\n-\n\nFigure 2 shows Cz1lY  for the diamond network in Fig.  1.  We generated random potential \nfunctions  and observations and  calculated the  conditional correlations in  the  unwrapped \ntree.  Note that the conditional correlation decreases with distance in  the tree - we  are \nscanning  in  breadth  first  order  so  the  last  L  components  correspond  to  the  leaf nodes. \nAs  the number of iterations of loopy  propagation is  increased the size of the unwrapped \ntree  increases  and the conditional  correlation between  the  leaf nodes  and  the  root node \ndecreases. \n\nFrom equations 2-3 it is clear that if the conditional correlation between the leaf nodes and \nthe root nodes are zero for all  sufficiently large unwrappings then (1)  belief propagation \nconverges (2) the means are exact and (3) the variances may be incorrect.  In practice the \nconditional correlations will not actually be equal to zero for any finite unwrapping. In [19] \nwe  give a more precise statement:  if the conditional correlation of the root node and the \nleaf nodes decreases rapidly enough then  (1) belief propagation converges (2)  the means \nare exact and (3) the variances may be incorrect. We also show sufficient conditions on the \npotentials  III (Xi, X j) for the correlation to decrease rapidly enough:  the rate at which  the \ncorrelation decreases is  determined by the ratio of off-diagonal and diagonal components \nin the quadratic fonn defining the potentials [19]. \n\nHow  wrong will  the  variances be?  The tenn CZllye2  in equation 3 is  simply the sum of \nmany  components of Cz11y.  Figure 2  shows  these  components.  The correct variance  is \nthe sum of all the components witHe the belief propagation variance approximates this sum \nwith  the first  (and dominant) tenn.  Whenever there  is  a positive correlation between the \nroot node and other replicas of Xl  the loopy variance is  strictly less than the true variance \n-\n\nthe loopy estimate is overconfident. \n\n\fCorrectness of Belief Propagation \n\n677 \n\n~07 \ne \niDO.6 \n\" ., \n;;;05 \nfr \n~04 \n'\" ~03 \n0.2 \n\n0.1 \n\nSOR \n\n(a) \n\n40 \n\n50 \n\n60 \n\n20 \n\n30 \n\niterations \n(b) \n\nFigure 3:  (a) 25 x 25 graphical model for simulation. The unobserved nodes (unfilled) were \nconnected to their four nearest neighbors and to an observation node (filled).  (b) The error \nof the estimates of loopy propagation and successive over-relaxation (SOR) as  a function \nof iteration. Note that belief propagation converges much faster than SOR. \n\nNote that when  the conditional correlation  decreases  rapidly  to  zero  two  things  happen. \nFirst, the convergence is faster (because CZdyel  approaches zero faster) .  Second, the ap(cid:173)\nproximation error of the variances is smaller (because CZ1 /y e2  is  smaller).  Thus we have \nshown, as in the single loop case, quick convergence is correlated with good approximation. \n\n2  Simulations \n\nWe ran belief propagation on the 25  x  25  2D grid of Fig. 3 a. The joint probability was: \n\n(4) \n\nwhere Wij  =  0 if nodes Xi, Xj  are not neighbors and 0.01 otherwise and Wii  was randomly \nselected to  be 0 or 1 for  all  i  with  probability of 1 set to  0.2.  The observations Yi  were \nchosen  randomly.  This  problem corresponds  to  an  approximation problem from  sparse \ndata where only 20% of the points are visible. \n\nWe  found the exact posterior by  solving equation  1.  We  also ran belief propagation and \nfound  that when  it converged, the  calculated means  were  identical  to the  true  means  up \nto machine precision.  Also, as  predicted by the theory, the calculated variances were too \nsmall -\n\nthe belief propagation estimate was overconfident. \n\nIn many applications, the solution of equation 1 by matrix inversion is intractable and iter(cid:173)\native methods are used. Figure 3 compares the error in the means as a function of iterations \nfor  loopy  propagation and  successive-over-relaxation (SOR),  considered  one  of the  best \nrelaxation methods [16].  Note that after essentially five  iterations loopy propagation gives \nthe right answer while SOR requires many more.  As expected by the fast convergence, the \napproximation error in  the  variances was  quite  small.  The median error was  0.018.  For \ncomparison the  true variances ranged from 0.01  to 0.94 with  a mean of 0.322.  Also,  the \nnodes for which the approximation error was worse were indeed the nodes that converged \nslower. \n\n\f678 \n\n3  Discussion \n\nY.  Weiss  and W  T  Freeman \n\nIndependently, two other groups have recently analyzed special cases of Gaussian graphical \nmodels.  Frey  [7]  analyzed the graphical  model corresponding to factor analysis and  gave \nconditions for  the existence of a stable fixed-point.  Rusmevichientong and Van  Roy  [14] \nanalyzed a graphical model with the topology of turbo decoding but a Gaussian joint den(cid:173)\nsity.  For this  specific  graph they  gave  sufficient conditions for convergence and  showed \nthat the means are exact. \n\nOur main interest in the Gaussian case is to understand the performance of belief propaga(cid:173)\ntion in general networks with multiple loops.  We  are struck by the similarity of our results \nfor  Gaussians  in  arbitrary  networks and the  results for  single  loops  of arbitrary  distribu(cid:173)\ntions [18].  First, in  single loop networks with binary nodes, loopy belief at a node and the \ntrue  belief at a node are  maximized by  the same assignment while  the confidence in that \nassignment is  incorrect.  In Gaussian networks with multiple loops, the mean at each node \nis correct but the confidence around that mean may be incorrect.  Second, for both single(cid:173)\nloop and Gaussian networks, fast belief propagation convergence correlates with accurate \nbeliefs.  Third, in  both Gaussians and discrete valued single loop networks,  the statistical \ndependence between root and leaf nodes governs the convergence rate and accuracy. \n\nThe  two  models  are  quite  different.  Mean  field  approximations  are  exact  for  Gaussian \nMRFs while they work poorly in sparsely connected discrete networks with a single loop. \nThe results  for  the  Gaussian and single-loop cases  lead  us  to  believe that similar results \nmay hold for a larger class of networks. \n\nCan  our analysis  be  extended to  non-Gaussian  distributions?  The  basic  idea  applies  to \narbitrary graphs and arbitrary potentials:  belief propagation is  performing exact inference \non a tree that has the same local neighbor structure as the loopy graph. However, the linear \nalgebra that we used to calculate exact expressions for the error in belief propagation at any \niteration holds only for Gaussian variables. \n\nWe  have used a similar approach to analyze the related \"max-product\" belief propagation \nalgorithm on arbitrary graphs with arbitrary distributions [5]  (both discrete and continuous \nvalued  nodes).  We  show  that  if the  max-product algorithm converges,  the  max-product \nassignment has greater posterior probability then any assignment in a particular large region \naround that assignment. While this is a weaker condition than a global maximum, it is much \nstronger than a simple local maximum of the posterior probability. \n\nThe  sum-product and  max-product belief propagation algorithms are  fast  and paralleliz(cid:173)\nable.  Due to the well known hardness of probabilistic inference in graphical models, belief \npropagation will obviously not work for arbitrary networks and distributions.  Nevertheless, \na growing body of empirical evidence shows its success in many networks with loops.  Our \nresults justify  applying  belief propagation in  certain networks  with  mUltiple  loops.  This \nmay enable fast,  approximate probabilistic inference in a range of new applications. \n\nReferences \n\n[1]  S.M.  Aji,  G.B.  Hom,  and R.J.  McEliece.  On the convergence of iterative decoding on graphs \n\nwith a single cycle.  In Proc.  1998 ISIT,  1998. \n\n[2]  c. Berrou, A. Glavieux, and P. Thitimajshima.  Near Shannon limit error-correcting coding and \n\ndecoding: Turbo codes. In Proc. IEEE International Communications Conference  '93, 1993. \n\n[3]  R.  Cowell.  Advanced  inference  in  Bayesian  networks.  In  M.1.  Jordan,  editor,  Learning  in \n\nGraphical Models . MIT Press,  1998. \n\n[4]  G.D.  Forney,  F.R.  Kschischang,  and  B.  Marcus. \n\nIterative  decoding  of tail-biting  trellisses. \n\npreprint presented at  1998 Information Theory Workshop in San Diego, 1998. \n\n\fCorrectness of Belief Propagation \n\n679 \n\n[5]  W.  T.  Freeman  and  Y.  Weiss.  On  the  fixed  points  of the  max-product  algorithm.  Technical \n\nReport 99-39, MERL, 201  Broadway, Cambridge, MA 02139,  1999. \n\n[6]  W.T.  Freeman  and  E.C.  Pasztor.  Learning  to  estimate scenes  from  images.  In  M.S.  Kearns, \nS.A.  SoUa, and D.A. Cohn, editors, Adv.  Neural Information Processing Systems  I I.  MIT Press, \n1999. \n\n[7]  B.J. Frey.  Turbo factor  analysis.  In Adv.  Neural  Information  Processing Systems 12. 2000.  to \n\nappear. \n\n[8)  Brendan J. Frey. Bayesian Networksfor Pattern Classification,  Data Compression and Channel \n\nCoding.  MIT Press,  1998. \n\n[9)  R.G. Gallager.  Low Density Parity Check Codes.  MIT Press, 1963. \n[10)  F. R. Kschischang and B. J. Frey.  Iterative decoding of compound codes by probability propaga(cid:173)\n\ntion in graphical  models.  IEEE Journal on Selected Areas  in  Communication ,  16(2):219-230, \n1998. \n\n[11]  R.J. McEliece,  D.J .C.  MackKay,  and  J.F.  Cheng.  Turbo decoding  as  as  an  instance  of Pearl's \n'belief propagation'  algorithm. IEEE Journal on Selected Areas in  Communication,  16(2): 140-\n152,1998. \n\n[12J  R.J.  McEliece,  E.  Rodemich,  and J.F.  Cheng.  The  Turbo decision  algorithm.  In  Proc.  33rd \nAllerton Conference on Communications, Control and Computing, pages 366-379, Monticello, \nIL,  1995. \n\n[I3J  K.P. Murphy,  Y.  Weiss,  and M.1.  Jordan.  Loopy  belief propagation for approximate  inference: \n\nan empirical  study.  In Proceedings of Uncertainty in AI, 1999. \n\n[14]  Rusmevichientong P. and Van Roy B.  An analysis of Turbo decoding with Gaussian densities. \n\nIn Adv.  Neural Information Processing Systems  I2 . 2000.  to appear. \n\n[15)  Judea Pearl.  Probabilistic Reasoning  in  Intelligent Systems:  Networks of Plausible  Inference. \n\nMorgan Kaufmann,  1988. \n\n[16J  Gilbert Strang.  Introduction to Applied Mathel1Ultics. Wellesley-Cambridge, 1986. \n[I7J  Y.  Weiss. Belief propagation and revision in networks with loops. Technical Report  1616, MIT \n\nAI  lab,  1997. \n\n[18J  Y.  Weiss.  Correctness of local  probability propagation in graphical  models with loops.  Neural \n\nComputation, to appear,  2000. \n\n[19]  Y.  Weiss  and  W.  T.  Freeman.  Loopy  propagation  gives  the  correct  posterior  means  for \nGaussians.  Technical  Report  UCB.CSD-99-1046,  Berkeley  Computer  Science  Dept.,  1999. \nwww.cs.berkeley.edu yweiss/. \n\n[20J  N.  Wiberg.  Codes  and  decoding  on  general  graphs.  PhD  thesis,  Department  of Electrical \n\nEngineering, U.  Linkoping, Sweden,  1996. \n\n\fGaussian Fields for Approximate Inference \n\nin Layered  Sigmoid  Belief Networks \n\nDavid Barber'\" \n\nStichting Neurale  Netwerken \n\nMedical  Physics  and Biophysics \n\nNijmegen University, The Netherlands \n\nbarberdOaston.ac.uk \n\nPeter Sollich \n\nDepartment of Mathematics \n\nKing's College,  University  of London \n\nLondon WC2R 2LS,  U.K. \npeter.sollichOkcl.ac.uk \n\nAbstract \n\nLayered  Sigmoid  Belief  Networks  are  directed  graphical  models \nin  which  the  local  conditional  probabilities  are  parameterised  by \nweighted  sums  of parental  states.  Learning  and  inference  in  such \nnetworks  are generally intractable, and approximations need  to be \nconsidered.  Progress  in  learning these  networks  has  been  made by \nusing  variational procedures.  We  demonstrate,  however,  that vari(cid:173)\national procedures  can be  inappropriate for  the  equally important \nissue  of  inference - that  is,\u00b7 calculating  marginals of  the  network. \nWe introduce an alternative procedure,  based on assuming that the \nweighted  input  to  a  node  is  approximately  Gaussian  distributed. \nOur  approach goes  beyond previous  Gaussian field  assumptions in \nthat  we  take  into  account  correlations  between  parents  of nodes. \nThis  procedure  is  specialized  for  calculating marginals and  is  sig(cid:173)\nnificantly faster  and simpler than the  variational procedure. \n\n1 \n\nIntroduction \n\nLayered  Sigmoid  Belief  Networks  [1]  are  directed  graphical  models  [2]  in  which \nthe local conditional  probabilities are  parameterised by  weighted  sums of parental \nstates,  see  fig ( 1).  This  is  a  graphical  representation  of a  distribution over  a  set  of \nbinary  variables  Si  E  {a, I}.  Typically,  one  supposes  that  the  states  of the  nodes \nat the bottom of the network  are  generated by  states in previous layers.  Whilst, in \nprinciple,  there is  no  restriction on the number of nodes  in any layer,  typically, one \nconsiders  structures  similar to  the  \"fan  out\"  in  fig(l)  in  which  higher  level  layers \nprovide  an  \"explanation\"  for  patterns  generated  in  lower  layers.  Such  graphical \nmodels  are  attractive since  they  correspond  to  layers  of information processors,  of \npotentially increasing complexity.  Unfortunately, learning and inference in such net(cid:173)\nworks is  generally intractable, and approximations need  to be considered.  Progress \nin  learning  has  been  made  by  using  variational procedures  [3,4,  5].  However,  an(cid:173)\nother crucial  aspect  remains inference  [2].  That is,  given some evidence  ( or  none), \ncalculate  the  marginal of a  variable,  conditional  on  this  evidence.  This  assumes \nthat we  have found  a suitable network from some learning procedure,  and now  wish \n\n\u00b7Present Address:  NCRG,  Aston  University,  Birmingham  B4  7ET,  U.K. \n\n\f394 \n\nD.  Barber and P.  Sollich \n\nto query  this  network.  Whilst  the  variational procedure  is  attractive for  learning, \nsince it generally provides a bound on the likelihood of the visible units,  we demon(cid:173)\nstrate  that  it  may  not  always  be  equally  appropriate  for  the  inference  problem. \n\nA directed graphical model defines a distribution over \na  set  of variables  s  = (S1  ... sn)  that  factorises  into \nthe local  conditional distributions, \n\np(S1  . .. sn) = IIp(silll'i) \n\nn \n\ni=1 \n\n(1) \n\nwhere  lI'i  denotes  the  parent  nodes  of node  i.  In  a \nlayered  network,  these  are  the  nodes  in  the proceed(cid:173)\ning  layer  that  feed  into  node  i.  In  a  sigmoid  belief \nnetwork  the local  probabilities are defined  as \n\nFigure  1:  A  Layered  Sig(cid:173)\nmoid  Belief Network \n\nP (s;  = ll~;) = \"  ( ~ W;jSj + 0;)  =\" (h;) \n\n(2) \n\nwhere the  \"field\"  at node i  is defined  as  hi = 2:j  WijSj + fh  and er(h)  = 1/(1 + e- h ). \nWij  is  the strength of the connection  between  node  i  and its  parent  node  j; if j  is \nnot a  parent of i  we  set  Wij  = O.  Oi  is  a  bias term that gives  a  parent-independent \nbias to the state of node i . \n\nWe  are interested  in inference - in particular, calculating marginals of the  network \nfor  cases  with  and  without  evidential  nodes.  In  section  (2)  we  describe  how  to \napproximate the  quantities p(Si  = 1)  and discuss  in section  (2.1)  why  our method \ncan improve on the standard variational mean field  theory.  Conditional marginals, \nsuch  as  p(Si  =  IISj  =  1, Sk  =  0)  are considered  in section  (3). \n\n2  Gaussian  Field Distributions \n\nUnder the 0/1 coding for  the variables Si,  the mean of a variable,  mi  is given by the \nprobability that it is  in  state  1.  Using  the fact from  (2)  that the  local  conditional \ndistribution of node  i  is  dependent  on its parents  only through its field  hi, we  have \n\nwhere  we  use  the  notation  \u00ab(-)p  to  denote  an  average  with  respect  to  the  distri(cid:173)\nbution p.  If there  are  many parents of node  i,  a  reasonable  assumption is  that the \ndistribution of the  field  hi  will  be  Gaussian,  p(hi ) ~ N  (J,Li,er[).  Under  this  Gaus(cid:173)\nsian Field (GF) assumption, we  need  to work out the mean and variance, which  are \ngiven by \n\n(3) \n\nj \n\nj \n\nerr  =  ((Llhi)2) = L WijWikRjk \n\nj,k \n\n(4) \n\n(5) \n\nwhere  Rjk = (LlSjLlsk).  We  use the notation Ll  (-)  ==  (-)  - \u00ab(.) . \nThe diagonal terms of the node covariance matrix are  ~i =  mi (1- mi)'  In contrast \nto previous  studies,  we  include off diagonal terms in the calculation of R  [4] .  From \n\n\fGaussian Fields for Approximate Inference \n\n395 \n\n(5)  we  only  need  to find  correlations  between  parents  i  and j  of a  node.  These  are \neasy  to calculate in the layered  networks  that  we  are considering,  because  neither  i \nnor  j  is  a  descendant  of the other: \n\nRjj = p(Sj = 1, Sj  = 1)  - mjmj \n\n= J p(Si  =  Ilhj)p(Sj  =  Ilhj)p(hj, hj)dh - mimj \n\n=  (0\" (hd 0\"  (h j ) \n\n(h  h) - mjmj \nP \n\nJ,  J \n\n(6) \n\n(7) \n\n(8) \n\nAssuming that the joint distribution p( h j , hj )  is  Gaussian,  we  again  need  its mean \nand covariance,  given by \n\n~ij = (D.hjD.hj) = L WjkWjl  (D.skD.SI)  = L WikWjlRkl \n\nkl \n\nkl \n\n(10) \n\nUnder  this  scheme,  we  have  a  closed  set  of  equations,  (4,5,8,10)  for  the  means \nmj  and covariance  matrix Rij  which  can be  solved  by  forward  propagation of the \nequations.  That  is,  we  start  from  nodes  without  parents,  and  then  consider  the \nnext layer of nodes,  repeating  the procedure until a full  sweep  through the network \nhas  been  completed.  The  one  and  two  dimensional  field  averages,  equations  (3) \nand  (8),  are  computed  using  Gaussian  Quadrature.  This  results  in  an  extremely \nfast  procedure  for  approximating the  marginals  mi,  requiring  only  a  single  sweep \nthrough  the network. \nOur approach  is  related  to  that of [6]  by  the common motivating assumption that \neach  node  has  a  large  number  of  parents.  This  is  used  in  [6]  to  obtain  actual \nbounds  on  quantities  of interest  such  as  joint marginals.  Our  approach  does  not \ngive  bounds.  Its  advantage,  however,  is  that  it  allows fluctuations  in  the fields  hi, \nwhich are effectively  excluded  in [6]  by  the  assumed scaling of the weights  Wij  with \nthe number of parents per node. \n\n2.1  Relation to Variational Mean Field Theory \n\nIn  the  variational  approach,  one  fits  a  tractable  approximating distribution  Q to \nthe SBN.  Taking Q factorised,  Q(s) = Dj m:' (1  - md l - 3 \u2022  we  have  the bound \nIn p (Sl  ... sn)  2: L {-mj In mj -\n\n(1  - md In (1  - md} \n\ni \n\nThe final  term in  (11)  causes  some difficulty  even  in  the  case  in  which  Q is  a  fac(cid:173)\ntorised model.  Formally, this is  because  this term does  not have the same graphical \nstructure as the tractable model Q.  One way around around this difficulty is  to em(cid:173)\nploy a further  bound,  with associated variational parameters [7].  Another approach \nis  to make the Gaussian assumption for  the field  hi  as  in section  (2).  Because  Q  is \nfactorised,  corresponding  to a  diagonal correlation matrix R,  this gives  [4] \n\n(12) \n\n\f396 \n\nD.  Barber and P  Sollich \n\nwhere  Pi  =  ~j Wijmj  + Oi  and  (1[  =  ~j w[jmj(l - mj).  Note  that  this  is  a  one \ndimensional integral of a  smooth function.  In contrast  to  [4]  we  therefore  evaluate \nthis  quantity  using  Gaussian  Quadrature.  This  has  the  advantage  that  no  extra \nvariational  parameters  need  to  be  introduced.  Technically,  the  assumption  of a \nGaussian field  distribution means that  (11)  is  no  longer  a  bound.  Nevertheless,  in \npractice it is found that this has little effect  on the quality of the resulting solution. \nIn our implementation of the variational approach,  we find  the optimal parameters \nmi  by  maximising the  above  equation  for  each  component  mi  separately,  cycling \nthrough  the  nodes  until  the  parameters  mi  do  not  change  by  more  than  10- 1\u00b0. \nThis is repeated  5 times,  and  the solution with the highest  bound score  is  chosen. \nNote  that these equations cannot be solved by forward  propagation alone since  the \nfinal  term  contains  contributions  from  all  the  nodes  in  the  network.  This  is  in \ncontrast to the GF approach of section  (2) .  Finding appropriate parameters mi  by \nthe  variational approach is  therefore  rather slower  than using  the  GF method. \n\nIn  arriving  at  the  above  equations,  we  have  made  two  assumptions.  The  first  is \nthat  the  intractable distribution is  well  approximated by  a  factorised  model.  The \nsecond  is  that  the  field  distribution  is  Gaussian.  The  first  step  is  necessary  in \norder  to  obtain  a  bound  on  the  likelihood  of the  model  (although  this  is  slightly \ncompromised by  the  Gaussian field assumption).  In the  GF  approach  we  dispense \nwith  this  assumption  of an  effectively  factorised  network  (partially  because  if we \nare  only  interested  in  inference,  a  bound on the  model  likelihood is  less  relevant). \nThe GF method may therefore  prove useful for  a broader class of networks  than the \nvariational approach. \n\n2.2  Results for unconditional marginals \n\nWe  compared  three  procedures  for  estimating the  conditional values  p(Si  =  1)  for \nall the  nodes  in the network,  namely the variational theory,  as described  in section \n(2.1), the diagonal Gaussian field theory, and the non-diagonal Gaussian field theory \nwhich includes  correlation effects  between  parents.  Results for  small weight  values \nWij  are  shown  in  fig(2).  In  this  case,  all  three  methods  perform  reasonably  well, \nalthough  there  is  a  significant  improvement  in  using  the  GF  methods  over  the \nvariational  procedure;  parental  correlations  are  not  important  (compare  figs(2b) \nand  (2c)) .  In  fig(3)  the  weights  and  biases  are  chosen  such  that  the  exact  mean \nvariables  mi  are  roughly  0.5  with  non-trivial  correlation  effects  between  parents. \nNote  that the  variational mean field  theory  now  provides  a  poor solution,  whereas \nthe  GF  methods  are  relatively  accurate.  The  effect  of using  the  non-diagonal  R \nterms is  beneficial,  although not dramatically so. \n\n3  Calculating Conditional  Marginals \n\nWe  consider  now  how  to  calculate  conditional  marginals,  given  some  evidential \nnodes.  (In  contrast  to  [6],  any set  of nodes  in the  network,  not just output nodes, \ncan be considered  evidential.)  We  write  the evidence  in the following manner \n\nE  =  {SCi  = SCi' . . . Sc\"  = SC,.}  = {ECl  ... Ec,.} \n\nThe quantities that we  are interested in are conditional marginals which, from Bayes \nrule  are related  to the joint distribution by \n\nP (Si  = liE)  = \n\nP (Si  = 1, E) \n\nP (Si  = 0, E) + P (Si  = 1, E) \n\n(13) \n\nThat is,  provided  that  we  have a  procedure for  estimating joint marginals,  we  can \nobtain conditional marginals too.  Without loss  of generality, we  therefore  consider \n\n\fGaussian Fields for Approximate Inference \n\nEm>ruoing1_ model 1ft \n\n2Or--~ \n\nEm>r using Ga_ Fiaid. Ooagonal ooyariance \n\n397 \n\ncowuiance \n\nO<Xll  001 \n\n(a) Mean error = 0.0377 \n\n(b) Mean error = 0.0018 \n\n(c)  Mean error = 0.0017 \n\nFigure 2:  Error in approximating p(Si  =  1)  for  the network  in fig(l),  averaged over \nall  the  nodes  in  the  network.  In  each  of  100  trials,  weights  were  drawn  from  a \nzero  mean,  unit  variance  Gaussian;  biases  were  set  to  O.  Note  the  different  scale \nIn  (a)  we  use  the  variational  procedure  with  a  factorised  Q,  as \nin  (b)  and  (c). \nin  section  (2.1).  In  (b)  we  use  the  Gaussian field  equations,  assuming  a  diagonal \ncovariance  matrix  R.  This  procedure  was  repeated  in  (c)  including  correlations \nbetween  parents. \n\nE+  =  E U {Si  =  I},  which  then contains n + 1  \"evidential\"  variables.  That is,  the \ndesired  marginal variable  is  absorbed  into  the  evidence  set .  For  convenience,  we \nthen  split  the  nodes  into  two  sets,  those  containing  the  evidential  or  \"clamped\" \nnodes,  C,  and  the remaining  \"free\"  nodes  F .  The joint evidence  is  then given  by \n\n(14) \n\n8F \n\n= I:p (ECllll'~l) ... p (En+llll'~\"+l) p (sh 11I'jJ  ... p (Sfm 11I'jJ \n\n(15) \nwhere  11';  are  the  parents of node  i,  with any  evidential parental nodes  set  to their \nvalues  as  specified  in  E+.  In the sigmoid belief network \n\n8F \n\nif i  is  an evidential node \notherwise \n(16) \np(Eklll'Z) is  therefore determined by  the distribution of the field hZ  = Li WkiS; +Ok . \nExamining  (15),  we  see  that  the  product  over  the  \"free\"  nodes  defines  a  SBN  in \nwhich  the local  probability distributions are given by  those of the original network, \nbut with any evidential parental nodes clamped to their evidence values.  Therefore, \n\n(17) \n\nConsistent  with our  previous  assumptions,  we  assume  that  the  distribution  of the \nfields  h+  = (h~l'\"  h~\"+l)  is  jointly  Gaussian.  We  can  then  find  the  mean  and \ncovariance matrix for  the distribution of h+  by  repeating the calculation of section \n(2)  in which evidential nodes have been clamped to their evidence values.  Once this \nGaussian has been determined, it can be used in (17)  to determine p( E+).  Gaussian \naverages  of products  of sigmoids are calculated  by  drawing  1000 samples from  the \nGaussian over  which  we  wish  to integrate1 .  Note  that if there  are  evidential nodes \n\nlIn one  and two  dimensions  (n = 0, 1),  or  n = 1,  we use  Gaussian  Quadrature. \n\n\f398 \n\nD.  Barber and P  Sollich \n\nError uoing lado_ model 1M \n\nl00r-~--~----~~--~ \n\neo \n\nEO \n\n50 \n\n30 \n\n20 \n\n10 \n\nError uoing Ga_ian Field. Diegorel \"\"\"arianee \n\nEm:>< uoing Ga ..... n Field. Non Diagonal \"\"w\"ianee \n70,---~------~--------, \n\nEO \n\n50 \n\n40 \n\n30 \n\n20 \n\n10  II.. \n\n(a) Mean error =  0.4188 \n\n(b) Mean error =  0.0253 \n\n(c) Mean error = 0.0198 \n\no 1 \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.5 \n\n06 \n\n00 \n\n0 1 \n\n02 \n\n0.3 \n\n0.4 \n\n0.5 \n\n0.6 \n\nFigure  3:  All  weights  are set  to  uniformly from  0  to  50.  Biases  are  set  to -0.5  of \nthe summed parental weights plus  a uniform random number from -2.5 to 2.5 .  The \nroot  node  is  set  to be  1 with probability 0.5.  This has the  effect  of making all  the \nnodes  in  the  exact  network  roughly  0.5  in  mean,  with  non-negligible  correlations \nbetween  parental  nodes.  160  simulations were  made. \n\nin different layers, we  require the correlations between their fields h to evaluate (17) . \nSuch 'inter-layer' correlations were  not required in section (2) , and to be able to use \nthe  same calculational scheme  we  simply  neglect  them.  (We  leave  a  study  of the \neffects  of this  assumption for  future  work.)  The  average  in  (17)  then  factors  into \ngroups,  where  each group contains evidential terms in a  particular layer. \n\nThe conditional marginal for  node i  is  obtained from repeating the above procedure \nin which the desired marginal node is clamped to its opposite value, and then using \nthese results in (13).  The above procedure is repeated for each conditional marginal \nthat we  are interested  in.  Although  this may seem computationally expensive,  the \nmarginal for  each  node is  computed quickly,  since  the equations  are  solved  by  one \nforward  propagation sweep  only. \n\nError uoing Gauosian Field,  Diago\".1 covarIanee \n\nEm:>< uoing Gau\"ian Field.  Non Diagonal \"\"\"ariance \n\n(a) Mean error = 0.1534 \n\n(b) Mean error = 0.0931 \n\n(c)  Mean error =  0.0865 \n\nFigure  4:  Estimating  the  conditional  marginal of the  top  node  being  in  state  1, \ngiven  that  the four  bottom nodes  are  in state  1.  Weights were  drawn from  a  zero \nmean Gaussian with variance 5, with biases set to -0.5 the summed parental weights \nplus  a  uniform random number from -2.5  to 2.5 .  Results  of 160 simulations. \n\n3.1  Results for conditional marginals \n\nWe  used  the same structure as in the previous experiments,  as shown in fig(I).  We \nare  interested  here  in  calculating  the  probability  that  the  top  node  is  in  state  1, \n\n\fGaussian Fields for Approximate Inference \n\n399 \n\ngiven  that  the four  bottom nodes  are  in state  1.  Weights were  chosen from  a  zero \nmean  Gaussian  with  variance  5.  Biases  were  set  to  negative  half of the  summed \nparent  weights,  plus  a  uniform random  value  from -2.5  to  2.5.  Correlation effects \nin these  networks are not as strong as in the experiments in section  (2.2), although \nthe improvement of the G F theory over the variational theory seen in fig ( 4)  remains \nclear.  The improvement from the off diagonal terms in R is minimal. \n\n4  Conclusion \n\nDespite  their appropriateness for  learning, variational methods may not  be equally \nsuited  to  inference,  making more tailored methods attractive.  We  have  considered \nan approximation procedure  that is  based on assuming that the distribution of the \nweighted  input  to  a  node  is  approximately  Gaussian.  Correlation  effects  between \nparents of a node were taken into account to improve the Gaussian theory, although \nin our examples this gave only relatively modest improvements. \n\nThe  variational  mean  field  theory  performs  poorly  in  networks  with  strong  cor(cid:173)\nrelation  effects  between  nodes.  On  the  other  hand,  one  may  conjecture  that  the \nGaussian Field approach will not generally perform catastrophically worse than the \nfactorised  variational  mean field  theory.  One  advantage of the  variational theory \nis  the  presence  of an objective function  against  which  competing solutions can  be \ncompared.  However,  finding an optimum solution for the mean parameters mj from \nthis function is  numerically complex.  Since  the  Gaussian Field theory  is  extremely \nfast  to solve,  an interesting compromise might be to prime the variational solution \nwith the results from  the  Gaussian Field theory. \n\nAcknowledgments \n\nDB  would  like  to  thank  Bert  Kappen  and  Wim  Wiegerinck  for  stimulating and \nhelpful discussions.  PS thanks the  Royal Society for  financial support. \n\n[1]  R.  Neal.  Connectionist  learning  of Belief Networks.  Artificial Intelligence, 56:71-113, \n\n1992. \n\n[2]  E.  Castillo,  J.  M.  Gutierrez,  and A.  S.  Radi.  Expert Systems and Probabilistic Network \n\nModels.  Springer,  1997. \n\n[3]  M.  I.  Jordan,  Z.  Gharamani,  T. S.  Jaakola,  and  L.  K. Saul.  An Introduction  to Vari(cid:173)\n\national  Methods for  Graphical  Models.  In  M.  I.  Jordan,  editor,  Learning in  Graphical \nModels,  pages  105-161.  Kluwer,  1998. \n\n[4]  L.  Saul  and  M.  I.  Jordan.  A  mean field  learning  algorithm  for  unsupervised  neural \n\nnetworks.  In M.  I.  Jordan,  editor,  Learning in  Graphical  Models,  1998. \n\n[5]  D.  Barber  and  W  Wiegerinck.  Tractable  variational  structures  for  approximating \ngraphical  models.  In  M.S.  Kearns,  S.A.  Solla,  and  D.A.  Cohn,  editors,  Advances in \nNeural  Information  Processing Systems NIPS 11.  MIT Press,  1999. \n\n[6]  M.  Kearns and 1. Saul.  Inference in  Multilayer Networks via Large Deviation Bounds. \n\nIn  Advances in  Neural Information Processing Systems  NIPS 11,  1999. \n\n[7]  L.  K.  Saul,  T.  Jaakkola,  and  M.  I.  Jordan.  Mean  Field  Theory  for  Sigmoid  Belief \n\nNetworks.  Journal of Artificial Intelligence Research, 4:61-76,  1996. \n\n\f", "award": [], "sourceid": 1643, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Peter", "family_name": "Sollich", "institution": null}]}