{"title": "Second Order Approximations for Probability Models", "book": "Advances in Neural Information Processing Systems", "page_first": 238, "page_last": 244, "abstract": null, "full_text": "Second order approximations for probability \n\nmodels \n\nHilbert J. Kappen \n\nDepartment of Biophysics \n\nNijmegen University \n\nNijmegen, the Netherlands \nbert@mbfys.kun.nl \n\nWim Wiegerinck \n\nDepartment of Biophysics \n\nNijmegen University \n\nNijmegen, the Netherlands \nwimw@mbfys.kun.nl \n\nAbstract \n\nIn  this  paper,  we  derive  a  second  order mean  field  theory  for  directed \ngraphical  probability  models.  By  using  an  information  theoretic  argu(cid:173)\nment  it  is  shown  how  this  can  be  done  in  the  absense  of a  partition \nfunction.  This method is  a direct generalisation of the well-known TAP \napproximation for  Boltzmann Machines.  In  a numerical  example,  it is \nshown  that the  method  greatly  improves  the  first  order mean  field  ap(cid:173)\nproximation.  For a restricted class of graphical models, so-called single \noverlap graphs, the second order method has comparable complexity to \nthe first order method.  For sigmoid belief networks, the method is shown \nto be particularly fast and effective. \n\n1  Introduction \n\nRecently, a number of authors have proposed deterministic methods for approximate infer(cid:173)\nence  in  large graphical models.  The simplest approach gives a lower bound on the prob(cid:173)\nability of a subset of variables using Jenssen's inequality (Saul et aI.,  1996).  The method \ninvolves the minimization of the KL divergence between the target probability distribution \np and some  'simple'  variational distribution q.  The method can be applied to a large class \nof probability  models,  such  as  sigmoid belief networks,  DAGs  and  Boltzmann Machines \n(BM). \n\nFor  Boltzmann-Gibbs  distributions,  it  is  possible  to  derive  the  lower  bound  as  the  first \nterm in  a Taylor series expansion of the free  energy around a factorized  model.  The free \nenergy is given by -log Z, where Z is the normalization constant of the Boltzmann-Gibbs \ndistribution: p(x) =  exp(~E(X)). This Taylor series can be continued and the second order \nterm is  known as  the  TAP  correction (Plefka,  1982;  Kappen and Rodriguez,  1998).  The \nsecond order term significantly improves the quality of the approximation, but is no longer \na bound. \n\nFor probability distributions that are  not Boltzmann-Gibbs distributions,  it is  not obvious \nhow  to  obtain  the  second  order approximation.  However,  there  is  an  alternative way  to \ncompute  the  higher order corrections,  based  on  an  information theoretic  argument.  Re(cid:173)\ncently, this  argument was  applied to  stochastic neural  networks with asymmetric connec(cid:173)\ntivity (Kappen and Spanjers, 1999). Here, we apply this idea to directed graphical models. \n\n\f2  The method \nLet x  = (Xl\"\", Xn)  be an n-dimensional vector,  with Xi  taking on discrete values.  Let \np(x)  be  a directed  graphical  model  on  x.  We  will  assume that p(x)  can  be  written  as  a \nproduct of potentials in the following way: \n\nn \n\nk=l \n\nn \n\nk=l \n\n(I) \n\nHere, Pk(Xkl'lrk)  denotes the conditional probability table of variable Xk  given the values \nof its parents 'Irk.  xk  = (X k, 'Irk)  denotes the subset of variables that appear in potential k \nand \u00a2k (xk)  =  logpk(xk l'lrk).  Potentials can be overlapping, xk n xl  =1=  0, and X =  Ukxk. \nWe  wish  to  compute the  marginal  probability  that  Xi  has  some  specific  value  Si  in  the \npresence of some evidence.  We  therefore denote X  = (e, s)  where e denote the subset of \nvariables  that constitute the evidence,  and  S  denotes  the remainder of the variables.  The \nmarginal is given as \n\n( \n\nP s. e  -\n\n)  - p(si,e) \n. \n\np(e) \n\n_I \n\n(2) \n\nBoth  numerator and denominator contain sums over hidden  states.  These  sums  scale ex(cid:173)\nponentially  with  the  size  of the problem,  and  therefore  the  computation  of marginals  is \nintractable. \n\nWe  propose to approximate this problem by using a mean field  approach.  Consider a fac(cid:173)\ntorized distribution on the hidden variables h: \n\n(3) \n\nWe  wish  to  find  the  factorized distribution q that best approximates p(sle).  Consider as  a \ndistance measure \n\nKL = LP(sle) log (p~~!;)) . \n\n\u2022 \n\nIt is easy to see that the q that minimizes KL satisfies: \n\n(4) \n\n(5) \n\nWe  now  think of the  manifold of all  probability distributions of the form  Eq.  1,  spanned \nby  the  coordinates  \u00a2k (xk), k  =  1, ... , m.  For each  k,  \u00a2k (xk)  is  a  table  of numbers, \nindexed  by  xk.  This  manifold  contains  a  submanifold  of factorized  probability  distri(cid:173)\nbutions  in  which  the  potentials factorize:  \u00a2k(Xk)  = Li,iEk \u00a2ki(Xi).  When  in  addition, \nLk,iEk \u00a2ki(Xi) = logqi(xi), i  E h, p(sle) reduces to q(s). \nAssume now  that p(sle)  is  somehow close to  the factorized  submanifold.  The difference \n~P(Si Ie)  = P(Si Ie) - qi(Si) is then small, and we can expand this small difference in terms \nof changes in the parameters ~\u00a2k(xk) = \u00a2k(Xk) -logq(xk), k = 1, ... , m: \n\nt L  (810gp(s~le))  ~\u00a2k(xk) \n\nk=l  a;k \n\n8\u00a2k(x) \n(  8 2 10gp(sile)  ) \n1 \" \n\"2  L..J  L..J  8\u00a2  (xk)8\u00a2  (I) \n\n\" \n\nq \n\n1  Y \n\n+ \n\nkl  a;k ,ii' \n\nk \n+  higher order terms \n\n-k \n\n-I \n~\u00a2k(X )~\u00a2I(Y) \n\nq \n\n(6) \n\n\fThe differentials are evaluated in the factorized distribution q.  The left-hand size of Eq. 6 is \nzero because of Eq. 5 and we solve for q(Si)'  This factorized distribution gives the desired \nmarginals up to  the order of the expansion of ~ logp(sile). \nIt is straightforward to compute the derivatives: \n\n8l0gp(sile) \n8<pk(Xk) \n82 10gp(sile) \n8<Pk (xk)8<p1 (fil)  =  p(xk,iilsi,e) - p(xk,ille) \n\n-p(xklsi' e)p(fillsi' e) + p(xk le)p{;ii Ie) \n\n(7) \n\nWe  introduce the  notation  ( . .. ) 8 ;  and  ( . .. )  as  the expectation  values  with respect to  the \nfactorized distributions q(XISi, e)  and q(xle), respectively. We define (( ... ) )8;  ==  ( ... ) 8;  -\n( . .. ).  We  obtain \n\n~logp(sile)  =  2)(~<pk))8; \n\nk \n1 \n\n+  2 L \nk,l \n\n+  higher order terms \n\n(((~<Pk~<PI))8; - (~<Pk)8; (~<P1)8; + (~<Pk) (~<Pl)) \n\nTo first order, setting Eq. 8 equal to zero we obtain \n\no = L((~<pk)).; = (logp(x)).; -logq(si) + const., \n\nk \n\n(8) \n\n(9) \n\nwhere  we  have  absorbed  all  terms  independent of i  into  a  constant.  Thus,  we  find  the \nsolution \n\n(10) \n\nin which the constants Zi  follow  from normalisation.  The first  order term is equivalent to \nthe standard mean field equations, obtained from Jensens'  inequality. \n\nThe correction with second order terms is  obtained in the same way, again dropping terms \nindependent of i: \n\nq(Si) = ~. exp  ((IOgP(X)).;  + ~ L \n\n((~<Pk~<Pt)8; - (~<Pk).; (~<Pt)8.)) \n\n(11) \n\nt \n\nk~ \n\nwere, again, the constants Zi follow from normalisation. These equations, which form the \nmain result of this paper, are generalization of the mean  field  equations with TAP correc(cid:173)\ntions for directed graphical models.  Both left and right-hand size of Eqs.  10 and  11 depend \non the unknown probability distribution q(s)  and can be solved by fixed point iteration. \n\n3  Complexity and single-overlap graphs \n\nThe complexity of the first order equations Eq.  10 is exponential in the number of variables \nin  the potentials <Pk  of P:  if the maximal  clique size is  c,  then  for each i  we  need of the \norder of ni exp(c) computations, where ni is the number of cliques that contain node i. \n\nThe  second term  scales  worse,  since  one  must  compute  averages  over  the  union  of two \noverlapping cliques and because of the double sum.  However, things are not so bad.  First \n\n\fFigure  1:  An  example  of  a  single-overlap  graph.  Left:  The  chest  clinic  model \n(ASIA)(Lauritzen and Spiegelhaiter, 1988). Right:  nodes within one potential a re grouped \ntogether, showing that potentials share at most one node. \n\nof all, notice that the sum over k and l can be restricted to  overlapping cliques (k n l  =1=  0) \nand that i must be in either k or lor both (i  E k U I).  Denote by n k  the number of cliques \nthat have at least one variable in common with clique k and denote by noverlap = maxk nk \nThen, the sum over k and l contains not more than  ninoverlap terms. \nEach term is an average over the union of two cliques, which can be worse case of size 2c-1 \n(when  only  one  variable  is  shared).  However,  since  (!l</>k!l<1>t) Si  =  ((!l</>k) knl !l</>l) 8i \n((.) knl  means expectation wrt q conditioned on the variables in k n l) we can precompute \n(!l</>k)knl  for all pairs of overlapping cliques k, l, for all states in knl. Therefore, the worse \ncase complexity of the second order term is less than ninoverlap exp(c).  Thus, we see that \nthe second order method has the same exponential complexity as the first order method, but \nwith a different polynomial prefactor.  Therefore, the first  or second  order method can  be \napplied to directed graphical models as long as the number of parents is reasonably small. \n\nThe fact  that the second order term has a worse complexity than  the first  order term is  in \ncontrast to Boltzmann machines, in which the TAP approximation has the same complexity \nas the standard mean field approximation. This phenomenon also occurs for a special class \nof DAGs, which we call single-overlap graphs. These are graphs in which the potentials </>k \nshare at most one node.  Figure 1 shows an example of a single-overlap graph. \n\nFor single overlap graphs,  we  can  use  the  first  order result Eq.  9 to  simplify  the  second \norder correction. The derivation rather tedious and we just present the result \n\n\u2022 \n\n~. exp  ((IOgP(X))8i  + ~ L  ((!l</>I?)8i  - (!l</>l);.) \n,~, ~ \u00ab( (8M )., 8\u00a2'),,) \n\nl,iEI \n\n(12) \n\nwhich has a complexity that is  of order ni(c -1) exp(c).  For probability distributions with \nmany small potentials that share nodes with many other potentials, Eq.  12 is more efficient \nthan Eq.  11.  For instance, for Boltzmann Machines ni = noverlap  = n - 1 and c = 2.  In \nthis case, Eq.  12 is identical to the TAP equations (Thouless et al.,  1977). \n\n4  Sigmoid belief networks \n\nIn  this  section,  we  consider  sigmoid  belief networks  as  an  interesting  class  of directed \ngraphical models.  The reason is,  that one can expand in terms of the couplings instead of \nthe potentials which is more efficient. The sigmoid belief network is defined as \n\n(13) \n\n\f(a) \n\n(b) \n\n(c) \n\n(d) \n\nFigure 2:  Interpretation of different interaction terms appearing in  Eq.  16.  The open  and \nshaded  nodes  are hidden and evidence nodes, respectively  (except in  (a), where k can be \nany  node).  Solid arrows  indicate the  graphical  structure in  the  network.  Dashed arrows \nindicate interaction terms that appear in Eq.  16. \n\n(1  + exp(-2x))-1,  Xi  =  \u00b11  and  hi  is  the  local  field:  hi(x)  = \n\nwhere  O\"(x)  = \n2:;=1 WijXj  + 9i. \nWe separate the variables in evidence variables e and hidden variables s:  X  = (s, e).  When \ncouplings from hidden nodes to either hidden or evidence nodes are zero, Wij  =  0, i  E  e, s \nand j  E s, the probability distributions p(sle) and p(e)  reduce to \n\np(sle) \n\n-+  q(s)  = II 0\"  (Si9n \n\niEs \n\n(14) \n\n(15) \n\n(16) \n\nwhere 9,/  = 2:jEe Wijej  + 9i  depends on the evidence. \nWe expand to second order around this tractable distribution and obtain \n\niEe \n\nmi  =  tanh (L mjWik + 9i + 2 L r( -ek)ekwki - mi L(1 - m%}w;k \n\nkEs,e \n\nkEe \n\nkEs \n\n+ 4mi L r(ek)r( -ek)w~i - 4  L  r(ek)r( -ek)mlwklwki \n\nkEe \n\nkEe,IEs \n\n+ 2  L  (1  - m%}r( -el)eIWlkWki) \n\nkEs ,IEe \n\nwith mi =  (Si}q  ~ (Si}p  and r is given by Eq.  15. \n\nThe different terms  that appear in  this  equation can  be easily  interpreted.  The first  term \ndescribes  the lowest order forward  influence  on  node i  from  its  parents.  Parents  can  be \neither evidence or hidden nodes  (fig.  2a).  The second term is  the  bias 9i .  The third term \ndescribes  to  lowest  order the  effect  of Bayes'  rule:  it affects  mi  such  that  the  observed \nevidence on  its  children becomes  most probable (fig.  2b).  Note,  that this  term  is  absent \nwhen the evidence is explained by  the evidence nodes themselves:  r(ek)  = 1.  The fourth \nand fifth terms are the quadratic contributions to the first and third terms, respectively. The \nsixth term describes  'explaining away'.  It describes the effect of hidden node I on  node i, \nwhen both have a common observed child k (fig. 2c).  The last term describes the effect on \nnode i  when its grandchild is observed (fig. 2d). \n\nNote, that these equations are different from Eq.  10.  When one applies Eq.  10 to sigmoid \nbelief networks,  one  requires  additional  approximations  to  compute  (logO\"(xihi)}  (Saul \net aI.,  1996). \n\n\f0.07 \n\n0.06 \n\n'U 0.05 \n~ \n!l..O.O4 \n~ \nE \n;  0.03 \nQ. \nu  0.02 \n\n0.01 \na \na \n\n0.7 \n\n0.6 \n\n0.5 \n\n(/)  0.4 \n::2 \na:  0.3 \n\n0.2 \n\n0.1 \na \na \n\n10 \n\nn, \n\n20 \n\n30 \n\n...  \"\"'  -I  1 1 \n\n0 .5 \nJ \n\nFigure  3:  Second  order approximation for  fully  connected  sigmoid  belief network  of n \nnodes.  a)  nodes 1, ... , nl are hidden (white) and nodes nl + 1, ... , n are clamped (grey), \nnl  = n/2;  b)  CPU  time  for  exact  inference  (dashed)  and  second  order approximation \n(solid) versus nl (J = 0.5); c)  RMS of hidden node exact marginals (solid) and RMS error \nof second order approximation (dashed) versus coupling strength J, (nl = 10). \n\nSince only feed-forward connections are present, one can order the nodes such that Wij  = 0 \nfor i  <  j.  Then the  first  order mean  field  equations can  be  solved  in  one  single  sweep \nstarting with  node  l.  The full  second order equations can be solved by  iteration,  starting \nwith the first order solution. \n\n5  Numerical results \n\nWe  illustrate the  theory with  two  toy  problems.  The first  one is inference in  Lauritzen's \nchest clinic  model  (ASIA),  defined  on  8 binary  variables  x  = {A, T, S, L, B, E, X, D} \n(see figure 1, and (Lauritzen and SpiegeJhaJter, 1988) for more details about the model). We \ncomputed exact marginals and approximate marginals using the approximating methods up \nto  first  (Eq.  10)  and  second  order (Eq.  11), respectively.  The approximate marginals are \ndetermined by sequential iteration of (10) and (11), starting at q(Xi)  = 0.5 for all variables \ni.  The maximal  error in  the  marginals  using  the  first  and  second  order method  is  0.213 \nand 0.061, respectively. We verified that the single-overlap expression Eq.  12 gave similar \nresults. \n\nIn fig.  3,  we assess  the accuracy and CPU time of the second order approximation Eq.  16 \nfor sigmoid belief networks. We generate random fully connected sigmoid belief networks \nwith Wij  from  a normal distribution with mean zero and variance  J2 In  and  (}i  =  O.  We \nobserve in fig.  3b that the computation time is  very fast:  For nl = 500, we have obtained \nconvergence in 37  second on a Pentium 300 Mhz processor.  The accuracy of the method \ndepends on the size of the weights and is computed for a network of nl  =  10 (fig.  3c).  In \n(Kappen and Wiegerinck, 2001), we compare this approach to Saul's variational approach \n(Saul  et aI.,  1996) and show that our approach is much faster and slightly more accurate. \n\n\f6  Discussion \n\nIn this paper, we computed a second order mean field approximation for directed graphical \nmodels.  We  show  that  the  second  order  approximation gives  a  significant  improvement \nover the  first  order result.  The method  does  not use explicitly  that the graph is  directed. \nTherefore, the result is equally valid for Markov graphs. \nThe  complexity  of  the  first  and  second  order  approximation  is  of a (ni exp ( c))  and \nO(ninoverlap exp(c)),  respectively,  with  c  the  number  of variables  in  the  largest  poten(cid:173)\ntial.  For single-overlap graphs,  one can  rewrite  the  second  order equation  such  that the \ncomputational complexity reduces to  O(ni(c - 1) exp(c)).  Boltzmann machines and the \nAsia network are examples of single-overlap graphs. \n\nFor large c,  additional approximations are required, as  was proposed by (Saul et al.,  1996) \nfor the first  order mean field  equations.  It is  evident, that such additional approximations \nare then also required for the second order mean field equations. \n\nIt has  been reported  (Barber and Wiegerinck,  1999; Wiegerinck and  Kappen,  1999) that \nsimilar numerical improvements can be obtained by using a very different approach, which \nis to use an approximating distribution q that is not factorized, but still tractable.  A promis(cid:173)\ning  way  to  proceed is  therefore  to  combine  both  approaches  and  to  do  a  second  order \nexpansion aroud a manifold of distributions with non-factorized yet tractable distributions. \nIn this approach the sufficient statistics of the tractable  structure is  expanded, rather than \nthe marginal probabilities. \n\nAcknowledgments \n\nThis research was supported in part by the Dutch Technology Foundation (STW). \n\nReferences \n\nBarber,  D.  and Wiegerinck, W.  (1999).  Tractable variational  structures  for  approximating  graphi(cid:173)\n\ncal  models.  In Kearns,  M., Solla,  S .,  and Cohn,  D., editors, Advances in Neural Information \nProcessing Systems,  volume 11  of Advances  in Neural Information  Processing Systems,  pages \n183- 189. MIT Press. \n\nKappen, H. and Rodriguez, F. (1998). Efficient learning in Boltzmann Machines using linear response \n\ntheory.  Neural Computation,  10:1137-1156. \n\nKappen,  H.  and Spanjers,  J.  (1999).  Mean  field  theory  for  asymmetric neural  networks.  Physical \n\nReview E, 61 :5658-5663. \n\nKappen,  H. and  Wiegerinck,  W.  (2001).  Mean  field  theory  for  graphical  models.  In Saad,  D.  and \n\nOpper, M., editors, Advanced mean field theory. MIT Press. \n\nLauritzen, S.  and Spiegelhalter, D.  (1988).  Local computations with probabilties on graphical struc(cid:173)\n\ntures and their application to expert systems. J. Royal Statistical society B, 50: 154-227. \n\nPlefka, T. (1982).  Convergence condition  of the TAP equation for the infinite-range Ising spin glass \n\nmodel.  Journal of Physics A, 15:1971- 1978. \n\nSaul, L., Jaakkola, T.,  and Jordan, M. (1996) . Mean field theory for sigmoid belief networks.  Journal \n\nof anificial intelligence research, 4:61-76. \n\nThouless,  D.,  Anderson,  P.,  and  Palmer,  R.  (1977).  Solution  of 'Solvable Model  of a Spin Glass'. \n\nPhil.  Mag.,  35:593- 601. \n\nWiegerinck, W.  and Kappen, H. (1999) .  Approximations of bayesian networks through kl minimisa(cid:173)\n\ntion.  New Generation Computing, 18:167- 175. \n\n\f", "award": [], "sourceid": 1890, "authors": [{"given_name": "Hilbert", "family_name": "Kappen", "institution": null}, {"given_name": "Wim", "family_name": "Wiegerinck", "institution": null}]}