{"title": "Approximate Inference A lgorithms for Two-Layer Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 533, "page_last": 539, "abstract": null, "full_text": "Approximate inference algorithms for two-layer \n\nBayesian networks \n\nAndrewY. Ng \n\nComputer Science Division \n\nUC Berkeley \n\nBerkeley, CA 94720 \nang@cs.berkeley.edu \n\nMichael I. Jordan \n\nComputer Science Division and \n\nDepartment of Statistics \n\nUC Berkeley \n\nBerkeley, CA 94720 \n\njordan@cs.berkeley.edu \n\nAbstract \n\nWe  present  a  class  of approximate  inference  algorithms  for  graphical \nmodels  of the  QMR-DT type.  We  give  convergence rates  for  these  al(cid:173)\ngorithms and  for  the  Jaakkola and  Jordan  (1999) algorithm,  and  verify \nthese  theoretical predictions empirically.  We  also present empirical re(cid:173)\nsults on the difficult QMR-DT network problem, obtaining performance \nof the  new  algorithms roughly  comparable  to  the  Jaakkola and  Jordan \nalgorithm. \n\n1  Introduction \n\nThe graphical models formalism provides an appealing framework for the design and anal(cid:173)\nysis of network-based learning and inference systems.  The formalism endows graphs with \na joint probability distribution and interprets most queries of interest as  marginal or con(cid:173)\nditional probabilities under this joint.  For a fixed  model one is  generally interested in the \nconditional probability of an output given an input (for prediction), or an input conditional \non  the  output (for diagnosis or control).  During learning the  focus  is  usually  on  the  like(cid:173)\nlihood (a marginal probability), on  the conditional probability of unobserved nodes given \nobserved nodes (e.g., for an EM or gradient-based algorithm), or on the conditional proba(cid:173)\nbility of the parameters given the observed data (in a Bayesian setting). \n\nIn all  of these cases the key  computational operation is that of marginalization.  There are \nseveral methods available for computing marginal probabilities in graphical models,  most \nof which involve some form of message-passing on the graph. Exact methods, while viable \nin many interesting cases (involving sparse graphs), are infeasible in the dense graphs that \nwe consider in the current paper. A number of approximation methods have evolved to treat \nsuch  cases;  these  include search-based methods,  loopy propagation, stochastic sampling, \nand variational methods. \n\nVariational  methods,  the  focus  of the  current paper,  have  been  applied  successfully  to  a \nnumber of large-scale inference problems.  In particular, Jaakkola and Jordan  (1999) de(cid:173)\nveloped a  variational  inference method for the QMR-DT network,  a benchmark network \ninvolving  over 4,000  nodes  (see  below).  The  variational  method  provided  accurate  ap(cid:173)\nproximation to  posterior probabilities within a second of computer time.  For this difficult \n\n\f534 \n\nA.  Y.  Ng and M.  1.  Jordan \n\ninference  problem exact  methods  are  entirely  infeasible  (see  below),  loopy  propagation \ndoes not converge to  correct posteriors (Murphy, Weiss,  &  Jordan,  1999), and  stochastic \nsampling methods are slow and unreliable (Jaakkola & Jordan, 1999). \n\nA significant step forward in the understanding of variational inference was made by Kearns \nand Saul  (1998),  who  used  large deviation techniques to  analyze the convergence rate  of \na simplified variational inference algorithm.  Imposing conditions on the magnitude of the \nweights in the network, they established a 0  ( Jlog N / N) rate of convergence for the error \nof their algorithm, where N  is the fan-in. \n\nIn the current paper we  utilize techniques similar to those of Kearns  and Saul to  derive a \nnew set of variational inference algorithms with rates that are faster than 0 ( Jlog N / N). \nOur techniques also allow  us  to  analyze the convergence rate of the Jaakkola and Jordan \n(1999)  algorithm.  We  test  these  algorithms on an  idealized  problem  and  verify  that our \nanalysis correctly predicts their rates of convergence.  We  then  apply  these algorithms to \nthe difficult the QMR-DT network problem. \n\n2  Background \n\n2.1  The QMR-DT network \n\nThe QMR-DT (Quick Medical Reference, Decision-Theoretic) network is a bipartite graph \nwith  approximately 600 top-level nodes di  representing diseases and approximately 4000 \nlower-level  nodes Ii representing findings  (observed symptoms).  All  nodes  are  binary(cid:173)\nvalued.  Each disease is  given a prior probability P(di  =  1), obtained from archival  data, \nand each finding is parameterized as a \"noisy-OR\" model: \n\nP(h =  lid) =  1- e-(lio-L:jE\"i (lijd j\n\n, \n\nwhere  7T'i  is  the  set  of parent diseases  for  finding  h  and  where  the  parameters  Oij  are \nobtained from assessments by medical experts (see Shwe, et aI.,  1991). \nLetting Zi  =  OiQ  + I:jE 1l'i  Oijdj , we have the following expression for the likelihood I  : \n\n(1) \n\nwhere the sum is a sum across the approximately 2600  configurations of the diseases.  Note \nthat the second product, a product over the negative findings, factorizes across the diseases \ndj ;  these factors can be absorbed into the priors P (dj )  and have no significant effect on the \ncomplexity of inference.  It is  the positive findings  which couple the diseases and prevent \nthe sum from being distributed across the product. \n\nGeneric  exact algorithms  such  as  the  junction  tree  algorithm  scale  exponentially  in  the \nsize of the maximal clique in a moralized, triangulated graph.  Jaakkola and Jordan (1999) \nfound  cliques of more  than  150 nodes  in  QMR-DT;  this  rules  out the junction tree algo(cid:173)\nrithm.  Heckerman (1989) discovered a factorization specific to QMR-DT that reduces the \ncomplexity substantially;  however the resulting algorithm still  scales exponentially in  the \nnumber of positive findings and is only feasible for a small subset of the benchmark cases. \n\nI In this expression, the factors P( dj) are the probabilities associated with the (parent-less) disease \nnodes, the factors  (1 - e - Zi) are the probabilities of the (child) finding nodes that are observed to be \nin their positive state, and the factors e -Zi  are the probabilities of the negative findings.  The resulting \nproduct is the joint probability P(f, d), which is marginalized to obtain the likelihood P(f). \n\n\fApproximate Inference Algorithms for Two-Layer Bayesian Networks \n\n535 \n\n2.2  The Jaakkola and Jordan (JJ) algorithm \n\nJaakkola and Jordan (1999) proposed a variational algorithm for approximate inference in \nthe  QMR-DT setting.  Briefly,  their approach is  to  make  use  of the  following  variational \ninequality: \n\nwhere  Ci  is  a  deterministic  function  of Ai.  This  inequality  holds  for  arbitrary  values  of \nthe  free  \"variational parameter\"  Ai.  Substituting  these  variational  upper bounds for  the \nprobabilities of positive findings in Eq.  (1), one obtains a factorizable upper bound on the \nlikelihood. Because of the factorizability, the sum across diseases can be distributed across \nthe joint probability,  yielding a product of sums rather than  a sum of products.  One then \nminimizes the resulting expression with respect to  the variational parameters to  obtain the \ntightest possible variational bound. \n\n2.3  The Kearns and Saul (KS) algorithm \n\nA simplified variational algorithm was proposed by  Kearns  and Saul (1998), whose  main \ngoal was the theoretical analysis of the rates of convergence for variational algorithms.  In \ntheir approach, the  local conditional probability for  the  finding Ii is  approximated by  its \nvalue at a point a small distance Ci  above or below (depending on whether upper or lower \nbounds are desired) the mean input E[Zi].  This yields a variational algorithm in  which the \nvalues  Ci  are  the  variational  parameters  to  be  optimized.  Under the  assumption that  the \nweights Oij  are bounded in magnitude by  T / N, where T  is a constant and N  is the number \nof parent (\"disease\") nodes,  Kearns and Saul showed that the error in  likelihood for  their \nalgorithm converges at a rate of O( Vlog N / N). \n\n3  Algorithms based on local expansions \n\nInspired by  Kearns and  Saul (1998),  we  describe the  design of approximation algorithms \nfor QMR-DT obtained by expansions around the mean input to the finding  nodes.  Rather \nthan  using  point approximations as  in  the  Kearns-Saul  (KS)  algorithm,  we  make  use  of \nTaylor expansions.  (See also Plefka (1982), and  Barber and  van de Laar (1999) for  other \nperturbational techniques.) \n\nConsider a generalized QMR-DT architecture in which the noisy-OR model is replaced by a \ngeneral function 'IjJ( z)  : R  -t [0, 1]  having uniformly bounded derivatives, i.e.,  \\'IjJ(i) (z) \\ :::; \nB i . Define F(Zl, . .. , ZK)  =  rr~l ('IjJ(zi))fi  rr~l (1  - 'IjJ(Zd)l-fi  so that the likelihood \ncan be written as \n\nP(f) =  E{z;}[F(Zl\"\"  ,ZK)]. \n\n(2) \n\nAlso define /-ti  = E[Zi]  = Ow  + 2:7=1 OijP(dj  = 1). \nA simple mean-field-like approximation can be obtained by evaluating F  at the mean values \n\nWe refer to this approximation as \"MF(O).\" \n\nP(f) ~ F(/-t1, ... ,/-tK). \n\nExpanding the function F  to second order, and defining (i =  Zi  - /-ti,  we have: \n\nP(f) \n\nE{fi}  F(jl) + L  Fi1 (J1)(i 1 + 21  .L L  Fid2 (J1)Eh (i2 + \n\nr \n\nK \n\n1  K  K \n\nL \n\n11 =1 \n\n11 =112=1 \n\n(3) \n\n(4) \n\n\f536 \n\nA.  Y.  Ng and M  I.  Jordan \n\nwhere the subscripts on F represent derivatives.  Dropping the remainder term and bringing \nthe expectation inside, we  have the \"MF(2)\" approximation: \n\nP(f) ~ F(il) + 2 L L Fili2(Jt)E[EilEi2] \n\nK \n\n1  K \n\nMore generally, we obtain a \"MF(i)\" approximation by carrying out a Taylor expansion to \ni-th order. \n\ni 1 =1 i2=1 \n\n3.1  Analysis \n\nIn this section, we give two theorems establishing convergence rates for the MF( i)  family \nof algorithms and  for  the  Jaakkola and Jordan algorithm.  As  in  Kearns  and  Saul (1998), \nour results  are  obtained under the assumption that the  weights are  of magnitude at  most \nO(lIN)  (recall that N  is  the number of disease  nodes).  For large  N, this  assumption of \n\"weak interactions\" implies that each Zi  will be close to its mean value with high probability \n(by the law of large numbers), and thereby gives justification to the use of local expansions \nfor the probabilities of the findings. \n\nDue to  space constraints, the detailed proofs of the  theorems given in  this section are  de(cid:173)\nferred  to  the  long version of this paper, and we  will  instead  only sketch the intuitions for \nthe proofs here. \nTheorem 1  Let K  (the number offindings) be fixed,  and suppose IfJij I :::;  ~ for all i, j for \nsome fixed constantT.  Then the absolute error of the MF(k) approximation is 0  (N(!:!1)/2) \nfor k odd and 0  (N(k72+1)  ) for k even. \nProof intuition.  First consider the case of odd k.  Since  IfJij I :::;  ~, the quantity  Ei  =  Zi  -\nJ-Li  =  2:j  fJij (dj - E[ dj ])  is like an average of N  random variables, and hence has standard \ndeviation on the order 11m. Since MF(k) matches F  up to the k-th order derivatives, we \nfind that when we take a Taylor expansion ofMF(k)'s error, the leading non-zero term is the \nk + 1-st order term, which contains quantities such as 10:+1.  Now because Ei  has standard \ndeviation on  the  order 11m, it is  unsurprising that E[E:+l]  is  on  the  order 1IN(k+l)/2, \nwhich gives the error of MF(k) for odd k. \nFor k even, the leading non-zero term in the Taylor expansion of the error is a k + 1-st order \nterm with  quantities such as 10:+1.  But if we think of Ei  as converging (via a central limit \ntheorem effect) to a symmetric distribution, then since symmetric distributions have small \nodd central moments, E[ 10:+1]  would be small. This means that for k even, we may look to \nthe order k + 2 term for the error, which leads to MF(k) having the the same big-O error as \nMF(k + 1).  Note this is also consistent with how MF(O) and MF(l) always give the same \nestimates and hence have the same absolute error. \n0 \n\nA  theorem  may  also  be proved for  the  convergence rate  of the  Jaakkola and  Jordan  (JJ) \nalgorithm.  For simplicity, we  state it here only for  noisy-OR networks. 2  A closely related \nresult  also  holds  for  sigmoid  networks  with  suitably  modified  assumptions;  see  the  full \npaper. \n\nTheorem 2  Let K  befixed, and suppose 'Ij;(z)  =  1-e-z  is the noisy-ORfunction. Suppose \nfurther that 0  :::;  fJij  :::;  ~ for all i, j  for some fixed constant T,  and that J-Li  ~ J-Lmin  for all \ni, for some fixed J-Lmin  > O.  Then the absolute error of the JJ approximation is 0  (~ ). \n\n2Note in any case that 11 can be applied only when 1/J  is log-concave, such as in noisy-OR networks \n\n(where incidentally all weights are non-negative). \n\n\fApproximate Inference Algorithms for Two-Layer Bayesian Networks \n\n537 \n\nThe condition of some Pmin  lowerbounding the Pi'S  ensures that the findings  are  not too \nunlikely; for it to hold, it is sufficient that there be bias (\"leak\") nodes in the network with \nweights bounded away from zero. \nProof intuition.  Neglecting negative findings,  (which as  discussed do not need to be han(cid:173)\ndled variationally,) this result is  proved for a \"simplified\" version of the JJ algorithm, that \nalways chooses the  variational parameters so that for each i, the exponential upperbound \non '1/J(Zi)  is  tangent to  '1/J  at Zi  =  Pi.  (The \"normal\" version of JJ can have error no  worse \nthan this simplified one.) Taking a Tay lor expansion again of the approximation's error, we \nfind  that since the upperbound has matched zeroth and first derivatives with F, the error is \na second order term with quantities such as f.t. As discussed  in the MF(k) proof outline, \nthis quantity has expectation on the order 1jN, and hence JJ's error is O(ljN). \n0 \n\nTo  summarize our results in  the most useful cases, we find  that MF(O)  has a convergence \nrate of O(ljN) , both MF(2) and MF(3) have rates of O(ljN2) , and JJ has a convergence \nrate of O(ljN). \n\n4  Simulation results \n\n4.1  Artificial networks \n\nWe carried out a set of simulations that were intended to verify the theoretical results pre(cid:173)\nsented in the previous section. We used bipartite noisy-OR networks, with full connectivity \nbetween layers and with the weights ()ij  chosen uniformly in  (0, 2jN). The number N  of \ntop-level  (\"disease\")  nodes  ranged  from  10  to  1000.  Priors  on  the  disease  nodes  were \nchosen uniformly in (0,1). \n\nThe results are shown in Figure 1 for one and five  positive findings  (similar results where \nobtained for additional positive findings). \n\n100 r--_ _ ___  ~------___. \n\n-... _--_._---... \n\n10\u00b7'  -- - - - - - __ _ \n\n... e \nw 10' \nth 1t \n\n--..... _--... -. \n\n10\u00b7'  ....\u2022. ...... .... . .. . ............... ... .. .... \n\n10 \n\n100 \n\n#diseases \n\n1000 \n\n10 \n\n100 \n\n#diseases \n\n1000 \n\nFigure 1:  Absolute error in likelihood (averaged over many randomly generated networks) as a func(cid:173)\ntion of the number of disease nodes  for  various  algorithms.  The short-dashed lines are the KS  upper \nand lower bounds (these curves overlap in the left panel), the long-dashed line is the 11 algorithm and \nthe solid lines are MF(O), MF(2) and MF(3) (the latter two curves overlap in the right panel). \n\nThe results are entirely consistent with the theoretical analysis, showing nearly exactly the \nexpected slopes of -112,  -1  and -2 on a loglog plot. 3  Moreover, the asymptotic  results are \n\n3The  anomalous  behavior  of the  KS  lower bound  in  the  second  panel  is  due  to  the  fact  that  the \nalgorithm  generally  finds  a vacuous  lower bound  of 0 in  this  case,  which  yields  an  error which  is \nessentially constant as  a function of the number of diseases. \n\n\f538 \n\nA. Y.  Ng and M.  1.  Jordan \n\nalso predictive of overall performance:  the MF(2)  and MF(3)  algorithms perform best in \nall cases, MF(O)  and JJ are roughly equivalent, and KS  is the least accurate. \n\n4.2  QMR-DT network \n\nWe  now  present results  for  the  QMR-DT network,  in  particular for  the  four  benchmark \nCPC  cases  studied  by  Jaakkola and  Jordan  (1999).  These  cases  all  have  fewer  than  20 \npositive findings;  thus it is possible to run the Heckerman (1989) \"Quickscore\" algorithm \nto obtain the true likelihood. \n\n10'''' \n\n10\u00b7:10 \n\n10'r. \n\n'0 \n\n8 \n~t O\u00b730 \nQj \n~ \n\n1 0~ \n\n'0-\n\n'0-\n\n\u00b0 \n\nCase 16 \n\n3  .  , \n\n7 \n#Exactly treated findings \n\n, \n\n, , \n, , \n\n'0' \n\n'0\u00b7  , , \n, \n, , \n\n'0\u00b7' \n\n8'\u00b0\u00b7 \n.c \n~to'\" \n\n10\" \u00b7 \n\nto'\" :----\n\n10 \u2022 \u2022 1 \n0 \n\nCase 32 \n\n3 \n\n5 \n\n#Exactly treated findings \n\n\u2022 \n\nFigure 2:  Results for CPC cases  16  and 32, for different numbers  of exactly  treated  findings.  The \nhorizontal  line  is  the  true  likelihood,  the  dashed  line  is  J1's  estimate,  and  the  lower  solid  line  is \nMF(3)'s estimate. \n\nlO\" ~-~--...___---,.---...,.--~------, \n\nCase 34 \n\nlO'\u00b7.--~--...___---,.---...,.--~--__, \n\nCase 46 \n\n10\u00b7\"'F - - - - - - - - - - - - --\n\n-- ----- ----\n\n----:;;\",.-\"\"==;J \n\nlO'\"F - - - - -____ _  -=\",...-=======j \n\n1O~ \n\n,0 \u00b7o~-----!---~. - - 7 ', --'-7'---:':lO:------!12 \n\nto ... o~-____!_--~. --7'.---!-'---::-----!12' \n\n#Exacdy treated findings \n\n#Exaclly treated findings \n\nFigure 3:  Results for CPC cases 34 and 46.  Same legend as  above. \n\nIn Jaakkola and Jordan (1999), a hybrid methodology was proposed in which only a portion \nof the findings were treated approximately ; exact methods were used to treat the remaining \nfindings .  Using  this hybrid methodology, Figures 2  and  3  show the results of running JJ \nand MF(3) on these four cases.4 \n\n4These  experiments  were  run  using  a version  of the 11  algorithm  that  optimizes  the  variational \nparameters just once  without  any  findings  treated  exactly,  and  then  uses  these  fixed  values  of the \nparameters thereafter.  The order in  which  findings  are chosen to  be treated exactly  is based  on 11's \nestimates, as described in Jaakkola and Jordan (1999).  Missing points in the graphs for cases  16 and \n\n\fApproximate Inference Algorithms for Two-Layer Bayesian Networks \n\n539 \n\nThe results show the MF algorithm yielding results that are comparable with  the  JJ  algo(cid:173)\nrithm. \n\n5  Conclusions and extension to multilayer networks \n\nThis paper has presented a class of approximate inference algorithms for graphical models \nof the QMR-DT type, supplied a theoretical analysis of convergence rates, verified the rates \nempirically, and presented promising empirical results for the difficult QMR-DT problem. \n\nAlthough  the  focus  of this  paper has  been  two-layer  networks,  the  MF(k)  family  of al(cid:173)\ngorithms can  also  be  extended  to  multilayer networks.  For example,  consider a  3-layer \nnetwork with nodes bi  being parents of nodes di  being parents of nodes Ii.  To approximate \nPr[J]  using (say)  MF(2),  we  first  write  Pr[J]  as  an  expectation  of a function  (F)  of the \nZi'S,  and approximate this function via a second-order Taylor expansion.  To calculate the \nexpectation of the Taylor approximation, we need to calculate terms in the expansion such \nas E[di ], E[didj ]  and E[dn When di  had no parents, these quantities were easily derived in \nterms of the disease prior probabilities.  Now,  they instead depend on the joint distribution \nof di  and dj , which we use our two-layer version of MF(k), applied to the first two (bi  and \ndi )  layers of the network, to approximate. It is important future work to carefully study the \nperformance of this algorithm in the multilayer setting. \n\nAcknowledgments \n\nWe  wish  to  acknowledge the helpful advice  of Tommi  Jaakkola,  Michael  Kearns,  Kevin \nMurphy, and Larry Saul. \n\nReferences \n\n[1]  Barber, D., &  van de Laar, P.  (1999) Variational cumulant expansions for intractable distributions. \nJournal of Artificial Intelligence Research, 10,435-455. \n\n[2]  Heckerman,  D.  (1989).  A  tractable  inference  algorithm  for  diagnosing  multiple  diseases.  In \nProceedings of the Fifth Conference on Uncertainty in Artificial Intelligence. \n\n[3]  Jaakkola,  T.  S.,  &  Jordan,  M.  1.  (1999).  Variational  probabilistic  inference  and  the  QMR-DT \nnetwork.  Journal of Artificial Intelligence Research,  10,291-322. \n\n[4] Jordan, M. 1., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998).  An introduction to variational \nmethods for graphical models.  In Learning in Graphical Models.  Cambridge:  MIT Press. \n\n[5]  Kearns, M. 1., & Saul, L. K. (1998).  Large deviation methods for approximate probabilistic infer(cid:173)\nence,  with rates of convergence.  In G. F.  Cooper &  S.  Moral (Eds.),  Proceedings of the  Fourteenth \nConference on Uncertainty in Artificial Intelligence. San Mateo, CA: Morgan Kaufmann. \n\n[6]  Murphy,  K.  P.,  Weiss,  Y,  &  Jordan,  M.  1.  (1999).  Loopy  belief propagation  for  approximate \ninference:  An empirical study.  In Proceedings of the Fifteenth Conference on Uncertainty in Artificial \nIntelligence. \n\n[7]  Plefka, T.  (1982).  Convergence condition of the TAP equation for the infinite-ranged Ising spin \nglass model.  In 1.  Phys.  A: Math.  Gen.,  15(6). \n\n[8]  Shwe, M., Middleton, B., Heckerman, D., Henrion, M., Horvitz, E., Lehmann, H., & Cooper, G. \n(1991).  Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base 1. \nThe probabilistic model and inference algorithms.  Methods of Information in Medicine, 30, 241-255. \n\n34 correspond to runs where our implementation of the Quickscore algorithm encountered numerical \nproblems. \n\n\f", "award": [], "sourceid": 1640, "authors": [{"given_name": "Andrew", "family_name": "Ng", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}