{"title": "Approximate Expectation Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 353, "page_last": 360, "abstract": "", "full_text": "Approximate Expectation\n\nTom Heskes, Onno Zoeter, and Wim Wiegerinck\n\nGeert Grooteplein 21, 6525 EZ, Nijmegen, The Netherlands\n\nSNN, University of Nijmegen\n\nAbstract\n\nWe discuss the integration of the expectation-maximization (EM) algorithm\nfor maximum likelihood learning of Bayesian networks with belief propagation\nalgorithms for approximate inference. Specifically we propose to combine the\nouter-loop step of convergent belief propagation algorithms with the M-step\nof the EM algorithm. This then yields an approximate EM algorithm that is\nessentially still double loop, with the important advantage of an inner loop\nthat is guaranteed to converge. Simulations illustrate the merits of such an\napproach.\n\n1\n\nIntroduction\n\nThe EM (expectation-maximization) algorithm [1, 2] is a popular method for max(cid:173)\nimum likelihood learning in probabilistic models with hidden variables. The E-step\nboils down to computing probabilities of the hidden variables given the observed\nvariables (evidence) and current set of parameters. The M-step then, given these\nprobabilities, yields a new set of parameters guaranteed to increase the likelihood.\nIn Bayesian networks, that will be the focus of this article, the M-step is usually\nrelatively straightforward. A complication may arise in the E-step, when computing\nthe probability of the hidden variables given the evidence becomes intractable.\n\nAn often used approach is to replace the exact yet intractable inference in the E(cid:173)\nstep with approximate inference, either through sampling or using a deterministic\nvariational method. The use of a \"mean-field\" variational method in this context\nleads to an algorithm known as variational EM and can be given the interpretation of\nminimizing a free energy with respect to both a tractable approximate distribution\n(approximate E-step) and the parameters (M-step) [2].\n\nLoopy belief propagation [3] and variants thereof, such as generalized belief prop(cid:173)\nagation [4] and expectation propagation [5], have become popular alternatives to\nthe \"mean-field\" variational approaches, often yielding somewhat better approxi(cid:173)\nmations. And indeed, they can and have been applied for approximate inference\nin the E-step of the EM algorithm (see e.g. [6, 7]). A possible worry, however, is\nthat standard application of these belief propagation algorithms does not always\nlead to convergence. So-called double-loop algorithms with convergence guarantees\nhave been derived, such as CCCP [8] and UPS [9], but they tend to be an order of\nmagnitude slower than standard belief propagation~ .\n\nThe goal of this article is to integrate expectation-maximization with belief propaga(cid:173)\ntion. As for variational EM, this integration relies on the free-energy interpretation\n\n\fIn Section 3 we describe how the exact free\nof EM that is reviewed in Section 2.\nenergy can be approximated with a Kikuchi free energy and how this leads to an\napproximate EM algorithm. Section 4 contains our main result:\nintegrating the\nouter-loop of a convergent double-loop algorithm with the M-step, we are left with\nan overall double-loop algorithm, where the inner loop is now a convex constrained\noptimization problem with a unique solution. The methods are illustrated in Sec(cid:173)\ntion 5; implications and extensions are discussed in Section 6.\n\n2 The free energy interpretation of EM\n\nWe consider probabilistic models P(x; fJ), with fJ the model parameters to be learned\nand x the variables in the model. We subdivide the variables into hidden variables h\nand observed, evidenced variables e. For ease of notation, we consider just a single\nset of observed variables e (in fact, if we have N sets of observed variables, we can\nsimply copy our probability model N times and view this as our single probability\nmodel with \"shared\" parameters fJ).\nIn maximum likelihood learning, the goal is\nto find the parameters fJ that maximize the likelihood P(e; fJ) or, equivalently, that\nminimize minus the loglikelihood\n\nL(O) = -log pee; 0) = -log [~p(e,h; 0)]\n\n.\n\nThe EM algorithm can be understood from the observation, made in [2], that\n\nL(B) == min F(Q, fJ) ,\n\nQEP\n\nwith P the set of all probability distributions defined on h and F(Q, B) the so-called\nfree energy\n\nF(Q, 0) = L(O) + ~ Q(h) log P(hle; 0) = E(Q, 0) - SeQ) ,\n\n[\n\nQ(h)\n\n]\n\n(1) .\n\nwith the \"energy\"\n\nand the \"entropy\"\n\nE(Q, fJ) == - L Q(h) logP(e, h; B) ,\n\nh\n\n8(Q) == - L Q(h) log Q(h) .\n\nh\n\nThe EM algorithm now boils down to alternate minimization with respect to Q and\nfJ:\n\nE-step:\n\nM-step:\n\nfix fJ and solve Q == argmin F(QI, B)\nfix Q and solve B == argminF(Q,B1\n\nQ'EP\n\n) == argrninE(Q,B1\n\n(2)\n\n)\n\nThe advantage of the M-step over direct minimization of -logP(e; fJ) is that the\nsummation over h is now outside the logarithm, which in many cases implies that\nthe minimum with respect to () can be computed explicitly. The main inference\nproblem is then in the E-step. Its solution follows directly from (1):\n\n()'\n\n()'\n\n( )\nQ h = P h e;O = L-h\n\n( I)\n\nP(h, e; fJ)\n' P(h',e;O) ,\n\n(3)\n\nwith fJ the current setting of the parameters. However, in complex probability\nmodels P(hle; fJ) can be difficult and even intractable to compute, mainly because\n\n\fof the normalization in the denominator. For later purposes we note that the EM\nalgorithm can be interpreted as a general \"bound optimization algorithm\" [10]. In\nthis interpretation the free energy F(Q,B) is an upper bound on the function L(B)\nthat we try to minimize; the E-step corresponds to a reset of the bound and the\nM-step to the minimization of the upper bound.\nIn variational EM [2] one restricts the probability distribution Q to a specific set\npI, such that the E-step. becomes tractable. Note that this restriction affects\nboth the energy term and the entropy term. By construction the approximate\nminQEpl F(Q, B) is an upper bound on L(B).\n\n3 Approximate free energies\n\nIn several studies, propagation algorithms like loopy belief propagation [6J and ex(cid:173)\npectation propagation [7] have been applied to find approximate solutions for the\nE-step. As we will see, the corresponding approximate EM-algorithm can be inter(cid:173)\npreted as alternate minimization of a Bethe or Kikuchi free energy. For the moment,\nwe will consider the case of loopy and generalized belief propagation applied to\nprobability models with just discrete variables. The generalization to expectation\npropagation is discussed in Section 6.\n\nThe joint probability implied by a Bayesian network can be written in the form\n\nP(x; B) == II wa(xa;Ba) ,\n\nwhere a denotes a subset of variables and Wa is a potential function. The parameters\nBa may be shared, i.e., we may have Ba == Bal for some a =1= al. For a Bayesian\nnetwork, the energy term simplifies into a sum over local terms:\nE(Q,B) == - LLQ(ha)log'1ia(ha,ea;Ba).\n\na\n\nhex\n\nHowever, the entropy term is as intractable as the normalization in (3) that we try\nto prevent.\nIn the Bethe or more generally Kikuchi approximation, this entropy\nterm is approximated through [4]\n\nS(Q) == - LQ(h)logQ(h) ~ LSa(Q) + LcIJSIJ(Q).== S(Q) ,\n\nh\n\na\n\nIJ\n\nwith\n\nand similarly for SIJ(Q). The subsets indexed by f3 correspond to intersections\nbetween the subsets indexed by a, intersections of intersections, and so on. The\nparameters clJ are called Moebius or overcounting numbers. In the above descrip(cid:173)\ntion, the a-clusters correspond to the potential subsets, i.e., the clusters in the\nmoralized graph. However, we can also choose them to be larger, e.g., combin(cid:173)\ning several potentials into a single cluster. The Kikuchi/Bethe approximation is\nexact if the a-clusters form a singly-connected structure. That is, exact inference\nis obtained when the a-clusters correspond to cliques in a junction tree. The f3\nsubsets then play the role of the separators and have overcounting numbers 1 - nlJ\nwith n{J the number of neighboring cliq~es. The larger the clusters, the higher the\ncomputational complexity.\n\nThere are different kinds of approximations (Bethe, CVM, junction graphs), each\ncorresponding to a somewhat different choice of a-clusters, f3-subsets and overcount(cid:173)\ning numbers (see [4] for an overview). In the following we will refer to all of them\n\n\fas Kikuchi approximations. The important point is that the approximate entropy\nis, like the energy, a sum of local terms. Furthermore, the Kikuchi free energy as\na function of the probability distribution Q only depends on the marginals Q(xa:)\nand Q(xf3). The minimization of the exact free energy with respect to a probability\ndistribution Q has been turned into the minimization of the Kikuchi free energy\nF(Q,()) == E (Q, ()) -8(Q) with respect to a set of pseudo-marginals Q == {Q a: , Qf3 }.\nFor the approximation to make any sense, these pseudo-marginals have to be prop(cid:173)\nerly normalized as well as consistent, which boils down to a set of linear constraints\nof the form\n\n(4)\n\nThe approximate EM algorithm based on the Kikuchi free energy now reads\n\napproximate E-step:\n\nfix () and solve Q == argminF(Q',8)\n\nQ/EP\n\n(jl\n\n0'\n\nM-step:\n\nfix Q and solve\n\n() == argrninF(Q,()') == argrninE(Q,()')\n\n(5)\nwhere P refers to all sets of consistent and properly normalized pseudo-marginals\n{Qa:, Qf3}. Because the entropy does not depend on the parameters (), the M-step of\nthe approximate EM algorithm is completely equivalent to the M-step of the exact\nEM algorithm. The only difference is that the statistics required for this M-step is\ncomputed approximately rather than exactly. In other words, the seemingly naive\nprocedure of using generalized or loopy belief propagation to compute the statistics\nin the E-step and use it in the M-step, can be interpreted as alternate minimization\nof the Kikuchi approximation of the exact free energy. That is, algorithm (5) can\nbe interpreted as a bound optimization algorithm for minimizing\n\nL(8) == miI! F(Q, 8) ,\n\nQEP\n\nwhich we hope to be a good approximation (not necessarily a bound) of the original\nL(8).\n\n4 Constrained optimization\n\nThere are two kinds of approaches for finding the minimum of the Kikuchi free\nenergy. The first one is to run loopy or generalized belief propagation, e.g., using\nAlgorithm 1 in the hope that it converges to such a minimum. However, convergence\nguarantees can only be given in special cases and in practice one does observe\nconvergence problems. In the following we will refer to the use of standard belief\npropagation in the E-step as the \"naive algorithm\".\n\nRecently, there have been derived double-loop algorithms that explicitly minimize\nthe Kikuchi free energy [8, 9, 11]. Technically, finding the minimum of the Kikuchi\nfree energy with respect to consistent marginals corresponds to a non-convex con(cid:173)\nstrained optimization problem. The consistency and normalization constraints on\nthe marginals are linear in Q and so is the energy term E (Q, 8). The non-convexity\nstems from the entropy terms and specifically those with negative overcounting\nnumbers. Most currently described techniques, such as CCCP [8], UPS [9] and\nvariants thereof, can be understood as general bound optimization algorithms. In\nCCCP concave terms are bounded with a linear term, yielding a convex bound and\nthus, in combination with the linear constraints, a convex optimization problem to\nbe solved in the inner loop. In particular we can write\n\nF(Q,()) == miI!G(Q,R,8) with G(Q,R,8) ==F(Q,(}) +'K(Q,R) ,\n\nREP\n\n(6)\n\n\fAlgorithm 1 Generalized belief propagation.\n1: while -,converged do\n2:\n3:\n\nfor all a :J f3 do\n\nfor all f3 do\n\n4:\n\n5:\n\n6:\n\n7:\n\n8:\n\nQa(XIJ) == L.Qa(Xa);\n\nxO:\\{3\n\nend for\n\nQIJ (XIJ) ex:\n\nJ-la-+IJ (XIJ) n{3+c{3\n\n1\n\na-.:JIJ\n\nfor all a :J f3 do\n\nJ-l1J-+a xlJ ==\n\n( )\n\nQIJ(xlJ)\nJ.La-+1J xlJ\n\n() ;\n\nQa(Xa) ex: Wa(Xa) II J.LIJ-+a(xlJ)\n\nIJCa\n\nend for\n\n9:\nend for\n10:\n11: end while\n\nwhere\n\nK(Q, R) == L ICIJI L QIJ(hlJ) log [~~~~~~] ,\n\nIJ\n\nIJ\n\nIJ;C{3 <0\n\nh{3\n\nis a weighted sum of local Kullback-Leibler divergences. By construction G(Q, R, 0)\nis convex in Q - the concave QIJ log QIJ terms in F(Q, 0) cancel with those in K (Q,R)\n- as well as an upper bound on F(Q, B) since K(Q, R) ~ O. The now convex opti(cid:173)\nmization problem in the inner loop can be solved with a message passing algorithm\nvery similar to standard loopy or generalized belief propagation. In fact, we can\nuse Algorithm 1, with clJ == 0 and after a slight redefinition of the potentials Wa\nsuch that they incorporate the linear bound of the concave entropy terms (see [11]\nfor details). The messages in this algorithm are in one-to-one correspondence with\nthe Lagrange multipliers of the concave dual. Most importantly, with the particu(cid:173)\nlar scheduling in Algorithm 1, each update is guaranteed to increase the dual and\ntherefore the inner-loop algorithm must converge to its unique solution. The outer\nloop simply sets R == Q and corresponds to a reset of the bound.\nIncorporating this double-loop algorithm into our approximate EM algorithm (5),\nwe obtain\n\ninner-loop E-step:\n\nfix {B, R} solve Q == argmin G (QI , R, fJ)\n\nouter-loop E-step:\n\nM-step:\n\nfix {Q, 8} solve R == argminG(Q,R',fJ) == argminK(Q,R)\n\nQiE'P\n\nfix {Q, R} solve B == argrninG(Q,R,fJl) == argrninE(Q, 81\n\n)\n\nWE'P\n\n()I\n\nWE'P\n\n()I\n\n(7)\nTo distinguish it from the naive algorithm, we will refer to (7) as the \"convergent\nalgorithm\". The crucial observation is that we can combine the outer-loop E-step\nwith the usual M-step:\nthere is no need to run the double-loop algorithm in the\nE-step until convergence. This gives us then an overall double-loop rather than\nIn principle (see however the next section) the algorithmic\ntriple-loop algorithm.\ncomplexity of the convergent algorithm is- the same as that of the naive algorithm.\n\n\f70.----,-----,---.,----,.------,\n\n\\\n\n'.\\\n\n:\\.\n: \\\n: \\. .\n\n\"'-\n\n\"-\n\n\\\n\n\\\n\n\\\n\n\\\n\n\"-\n\n\\\n\n\\\n\n\\\n\n\\\n\n\\\n\n\"-\n\n\"-\n\n\"-\n\n\"'-\n\n---'-.:..,..=;=.....~. - , - . --''=::-. =.-=-\n55 '-------'-----'----'-----'--------'\n\no\n\n20\n\n40\n60\nouter loops\n\n80\n\n100\n\n(a) Coupled hidden Markov model.\n\n(b) Simulation results.\n\nFigure 1: Learning a coupled hidden Markov model. (a) Architecture for 3 time slice\nQa'1d 4 hidden nodes per time slice. (b) 11inus the loglikelihood in the Kikuchi/Bethe\napproximation as a function of the number of M-steps. Naive algorithm (solid\nline), convergent algorithm (dashed), convergent algorithm with tighter bound and\noverrelaxation (dash-dotted), same for a Kikuchi approximation (dotted). See text\nfor details.\n\n5\n\n'Simulations\n\nFor illustration, we compare the naive and convergent approximate EM algorithms\nfor learning in a coupled hidden Markov model. The architecture of coupled hidden\nMarkov models is sketched in Figure l(a) for T == 3 time slices and M == 4 hidden(cid:173)\nIn our simulations we used M == 5 and T == 20;\nvariable nodes per time slice.\nall nodes are binary. The parameters to be learned are the observation matrix\np(em,t == ilhm,t == j) and two transition matrices: p(h1,t+l == ilh1,t == j, h2 ,t == k) ==\np(hM,t+l == ilhM,t == j, hM-l,t == k) for the outer nodes and p(hm,t+l == ilhm-1,t ==\nj, hm,t == k, hm+1 ,t == l) for the middle nodes. The prior for the first time slice\nis fixed and uniform. We randomly generated properly normalized transition and\nobservation matrices and evidence given those matrices. fuitial parameters were set\nIn the inner loop of both the naive and\nto another randomly generated instance.\nthe convergent algorithm, Algorithm 1 was run for 10 iterations.\n\nLoopy belief propagation, which for dynamic Bayesian networks can be interpreted\nas an iterative version of the Boyen-Koller algorithm [12], converged just fine for the\nmany instances that we have seen. The naive algorithm nicely minimizes the Bethe\napproximation of minus the loglikelihood L(O), as can be seen from the solid line in\nFigure 1(b). The Bethe approximation is fairly accurate in this model and plots of\nthe exact loglikelihood, both those learned with exact and with approximate EM,\nare very similar (not shown). The convergent algorithm also works fine, but takes\nmore time to converge (dashed line). This is to. be expected: the additional bound\nimplied by the outer-loop E-step makes G(Q,R,(}) a looser bound of L((}) than\nF(Q, (}) and the tighter the bound in a bound optimization algorithm, the faster\nthe convergence. Therefore, it makes sense to use tighter convex bounds on F(Q, (}),\nfor example those derived in [Ill. On top of that, we can use overrelaxation, i.e., set\nlog Q == 'TJ log R + (1 - 'TJ) log QO d .(up to normalization) with QOld the previous set\nof pseudo-marginals. See e.g. [10] for the general idea; here we took 'TJ == 1.4 fixed.\nApplication of these two \"tricks\" yields the dash-dotted line. It gives an indication\nof how close one can bring the convergent to the naive algorithm (overrelaxation\n\n\fapplied to the M-step affects both algorithms in the same way and is therefore not\nconsidered here). Another option is to repeat the inner and outer E-steps N times,\nbefore updating the parameters in the M-step. Plots for N ~ 3 are indistinguishable\nfrom the solid line for the naive algorithm.\n\nThe above shows that the price to be paid for an algorithm that is guaranteed to\nconverge is relatively low. Obviously, the true value of the convergent algorithm\nbecomes clear when the naive algorithm fails. Many instances of non-convergence of\nloopy and especially generalized belief propagation have been reported (see e.g. [3,\n11] and [12] specifically on coupled hidden Markov models). Some but not all of\nthese problems disappear when the updates are damped, which further has the\nIn\ndrawback of slowing down convergence as well as requiring additional tuning.\nthe context of the coupled hidden Markov models we observed serious problems with\ngeneralized belief propagation. For example, with a-clusters of size 12, consisting of\n3 neighboring hidden and evidence nodes in two subsequent time slices, we did not\nmanage to get the naive algorithm to converge properly. The convergent algorithm\nalvlays converged vlithout any problem, yielding the dotted line in Figure l(b) for\nthe particular problem instance considered for the Bethe approximation as welL\nNote that, where the inner loops for the Bethe approximations take about the same\namount of time (which makes the number of outer loops roughly proportional to\ncpu time), an inner loop for the Kikuchi approximation is in this case about two\ntimes slower.\n\n6 Discussion\n\nThe main idea of this article, that there is no need to run a converging double(cid:173)\nloop algorithm in an approximate E-step until convergence, only applies to directed\nprobabilistic graphical models like Bayesian networks. In undirected graphical mod(cid:173)\nels like Boltzmann machines there is a global normalization constant that typically\ndepends on all parameters .f) and is intractable to compute analytically. For this\nso-called partition function, the bound used in converging double-loop algorithms\nworks in the opposite direction as the bound implicit in the EM algorithm. The\nconvex bound of [13] does work in the right direction, but cannot (yet) handle miss(cid:173)\nIn [14] standard loopy belief propagation is used in the inner loop of\ning values.\niterative proportional fitting (IPF). Also here it is not yet clear how to integrate IPF\nwith convergent belief propagation without ending up with a triple-loop algorithm.\n\nFollowing the same line of reasoning, expectation maximization can be combined\nwith expectation propagation (EP) [5]. EP can be understood as a generaliza(cid:173)\ntion of loopy belief propagation. Besides neglecting possible loops in the gI;'aphical\nstructure, expectation propagation can also handle projections onto an exponential\nThe approximate free energy for EP is the same Bethe\nfamily of distributions.\nfree energy, only the constraints are different. That is, the \"strong\" marginaliza(cid:173)\ntion constraints (4) are replaced by the \"weak\" marginalization constraints that all\nsubsets marginals agree upon their moments.\nThese constraints are still linear\nin Qa and Q{3 and we can make the same decomposition (6) of the Bethe free en(cid:173)\nergy into a convex and a concave term to derive a double-loop algorithm with a\nconvex optimization problem in the inner loop. However, EP can have reasons for\nnon-convergence that are not necessarily resolved with a double-loop version. For\nexample, it can happen that while projecting onto Gaussians negative covariance\nmatrices appear. This problem has, to the best of our knowledge, not yet been\nsolved and is subject to ongoing research.\n\nIt has been emphasized before [13] that it makes no sense to learn with approxi-\n\n\fmate inference and then apply exact inference given the learned parameters. The\nintuition is that we tune the parameters to the evidence, incorporating the errors\nthat are made while doing approximate inference. In that context it is important\nthat the results of approximate inference are reproducable and the use of convergent\nalgorithms is a relevant step in that direction.\n\nReferences\n\n[1] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete\ndata via the EM algorithm. Journal of the Royal Statistical Society B, 39:1-38,\n1977.\n\n[2] R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental,\nsparse, and other variants. In M. Jordan, editor, Learning in Graphical Models,\npages 355-368. Kluwer Academic Publishers, Dordrecht, 1998.\n\n[3] K. Murphy, Y. Weiss, and M. Jordan. 'Loopy belief propagation for approximate\ninference: An empirical study. In Proceedings of the Fifteenth Conference on\nUncertainty in Articial Intelligence, pages 467-475, San Francisco, CA, 1999.\nMorgan Kaufmann.\n\n[4] J. Yedidia, W. Freeman, and Y. Weiss. Constructing free energy approxima(cid:173)\ntions and generalized belief propagation algorithms. Technical report, Mit(cid:173)\nsubishi Electric Research Laboratories, 2002.\n\n[5] T. Minka. Expectation propagation for approximate Bayesian inference.\n\nIn\nUncertainty in Artificial Intelligence: Proceedings of the Seventeenth Confer(cid:173)\nence (UAI-2001), pages 362-369, San Francisco, CA, 2001. Morgan Kaufmann\nPublishers.\n\n[6] B. Frey and A. Kanna. Accumulator networks: Suitors of local probability\npropagation.\nIn T. Leen, T. Dietterich, and V. Tresp, editors, Advances in\nNeural Information Processing Systems 13, pages 486-492. MIT Press, 2001.\n[7] T. Minka and J. Lafferty. Expectation propagation for the generative aspect\n\nmodel. In Proceedings of UAI-2002, pages 352-359, 2002.\n\n[8] A. Yuille. CCCP algorithms to minimize the Bethe and Kikuchi free energies:\nConvergent alternatives to belief propagation. Neural Computation, 14:1691(cid:173)\n1722,2002.\n\n[9] Y. Teh and M. Welling. The unified propagation and scaling algorithm.\n\nNIPS 14, 2002.\n\nIn\n\n[10] R. Salakhutdinov and S. Roweis. Adaptive overrelaxed bound optimization\n\nmethods. In ICML-2003, 2003.\n\n[11] T. Heskes, K. Albers, and B. Kappen. Approximate inference and constrained\n\noptimization. In UAI-2003, 2003.\n\n[12] K. Murphy and Y. Weiss.. The factored frontier algorithm for approximate\n\ninference in DBNs. In UAI-2001, pages 378-385, 2001.\n\n[13] M. Wainwright, T. Jaakkola, and A. WHIsky. Tree-reweighted belief propaga(cid:173)\ntion algorithms and approximate ML estimation via pseudo-moment matching.\nIn AISTATS-2003, 2003.\n\n[14] Y. Teh and M. Welling. On improving the efficiency of the iterative propor(cid:173)\n\ntional fitting procedure. In AISTATS-2003, 2003.\n\n\f", "award": [], "sourceid": 2404, "authors": [{"given_name": "Tom", "family_name": "Heskes", "institution": null}, {"given_name": "Onno", "family_name": "Zoeter", "institution": null}, {"given_name": "Wim", "family_name": "Wiegerinck", "institution": null}]}