{"title": "VIBES: A Variational Inference Engine for Bayesian Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 800, "abstract": "", "full_text": "VIBES: A Variational Inference\nEngine for Bayesian Networks\n\nChristopher M. Bishop\n\nMicrosoft Research\n\nCambridge, CB3 0FB, U.K.\n\nDavid Spiegelhalter\nMRC Biostatistics Unit\n\nCambridge, U.K.\n\nresearch.microsoft.com/(cid:24)cmbishop\n\ndavid.spiegelhalter@mrc-bsu.cam.ac.uk\n\nJohn Winn\n\nDepartment of Physics\n\nUniversity of Cambridge, U.K.\n\nwww.inference.phy.cam.ac.uk/jmw39\n\nAbstract\n\nIn recent years variational methods have become a popular tool\nfor approximate inference and learning in a wide variety of proba-\nbilistic models. For each new application, however, it is currently\nnecessary (cid:12)rst to derive the variational update equations, and then\nto implement them in application-speci(cid:12)c code. Each of these steps\nis both time consuming and error prone. In this paper we describe a\ngeneral purpose inference engine called VIBES (\u2018Variational Infer-\nence for Bayesian Networks\u2019) which allows a wide variety of proba-\nbilistic models to be implemented and solved variationally without\nrecourse to coding. New models are speci(cid:12)ed either through a\nsimple script or via a graphical interface analogous to a drawing\npackage. VIBES then automatically generates and solves the vari-\national equations. We illustrate the power and (cid:13)exibility of VIBES\nusing examples from Bayesian mixture modelling.\n\n1\n\nIntroduction\n\nVariational methods [1, 2] have been used successfully for a wide range of models,\nand new applications are constantly being explored. In many ways the variational\nframework can be seen as a complementary approach to that of Markov chain Monte\nCarlo (MCMC), with di(cid:11)erent strengths and weaknesses.\n\nFor many years there has existed a powerful tool for tackling new problems using\nMCMC, called BUGS (\u2018Bayesian inference Using Gibbs Sampling\u2019) [3]. In BUGS\na new probabilistic model, expressed as a directed acyclic graph, can be encoded\nusing a simple scripting notation, and then samples can be drawn from the posterior\ndistribution (given some data set of observed values) using Gibbs sampling in a way\nthat is largely automatic. Furthermore, an extension called WinBUGS provides a\ngraphical front end to BUGS in which the user draws a pictorial representation of\n\n\fthe directed graph, and this automatically generates the required script.\n\nWe have been inspired by the success of BUGS to produce an analogous tool for\nthe solution of problems using variational methods. The challenge is to build a\nsystem that can handle a wide range of graph structures, a broad variety of com-\nmon conditional probability distributions at the nodes, and a range of variational\napproximating distributions. All of this must be achieved whilst also remaining\ncomputationally e(cid:14)cient.\n\n2 A General Framework for Variational Inference\n\nIn this section we brie(cid:13)y review the variational framework, and then we charac-\nterise a large class of models for which the variational method can be implemented\nautomatically. We denote the set of all variables in the model by W = (V; X) where\nV are the visible (observed) variables and X are the hidden (latent) variables. As\nwith BUGS, we focus on models that are speci(cid:12)ed in terms of an acyclic directed\ngraph (treatment of undirected graphical models is equally possible and is some-\nwhat more straightforward). The joint distribution P (V; X) is then expressed in\nterms of conditional distributions P (Wijpai) at each node i, where pai denotes the\nset of variables corresponding to the parents of node i, and Wi denotes the variable,\nor group of variables, associated with node i. The joint distribution of all variables\n\nis then given by the product of the conditionals P (V; X) =Qi P (Wijpai).\n\nOur goal is to (cid:12)nd a variational distribution Q(XjV ) that approximates the true\nposterior distribution P (XjV ). To do this we note the following decomposition of\nthe log marginal probability of the observed data, which holds for any choice of\ndistribution Q(XjV )\n\nln P (V ) = L(Q) + KL(QkP )\n\nwhere\n\nQ(XjV ) ln\n\nP (V; X)\nQ(XjV )\n\nL(Q) = XX\nKL(QkP ) = (cid:0)XX\n\nQ(XjV ) ln\n\nP (XjV )\nQ(XjV )\n\n(1)\n\n(2)\n\n(3)\n\nand the sums are replaced by integrals in the case of continuous variables. Here\nKL(QkP ) is the Kullback-Leibler divergence between the variational approximation\nQ(XjV ) and the true posterior P (XjV ). Since this satis(cid:12)es KL(QkP ) (cid:21) 0 it follows\nfrom (1) that the quantity L(Q) forms a lower bound on ln P (V ).\n\nWe now choose some family of distributions to represent Q(XjV ) and then seek a\nmember of that family that maximizes the lower bound L(Q). If we allow Q(XjV )\nto have complete (cid:13)exibility then we see that the maximum of the lower bound\noccurs for Q(XjV ) = P (XjV ) so that the variational posterior distribution equals\nthe true posterior. In this case the Kullback-Leibler divergence vanishes and L(Q) =\nln P (V ). However, working with the true posterior distribution is computationally\nintractable (otherwise we wouldn\u2019t be resorting to variational methods). We must\ntherefore consider a more restricted family of Q distributions which has the property\nthat the lower bound (2) can be evaluated and optimized e(cid:14)ciently and yet which\nis still su(cid:14)ciently (cid:13)exible as to give a good approximation to the true posterior\ndistribution.\n\n\f2.1 Factorized Distributions\n\nFor the purposes of building VIBES we have focussed attention initially on distri-\nbutions that factorize with respect to disjoint groups Xi of variables\n\nQ(XjV ) =Yi\n\nQi(Xi):\n\n(4)\n\nThis approximation has been successfully used in many applications of variational\nmethods [4, 5, 6]. Substituting (4) into (2) we can maximize L(Q) variationally\nwith respect to Qi(Xi) keeping all Qj for j 6= i (cid:12)xed. This leads to the solution\n\nln Q?\n\ni (Xi) = hln P (V; X)ifj6=ig + const:\n\n(5)\n\nwhere h(cid:1)ik denotes an expectation with respect to the distribution Qk(Xk). Taking\nexponentials of both sides and normalizing we obtain\n\nQ?\n\ni (Xi) =\n\nPXi\n\nexphln P (V; X)ifj6=ig\n\nexphln P (V; X)ifj6=ig\n\n:\n\n(6)\n\nNote that these are coupled equations since the solution for each Qi(Xi) depends on\nexpectations with respect to the other factors Qj6=i. The variational optimization\nproceeds by initializing each of the Qi(Xi) and then cycling through each factor in\nturn replacing the current distribution with a revised estimate given by (6). The\ncurrent version of VIBES is based on a factorization of the form (4) in which each\nfactor Qi(Xi) corresponds to one of the nodes of the graph (each of which can be a\ncomposite node, as discussed shortly).\n\nAn important property of the variational update equations, from the point of view of\nVIBES, is that the right hand side of (6) does not depend on all of the conditional\ndistributions P (Wijpai) that de(cid:12)ne the joint distribution but only on those that\nhave a functional dependence on Xi, namely the conditional P (Xijpai), together\nwith the conditional distributions for any children of node i since these have Xi in\ntheir parent set. Thus the expectations that must be performed on the right hand\nside of (6) involve only those variables lying in the Markov blanket of node i, in\nother words the parents, children and co-parents of i, as illustrated in Figure 1(a).\nThis is a key concept in VIBES since it allows the variational update equations to\nbe expressed in terms of local operations, which can therefore be expressed in terms\nof generic code which is independent of the global structure of the graph.\n\n2.2 Conjugate Exponential Models\n\nIt has already been noted [4, 5] that important simpli(cid:12)cations to the variational\nupdate equations occur when the distributions of the latent variables, conditioned\non their parameters, are drawn from the exponential family and are conjugate with\nrespect to the prior distributions of the parameters. Here we adopt a somewhat dif-\nferent viewpoint in that we make no distinction between latent variables and model\nparameters. In a Bayesian setting these both correspond to unobserved stochastic\nvariables and can be treated on an equal footing. This allows us to consider con-\njugacy not just between variables and their parameters, but hierarchically between\nall parent-child pairs in the graph.\n\nThus we consider models in which each conditional distribution takes the standard\nexponential family form\n\nln P (XijY ) = (cid:30)i(Y )Tui(Xi) + fi(Xi) + gi(Y )\n\n(7)\n\nwhere the vector (cid:30)(Y ) is called the natural parameter of the distribution. Now\nconsider a node Zj with parent Xi and co-parents cp(i)\nj , as indicated in Figure 1(a).\n\n\fY1\n\nXi\n\nYK\n\n{\n\n( )i\n\ncpj\n\n}\n\nZ1\n\nZj\n\n(a)\n\n(b)\n\nFigure 1: (a) A central observation is that the variational update equations for node\nXi depend only on expectations over variables appearing in the Markov blanket of\nXi, namely the set of parents, children and co-parents. (b) Hinton diagram of hW i\nfrom one of the components in the Bayesian PCA model, illustrating how all but\nthree of the PCA eigenvectors have been suppressed.\n\nAs far as the pair of nodes Xi and Zj are concerned, we can think of P (XijY )\nas a prior over Xi and the conditional P (ZjjXi; cp(i)\nj ) as a (contribution to) the\nlikelihood function. Conjugacy requires that, as a function of Xi, the product\nof these two conditionals must take the same form as (7). Since the conditional\nP (ZjjXi; cp(i)\n\nj ) is also in the exponential family it can be expressed as\n\n(8)\n\n(9)\n\nln P (ZjjXi; cp(i)\n\nj ) = (cid:30)j(Xi; cp(i)\n\nj )Tuj(Zj) + fj(Zj) + gj(Xi; cp(i)\nj ):\n\nConjugacy then requires that this be expressible in the form\n\nln P (ZjjXi; cp(i)\n\nj ) = e(cid:30)j!i(Zj; cp(i)\n\nj ) Tui(Xi) + (cid:21)(Zj; cp(i)\nj )\n\nfor some choice of functions e(cid:30) and (cid:21). Since this must hold for each of the parents of\n\nZj it follows that ln P (ZjjXi; cp(i)\nj ) must be a multi-linear function of the uk(Xk) for\neach of the parents Xk of node XZj. Also, we observe from (8) that the dependence\nof ln P (ZjjXi; cp(i)\nj ) on Zj is again linear in the function uj(Zj). We can apply a\nsimilar argument to the conjugate relationship between node Xj and each of its\nparents, showing that the contribution from the conditional P (XijY ) can again be\nexpressed in terms of expectations of the natural parameters for the parent node\ndistributions. Hence the right hand side of the variational update equation (5) for\na particular node Xi will be a multi-linear function of the expectations hui for each\nnode in the Markov blanket of Xi.\n\nThe variational update equation then takes the form\n\nln Q?\n\ni (Xi) =8<\n:\n\nh(cid:30)i(Y )iY +\n\nMXj=1\n\nhe(cid:30)j!i(Zj; cp(i)\n\nj )iZj ;cp(i)\n\nj\n\n9=\n;\n\nT\n\nui(Xi) + const:\n\n(10)\n\nwhich involves summation of bottom up \u2018messages\u2019 he(cid:30)j!iiZj ;cp(i)\n\ntogether with a top-down message h(cid:30)i(Y )iY from the parents. Since all of these\nmessages are expressed in terms of the same basis ui(Xi), we can write compact,\ngeneric code for updating any type of node, instead of having to take account\nexplicitly of the many possible combinations of node types in each Markov blanket.\n\nfrom the children\n\nj\n\n\fu(cid:22) =(cid:20) (cid:22)\n\n(cid:22)2 (cid:21) ; he(cid:30)X!(cid:22)i =(cid:20) h(cid:28) ihXi\n\n(cid:0)h(cid:28) i=2 (cid:21) ; u(cid:28) =(cid:20) (cid:28)\n\nln (cid:28) (cid:21) ; he(cid:30)X!(cid:28) i =(cid:20) (cid:0)h(X (cid:0) (cid:22))2i\n\n1=2\n\n(cid:21) :\n\nAs an example, consider the Gaussian N (Xj(cid:22); (cid:28) (cid:0)1) for a single variable X with\nmean (cid:22) and precision (inverse variance) (cid:28) . The natural coordinates are uX =\n[X; X 2]T and the natural parameterization is (cid:30) = [(cid:22)(cid:28); (cid:0)(cid:28) =2]T. Then hui = [(cid:22); (cid:22)2 +\n(cid:28) (cid:0)1]T, and the function fi(Xi) is simply zero in this case. Conjugacy allows us to\nchoose a distribution for the parent (cid:22) that is Gaussian and a prior for (cid:28) that is\na Gamma distribution. The corresponding natural parameterizations and update\nmessages are given by\n\nWe can similarly consider multi-dimensional Gaussian distributions, with a Gaus-\nsian prior for the mean and a Wishart prior for the inverse covariance matrix.\n\nA generalization of the Gaussian is the recti(cid:12)ed Gaussian which is de(cid:12)ned as\nP (Xj(cid:22); (cid:28) ) / N (Xj(cid:22); (cid:28) ) for X (cid:21) 0 and P (Xj(cid:22); (cid:28) ) = 0 for X < 0, for which moments\ncan be expressed in terms of the \u2018erf\u2019 function. This recti(cid:12)cation corresponds to\nthe introduction of a step function, whose logarithm corresponds to fi(Xi) in (7),\nwhich is carried through the variational update equations unchanged. Similarly, we\ncan consider doubly truncated Gaussians, which are non-zero only over some (cid:12)nite\ninterval.\n\nAnother example is the discrete distribution for categorical variables. These are\nmost conveniently represented using the 1-of-K scheme in which S = fSkg with\nk=1 (cid:25)Sk\n\nk = 1; : : : ; K, Sk 2 f0; 1g andPk Sk = 1. This has distribution P (Sj(cid:25)) =QK\n\nand we can place a conjugate Dirichlet distribution over the parameters f(cid:25)kg.\n\nk\n\n2.3 Allowable Distributions\n\nWe now characterize the class of models that can be solved by VIBES using the\nfactorized variational distribution given by (4). First of all we note that, since a\nGaussian variable can have a Gaussian parent for its mean, we can extend this hier-\narchically to any number of levels to give a sub-graph which is a DAG of Gaussian\nnodes of arbitrary topology. Each Gaussian can have Gamma (or Wishart) prior\nover its precision.\n\nNext, we observe that discrete variables S = fSkg can be used to construct \u2018pick\u2019\n\nfunctions which choose a particular parent node bY from amongst several conjugate\nparents fYkg, so that bY = Yk when sk = 1, which can be written bY = QK\nUnder any non-linear function h((cid:1)) we have h(Y ) =QK\nexpectation under S takes the form hh(Y )iS =PkhSkih(Yk). Variational inference\n\nwill therefore be tractable for this model provided it is tractable for each of the\nparents Yk individually.\n\nk=1 Y Sk\nk .\nk=1 h(Yk)Sk . Furthermore the\n\nThus we can handle the following very general architecture: an arbitrary DAG\nof multinomial discrete variables (each having Dirichlet priors) together with an\narbitrary DAG of linear Gaussian nodes (each having Wishart priors) and with\narbitrary pick links from the discrete nodes to the Gaussian nodes. This graph\nrepresents a generalization of the Gaussian mixture model, and includes as special\ncases models such as hidden Markov models, Kalman (cid:12)lters, factor analysers and\nprincipal component analysers, as well as mixtures and hierarchical mixtures of all\nof these.\n\nThere are other classes of models that are tractable under this scheme, for example\nPoisson variables having Gamma priors, although these may be of limited interest.\n\nWe can further extend the class of tractable models by considering nodes whose\n\n\fnatural parameters are formed from deterministic functions of the states of several\nparents. This is a key property of the VIBES approach which, as with BUGS, greatly\nextends its applicability. Suppose we have some conditional distribution P (XjY; : : :)\nand we want to make Y some deterministic function of the states of some other nodes\n (Z1; : : : ; ZM ). In e(cid:11)ect we have a pseudo-parent that is a deterministic function of\nother nodes, and indeed is represented explicitly through additional deterministic\nnodes in the graphical interface both to WinBUGS and to VIBES. This will be\ntractable under VIBES provided the expectation of u ( ) can be expressed in terms\nof the expectations of the corresponding functions uj(Zj) of the parents. The pick\nfunctions discussed earlier are a special case of these deterministic functions.\n\nThus for a Gaussian node the mean can be formed from products and sums of the\nstates of other Gaussian nodes provided the function is linear with respect to each\nof the nodes. Similarly, the precision of the Gaussian can comprise the products\n(but not sums) of any number of Gamma distributed variables.\n\nFinally, we have seen that continuous nodes can have both discrete and continuous\nparents but that discrete nodes can only have discrete parents. We can allow discrete\nnodes to have continuous parents by stepping outside the conjugate-exponential\nframework by exploiting a variational bound on the logistic sigmoid function [1].\n\nWe also wish to be able to evaluate the lower bound (2), both to con(cid:12)rm the\ncorrectness of the variational updates (since the value of the bound should never\ndecrease), as well as to monitor convergence and set termination criteria. This can\nbe done e(cid:14)ciently, largely using quantities that have already been calculated during\nthe variational updates.\n\n3 VIBES: A Software Implementation\n\nCreation of a model in VIBES simply involves drawing the graph (using operations\nsimilar to those in a simple drawing package) and then assigning properties to each\nnode such as the functional form for the distribution, a list of the other variables\nit is conditioned on, and the location of the corresponding data (cid:12)le if the node is\nobserved. The menu of distributions available to the user is dynamically adjusted\nat each stage to ensure that only valid conjugate models can be constructed.\n\nAs in WinBUGS we have adopted the convention of making logical (deterministic)\nnodes explicit in the graphical representation as this greatly simpli(cid:12)es the spec-\ni(cid:12)cation and interpretation of the model. We also use the \u2018plate\u2019 notation of a\nbox surrounding one or more nodes to denote that those nodes are replicated some\nnumber of times as speci(cid:12)ed by the parameter appearing in the bottom right hand\ncorner of the box.\n\n3.1 Example: Bayesian Mixture Models\n\nWe illustrate VIBES using a Bayesian model for a mixture of M probabilistic PCA\ndistributions, each having maximum intrinsic dimensionality of q, with a sparse\nprior [6], for which the VIBES implementation is shown in Figure 2. Here there are\nN observations of the vector t whose dimensionality is d, as indicated by the plates.\nThe dimensionality of the other variables is also determined by which plates they\nare contained in (e.g. W has dimension d (cid:2) q (cid:2) M whereas (cid:28) is a scalar). Variables\nt, x, W and (cid:22) are Gaussian, (cid:28) and (cid:11) have Gamma distributions, S is discrete and\n(cid:25) is Dirichlet.\n\nOnce the model is completed (and the (cid:12)le or (cid:12)les containing the observed variables\n\n\fFigure 2: Screen shot from VIBES showing the graph for a mixture of probabilistic\nPCA distributions. The node t is coloured black to denote that this variable is\nobserved, and the node \u2018alpha\u2019 has been highlighted and its properties (e.g. the\nform of the distribution) can be changed using the menus on the left hand side.\nThe node labelled \u2018x.W+mu\u2019 is a deterministic node, and the double arrows denote\ndeterministic relationships.\n\nare speci(cid:12)ed) it is then \u2018compiled\u2019, which involves allocation of memory for the\nvariables and initializing the distributions Qi (which is done using simple heuristics\nbut which can also be over-ridden by the user). If desired, monitoring of the lower\nbound (2) can be switched on (at the expense of slightly increased computation) and\nthis can also be used to set a termination criterion. Alternatively the variational\noptimization can be run for a (cid:12)xed number of iterations.\n\nOnce the optimization is complete various diagnostics can be used to probe the\nresults, such as the Hinton diagram plot shown in Figure 1(b).\n\nNow suppose we wish to modify the model, for instance by having a single set of\nhyper-parameters (cid:11) whose values are shared by all of the M components in the\nmixture, instead of having a separate set for each component. This simply involved\ndragging the (cid:11) node outside of the M plate using the mouse and then recompiling\n(since (cid:11) is now a vector of length q instead of a matrix of size M (cid:2) q). This literally\ntakes a few seconds, in contrast to the e(cid:11)ort required to formulate the variational\ninference equations, and develop bespoke code, for a new model! The result is then\noptimized as before. A screen shot of the corresponding VIBES model is shown in\nFigure 3.\n\n4 Discussion\n\nOur early experiences with VIBES have shown that it dramatically simpli(cid:12)es the\nconstruction and testing of new variational models, and readily allows a range of\nalternative models to be evaluated on a given problem. Currently we are extending\nVIBES to cater for a broader range of variational distributions by allowing the user\nto specify a Q distribution de(cid:12)ned over a subgraph of the true graph [7].\n\nFinally, there are many possible extensions to the basic VIBES we have described\n\n\fFigure 3: As in Figure 2 but with the vector (cid:11) of hyper-parameters moved outside\nthe M \u2018plate\u2019. This causes there to be only q terms in (cid:11) which are shared over the\nmixture components rather than M (cid:2) q. Note that, with no nodes highlighted, the\nside menus disappear.\n\nhere. For example, in order to broaden the range of models that can be tackled we\ncan combine variational with other methods such as Gibbs sampling or optimization\n(empirical Bayes) to allow for non-conjugate hyper-priors for instance. Similarly,\nthere is scope for exploiting exact methods where there exist tractable sub-graphs.\n\nReferences\n\n[1] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction\nto variational methods for graphical models. In M. I. Jordan, editor, Learning\nin Graphical Models, pages 105{162. Kluwer, 1998.\n\n[2] R. M. Neal and G. E. Hinton. A new view of the EM algorithm that justi(cid:12)es\nincremental and other variants. In M. I. Jordan, editor, Learning in Graphical\nModels, pages 355{368. Kluwer, 1998.\n\n[3] D J Lunn, A Thomas, N G Best, and D J Spiegelhalter. WinBUGS { a Bayesian\nmodelling framework: concepts, structure and extensibility. Statistics and Com-\nputing, 10:321{333, 2000. http://www.mrc-bsu.cam.ac.uk/bugs/.\n\n[4] Z. Ghahramani and M. J. Beal. Propagation algorithms for variational Bayesian\nlearning. In T. K. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural\nInformation Processing Systems, volume 13, Cambridge MA, 2001. MIT Press.\n[5] H. Attias. A variational Bayesian framework for graphical models. In S. Solla,\nT. K. Leen, and K-L Muller, editors, Advances in Neural Information Processing\nSystems, volume 12, pages 209{215, Cambridge MA, 2000. MIT Press.\n\n[6] C. M. Bishop. Variational principal components. In Proceedings Ninth Inter-\nnational Conference on Arti(cid:12)cial Neural Networks, ICANN\u201999, volume 1, pages\n509{514. IEE, 1999.\n\n[7] Christopher M. Bishop and John Winn. Structured variational distributions in\nVIBES. In Proceedings Arti(cid:12)cial Intelligence and Statistics, Key West, Florida,\n2003. Accepted for publication.\n\n\f", "award": [], "sourceid": 2172, "authors": [{"given_name": "Christopher", "family_name": "Bishop", "institution": null}, {"given_name": "David", "family_name": "Spiegelhalter", "institution": null}, {"given_name": "John", "family_name": "Winn", "institution": null}]}