{"title": "Approximate Inference by Compilation to Arithmetic Circuits", "book": "Advances in Neural Information Processing Systems", "page_first": 1477, "page_last": 1485, "abstract": "Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2-F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2-G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines.", "full_text": "Approximate Inference by Compilation to\n\nArithmetic Circuits\n\nDepartment of Computer and Information Science\n\nDaniel Lowd\n\nUniversity of Oregon\n\nEugene, OR 97403-1202\n\nlowd@cs.uoregon.edu\n\nPedro Domingos\n\nDepartment of Computer Science and Engineering\n\nUniversity of Washington\nSeattle, WA 98195-2350\n\npedrod@cs.washington.edu\n\nAbstract\n\nArithmetic circuits (ACs) exploit context-speci\ufb01c independence and determinism\nto allow exact inference even in networks with high treewidth. In this paper, we\nintroduce the \ufb01rst ever approximate inference methods using ACs, for domains\nwhere exact inference remains intractable. We propose and evaluate a variety of\ntechniques based on exact compilation, forward sampling, AC structure learning,\nMarkov network parameter learning, variational inference, and Gibbs sampling.\nIn experiments on eight challenging real-world domains, we \ufb01nd that the methods\nbased on sampling and learning work best: one such method (AC2-F) is faster\nand usually more accurate than loopy belief propagation, mean \ufb01eld, and Gibbs\nsampling; another (AC2-G) has a running time similar to Gibbs sampling but is\nconsistently more accurate than all baselines.\n\n1\n\nIntroduction\n\nCompilation to arithmetic circuits (ACs) [1] is one of the most effective methods for exact inference\nin Bayesian networks. An AC represents a probability distribution as a directed acyclic graph of\naddition and multiplication nodes, with real-valued parameters and indicator variables at the leaves.\nThis representation allows for linear-time exact inference in the size of the circuit. Compared to\na junction tree, an AC can be exponentially smaller by omitting unnecessary computations, or by\nperforming repeated subcomputations only once and referencing them multiple times. Given an\nAC, we can ef\ufb01ciently condition on evidence or marginalize variables to yield a simpler AC for the\nconditional or marginal distribution, respectively. We can also compute all marginals in parallel by\ndifferentiating the circuit. These many attractive properties make ACs an interesting and important\nrepresentation, especially when answering many queries on the same domain. However, as with\njunction trees, compiling a BN to an equivalent AC yields an exponentially-sized AC in the worst\ncase, preventing their application to many domains of interest.\nIn this paper, we introduce approximate compilation methods, allowing us to construct effective ACs\nfor previously intractable domains. For selecting circuit structure, we compare exact compilation of\na simpli\ufb01ed network to learning it from samples. Structure selection is done once per domain, so the\ncost is amortized over all future queries. For selecting circuit parameters, we compare variational\ninference to maximum likelihood learning from samples. We \ufb01nd that learning from samples works\n\n1\n\n\fbest for both structure and parameters, achieving the highest accuracy on eight challenging, real-\nworld domains. Compared to loopy belief propagation, mean \ufb01eld, and Gibbs sampling, our AC2-F\nmethod, which selects parameters once per domain, is faster and usually more accurate. Our AC2-G\nmethod, which optimizes parameters at query time, achieves higher accuracy on every domain with\na running time similar to Gibbs sampling.\nThe remainder of this paper is organized as follows.\nIn Section 2, we provide background on\nBayesian networks and arithmetic circuits. In Section 3, we present our methods and discuss re-\nlated work. We evaluate the methods empirically in Section 4 and conclude in Section 5.\n\n2 Background\n\n2.1 Bayesian networks\n\nBayesian networks (BNs) exploit conditional independence to compactly represent a probability\ndistribution over a set of variables, {X1, . . . , Xn}. A BN consists of a directed, acyclic graph with\na node for each variable, and a set of conditional probability distributions (CPDs) describing the\nprobability of each variable, Xi, given its parents in the graph, denoted \u03c0i [2]. The full probability\n\ndistribution is the product of the CPDs: P (X) =Qn\n\ni=1 P (Xi|\u03c0i).\n\nEach variable in a BN is conditionally independent of its non-descendants given its parents. Depend-\ning on the how the CPDs are parametrized, there may be additional independencies. For discrete\ndomains, the simplest form of CPD is a conditional probability table, but this requires space expo-\nnential in the number of parents of the variable. A more scalable approach is to use decision trees as\nCPDs, taking advantage of context-speci\ufb01c independencies [3, 4, 5]. In a decision tree CPD for vari-\nable Xi, each interior node is labeled with one of the parent variables, and each of its outgoing edges\nis labeled with a value of that variable. Each leaf node is a multinomial representing the marginal\ndistribution of Xi conditioned on the parent values speci\ufb01ed by its ancestor nodes and edges in the\ntree.\nBayesian networks can be represented as log-linear models:\n\nlog P (X = x) = \u2212 log Z +P\n\ni wifi(x)\n\n(1)\nwhere each fi is a feature, each wi is a real-valued weight, and Z is the partition function. In BNs, Z\nis 1, since the conditional distributions ensure global normalization. After conditioning on evidence,\nthe resulting distribution may no longer be a BN, but it can still be represented as a log linear model.\nThe goal of inference in Bayesian networks and other graphical models is to answer arbitrary\nmarginal and conditional queries (i.e., to compute the marginal distribution of a set of query vari-\nables, possibly conditioned on the values of a set of evidence variables). Popular methods include\nvariational inference, Gibbs sampling, and loopy belief propagation.\nIn variational inference, the goal is to select a tractable distribution Q that is as close as possible to\nthe original, intractable distribution P . Minimizing the KL divergence from P to Q (KL(P k Q)) is\ngenerally intractable, so the \u201creverse\u201d KL divergence is typically used instead:\n\nKL(Qk P ) =X\n\nQ(x) log Q(x)\nP (x)\n\nx\n\n= \u2212HQ(x) \u2212X\n\ni\n\nwiEQ[fi] + log ZP\n\n(2)\n\nwhere HQ(x) is the entropy of Q, EQ is an expectation computed over the probability distribution Q,\nZP is the partition function of P , and wi and fi are the weights and features of P (see Equation 1).\nThis quantity can be minimized by \ufb01xed-point iteration or by using a gradient-based numerical\noptimization method. What makes the reverse KL divergence more tractable to optimize is that the\nexpectations are done over Q instead of P . This minimization also yields bounds on the log partition\nfunction, or the probability of evidence in a BN. Speci\ufb01cally, because KL(Q k P ) is non-negative,\n\nlog ZP \u2265 HQ(x) +P\n\ni wiEQ[fi].\n\nThe most commonly applied variational method is mean \ufb01eld, in which Q is chosen from the set\nof fully factorized distributions. Generalized or structured mean \ufb01eld operates on a set of clusters\n(possibly overlapping), or junction tree formed from a subset of the edges [6, 7, 8]. Selecting the\nbest tractable substructure is a dif\ufb01cult problem. One approach is to greedily delete arcs until the\njunction tree is tractable [6]. Alternately, Xing et al. [7] use weighted graph cuts to select clusters\nfor structured mean \ufb01eld.\n\n2\n\n\f2.2 Arithmetic circuits\n\nX\n\nP\n\nQn\nThe probability distribution represented by a Bayesian network can be equivalently represented\nby a multilinear function known as the network polynomial [1]: P (X1 = x1, . . . , Xn = xn) =\ni=1 I(Xi = xi)P (Xi = xi|\u03a0i = \u03c0i) where the sum ranges over all possible instantiations of\nthe variables, I() is the indicator function (1 if the argument is true, 0 otherwise), and the P (Xi|\u03a0i)\nare the parameters of the BN. The probability of any partial instantiation of the variables can now be\ncomputed simply by setting to 1 all the indicators consistent with the instantiation, and to 0 all others.\nThis allows arbitrary marginal and conditional queries to be answered in time linear in the size of\nthe polynomial. Furthermore, differentiating the network with respect to its weight parameters (wi)\nyields the probabilities of the corresponding features (fi).\nThe size of the network polynomial is exponential in the number of variables, but it can be more\ncompactly represented using an arithmetic circuit (AC). An AC is a rooted, directed acyclic graph\nwhose leaves are numeric constants or variables, and whose interior nodes are addition and multi-\nplication operations. The value of the function for an input tuple is computed by setting the variable\nleaves to the corresponding values and computing the value of each node from the values of its chil-\ndren, starting at the leaves. In the case of the network polynomial, the leaves are the indicators and\nnetwork parameters. The AC avoids the redundancy present in the network polynomial, and can be\nexponentially more compact.\nEvery junction tree has a corresponding AC, with an addition node for every instantiation of a sep-\narator, a multiplication node for every instantiation of a clique, and a summation node as the root.\nThus one way to compile a BN into an AC is via a junction tree. However, when the network con-\ntains context-speci\ufb01c independences, a much more compact circuit can be obtained. Darwiche [1]\ndescribes one way to do this, by encoding the network into a special logical form, factoring the\nlogical form, and extracting the corresponding AC.\nOther exact inference methods include variable elimination with algebraic decision diagrams (which\ncan also be done with ACs [9]), AND/OR graphs [10], bucket elimination [11], and more.\n\n3 Approximate Compilation of Arithmetic Circuits\n\nIn this section, we describe AC2 (Approximate Compilation of Arithmetic Circuits), an approach\nfor constructing an AC to approximate a given BN. AC2 does this in two stages: structure search\nand parameter optimization. The structure search is done in advance, once per network, while the\nparameters may be selected at query time, conditioned on evidence. This amortizes the cost of\nthe structure search over all future queries.The parameter optimization allows us to \ufb01ne-tune the\ncircuit to speci\ufb01c pieces of evidence. Just as in variational inference methods such as mean \ufb01eld,\nwe optimize the parameters of a tractable distribution to best approximate an intractable one. Note\nthat, if the BN could be compiled exactly, this step would be unnecessary, since the conditional\ndistribution would always be optimal.\n\n3.1 Structure search\n\nWe considered two methods for generating circuit structures. The \ufb01rst is to prune the BN structure\nand then compile the simpli\ufb01ed BN exactly. The second is to approximate the BN distribution with\na set of samples and learn a circuit from this pseudo-empirical data.\n\n3.1.1 Pruning and compiling\n\nPruning and compiling a BN is somewhat analogous to edge deletion methods (e.g., [6]), except\nthat instead of removing entire edges and building the full junction tree, we introduce context-\nspeci\ufb01c independencies and build an arithmetic circuit that can exploit them. This \ufb01ner-grained\nsimpli\ufb01cation offers the potential of much richer models or smaller circuits. However, it also offers\nmore challenging search problems that must be approximated heuristically.\nWe explored several techniques for greedily simplifying a network into a tractable AC by pruning\nsplits from its decision-tree CPDs. Ideally, we would like to have bounds on the error of our simpli-\n\ufb01ed model, relative to the original. This can be accomplished by bounding the ratio of each log con-\n\n3\n\n\fditional probability distribution, so that the approximated log probability of every instance is within\na constant factor of the truth, as done by the Multiplicative Approximation Scheme (MAS) [12].\nHowever, we found that the bounds for our networks were very large, with ratios in the hundreds or\nthousands. This occurs because our networks have probabilities close to 0 and 1 (with logs close to\nnegative in\ufb01nity and zero), and because the bounds focus on the worst case.\nTherefore, we chose to focus\ninstead on the average case by attempting to mini-\nmize the KL divergence between the original model and the simpli\ufb01ed approximation:\nQ(x) where P is the original network and Q is the simpli\ufb01ed approxi-\nmate network, in which each of P \u2019s conditional probability distributions has been simpli\ufb01ed. We\nchoose to optimize the KL divergence here because the reverse KL is prone to \ufb01tting only a sin-\ngle mode, and we want to avoid excluding any signi\ufb01cant parts of the distribution before seeing\nevidence. Since Q\u2019s structure is a subset of P \u2019s, we can decompose the KL divergence as follows:\n\nKL(P k Q) =P\n\nx P (x) log P (x)\n\nKL(P k Q) =X\n\nX\n\nP (\u03c0i)X\n\nP (xi|\u03c0i) log P (xi|\u03c0i)\nQ(xi|\u03c0i)\n\ni\n\n\u03c0i\n\nxi\n\n(3)\n\nwhere the summation is over all states of the Xi\u2019s parents, \u03a0i. In other words, the KL divergence\ncan be computed by adding the expected divergence of each local factor, where the expectation is\ncomputed according to the global probability distribution. For the case of BNs with tree CPDs (as\ndescribed in Section 2.1), this means that knowing the distribution of the parent variables allows us\nto compute the change in KL divergence from pruning a tree CPD.\nUnfortunately, computing the distribution of each variable\u2019s parents is intractable and must be ap-\nproximated in some way. We tried two different methods for computing these distributions: estimat-\ning the joint parent probabilities from a large number of samples (one million in our experiments)\n(\u201cP-Samp\u201d), and forming the product of the parent marginals estimated using mean \ufb01eld (\u201cP-MF\u201d).\nGiven a method for computing the parent marginals, we remove the splits that least increase the\nKL divergence. We implement this by starting from a fully pruned network and greedily adding the\nsplits that most decrease KL divergence. After every 10 splits, we check the number of edges by\ncompiling the candidate network to an AC using the C2D compiler. 1 We stop when the number of\nedges exceeds our prespeci\ufb01ed bound.\n\n3.1.2 Learning from samples\n\nThe second approach we tried is learning a circuit from a set of generated samples. The samples\nthemselves are generated using forward sampling, in which each variable in the BN is sampled in\ntopological order according to its conditional distribution given its parents. The circuit learning\nmethod we chose is the LearnAC algorithm by Lowd and Domingos [13], which greedily learns\nan AC representing a BN with decision tree CPDs by trading off log likelihood and circuit size.\nWe made one modi\ufb01cation to the the LearnAC (LAC) algorithm in order to learn circuits with a\n\ufb01xed number of edges. Instead of using a \ufb01xed edge penalty, we start with an edge penalty of 100\nand halve it every time we run out of candidate splits with non-negative scores. The effect of this\nmodi\ufb01ed procedure is to conservatively selects splits that add few edges to the circuit at \ufb01rst, and\nbecome increasingly liberal until the edge limit is reached. Tuning the initial edge penalty can lead\nto slightly better performance at the cost of additional training time. We also explored using the BN\nstructure to guide the AC structure search (for example, by excluding splits that would violate the\npartial order of the original BN), but these restrictions offered no signi\ufb01cant advantage in accuracy.\nMany modi\ufb01cations to this procedure are possible. Larger edge budgets or different heuristics could\nyield more accurate circuits. With additional engineering, the LearnAC algorithm could be adapted\nto dynamically request only as many samples as necessary to be con\ufb01dent in its choices. For exam-\nple, Hulten and Domingos [14] have developed methods that scale learning algorithms to datasets\nof arbitrary size; the same approach could be used here, except in a \u201cpull\u201d setting where the data is\ngenerated on-demand. Spending a long time \ufb01nding the most accurate circuit may be worthwhile,\nsince the cost is amortized over all queries.\nWe are not the \ufb01rst to propose sampling as a method for converting intractable models into tractable\nones. Wang et al. [15] used a similar procedure for learning a latent tree model to approximate a\n\n1Available at http://reasoning.cs.ucla.edu/c2d/.\n\n4\n\n\fBN. They found that the learned models had faster or more accurate inference on a wide range of\nstandard BNs (where exact inference is somewhat tractable). In a semi-supervised setting, Liang et\nal. [16] trained a conditional random \ufb01eld (CRF) from a small amount of labeled training data, used\nthe CRF to label additional examples, and learned independent logistic regression models from this\nexpanded dataset.\n\n3.2 Parameter optimization\n\nIn this section, we describe three methods for selecting AC parameters: forward sampling, varia-\ntional optimization, and Gibbs sampling.\n\n3.2.1 Forward sampling\n\nIn AC2-F, we use forward sampling to generate a set of samples from the original BN (one million\nin our experiments) and maximum likelihood estimation to estimate the AC parameters from those\nsamples. This can be done in closed form because, before conditioning on evidence, the AC structure\nalso represents a BN. AC2-F selects these parameters once per domain, before conditioning on any\nevidence. This makes it very fast at query time.\nAC2-F can be viewed as approximately minimizing the KL divergence KL(P k Q) between the\nBN distribution P and the AC distribution Q. For conditional queries P (Y |X = xev), we are more\ninterested in the divergence of the conditional distributions, KL(P (.|xev)k Q(.|xev)). The following\ntheorem bounds the conditional KL divergence as a function of the unconditional KL divergence:\nTheorem 1. For discrete probability distributions P and Q, and evidence xev,\n\nKL(P (.|xev)k Q(.|xev)) \u2264\n\n1\n\nP (xev)\n\nKL(P k Q)\n\n(See the supplementary materials for the proof.) From this theorem, we expect AC2-F to work\nbetter when evidence is likely (i.e., P (xev) is not too small). For rare evidence, the conditional KL\ndivergence could be much larger than the unconditional KL divergence.\n\n3.2.2 Variational optimization\n\nSince AC2-F selects parameters based on the unconditioned BN, it may do poorly when conditioning\non rare evidence. An alternative is to choose AC parameters that (locally) minimize the reverse KL\ndivergence to the BN conditioned on evidence. Let P and Q be log-linear models, i.e.:\n\nlog Q(x) = \u2212 log ZQ +P\n\nj vjgj(x)\n\n(4)\n\nThe reverse KL divergence and its gradient can now be written as follows:\n\nlog P (x) = \u2212 log ZP +P\nKL(Qk P ) =P\nKL(Qk P ) =P\n\ni wifi(x)\n\nj vjEQ(gj) \u2212P\nk vk(EQ(gkgj) \u2212 Q(gk)Q(gj)) \u2212P\n\ni wiEQ(fi) + log ZP\nZQ\n\n\u2202\n\u2202vj\n\ni vi(EQ(figj) \u2212 Q(fi)Q(gj))\n\n(5)\nwhere EQ(gkgj) is the expected value of gk(x) \u00d7 gj(x) according to Q. In our application, P is\nthe BN conditioned on evidence and Q is the AC. Since inference in Q (the AC) is tractable, the\ngradient can be computed exactly.\nWe can optimize this using any numerical optimization method, such as gradient descent. Due\nto local optima, the results may depend on the optimization procedure and its initialization.\nIn\nexperiments, we used the limited memory BFGS algorithm (L-BFGS) [17], initialized with AC2-F.\nWe now discuss how to compute the gradient ef\ufb01ciently in a circuit with e edges. By setting leaf\nvalues and evaluating the circuit as described by Darwiche [1], we can compute the probability of\nany conjunctive feature Q(fi) (or Q(gk)) in O(e) operations. If we differentiate the circuit after\nconditioning on a feature fi (or gk), we can obtain the probabilities of the conjunctions Q(figj) (or\nQ(gkgj)) for all gj in O(e) time. Therefore, if there are n features in P , and m features in Q, then\nthe total complexity of computing the derivative is O((n + m)e). Since there are typically fewer\nfeatures in Q than P , this simpli\ufb01es to O(ne).\nThese methods are applicable to any tractable structure represented as an AC, including low tree-\nwidth models, mixture models, latent tree models, etc. We refer to this method as AC2-V.\n\n5\n\n\f3.2.3 Gibbs sampling\n\nKL(P k Q)=P\n\ni wiEP (fi) \u2212P\n\nWhile optimizing the reverse KL is a popular choice for approximate inference, there are certain\nrisks. Even if KL(Qk P ) is small, Q may assign very small or zero probabilities to important modes\nof P . Furthermore, we are only guaranteed to \ufb01nd a local optimum, which may be much worse\nthan the global optimum. The \u201cregular\u201d KL divergence, does not suffer these disadvantages, but is\nimpractical to compute since it involves expectations according to P :\n\n\u2202\n\u2202vj\n\nj vjEP (gj) + log ZQ/ZP\n\nKL(P k Q)= EQ(gj) \u2212 EP (gj)\n\n(6)\n(7)\nTherefore, minimizing KL(P k Q) by gradient descent or L-BFGS requires computing the condi-\ntional probability of each AC feature according to the BN, EP (gj). Note that these only need to be\ncomputed once, since they are unaffected by the AC feature weights, vj. We chose to approximate\nthese expectations using Gibbs sampling, but an alternate inference method (e.g., importance sam-\npling) could be substituted. The probabilities of the AC features according to the AC, EQ(gj), can\nbe computed in parallel by differentiating the circuit, requiring time O(e).2 This is typically orders\nof magnitude faster than the variational approach described above, since each optimization step runs\nin O(e) instead of O(ne), where n is the number of BN features. We refer to this method as AC2-G.\n\n4 Experiments\nIn this section, we compare the proposed methods experimentally and demonstrate that approximate\ncompilation is an accurate and ef\ufb01cient technique for inference in intractable networks.\n\n4.1 Datasets\n\nWe wanted to evaluate our methods on challenging, realistic networks where exact inference is in-\ntractable, even for the most sophisticated arithmetic circuit-based techniques. This ruled out most\ntraditional benchmarks, for which ACs can already perform exact inference [9]. We generated in-\ntractable networks by learning them from eight real-world datasets using the WinMine Toolkit [18].\nThe WinMine Toolkit learns BNs with tree-structured CPDs, leading to complex models with high\ntree-width. In theory, this additional structure can be exploited by existing arithmetic circuit tech-\nniques, but in practice, compilation techniques ran out of memory on all eight networks. See Davis\nand Domingos [19] and our supplementary material for more details on the datasets and the networks\nlearned from them, respectively.\n\n4.2 Structure selection\n\nIn our \ufb01rst set of experiments, we compared the structure selection algorithms from Section 3.1\naccording to their ability to approximate the original models. Since computing the KL divergence\ndirectly is intractable, we approximated it using random samples x(i):\n\nD(P||Q) =X\n\nx\n\nX\n\ni\n\nP (x) log P (x)\nQ(x)\n\n= EP [log(P (x)/Q(x))] \u2248 1\nm\n\nlog(P (x(i))/Q(x(i)))\n\n(8)\n\nwhere m is the number of samples (10,000 in our experiments). These samples were distinct from\nthe training data, and the same set of samples was used to evaluate each algorithm.\nFor LearnAC, we trained circuits with a limit of 100,000 edges. All circuits were learned using\n100,000 samples, and then the parameters were set using AC2-F with 1 million samples.3 Training\ntime ranged from 17 minutes (KDD Cup) to 8 hours (EachMovie). As an additional baseline, we also\nlearned tree-structured BNs from the same 1 million samples using the Chow-Liu algorithm [20].\nResults are in Table 1. The learned arithmetic circuit (LAC) achieves the best performance on all\ndatasets, often by a wide margin. We also observe that, of the pruning methods, samples (P-Samp)\nwork better than mean \ufb01eld marginals (P-MF). Chow-Liu trees (C-L) typically perform somewhere\nbetween P-MF and P-Samp. For the rest of this paper, we focus on structures selected by LearnAC.\n2To support optimization methods that perform line search (including L-BFGS), we can similarly approxi-\n\nmate KL(P k Q). log ZQ can also be computed in O(e) time.\n\n3With 1 million samples, we ran into memory limitations that a more careful implementation might avoid.\n\n6\n\n\fTable 1: KL divergence of different\nstructure selection algorithms.\n\nTable 2: Mean time for answering a single conditional\nquery, in seconds.\n\nKDD Cup\nPlants\nAudio\nJester\nNet\ufb02ix\nMSWeb\nBook\nEachMovie\n\nP-MF\n2.44\n8.41\n4.99\n5.14\n3.83\n1.78\n4.90\n29.66\n\nP-Samp\n0.10\n2.29\n3.31\n3.55\n3.06\n0.52\n2.43\n17.61\n\nC-L\n0.23\n4.48\n4.47\n5.08\n4.14\n0.70\n2.84\n17.11\n\nLAC\n0.07\n1.27\n2.12\n2.82\n2.24\n0.38\n1.89\n11.12\n\nKDD Cup\nPlants\nAudio\nJester\nNet\ufb02ix\nMSWeb\nBook\nEachMovie\n\nAC2-F\n0.022\n0.022\n0.023\n0.019\n0.021\n0.022\n0.020\n0.022\n\nAC2-V\n3803\n2741\n4184\n3448\n3050\n2831\n5190\n10204\n\nAC2-G\n11.2\n11.2\n14.4\n13.8\n12.3\n12.2\n16.1\n28.6\n\nBP\n0.050\n0.081\n0.063\n0.054\n0.057\n0.277\n0.864\n1.441\n\nMF\n0.025\n0.073\n0.048\n0.057\n0.053\n0.046\n0.059\n0.342\n\nGibbs\n2.5\n2.8\n3.4\n3.3\n3.3\n4.3\n6.6\n11.0\n\nFigure 1: Average conditional log likelihood of the query variables (y axis), divided by the number\nof query variables (x axis). Higher is better. Gibbs often performs too badly to appear in the frame.\n\n4.3 Conditional probabilities\n\nUsing structures selected by LearnAC, we compared the accuracy of AC2-F, AC2-V, and AC2-G\nto mean \ufb01eld (MF), loopy belief propagation (BP), and Gibbs sampling (Gibbs) on conditional\nprobability queries. We ran MF and BP to convergence. For Gibbs sampling, we ran 10 chains, each\nwith 1000 burn-in iterations and 10,000 sampling iterations. All methods exploited CPD structure\nwhenever possible (e.g., in the computation of BP messages). All code will be publicly released.\nSince most of these queries are intractable to compute exactly, we cannot determine the true proba-\nbilities directly. Instead, we generated 100 random samples from each network, selected a random\nsubset of the variables to use as evidence (10%-50% of the total variables), and measured the log\nconditional probability of the non-evidence variables according to each inference method. Different\nqueries used different evidence variables. This approximates the KL divergence between the true\nand inferred conditional distributions up to a constant. We reduced the variance of this approxi-\nmation by selecting additional queries for each evidence con\ufb01guration. Speci\ufb01cally, we generated\n100,000 samples and kept the ones compatible with the evidence, up to 10,000 per con\ufb01guration.\nFor some evidence, none of the 100,000 samples were compatible, leaving just the original query.\nFull results are in Figure 1. Table 2 contains the average inference time for each method.\nOverall, AC2-F does very well against BP and even better against MF and Gibbs, especially with\nlesser amounts of evidence.\nIts somewhat worse performance at greater amounts of evidence is\nconsistent with Theorem 1. AC2-F is also the fastest of the inference methods, making it a very\ngood choice for speedy inference with small to moderate amounts of evidence.\nAC2-V obtains higher accuracy than AC2-F at higher levels of evidence, but is often less accurate at\nlesser amounts of evidence. This can be attributed to different optimization and evaluation metrics:\n\n7\n\n-0.044-0.042-0.040-0.038-0.036-0.03410%20%30%40%50%Log probabilityEvidence variablesKDD-0.4-0.3-0.2-0.110%20%30%40%50%Log probabilityEvidence variablesPlants-0.58-0.54-0.50-0.46-0.42-0.3810%20%30%40%50%Log probabilityEvidence variablesAudio-0.68-0.64-0.60-0.56-0.5210%20%30%40%50%Log probabilityEvidence variablesJester-0.64-0.62-0.60-0.58-0.56-0.5410%20%30%40%50%Log probabilityEvidence variablesNetflix-0.044-0.040-0.036-0.032-0.02810%20%30%40%50%Log probabilityEvidence variablesMSWeb-0.10-0.09-0.08-0.0710%20%30%40%50%Log probabilityEvidence variablesBook-0.18-0.16-0.14-0.12-0.10-0.0810%20%30%40%50%Log probabilityEvidence variablesEachMovie-0.18-0.16-0.14-0.12-0.10-0.0810%20%30%40%50%Log probabilityEvidence variablesEachMovieAC2-FAC2-VAC2-GMFBPGibbs\freducing KL(Q k P ) may sometimes lead to increased KL(P k Q). On EachMovie, AC2-V does\nparticularly poorly, getting stuck in a worse local optimum than the much simpler MF. AC2-V is\nalso the slowest method, by far.\nAC2-G is the most accurate method overall. It dominates BP, MF, and Gibbs on all datasets. With\nthe same number of samples, AC2-G takes 2-4 times longer than Gibbs. This additional running time\nis partly due to the parameter optimization step and partly due to the fact that AC2-G is computing\nmany expectations in parallel, and therefore has more bookkeeping per sample. If we increase the\nnumber of samples in Gibbs by a factor of 10 (not shown), then Gibbs wins on KDD at 40 and 50%\nand Plants at 50% evidence, but is also signi\ufb01cantly slower than AC2-G. Compared to the other AC\nmethods, AC2-G wins everywhere except for KDD at 10-40% evidence and Net\ufb02ix at 10% evidence.\nIf we increase the number of samples in AC2-G by a factor of 10 (not shown), then it beats AC2-F\nand AC2-V on every dataset. The running time of AC2-G is split approximately evenly between\ncomputing suf\ufb01cient statistics and optimizing parameters with L-BFGS.\nGibbs sampling did poorly in almost all of the scenarios, which can be attributed to the fact that\nit is unable to accurately estimate the probabilities of very infrequent events. Most conjunctions\nof dozens or hundreds of variables are very improbable, even if conditioned on a large amount\nof evidence. If a certain con\ufb01guration is never seen, then its probability is estimated to be very\nlow (non-zero due to smoothing). MF and BP did not have this problem, since they represent the\nconditional distribution as a product of marginals, each of which can be estimated reasonably well.\nIn follow-up experiments, we found that using Gibbs sampling to compute the marginals yielded\nslightly better accuracy than BP, but much slower. AC2-G can be seen as a generalization of using\nGibbs sampling to compute marginals, just as AC2-V generalizes MF.\n5 Conclusion\n\nArithmetic circuits are an attractive alternative to junction trees due to their ability to exploit de-\nterminism and context-speci\ufb01c independence. However, even with ACs, exact inference remains\nintractable for many networks of interest. In this paper, we introduced the \ufb01rst approximate compi-\nlation methods, allowing us to apply ACs to any BN. Our most ef\ufb01cient method, AC2-F, is faster than\ntraditional approximate inference methods and more accurate most of the time. Our most accurate\nmethod, AC2-G, is more accurate than the baselines on every domain.\nOne of the key lessons is that combining sampling and learning is a good strategy for accurate\napproximate inference. Sampling generates a coarse approximation of the desired distribution which\nis subsequently smoothed by learning. For structure selection, an AC learning method applied to\nsamples was more effective than exact compilation of a simpli\ufb01ed network. For parameter selection,\nmaximum likelihood estimation applied to Gibbs samples was both faster and more effective than\nvariational inference in ACs.\nFor future work, we hope to extend our methods to Markov networks, in which generating samples\nis a dif\ufb01cult inference problem in itself. Similar methods could be used to select AC structures tuned\nto particular queries, since a BN conditioned on evidence can be represented as a Markov network.\nThis could lead to more accurate results, especially in cases with a lot of evidence, but the cost would\nno longer be amortized over all future queries. Comparisons with more sophisticated baselines are\nanother important item for future work.\n\nAcknowledgements\n\nThe authors wish to thank Christopher Meek and Jesse Davis for helpful comments. This research\nwas partly funded by ARO grant W911NF-08-1-0242, AFRL contract FA8750-09-C-0181, DARPA\ncontracts FA8750-05-2-0283, FA8750-07-D-0185, HR0011-06-C-0025, HR0011-07-C-0060 and\nNBCH-D030010, NSF grants IIS-0534881 and IIS-0803481, and ONR grant N00014-08-1-0670.\nThe views and conclusions contained in this document are those of the authors and should not be\ninterpreted as necessarily representing the of\ufb01cial policies, either expressed or implied, of ARO,\nDARPA, NSF, ONR, or the United States Government.\n\n8\n\n\fReferences\n\n[1] A. Darwiche. A differential approach to inference in Bayesian networks. Journal of the ACM, 50(3):280\u2013\n\n305, 2003.\n\n[2] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-\n\nmann, San Francisco, CA, 1988.\n\n[3] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-speci\ufb01c independence in Bayesian\nIn Proc. of the 12th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 115\u2013123,\n\nnetworks.\nPortland, OR, 1996. Morgan Kaufmann.\n\n[4] N. Friedman and M. Goldszmidt. Learning Bayesian networks with local structure.\n\nIn Proc. of the\n12th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 252\u2013262, Portland, OR, 1996. Morgan\nKaufmann.\n\n[5] D. Chickering, D. Heckerman, and C. Meek. A Bayesian approach to learning Bayesian networks with\nlocal structure. In Proc. of the 13th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 80\u201389,\nProvidence, RI, 1997. Morgan Kaufmann.\n\n[6] Arthur Choi and Adnan Darwiche. A variational approach for approximating Bayesian networks by edge\ndeletion. In Proc. of the 22nd Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-06), Arlington,\nVirginia, 2006. AUAI Press.\n\n[7] E. P. Xing, M. I. Jordan, and S. Russell. Graph partition strategies for generalized mean \ufb01eld inference.\nIn Proc. of the 20th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 602\u2013610, Banff, Canada,\n2004.\n\n[8] D. Geiger, C. Meek, and Y. Wexler. A variational inference procedure allowing internal structure for\noverlapping clusters and deterministic constraints. Journal of Arti\ufb01cial Intelligence Research, 27:1\u201323,\n2006.\n\n[9] M. Chavira and A. Darwiche. Compiling Bayesian networks using variable elimination. In Proc. of the\n\n20th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 2443\u20132449, 2007.\n\n[10] R. Dechter and R. Mateescu. AND/OR search spaces for graphical models. Arti\ufb01cial Intelligence, 171:73\u2013\n\n106, 2007.\n\n[11] R. Dechter. Bucket elimination: a unifying framework for reasoning. Arti\ufb01cial Intelligence, 113:41\u201385,\n\n1999.\n\n[12] Y. Wexler and C. Meek. MAS: a multiplicative approximation scheme for probabilistic inference.\n\nAdvances in Neural Information Processing Systems 22, Cambridge, MA, 2008. MIT Press.\n\nIn\n\n[13] D. Lowd and P. Domingos. Learning arithmetic circuits. In Proc. of the 24th Conference on Uncertainty\n\nin Arti\ufb01cial Intelligence, Helsinki, Finland, 2008. AUAI Press.\n\n[14] G. Hulten and P. Domingos. Mining complex models from arbitrarily large databases in constant time.\nIn Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 525\u2013531, Edmonton, Canada, 2002. ACM Press.\n\n[15] Y. Wang, N. L. Zhang, and T. Chen. Latent tree models and approximate inference in Bayesian networks.\n\nJournal of Arti\ufb01cial Intelligence Research, 32:879\u2013900, 2008.\n\n[16] P. Liang, III H. Daum\u00b4e, and D. Klein. Structure compilation: trading structure for features. In Proc. of\nthe 25th International Conference on Machine Learning, pages 592\u2013599, Helsinki, Finland, 2008. ACM.\n[17] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathemat-\n\nical Programming, 45(3):503\u2013528, 1989.\n\n[18] D. M. Chickering. The WinMine toolkit. Technical Report MSR-TR-2002-103, Microsoft, Redmond,\n\nWA, 2002.\n\n[19] J. Davis and P. Domingos. Bottom-up learning of Markov network structure. In Proc. of the 27th Inter-\n\nnational Conference on Machine Learning, Haifa, Israel, 2010. ACM Press.\n\n[20] C. K. Chow and C. N Liu. Approximating discrete probability distributions with dependence trees. IEEE\n\nTransactions on Information Theory, 14:462\u2013467, 1968.\n\n9\n\n\f", "award": [], "sourceid": 855, "authors": [{"given_name": "Daniel", "family_name": "Lowd", "institution": null}, {"given_name": "Pedro", "family_name": "Domingos", "institution": null}]}