{"title": "Approximate Knowledge Compilation by Online Collapsed Importance Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 8024, "page_last": 8034, "abstract": "We introduce collapsed compilation, a novel approximate inference algorithm for discrete probabilistic graphical models. It is a collapsed sampling algorithm that incrementally selects which variable to sample next based on the partial compila- tion obtained so far. This online collapsing, together with knowledge compilation inference on the remaining variables, naturally exploits local structure and context- specific independence in the distribution. These properties are used implicitly in exact inference, but are difficult to harness for approximate inference. More- over, by having a partially compiled circuit available during sampling, collapsed compilation has access to a highly effective proposal distribution for importance sampling. Our experimental evaluation shows that collapsed compilation performs well on standard benchmarks. In particular, when the amount of exact inference is equally limited, collapsed compilation is competitive with the state of the art, and outperforms it on several benchmarks.", "full_text": "Approximate Knowledge Compilation by\nOnline Collapsed Importance Sampling\n\nTal Friedman\n\nComputer Science Department\n\nUniversity of California\nLos Angeles, CA 90095\n\ntal@cs.ucla.edu\n\nGuy Van den Broeck\n\nComputer Science Department\n\nUniversity of California\nLos Angeles, CA 90095\nguyvdb@cs.ucla.edu\n\nAbstract\n\nWe introduce collapsed compilation, a novel approximate inference algorithm for\ndiscrete probabilistic graphical models. It is a collapsed sampling algorithm that\nincrementally selects which variable to sample next based on the partial compila-\ntion obtained so far. This online collapsing, together with knowledge compilation\ninference on the remaining variables, naturally exploits local structure and context-\nspeci\ufb01c independence in the distribution. These properties are used implicitly\nin exact inference, but are dif\ufb01cult to harness for approximate inference. More-\nover, by having a partially compiled circuit available during sampling, collapsed\ncompilation has access to a highly effective proposal distribution for importance\nsampling. Our experimental evaluation shows that collapsed compilation performs\nwell on standard benchmarks. In particular, when the amount of exact inference is\nequally limited, collapsed compilation is competitive with the state of the art, and\noutperforms it on several benchmarks.\n\n1\n\nIntroduction\n\nModern probabilistic inference algorithms for discrete graphical models are designed to exploit\nkey properties of the distribution. In addition to classical conditional independence, they exploit\nlocal structure in the individual factors, determinism coming from logical constraints (Darwiche,\n2009), and the context-speci\ufb01c independencies that arise in such distributions (Boutilier et al., 1996).\nThe knowledge compilation approach in particular forms the basis for state-of-the-art probabilistic\ninference algorithms in a wide range of models, including Bayesian networks (Chavira & Darwiche,\n2008), factor graphs (Choi et al., 2013), statistical relational models (Chavira et al., 2006; Van den\nBroeck, 2013), probabilistic programs (Fierens et al., 2015), probabilistic databases (Van den Broeck\n& Suciu, 2017), and dynamic Bayesian networks (Vlasselaer et al., 2016). Based on logical reasoning\ntechniques, knowledge compilation algorithms construct an arithmetic circuit representation of\nthe distribution on which inference is guaranteed to be ef\ufb01cient (Darwiche, 2003). The inference\nalgorithms listed above have one common limitation: they perform exact inference by compiling a\nworst-case exponentially-sized arithmetic circuit representation. Our goal in this paper is to upgrade\nthese techniques to allow for approximate probabilistic inference, while still naturally exploiting\nthe structure in the distribution. We aim to open up a new direction towards scaling up knowledge\ncompilation to larger distributions.\nWhen knowledge compilation produces circuits that are too large, a natural solution is to sample some\nrandom variables and do exact compilation on the smaller distribution over the remaining variables.\nThis collapsed sampling approach suffers from two problems. First, collapsed sampling assumes that\none can determine a priori which variables need to be sampled to make the distribution amenable to\nexact inference. When dealing with large amounts of context-speci\ufb01c independence, it is dif\ufb01cult to\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\ufb01nd such a set, because the independencies are a function of the particular values that variables get\ninstantiated to. Second, collapsed sampling assumes that one has access to a proposal distribution\nthat determines how to sample each variable, and the success of inference largely depends on the\nquality of this proposal. In practice, the user often needs to specify the proposal distribution manually,\nand it is dif\ufb01cult to automatically construct one that is general purpose.\nAs our \ufb01rst contribution, Section 2 introduces online collapsed importance sampling, where the\nsampler chooses which variable to sample next based on the values sampled for previous variables.\nThis algorithm is a solution to the \ufb01rst problem identi\ufb01ed above: based on the context of each\nindividual sample, it allows the sampler to determine which subset of the variables is amenable to\nexact inference. We show that the sampler corresponds to a classical collapsed importance sampler\non an augmented graphical model and prove conditions for it to be asymptotically unbiased.\nSection 3 describes our second contribution: a collapsed compilation algorithm that maintains a\npartially-compiled arithmetic circuit during online collapsed importance sampling. This circuit\nprovides a solution to the second problem identi\ufb01ed above: it serves as a highly-effective proposal\ndistribution at each step of the algorithm. Moreover, by setting a limit on the circuit size as we\ncompile more factors into the model, we are able to sample exactly as many variables as needed to\n\ufb01t the arithmetic circuit into memory. This allows us to maximize the amount of exact inference\nperformed by the algorithm. Crucially, through online collapsing, the set of collapsed variables\nchanges with every sample, exploiting different independencies in each sample\u2019s arithmetic circuit.\nWe provide an open-source Scala implementation of this collapsed compilation algorithm.1\nFinally, we experimentally validate the performance of collapsed compilation on standard benchmarks.\nWe begin by empirically examining properties of collapsed compilation, to show the value of the\nproposal distribution and pick apart where performance improvements are coming from. Then, in a\nsetting where the amount of exact inference is \ufb01xed, we \ufb01nd that collapsed compilation is competitive\nwith state-of-the-art approximate inference algorithms, outperforming them on several benchmarks.\n\n2 Online Collapsed Importance Sampling\n\nWe begin with a brief review of collapsed importance sampling, before motivating the need for\ndynamically selecting which variables to sample. We then demonstrate that we can select variables\nin an online fashion while maintaining the desired unbiasedness property of the sampler, using an\nalgorithm we call online collapsed importance sampling.\nWe denote random variables with uppercase letters (X ), and their instantiation with lowercase letters\n(x). Bold letters denote sets of variables (X) and their instantiations (x). We refer to Koller &\nFriedman (2009) for notation and formulae related to (collapsed) importance sampling.\n\n2.1 Collapsed Importance Sampling\nThe basic principle behind collapsed sampling is that we can reduce the variance of an estimator by\nmaking part of the inference exact. That is, suppose we partition our variables into two sets: Xp,\nand Xd. In collapsed importance sampling, the distribution of variables in Xp will be estimated via\nimportance sampling, while those in Xd will be estimated by computing exactly P (Xd|xp) for each\nsample xp. In particular, suppose we have some function f (x) where x is a complete instantiation of\nXp [ Xd, and a proposal distribution Q over Xp. Then we estimate the expectation of f by\n\nm=1 w[m](EP (Xd|xp[m])[f (xp[m], Xd)])\n\n(1)\n\n\u02c6E(f ) = PM\n\nm=1 w[m]\n\nPM\n\nm=1 drawn from a proposal distribution Q. For each sample, we analytically\non samples {xp[m]}M\n\u02c6P (xp[m])\ncompute the importance weights w[m] =\nQ(xp[m]), and the exact expectation of f conditioned on\nthe sample, that is, EP (Xd|xp[m])[f (xp[m], Xd)]. Due to the properties of importance samplers, the\nestimator given by (1) is asymptotically unbiased. Moreover, if we compute P (xp[m]) exactly rather\nthan the unnormalized \u02c6P (xp[m]), then the estimator is unbiased (Tokdar & Kass, 2010).\n\n1The code is available at https://github.com/UCLA-StarAI/Collapsed-Compilation. It uses the\n\nSDD library for knowledge compilation (Darwiche, 2011) and the Scala interface by Bekker et al. (2015).\n\n2\n\n\fF12\nF13\nF23\n\n1\n0\n1\n\nV1\n\nV2\n\nV3\n\nV1\n\nV2\n\nV5\n\nV7\n\nV6\n\nV9\n\nV4\n\nV10\n\nV3\n\nV8\n\nV4\n\nV5\n\nV2\n\nV1\n\nV3\n\nV6\n\nV7\n\nV8\n\nV9\n\nV10\n\n(a) Sampled Friendships\n\n(b) Induced Dependencies\n\n(c) Induced Network G1\n\n(d) Induced Network G2\n\nFigure 1: Different samples for F can have a large effect on the resulting dependencies between V.\n\n2.2 Motivation\n\nA critical decision that needs to be made when doing collapsed sampling is selecting a partition \u2013\nwhich variables go in Xp and which go in Xd. The choice of partition can have a large effect on\nthe quality of the resulting estimator, and the process of choosing such a partition requires expert\nknowledge. Furthermore, selecting a partition a priori that works well is not always possible, as we\nwill show in the following example. All of this raises the question whether it is possible to choose the\npartition on the \ufb02y for each sample, which we will discuss in Section 2.3.\nSuppose we have a group of n people, denoted 1, ..., n. For every pair of people (i, j), i < j, there is\na binary variable Fij indicating whether i and j are friends. Additionally, we have features Vi for\neach person i, and Fij = 1 (that is i, j are friends) implies that Vi and Vj are correlated. Suppose\nwe are performing collapsed sampling on the joint distribution over F and V, and that we have\nalready decided to place all friendship indicators Fij in Xp to be sampled. Next, we need to decide\nwhich variables in V to include in Xp for the remaining inference problem over Xd to become\ntractable. Observe that given a sampled F, due to the independence properties of V relying on F, a\ngraphical model G is induced over V (see Figures 1a,1b). Moreover, this graphical model can vary\ngreatly between different samples of F. For example, G1 in Figure 1c densely connects {V1, ..., V6}\nmaking it dif\ufb01cult to perform exact inference. Thus, we will need to sample some variables from this\nset. However, exact inference over {V7, ..., V10} is easy. Conversely, G2 in Figure 1d depicts the\nopposite scenario: {V1, ..., V5} forms a tree, which is easy for inference, whereas {V6, ..., V10} is\nnow intractable. It is clearly impossible to choose a small subset of V to sample that \ufb01ts all cases,\nthus demonstrating a need for an online variable selection during collapsed sampling.\n\n2.3 Algorithm\n\nWe now introduce our online collapsed importance sampling algorithm. It decides at sampling time\nwhich variables to sample and which to do exact inference on.\nTo gain an intuition, suppose we are in the standard collapsed importance sampling setting. Rather\nthan sampling an instantiation xp jointly from Q, we can instead \ufb01rst sample xp1 \u21e0 Q(Xp1), then\nxp2 \u21e0 Q(Xp2|xp1), and so on using the chain rule of probability. In online collapsed importance\nsampling, rather than deciding Xp1, Xp2, Xp3, . . . a priori, we select which variable will be Xp2 based\non the previous sampled value xp1, we select which will be Xp3 based on xp1 and xp2, and so on.\n\nDe\ufb01nition 1. Let y be an instantiation of Y \u21e2 X.\nA variable selection policy \u21e1 takes y and either stops\nsampling or returns a distribution over which variable\nin X \\ Y should be sampled next.\nFor example, a naive policy could be to select a remain-\ning variable uniformly at random. Once the policy \u21e1\nstops sampling, we are left with an instantiation xp\nand a set of remaining variables Xd, where both are\nspeci\ufb01c to the choices made for that particular sample.\nAlgorithm 1 shows more precisely how online col-\nlapsed importance sampling generates a single sample,\ngiven a full set of variables X, a variable selection\npolicy \u21e1, and proposal distributions QXi|xp for any\n\n3\n\nAlgorithm 1: Online Collapsed IS\nInput :X: The set of all variables,\n\u21e1: Variable selection policy,\nQXi|xp: Proposal distributions\n\nd , xm\n\np , w[m]\n\nResult: A sampleXm\n1 xp {} ; Xd X\n2 while \u21e1 does not stop do\nXi \u21e0 \u21e1 (xp)\n3\nxi \u21e0 QXi|xp(Xi|xp)\n4\nxp xp [{ xi}\n5\nXd Xd \\ {Xi}\n6\n\u02c6P (xp)\n\n7 return\u21e3Xd, xp,\n\nQ(xp)\u2318\n\n\fd to do exact inference for, an\nchoice of Xi and xp. This sample consists of a set of variables Xm\ninstantiation of the sampled variables xm\np , and the corresponding importance weights w[m], all\nindexed by the sample number m. Note that xp is a set of variables together with their instantiations,\nwhile Xd is just a set of variables. The global joint proposal Q(xp), denoting the probability that\nAlgorithm 1 returns xp, is left abstract for now (see Section 2.4.2 for a concrete instance). In general,\nit is induced by variable selection policy \u21e1 and the individual local proposals QXi|xp.\n\nsampling, the online collapsed importance sampling estimator of f is\n\nd , xm\n\nm=1 produced by online collapsed importance\n\nDe\ufb01nition 2. Given M samplesXm\n\u02c6E(f ) = PM\n\np , w[m] M\nPM\n\nm=1 w[m]\n\nm=1 w[m](EP (Xm\n\nd |xm\n\np )[f (xm\n\np , Xm\n\nd )])\n\n.\n\n(2)\n\nNote that the only difference compared to Equation 1 is that sets Xm\n\np and Xm\n\nd vary with each sample.\n\n2.4 Analysis\nOur algorithm for online collapsed importance sampling raises two questions: does Equation 2 yield\nunbiased estimates, and how does one compute the proposal Q(xp)? We study both questions next.\n\n2.4.1 Unbiasedness of Estimator\nIf we let \u21e1 be a policy that always returns the same variables in the same order, then we recover\nclassical of\ufb02ine collapsed importance sampling - and thus retain all of its properties. In order to make\na similar statement for any arbitrary policy \u21e1, we will use the augmented factor graph construction\npresented in Figure 2. Our goal is to reduce online collapsed importance sampling on F to a problem\nof doing of\ufb02ine collapsed importance sampling on FA.\n\nX2:n\n\nf\n\nX1\n\nX2:n\n\nf\n\nX1\n\nfau\n\nS1\n\nX\u21e41\n\n(a) Original factor graph F\n\n(b) Augmented factor graph FA\n\nFigure 2: Online collapsed sampling corresponds to collapsed sampling on an augmented graph\n\nto the factor graph, representing a copy variable of Xi. We design\nIntuitively, we add variable X \u21e4i\nour of\ufb02ine collapsed sampler on augmented graph FA such that we are always sampling X \u21e4i and\ncomputing Xi exactly. To make this possible without actually inferring the entire distribution exactly,\nwe add variable Si to the model (also always to be sampled). Each Si acts as an indicator for whether\nX \u21e4i and Xi are constrained to be equal. Si can also be thought of as indicating whether or not we are\nsampling Xi in our original factor graph F when doing online collapsed importance sampling. These\ndependencies are captured in the new factor fau. We are now ready to state the following results.\nTheorem 1. For any factor graph F and its augmented graph FA, we have 8x , PF (x) = PFA(x).\nTheorem 2. Let F be a factor graph and let FA be its augmented factor graph. The collapsed\nimportance sampling estimator (Eq. 1) with Xp = X\u21e4 [ S and Xd = X on FA is equivalent to the\nonline collapsed importance sampling estimator (Eq. 2) on F .\nCorollary 1. The estimator given by Eq. 2 is asymptotically unbiased.\n\nProofs and the details of this construction can be found in Appendix A.\n\n2.4.2 Computing the Proposal Distribution\nOur next question is how to compute the global joint proposal distribution Q(xp), given that we\nhave variable selection policy \u21e1 and each local proposal distribution QXi|xp. Notice that since these\nQXi|xp are unconstrained and unrelated distributions, the computation is not easy in general. In\nparticular, considering |Xp| = n and our previous example of a uniformly random policy \u21e1, then for\nany given instantiation xp, there are n! different ways xp could be sampled by Algorithm 1 \u2013 one for\n\n4\n\n\fA\n\nf1\n\nB\n\nf2\n\nC\n\nA B\n\nf1\n\n0\n0\n1\n1\n\n0\n1\n0\n1\n\n2\n2\n2\n5\n\nB\n\n0\n0\n1\n1\n\nC\n\n0\n1\n0\n1\n\nf2\n\n3\n8\n8\n8\n\nB\n\nf1(A, B)\n\n+\n\n+\n\n+\n\n\u21e5\n\n\u21e5\n\nA \u00acA\n\n2\n\n\u21e5\n\u21e5\n\n5\n\n\u00acB\n\nB\n\n\u21e5\n\nf2(B, C)\n\n+\n\n+\n\n+\n\n\u21e5\n\n8\n\nC \u00acC\n\n\u21e5\n\u21e5\n\n3\n\n\u00acB\n\nf1(A, B) \u00b7 f2(B, C)\n\n+\n\n+\n\n+\n\n\u21e5\n+\n\n\u21e5\n\nC\n\n\u21e5\n\n3\n\n\u00acB\n\n2\n\n\u21e5\n+\n\n\u21e5\n\nB\n\n5\n\n\u21e5\n\n8\n\u00acC\n\nA \u00acA\n\nFigure 3: Multiplying Arithmetic Circuits: Factor graph and ACs for individual factors which multiply\ninto a single AC for the joint distribution. Given an AC, inference is tractable by propagating inputs.\n\neach ordering that arrives at xp. In this case, computing Q(xp) requires summing over exponentially\nmany terms, which is undesirable. Instead, we restrict the variable selection policies we use to the\nfollowing class.\nDe\ufb01nition 3. A deterministic variable selection policy \u21e1(xp) is a function with a range of X \\ Xp.\nTheorem 3. For any sample xp and deterministic variable selection policy \u21e1(xp), there is exactly\none order Xp1, Xp2, . . . , Xp|Xp|\nin which the variables Xp could have been sampled. Therefore, the\njoint proposal distribution is given by Q(xp) =Q|Xp|\n\nHence, computing the joint proposal Q(xp) becomes easy given a deterministic selection policy \u21e1.\n\ni=1 QXpi|xp1:i1\n\n(xpi|xp1:i1).\n\n3 Collapsed Compilation\n\nOnline collapsed importance sampling presents us with a powerful technique for adapting to problems\ntraditional collapsed importance sampling may struggle with. However, it also demands we solve\nseveral dif\ufb01cult tasks: one needs a good proposal distribution over any subset of variables, an ef\ufb01cient\nway of exactly computing an expectation given a sample, and an ef\ufb01cient way of \ufb01nding the true\nprobability of sampled variables. In this section, we introduce collapsed compilation, which tackles\nall three of these problems at once using techniques from knowledge compilation.\n\n3.1 Knowledge Compilation Background\n\nWe begin with a short review of how to perform exact inference on a probabilistic graphical model\nusing knowledge compilation to arithmetic circuits (ACs).\nSuppose we have a factor graph (Koller & Friedman, 2009) consisting of three binary variables A, B\nand C, and factors f1, f2 as depicted in Figure 3. Each of these factors, as well as their product can be\nrepresented as an arithmetic circuit. These circuits have inputs corresponding to variable assignments\n(e.g., A and \u00acA) or constants (e.g., 5). Internal nodes are sums or products. We can encode a\ncomplete instantiation of the random variables by setting the corresponding variable assignments to 1\nand the opposing assignments to 0. Then, the root of the circuit for a factor evaluates to the value of\nthe factor for that instantiation. However, ACs can also represent products of factors. In that case, the\nAC\u2019s root evaluates to a weight that is the product of factor values. Under factor graph semantics, this\nweight represents the unnormalized probability of a possible world.\nThe use of ACs for probabilistic inference stems from two important properties. Product nodes are\ndecomposable, meaning that their inputs are disjoint, having no variable inputs in common. Sum\nnodes are deterministic, meaning that for any given complete input assignment to the circuit, at most\none of the sum\u2019s inputs evaluates to a non-zero value. Because of decomposability, we are able\nto perform marginal inference on ACs: by setting both assignments for the same variable to 1, we\neffectively marginalize out that variable. For example, by setting all inputs to 1, the arithmetic circuit\nevaluates to the sum of weights of all worlds, which is the partition function of the graphical model.\nWe refer to Darwiche (2009) for further details on how to reason with arithmetic circuits.\nIn practice, arithmetic circuits are often compiled from graphical models by encoding graphical\nmodel inference into a logical task called weighted model counting, followed by using Boolean\ncircuit compilation techniques on the weighted model counting problem. We refer to Choi et al.\n(2013) and Chavira & Darwiche (2008) for details. As our Boolean circuit compilation target, we\nwill use the sentential decision diagram (SDD) (Darwiche, 2011). Given any two SDDs representing\n\n5\n\n\ffactors f1, f2, we can ef\ufb01ciently compute the SDD representing the factor multiplication of f1 and f2,\nas well as the result of conditioning the factor graph on any instantiation x. We call such operations\nAPPLY, and they are the key to using knowledge compilation for doing online collapsed importance\nsampling. An example of multiplying two arithmetic circuits is depicted in Figure 3.\nAs a result of SDDs supporting the APPLY operations, we can directly compile graphical models to\ncircuits in a bottom-up manner. Concretely, we start out by compiling each factor into a corresponding\nSDD representation using the encoding of Choi et al. (2013). Next, these SDDs are multiplied in order\nto obtain a representation for the entire model. As shown by Choi et al. (2013), this straightforward\napproach can be used to achieve state-of-the-art exact inference on probabilistic graphical models.\n\n3.2 Algorithm\n\nNow that we have proposed online collapsed importance sampling and given background on knowl-\nedge compilation, we are ready to introduce collapsed compilation, an algorithm that uses knowledge\ncompilation to do online collapsed importance sampling.\nCollapsed compilation begins by multiplying factors represented as SDDs. When the resulting SDD\nbecomes too large, we invoke online collapsed importance sampling to instantiate one of the variables.\nOn the arithmetic circuit representation, sampling a variable replaces one input by 1 and the other\nby 0. This conditioning operation allows us to simplify the SDD until it is suf\ufb01ciently small again.\nAt the end, the sampled variables form xp, and the variables remaining in the SDD form Xd.\nConcretely, collapsed compilation repeatedly performs a few simple steps, following Algorithm 1:\n\n1. Choose an order, and begin multiplying compiled factors into the current SDD until the size\n\nlimit is reached.\n\n2. Select a variable X using the given policy \u21e1.\n\n3. Sample X according to its marginal probability in the current SDD, corresponding to the\n\npartially compiled factor graph conditioned on prior instantiations.\n\n4. Condition the SDD on the sampled value for X .\n\nWe are taking advantage of knowledge compilation in a few subtle ways. First, to obtain the\nimportance weights, we compute the partition function on the \ufb01nal resulting circuit, which corresponds\nto the unnormalized probability of all sampled variables, that is, \u02c6P (xp) in Algorithm 1. Second,\nStep 3 presents a non-trivial and effective proposal distribution, which due to the properties of SDDs is\nef\ufb01cient to compute in the size of the circuit. Third, all APPLY operations on SDDs can be performed\ntractably (Van den Broeck & Darwiche, 2015), which allows us to multiply factors and condition\nSDDs on sampled instantiations.\nThe full technical description and implementation details can be found in Appendix B and C.\n\n4 Experimental Evaluation\n\nData & Evaluation Criteria To empirically investigate collapsed compilation, we evaluate the\nperformance of estimating a single marginal on a series of commonly used graphical models. Each\nmodel is followed in parentheses by its number of random variable nodes and factors.\nFrom the 2014 UAI inference competition, we evaluate on linkage(1077,1077), Grids(100,300),\nDBN(40, 440), and Segmentation(228,845) problem instances. From the 2008 UAI inference\ncompetition, we use two semi-deterministic grid instances, 50-20(400, 400) and 75-26(676, 676).\nHere the \ufb01rst number indicates the percentage of factor entries that are deterministic, and the second\nindicates the size of the grid. Finally, we generated a randomized frustrated Ising model on a\n16x16 grid, frust16(256, 480). Beyond these seven benchmarks, we experimented on ten additional\nstandard benchmarks. Because those were either too easy (showing no difference between collapsed\ncompilation and the baselines), or similar to other benchmarks, we do not report on them here.\n\n6\n\n\fH(P, Q) =\n\n(ppi  pqi)2.\n\nkXi=1\n\n1\n\np2vuut\n\nFor evaluation, we run all sampling-based methods 5 times for 1 hour each. We report the median\nHellinger distance across all runs, which for discrete distributions P and Q is given by\n\nCompilation Order Once we have compiled an SDD for each factor in the graphical model,\ncollapsed compilation requires us to choose in which order to multiply these SDDs. We look at two\norders: BFS and revBFS. The \ufb01rst begins from the marginal query variable, and compiles outwards\nin a breadth-\ufb01rst order. The second does the same, but in exactly the opposite order arriving at the\nquery variable last.\n\nVariable Selection Policies We evaluate three variable selection policies:\nThe \ufb01rst policy RBVar explores the idea of picking the variable that least increases the Rao-Blackwell\nvariance of the query (Darwiche, 2009). For a given query \u21b5, to select our next variable from X, we\n\nuse argminX2XPx P (\u21b5|X )2P (X ). This quantity can be computed in time linear in the size of the\n\ncurrent SDD.\nThe next policy we look at is MinEnt, which selects the variable with the smallest entropy. Intuitively,\nthis is selecting the variable for which sampling assumes the least amount of unknown information.\nFinally, we examine a graph-based policy FD (FrontierDistance). At any given point in our compilation\nwe have some frontier F, which is the set of variables that have some but not all factors included in\nthe current SDD. Then we select the variable in our current SDD that is, on the graph of our model,\nclosest to the \u201ccenter\u201d induced by F. That is, we use argminX2X maxF2F distance(X , F ).\nIn our experiments, policy RBVar is used with the compilation order BFS, while policies MinEnt and\nFrontierDist are used with order RevBFS.\n\n4.1 Understanding Collapsed Compilation\nWe begin our evaluation with experiments designed to shed some light on different components in-\nvolved in collapsed compilation. First, we evaluate our choice in proposal distribution by comparison\nto marginal-based proposals. Then, we examine the effects of setting different size thresholds for\ncompilation on the overall performance, as well as the sample count and quality.\n\nEvaluating the Proposal Distribution Selecting an effective proposal distribution is key to suc-\ncessfully using importance sampling estimation (Tokdar & Kass, 2010). As discussed in Section 3,\none requirement of online collapsed importance sampling is that we must provide a proposal distribu-\ntion over any subset of variables, which in general is challenging.\nTo evaluate the quality of collapsed compilation\u2019s proposal distribution, we compare it to using\nmarginal-based proposals, and highlight the problem with such proposals. First, we compare to a\ndummy uniform proposal. Second, we compare to a proposal that uses the true marginals for each\nvariable. Experiments on the 50-20 benchmark are shown in Table 1a. Note that these experiments\nwere run for 3 hours rather than 1 hour, so the numbers can not be compared exactly to other tables.\nParticularly with policies FrontierDist and MinEnt, the results underline the effectiveness of\ncollapsed compilation\u2019s proposal distribution over baselines. This is the effect of conditioning \u2013 even\nsampling from the true posterior marginals does not work very well, due to the missed correlation\nbetween variables. Since we are already conditioning for our partial exact inference, collapsed\ncompilation\u2019s proposal distribution is providing this improvement for very little added cost.\n\nChoosing a Size Threshold A second requirement for collapsed compilation is to set a size\nthreshold for the circuit being maintained. Setting the threshold to be in\ufb01nity leaves us with exact\ninference which is in general intractable, while setting the threshold to zero leaves us with importance\nsampling using what is likely a poor proposal distribution (since we can only consider one factor at a\ntime). Clearly, the optimal choice \ufb01nds a trade-off between these two considerations.\nUsing benchmark 50-20 again, we compare the performance on three different settings for the circuit\nsize threshold: 10,000, 100,000, and 1,000,000. Table 1b shows that generally, 100k gives the best\n\n7\n\n\fTable 1: Internal comparisons for collapsed compilation. Values represent Hellinger distances.\n\n(a) Comparison of proposal distributions\n\nPolicy\nFD\nMinEnt\nRBVar\n\nDummy\n2.37e4\n3.29e4\n5.81e3\n\nTrue\n1.77e4\n1.31e3\n5.71e3\n\nSDD\n3.72e7\n2.10e8\n7.34e3\n\n(c) Comparison of size thresholds (50 samples)\nPolicy\nFD\nMinEnt\nRBVar\n\n1m\n1.27e6\n7.24e6\n3.07e2\n\n100k\n5.08e7\n1.84e6\n1.52e1\n\n10k\n1.63e3\n1.69e2\n1.94e2\n\nPolicy\nFD\nMinEnt\nRBVar\n\n10k\n7.33e5\n1.44e3\n2.96e2\n\n(b) Comparison of size thresholds\n1m\n7.53e6\n8.07e4\n8.81e3\n(d) Number of samples taken in 1 hour by size\nSize Threshold\n1m\n4.7\nNumber of Samples\n\n100k\n9.77e6\n1.50e5\n2.66e2\n\n10k\n561.3\n\n100k\n33.5\n\nperformance, but the results are often similar. To further investigate this, Table 1c and Table 1d show\nperformance with exactly 50 samples for each size, and number of samples per hour respectively.\nThis is more informative as to why 100k gave the best performance - there is a massive difference in\nperformance for a \ufb01xed number of samples between 10k and 100k or 1m. The gap between 100k and\n1m is quite small, so as a result the increased number of samples for 100k leads to better performance.\nIntuitively, this is due to the nature of exact circuit compilation, where at a certain size point of\ncompilation you enter an exponential regime. Ideally, we would like to stop compiling right before\nwe reach that point. Thus, we proceed with 100k as our size-threshold setting for further experiments.\n\n4.2 Memory-Constrained Comparison\n\nIn this section, we compare collapsed compilation to two related state-of-the-art methods: edge-\ndeletion belief propagation (EDBP) (Choi & Darwiche, 2006), and IJGP-Samplesearch (SS) (Gogate\n& Dechter, 2011). Generally, for example in past UAI probabilistic inference competitions, comparing\nmethods in this space involves a \ufb01xed amount of time and memory being given to each tool. The\nresults are then directly compared to determine the empirically best performing algorithm. While this\nis certainly a useful metric, it is highly dependent on ef\ufb01ciency of implementation, and moreover\ndoes not provide as good of an understanding of the effects of being allowed to do more or less exact\ninference. To give more informative results, in addition to a time limit, we restrict our comparison at\nthe algorithmic level, by controlling for the level of exact inference being performed.\n\nEdge-Deletion Belief Propagation EDBP performs approximate inference by increasingly running\nmore exact junction tree inference, and approximating the rest via belief propagation (Choi &\nDarwiche, 2006; Choi et al., 2005). To constrain EDBP, we limit the corresponding circuit size for\nthe junction tree used. In our experiments we set these limits at 100,000 and 1,000,000.\n\nIJGP-Samplesearch IJGP-Samplesearch (SS) is an importance sampler augmented with constraint\nsatisfaction search (Gogate & Dechter, 2011, 2007). It uses iterative join-graph propagation (Dechter\net al., 2002) together with w-cutset sampling (Bidyuk & Dechter, 2007) to form a proposal, and then\nuses search to ensure that no samples are rejected. To constrain SS, we limit treewidth w at either 15,\n12, or 10. For reference, a circuit of size 100,000 corresponds to a treewidth between 10 and 12.\nAppendix D describes both baselines as well as the experimental setup in further detail.\n\n4.2.1 Discussion\nTable 2 shows the experimental results for this setting. Overall, we have found that when restricting all\nmethods to only do a \ufb01xed amount of exact inference, collapsed compilation has similar performance\nto both Samplesearch and EDBP. Furthermore, given a good choice of variable selection policy, it can\noften perform better. In particular, we highlight DBN, where we see that collapsed compilation with\nthe RBVar or MinEnt policies is the only method that manages to achieve reasonable approximate\ninference. This follows the intuition discussed in Section 2.2: a good choice of a few variables in a\ndensely connected model can lead to relatively easy exact inference for a large chunk of the model.\n\n8\n\n\fTable 2: Hellinger distances across methods with internal treewidth and size bounds\nfrust\n4.73e3\n4.73e3\n1.05e2\n5.27e4\n6.23e3\n5.96e6\n3.10e2\n2.30e3\n\nSegment\n1.63e6\n1.93e7\n3.11e7\n3.11e7\n3.11e7\n6.00e8\n3.40e7\n3.01e7\n\nlinkage\n6.54e8\n5.98e8\n4.93e2\n1.10e3\n4.06e6\n5.99e6\n6.16e5\n2.02e2\n\n50-20\n2.19e3\n7.40e7\n2.51e2\n6.96e3\n9.09e6\n9.77e6\n1.50e5\n2.66e2\n\n75-26\n3.17e5\n2.21e4\n2.22e3\n1.02e3\n1.09e4\n1.87e3\n3.29e2\n4.39e1\n\nMethod\nEDBP-100k\nEDBP-1m\nSS-10\nSS-12\nSS-15\nFD\nMinEnt\nRBVar\n\nDBN\n6.39e1\n6.39e1\n6.37e1\n6.27e1\n(Exact)\n1.24e1\n1.83e2\n6.27e3\n\nGrids\n1.24e3\n1.98e7\n3.10e1\n2.48e1\n8.74e4\n1.98e4\n3.61e3\n1.20e1\n\nAnother factor differentiating collapsed compilation from both EDBP and Samplesearch is the lack\nof reliance on some type of belief propagation algorithm. Loopy belief propagation is a cornerstone\nof approximate inference in graphical models, but it is known to have problems converging to a good\napproximation on certain classes of models (Murphy et al., 1999). The problem instance frust16 is\none such example \u2013 it is an Ising model with spins set up such that potentials can form loops, and the\nperformance of both EDBP and Samplesearch highlights these issues.\n\n4.3 Probabilistic Program Inference\n\nMethod\nEDBP-1m\nSS-15\nFD\n\nAs an additional point of comparison, we introduce a new type of\nbenchmark. We use the probabilistic logic programming language\nProbLog (De Raedt & Kimmig, 2015) to model a graph with prob-\nabilistic edges, and then query for the probability of two nodes\nbeing connected. This problem presents a unique challenge, as every\nnon-unary factor is deterministic.\nTable 3 shows the results for this benchmark, with the underlying\ngraph being a 12x12 grid. We see that EDBP struggles here due to the\nlarge number of deterministic factors, which stop belief propagation\nfrom converging in the allowed number of iterations. Samplesearch and collapsed compilation show\nsimilarly decent results, but interestingly they are not happening for the same reason. To contextualize\nthis discussion, consider the stability of each method. Collapsed compilation draws far fewer samples\nthan SS \u2013 some of this is made up for by how powerful collapsing is as a variance reduction technique,\nbut it is indeed less stable than SS. For this particular instance, we found that while different runs for\ncollapsed compilation tended to give different marginals fairly near the true value, SS consistently\ngave the same incorrect marginal. This suggests that if we ran each algorithm until convergence,\ncollapsed compilation would tend toward the correct solution, while SS would not, and appears to\nhave a bias on this benchmark.\n\nTable 3: Hellinger distances\nfor ProbLog benchmark\n\nProb12\n3.18e1\n3.87e3\n1.50e3\n\n5 Related Work and Conclusions\n\nWe have presented online collapsed importance sampling, an asymptotically unbiased estimator\nthat allows for doing collapsed importance sampling without choosing which variables to collapse\na priori. Using techniques from knowledge compilation, we developed collapsed compilation, an\nimplementation of online collapsed importance sampling that draws its proposal distribution from\npartial compilations of the distribution, and naturally exploits structure in the distribution.\nIn related work, Lowd & Domingos (2010) study arithmetic circuits as a variational approximation\nof graphical models. Approximate compilation has been used for inference in probabilistic (logic)\nprograms (Vlasselaer et al., 2015). Other approximate inference algorithms that exploit local structure\ninclude samplesearch and the family of universal hashing algorithms (Ermon et al., 2013; Chakraborty\net al., 2014). Finally, collapsed compilation can be viewed as an approximate knowledge compilation\nmethod: each drawn sample presents a partial knowledge base along with the corresponding correction\nweight. This means that it can be used to approximate any query which can be performed ef\ufb01ciently\non an SDD \u2013 for example the most probable explanation (MPE) query (Chan & Darwiche, 2006;\nChoi & Darwiche, 2017). We leave this as an interesting direction for future work.\n\n9\n\n\fAcknowledgements\nWe thank Jonas Vlasselaer and Wannes Meert for initial discussions. Additionally, we thank Arthur\nChoi, Yujia Shen, Steven Holtzen, and YooJung Choi for helpful feedback. This work is partially\nsupported by a gift from Intel, NSF grants #IIS-1657613, #IIS-1633857, #CCF-1837129, and DARPA\nXAI grant #N66001-17-2-4032.\n\nReferences\nBekker, Jessa, Davis, Jesse, Choi, Arthur, Darwiche, Adnan, and Van den Broeck, Guy. Tractable\nlearning for complex probability queries. In Advances in Neural Information Processing Systems,\npp. 2242\u20132250, 2015.\n\nBidyuk, Bozhena and Dechter, Rina. Cutset sampling for Bayesian networks. Journal of Arti\ufb01cial\n\nIntelligence Research (JAIR), 28:1\u201348, 2007.\n\nBoutilier, Craig, Friedman, Nir, Goldszmidt, Moises, and Koller, Daphne. Context-speci\ufb01c indepen-\n\ndence in Bayesian networks. In Proceedings of UAI, pp. 115\u2013123, 1996.\n\nChakraborty, Supratik, Fremont, Daniel J, Meel, Kuldeep S, Seshia, Sanjit A, and Vardi, Moshe Y.\nIn Proceedings of AAAI,\n\nDistribution-aware sampling and weighted model counting for sat.\nvolume 14, pp. 1722\u20131730, 2014.\n\nChan, Hei and Darwiche, Adnan. On the Robustness of Most Probable Explanations. In Proceedings\n\nof UAI, pp. 63\u201371, Arlington, Virginia, United States, 2006. AUAI Press.\n\nChavira, Mark and Darwiche, Adnan. On probabilistic inference by weighted model counting.\n\nArti\ufb01cial Intelligence, 172:772\u2013799, 2008.\n\nChavira, Mark, Darwiche, Adnan, and Jaeger, Manfred. Compiling relational Bayesian networks for\n\nexact inference. IJAR, 42(1-2):4\u201320, 2006.\n\nChoi, Arthur and Darwiche, Adnan. An edge deletion semantics for belief propagation and its\npractical impact on approximation quality. In Proceedings of AAAI, volume 21, pp. 1107, 2006.\n\nChoi, Arthur and Darwiche, Adnan. On relaxing determinism in arithmetic circuits. In ICML, 2017.\n\nChoi, Arthur, Chan, Hei, and Darwiche, Adnan. On Bayesian network approximation by edge\ndeletion. In Proceedings of the Twenty-First Conference on Uncertainty in Arti\ufb01cial Intelligence,\npp. 128\u2013135, Arlington, Virginia, United States, 2005. AUAI Press. ISBN 0-9749039-1-4.\n\nChoi, Arthur, Kisa, Doga, and Darwiche, Adnan. Compiling Probabilistic Graphical Models Using\n\nSentential Decision Diagrams. In ECSQARU, 2013.\n\nDarwiche, Adnan. Compiling Knowledge into Decomposable Negation Normal Form. In Proceedings\n\nof IJCAI, 1999.\n\nDarwiche, Adnan. Decomposable negation normal form. J. ACM, 48:608\u2013647, 2001.\n\nDarwiche, Adnan. A Differential Approach to Inference in Bayesian Networks. J. ACM, 50(3):\n\n280\u2013305, May 2003. ISSN 0004-5411.\n\nDarwiche, Adnan. Modeling and reasoning with Bayesian networks. Cambridge University Press,\n\n2009.\n\nDarwiche, Adnan. SDD: A New Canonical Representation of Propositional Knowledge Bases. In\n\nProceedings of IJCAI, 2011.\n\nDe Raedt, Luc and Kimmig, Angelika. Probabilistic (logic) programming concepts. Machine\n\nLearning, 100(1):5\u201347, 2015.\n\nDechter, Rina, Kask, Kalev, and Mateescu, Robert. Iterative join-graph propagation. In Proceedings of\nthe Eighteenth conference on Uncertainty in arti\ufb01cial intelligence, pp. 128\u2013136. Morgan Kaufmann\nPublishers Inc., 2002.\n\n10\n\n\fErmon, Stefano, Gomes, Carla P, Sabharwal, Ashish, and Selman, Bart. Embed and project: Discrete\nsampling with universal hashing. In Advances in Neural Information Processing Systems, pp.\n2085\u20132093, 2013.\n\nFierens, Daan, Van den Broeck, Guy, Renkens, Joris, Shterionov, Dimitar, Gutmann, Bernd, Thon,\nIngo, Janssens, Gerda, and De Raedt, Luc. Inference and learning in probabilistic logic programs\nusing weighted boolean formulas. TPLP, 15(3):358\u2013401, 2015.\n\nGogate, Vibhav and Dechter, Rina. Samplesearch: A scheme that searches for consistent samples. In\n\nArti\ufb01cial Intelligence and Statistics, pp. 147\u2013154, 2007.\n\nGogate, Vibhav and Dechter, Rina. Samplesearch: Importance sampling in presence of determinism.\n\nArti\ufb01cial Intelligence, 175(2):694\u2013729, 2011.\n\nKoller, Daphne and Friedman, Nir. Probabilistic graphical models: principles and techniques. 2009.\nLowd, Daniel and Domingos, Pedro. Approximate inference by compilation to arithmetic circuits. In\n\nNIPS, pp. 1477\u20131485, 2010.\n\nMurphy, Kevin P, Weiss, Yair, and Jordan, Michael I. Loopy belief propagation for approximate\ninference: An empirical study. In Proceedings of UAI, pp. 467\u2013475. Morgan Kaufmann Publishers\nInc., 1999.\n\nTokdar, Surya T and Kass, Robert E. Importance sampling: a review. Wiley Interdisciplinary Reviews:\n\nComputational Statistics, 2(1):54\u201360, 2010.\n\nVan den Broeck, Guy. Lifted Inference and Learning in Statistical Relational Models. PhD thesis,\n\nKU Leuven, January 2013.\n\nVan den Broeck, Guy and Darwiche, Adnan. On the role of canonicity in knowledge compilation. In\n\nProceedings of the 29th Conference on Arti\ufb01cial Intelligence (AAAI), 2015.\n\nVan den Broeck, Guy and Suciu, Dan. Query Processing on Probabilistic Data: A Survey. Foundations\n\nand Trends in Databases. Now Publishers, 2017. doi: 10.1561/1900000052.\n\nVlasselaer, Jonas, Van den Broeck, Guy, Kimmig, Angelika, Meert, Wannes, and De Raedt, Luc.\nAnytime inference in probabilistic logic programs with Tp-compilation. In Proceedings of IJCAI,\npp. 1852\u20131858, July 2015.\n\nVlasselaer, Jonas, Meert, Wannes, Van den Broeck, Guy, and De Raedt, Luc. Exploiting local and\nrepeated structure in dynamic Bayesian networks. Arti\ufb01cial Intelligence, 232:43 \u2013 53, March 2016.\nISSN 0004-3702. doi: 10.1016/j.artint.2015.12.001.\n\n11\n\n\f", "award": [], "sourceid": 4952, "authors": [{"given_name": "Tal", "family_name": "Friedman", "institution": "UCLA"}, {"given_name": "Guy", "family_name": "Van den Broeck", "institution": "UCLA"}]}