{"title": "Approximability of Probability Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 377, "page_last": 384, "abstract": "", "full_text": "Approximability of Probability Distributions\n\nIBM T. J. Watson Research Center\n\nIBM T. J. Watson Research Center\n\nHawthorne, NY 10532\nrish@us.ibm.com\n\nAlina Beygelzimer\u2217\n\nIrina Rish\n\nHawthorne, NY 10532\n\nbeygel@cs.rochester.edu\n\nAbstract\n\nWe consider the question of how well a given distribution can be approx-\nimated with probabilistic graphical models. We introduce a new param-\neter, effective treewidth, that captures the degree of approximability as\na tradeoff between the accuracy and the complexity of approximation.\nWe present a simple approach to analyzing achievable tradeoffs that ex-\nploits the threshold behavior of monotone graph properties, and provide\nexperimental results that support the approach.\n\n1\n\nIntroduction\n\nOne of the major concerns in probabilistic reasoning using graphical models, such as\nBayesian networks, is the computational complexity of inference. In general, probabilistic\ninference is NP-hard and a typical approach to handling this complexity is to use an approx-\nimate inference algorithm that trades accuracy for ef\ufb01ciency. This leads to the following\nquestion: How can we distinguish between distributions that are easy to approximate and\nthose that are hard? More generally, how can we characterize the inherent degree of distri-\nbution\u2019s complexity, i.e. its approximability?\n\nThese questions also arise in the context of learning probabilistic graphical models from\ndata. Note that traditional model selection criteria, such as BIC/MDL, aim at \ufb01tting the\ndata well and minimizing the representation complexity of the learned model (i.e., the\ntotal number of parameters). However, as demonstrated in [2], such criteria are unable to\ncapture the inference complexity: two models that have similar representation complexity\nand \ufb01t data equally well may have quite different graph structures, making one model\nexponentially slower for inference than the other. Thus, our goal is to develop learning\nalgorithms that can \ufb01nd good trade-offs between accuracy of a model and its inference\ncomplexity.\n\nCommonly used exact inference algorithms, such as the junction tree algorithm [12], or\nclosely related variable-elimination techniques [6], essentially triangulate the graph, and\ntheir complexity is exponential in the size of largest clique induced during triangulation\n(parameter known as treewidth). Generally, it can be shown that (in some precise sense)\nany scheme for belief updating based on local calculations must contain a hidden trian-\ngulation [10]. Thus the treewidth arises as a natural measure of inference complexity in\ngraphical models.\n\n\u2217The work was done while the author was at the Department of Computer Science, University of\n\nRochester.\n\n\fIntuitively, a probability distribution is approximable, or easy, if it is close to a distribution\nrepresented by an ef\ufb01cient, low-treewidth graphical model. We use the Kullback-Leibler\ndivergence dKL as a measure of closeness. 1. The following example explains our intuition\nbehind approximable vs. nonapproximable distributions.\n\nMotivating Example Consider the parity function on n binary random variables\n{X1, . . . , Xn}, and let our target distribution P be the uniform distribution on the values to\nwhich it assigns 1 (i.e., on n-bit strings with an odd number of 1s). It is easy to see that any\napproximation Q that decomposes over a network whose moralized graph misses at least\none edge, is precisely as inaccurate as the one that assumes all variables to be independent\n(i.e., has no edges).\n\ndKL\n\n1\n\nthat\n\nand thus\n\nis uniform,\n\nThis follows from the fact\nthe probability\ndistribution induced on any proper subset of the\nvariables\nfor any subset\n{Xi1 , . . . , Xik } of k < n variables, P (Xi1\n|\nXi2, . . . , Xik ) = P (Xi1 ), uniform on {0, 1}.\nIt\nis then readily seen that Px P (x) log Q(x) =\n2\u2212(n\u22121) Px:P (x)>0 log Qn\ni=1 Q(xi | xi1, . . . , xir ) =\nlog Qn\ni=1 Q(xi) = log 2\u2212n = \u2212n, 2 and dKL(P, Q) =\n\u2212H(P )+n = 1 since H(P ) = n\u22121. Thus, unless we\ncan afford the complexity of the complete graph, there\nis absolutely no sense (i.e., absolutely no gain in accuracy and a potentially exponential loss\nof ef\ufb01ciency) in using a model more complex than the empty graph (i.e., n isolated nodes\nwith no edges). This intuitively captures what we mean by a nonapproximable distribution.\n\n(empty graph)\n\nn \u2212 1\n(clique)\n\nn \u2212 2\n\ntreewidth\n\n0\n\nOn the other hand, one can easily construct a distribution with large weak dependencies\nsuch that representing this distribution exactly requires a network with large treewidth;\nhowever, if we are willing to sacri\ufb01ce just a bit of accuracy, we get a very simple model.\nFor example, consider a distribution P ({X1, . . . , Xn}) in which variables X1, . . . , Xn\u22121\nare independent and uniformly distributed; if all X1, . . . , Xn\u22121 are true, Xn is true with\nprobability 1 (and false with probability 0); otherwise Xn is true with probability 1/2 (re-\ngardless of the values of X1, . . . , Xn\u22121). The network yielding zero KL-divergence is the\nn-node clique (after moralization). Tolerating KL-divergence 2\u2212(n\u22121) (i.e., exponentially\nvanishing with n) allows us to use an exponentially more ef\ufb01cient model for P (namely,\nthe empty graph).\n\nThe following questions naturally arise: If we tolerate a certain inaccuracy, what is the best\ninference complexity we can hope to achieve? Or, what is the best achievable approxima-\ntion accuracy given a constraint on the complexity (i.e., a bound on the treewidth)? The\ntradeoff between the complexity and accuracy is monotonic; however, it may be far from\nlinear. The goal is to exploit these nonlinearities in choosing the best available tradeoff.\n\nOur analysis of accuracy vs. complexity trade-offs is based on the results from random\ngraph theory which suggest that graph properties monotone in edge addition (e.g., such\nas graph connectivity) appear rather suddenly: the transition from the property being very\nunlikely to it being very likely occurs during a small change of the edge probability p\n(density) in the random graph [7, 8].\n\nThis paper makes the following contributions. First, we show that both important proper-\nties of random graphical models, the property of \u201cbeing ef\ufb01cient\u201d (i.e., having treewidth at\nmost some \ufb01xed integer k) and the property of \u201cbeing accurate\u201d (i.e., being at distance at\nmost some \u03b4 from the target distribution), are monotone and demonstrate a threshold behav-\nior, giving us two families of threshold curves parameterized by k and by \u03b4, respectively.\nSecond, we introduce the notion of effective treewidth k(\u03b4), which denotes the smallest\n1Note that minimizing dKL from the empirical distribution (induced by a given set of samples)\n\nalso corresponds to maximizing the likelihood of observed data.\n\n2The second to last equality is due to the well-known fact that dKL(P, Q) is minimized by forcing\n\nthe conditional probabilities of Q to coincide with those computed from P .\n\n\fachievable treewidth k given a constraint \u03b4 on KL-divergence (error) from the target (we\nalso introduce a notion of \u0001-achievable k(\u03b4) which requires at least \u0001-fraction of models\nin a given set to achieve treewidth k and error \u03b4). The effective treewidth captures the\napproximability of the distribution, and is determined by relative position of the thresh-\nold curves, an inherent property of the target distribution. Finally, we provide an ef\ufb01cient\nsampling-based approach that actually \ufb01nds a model achieving k(\u03b4) with high probabil-\nity. We estimate the threshold curves and, using their relative position, identify a class of\ntreewidth-bounded models such that the models in the class are still simple, yet this class\nalready contains (with high probability) a suf\ufb01ciently good approximations to the target\ndistribution (otherwise, we suggest that the distribution is inherently hard to approximate).\n\n2 Preliminaries and Related Work\n\nLet P be a probability distribution on n discrete random variables X1, X2, . . . , Xn. A\nBayesian network exploits the independences among the Xi to provide a compact repre-\nsentation of P as a product of low-order conditional probability distributions. The inde-\npendences are encoded by a directed acyclic graph (DAG) G with nodes corresponding to\nX1, X2, . . . , Xn and edges representing direct dependencies. Each Xi is independent of\nits non-descendants given its parents in the graph [12]. The dependencies are quanti\ufb01ed\nby associating each node Xi with a local conditional probability distribution PB(Xi | \u03a0i),\nwhere \u03a0i is the set of parents of Xi in G. The joint probability distribution encoded by B\nis given by the product PB(X1, . . . , Xn) = Qn\ni=1 PB(Xi | \u03a0i). We say that a distribu-\ntion P decomposes over a DAG G if there exist local conditional probability distributions\ncorresponding to G such that P can be written in such a form.\nIn general, exact probabilistic inference in Bayesian networks is NP-hard. For singly-\nconnected networks (i.e., networks with no undirected cycles), there is a linear time local\nbelief-propagation algorithm [12]. In order to use this algorithm in the presence of cy-\ncles, one typically constructs a junction tree of the network and runs the algorithm on this\ntree [12]. Constructing a junction tree involves triangulating the graph, i.e., adding edges\nso that every cycle of length greater than three has a chord (i.e., an edge between a pair\nof non-adjacent nodes). Each triangulation corresponds to some order of eliminating vari-\nables when summing terms out during inference [6]. Exact inference can then be done in\ntime and space linear in the representation of clique marginals in the junction tree, which\nis exponential in the size of the largest clique induced during triangulation. This number\n(minus one) is known as the width of a given triangulation. The minimum width over all\npossible triangulations is called the treewidth of the graph. The triangulation procedure\nis de\ufb01ned for undirected graphs, so we must \ufb01rst make the network undirected while pre-\nserving the set of independence assumptions; this can be done by moralizing the network,\ni.e., connecting (\u201cmarrying\u201d) the parents of every node by a clique and then dropping the\ndirection of all edges.\n\nGiven a set of independent samples from P , the general goal is to learn a model (a Bayesian\nnetwork) of this distribution that involves dependencies only on limited subsets of the vari-\nables. Restricting the size of dependencies controls both over\ufb01tting and the complexity of\ninference in the resulting model. The samples are in the form of tuples hx1, . . . , xni each\ncorresponding to a particular assignment hX1 = x1, . . . , Xn = xni. Given a target distri-\nbution P (X) and an approximation Q(X), the information divergence (or Kullback-Leibler\ndistance) between P and Q is de\ufb01ned as dKL(P, Q) = Px P (x) log P (x)\nQ(x) , where x ranges\nover all possible assignments to the variables in X (See [5].) Notice that dKL(P, Q) is not\nnecessarily symmetric.\n\nA natural way of controlling the complexity of the learned model is to limit ourselves to a\nclass of treewidth-bounded networks. Let Dk denote the class of distributions decompos-\nable on graphs with treewidth at most k (0 \u2264 k < n), with D1 corresponding to the set of\n\n\ftree-decomposable distributions. The distribution within Dk minimizing the information\ndivergence from the target distribution P is called the projection of P onto Dk. Again, if\nP is the empirical distribution, then this is also the distribution within Dk maximizing the\nlikelihood of observing the data.\n\nLearning bounded-treewidth models Chow and Liu [4] showed how to \ufb01nd a projection\nonto the set of tree-decomposable distributions. For a \ufb01xed tree T , the projection of P\nonto the set of T -decomposable distributions is uniquely given by the distribution in which\nthe conditional probabilities along the edges of T coincide with those computed from P .\nHence the tree yielding the closest projection is simply given by any maximum weight\nspanning tree, where the edge weight is the mutual information between the corresponding\nvariables. Notice that candidate spanning trees can be compared without any knowledge of\nP beyond that given by pairwise statistics. The tree can be ef\ufb01ciently found using any of\nthe well known algorithms. The additive decomposition of dKL used in the proof, can be\neasily extended to \u201cwider\u201d networks. Fix a network structure G, and let Q be a distribution\ndecomposable over G. Then\nP (x) log\n\nP (xi, \u03c0i) log Q(xi | \u03c0i) \u2212 H(P ),\n\ndKL(P, Q) = X\n\nn\n\n= \u2212\n\nX\n\nX\n\ni=1\n\nxi,\u03c0i\n\nP (x)\nQ(x)\n\nx\n\nwhere \u03c0i ranges over all possible values of \u03a0i. If P is the empirical distribution induced\nby the given sample of size N (i.e., de\ufb01ned by frequencies of events in the sample), then\nthe \ufb01rst term can be shown to be \u2212LL(Q)/N.3 Thus minimizing dKL(P, Q) is equivalent\nto maximizing the log likelihood LL(Q).\nStandard arguments (see, for example, [12]) show that the \ufb01rst term is maximized by forc-\ning all conditional probabilities Q(xi | \u03c0i) to coincide with those computed from P . If\nP is the empirical distribution, this means forcing the parameters to be the corresponding\nrelative frequencies in the sample. Hence if G is \ufb01xed, the projection onto the set of G-\ndecomposable distributions is uniquely de\ufb01ned, and we will identify G with this projection\n(ignoring some notational abuse). It remains, of course, to \ufb01nd G that is the closest to P\namong all DAGs in some treewidth-bounded class Dk. As observed by H\u00a8offgen [9], the\nproblem readily reduces to the minimum-weight hypertree problem. The reverse reduction\nis not known, so the NP-hardness of the hypertree problem does not imply the hardness of\nthe learning problem. Srebro [13] showed that a similar undirected decomposition holds\nfor bounded treewidth Markov networks (probabilistic models that use undirected graphs\nto represent dependencies). He showed that the learning problem is equivalent to \ufb01nd-\ning a minimum-weight undirected hypertree, and so is NP-hard. It is important to note\nthat Srebro [13] considered approximation in the context of density estimation rather than\nmodel selection, thus the choice of k is directly driven by the size of the sample space;\nthe only rationale for limiting the class of hypothesis distributions is to prevent over\ufb01tting.\nWith an in\ufb01nite amount of data, they would learn a clique, since adding edges would al-\nways decrease the divergence. Our goal, on the other hand, is to \ufb01nd the most appropriate\ntreewidth-bounded class onto which to project the distribution.\n\nThreshold behavior of random graphs We use the model of random directed acyclic\ngraphs (DAGs) de\ufb01ned by Barak and Erd\u02ddos [1]. Consider the probability space G(n, p)\nof random undirected graphs on n nodes with edge probability p (i.e., every pair of nodes\nis connected with probability p, independently of every other pair). Let Gn,p stand for a\nrandom graph from this probability space. We will also occasionally use Gn,m to denote\na graph chosen randomly from among all graphs with n nodes and m edges. When p =\nm/\u00a1n\n2\u00a2, the two models are practically identical. A random DAG in the Barak-Erd\u02ddos model\nis obtained from Gn,p by orienting the edges according to the ordering of vertices, i.e., all\nedges are directed from higher to lower indexed vertices.\n\n3Since the true distribution P is given only by the sample, we let P also denote the empirical\n\ndistribution induced by the sample, ignoring some abuse of notation.\n\n\fA graph property P is naturally associated with the set of graphs having P. A property\nis monotone increasing if it is preserved under edge addition: If a graph G satis\ufb01es the\nproperty, then every graph on the same set of nodes containing G as a subgraph must sat-\nisfy it as well. It is easy to see (and intuitively clear) that if P is a monotone increasing\nproperty then the probability that Gn,p satis\ufb01es P is a non-decreasing function of p. A\nmonotone decreasing property is de\ufb01ned similarly. For example, the property of having\ntreewidth at most some \ufb01xed integer k is monotone decreasing: adding edges can only in-\ncrease the treewidth. The theory of random graphs was initiated by Erd\u02ddos and R\u00b4enyi [7],\nand one of the main observations they made was that many natural monotone properties\nappear rather suddenly, i.e., as we increase p, there is a sharp transition from a property\nbeing very unlikely to it being very likely. Friedgut [8] proved that every monotone graph\nproperty of undirected graphs has such a threshold behavior. Random DAGs (correspond-\ning to random partially ordered sets) have received less attention then random undirected\ngraphs, partially because of the additional structure that prevents the completely indepen-\ndent choice of edges. Nonetheless, many properties of random DAGs were also shown to\nhave threshold functions. (See, for example, [3] and references therein.) However, we are\nnot aware of any general result for random DAGs analogous to that of Friedgut [8].\n\n3 Formalization\nFirst we introduce two properties of networks essential for the rest of the paper.\n\nAccuracy Recall that the information divergence of a given DAG G from the tar-\nget distribution P is given by dKL(P, G) = W (G) \u2212 H(P ), where W (G) =\n\u2212 Pn\ni=1 Pxi,\u03c0i P (xi, \u03c0i) log P (xi | \u03c0i). (In our case, P is the empirical distribution in-\nduced by the given sample S of size N. As mentioned before, W (G) = \u2212LL(G)/N \u2265 0.)\nFix a distance parameter \u03b4 > 0, and consider the property P\u03b4 of n-node DAGs of having\nW (G) \u2264 \u03b4. Notice that P\u03b4 is monotone increasing: Adding edges to a graph can only\nbring the graph closer to the target distribution, since any distribution decomposable on the\noriginal graph is also decomposable on the augmented one. Thus if G is a subgraph of G0,\nthen W (G) \u2264 \u03b4 only if W (G0) \u2264 \u03b4.\nComplexity Fix an integer k, and consider the property of n-node DAGs of having\ntreewidth of their moralized graph at most k. Call this property Pk and note that it is\na structural property of a DAG, which does not depend on the target distribution and its\nprojection onto the DAG. It is also a monotone decreasing property, since if a graph has\ntreewidth at most k, then certainly any of its subgraphs does.\nRecall that we identify each graph with the projection of the target distribution onto the\ngraph. We call a pair (k, \u03b4) achievable for a distribution P , if there exists a distribution\nQ decomposable on a graph with treewidth at most k such that dKL(P, Q) \u2264 \u03b4. The\neffective treewidth of P , with respect to a given \u03b4, is de\ufb01ned as the smallest k(\u03b4) such that\nthe pair (k, \u03b4) is achievable, i.e., if all distributions at distance at most \u03b4 from P are not\ndecomposable on graphs with treewidth less than k(\u03b4). This formulation gives the level of\ninevitable complexity (i.e., treewidth) k, given the desired level of accuracy \u03b4. We will also\nbe interested in average-case analogs of these de\ufb01nitions. Fix \u0001 > 0. We will say that a\npair (k, \u03b4) is \u0001-achievable for P if at least an \u0001-fraction of all DAGs in Dk certify that (k, \u03b4)\nis achievable. Thus we not only care about the existence of an approximation with given \u03b4\nand k, but also in the number of such approximations.\n\n4 Main Idea\nConsider, for each treewidth bound k, the curve given by \u00b5k(p) = Pr[width(Gn,p) \u2264 k],\nand let pk be such that \u00b5k(pk) = 1/2 + \u0001, where 0 < \u0001 < 1\n2 is some \ufb01xed constant.\nSimilarly, for \u03b4 > 0, de\ufb01ne the curve \u00b5\u03b4(p) = Pr[W (Gn,p) \u2264 \u03b4], and let p\u03b4 be the critical\nvalue of p given by \u00b5\u03b4(p\u03b4) = 1/2.\n\n\f2\u03c12\n\nFor reasons that will become clear in a moment, our goal will be to \ufb01nd, for each feasible\ntreewidth k, the value of \u03b4 such that p\u03b4 = pk. To \ufb01nd each pk, the algorithm will simply\ndo a binary search on the interval (0, 1): For the current value of edge probability p, the\nalgorithm estimates \u00b5k(Gn,p) by random sampling and branches according to the estimate.\nThe search is continued until p gets suf\ufb01ciently close to satisfying \u00b5k(Gn,p) = 1/2 +\n\u0001. To estimate \u00b5k(Gn,p) within an additive error \u03c1 with probability at least 1 \u2212 \u03b3, the\nalgorithm samples m = ln(2/\u03b3)\nindependent copies of Gn,p, and outputs the average value\nof the 0/1 random variable indicating whether the treewidth of the sampled DAG is at\nmost k. The analysis is just a straightforward application of the Chernoff Bound. Note\nthat the values related to treewidth are independent of the target distribution and can be\nprecomputed of\ufb02ine. To \ufb01nd \u03b4 = \u03b4(k) for a given value of k, the algorithm computes\nthe values of W (Gn,pk ) for the m sampled random DAGs in G(n, pk), orders them and\nchooses the median. Each pair (k, \u03b4) gives a point on the threshold curve. We know\nthat at least a (1/2 + \u0001)-fraction of the DAGs in G(n, pk) satisfy Pk. On the other hand,\nat least half of them satisfy P\u03b4, and thus at least an \u0001-fraction satis\ufb01es both. Moreover,\nthere is a very simple probabilistic algorithm for \ufb01nding a model realizing the tradeoff:\nWe just need to sample O(1/\u0001) DAGs in G(n, pk) and choose the closest one. Clearly\nwe are overcounting, since the same DAGs may contribute to both probabilities; however\nnot absurdly, since intuitively the graphs in G(n, pk) with small treewidth will not \ufb01t the\ndistribution better than the ones with larger treewidth.\n\n$1$\n\n$0.6$\n\n$10$\n\n$15$\n\n$20$\n\n$25$\n\n$30$\n\n$0.8$\n\n$0.2$\n\n$0.4$\n\n$0$\n\n$5$\n\nnumber of edges\n\nFigure 1: Threshold curves for a 3-wise indepen-\ndent distribution on 8 random variables (using a\nconstruction from [11]).\n\nA small example should help make\nthe goals clear. A distribution is\ncalled k-wise independent\nif any\nsubset of k variables is mutually in-\ndependent (however, there may ex-\nist dependencies on larger subsets).\nFigure 1 shows the curves for a 3-\nwise independent distribution on 8\nrandom variables. We can hardly\nexpect graphs with treewidth at most\n2 to do well on this distribution,\nsince all triples are independent, and\ntheir marginals do not reveal any\nhigher-order structure; as we will\nsee this is indeed the case. The x-\naxis in Figure 1 corresponds to the\nnumber of edges m, the y-axis de-\nnotes the probability that Gn,m satis\ufb01es the property corresponding to a given curve. The\nmonotone decreasing curves correspond to the properties Pk for k = {1, . . . , 6} (from left\nto right respectively). For k = 7, the curve is just \u00b5m(Pk) = 1. The monotone increas-\ning curves correspond to the property of having dKL at most \u03b4. The leftmost curve is for\n\u03b4 = 0.07, and it decreases by 0.01 as we go from left to right; the smaller \u03b4, the higher\nthe quality of approximation, thus the smaller the probability of attaining it. The empty\ngraph (treewidth 0) had divergence 0.073. As m increases, the probability of having small\ntreewidth decreases, while the probability of getting close to the target increases. (Since n\nis small, we computed the divergence exactly.) As the random graph evolves, we want to\ncapture the moment when the \ufb01rst probability is still high, while the second is already high.\nAs expected, graphs with treewidth at most 2 are as inaccurate as the empty graph since all\ntriples are independent. Given the desired level of closeness \u03b4, we want to \ufb01nd the smallest\ntreewidth k such that the corresponding curves meet above some cut-off probability. For\nexample, to get within dKL at most 0.7, we may suggest, say, projecting onto graphs with\ntreewidth 4 (cutting at 0.4). The cut-off value determines the ef\ufb01ciency of \ufb01nding a model\nwith such k and \u03b4 (see discussion above).\n\n\f\u03c1 )2 log2 n\n\n\u03c1 log nk+1\n\nEstimating dKL Fix a bounded-treewidth DAG G. Let the target distribution be the\nempirical distribution P induced by a given sample. Recall that dKL(P, G) decomposes\ninto sum of conditional entropies induced by G (minus the entropy of P ). H\u00a8offgen [9]\nshowed how to estimate these conditional entropies with any \ufb01xed additive precision \u03c1\nusing polynomially many samples. More precisely, he showed that a sample of size m =\nm(\u03b3, \u03c1) = O(( n\n\u03b3 ) suf\ufb01ces to obtain good estimations of all induced\nconditional entropies with probability at least 1 \u2212 \u03b3, which in turn suf\ufb01ces to estimate\ndKL(P, G) with the additive precision \u03c1.\nEstimating Treewidth We, of course, will not attempt to compute the treewidth of the\nrandomly generated graphs exactly. The problem is NP-hard 4. In practice, people often use\nheuristics (based, for example, on eliminating vertices in the order of maximum cardinality,\nminimum degree, or minimum separating vertex set). There are no theoretical guarantees in\ngeneral, but heuristics tend to perform reasonably well: used in combination with various\nlower bound techniques, they can often pin down the treewidth to a small range, or even\nidentify it exactly 5. We stress that the values related to treewidth are independent of the\ntarget distribution and can be precomputed.\n\n5 Experimental Results\nWe tested the approach presented in the paper on distributions ancestrally sampled from\nreal-life medical networks commonly used for benchmarking. The experiments support\nthe following conclusions: the properties capturing the complexity and accuracy of a model\nindeed demonstrate a threshold behavior, which can be exploited in determining the best\ntradeoff for the given distribution; the simple approach based on generating random graphs\nand using them to approximate the thresholds is indeed capable of capturing the effective\nwidth of a distribution. Due to page limit, we discuss an application of the method to a\nsingle network known as ALARM (originating from anesthesia monitoring).\n\n16\n\n14\n\n12\n\n20\n\n24\n\n22\n\n)\n\nW\n\n18\n\n(\n \ny\nc\na\nr\nu\nc\nc\na\n\nThe network has 37 nodes, 46 directed edges,\n19 additional undirected edges induced by\nmoralization; the treewidth is 4. A sample of\nsize N = 104 was generated using ancestral\nsampling, inducing the empirical distribution\nwith support on 5570 unique variable assign-\nments. The entropy of the empirical distribu-\ntion P was 9.6 (maximum possible entropy\nfor a 5570-point distribution is 12.4). Fig-\nure 2 shows the curve illustrating the (esti-\nmated) tradeoffs available for P . For each\ntreewidth bound k, the curves gives an esti-\nmate of the best achievable value of W =\ndKL \u2212 H(P ). (Recall that LL = \u2212N \u00b7 W .)\nThe estimate is based on generating 400 random DAGs with 37 nodes and m edges, for\nevery possible m. Several points on the curve are worthy of note. The upper-left point\n(0, 23.4) corresponds to the model that assumes all 37 variables to be independent. On the\nother extreme, the lower-right point (36, 0) corresponds to the clique on 37 nodes, which of\ncourse can model P perfectly, but with exponential complexity. The closer the area under\nthe curve to zero, the easier the distribution (in the sense discussed in this paper). Here we\nsee that the highest gain in accuracy from allowing the model to be more complex occurs\nup to treewidth 4, less so 5 and 6; by further increasing the treewidth we do not gain much\nin accuracy. We succeed in reconstructing the width in the sense that the distribution was\n\nFigure 2: Tradeoff curve for ALARM\n\ncomplexity (treewidth)\n\n10\n\n20\n\n10\n\n15\n\n0\n\n5\n\n25\n\n30\n\n35\n\n4If k is \ufb01xed, the problem of determining whether a graph has treewidth k has a linear time\nalgorithm. As typical, the bound contains a large hidden constant with k in the exponent, making\nthe algorithm hardly applicable in practice. There is a number of constant-factor approximations\nwith an exponential dependence on k, and a polynomial-time O(log k)-factor approximation. No\npolynomial-time constant-factor approximation is known.\n\n5Although one can construct graphs for which they produce solutions that are arbitrarily far from\n\noptimal.\n\n\fy\nt\nr\ne\np\no\nr\np\ne\nh\n\n \n\nt\n \n\ni\n\ng\nn\ny\nf\ns\ni\nt\n\na\ns\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\nnumber of edges\n\nFigure 3: Threshold curves for ALARM\n\nsimulated from a treewidth-4 model.6 Such tradeoff curves are similar to commonly used\nROC (Receiver Operating Characteristic) curves; the techniques for \ufb01nding the cutoff value\nin ROC curves can be used here as well. Instead of plotting the best achievable distance,\nwe can plot the best distance achievable by at least an \u0001-fraction of models in the class,\nparameterizing the tradeoff curve by \u0001. Figure 3 shows the threshold curves. The axes\nhave the same meaning as in Figure 1. Varying sample size and the number of randomly\ngenerated DAGs does not change the behavior of the curves in any meaningful way; not\nsurprisingly, increasing these parameters results in smoother curves.\nReferences\n[1] A. Barak and P. Erd\u02ddos. On the maximal number of strongly independent vertices in a random\n\nacyclic directed graph. SIAM J. Algebraic and Discrete Methods, 5:508\u2013514, 1984.\n\n[2] A. Beygelzimer and I. Rish. Inference complexity as a model-selection criterion for learning\nIn Proceedings of the Eighth International Conference on Principles of\n\nbayesian networks.\nKnowledge Representation and Reasoning (KR2002), Toulouse, France, 2002.\n\n[3] B. Bollob\u00b4as and G. Brightwell. The structure of random graph orders. SIAM J. Discrete Math-\n\nematics, 10(2):318\u2013335, 1997.\n\n[4] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees.\n\nIEEE Trans. on Inf. Theory, 14:462\u2013467, 1968.\n\n[5] T. Cover and J. Thomas. Elements of information theory. John Wiley & Sons Inc., New York,\n\n1991. A Wiley-Interscience Publication.\n\n[6] R. Dechter. Bucket elimination: A unifying framework for probabilistic reasoning. In M. I.\n\nJordan (Ed.), Learning in Graphical Models, Kluwer Academic Press, 1998.\n\n[7] P. Erd\u02ddos and A. R\u00b4enyi. On the evolution of random graphs. Bull. Inst. Internat. Statist., 38:343\u2013\n\n347, 1961.\n\n[8] E. Friedgut and G. Kalai. Every monotone graph property has a sharp threshold. Proceedings\n\nof the American Mathematical Society, 124(10):2993\u20133002, 1996.\n\n[9] K. H\u00a8offgen. Learning and robust learning of product distributions. In Proceedings of the 6th\n\nAnnual Workshop on Computational Learning Theory, pages 77\u201383, 1993.\n\n[10] F. V. Jensen and F. Jensen. Optimal junction trees. In Proc. Tenth Conference on Uncertainty\n\nand AI (UAI), 1994.\n\n[11] J. Naor and M. Naor. Small-bias probability spaces: Ef\ufb01cient constructions and applications.\nIn Proc. of the 22nd ACM Symposium on Theory of Computing (STOC), pages 213\u2013223, 1990.\n[12] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Mor-\n\ngan Kaufmann Publishers, 1988.\n\n[13] N. Srebro. Maximum likelihood bounded Tree-Width markov networks. In Proceedings of the\n\n17th Conference on Uncertainty in AI (UAI), pages 504\u2013511, 2001.\n\n6Note, however, that it does not imply that the empirical distribution itself decomposes on a\n\ntreewidth-4 model. The simplest example of this is when the true distribution is uniform.\n\n\f", "award": [], "sourceid": 2498, "authors": [{"given_name": "Alina", "family_name": "Beygelzimer", "institution": null}, {"given_name": "Irina", "family_name": "Rish", "institution": null}]}