{"title": "Tractable Bayesian Network Structure Learning with Bounded Vertex Cover Number", "book": "Advances in Neural Information Processing Systems", "page_first": 622, "page_last": 630, "abstract": "Both learning and inference tasks on Bayesian networks are NP-hard in general. Bounded tree-width Bayesian networks have recently received a lot of attention as a way to circumvent this complexity issue; however, while inference on bounded tree-width networks is tractable, the learning problem remains NP-hard even for tree-width~2. In this paper, we propose bounded vertex cover number Bayesian networks as an alternative to bounded tree-width networks. In particular, we show that both inference and learning can be done in polynomial time for any fixed vertex cover number bound $k$, in contrast to the general and bounded tree-width cases; on the other hand, we also show that learning problem is W[1]-hard in parameter $k$. Furthermore, we give an alternative way to learn bounded vertex cover number Bayesian networks using integer linear programming (ILP),  and show this is feasible in practice.", "full_text": "Tractable Bayesian Network Structure Learning with\n\nBounded Vertex Cover Number\n\nJanne H. Korhonen\n\nHelsinki Institute for Information Technology HIIT\n\nDepartment of Computer Science\n\nUniversity of Helsinki\n\njanne.h.korhonen@helsinki.fi\n\nPekka Parviainen\n\nHelsinki Institute for Information Technology HIIT\n\nDepartment of Computer Science\n\nAalto University\n\npekka.parviainen@aalto.fi\n\nAbstract\n\nBoth learning and inference tasks on Bayesian networks are NP-hard in general.\nBounded tree-width Bayesian networks have recently received a lot of attention as\na way to circumvent this complexity issue; however, while inference on bounded\ntree-width networks is tractable, the learning problem remains NP-hard even for\ntree-width 2. In this paper, we propose bounded vertex cover number Bayesian\nnetworks as an alternative to bounded tree-width networks. In particular, we show\nthat both inference and learning can be done in polynomial time for any \ufb01xed\nvertex cover number bound k, in contrast to the general and bounded tree-width\ncases; on the other hand, we also show that learning problem is W[1]-hard in\nparameter k. Furthermore, we give an alternative way to learn bounded vertex\ncover number Bayesian networks using integer linear programming (ILP), and\nshow this is feasible in practice.\n\n1\n\nIntroduction\n\nBayesian networks are probabilistic graphical models representing joint probability distributions\nof random variables. They can be used as a model in a variety of prediction tasks, as they enable\ncomputing the conditional probabilities of a set of random variables given another set of random\nvariables; this is called the inference task. However, to use a Bayesian network as a model for\ninference, one must \ufb01rst obtain the network. Typically, this is done by estimating the network based\non observed data; this is called the learning task.\nBoth the inference and learning tasks are NP-hard in general [3, 4, 6]. One approach to deal with\nthis issue has been to investigate special cases where these problems would be tractable. That is,\nthe basic idea is to select models from a restricted class of Bayesian networks that have structural\nproperties enabling fast learning or inference; this way, the computational complexity will not be\nan issue, though possibly at the cost of accuracy if the true distribution is far from the model family.\nMost notably, it is known that the inference task can be solved in polynomial time if the network\nhas bounded tree-width, or more precisely, the inference task is \ufb01xed-parameter tractable in the\ntree-width of the network. Moreover, this is in a sense optimal, as bounded tree-width is necessary\nfor polynomial-time inference unless the exponential time hypothesis (ETH) fails [17].\n\n1\n\n\fThe possibility of tractable inference has motivated several recent studies also on learning bounded\ntree-width Bayesian networks [2, 12, 16, 19, 22]. However, unlike in the case of inference, learning a\nBayesian network of bounded tree-width is NP-hard for any \ufb01xed tree-width bound at least 2 [16].\nFurthermore, it is known that learning many relatively simple classes such as paths [18] and polytrees\n[9] is also NP-hard. Indeed, so far the only class of Bayesian networks for which a polynomial\ntime learning algorithm is known are trees, i.e., graphs with tree-width 1 [5] \u2013 it appears that our\nknowledge about structure classes allowing tractable learning is quite limited.\n\n1.1 Structure Learning with Bounded Vertex Cover Number\n\nIn this work, we propose bounded vertex cover number Bayesian networks as an alternative to\nthe tree-width paradigm. Roughly speaking, we consider Bayesian networks where all pairwise\ndependencies \u2013 i.e., edges in the moralised graph \u2013 are covered by having at least one node from the\nvertex cover incident to each of them; see Section 2 for technical details. Like bounded tree-width\nBayesian networks, this is a parameterised class, allowing a trade-off between the complexity of\nmodels and the size of the space of possible models by varying the parameter k.\n\nResults: complexity of learning bounded vertex cover networks. Crucially, we show that learn-\ning an optimal Bayesian network structure with vertex cover number at most k can be done in\npolynomial time for any \ufb01xed k. Moreover, vertex cover number provides an upper bound for\ntree-width, implying that inference is also tractable; thus, we identify a rare example of a class of\nBayesian networks where both learning and inference are tractable.\nSpeci\ufb01cally, our main theoretical result shows that an optimal Bayesian network structure with\nvertex cover number at most k can be found in time 4kn2k+O(1) (Theorem 5). However, while the\nrunning time of our algorithm is polynomial with respect to the number of nodes, the degree of the\npolynomial depends on k. We show that this is in a sense best we can hope for; that is, we show that\nthere is no \ufb01xed-parameter algorithm with running time f (k) poly(n) for any function f even when\nthe maximum allowed parent set size is restricted to 2, unless the commonly accepted complexity\nassumption FPT (cid:54)= W[1] fails (Theorem 6).\nResults: ILP formulation and learning in practice. While we prove that the learning bounded\nvertex cover Bayesian network structures can be done in polynomial time, the unavoidable dependence\non k in the degree the polynomial makes the algorithm of our main theorem infeasible for practical\nusage when the vertex cover number k increases. Therefore, we investigate using an integer linear\nprogramming (ILP) formulation as an alternative way to \ufb01nd optimal bounded vertex cover Bayesian\nnetworks in practice (Section 4). Although the running time of an ILP is exponential in the worst\ncase, the actual running time in many practical scenarios is signi\ufb01cantly lower; indeed, most of the\nstate-of-the-art algorithms for exact learning of Bayesian networks in general [1, 8] and with bounded\ntree-width [19, 22] are based on ILPs. Our experiments show that bounded vertex cover number\nBayesian networks can, indeed, be learned fast in practice using ILP (Section 5).\n\n2 Preliminaries\n\nDirected graphs. A directed graph D = (N, A) consists of a node set N and arc set A \u2286 N \u00d7 N;\nfor a \ufb01xed node set, we usually identify a directed graph with its arc set A. A directed graph is called\na directed acyclic graph or DAG if it contains no directed cycles. We write n = |N| and uv for arc\n(u, v) \u2208 A. For u, v \u2208 N with uv \u2208 A, we say that u is a parent of v and v is a child of u. We write\nAv for the parent set of v, that is, Av = {u \u2208 N : uv \u2208 A}.\nBayesian network structure learning. We consider the Bayesian network structure learning using\nthe score-based approach [7, 14], where the input consists of the node set N and the local scores\nfv(S) for each node v \u2208 N and S \u2286 N \\ {v}. The task is to \ufb01nd a DAG A \u2013 the network structure \u2013\nWe assume that the scores fv are computed beforehand, and that we can access each entry fv(S) in\nconstant time. We generally consider a setting where only parent sets belonging to speci\ufb01ed sets\nFv \u2286 2N are permitted. Typically, Fv consists of parent sets of size at most k, in which case we\n\nthat maximises the score f (A) =(cid:80)\nassume that the scores fv(S) are given only for |S| \u2264 k; that is, the size of the input is O(cid:0)n(cid:0)n\n\nv\u2208N fv(Av).\n\n(cid:1)(cid:1).\n\nk\n\n2\n\n\fMoralised graphs. For a DAG A, the moralised graph of A is an undirected graph MA = (N, EA),\nwhere EA is obtained by adding (1) an undirected edge {u, v} to EA for each arc uv \u2208 A, and (2) by\nadding an undirected edge {u, v} to EA if u and v have a common child, that is, {uw, vw} \u2286 A in\nA for some w \u2208 A. The edges added to EA due to rule (2) are called moral edges.\nTree-width and vertex cover number. A tree-decomposition of a graph G = (V, E) is a pair\n(X , T ), where T is a tree with node set {1, 2, . . . , m} and X = {X1, X2, . . . , Xm} is a collection of\n\nsubsets of V with(cid:83)m\n\ni=1 Xi = V such that\n\n(a) for each {u, v} \u2208 E there is i with u, v \u2208 Xi, and\n(b) for each v \u2208 V the graph T [{i : v \u2208 Xi}] is connected.\n\nThe width of a tree-decomposition (T,X ) is maxi |Xi| \u2212 1. The tree-width tw(G) of a graph G is\nthe minimum width of a tree-decomposition of G. For a DAG A, we de\ufb01ne the tree-width tw(A) as\nthe tree-width of the moralised graph MA [12].\nFor a graph G = (V, E), a set C \u2286 V is a vertex cover if each edge is incident to at least one vertex\nin C. The vertex cover number of a graph \u03c4 (G) is the size of the smallest vertex cover in G. As with\ntree-width, we de\ufb01ne the vertex cover number \u03c4 (A) of a DAG A as \u03c4 (MA).\nLemma 1. For a DAG A, we have tw(A) \u2264 \u03c4 (A).\nProof. By de\ufb01nition, the moralised graph MA has a vertex cover C of size \u03c4 (A). We can construct\na star-shaped tree-decomposition for MA with a central node i with Xi = C and a leaf j with\nXj = C \u222a v for every v \u2208 N \\ C. Clearly, this tree-decomposition has width \u03c4 (A); thus, we have\ntw(A) = tw(MA) \u2264 \u03c4 (A).\nStructure learning with parameters. Finally, we give a formal de\ufb01nition for the bounded tree-\nwidth and bounded vertex cover number Bayesian network structure learning problems. That is, let\np \u2208 {\u03c4, tw}; in the bounded-p Bayesian network structure learning, we are given a node set N, local\nv\u2208N fv(Av)\nsubject to p(A) \u2264 k. For both tree-width and vertex cover number, the parameter k also bounds the\nmaximum parent set size, so we will assume that the local scores fv(S) are given only if |S| \u2264 k.\n3 Complexity Results\n\nscores fv(S) and an integer k, and the task is to \ufb01nd a DAG A maximising score(cid:80)\n\n3.1 Polynomial-time Algorithm\n\nWe start by making a few simple observations about the structure of bounded vertex cover number\nBayesian networks. In the following, we slightly abuse the terminology and say that N1 \u2286 N is a\nvertex cover for a DAG A if N1 is a vertex cover of MA.\nLemma 2. Let N1 \u2286 N be a set of size k, and let A be a DAG on N. Set N1 is a vertex cover for A\nif and only if\n\n(a) for each node v /\u2208 N1, we have Av \u2286 N1, and\n(b) each node v \u2208 N1 has at most one parent outside N1.\n\nProof. (\u21d2) For (a), we have that if there were nodes u, v /\u2208 N1 such that u is the child of v, the\nmoralised graph MA would have edge {u, v} that is not covered by N1. Likewise for (b), we have\nthat if a node u \u2208 N1 had parents v, w /\u2208 N1, then MA would have edge {v, w} not covered by N1.\nThus, both (a) and (b) have to hold if A has vertex cover N1.\n(\u21d0) Since (a) holds, all directed edges in A have one endpoint in N1, and thus the corresponding\nundirected edges in MA are covered by N1. Moreover, by (a) and (b), no node has two parents\noutside N1, so all moral edges in MA also have at least one endpoint in N1.\n\nLemma 2 allows us to partition a DAG with vertex cover number k into a core that covers at most 2k\nnodes that are either in a \ufb01xed vertex cover or are parents of those nodes (core nodes), and a periphery\n\n3\n\n\fFigure 1: (a) Example of a DAG with vertex cover number 4, with sets N1 and N2 as in Lemma 3.\n(b) Reduction used in Theorem 6; each edge in the original graph is replaced by a possible v-structure.\n\ncontaining arcs going into nodes that have no children and all parents in the vertex cover (peripheral\nnodes). This is illustrated in Figure 1(a), and the following lemma formalises the observation.\nLemma 3. Let A be a DAG on N with vertex cover N1 of size k. Then there is a set N2 \u2286 N \\ N1\nof size at most k and arc sets B and C such that A = B \u222a C and\n\n(a) B is a DAG on N1 \u222a N2 with vertex cover N1, and\n(b) C contains only arcs uv with u \u2208 N1 and v /\u2208 N1 \u222a N2.\n\nProof. First, let N2 =(cid:0)(cid:83)\n\n(cid:1). By Lemma 2, each v \u2208 N1 can have at most one parent\n\nv\u2208N1 Av \\ N1\noutside N1, so we have |N2| \u2264 |N1| \u2264 k.\nNow let B = {uv \u2208 A : u, v \u2208 N1 \u222a N2} and C = A \\ B. To see that (a) holds for this choice of B,\nwe observe that the edge set of the moralised graph MB is a subset of the edges in MA, and thus N1\ncovers all edges of MB. For (b), the choice of N2 and Lemma 2 ensure that nodes in N \\ (N1 \u222a N2)\nhave no children, and again by Lemma 2 their parents are all in N1.\n\nDually, if we \ufb01x the core and peripheral node sets, we can construct a DAG with bounded vertex cover\nnumber by the selecting the core independently from the parents of the peripheral nodes. Formally:\nLemma 4. Let N1, N2 \u2286 N be disjoint. Let B be a DAG on N1 \u222a N2 with vertex cover N1, and let\nC be a DAG on N such that C only contains arcs uv with u \u2208 N1 and v /\u2208 N1 \u222a N2. Then\n\n(b) the score of A is f (A) =(cid:80)\n\n(a) A = B \u222a C is a DAG on N with vertex cover N1, and\n\nv\u2208N1\u222aN2 fv(Bv) +(cid:80)\n\nv /\u2208N1\u222aN2 fv(Cv).\n\nProof. To see that (a) holds, we observe that B is acyclic by assumption, and addition of arcs from\nC cannot create cycles as there are no outgoing arcs from nodes in N \\ (N1 \u222a N2). Moreover, for\nv \u2208 N1 \u222a N2, there are no arcs ending at v in C, and likewise for v /\u2208 N1 \u222a N2, there are no arcs\nending at v in B. Thus, we have Av = Bv if v \u2208 N1 \u222a N2 and Av = Cv otherwise. This implies that\nsince conditions of Lemma 2 hold for both B and C, they also hold for A, and thus N1 is a vertex\ncover for A. Finally, the preceding observation implies also that fv(Av) = fv(Bv) for v \u2208 N1 \u222a N2\nand fv(Av) = fv(Cv) otherwise, which implies (b).\n\nvertex cover number at most k. That is, we iterate over all possible(cid:0)n\n\n(cid:1)(cid:0)n\u2212k\n\nLemmas 3 and 4 give the basis of our strategy for \ufb01nding an optimal Bayesian network structure with\n\n(cid:1) = O(n2k) choices for\n\nk\n\nk\n\nsets N1 and N2; for each choice, we construct the optimal core and periphery as follows, keeping\ntrack of the best found DAG A\u2217:\nStep 1. To \ufb01nd the optimal core B, we construct a Bayesian network structure learning instance on\nN1 \u222a N2 by removing nodes outside N1 \u222a N2 and restricting the possible choices of parent\nsets so that Fv = 2N1 for all v \u2208 N2, and Fv = {S \u2286 N1\u222aN2 : |S \u2229 N2| \u2264 1} for v \u2208 N1.\nBy Lemma 2, any solution for this instance is a DAG with vertex cover N1. Moreover, this\ninstance has 2k nodes, so it can be solved in time O(k222k) using the dynamic programming\nalgorithm of Silander and Myllym\u00e4ki [23].\n\n4\n\n(b)(a)euveuvN1N2\fStep 2. To construct the periphery C, we compute the value \u02c6fv(N1) = maxS\u2286N1 fv(S) and select\ncorresponding best parent set choice Cv for each v /\u2208 N1 \u222a N2; this can be done in time in\nO(nk2k) time using the dynamic programming algorithm of Ott and Miyano [21].\n\nStep 3. We check if f (B \u222a C) > f (A\u2217), and replace A\u2217 with B \u222a C if this holds.\nBy Lemma 4(a), all DAGs considered by the algorithm are valid solutions for Bayesian network\nstructure learning with bounded vertex cover number, and by Lemma 4(b), we can \ufb01nd the optimal\nsolution for \ufb01xed N1 and N2 by optimising the choice of the core and the periphery separately.\nMoreover, by Lemma 3 each bounded vertex cover DAG is included in the search space, so we are\nguaranteed to \ufb01nd the optimal one. Thus, we have proven our main theorem:\nTheorem 5. Bounded vertex cover number Bayesian network structure learning can be solved in\ntime 4kn2k+O(1).\n\n3.2 Lower Bound\n\nk\n\nAlthough the algorithm presented in the previous section runs in polynomial time in n, the degree of\nthe polynomial depends on the size of vertex cover k, which poses a serious barrier to practical use\nwhen k grows. Moreover, the algorithm is essentially optimal in the general case, as the input has\n\n(cid:1)(cid:1) when parent sets of size at most k are allowed. However, in practice one often assumes\n\nsize \u2126(cid:0)n(cid:0)n\n\nthat a node can have at most, say, 2 or 3 parents. Thus, it makes sense to consider settings where\nthe input is restricted, by e.g. considering instances where parent set size is bounded from above by\nsome constant w while allowing vertex cover number k to be higher. In this case, we might hope to\ndo better, as the input size is not a restricting factor.\nUnfortunately, we show that it is not possible to obtain a algorithm where the degree of the polynomial\ndoes not depend on k even when the maximum parent set size is limited to 2, that is, there is no\nalgorithm with running time g(k) poly(n) for any function g, unless the widely believed complexity\nassumption FPT (cid:54)= W[1] fails. Speci\ufb01cally, we show that Bayesian network structure learning\nwith bounded vertex cover number is W[1]-hard when restricted to instances with parent set size 2,\nimplying the above claim. For full technical details on complexity classes FPT and W[1] and the\nrelated theory, we refer the reader to standard texts on the topic [11, 13, 20]; for our result, it suf\ufb01ces\nto note that the assumption FPT (cid:54)= W[1] implies that \ufb01nding a k-clique from a graph cannot be done\nin time g(k) poly(n) for any function g.\nTheorem 6. Bayesian network structure learning with bounded vertex cover number is W[1]-hard in\nparameter k, even when restricted to instances with maximum parent set size 2.\n\nProof. We prove the result by a parameter-preserving reduction from clique, which is known to\nbe W[1]-hard [10]. We use the same reduction strategy as Korhonen and Parviainen [16] use in\nproving that the bounded tree-width version of the problem is NP-hard. That is, given an instance\n(G = (V, E), k) of clique, we construct a new instance of bounded vertex cover number Bayesian\nnetwork structure learning as follows. The node set of the instance is N = V \u222a E. The parent scores\nare de\ufb01ned by setting fe({u, v}) = 1 for each e = {u, v} \u2208 E, and fv(S) = 0 for all other v and S;\nsee Figure 1(b). Finally, the vertex cover size is required to be at most k. Clearly, the new instance\ncan be constructed in polynomial time.\nIt now suf\ufb01ces to show that the original graph G has a clique of size k if and only if the optimal DAG\n\nN with vertex cover number at most k has score(cid:0)k\nare now clearly covered by C. Furthermore, since C is a clique in G, there are(cid:0)k\nnon-empty parent set, giving f (A) =(cid:0)k\nThere must be at least(cid:0)k\nC \u2286 V and there are at least(cid:0)k\n\n(cid:1) nodes with a\n(\u21d2) Assume G has a k-clique C \u2286 V . Let A be a DAG on N obtained by setting Ae = {u, v} for\neach e = {u, v} \u2286 C, and Av = \u2205 for all other nodes v \u2208 N. All edges in the moralised graph MA\n(cid:0)k\n(cid:1).\n(cid:1) nodes e = {u, v} \u2208 E such that Ae = {u, v}, as these are the only nodes\n(\u21d0) Assume now that there is a DAG A on N with vertex cover number k and a score f (A) \u2265\nthat can contribute to a positive score. Each of these triangles Te = {e, u, v} for e = {u, v} must\ncontain at least two nodes from a minimum vertex cover C; without loss of generality, we may\nassume that these nodes are u and v, as e cannot cover any other edges. However, this means that\n\n(cid:1) edges e \u2286 C, implying that C must be a k-clique in G.\n\n(cid:1):\n\n2\n\n(cid:1).\n\n2\n\n2\n\n2\n\n5\n\n2\n\n2\n\n\f4\n\nInteger Linear Programming\n\nTo complement the combinatorial algorithm of Section 3.1, we will formulate the bounded vertex\ncover number Bayesian network structure learning problem as an integer linear program (ILP).\nWithout loss of generality, we may assume that nodes are labeled with integers [n].\nAs a basis for the formulation, let zSv be a binary variable that takes value 1 when S is the parent set\nof v and 0 otherwise. The objective function for the ILP is\n\n(cid:88)\n\n(cid:88)\n\nv\u2208N\n\nS\u2208Fv\n\nmax\n\nfv(S)zSv .\n\nTo ensure that the variables zSv encode a valid DAG, we use the standard constraints introduced by\n\nJaakkola et al. [15] and Cussens [8]:(cid:88)\n(cid:88)\n\n(cid:88)\n\nS\u2208Fv\n\nv\u2208W\n\nS\u2208Fv\nS\u2229W =\u2205\n\nzSv = 1\n\n\u2200v \u2208 N\n\nzSv \u2265 1\n\n\u2200W \u2286 N : |W| \u2265 1\n\nzSv \u2208 {0, 1} \u2200v \u2208 N, S \u2208 Fv.\n\n(1)\n\n(2)\n\n(3)\n\nNow it remains to bound the vertex cover number of the moralised graph. We introduce two sets\nof binary variables. The variable yuv takes value 1 if there is an edge between nodes u and v in\nthe moralised graph and 0 otherwise. The variable cu takes value 1 if the node u is a part of the\nvertex cover and 0 otherwise. By combining a construction of the moralised graph and a well-known\nformulation for vertex cover, we get the following:\n\n(cid:88)\n\n(cid:88)\n\nzSv +\n\nS\u2208Fv : u\u2208S\n\nT\u2208Fu : v\u2208T\n\nzT u \u2212 yuv \u2264 0\nzSv \u2212 yuw \u2264 0\nyuv \u2212 cu \u2212 cv \u2264 0\ncu \u2264 k\n\n(cid:88)\n\n\u2200u, v \u2208 N : u < v\n\u2200v \u2208 N, S \u2208 Fv : u, w \u2208 S, u < w\n\u2200u, v \u2208 N : u < v\n\nu\u2208N\nyuv, cu \u2208 {0, 1} \u2200u, v \u2208 N.\n\n(4)\n\n(5)\n(6)\n(7)\n\n(8)\n\nThe constraints (4) and (5) guarantee that y-variables encode the moral graph. The constraint (6)\nguarantees that if there is an edge between u and v in the moral graph then either u or v is included\nin the vertex cover. Finally, the constraint (7) bounds the size of the vertex cover.\n\n5 Experiments\n\nWe implemented both the combinatorial algorithm of Section 3.1 and the ILP formulation of Section 4\nto benchmark the practical performance of the algorithms and test how good approximations bounded\nvertex cover DAGs provide. The combinatorial algorithm was implemented in Matlab and is available\nonline1. The ILPs were implemented using CPLEX Python API and solved using CPLEX 12. The\nimplementation is available as a part of TWILP software2.\n\nCombinatorial algorithm. As the worst- and best-case running time of the combinatorial algorithm\nare the same, we tested it with synthetic data sets varying the number of nodes n and the vertex cover\nbound k, limiting each run to at most 24 hours. The results are shown in Figure 2. With reasonable\nvertex cover number bounds the polynomial-time algorithm scales only up to about 15 nodes; this is\nmainly due to the fact that, while the running time is polynomial in n, the degree of the polynomial\ndepends on k and when k grows, the algorithm becomes quickly infeasible.\n\n1http://research.cs.aalto.\ufb01/pml/software/VCDP/\n2http://bitbucket.org/twilp/twilp\n\n6\n\n\fFigure 2: Running times of the polynomial time algorithm. Number of nodes varies from 13 to 16\nand the vertex cover number from 1 to 5. For n = 15 and n = 16 with k = 5, the algorithm did not\n\ufb01nish in 24 hours.\n\nInteger linear program. We ran our experiments using a union of the data sets used by Berg et\nal. [2] and those provided at GOBNILP homepage3. We benchmarked the results against other\nILP-based algorithms, namely GOBNILP [8] for learning Bayesian networks without any restrictions\nto the structure and TWILP [22] for learning bounded tree-width Bayesian networks. In our tests,\neach algorithm was given 4 hours of CPU time. Figure 3 shows results for selected data sets. Due to\nspace reasons, full results are reported in the supplement.\nThe results show that optimal DAGs with moderate vertex cover number (7 for \ufb02ag, 6 for carpo10000)\ntend to have higher scores than optimal trees. This suggests that often one can trade speed for\naccuracy by moving from trees to bounded vertex cover number DAGs. We also note that bounded\nvertex cover number DAGs are usually learned quickly, typically at least two orders-of-magnitude\nfaster than bounded tree-width DAGs. However, bounded tree-width DAGs are a less constrained\nclass, and thus in multiple cases the best found bounded tree-width DAG has better score than the\ncorresponding bounded vertex cover number DAG even when the bounded tree-width DAG is not\nproven to be optimal. This seems to be the case also if we have mismatching bound, say, 5 for\ntree-width and 10 for vertex cover number.\nFinally, we notice that ILP solves easily problem instances with, say, 60 nodes and vertex cover bound\n8; see the results for carpo10000 data set. Thus, in practice ILP scales up to signi\ufb01cantly larger data\nsets and vertex cover number bounds than the combinatorial algorithm of Section 3.1. Presumably,\nthis is due to the fact that ILP solvers tend to use heuristics that can quickly prune out provably\nnon-optimal parts of choices for the vertex cover, while the combinatorial algorithm considers them\nall.\n\n6 Discussion\n\nWe have shown that bounded vertex cover number Bayesian networks both allow tractable inference\nand can be learned in polynomial time. The obvious point of comparison is the class of trees, which\nhas the same properties. Structurally these two classes are quite different. In particular, neither is a\nsubclass of the other \u2013 DAGs with vertex cover number k > 1 can contain dense substructures, while\na path of n nodes (which is also a tree) has a vertex cover number (cid:98)n/2(cid:99) = \u2126(n).\nIn contrast with trees, bounded vertex cover number Bayesian networks have a densely connected\n\u201ccore\u201d , and each node outside the core is either connected to the core or it has no connections. Thus,\nwe would expect them to perform better than trees when the \u201creal\u201d network has a few dense areas\nand only few connections between nodes outside these areas. On the other hand, bounding the vertex\ncover number bounds the total size of the core area, which can be problematic especially in large\nnetworks when some parts of the network are not represented in the minimum vertex cover.\n\n3http://www.cs.york.ac.uk/aig/sw/gobnilp/\n\n7\n\n12345k100101102103104105time(s)n=16n=15n=14n=13\fFigure 3: Results for selected data sets. We report the score for the optimal DAG without structure\nconstraints, and for the optimal DAGs with bounded tree-width and bounded vertex cover when the\nbound k changes, as well as the running time required for \ufb01nding the optimal DAG in each case. If\nthe computations were not \ufb01nished at the time limit of 4 hours, we show the score of the best DAG\nfound so far; the shaded area represents the unexplored part of the search space, that is, the upper\nbound of the shaded area is the best score upper bound proven by the ILP solver.\n\nWe also note that bounded vertex cover Bayesian networks have a close connection to naive Bayes\nclassi\ufb01ers. That is, variables outside a vertex cover are conditionally independent of each other\ngiven the vertex cover. Thus, we can replace the vertex cover by a single variable whose states are a\nCartesian product of the states of the vertex cover variables; this star-shaped network can then be\nviewed as a naive Bayes classi\ufb01er.\nFinally, we note some open question related to our current work. From a theoretical perspective,\nwe would like to classify different graph parameters in terms of complexity of learning. Ideally, we\nwould want to have a graph parameter that has a \ufb01xed-parameter learning algorithm when we bound\nthe maximum parent set size, circumventing the barrier of Theorem 6. From a practical perspective,\nthere is clearly room for improvement in ef\ufb01ciency of our ILP-based learning algorithm; for instance,\nGOBNILP uses various optimisations beyond the basic ILP encoding to speed up the search.\n\nAcknowledgments\n\nWe thank James Cussens for fruitful discussions. This research was partially funded by the Academy\nof Finland (Finnish Centre of Excellence in Computational Inference Research COIN, 251170).\nThe experiments were performed using computing resources within the Aalto University School of\nScience \u201cScience-IT\u201d project.\n\n8\n\n12345678910k\u221216400\u221216200\u221216000\u221215800\u221215600\u221215400\u221215200scoreabalone(n=9),scores12345678910k0100101102103104time(s)abalone(n=9),runningtimes12345678910k\u22123100\u22123050\u22123000\u22122950\u22122900\u22122850\u22122800\u22122750\u22122700score\ufb02ag(n=29),scores12345678910k0100101102103104time(s)\ufb02ag(n=29),runningtimes12345678910k\u2212230000\u2212220000\u2212210000\u2212200000\u2212190000\u2212180000\u2212170000\u2212160000\u2212150000scorecarpo10000(n=60),scores12345678910k0100101102103104time(s)carpo10000(n=60),runningtimesNostructureconstraintsBoundedtree-widthBoundedvertexcover\fReferences\n[1] Mark Bartlett and James Cussens. Advances in Bayesian network learning using integer programming. In\n\n29th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2013.\n\n[2] Jeremias Berg, Matti J\u00e4rvisalo, and Brandon Malone. Learning Optimal Bounded Treewidth Bayesian\nNetworks via Maximum Satis\ufb01ability. In 17th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), 2014.\n\n[3] David M. Chickering. Learning Bayesian networks is NP-Complete. In Learning from Data: Arti\ufb01cial\n\nIntelligence and Statistics V, pages 121\u2013130. Springer-Verlag, 1996.\n\n[4] David M. Chickering, David Heckerman, and Chris Meek. Large-sample learning of Bayesian networks is\n\nNP-Hard. Journal of Machine Learning Research, 5:1287\u20131330, 2004.\n\n[5] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with dependence trees. IEEE\n\nTransactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[6] Gregory. F. Cooper. The computational complexity of probabilistic inference using Bayesian belief\n\nnetworks. Arti\ufb01cial Intelligence, 42:393\u2013405, 1990.\n\n[7] Gregory F. Cooper and Edward Herskovits. A Bayesian method for the induction of probabilistic networks\n\nfrom data. Machine Learning, 9:309\u2013347, 1992.\n\n[8] James Cussens. Bayesian network learning with cutting planes. In 27th Conference on Uncertainty in\n\nArti\ufb01cial Intelligence (UAI), 2011.\n\n[9] Sanjoy Dasgupta. Learning polytrees. In 15th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\n\n1999.\n\n[10] Rodney G. Downey and Michael R. Fellows. Parameterized computational feasibility.\n\nMathematics II, pages 219\u2013244. Birkhauser, 1994.\n\nIn Feasible\n\n[11] Rodney G. Downey and Michael R. Fellows. Parameterized complexity. Springer-Verlag, 1999.\n[12] Gal Elidan and Stephen Gould. Learning bounded treewidth Bayesian networks. Journal of Machine\n\nLearning Research, 9:2699\u20132731, 2008.\n\n[13] J\u00f6rg Flum and Martin Grohe. Parameterized complexity theory. Springer-Verlag, 2006.\n[14] David Heckerman, Dan Geiger, and David M. Chickering. Learning Bayesian networks: The combination\n\nof knowledge and statistical data. Machine Learning, 20(3):197\u2013243, 1995.\n\n[15] Tommi Jaakkola, David Sontag, Amir Globerson, and Marina Meila. Learning bayesian network structure\nusing LP relaxations. In 13th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2010.\n\n[16] Janne H. Korhonen and Pekka Parviainen. Learning bounded tree-width Bayesian networks. In 16th\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2013.\n\n[17] Johan H. P. Kwisthout, Hans L. Bodlaender, and L. C. van der Gaag. The necessity of bounded treewidth\nfor ef\ufb01cient inference in Bayesian networks. In 19th European Conference on Arti\ufb01cial Intelligence (ECAI),\n2010.\n\n[18] Chris Meek. Finding a path is harder than \ufb01nding a tree. Journal of Arti\ufb01cial Intelligence Research, 15:\n\n383\u2013389, 2001.\n\n[19] Siqi Nie, Denis Deratani Maua, Cassio Polpo de Campos, and Qiang Ji. Advances in Learning Bayesian\nNetworks of Bounded Treewidth. In Advances in Neural Information Processing Systems 27 (NIPS), 2014.\n\n[20] Rolf Niedermeier. Invitation to \ufb01xed-parameter algorithms. Oxford University Press, 2006.\n[21] Sascha Ott and Satoru Miyano. Finding optimal gene networks using biological constraints. Genome\n\nInformatics, 14:124\u2013133, 2003.\n\n[22] Pekka Parviainen, Hossein Shahrabi Farahani, and Jens Lagergren. Learning Bounded Tree-width Bayesian\nNetworks using Integer Linear Programming. In 17th International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), 2014.\n\n[23] Tomi Silander and Petri Myllym\u00e4ki. A simple approach for \ufb01nding the globally optimal Bayesian network\n\nstructure. In 22nd Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2006.\n\n9\n\n\f", "award": [], "sourceid": 431, "authors": [{"given_name": "Janne", "family_name": "Korhonen", "institution": "University of Helsinki"}, {"given_name": "Pekka", "family_name": "Parviainen", "institution": "Aalto University"}]}