{"title": "Learning Bayesian Networks with Thousands of Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 1864, "page_last": 1872, "abstract": "We present a method for learning Bayesian networks from data sets containingthousands of variables without the need for structure constraints. Our approachis made of two parts. The first is a novel algorithm that effectively explores thespace of possible parent sets of a node. It guides the exploration towards themost promising parent sets on the basis of an approximated score function thatis computed in constant time. The second part is an improvement of an existingordering-based algorithm for structure optimization. The new algorithm provablyachieves a higher score compared to its original formulation. On very large datasets containing up to ten thousand nodes, our novel approach consistently outper-forms the state of the art.", "full_text": "Learning Bayesian Networks with Thousands of\n\nVariables\n\nMauro Scanagatta\nIDSIA\u2217, SUPSI\u2020 , USI\u2021\nLugano, Switzerland\nmauro@idsia.ch\n\nCassio P. de Campos\n\nQueen\u2019s University Belfast\n\nNorthern Ireland, UK\n\nc.decampos@qub.ac.uk\n\nGiorgio Corani\n\nIDSIA\u2217, SUPSI\u2020 , USI\u2021\nLugano, Switzerland\n\ngiorgio@idsia.ch\n\nMarco Zaffalon\n\nIDSIA\u2217\n\nLugano, Switzerland\n\nzaffalon@idsia.ch\n\nAbstract\n\nWe present a method for learning Bayesian networks from data sets containing\nthousands of variables without the need for structure constraints. Our approach\nis made of two parts. The \ufb01rst is a novel algorithm that effectively explores the\nspace of possible parent sets of a node.\nIt guides the exploration towards the\nmost promising parent sets on the basis of an approximated score function that\nis computed in constant time. The second part is an improvement of an existing\nordering-based algorithm for structure optimization. The new algorithm provably\nachieves a higher score compared to its original formulation. Our novel approach\nconsistently outperforms the state of the art on very large data sets.\n\nIntroduction\n\n1\nLearning the structure of a Bayesian network from data is NP-hard [2]. We focus on score-based\nlearning, namely \ufb01nding the structure which maximizes a score that depends on the data [9]. Several\nexact algorithms have been developed based on dynamic programming [12, 17], branch and bound\n[7], linear and integer programming [4, 10], shortest-path heuristic [19, 20].\nUsually structural learning is accomplished in two steps: parent set identi\ufb01cation and structure\noptimization. Parent set identi\ufb01cation produces a list of suitable candidate parent sets for each\nvariable. Structure optimization assigns a parent set to each node, maximizing the score of the\nresulting structure without introducing cycles.\nThe problem of parent set identi\ufb01cation is unlikely to admit a polynomial-time algorithm with a\ngood quality guarantee [11]. This motivates the development of effective search heuristics. Usually\nhowever one decides the maximum in-degree (number of parents per node) k and then simply com-\nputes the score of all parent sets. At that point one performs structural optimization. An exception is\nthe greedy search of the K2 algorithm [3], which has however been superseded by the more modern\napproaches mentioned above.\nA higher in-degree implies a larger search space and allows achieving a higher score; however it\nalso requires higher computational time. When choosing the in-degree the user makes a trade-off\nbetween these two objectives. However when the number of variables is large, the in-degree is\n\n\u2217Istituto Dalle Molle di studi sull\u2019Intelligenza Arti\ufb01ciale (IDSIA)\n\u2020Scuola universitaria professionale della Svizzera italiana (SUPSI)\n\u2021Universit`a della Svizzera italiana (USI)\n\n1\n\n\fgenerally set to a small value, to allow the optimization to be feasible. The largest data set analyzed\nin [1] with the Gobnilp1 software contains 413 variables; it is analyzed setting k = 2. In [5] Gobnilp\nis used for structural learning with 1614 variables, setting k = 2. These are among the largest\nexamples of score-based structural learning in the literature.\nIn this paper we propose an algorithm that performs approximated structure learning with thousands\nof variables without constraints on the in-degree. It is constituted by a novel approach for parent set\nidenti\ufb01cation and a novel approach for structure optimization.\nAs for parent set identi\ufb01cation we propose an anytime algorithm that effectively explores the space\nof possible parent sets. It guides the exploration towards the most promising parent sets, exploiting\nan approximated score function that is computed in constant time. As for structure optimization,\nwe extend the ordering-based algorithm of [18], which provides an effective approach for model\nselection with reduced computational cost. Our algorithm is guaranteed to \ufb01nd a solution better than\nor equal to that of [18].\nWe test our approach on data sets containing up to ten thousand variables. As a performance indica-\ntor we consider the score of the network found. Our parent set identi\ufb01cation approach outperforms\nconsistently the usual approach of setting the maximum in-degree and then computing the score of\nall parent sets. Our structure optimization approach outperforms Gobnilp when learning with more\nthan 500 nodes. All the software and data sets used in the experiments are available online. 2.\n2 Structure Learning of Bayesian Networks\nConsider the problem of learning the structure of a Bayesian Network from a complete data set of N\ninstances D = {D1, ..., DN}. The set of n categorical random variables is X = {X1, ..., Xn}. The\ngoal is to \ufb01nd the best DAG G = (V,E), where V is the collection of nodes and E is the collection\nof arcs. E can be de\ufb01ned as the set of parents \u03a01, ..., \u03a0n of each variable. Different scores can be\nused to assess the \ufb01t of a DAG. We adopt the BIC, which asymptotically approximates the posterior\nprobability of the DAG. The BIC score is decomposable, namely it is constituted by the sum of the\nscores of the individual variables:\nBIC(G) =\n\n(cid:88)n\n\n(cid:88)n\n\n(cid:88)\n\n(cid:88)\n\n=\n\nBIC(Xi, \u03a0i) =\n\ni=1\n\ni=1\n\n\u03c0\u2208|\u03a0i|\n\nx\u2208|Xi| Nx,\u03c0 log \u02c6\u03b8x|\u03c0 \u2212 log N\n\n2\n\n(|Xi| \u2212 1)(|\u03a0i|) ,\n\nwhere \u02c6\u03b8x|\u03c0 is the maximum likelihood estimate of the conditional probability P (Xi = x|\u03a0i = \u03c0),\nand Nx,\u03c0 represents the number of times (X = x\u2227 \u03a0i = \u03c0) appears in the data set, and |\u00b7| indicates\nthe size of the Cartesian product space of the variables given as arguments (instead of the number of\nvariables) such that |Xi| is the number of states of Xi and |\u2205| = 1.\nExploiting decomposability, we \ufb01rst identify independently for each variable a list of candidate\nparent sets (parent set identi\ufb01cation). Then by structure optimization we select for each node the\nparent set that yields the highest score without introducing cycles.\n3 Parent set identi\ufb01cation\nFor parent set identi\ufb01cation usually one explores all the possible parent sets, whose number however\nincreases as O(nk), where k denotes the maximum in-degree. Pruning rules [7] do not considerably\nreduce the size of this space.\nUsually the parent sets are explored in sequential order: \ufb01rst all the parent size of size one, then all\nthe parent sets of size two, and so on, up to size k. We refer to this approach as sequential ordering.\nIf the solver adopted for structural optimization is exact, this strategy allows to \ufb01nd the globally\noptimum graph given the chosen value of k. In order to deal with a large number of variables it\nis however necessary setting a low in-degree k. For instance [1] adopts k=2 when dealing with\nthe largest data set (diabetes), which contains 413 variables. In [5] Gobnilp is used for structural\nlearning with 1614 variables, again setting k = 2. A higher value of k would make the structural\n\n1http://www.cs.york.ac.uk/aig/sw/gobnilp/\n2http://blip.idsia.ch\n\n2\n\n\flearning not feasible. Yet a low k implies dropping all the parent sets with size larger than k. Some\nof them possibly have a high score.\nIn [18] it is proposed to adopt the subset \u03a0corr of the most correlated variables with the children\nvariable. Then [18] consider only parent sets which are subsets of \u03a0corr. However this approach\nis not commonly adopted, possibly because it requires specifying the size of \u03a0corr. Indeed [18]\nacknowledges the need for further innovative approaches in order to effectively explore the space of\nthe parent sets.\nWe propose two anytime algorithms to address this problem. The \ufb01rst is the simplest; we call it\ngreedy selection. It starts by exploring all the parent sets of size one and adding them to a list. Then\nit repeats the following until time is expired: pops the best scoring parent set \u03a0 from the list, explores\nall the supersets obtained by adding one variable to \u03a0, and adds them to the list. Note that in general\nthe parent sets chosen at two adjoining step are not related to each other. The second approach\n(independence selection) adopts a more sophisticated strategy, as explained in the following.\n3.1 Parent set identi\ufb01cation by independence selection\n\n2\n\nIndependence selection uses an approximation of the actual BIC score of a parent set \u03a0, which we\ndenote as BIC\u2217, to guide the exploration of the space of the parent sets. The BIC\u2217 of a parent set\nconstituted by the union of two non-empty parent sets \u03a01 and \u03a02 is de\ufb01ned as follows:\nBIC\u2217(X, \u03a01, \u03a02) = BIC(X, \u03a01) + BIC(X, \u03a02) + inter(X, \u03a01, \u03a02) ,\n\n(1)\nwith \u03a01\u222a\u03a02 = \u03a0 and inter(X, \u03a01, \u03a02) = log N\n(|X|\u22121)(|\u03a01|+|\u03a02|\u2212|\u03a01||\u03a02|\u22121)\u2212BIC(X, \u2205).\nIf we already know BIC(X, \u03a01) and BIC(X, \u03a02) from previous calculations (and we know\nBIC(X, \u2205)), then BIC\u2217 can be computed in constant time (with respect to data accesses). We\nthus exploit BIC\u2217 to quickly estimate the score of a large number of candidate parent sets and to\ndecide the order to explore them.\nWe provide a bound for the difference between BIC\u2217(X, \u03a01, \u03a02) and BIC(X, \u03a01 \u222a \u03a02). To this\nend, we denote by ii the Interaction Information [14]: ii(X; Y ; Z) = I(X; Y |Z)\u2212 I(X; Y ), namely\nthe difference between the mutual information of X and Y conditional on Z and the unconditional\nmutual information of X and Y .\nTheorem 1. Let X be a node of G and \u03a0 = \u03a01 \u222a \u03a02 be a parent set for X with \u03a01 \u2229 \u03a02 = \u2205\nand \u03a01, \u03a02 non-empty. Then BIC(X, \u03a0) = BIC\u2217(X, \u03a01, \u03a02) + N \u00b7 ii(\u03a01; \u03a02; X), where ii is the\nInteraction Information estimated from data.\nProof. BIC(X, \u03a01 \u222a \u03a02) \u2212 BIC\u2217(X, \u03a01, \u03a02) =\n\n(cid:88)\n(cid:33)\n\nx,\u03c01,\u03c02\n\n(cid:32) \u02c6\u03b8x|\u03c01,\u03c02\n\n\u02c6\u03b8x\n\u02c6\u03b8x|\u03c02\n\n\u02c6\u03b8x|\u03c01\n\nBIC(X, \u03a01 \u222a \u03a02) \u2212 BIC(X, \u03a01) \u2212 BIC(X, \u03a02) \u2212 inter(X, \u03a01, \u03a02) =\nNx log \u02c6\u03b8x =\n\nNx,\u03c01,\u03c02\n\n\u02c6\u03b8x|\u03c02)\n\n+\n\nx,\u03c01,\u03c02\n\n(cid:104)\n\n(cid:34)\n\n(cid:105)\n(cid:88)\n(cid:32) \u02c6\u03b8x|\u03c01\nlog \u02c6\u03b8x|\u03c01,\u03c02 \u2212 log\n(cid:32) \u02c6\u03b8\u03c01,\u03c02|x\n(cid:32) \u02c6\u03b8\u03c01,\u03c02\n\nlog \u02c6\u03b8x|\u03c01,\u03c02 \u2212 log(\u02c6\u03b8x|\u03c01\n(cid:88)\n(cid:32) \u02c6\u03b8\u03c01,\u03c02|x\n\nN \u00b7 \u02c6\u03b8x,\u03c01,\u03c02 log\n(cid:33)\n\n\u02c6\u03b8x|\u03c02\n\u02c6\u03b8x\n\u02c6\u03b8\u03c01\n\u02c6\u03b8\u03c02\n\u02c6\u03b8\u03c01,\u03c02\n\u02c6\u03b8\u03c02|x\n\nNx,\u03c01,\u03c02\n\n\u02c6\u03b8\u03c01|x\n\n(cid:33)(cid:35)\n(cid:33)\n(cid:33)(cid:33)\n\nx,\u03c01,\u03c02\n\n=\n\nx\n\n\u2212(cid:88)\n\n\u02c6\u03b8\u03c01|x\n\n=\nN \u00b7 (I(\u03a01; \u03a02|X) \u2212 I(\u03a01; \u03a02)) = N \u00b7 ii(\u03a01; \u03a02; X) ,\n\n\u02c6\u03b8\u03c02|x\n\n\u02c6\u03b8\u03c02\n\n\u02c6\u03b8\u03c01\n\n\u03c01,\u03c02\n\n\u02c6\u03b8\u03c01,\u03c02 log\n\n=\n\n=\n\nNx,\u03c01,\u03c02 log\n\n(cid:32)(cid:88)\n\nN\n\n\u02c6\u03b8x,\u03c01,\u03c02 log\n\nx,\u03c01,\u03c02\n\n(cid:88)\n\n(cid:88)\n\nx,\u03c01,\u03c02\n\nwhere I(\u00b7) denotes the (conditional) mutual information estimated from data.\nCorollary 1. Let X be a node of G, and \u03a0 = \u03a01 \u222a \u03a02 be a parent set of X such that \u03a01 \u2229 \u03a02 = \u2205\nand \u03a01, \u03a02 non-empty. Then\n\n|BIC(X, \u03a0) \u2212 BIC\u2217(X, \u03a01, \u03a02)| \u2264 N min{H(X), H(\u03a01), H(\u03a02)} .\n\nProof. Theorem 1 states that BIC(X, \u03a0) = BIC\u2217(X, \u03a01, \u03a02) + N \u00b7 ii(\u03a01; \u03a02; X). We now devise\nbounds for interaction information, recalling that mutual information and conditional mutual infor-\nmation are always non-negative and achieve their maximum value at the smallest entropy H of their\n\n3\n\n\fargument: \u2212H(\u03a02) \u2264 \u2212I(\u03a01; \u03a02) \u2264 ii(\u03a01; \u03a02; X) \u2264 I(\u03a01; \u03a02|X) \u2264 H(\u03a02). The theorem is\nproven by simply permuting the values \u03a01; \u03a02; X in the ii of such equation. Since\nii(\u03a01; \u03a02; X) = I(\u03a01; \u03a02|X)\u2212I(\u03a01; \u03a02) = I(X; \u03a01|\u03a02)\u2212I(X; \u03a01) = I(\u03a02; X|\u03a01)\u2212I(\u03a02; X) ,\nthe bounds for ii are valid.\nWe know that 0 \u2264 H(\u03a0) \u2264 log(|\u03a0|) for any set of nodes \u03a0, hence the result of Corollary 1 could\nbe further manipulated to achieve a bound for the difference between BIC and BIC\u2217 of at most\nN log(min{|X|,|\u03a01|,|\u03a02|}). However, Corollary 1 is stronger and can still be computed ef\ufb01ciently\nas follows. When computing BIC\u2217(X, \u03a01, \u03a02), we assumed that BIC(X, \u03a01) and BIC(X, \u03a02) had\nbeen precomputed. As such, we can also have precomputed the values H(\u03a01) and H(\u03a02) at the\nsame time as the BIC scores were computed, without any signi\ufb01cant increase of complexity (when\ncomputing BIC(X, \u03a0) for a given \u03a0, just use the same loop over the data to compute H(\u03a0)).\nCorollary 2. Let X be a node of G, and \u03a0 = \u03a01\u222a\u03a02 be a parent set for that node with \u03a01\u2229\u03a02 = \u2205\nand \u03a01, \u03a02 non-empty. If \u03a01 \u22a5\u22a5 \u03a02, then BIC(X, \u03a01 \u222a \u03a02) \u2265 BIC\u2217(X, \u03a01 \u222a \u03a02). If \u03a01 \u22a5\u22a5 \u03a02 |X,\nthen BIC(X, \u03a01 \u222a \u03a02) \u2264 BIC\u2217(X, \u03a01 \u222a \u03a02). If the interaction information ii(\u03a01; \u03a02; X) = 0,\nthen BIC(X, \u03a01 \u222a \u03a02) = BIC\u2217(X, \u03a01, \u03a02).\nProof. It follows from Theorem 1 considering that mutual information I(\u03a01, \u03a02) = 0 if \u03a01 and \u03a02\nare independent, while I(\u03a01, \u03a02|X) = 0 if \u03a01 and \u03a02 are conditionally independent.\nWe now devise a novel pruning strategy for BIC based on the bounds of Corollaries 1 and 2.\nTheorem 2. Let X be a node of G, and \u03a0 = \u03a01 \u222a \u03a02 be a parent set for that node with \u03a01 \u2229\n\u03a02 = \u2205 and \u03a01, \u03a02 non-empty. Let \u03a0(cid:48) \u2283 \u03a0.\n(|X| \u2212 1)|\u03a0(cid:48)| >\nN min{H(X), H(\u03a01), H(\u03a02)}, then \u03a0(cid:48) and its supersets are not optimal and can be ignored.\nProof. BIC\u2217(X, \u03a01, \u03a02) \u2212 N min{H(X), H(\u03a01), H(\u03a02)} + log N\nBIC(\u03a0) + log N\n\n(|X| \u2212 1)|\u03a0(cid:48)| > 0, and Theorem 4 of [6] prunes \u03a0(cid:48) and all its supersets.\n\n(|X| \u2212 1)|\u03a0(cid:48)| > 0 implies\n\n2\n\nIf BIC\u2217(X, \u03a01, \u03a02) + log N\n\n2\n\n2\n\nThus we can ef\ufb01ciently check whether large parts of the search space can be discarded based on\nthese results. We note that Corollary 1 and hence Theorem 2 are very generic in the choice of \u03a01\nand \u03a02, even though usually one of them is taken as a singleton.\n3.2 Independence selection algorithm\nWe now describe the algorithm that exploits the BIC\u2217 score in order to effectively explore the space\nof the parent sets. It uses two lists: (1) open: a list for the parent sets to be explored, ordered by their\nBIC\u2217 score; (2) closed: a list of already explored parent sets, along with their actual BIC score.\nThe algorithm starts with the BIC of the empty set computed. First it explores all the parent sets of\nsize one and saves their BIC score in the closed list. Then it adds to the open list every parent set of\nsize two, computing their BIC\u2217 scores in constant time on the basis of the scores available from the\nclosed list. It then proceeds as follows until all elements in open have been processed, or the time is\nexpired. It extracts from open the parent set \u03a0 with the best BIC\u2217 score; it computes its BIC score\nand adds it to the closed list. It then looks for all the possible expansions of \u03a0 obtained by adding\na single variable Y , such that \u03a0 \u222a Y is not present in open or closed. It adds them to open with\ntheir BIC\u2217(X, \u03a0, Y ) scores. Eventually it also considers all the explored subsets of \u03a0. It safely [7]\nprunes \u03a0 if any of its subsets yields a higher BIC score than \u03a0. The algorithm returns the content of\nthe closed list, pruned and ordered by the BIC score. Such list becomes the content of the so-called\ncache of scores for X. The procedure is repeated for every variable and can be easily parallelized.\nFigure 1 compares sequential ordering and independence selection.\nIt shows that independence\nselection is more effective than sequential ordering because it biases the search towards the highest-\nscoring parent sets.\n4 Structure optimization\nThe goal of structure optimization is to choose the overall highest scoring parent sets (measured\nby the sum of the local scores) without introducing directed cycles in the graph. We start from the\napproach proposed in [18] (which we call ordering-based search or OBS), which exploits the fact\n\n4\n\n\fC\nB\n\nI\n\n\u22121,400\n\u22121,600\n\u22121,800\n\u22122,000\n\nC\nB\n\nI\n\n\u22121,400\n\u22121,600\n\u22121,800\n\u22122,000\n\n500\nIteration\n(a) Sequential ordering.\n\n1,000\n\n500\nIteration\n\n1,000\n\n(b) Indep. selection ordering.\n\nFigure 1: Exploration of the parent sets space for a given variable performed by sequential ordering\nand independence selection. Each point refers to a distinct parent set.\n\nthat the optimal network can be found in time O(Ck), where C =(cid:80)n\n\ni=1 ci and ci is the number of\nelements in the cache of scores of Xi, if an ordering over the variables is given.3 \u0398(k) is needed to\ncheck whether all the variables in a parent set for X come before X in the ordering (a simple array\ncan be used as data structure for this checking). This implies working on the search space of the\npossible orderings, which is convenient as it is smaller than the space of network structures. Multiple\norderings are sampled and evaluated (different techniques can be used for guiding the sampling). For\neach sampled total ordering \u227a over variables X1, . . . , Xn, the network is consistent with the order\nif \u2200Xi : \u2200X \u2208 \u03a0i : X \u227a Xi. A network consistent with a given ordering automatically satis\ufb01es\nthe acyclicity constraint. This allows us to choose independently the best parent set of each node.\nMoreover, for a given total ordering V1, . . . , Vn of the variables, the algorithm tries to improve the\nnetwork by a greedy search swapping procedure: if there is a pair Vj, Vj+1 such that the swapped\nordering with Vj in place of Vj+1 (and vice versa) yields better score for the network, then these\nnodes are swapped and the search continues. One advantage of this swapping over extra random\norderings is that searching for it and updating the network (if a good swap is found) only takes time\nO((cj + cj+1) \u00b7 kn) (which can be sped up as cj only is inspected for parents sets containing Vj+1,\nand cj+1 is only processed if Vj+1 has Vj as parent in the current network), while a new sampled\nordering would take O(n + Ck) (the swapping approach is usually favourable if ci is \u2126(n), which is\na plausible assumption). We emphasize that the use of k here is sole with the purpose of analyzing\nthe complexity of the methods, since our parent set identi\ufb01cation approach does not rely on a \ufb01xed\nvalue for k.\nHowever, the consistency rule of OBS is quite restricting. While it surely refuses all cyclic structures,\nit also rules out some acyclic ones which could be captured by interpreting the ordering in a slightly\ndifferent manner. We propose a novel consistency rule for a given ordering which processes the\nnodes in V1, . . . , Vn from Vn to V1 (OBS can do it in any order, as the local parent sets can be\nchosen independently) and we de\ufb01ne the parent set of Vj such that it does not introduce a cycle in\nthe current partial network. This allows back-arcs in the ordering from a node Vj to its successors, as\nlong as this does not introduce a cycle. We call this idea acyclic selection OBS (or simply ASOBS).\nBecause we need to check for cycles at each step of constructing the network for a given ordering, at\na \ufb01rst glance the algorithm seems to be slower (time complexity of O(Cn) against O(Ck) for OBS;\nnote this difference is only relevant as we intend to work with large values n). Surprisingly, we can\nimplement it in the same overall time complexity of O(Ck) as follows.\n\n1. Build and keep a Boolean square matrix m to mark which are the descendants of nodes\n\n(m(X, Y ) tells whether Y is descendant of X). Start it all false.\n\n2. For each node Vj in the order, with j = n, . . . , 1:\n\n(a) Go through the parent sets and pick the best scoring one for which all contained par-\nents are not descendants of Vj (this takes time O(cik) if parent sets are kept as lists).\n(b) Build a todo list with the descendants of Vj from the matrix representation and asso-\n\nciate an empty todo list to all ancestors of Vj.\n\n(c) Start the todo lists of the parents of Vj with the descendants of Vj.\n(d) For each ancestor X of Vj (ancestors will be iteratively visited by following a depth-\n\ufb01rst graph search procedure using the network built so far; we process a node after\n\n3O(\u00b7), \u2126(\u00b7) and \u0398(\u00b7) shall be understood as usual asymptotic notation functions.\n\n5\n\n\fits children with non-empty todo lists have been already processed; the search stops\nwhen all ancestors are visited):\n\ni. For each element Y in the todo list of X, if m(X, Y ) is true, then ignore Y and\n\nmove on; otherwise set m(X, Y ) to true and add Y to the todo of parents of X.\n\nLet us analyze the complexity of the method. Step 2a takes overall time O(Ck) (already considering\nthe outer loop). Step 2b takes overall time O(n2) (already considering the outer loop). Steps 2c\nand 2(d)i will be analyzed based on the number of elements on the todo lists and the time to process\nthem in an amortized way. Note that the time complexity is directly related to the number of elements\nthat are processed from the todo lists (we can simply look to the moment that they leave a list, as\ntheir inclusion in the lists will be in equal number). We will now count the number of times we\nprocess an element from a todo list. This number is overall bounded (over all external loop cycles)\nby the number of times we can make a cell of matrix m turn from false to true (which is O(n2)) plus\nthe number of times we ignore an element because the matrix cell was already set to true (which is\nat most O(n) per each Vj, as this is the maximum number of descendants of Vj and each of them\ncan fall into this category only once, so again there are O(n2) times in total). In other words, each\nelement being removed from a todo list is either ignored (matrix already set to true) or an entry\nin the matrix of descendants is changed from false to true, and this can only happen O(n2) times.\nHence the total time complexity is O(Ck + n2), which is O(Ck) for any C greater than n2/k (a\nvery plausible scenario, as each local cache of a variable usually has more than n/k elements).\nMoreover, we have the following interesting properties of this new method.\nTheorem 3. For a given ordering \u227a, the network obtained by ASOBS has score equal than or\ngreater to that obtained by OBS.\nProof. It follows immediately from the fact that the consistency rule of ASOBS generalizes that of\nOBS, that is, for each node Vj with j = n, . . . , 1, ASOBS allows all parent sets allowed by OBS\nand also others (containing back-arcs).\nTheorem 4. For a given ordering \u227a de\ufb01ned by V1, . . . , Vn and a current graph G consistent with\n\u227a, if OBS consistency rule allows the swapping of Vj, Vj+1 and leads to improving the score of G,\nthen the consistency rule of ASOBS allows the same swapping and achieves the same improvement\nin score.\nProof. It follows immediately from the fact that the consistency rule of ASOBS generalizes that of\nOBS, so from a given graph G, if a swapping is possible under OBS rules, then it is also possible\nunder ASOBS rules.\n\n5 Experiments\nWe compare three different approaches for parent set identi\ufb01cation (sequential, greedy selection and\nindependence selection) and three different approaches (Gobnilp, OBS and ASOBS) for structure\noptimization. This yields nine different approaches for structural learning, obtained by combining\nall the methods for parent set identi\ufb01cation and structure optimization. Note that OBS has been\nshown in [18] to outperform other greedy-tabu search over structures, such as greedy hill-climbing\nand optimal-reinsertion-search methods [15].\nWe allow one minute per variable to each approach for parent set identi\ufb01cation. We set the maximum\nin-degree to k = 6, a high value that allows learning even complex structures. Notice that our novel\napproach does not need a maximum in-degree. We set a maximum in-degree to put our approach\nand its competitors on the same ground. Once computed the scores of the parent sets we run each\nsolver (Gobnilp, OBS, ASOBS) for 24 hours. For a given data set the computation is performed on\nthe same machine.\nThe explicit goal of each approach for both parent set identi\ufb01cation and structure optimization is\nto maximize the BIC score. We then measure the BIC score of the Bayesian networks eventually\nobtained as performance indicator. The difference in the BIC score between two alternative networks\nis an asymptotic approximation of the logarithm of the Bayes factor. The Bayes factor is the ratio\nof the posterior probabilities of two competing models. Let us denote by \u2206BIC1,2 =BIC1-BIC2 the\ndifference between the BIC score of network 1 and network 2. Positive values of \u2206BIC1,2 imply\n\n6\n\n\fData set\nAudio\nJester\nNet\ufb02ix\n\nAccidents\n\nn\n100\n100\n100\n111\n\nData set\nRetail\n\nPumsb-star\n\nDNA\nKosarek\n\nData set\nMSWeb\nBook\n\nn\n135\n163\n180 EachMovie\n190\n\nWebKB\n\nData set\nn\n294 Reuters-52\n500\n500\n839\n\nC20NG\nBBC\nAd\n\nn\n889\n910\n1058\n1556\n\nTable 1: Data sets sorted according to the number n of variables.\n\nevidence in favor of network 1. The evidence in favor of network 1 is respectively [16] {weak,\npositive, strong, very strong} if \u2206BIC1,2 is between {0 and 2; 2 and 6; 6 and 10 ; beyond 10}.\n5.1 Learning from datasets\n\nWe consider 16 data sets already used in the literature of structure learning, \ufb01rstly introduced in [13]\nand [8]. We randomly split each data set into three subsets of instances. This yields 48 data sets.\nThe approaches for parent set identi\ufb01cation are compared in Table 2. For each \ufb01xed structure op-\ntimization approach, we learn the network starting from the list of parent sets computed by inde-\npendence selection (IS), greedy selection (GS) and sequential selection (SQ). In turn we analyze\n\u2206BICIS,GS and \u2206BICIS,SQ. A positive \u2206BIC means that independence selection yields a network\nwith higher BIC score than the network obtained using an alternative approach for parent set iden-\nti\ufb01cation; vice versa for negative values of \u2206BIC. In most cases (see Table 2) \u2206BIC>10, implying\nvery strong support for the network learned using independence selection. We further analyze the\nresults through a sign-test. The null hypothesis of the test is that the BIC score of the network\nlearned under independence selection is smaller than or equivalent to the BIC score of the network\nlearned using the alternative approach (greedy selection or sequential selection depending on the\ncase). If a data set yields a \u2206BIC which is {very negative, strongly negative, negative, neutral}, it\nsupports the null hypothesis. If a data sets yields a BIC score which is {positive, strongly positive,\nextremely positive}, it supports the alternative hypothesis. Under any \ufb01xed structure solver, the sign\ntest rejects the null hypothesis, providing signi\ufb01cant evidence in favor of independence selection.\nIn the following when we further cite the sign test we refer to same type of analysis: the sign test\nanalyzes the counts of the \u2206BIC which are in favor and against a given method.\nAs for structure optimization, ASOBS achieves higher BIC score than OBS in all the 48 data sets,\nunder every chosen approach for parent set identi\ufb01cation. These results con\ufb01rm the improvement of\nASOBS over OBS, theoretically proven in Section 4. In most cases the \u2206BIC in favor of ASOBS\nis larger than 10. The difference in favor of ASOBS is signi\ufb01cant (sign test, p < 0.01) under every\nchosen approach for parent set identi\ufb01cation.\nWe now compare ASOBS and Gobnilp. On the smaller data sets (27 data sets with n < 500),\nGobnilp signi\ufb01cantly outperforms (sign test, p < 0.01) ASOBS under every chosen approach for\nparent set identi\ufb01cation. On most of such data sets, the \u2206BIC in favor of the network learned by\nGobnilp is larger than 10. This outcome is expected, as Gobnilp is an exact solver and those data\n\nstructure solver\nparent identi\ufb01cation:\n\nIS vs\n\nGobnilp\n\nGS\n\nASOBS\n\nSQ\n\nGS\n\nSQ\n\n\u2206BIC (K)\n\nVery positive (K >10)\n\nStrongly positive (6<K <10)\n\nPositive (2 <K <6)\nNeutral (-2 <K <2)\nNegative (-6 <K <-2)\n\nStrongly negative (-10 <K <-6)\n\nVery negative (K <-10)\n\n44\n0\n0\n2\n0\n1\n1\n\n38\n0\n4\n3\n1\n1\n1\n\n44\n0\n2\n0\n2\n0\n0\n\n30\n4\n3\n4\n1\n5\n1\n\nOBS\n\nGS\n\n44\n1\n0\n2\n0\n0\n1\n\nSQ\n\n32\n0\n2\n4\n2\n4\n4\n\np-value\n\n<0.01 <0.01 <0.01 <0.01 <0.01 <0.01\n\nTable 2: Comparison of the approaches for parent set identi\ufb01cation on 48 data sets. Given any \ufb01xed\nsolver for structural optimization, IS results in signi\ufb01cantly higher BIC scores than both GS and SQ.\n\n7\n\n\fparent identi\ufb01cation\nstructure solver: AS vs\n\nIndependence sel.\nGP\n\nOB\n\nForward sel\nGP\nOB\n\nSequential sel.\nGP\nOB\n\n\u2206BIC (K)\n\nVery positive (K >10)\n\nStrongly positive (6<K<10)\n\nPositive (2<K<6)\nNeutral (-2<K<2)\nNegative (-6<K<-2)\n\nStrongly negative (-10<K<-6)\n\nVery negative (K<-10)\n\n21\n0\n0\n0\n0\n0\n0\n\n21\n0\n0\n0\n0\n0\n0\n\n20\n0\n0\n0\n0\n0\n1\n\n21\n0\n0\n0\n0\n0\n0\n\n19\n0\n0\n0\n0\n0\n2\n\n21\n0\n0\n0\n0\n0\n0\n\np-value\n\n<0.01\n\n<0.01\n\n<0.01 <0.01 <0.01 <0.01\n\nTable 3: Comparison between the structure optimization approaches on the 21 data sets with n \u2265\n500. ASOBS (AS) outperforms both Gobnilp (GB) and OBS (OB), under any chosen approach for\nparent set identi\ufb01cation.\n\nsets imply a relatively reduced search space. However the focus of this paper is on large data sets.\nOn the 21 data sets with n \u2265 500, ASOBS outperforms Gobnilp (sign test, p < 0.01) under every\nchosen approach for parent set identi\ufb01cation (Table 3).\n5.2 Learning from data sets sampled from known networks\n\nIn the next experiments we create data sets by sampling from known networks. We take the largest\nnetworks available in the literature: 4 andes (n=223), diabetes (n=413), pigs (n=441), link (n=724),\nmunin (n=1041). Additionally we randomly generate other 15 networks: \ufb01ve networks of size\n2000, \ufb01ve networks of size 4000, \ufb01ve networks of size 10000. Each variable has a number of states\nrandomly drawn from 2 to 4 and a number of parents randomly drawn from 0 to 6. Overall we\nconsider 20 networks. From each network we sample a data set of 5000 instances.\nWe perform experiments and analysis as in the previous section. For the sake of brevity we do not\nadd further tables of results. As for parent set identi\ufb01cation, independence selection outperforms\nboth greedy selection and sequential selection. The difference in favor of independence selection\nis signi\ufb01cant (sign test, p-value <0.01) under every chosen structure optimization approach. The\n\u2206BIC of the learned network is >10 in most cases. Take for instance Gobnilp for structure opti-\nmization. Then independence selection yields a \u2206BIC>10 in 18/20 cases when compared to GS\nand \u2206BIC>10 in 19/20 cases when compared to SQ. Similar results are obtained using the other\nsolvers for structure optimization.\nStrong results support also ASOBS against OBS and Gobnilp. Under every approach for parent set\nidenti\ufb01cation, \u2206BIC>10 is obtained in 20/20 cases when comparing ASOBS and OBS. The number\nof cases in which ASOBS obtains \u2206BIC>10 when compared against Gobnilp ranges between 17/20\nand 19/20 depending on the approach adopted for parent set selection. The superiority of ASOBS\nover both OBS and Gobnilp is signi\ufb01cant (sign test, p < 0.01) under every approach for parent set\nidenti\ufb01cation.\nMoreover, we measured the Hamming distance between the moralized true structure and the learned\nstructure. On the 21 data sets with n \u2265 500 ASOBS outperforms Gobnilp and OBS and IS outper-\nforms GS and SQ (sign test, p < 0.01). The novel framework is thus superior in terms of both score\nand correctness of the retrieved structure.\n6 Conclusion and future work\nOur novel approximated approach for structural learning of Bayesian Networks scales up to thou-\nsands of nodes without constraints on the maximum in-degree. The current results refer to the BIC\nscore, but in future the methodology could be extended to other scoring functions.\nAcknowledgments\n\nWork partially supported by the Swiss NSF grant n. 200021 146606 / 1.\n\n4http://www.bnlearn.com/bnrepository/\n\n8\n\n\fReferences\n[1] M. Bartlett and J. Cussens.\n\nlearning problem. Arti\ufb01cial Intelligence, 2015. in press.\n\nInteger linear programming for the Bayesian network structure\n\n[2] D. M. Chickering, C. Meek, and D. Heckerman. Large-sample learning of Bayesian networks\nis hard. In Proceedings of the 19st Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI-\n03, pages 124\u2013133. Morgan Kaufmann, 2003.\n\n[3] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks\n\nfrom data. Machine Learning, 9(4):309\u2013347, 1992.\n\n[4] J. Cussens. Bayesian network learning with cutting planes. In Proceedings of the 27st Con-\nference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI-11, pages 153\u2013160.\nAUAI Press, 2011.\n\n[5] J. Cussens, B. Malone, and C. Yuan. IJCAI 2013 tutorial on optimal algorithms for learning\n\nBayesian networks (https://sites.google.com/site/ijcai2013bns/slides), 2013.\n\n[6] C. P. de Campos and Q. Ji. Ef\ufb01cient structure learning of Bayesian networks using constraints.\n\nJournal of Machine Learning Research, 12:663\u2013689, 2011.\n\n[7] C. P. de Campos, Z. Zeng, and Q. Ji. Structure learning of Bayesian networks using constraints.\nIn Proceedings of the 26st Annual International Conference on Machine Learning, ICML-09,\npages 113\u2013120, 2009.\n\n[8] J. V. Haaren and J. Davis. Markov network structure learning: A randomized feature generation\n\napproach. In Proceedings of the 26st AAAI Conference on Arti\ufb01cial Intelligence, 2012.\n\n[9] D. Heckerman, D. Geiger, and D.M. Chickering. Learning Bayesian networks: The combina-\n\ntion of knowledge and statistical data. Machine Learning, 20:197\u2013243, 1995.\n\n[10] T. Jaakkola, D. Sontag, A. Globerson, and M. Meila. Learning Bayesian Network Structure\nusing LP Relaxations. In Proceedings of the 13st International Conference on Arti\ufb01cial Intel-\nligence and Statistics, AISTATS-10, pages 358\u2013365, 2010.\n\n[11] M. Koivisto. Parent assignment is hard for the MDL, AIC, and NML costs. In Proceedings of\n\nthe 19st annual conference on Learning Theory, pages 289\u2013303. Springer-Verlag, 2006.\n\n[12] M. Koivisto and K. Sood. Exact Bayesian Structure Discovery in Bayesian Networks. Journal\n\nof Machine Learning Research, 5:549\u2013573, 2004.\n\n[13] D. Lowd and J. Davis. Learning Markov network structure with decision trees. In Geoffrey I.\nWebb, Bing Liu 0001, Chengqi Zhang, Dimitrios Gunopulos, and Xindong Wu, editors, Pro-\nceedings of the 10st Int. Conference on Data Mining (ICDM2010), pages 334\u2013343, 2010.\n[14] W. J. McGill. Multivariate information transmission. Psychometrika, 19(2):97\u2013116, 1954.\n[15] A. Moore and W. Wong. Optimal reinsertion: A new search operator for accelerated and\nIn T. Fawcett and N. Mishra, editors,\nmore accurate Bayesian network structure learning.\nProceedings of the 20st International Conference on Machine Learning, ICML-03, pages 552\u2013\n559, Menlo Park, California, August 2003. AAAI Press.\n\n[16] A. E. Raftery. Bayesian model selection in social research. Sociological methodology, 25:111\u2013\n\n164, 1995.\n\n[17] T. Silander and P. Myllymaki. A simple approach for \ufb01nding the globally optimal Bayesian\nnetwork structure. In Proceedings of the 22nd Conference on Uncertainty in Arti\ufb01cial Intelli-\ngence, UAI-06, pages 445\u2013452, 2006.\n\n[18] M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learn-\nIn Proceedings of the 21st Conference on Uncertainty in Arti\ufb01cial\n\ning Bayesian networks.\nIntelligence, UAI-05, pages 584\u2013590, 2005.\n\n[19] C. Yuan and B. Malone. An improved admissible heuristic for learning optimal Bayesian\nIn Proceedings of the 28st Conference on Uncertainty in Arti\ufb01cial Intelligence,\n\nnetworks.\nUAI-12, 2012.\n\n[20] C. Yuan and B. Malone. Learning optimal Bayesian networks: A shortest path perspective.\n\nJournal of Arti\ufb01cial Intelligence Research, 48:23\u201365, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1150, "authors": [{"given_name": "Mauro", "family_name": "Scanagatta", "institution": "IDSIA"}, {"given_name": "Cassio", "family_name": "de Campos", "institution": "Queen's University Belfast"}, {"given_name": "Giorgio", "family_name": "Corani", "institution": "IDSIA"}, {"given_name": "Marco", "family_name": "Zaffalon", "institution": "IDSIA"}]}