{"title": "Learning Bayesian Belief Networks with Neural Network Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 578, "page_last": 584, "abstract": null, "full_text": "Learning Bayesian belief networks with \n\nneural network estimators \n\nStefano Monti* \n\n*Intelligent Systems Program \n\nUniversity of Pittsburgh \n\nGregory F. Cooper*'''' \n\n\"Center for Biomedical Informatics \n\nUniversity of Pittsburgh \n\n901M CL, Pittsburgh, PA - 15260 \n\n8084 Forbes Tower, Pittsburgh, PA - 15261 \n\nsmonti~isp.pitt.edu \n\ngfc~cbmi.upmc.edu \n\nAbstract \n\nIn this paper we propose a method for learning Bayesian belief \nnetworks from data. The method uses artificial neural networks \nas probability estimators, thus avoiding the need for making prior \nassumptions on the nature of the probability distributions govern(cid:173)\ning the relationships among the participating variables. This new \nmethod has the potential for being applied to domains containing \nboth discrete and continuous variables arbitrarily distributed. We \ncompare the learning performance of this new method with the \nperformance of the method proposed by Cooper and Herskovits \nin [7]. The experimental results show that, although the learning \nscheme based on the use of ANN estimators is slower, the learning \naccuracy of the two methods is comparable. \nCategory: Algorithms and Architectures. \n\n1 \n\nIntroduction \n\nBayesian belief networks (BBN) are a powerful formalism for representing and rea(cid:173)\nsoning under uncertainty. This representation has a solid theoretical foundation \n[13], and its practical value is suggested by the rapidly growing number of areas to \nwhich it is being applied. BBNs concisely represent the joint probability distribution \nover a set of random variables, by explicitly identifying the probabilistic dependen(cid:173)\ncies and independencies between these variables. Their clear semantics make BBNs \nparticularly. suitable for being used in tasks such as diagnosis, planning, and control. \n\nThe task of learning a BBN from data can usually be formulated as a search \nover the space of network structures, and as the subsequent search for an opti(cid:173)\nmal parametrization of the discovered structure or structures. The task can be \nfurther complicated by extending the search to account for hidden variables and for \n\n\fLearning Bayesian Belief Networks with Neural Network Estimators \n\n579 \n\nthe presence of data points with missing values. Different approaches have been \nsuccessfully applied to the task of learning probabilistic networks from data [5]. \nIn all these approaches, simplifying assumptions are made to circumvent practi(cid:173)\ncal problems in the implementation of the theory. One common assumption that \nis made is that all variables are discrete, or that all variables are continuous and \nnormally distributed. \n\nIn this paper, we propose a novel method for learning BBNs from data that makes \nuse of artificial neural networks (ANN) as probability distribution estimators, thus \navoiding the need for making prior assumptions on the nature of the probability \ndistribution governing the relationships among the participating variables. The use \nof ANNs as probability distribution estimators is not new [3], and its application to \nthe task of learning Bayesian belief networks from data has been recently explored \nin [11] . However, in [11] the ANN estimators were used in the parametrization of the \nBBN structure only, and cross validation was the method of choice for comparing \ndifferent network structures. In our approach, the ANN estimators are an essential \ncomponent of the scoring metric used to search over the BBN structure space. We \nran several simulations to compare the performance of this new method with the \nlearning method described in [7]. The results show that, although the learning \nscheme based on the use of ANN estimators is slower, the learning accuracy of the \ntwo methods is comparable. \n\nThe rest of the paper is organized as follows. In Section 2 we briefly introduce \nthe Bayesian belief network formalism and some basics of hbw to learn BBNs from \ndata. In Section 3, we describe our learning method, and detail the use of artificial \nneural networks as probability distribution estimators. In Section 4 we present some \nexperimental results comparing the performance of this new method with the one \nproposed in [7]. We conclude the paper with some suggestions for further research. \n\n2 Background \n\nA Bayesian belief network is defined by a triple (G,n,p), where G = (X,E) is \na directed acyclic graph with a set of nodes X = {Xl\"'\" xn} representing do(cid:173)\nmain variables, and with a set of arcs E representing probabilistic dependencies \namong domain variables; n is the space of possible instantiations of the domain \nvariables l ; and P is a probability distribution over the instantiations in n. Given \na node X EX, we use trx to denote the set of parents of X in X. The essential \nproperty of BBNs is summarized by the Markov condition, which asserts that each \nvariable is independent of its non-descendants given its parents. This property al(cid:173)\nlows for the representation of the multivariate joint probability distribution over \nX in terms of the univariate conditional distributions P( Xi l7ri, 8i ) of each variable \nXi given its parents 7ri, with 8i the set of parameters needed to fully characterize \nthe conditional probability. Application of the chain rule, together with the Markov \ncondition, yields the following factorization of the joint probability of any particular \ninstantiation of all n variables: \n\nP(x~, ... , x~) = II P(x~ 17r~., 8i ) . \n\nn \n\ni=l \n\n(1) \n\n1 An instantiation w of all n variables in X is an n-uple of values {x~, ... , x~} such that \n\nXi = X: for i = 1 ... n. \n\n\f580 \n\nS. Monti and G. F. Cooper \n\n2.1 Learning Bayesian belief networks \n\nThe task of learning BBNs involves learning the network structure and learning \nthe parameters of the conditional probability distributions. A well established set \nof learning methods is based on the definition of a scoring metric measuring the \nfitness of a network structure to the data, and on the search for high-scoring network \nstructures based on the defined scoring metric [7, 10]. We focus on these methods, \nand in particular on the definition of Bayesian scoring metrics. \n\nIn a Bayesian framework, ideally classification and prediction would be performed \nby taking a weighted average over the inferences of every possible belief network \ncontaining the domain variables. Since this approach is in general computationally \ninfeasible, often an attempt has been made to use a high scoring belief network for \nclassification. We will assume this approach in the remainder of this paper. \nThe basic idea ofthe Bayesian approach is to maximize the probability P(Bs I V) = \nP(Bs, V)j P(V) of a network structure Bs given a database of cases V. Because \nfor all network structures the term P(V) is the same, for the purpose of model \nselection it suffices to calculate PCBs, V) for all Bs. The Bayesian metrics developed \nso far all rely on the following assumptions: 1) given a BBN structure, all cases \nin V are drawn independently from the same distribution (multinomial sample); \n2) there are no cases with missing values (complete database); 3) the parameters \nof the conditional probability distribution of each variable are independent (global \nparameter independence); and 4) the parameters associated with each instantiation \nof the parents of a variable are independent (local parameter independence). \n\nThe application of these assumptions allows for the following factorization of the \nprobability PCBs, V) \n\nPCBs, V) = P(Bs)P(V I Bs) = PCBs) II S(Xi, 71'i, V) , \n\nn \n\ni=l \n\n(2) \n\nwhere n is the number of nodes in the network, and each s( Xi, 71'i, V) is a term \nmeasuring the contribution of Xi and its parents 71'i to the overall score of the \nnetwork Bs. The exact form of the terms s( Xi 71'i, V) slightly differs in the Bayesian \nscoring metrics defined so far, and for lack of space we refer the interested reader \nto the relevant literature [7, 10]. \nBy looking at Equation (2), it is clear that if we assume a uniform prior distribution \nover all network structures, the scoring metric is decomposable, in that it is just \nthe product of the S(Xi, 71'i, V) over all Xi times a constant P(Bs). Suppose that \ntwo network structures Bs and BSI differ only for the presence or absence of a \ngiven arc into Xi. To compare their metrics, it suffices to compute s( Xi, 71'i, V) \nfor both structures, since the other terms are the same. Likewise, if we assume \na decomposable prior distribution over network structures, of the form P(Bs) = \n11 Pi, as suggested in [10], the scoring metric is still decomposable, since we can \ninclude each Pi into the corresponding s( Xi, 71'i, V). \nOnce a scoring metric is defined, a search for a high-scoring network structure can \nbe carried out. This search task (in several forms) has been shown to be NP-hard \n[4,6]. Various heuristics have been proposed to find network structures with a high \nscore. One such heuristic is known as K2 [7], and it implements a greedy search over \nthe space of network structures. The algorithm assumes a given ordering on the \nvariables. For simplicity, it also assumes that no prior information on the network is \navailable, so the prior probability distribution over the network structures is uniform \nand can be ignored in comparing network structures. \n\n\fLearning Bayesian Belief Networks with Neural Network Estimators \n\n581 \n\nThe Bayesian scoring metrics developed so far either assume discrete variables \n[7, 10], or continuous variables normally distributed [9]. In the next section, we \npropose a possible generalization which allows for the inclusion of both discrete and \ncontinuous variables with arbitrary probability distributions. \n\n3 An ANN-based scoring metric \n\nThe main idea of this work is to use 'artificial neural networks as probability estima(cid:173)\ntors, to define a decomposable scoring metric for which no informative priors on the \nclass, or classes, of the probability distributions of the participating variables are \nneeded. The first three of the four assumptions described in the previous section \nare still needed, namely, the assumption of a multinomial sample, the assumption of \na complete database, and the assumption of global parameter independence. How(cid:173)\never, the use of ANN estimators allows for the elimination of the assumption of \nlocal parameter independence. In fact, the conditional probabilities corresponding \nto the different instantiations of the parents of a variable are represented by the \nsame ANN, and they share the same network weights and the same training data. \nLet us denote with VI == {C1 , .. . , CI - 1 } the set of the first I cases in the database, \nand with x~l) and 7rr) the instantiations of Xi and 7ri in the l-th case respectively. \nThe joint probability P( Bs, V) can be written as: \n\nP(Bs)P(VIBs) = P(Bs) IIp(CIIVI,Bs) \n\nm \n\n1=1 \n\nP(Bs) \n\nm n \n\nII II \n\n(1) \nP(xi \n\n(I) \n\nl7ri \n\n, VI, Bs). \n\n(3) \n\n1=1 i=l \n\nIf we assume uninformative priors, or decomposable priors on network structures, \nof the form P(Bs) = rt Pi, the probability PCBs, V) is decomposable. In fact, we \ncan interchange the two products in Equation 3, so as to obtain \n\nPCBs, V) = II [Pi II p(x~l) l7rr), VI , Bs)] = II S(Xi, 7ri, V), \n\nn \n\nn \n\nm \n\n(4) \n\n1=1 \n\ni=l \n\nwhere S(Xi, 7rj, V) is the term between square brackets, and it is only a function of \nXi and its parents in the network structure Bs (Pi can be neglected if we assume \na uniform prior over the network structures). The computation of Equation 4 \ncorresponds to the application of the prequential method discussed by Dawid [8]. \nThe estimation of each term P( Xi l7ri , VI, Bs) can be done by means of neural \nnetwork. Several schemes are available for training a neural network to approximate \na given probability distribution, or density. Notice that the calculation of each term \nS(Xi, 7ri, V) can be computationally very expensive. For each node Xi, computing \nS( Xi, 7ri, V) requires the training of mANNs, where m is the size of the database. \nTo reduce this computational cost, we use the following approximation, which we \ncall the t-invariance approximation: for any I E {I, . .. , m-l}, given the probability \nP(Xi l7ri, VI, Bs), at least t (1 s t S m -I) new cases are needed in order to alter \nsuch probability. That is, for each positive integer h, such that h < t, we assume \nP(Xi l7rj, VI+h, Bs) = P(Xi l7ri, VI , Bs) . Intuitively, this approximation implies the \nassumption that, given our present belief about the value of each P(Xi l7rj, VI, Bs), \nat least t new cases are needed to revise this belief. By making this approximation, \nwe achieve a t-fold reduction in the computation needed, since we now need to build \nand train only mit ANNs for each Xi , instead of the original m. In fact, application \n\n\f582 \n\ns. Monti and G. F. Cooper \n\nof the t-invariance approximatioin to the computation of a given S(Xi, 7ri, 'D) yields: \n\nRather than selecting a constant value for t, we can choose to increment t as the \nsize of the training database 'DI increases. This approach seems preferable. When \nestimating P(Xi l7ri, 'DI, Bs), this estimate will be very sensitive to the addition of \nnew cases when 1 is small, but will become increasingly insensitive to the addition of \nnew cases as 1 grows. A scheme for the incremental updating oft can be summarized \nin the equation t = rAil, where 1 is the number of cases already seen (i.e., the \ncardinality of'D/), and 0 < A ~ 1. For example, given a data set of 50 cases, \nthe updating scheme t = rO.511 would require the training of the ANN estimators \nP(Xi I 7ri,'DI, Bs) for 1= 1,2,3,5,8,12,18,27,41. \n\n4 Evaluation \n\nIn this section, we describe the experimental evaluation we conducted to test the \nfeasibility of use of the ANN-based scoring metric developed in the previous sec(cid:173)\ntion. All the experiments are performed on the belief network Alarm, a multiply(cid:173)\nconnected network originally developed to model anesthesiology problems that may \noccur during surgery [2]. It contains 37 nodes/variables and 46 arcs. The variables \nare all discrete, and take between 2 and 4 distinct values. The database used in the \nexperiments was generated from Alarm, and it is the same database used in [7]. \n\nIn the experiments, we use a modification of the algorithm K2 [7]. The modified \nalgorithm, which we call ANN-K2, replaces the closed-form scoring metric developed \nin [7] with the ANN-based scoring metric of Equation (5). The performance of ANN(cid:173)\nK2 is measured in terms of accuracy of the recovered network structure, by counting \nthe number of edges added and omitted with respect to the Alarm network; and in \nterms of the accuracy of the learned joint probability distribution, by computing \nits cross entropy with respect to the joint probability distribution of Alarm. The \nlearning performance of ANN-K2 is also compared with the performance of K2. To \ntrain the ANNs, we used the conjugate-gradient search algorithm [12]. \n\nSince all the variables in the Alarm network are discrete, the ANN estimators are \ndefined based on the softmax model,with normalized exponential output units, and \nwith cross entropy as cost function . As a regularization technique, we augment the \ntraining set so as to induce a uniform conditional probability over the unseen instan(cid:173)\ntiantions of the ANN input. Given the probability P(Xi l7ri, 'DI) to be estimated, \nand assuming Xi is a k-valued variable, for each instantiation 7r~ that does not ap(cid:173)\npear in the database D I , we generate k new cases, with 7ri instantiated to 7ri, and Xi \ntaking each of its k values. As a result, the neural network estimates P(Xi 17r~, 'DI) \nto be uniform, with P(Xi I7rL 'D/) = l/k for each of Xi'S values Xn,.\u00b7\u00b7, Xlk. \nWe ran simulations where we varied the size of the training data set (100, 500, \n1000, 2000, and 3000 cases), and the value of A in the updating scheme t = rAil \ndescribed in Section 3. We used the settings A = 0.35, and A = 0.5 . For each \nrun, we measured the number of arcs added, the number of arcs omitted, the cross \nentropy, and the computation time, for each variable in the network. That is, we \nconsidered each node, together with its parents, as a simple BBN, and collected the \nmeasures of interest for each of these BBNs. Table 1 reports mean and standard \ndeviation of each measure over the 37 variables of Alarm, for both ANN-K2 and \nK2. The results for ANN-K2 shown in Table 1 correspond to the setting A = 0.5, \n\n\fLearning Bayesian Belief Networks with Neural Network Estimators \n\n583 \n\nUata Algo. \nset \n100 \n\n500 \n\n1000 \n\n2000 \n\n3000 \n\nANN-K2 \nK2 \nANN-K2 \nK2 \nANN\u00b7K2 \nK2 \nANN\u00b7K2 \nK2 \nANN-K2 \nK2 \n\narcs + \ns.d. \nm \n0 .40 \n0.19 \n0 .75 \n1.28 \n0.40 \n0.19 \n0.22 \n0.42 \n0.49 \n0 .24 \n0.11 \n0.31 \n0.19 \n0.40 \n0 .05 \n0.23 \n0.37 \n0.16 \n0 .00 \n0 .00 \n\narcs -\nm \n0 .62 \n0 . 22 \n0.22 \n0.11 \n0.22 \n0.03 \n0.11 \n0 .03 \n0.05 \n0.03 \n\ns.d. \n0 .86 \n0.48 \n0 .48 \n0 .31 \n0 .48 \n0.16 \n0 .31 \n0.16 \n0 .23 \n0 .16 \n\ncross entropy \ns.d. \nm \n0.23 \n0 .52 \n0.08 \n0 .10 \n0 .04 \n0 .11 \n0 .02 \n0 .02 \n0 .05 \n0 .15 \n0.01 \n0 .01 \n0.02 \n0 .06 \n0.005 \n0.007 \n0 .017 \n0 .01 \n0.004 \n0 .005 \n\nmed \n.051 \n.070 \n.010 \n.010 \n.005 \n.006 \n.002 \n.002 \n.001 \n.001 \n\nm \n130 \n0.44 \n1077 \n0.13 \n6909 \n0.34 \n6458 \n0 .46 \n11155 \n1.02 \n\ntime (sees) \n\nmed \n88 \n.06 \n480 \n.06 \n4866 \n.23 \n4155 \n.44 \n4672 \n.84 \n\ns.d. \n159 \n1.48 \n1312 \n0 .22 \n6718 \n0 .46 \n7864 \n0 .65 \n2136 \n1.11 \n\nTable 1: Comparison ofthe performance of ANN-K2 and of K2 in terms of number of arcs \nwrongly added (+), number of arcs wrongly omitted (-), cross entropy, and computation \ntime. Each column reports the mean m , the median med, and the standard deviation s.d. \nof the corresponding measure over the 37 nodes/variables of Alarm. The median for the \nnumber of arcs added and omitted is always 0, and is not reported. \n\nsince their difference from the results corresponding to the setting A = 0.35 was not \nstatistically significant . \n\nStandard t-tests were performed to assess the significance of the difference between \nthe measures for K2 and the measures for ANN-K2, for each data set cardinality. \nNo technique to correct for multiple-testing was applied. Most measures show no \nstatistically significant difference, either at the 0.05 level or at the 0.01 level (most p \nvalues are well above 0.2). In the simulation with 100 cases, both the difference be(cid:173)\ntween the mean number of arcs added and the difference between the mean number \nof arcs omitted are statistically significant (p ~ 0.01). However, these differences \ncancel out, in that ANN-K2 adds fewer extra arcs than K2, and K2 omits fewer \narcs than ANN-K2. This is reflected in the corresponding cross entropies, whose \ndifference is not statistically significant (p = 0.08). In the simulation with 1000 \ncases, only the difference in the number of arcs omitted is statistically significant \n(p ~ .03) . Finally, in the simulation with 3000 cases, only the difference in the \nnumber of arcs added is statistically significant (p ~ .02). K2 misses a single are, \nand does not add any extra are, and this is the best result to date. By comparison, \nANN-K2 omits 2 arcs, and adds 5 extra arcs. For the simulation with 3000 cases, \nwe also computed Wilcoxon rank sum tests. The results were consistent with the \nt-test results, showing a statistically significant difference only in the number of arcs \nadded. Finally, as it can be noted in Table 1, the difference in computation time is \nof several order of magnitude, thus making a statistical analysis superfluous. \n\nA natural question to ask is how sensitive is the learning procedure to the order \nof the cases in the training set. \nIdeally, the procedure would be insensitive to \nthis order. Since we are using ANN estimators, however, which perform a greedy \nsearch in the solution space, particular permutations of the training cases might \ncause the ANN estimators to be more susceptible to getting stuck in local maxima. \nWe performed some preliminary experiments to test the sensitivity of the learning \nprocedure to the order of the cases in the training set. We ran few simulations in \nwhich we randomly changed the order of the cases. The recovered structure was \nidentical in all simulations. Morevoer, the difference in cross entropy for different \norderings of the cases in the training set showed not to be statistically significant. \n\n5 Conclusions \n\nIn this paper we presented a novel method for learning BBN s from data based on the \nuse of artificial neural networks as probability distribution estimators. As a prelim-\n\n\f584 \n\ns. Monti and G. F. Cooper \n\ninary evaluation, we have compared the performance of the new algorithm with the \nperformance of K2, a well established learning algorithm for discrete domains, for \nwhich extensive empirical evaluation is available [1,7]. With regard to the learning \naccuracy of the new method, the results are encouraging, being comparable to state(cid:173)\nof-the-art results for the chosen domain. The next step is the application of this \nmethod to domains where current techniques for learning BBNs from data are not \napplicable, namely domains with continuous variables not normally distributed, and \ndomains with mixtures of continuous and discrete variables. The main drawback of \nthe new algorithm is its time requirements. However, in this preliminary evaluation, \nour main concern was the learning accuracy of the algorithm, and little effort was \nspent in trying to optimize its time requirements. We believe there is ample room \nfor improvement in the time performance of the algorithm. More importantly, the \nscoring metric of Section 3 provides a general framework for experimenting with \ndifferent classes of probability estimators. In this paper we used ANN estimators, \nbut more efficient estimators can easily be adopted, especially if we assume the \navailability of prior information on the class of probability distributions to be used. \n\nAcknowledgments \n\nThis work was funded by grant IRI-9509792 from the National Science Foundation. \n\nReferences \n\n[1] C. Aliferis and G. F. Cooper. An evaluation of an algorithm for inductive learning \nof Bayesian belief networks using simulated data sets. In Proceedings of the 10th \nConference of Uncertainty in AI, pages 8-14, San Francisco, California, 1994. \n\n[2] I. Beinlich, H. Suermondt, H. Chavez, and G. Cooper. The ALARM monitoring \nsystem: A case study with two probabilistic inference techniques for belief networks. \nIn 2nd Conference of AI in Medicine Europe, pages 247- 256, London, England, 1989. \n[3] C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. \nIn \n[4] R. Bouckaert. Properties of learning algorithms for Bayesian belief networks. \n\nProceedings of the 10th Conference of Uncertainty in AI, pages 102-109, 1994. \n\n[5] W. Buntine. A guide to the literature on learning probabilistic networks from data. \n\nIEEE Transactions on Knowledge and Data Engineering, 1996. To appear. \n\n[6] D. Chickering, D. Geiger, and D. Heckerman. Learning Bayesian networks: search \n\nmethods and experimental results. Proc. 5th Workshop on AI and Statistics, 1995 . \n\n[7] G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic \n\nnetworks from data. Machine Learning, 9:309-347, 1992. \n\n[8] A. Dawid. Present position and potential developments: Some personal views. Sta(cid:173)\n\ntistical theory. The prequential approach. Journal of Royal Statistical Society A, \n147:278-292, 1984. \n\n[9] D. Geiger and D. Heckerman. Learning Gaussian networks. Technical Report MSR(cid:173)\n\nTR-94-10, Microsoft Research, One Microsoft Way, Redmond, WA 98052, 1994. \n\n[10] D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: the com(cid:173)\n\nbination of knowledge and statistical data. Machine Learning, 1995. \n\n[11] R. Hofmann and V. Tresp. Discovering structure in continuous variables using \n\nBayesian networks. In Advances in NIPS 8. MIT Press, 1995. \n\n[12] M. Moller. A scaled conjugate gradient algorithm for fast supervised learning. Neural \n\nNetworks, 6:525-533, 1993. \n\n[13] J. Pearl. Probabilistic Reasoning in Intelligent Systems: networks of plausible infer(cid:173)\n\nence. Morgan Kaufman Publishers, Inc., 1988. \n\n\f", "award": [], "sourceid": 1211, "authors": [{"given_name": "Stefano", "family_name": "Monti", "institution": null}, {"given_name": "Gregory", "family_name": "Cooper", "institution": null}]}