{"title": "Learning Bayesian Belief Networks with Neural Network Estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 578, "page_last": 584, "abstract": null, "full_text": "Learning Bayesian belief networks with \n\nneural network estimators \n\nStefano Monti* \n\n*Intelligent  Systems  Program \n\nUniversity  of Pittsburgh \n\nGregory F.  Cooper*'''' \n\n\"Center for  Biomedical  Informatics \n\nUniversity  of Pittsburgh \n\n901M  CL,  Pittsburgh,  PA - 15260 \n\n8084  Forbes  Tower,  Pittsburgh,  PA - 15261 \n\nsmonti~isp.pitt.edu \n\ngfc~cbmi.upmc.edu \n\nAbstract \n\nIn  this  paper  we  propose  a  method  for  learning  Bayesian  belief \nnetworks  from  data.  The  method  uses  artificial  neural  networks \nas  probability estimators, thus  avoiding the need for  making prior \nassumptions on the nature of the probability distributions govern(cid:173)\ning  the relationships  among the  participating variables.  This new \nmethod has  the potential for  being  applied to domains containing \nboth discrete  and continuous variables  arbitrarily distributed.  We \ncompare  the  learning  performance  of this  new  method  with  the \nperformance  of the  method  proposed  by  Cooper  and  Herskovits \nin  [7].  The experimental results  show  that,  although  the  learning \nscheme based on the use  of ANN  estimators is  slower, the learning \naccuracy  of the two methods is  comparable. \nCategory:  Algorithms and Architectures. \n\n1 \n\nIntroduction \n\nBayesian belief networks  (BBN)  are a powerful formalism for  representing  and rea(cid:173)\nsoning  under  uncertainty.  This  representation  has  a  solid  theoretical  foundation \n[13],  and its practical value is  suggested  by  the rapidly growing number of areas  to \nwhich it is being applied.  BBNs concisely represent the joint probability distribution \nover a set of random variables, by explicitly identifying the probabilistic dependen(cid:173)\ncies and independencies  between these variables.  Their clear semantics make BBNs \nparticularly. suitable for being used in tasks such as  diagnosis, planning, and control. \n\nThe  task  of  learning  a  BBN  from  data  can  usually  be  formulated  as  a  search \nover  the  space  of network  structures,  and  as  the  subsequent  search  for  an  opti(cid:173)\nmal  parametrization of the  discovered  structure  or  structures.  The  task  can  be \nfurther  complicated by extending the search to account for  hidden variables and for \n\n\fLearning Bayesian Belief Networks with Neural Network Estimators \n\n579 \n\nthe  presence  of data points  with  missing  values.  Different  approaches  have  been \nsuccessfully  applied  to  the  task  of learning  probabilistic  networks  from  data  [5]. \nIn  all  these  approaches,  simplifying  assumptions  are  made  to  circumvent  practi(cid:173)\ncal  problems in  the  implementation of the  theory.  One  common assumption  that \nis  made  is  that  all  variables  are  discrete,  or  that  all  variables  are  continuous  and \nnormally distributed. \n\nIn this paper,  we  propose  a novel method for  learning BBNs from data that makes \nuse  of artificial neural networks  (ANN)  as  probability distribution estimators, thus \navoiding the  need  for  making prior  assumptions on  the  nature  of the  probability \ndistribution governing the relationships among the participating variables.  The use \nof ANNs as  probability distribution estimators is not new  [3],  and its application to \nthe task of learning Bayesian belief networks from data has been  recently  explored \nin [11] .  However, in [11]  the ANN estimators were used in the parametrization of the \nBBN  structure  only,  and cross  validation was  the  method of choice  for  comparing \ndifferent  network structures.  In our approach,  the ANN  estimators are  an essential \ncomponent of the  scoring metric used  to search over  the BBN  structure space.  We \nran  several  simulations to  compare  the  performance of this  new  method with  the \nlearning  method  described  in  [7].  The  results  show  that,  although  the  learning \nscheme based on the use of ANN  estimators is  slower,  the learning accuracy of the \ntwo methods is  comparable. \n\nThe  rest  of the  paper  is  organized  as  follows.  In  Section  2  we  briefly  introduce \nthe  Bayesian belief network formalism and some basics of hbw  to learn BBNs from \ndata.  In Section 3,  we  describe our learning method,  and detail the use  of artificial \nneural networks as probability distribution estimators.  In Section 4 we present some \nexperimental results  comparing the  performance of this  new  method with  the  one \nproposed in [7].  We conclude the paper with some suggestions for further  research. \n\n2  Background \n\nA  Bayesian  belief network  is  defined  by  a  triple  (G,n,p),  where  G  =  (X,E)  is \na  directed  acyclic  graph  with  a  set  of nodes  X  = {Xl\"'\"  xn}  representing  do(cid:173)\nmain  variables,  and  with  a  set  of arcs  E  representing  probabilistic  dependencies \namong  domain  variables;  n is  the  space  of possible  instantiations  of the  domain \nvariables l ;  and  P  is  a  probability distribution over  the  instantiations in  n.  Given \na  node  X  EX,  we  use  trx  to  denote  the  set  of parents  of X  in  X.  The  essential \nproperty of BBNs  is  summarized by  the  Markov  condition,  which  asserts  that each \nvariable  is  independent  of its non-descendants  given its  parents.  This property  al(cid:173)\nlows  for  the  representation  of the  multivariate joint  probability  distribution  over \nX  in terms of the univariate conditional distributions P( Xi  l7ri, 8i )  of each variable \nXi  given  its parents  7ri,  with 8i  the  set  of parameters  needed  to  fully  characterize \nthe conditional probability.  Application of the chain rule, together with the Markov \ncondition, yields the following factorization of the joint probability of any particular \ninstantiation of all  n  variables: \n\nP(x~, ... , x~) = II P(x~ 17r~., 8i )  . \n\nn \n\ni=l \n\n(1) \n\n1 An instantiation  w  of all  n  variables in X  is  an n-uple of values  {x~, ... , x~} such  that \n\nXi  = X:  for  i =  1 ... n. \n\n\f580 \n\nS.  Monti and G.  F.  Cooper \n\n2.1  Learning Bayesian belief networks \n\nThe  task  of learning  BBNs  involves  learning  the  network  structure  and  learning \nthe  parameters of the  conditional probability distributions.  A  well  established  set \nof learning  methods  is  based  on  the  definition  of a  scoring  metric  measuring  the \nfitness of a network structure to the data, and on the search for high-scoring network \nstructures  based on the defined  scoring  metric [7,  10].  We  focus  on these  methods, \nand in particular on the definition of Bayesian scoring  metrics. \n\nIn a  Bayesian framework,  ideally  classification  and  prediction  would  be  performed \nby  taking  a  weighted  average  over  the  inferences  of every  possible  belief network \ncontaining the domain variables.  Since this approach is  in  general  computationally \ninfeasible, often an attempt has been  made to use  a high scoring belief network for \nclassification.  We will  assume this approach in the remainder of this paper. \nThe basic idea ofthe Bayesian approach is to maximize the probability P(Bs I V) = \nP(Bs, V)j P(V) of a  network  structure  Bs  given  a  database  of cases  V.  Because \nfor  all  network  structures  the  term  P(V)  is  the  same,  for  the  purpose  of model \nselection it suffices to calculate PCBs, V) for all Bs.  The Bayesian metrics developed \nso  far  all  rely  on  the  following  assumptions:  1)  given  a  BBN  structure,  all  cases \nin  V  are  drawn  independently  from  the  same  distribution  (multinomial sample); \n2)  there  are  no  cases  with  missing  values  (complete database);  3)  the  parameters \nof the  conditional  probability distribution of each  variable  are  independent  (global \nparameter independence);  and 4)  the parameters associated  with each instantiation \nof the parents of a variable are independent  (local parameter independence). \n\nThe  application of these  assumptions  allows  for  the  following  factorization  of the \nprobability PCBs, V) \n\nPCBs, V) = P(Bs)P(V I Bs) = PCBs) II S(Xi, 71'i, V)  , \n\nn \n\ni=l \n\n(2) \n\nwhere  n  is  the  number  of nodes  in  the  network,  and  each  s( Xi, 71'i, V)  is  a  term \nmeasuring  the  contribution  of  Xi  and  its  parents  71'i  to  the  overall  score  of the \nnetwork Bs.  The exact form of the terms s( Xi  71'i, V) slightly differs in the Bayesian \nscoring  metrics  defined  so  far,  and for  lack  of space  we  refer  the  interested  reader \nto the relevant  literature  [7,  10]. \nBy looking at Equation (2), it is clear that if we  assume a uniform prior distribution \nover  all  network  structures,  the  scoring  metric  is  decomposable,  in  that  it  is  just \nthe  product  of the  S(Xi, 71'i, V)  over  all  Xi  times  a  constant  P(Bs).  Suppose  that \ntwo  network  structures  Bs  and  BSI  differ  only  for  the  presence  or  absence  of a \ngiven  arc  into  Xi.  To  compare  their  metrics,  it  suffices  to  compute  s( Xi, 71'i, V) \nfor  both  structures,  since  the  other  terms  are  the  same.  Likewise,  if we  assume \na  decomposable  prior  distribution  over  network  structures,  of the  form  P(Bs)  = \n11 Pi,  as  suggested  in  [10],  the scoring  metric is  still  decomposable,  since  we  can \ninclude each Pi  into the corresponding s( Xi,  71'i, V). \nOnce  a scoring metric is  defined,  a search for  a  high-scoring network  structure  can \nbe carried out.  This search  task  (in several forms)  has been shown  to be  NP-hard \n[4,6].  Various heuristics have been  proposed to find  network structures  with a high \nscore.  One such heuristic is known as  K2  [7],  and it implements a greedy search over \nthe  space  of network  structures.  The  algorithm  assumes  a  given  ordering  on  the \nvariables.  For simplicity, it also assumes that no prior information on the network is \navailable, so the prior probability distribution over the network structures is uniform \nand can be ignored in comparing network structures. \n\n\fLearning Bayesian Belief Networks with Neural Network Estimators \n\n581 \n\nThe  Bayesian  scoring  metrics  developed  so  far  either  assume  discrete  variables \n[7,  10],  or  continuous  variables  normally  distributed  [9].  In  the  next  section,  we \npropose a possible generalization which allows for the inclusion of both discrete and \ncontinuous  variables with arbitrary probability distributions. \n\n3  An  ANN-based scoring metric \n\nThe main idea of this work is  to use 'artificial neural networks as probability estima(cid:173)\ntors,  to define a decomposable scoring metric for  which no informative priors on the \nclass,  or  classes,  of the  probability  distributions of the  participating  variables  are \nneeded.  The first  three  of the  four  assumptions  described  in  the  previous  section \nare still needed,  namely, the assumption of a multinomial sample, the assumption of \na complete database,  and the  assumption of global parameter independence.  How(cid:173)\never,  the  use  of ANN  estimators  allows  for  the  elimination of the  assumption  of \nlocal parameter independence.  In fact,  the  conditional probabilities corresponding \nto  the  different  instantiations  of the  parents  of a  variable  are  represented  by  the \nsame ANN,  and they share the same network  weights  and  the same training data. \nLet  us  denote with VI ==  {C1 ,  .. . , CI - 1 }  the set  of the first  I  cases  in  the database, \nand  with  x~l)  and  7rr)  the instantiations of Xi  and  7ri  in  the  l-th  case  respectively. \nThe joint probability P( Bs, V) can be written as: \n\nP(Bs)P(VIBs)  =  P(Bs)  IIp(CIIVI,Bs) \n\nm \n\n1=1 \n\nP(Bs) \n\nm  n \n\nII II \n\n(1) \nP(xi \n\n(I) \n\nl7ri \n\n, VI, Bs). \n\n(3) \n\n1=1 i=l \n\nIf we  assume  uninformative priors,  or decomposable priors  on network  structures, \nof the form  P(Bs) = rt Pi,  the probability PCBs, V) is  decomposable.  In fact,  we \ncan interchange the two  products  in Equation 3,  so as  to obtain \n\nPCBs, V) = II [Pi II p(x~l) l7rr), VI , Bs)] = II S(Xi, 7ri, V), \n\nn \n\nn \n\nm \n\n(4) \n\n1=1 \n\ni=l \n\nwhere  S(Xi, 7rj, V)  is  the term between  square brackets,  and it is  only  a function  of \nXi  and  its  parents  in  the  network  structure  Bs  (Pi  can  be  neglected  if we  assume \na  uniform  prior  over  the  network  structures).  The  computation  of  Equation  4 \ncorresponds  to the application of the prequential method discussed  by  Dawid  [8]. \nThe  estimation  of each  term  P( Xi  l7ri , VI, Bs)  can  be  done  by  means  of  neural \nnetwork.  Several schemes are available for training a neural network to approximate \na given probability distribution, or density.  Notice that the calculation of each term \nS(Xi, 7ri, V)  can  be  computationally very  expensive.  For  each  node  Xi,  computing \nS( Xi, 7ri, V)  requires  the  training of mANNs,  where  m  is  the size  of the  database. \nTo  reduce  this  computational cost,  we  use  the following  approximation,  which  we \ncall the t-invariance approximation: for any I E {I, . .. , m-l}, given the probability \nP(Xi l7ri, VI, Bs),  at least t  (1  s t  S  m -I) new  cases  are  needed  in order  to  alter \nsuch  probability.  That is, for  each  positive  integer  h,  such  that  h  < t,  we  assume \nP(Xi l7rj, VI+h, Bs) =  P(Xi l7ri, VI , Bs) .  Intuitively, this approximation implies the \nassumption that, given our present  belief about the value of each  P(Xi l7rj, VI, Bs), \nat least t  new  cases  are needed  to revise this belief.  By making this approximation, \nwe achieve a t-fold reduction in the computation needed,  since we now need to build \nand train only mit ANNs for each  Xi , instead of the original m.  In fact,  application \n\n\f582 \n\ns.  Monti and G.  F.  Cooper \n\nof the t-invariance approximatioin to the computation of a  given S(Xi, 7ri, 'D)  yields: \n\nRather  than  selecting  a  constant  value for  t,  we  can  choose  to increment  t  as  the \nsize  of the  training  database 'DI  increases.  This  approach seems  preferable.  When \nestimating P(Xi l7ri, 'DI, Bs), this estimate will  be  very  sensitive to the  addition of \nnew cases when 1 is small, but will become increasingly insensitive to the addition of \nnew cases as 1 grows.  A scheme for the incremental updating oft can be summarized \nin  the  equation  t  = rAil,  where  1 is  the  number  of cases  already  seen  (i.e.,  the \ncardinality  of'D/),  and  0  <  A  ~ 1.  For  example,  given  a  data  set  of 50  cases, \nthe  updating scheme t  = rO.511  would  require  the  training of the  ANN  estimators \nP(Xi I 7ri,'DI, Bs) for  1= 1,2,3,5,8,12,18,27,41. \n\n4  Evaluation \n\nIn  this  section,  we  describe  the  experimental  evaluation we  conducted  to  test  the \nfeasibility  of use  of the  ANN-based  scoring  metric  developed  in  the  previous  sec(cid:173)\ntion.  All  the  experiments  are  performed on the  belief network  Alarm,  a  multiply(cid:173)\nconnected network originally developed to model anesthesiology problems that may \noccur during surgery  [2].  It contains 37 nodes/variables and 46  arcs.  The variables \nare all discrete,  and take between 2 and 4 distinct values.  The database used  in the \nexperiments was  generated from Alarm,  and it is  the same database used  in  [7]. \n\nIn  the  experiments,  we  use  a  modification of the  algorithm  K2  [7].  The  modified \nalgorithm, which we call ANN-K2, replaces the closed-form scoring metric developed \nin [7]  with the ANN-based scoring metric of Equation (5).  The performance of ANN(cid:173)\nK2 is measured in terms of accuracy of the recovered network structure,  by counting \nthe number of edges  added and omitted with respect  to the Alarm network;  and in \nterms  of the  accuracy  of the  learned  joint probability  distribution,  by  computing \nits  cross  entropy  with  respect  to  the  joint probability distribution  of Alarm.  The \nlearning performance of ANN-K2 is  also compared with the performance of K2.  To \ntrain the ANNs,  we  used  the conjugate-gradient search  algorithm [12]. \n\nSince  all  the  variables in  the Alarm network  are  discrete,  the  ANN  estimators  are \ndefined  based on the softmax model,with normalized exponential output  units,  and \nwith cross  entropy as  cost function .  As a  regularization technique,  we  augment the \ntraining set so as to induce a uniform conditional probability over the unseen instan(cid:173)\ntiantions  of the  ANN  input.  Given  the  probability  P(Xi l7ri, 'DI)  to  be  estimated, \nand  assuming Xi  is  a  k-valued  variable, for  each  instantiation  7r~  that does  not  ap(cid:173)\npear in the database D I ,  we  generate k  new cases,  with 7ri  instantiated to 7ri,  and Xi \ntaking each of its  k  values.  As  a  result,  the neural network  estimates  P(Xi 17r~, 'DI) \nto be uniform, with P(Xi I7rL 'D/)  =  l/k for  each of Xi'S  values  Xn,.\u00b7\u00b7, Xlk. \nWe  ran  simulations  where  we  varied  the  size  of the  training  data  set  (100,  500, \n1000,  2000,  and  3000  cases),  and  the  value of A in  the  updating scheme  t  =  rAil \ndescribed  in  Section  3.  We  used  the  settings  A  =  0.35,  and  A  =  0.5 .  For  each \nrun,  we  measured the number of arcs  added,  the number of arcs  omitted, the cross \nentropy,  and the  computation time, for  each  variable  in  the  network.  That  is,  we \nconsidered  each node,  together with its parents,  as  a simple BBN,  and collected the \nmeasures  of interest  for  each  of these  BBNs.  Table  1 reports  mean  and  standard \ndeviation  of each  measure  over  the  37  variables  of Alarm,  for  both  ANN-K2  and \nK2.  The  results  for  ANN-K2  shown  in Table  1 correspond  to the  setting  A = 0.5, \n\n\fLearning Bayesian Belief Networks with Neural Network Estimators \n\n583 \n\nUata  Algo. \nset \n100 \n\n500 \n\n1000 \n\n2000 \n\n3000 \n\nANN-K2 \nK2 \nANN-K2 \nK2 \nANN\u00b7K2 \nK2 \nANN\u00b7K2 \nK2 \nANN-K2 \nK2 \n\narcs + \ns.d. \nm \n0 .40 \n0.19 \n0 .75 \n1.28 \n0.40 \n0.19 \n0.22 \n0.42 \n0.49 \n0 .24 \n0.11 \n0.31 \n0.19 \n0.40 \n0 .05 \n0.23 \n0.37 \n0.16 \n0 .00 \n0 .00 \n\narcs  -\nm \n0 .62 \n0 . 22 \n0.22 \n0.11 \n0.22 \n0.03 \n0.11 \n0 .03 \n0.05 \n0.03 \n\ns.d. \n0 .86 \n0.48 \n0 .48 \n0 .31 \n0 .48 \n0.16 \n0 .31 \n0.16 \n0 .23 \n0 .16 \n\ncross  entropy \ns.d. \nm \n0.23 \n0 .52 \n0.08 \n0 .10 \n0 .04 \n0 .11 \n0 .02 \n0 .02 \n0 .05 \n0 .15 \n0.01 \n0 .01 \n0.02 \n0 .06 \n0.005 \n0.007 \n0 .017 \n0 .01 \n0.004 \n0 .005 \n\nmed \n.051 \n.070 \n.010 \n.010 \n.005 \n.006 \n.002 \n.002 \n.001 \n.001 \n\nm \n130 \n0.44 \n1077 \n0.13 \n6909 \n0.34 \n6458 \n0 .46 \n11155 \n1.02 \n\ntime  (sees) \n\nmed \n88 \n.06 \n480 \n.06 \n4866 \n.23 \n4155 \n.44 \n4672 \n.84 \n\ns.d. \n159 \n1.48 \n1312 \n0 .22 \n6718 \n0 .46 \n7864 \n0 .65 \n2136 \n1.11 \n\nTable 1:  Comparison ofthe performance of ANN-K2 and of K2 in terms of number of arcs \nwrongly  added  (+),  number  of arcs wrongly omitted  (-), cross entropy,  and computation \ntime.  Each column reports the mean m , the median  med,  and the standard deviation  s.d. \nof the corresponding  measure  over  the  37  nodes/variables  of Alarm.  The median  for  the \nnumber  of arcs  added  and omitted is  always  0,  and is  not reported. \n\nsince their difference from the results corresponding to the setting A = 0.35 was not \nstatistically significant . \n\nStandard t-tests were  performed to assess  the significance of the difference  between \nthe  measures for  K2  and the  measures  for  ANN-K2,  for  each  data set  cardinality. \nNo  technique  to  correct  for  multiple-testing was  applied.  Most  measures  show  no \nstatistically significant difference,  either at the 0.05 level or at the 0.01  level (most p \nvalues are well  above 0.2).  In the simulation with 100 cases,  both the difference be(cid:173)\ntween the mean number of arcs  added and the difference  between the mean number \nof arcs  omitted are  statistically significant  (p  ~ 0.01).  However,  these  differences \ncancel  out,  in  that  ANN-K2  adds  fewer  extra  arcs  than  K2,  and  K2  omits fewer \narcs  than  ANN-K2.  This  is  reflected  in  the  corresponding  cross  entropies,  whose \ndifference  is  not  statistically  significant  (p  =  0.08).  In  the  simulation  with  1000 \ncases,  only the  difference  in  the  number of arcs  omitted is  statistically significant \n(p  ~ .03) .  Finally,  in  the  simulation with  3000  cases,  only  the  difference  in  the \nnumber of arcs  added  is  statistically significant  (p  ~ .02).  K2  misses  a  single  are, \nand does not add any extra are,  and this is the best result to date.  By comparison, \nANN-K2  omits 2  arcs,  and  adds 5  extra arcs.  For  the simulation with  3000 cases, \nwe  also  computed Wilcoxon rank  sum tests.  The  results  were  consistent  with the \nt-test results, showing a statistically significant difference only in the number of arcs \nadded.  Finally, as  it can be noted in Table  1, the difference  in computation time is \nof several  order of magnitude, thus  making a statistical analysis superfluous. \n\nA  natural  question  to  ask  is  how  sensitive  is  the  learning  procedure  to  the  order \nof  the  cases  in  the  training  set. \nIdeally,  the  procedure  would  be  insensitive  to \nthis  order.  Since  we  are  using  ANN  estimators,  however,  which  perform  a  greedy \nsearch  in  the  solution  space,  particular  permutations  of the  training  cases  might \ncause the ANN estimators to be more susceptible to getting stuck in local maxima. \nWe performed some preliminary experiments  to test  the  sensitivity of the learning \nprocedure  to the  order  of the  cases  in the training set.  We ran few  simulations in \nwhich  we  randomly  changed  the  order  of the  cases.  The  recovered  structure  was \nidentical in  all  simulations.  Morevoer,  the  difference  in  cross  entropy  for  different \norderings of the cases  in the training set showed not to be statistically significant. \n\n5  Conclusions \n\nIn this paper we  presented a novel method for learning BBN s from data based on the \nuse of artificial neural networks as  probability distribution estimators.  As  a prelim-\n\n\f584 \n\ns.  Monti and G.  F.  Cooper \n\ninary evaluation, we  have compared the performance of the new  algorithm with the \nperformance of K2,  a  well  established  learning  algorithm for  discrete  domains, for \nwhich extensive empirical evaluation is  available [1,7].  With regard to the learning \naccuracy of the new method, the results are encouraging, being comparable to state(cid:173)\nof-the-art  results  for  the  chosen  domain.  The  next  step  is  the  application  of this \nmethod to domains where  current  techniques for  learning BBNs  from  data are  not \napplicable, namely domains with continuous variables not normally distributed, and \ndomains with mixtures of continuous and discrete variables.  The main drawback of \nthe new  algorithm is its time requirements.  However, in this preliminary evaluation, \nour  main concern  was  the learning accuracy  of the  algorithm, and little effort  was \nspent  in  trying to optimize its time requirements.  We  believe  there  is  ample room \nfor  improvement in  the time performance of the  algorithm.  More  importantly, the \nscoring  metric  of Section  3  provides  a  general  framework  for  experimenting  with \ndifferent  classes  of probability estimators.  In this  paper we  used  ANN  estimators, \nbut  more  efficient  estimators  can  easily  be  adopted,  especially  if we  assume  the \navailability of prior information on the class of probability distributions to be used. \n\nAcknowledgments \n\nThis work was funded  by grant IRI-9509792 from the National Science  Foundation. \n\nReferences \n\n[1]  C.  Aliferis  and  G.  F.  Cooper.  An  evaluation  of an  algorithm  for  inductive  learning \nof  Bayesian  belief  networks  using  simulated  data  sets.  In  Proceedings  of the  10th \nConference  of Uncertainty in AI, pages  8-14,  San  Francisco,  California,  1994. \n\n[2]  I.  Beinlich,  H.  Suermondt,  H.  Chavez,  and  G.  Cooper.  The  ALARM  monitoring \nsystem:  A  case study with  two probabilistic  inference  techniques  for  belief networks. \nIn  2nd Conference of AI in Medicine Europe,  pages 247- 256,  London,  England,  1989. \n[3]  C.  Bishop.  Neural  Networks for  Pattern  Recognition.  Oxford  University  Press,  1995. \nIn \n[4]  R.  Bouckaert.  Properties  of  learning  algorithms  for  Bayesian  belief  networks. \n\nProceedings  of the  10th  Conference  of Uncertainty in AI, pages  102-109,  1994. \n\n[5]  W.  Buntine.  A  guide  to  the literature  on learning  probabilistic  networks from  data. \n\nIEEE  Transactions  on Knowledge  and Data  Engineering,  1996.  To  appear. \n\n[6]  D.  Chickering,  D.  Geiger,  and  D.  Heckerman.  Learning  Bayesian  networks:  search \n\nmethods  and  experimental results.  Proc.  5th  Workshop  on  AI and Statistics,  1995 . \n\n[7]  G.  Cooper  and  E.  Herskovits.  A  Bayesian  method for  the induction  of probabilistic \n\nnetworks from  data.  Machine  Learning, 9:309-347,  1992. \n\n[8]  A.  Dawid.  Present  position  and  potential developments:  Some  personal  views.  Sta(cid:173)\n\ntistical  theory.  The  prequential  approach.  Journal  of Royal  Statistical  Society  A, \n147:278-292,  1984. \n\n[9]  D.  Geiger  and  D.  Heckerman.  Learning  Gaussian  networks.  Technical  Report  MSR(cid:173)\n\nTR-94-10,  Microsoft  Research,  One  Microsoft  Way,  Redmond,  WA  98052,  1994. \n\n[10]  D.  Heckerman,  D.  Geiger,  and  D.  Chickering.  Learning  Bayesian  networks:  the com(cid:173)\n\nbination  of knowledge  and statistical data.  Machine  Learning,  1995. \n\n[11]  R.  Hofmann  and  V.  Tresp.  Discovering  structure  in  continuous  variables  using \n\nBayesian  networks.  In Advances in  NIPS 8.  MIT Press,  1995. \n\n[12]  M.  Moller.  A scaled conjugate gradient  algorithm for fast supervised learning.  Neural \n\nNetworks,  6:525-533,  1993. \n\n[13]  J.  Pearl.  Probabilistic  Reasoning in Intelligent Systems:  networks  of plausible  infer(cid:173)\n\nence.  Morgan  Kaufman  Publishers,  Inc.,  1988. \n\n\f", "award": [], "sourceid": 1211, "authors": [{"given_name": "Stefano", "family_name": "Monti", "institution": null}, {"given_name": "Gregory", "family_name": "Cooper", "institution": null}]}