{"title": "Inference in Multilayer Networks via Large Deviation Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 260, "page_last": 266, "abstract": null, "full_text": "Inference in Multilayer Networks \n\n\u2022 VIa \n\nLarge  Deviation Bounds \n\nMichael Kearns and  Lawrence Saul \n\nAT&T  Labs - Research \n\nShannon Laboratory \n\n180  Park A venue A-235 \nFlorham Park,  NJ  07932 \n\n{mkearns ,lsaul}Oresearch.att. com \n\nAbstract \n\nWe  study  probabilistic  inference  in  large,  layered  Bayesian  net(cid:173)\nworks  represented  as  directed  acyclic  graphs.  We  show  that  the \nintractability of exact inference in such  networks does  not preclude \ntheir effective  use.  We  give algorithms for  approximate probabilis(cid:173)\ntic inference  that exploit averaging phenomena occurring at nodes \nwith  large  numbers  of parents.  We  show  that  these  algorithms \ncompute  rigorous  lower  and  upper  bounds  on  marginal probabili(cid:173)\nties  of interest,  prove  that these  bounds  become exact in  the limit \nof large networks,  and provide rates  of convergence. \n\n1 \n\nIntroduction \n\nThe  promise  of neural  computation  lies  in  exploiting  the  information  processing \nabilities of simple computing elements organized into large networks.  Arguably one \nof the  most  important  types  of information  processing  is  the  capacity  for  proba(cid:173)\nbilistic reasoning. \n\nThe properties of undirectedproDabilistic models represented as symmetric networks \nhave  been  studied  extensively  using  methods from  statistical  mechanics  (Hertz  et \naI,  1991).  Detailed  analyses  of these  models  are  possible  by  exploiting  averaging \nphenomena that occur  in the thermodynamic limit of large networks. \n\nIn  this  paper,  we  analyze  the  limit  of large,  multilayer  networks  for  probabilistic \nmodels represented  as  directed acyclic graphs.  These models are known as  Bayesian \nnetworks  (Pearl,  1988;  Neal,  1992), and  they have  different  probabilistic semantics \nthan symmetric neural networks (such  as Hopfield models or Boltzmann machines). \nWe  show  that the intractability of exact inference  in  multilayer Bayesian  networks \n\n\fInference in Multilayer Networks via Large Deviation Bounds \n\n261 \n\ndoes  not  preclude  their effective  use.  Our  work  builds  on  earlier  studies  of varia(cid:173)\ntional  methods  (Jordan  et  aI,  1997).  We  give  algorithms for  approximate proba(cid:173)\nbilistic inference that exploit averaging phenomena occurring  at nodes  with  N  \u00bb  1 \nparents.  We  show  that these  algorithms compute rigorous  lower  and upper bounds \non  marginal probabilities of interest,  prove  that these  bounds  become exact in  the \nlimit N  -+ 00,  and provide rates  of convergence. \n\n2  Definitions and  Preliminaries \n\nA Bayesian network  is  a  directed graphical probabilistic model, in  which  the nodes \nrepresent  random variables,  and the links represent  causal dependencies.  The joint \ndistribution of this model is  obtained by  composing the local conditional probability \ndistributions  (or  tables),  Pr[childlparents],  specified  at  each  node  in  the  network. \nFor  networks  of  binary  random  variables,  so-called  transfer  functions  provide  a \nconvenient  way  to  parameterize conditional probability tables  (CPTs).  A  transfer \nfunction  is  a  mapping f  : [-00,00]  -+  [0,1]  that  is  everywhere  differentiable  and \nsatisfies f' (x)  2:  0 for  all  x  (thus,  f  is  nondecreasing).  If f' (x)  ::;  a  for  all  x,  we  say \nthat f  has slope a.  Common examples of transfer functions of bounded slope include \nthe  sigmoid  f(x)  =  1/(1 + e- X ),  the  cumulative gaussian  f(x)  =  J~oodt e- t2 /  ft, \nand  the  noisy-OR  f(x)  = 1 - e- x .  Because  the  value  of  a  transfer  function  f \nis  bounded  between  0  and  1,  it  can  be  interpreted  as  the  conditional  probability \nthat  a  binary  random  variable  takes  on  a  particular  value.  One  use  of  transfer \nfunctions  is  to endow  multilayer networks  of soft-thresholding computing elements \nwith  probabilistic semantics.  This motivates the following  definition: \n\nDefinition 1  For a transfer function  f,  a layered probabilistic f-network has: \n\u2022  Nodes  representing  binary  variables  {xf},  f  = 1, ... ,L  and i  = 1, ... , N. \n\nThus,  L  is  the  number of layers,  and each  layer contains  N  nodes. \n\n\u2022  For every pair of n~des XJ- 1  and xf in adjacent layers,  a real-valued weight \n\n0'-:-1  from  X l - 1 to Xl \nt  . \ntJ \n\nJ \n\n\u2022  For  every  node xl  in  the  first  layer,  a bias Pi. \n\nWe  will  sometimes refer to  nodes  in  layer  1 as  inputs,  and to  nodes  in  layer L  as \noutputs.  A  layered  probabilistic  f-network  defines  a  joint  probability  distribution \nover  all  of the  variables  {Xf}  as  follows:  each  input  node xl  is  independently  set \nto  1  with  probability Pi,  and  to  0  with  probability  1 - Pi.  Inductively,  given  binary \nvalues XJ-1  =  x;-l  E  {O, 1}  for  all  of the  nodes  in  layer f  - 1,  the  node xf  is  set \nto  1 with  probability  f('Lf=l ofj- 1 x;-l). \n\nAmong other uses,  multilayer networks of this form have been  studied  as  hierarchi(cid:173)\ncal  generative  models  of sensory  data  (Hinton  et  aI,  1995).  In  such  applications, \nthe fundamental computational problem (known as  inference)  is  that of estimating \nthe  marginal probability of evidence  at some number of output nodes,  say  the first \nf{ ::;  N.  (The  computation of conditional probabilities, such  as  diagnostic queries, \ncan be reduced to marginals via Bayes rule.)  More precisely,  one wishes to estimate \nPr[Xf = Xl,  ... ,XI<  =  XK]  (where  Xi  E  {O, 1}), a  quantity whose  exact  computa(cid:173)\ntion involves an exponential sum over all the possible settings of the uninstantiated \nnodes  in  layers  1  through  L  - 1,  and  is  known  to  be  computationally intractable \n(Cooper,  1990). \n\n\f262 \n\nM.  Kearns and L.  Saul \n\n3  Large  Deviation and  Union  Bounds \n\nOne of our main weapons will be the theory of large deviations.  As a first illustration \nof this theory, consider the input nodes {Xl} (which are independently set to 0 or 1 \naccording to their biases  pj)  and the weighted sum 2::7= 1 Blj Xl  that feeds  into the \nith  node xl in  the second  layer.  A  typical large deviation bound  (Kearns  &  Saul, \n1997)  states  that for  all  f  >  0,  Pr[1 2::7=1 Blj (XJ  - pj) I > f]  ~ 2e-2~2 /(N0 2)  where \ne  is  the  largest  weight  in  the  network.  If we  make  the  scaling  assumption  that \neach  weight  Blj  is  bounded  by  T /  N  for  some constant  T  (thus,  e ~ T /  N),  then  we \nsee  that the probability of large  (order  1)  deviations of this  weighted  sum from  its \nmean  decays  exponentially  with  N.  (Our  methods can  also  provide  results  under \nthe weaker  assumption that all weights  are bounded by O(N-a) for  a > 1/2.) \nHow  can  we  apply  this  observation  to  the  problem of inference?  Suppose  we  are \ninterested  in the marginal probability Pr[Xl =  1].  Then the large deviation bound \ntells  us  that with  probability at least  1 - 0  (where  we  define  0 =  2e- 2N\u20ac2/ r 2),  the \nweighted sum at node Xl  will be within f of its mean value Pi  =  2::7=1  Bljpj.  Thus, \nwith probability at least  1- 0,  we  are assured  that Pr[Xl = 1]  is  at least  f(pi  - f) \nand  at most  f(Pi + f).  Of course,  the flip  side of the  large deviation bound is  that \nwith  probability  at  most 0,  the  weighted  sum may fall  more than  f  away  from  Pi. \nIn  this case  we  can  make no guarantees  on Pr[Xl = 1]  aside from  the trivial lower \nand  upper  bounds  of 0  and  1.  Combining both  eventualities,  however,  we  obtain \nthe overall bounds: \n\n(1  - O)f(Pi  - f)  ~  Pr[Xl = 1]  ~  (1  - O)f(Pi + f)  + o. \n\n(1) \n\nEquation  (1)  is  based on a  simple  two-point approximation to the distribution over \nthe  weighted  sum  of inputs,  2::7=1  BtjX].  This  approximation  places  one  point, \nwith  weight  1 - 0,  at either  f  above  or below  the mean  Pi  (depending  on  whether \nwe  are  deriving  the upper  or  lower  bound);  and  the other point,  with weight  0,  at \neither  -00 or  +00.  The  value  of 0 depends  on  the  choice  of f:  in  particular,  as  f \nbecomes smaller, we give more weight to the \u00b1oo point, with the trade-off governed \nby  the  large  deviation  bound.  We  regard  the  weight  given  to  the  \u00b1oo  point  as  a \nthrow-away probability,  since  with  this  weight  we  resort  to the  trivial bounds  of 0 \nor  1 on  the marginal probability Pr[Xl =  1]. \nNote  that  the  very  simple  bounds  in  Equation  (1)  already  exhibit  an  interesting \ntrade-off, governed by the choice of the parameter f-namely, as f  becomes smaller, \nthe throw-away probability 0 becomes  larger, while the terms  f(Pi \u00b1 f)  converge  to \nthe  same value.  Since  the  overall  bounds  involve  products  of f(Pi  \u00b1 f)  and  1 - 0, \nthe  optimal value  of f  is  the  one  that  balances  this  competition between  probable \nexplanations of the evidence and improbable deviations from the mean.  This trade(cid:173)\noff  is  reminiscent  of that  encountered  between  energy  and  entropy  in  mean-field \napproximations for  symmetric networks  (Hertz  et aI,  1991). \n\nSo  far  we  have  considered  the  marginal probability involving  a  single  node  in  the \nsecond  layer.  We  can also  compute bounds  on  the marginal probabilities involving \n]{ > 1 nodes  in this layer  (which without loss  of generality we  take to be the nodes \nXr  through  Xi<).  This  is  done  by  considering  the  probability  that  one  or  more \n\nof the  weighted  sums entering these  ]{  nodes  in  the  second  layer  deviate  by  more \nthan f  from their means.  We can upper bound this probability by  ]{ 0 by  appealing \nto the so-called  union  bound,  which simply states that the probability of a  union  of \nevents  is  bounded  by  the  sum  of their  individual  probabilities.  The  union  bound \nallows us to bound marginal probabilities involving multiple variables.  For example, \n\n\fInference in Multilayer Networks via Large Deviation Bounds \n\n263 \n\nconsider  the  marginal probability  Pr[Xf  =  1, ... , Xl(cid:173)\ndeviation and union bounds,  we  find: \n\n1].  Combining the  large \n\n(I-Kb\") rr f(Pi- f)  ~ Pr[Xf = 1, ... , xl- = 1]  <  (I-Kb\") rr f(Pi+f)+Kb\".  (2) \n\nK \n\nK \n\ni=1 \n\ni=1 \n\nA  number  of observations  are  in  order  here.  First,  Equation  (2)  directly  leads  to \nefficient  algorithms for  computing the  upper  and  lower  bounds.  Second,  although \nfor  simplicity we  have considered  f- deviations of the same size  at each  node in the \nsecond  layer,  the  same  methods  apply  to  different  choices  of fi  (and  therefore  b\"i) \nat  each  node.  Indeed,  variations in  fi  can  lead  to significantly tighter  bounds,  and \nthus  we  exploit  the  freedom  to  choose  different  fi  in  the  rest  of the  paper.  This \nresults,  for  example, in  bounds of the form: \n(  _ ~ .) rrK \n1 \n\ni=1 f(pt  - ft  ~  Pr Xl  - 1, . .. ,XK - 1,  where  b\"t  - 2e \n\n-2NE; /r2 \n\n[  2  _ \n\nt;tb\"t \n\n.) \n\n.  _ \n\n2  _ \n\n] \n\n.\n\n. \n\n(3) \nThe  reader  is  invited  to  study  the  small  but  important  differences  between  this \nlower bound and the one  in Equation  (2).  Third, the  arguments leading  to bounds \non  the  marginal  probability  Pr[X;  = 1, ... , Xl- = 1]  generalize  in  a  straightfor(cid:173)\nward  manner to other  patterns of evidence  besides  all  1 'so  For instance,  again just \nconsidering the  lower bound,  we  have: \n\n( 1 - t, 0;) ny -/(1';+';)] }I /(/4 -';)  :s  Pr[Xf = X\"  ... , Xl\" = XK]  (4) \n\nwhere  Xi  E  {a, I}  are  arbitrary  binary  values.  Thus  together  the  large  deviation \nand  union  bounds  provide  the  means to compute upper  and  lower  bounds  on  the \nmarginal probabilities over  nodes  in  the  second  layer.  Further  details  and  conse(cid:173)\nquences  of these  bounds  for  the  special  case  of two-layer networks  are  given  in  a \ncompanion paper  (Kearns  &  Saul,  1997);  our interest  here,  however,  is  in the more \nchallenging generalization to multilayer networks. \n\n. \n\n4  Multilayer Networks:  Inference via Induction \n\nan  event  whose  probability we  quantified  in the  last  section  -\n\nIn  extending  the  ideas  of the  previous section  to multilayer networks,  we  face  the \nproblem that the  nodes  in the second  layer,  unlike  those  in  the first,  are  not  inde(cid:173)\npendent .  But we  can still adopt an inductive strategy to derive bounds on marginal \nprobabilities.  The crucial observation is that conditioned on the values of the incom(cid:173)\ning weighted sums  at the  nodes  in the  second  layer,  the variables  {xl} do become \nindependent.  More  generally,  conditioned on these  weighted sums all falling  \"near\" \ntheir  means -\nthe \nnodes  {Xl}  become  \"almost\"  independent.  It is  exactly  this  near-independence \nthat  we  now  formalize  and  exploit  inductively  to  compute  bounds  for  multilayer \nnetworks.  The  first  tool  we  require  is  an  appropriate  generalization  of  the  large \ndeviation  bound,  which  does  not  rely  on  precise  knowledge  of  the  means  of the \nrandom variables being summed. \nTheorem 1  For all  1 ~ j  ~ N,  let  Xj  E  {a, I}  denote  independent  binary random \nvariables,  and let I Tj I ~ T.  Suppose  that the means are  bounded by IE[Xj ]-Pj I ~ !:l.j, \nwhere \u00b0 < !:l.j  ~ Pj  ~ 1 -\n\n!:l.j.  Then  for  all  f  >  ~ L:f=l h I!:l.j,' \n\nPr [  ~tTj(Xj-Pj) >f] ~2e-~~(E-ttL:~==1IrJI~Jr \n\n(5) \n\nJ=1 \n\n\f264 \n\nM.  Keams and L.  Saul \n\nThe proof of this result  is  omitted due  to space  considerations.  Now  for  induction, \nconsider  the  nodes  in  the  fth  layer  of the  network.  Suppose  we  are  told  that  for \nevery  i,  the  weighted  sum  2:;=1 07j-1 XJ-1  entering  into  the  node  Xl  lies  in  the \nfr , J.lr  + frJ,  for  some choice  of the  J.l~  and  the  ff .  Then  the  mean  of \ninterval  (p~ -\nnode  xf is  constrained to lie  in the interval [pf - ~r, pf + ~n, where \n\n~ [f(J.l~ - ff) + f(J.lf  + ff)] \n~ [J(J.lf  + ff) - f(J.lf  - fDJ  . \n\n(6) \n\n(7) \n\nJ \n\n,  0J =1  1J \n\nHere we  have simply run the leftmost and rightmost allowed values for the incoming \nweighted  sums  through  the  transfer  function , and  defined  the  interval  around  the \nmean of unit  xf to be  centered  around pf.  Thus we  have translated  uncertainties \non  the  incoming  weighted  sums  to  layer  f  into  conditional  uncertainties  on  the \nmeans of the  nodes  Xf  in  layer f .  To  complete the  cycle,  we  now  translate  these \ninto  conditional  uncertainties  on  the  incoming  weighted  sums  to  layer  f  + 1. \nIn \nparticular, conditioned on the original intervals [J.lf  - ff , J.lf + ff] , what is  probability \nthat for  each  i  \"N  O~ .X~ lies inside some new  interval [//+1  _l+l  1I~+1 + fl+1J? \nIn  order  to  make  some  guarantee  on  this  probability,  we  set  J.lf+1  = 2:;=1 efjP] \nand  assume  that  ff+1  >  2:;=1 IOfj  I~].  These  conditions  suffice  to  ensure  that \nthe  new  intervals  contain  the  (conditional)  expected  values  of the  weighted  sums \n2:;=1 efjxf , and that the new intervals are large enough to encompass the incoming \nuncertainties.  Because  these  conditions  are  a  minimal requirement for establishing \nany  probabilistic  guarantees,  we  shall say  that  the  [J.lf  - d, J.lf  + ffj  define  a  valid \nset  of f-intervals  if they  meet these  conditions for  all  1 ::;  i  ::;  N.  Given a  valid set \nof f-intervals  at the (f + 1 )th layer , it follows from Theorem 1 and the union bound \nthat the weighted  sums entering nodes  in layer f + 1 obey \n1I~+1  > f~+l for  some  1 < i  < N] \n\nPr [ ~ O~ \u00b7Xl -\n\nl '  r1 \n\n(8) \n\nr1 \n\nr1 \n\n1 \n\nJ \n\n' \n\n-\n\n-\n\nl\n\n~ 1J \nj=l \n\ni=l \n\nwhere \n\n8~+1 = 2e - -;:2  f, \n\n2N  (l+1  \"N  I  l  I  l)2 \n\n1 \n\n1 \n\n-0)=1  8,)  L:.) \n\n(9) \n\"N  Ol.x~ are bounded by  intervals r .. l+1  _l+l  1I~+1 + f~+l]  This motivates the \n\nIn  what  follows,  we  shall frequently  make  use  of the  fact  that  the  weighted  sums \n\n0J=11J \nfollowing definitions. \nDefinition 2  Given  a valid set  of f-intervals  and binary  values  {Xf = xf}  for  the \nnodes  in  the  fth  layer,  we  say  that  the  (f + 1)st  layer  of the  network satisfies  its \nf-intervals  if 12:;=1 Ofjx]  - J.lf+11  < fl+1  for  all 1 ::;  i::;  N.  Otherwise,  we  say  that \nthe  (f + 1)st  layer violates its  f-intervals . \n\nl '  \n\n, r1 \n\nlP'1 \n\n1 \n\nSuppose that we  are given a valid set of f-intervals and that we sample from the joint \ndistribution defined by the probabilistic I-network.  The right hand side of Equation \n(8)  provides  an upper  bound on  the conditional probability that the  (f + 1)st layer \nviolates its  f-intervals , given  that the fth  layer did  not.  This upper bound may be \nvacuous  (that is,  larger than 1) , so  let us denote by 81+1  whichever  is  smaller -\nthe \nright  hand  side  of Equation  (8) , or  1;  in  other  words,  81+1  =  min {2:~1 8;+1,1 }. \nSince  at the fth  layer,  the  probability of violating the  f-intervals  is  at most 81  we \n\n\fInference in Multilayer Networks via Large Deviation Bounds \n\n265 \n\nare  guaranteed  that  with  probability  at  least  TIl> 1 [1  - 6l ],  all  the  layers  satisfy \ntheir f-intervals.  Conversely, we  are guaranteed that the probability that any layer \nviolates  its  f-intervals  is  at  most  1 - TIl>l [1  - 6l ].  Treating this  as  a  throw-away \nprobability, we  can now compute upper and lower bounds on marginal probabilities \ninvolving nodes  at the Lth layer exactly as  in the case of nodes at the second layer. \nThis yields the following theorem. \n\nTheorem 2  For  any  subset  {Xf, ... , Xi(}  of  the  outputs  of  a  probabilistic  f(cid:173)\nnetwork, for any setting Xl,  ... ,XK,  and for any valid set of f-intervals,  the marginal \nprobability of partial  evidence  in  the  output layer obeys: \n\n<  Pr[ X f = Xl,  ... , X f<  = X K] \n< D [1- 0'] }I f(\"f +tf) }}o [1- f(\"f - tf)] + (1-D [1- O']}ll) \n\n(10) \n\nTheorem 2 generalizes our earlier results for marginal probabilities over nodes in the \nsecond layer; for example, compare Equation (10)  to Equation (4).  Again, the upper \nand  lower bounds can be efficiently  computed for  all  common transfer functions. \n\n5  Rates of Convergence \n\nTo  demonstrate  the  power  of Theorem  2,  we  consider  how  the  gap  (or  additive \ndifference)  between  these  upper  and  lower  bounds  on  Pr[Xf = Xl,\u00b7  .. , Xi(  = XK] \nbehaves  for  some crude  (but  informed)  choices  of the  {fn.  Our  goal  is  to  derive \nthe  rate at  which  these  upper  and  lower  bounds  converge  to  the same value  as  we \nexamine larger  and larger networks.  Suppose  we  choose  the  f-intervals  inductively \nby defining .6.;  =  0 and setting \n\n~+1 = ;....I()~ .1.6.l  J ,r2 ln N \n\nJ  + \n\nN \n\n(12) \n\nfl \n\nL...J \nj=l \n\nlJ \n\nfor  some /  >  1.  From  Equations  (8)  and  (9),  this  choice  gives  6l+ 1  ::;:  2N l - 2,,/  as \nan  upper  bound  on  the  probability that  the  (\u00a3 +  1 )th layer  violates its  f-intervals . \nMoreover,  denoting the gap between  the  upper and lower bounds in Theorem 2 by \nG,  it can be shown  that: \n\n(13) \nLet  us  briefly  recall  the definitions of the parameters on  the right hand side of this \nequation:  a  is  the  maximal slope  of the  transfer  function  f,  N  is  the  number  of \nnodes in each layer, ]{ is the number of nodes with evidence,  r  =  N8 is N  times the \nlargest weight  in  the network,  L  is  the  number of layers,  and  /  > 1 is  a  parameter \nat our disposal.  The first  term of this bound essentially has a  1/ VN dependence on \nN,  but is  multiplied by  a  damping factor  that  we  might typically  expect  to  decay \nexponentially with the  number ]{  of outputs examined.  To see  this,  simply notice \nthat each of the factors  f(f.lj  +fj) and [1- f(f.lj  -fj)] is bounded by  1;  furthermore, \n\n\f266 \n\nM  Kearns and L.  Saul \n\nsince  all  the  means  J.lj  are  bounded,  if N  is  large  compared  to  1  then  the  Ci  are \nsmall,  and  each  of these  factors  is  in  fact  bounded  by  some  value  f3  <  1.  Thus \nthe first  term in  Equation  (13)  is  bounded  by  a  constant times f3 K - l  f{ Jln(N)/N. \nSince  it  is  natural  to  expect  the  marginal  probability of interest  itself to decrease \nexponentially with  f{,  this is  desirable  and natural behavior. \n\nOf course,  in  the  case  of large  f{,  the  behavior  of the  resulting  overall  bound  can \nbe  dominated  by  the  second  term 2L/ N 2'Y- l  of Equation  (13).  In  such  situations, \nhowever,  we  can  consider  larger  values  of I,  possibly  even  of order  f{;  indeed,  for \nsufficiently  large  I,  the first  term  (which scales  like  y0)  must necessarily  overtake \nthe  second  one.  Thus  there  is  a  clear  trade-off between  the  two  terms,  as  well  as \noptimal value  of 1  that sets  them to  be  (roughly)  the  same magnitude.  Generally \nspeaking, for fixed  f{  and large N, we observe that the difference  between our upper \nand lower bounds  on  Pr[Xf = Xl,  ... , xi = XK]  vanishes  as  0  (Jln(N)/N). \n\n6  An  Algorithm for  Fixed  Multilayer Networks \n\nWe  conclude  by  noting  that  the  specific  choices  made  for  the  parameters  Ci  in \nSection  5  to derive  rates  of convergence  may be  far  from  the  optimal choices  for  a \nfixed  network of interest.  However, Theorem 2 directly suggests a natural algorithm \nfor approximate probabilistic inference.  In particular, regarding the upper and lower \nbounds  on  Pr [X f = Xl,  ... , Xi = X K]  as  functions  of { cn, we  can  optimize these \nbounds  by  standard  numerical  methods.  For  the  upper  bound,  we  may  perform \ngradient descent  in the {cn to find  a local minimum, while for the lower bound,  we \nmay  perform  gradient  ascent  to  find  a  local  maximum.  The  components  of these \ngradients in both cases  are easily computable for  all the commonly studied transfer \nfunctions.  Moreover,  the constraint of maintaining valid c-intervals can be enforced \nby maintaining a floor on the c-intervals in one layer in terms of those at the previous \none.  The  practical  application  of this  algorithm  to  interesting  Bayesian  networks \nwill  be  studied in  future  work. \n\nReferences \n\nCooper,  G.  (1990).  Computational  complexity  of  probabilistic  inference  usmg \nBayesian  belief networks.  Artificial  Intelligence 42:393-405. \nHertz,  J, .  Krogh,  A.,  &  Palmer,  R.  (1991).  Introduction  to  the  theory  of neural \ncomputation.  Addison-Wesley,  Redwood  City,  CA. \nHinton, G.,  Dayan, P.,  Frey,  B.,  and Neal,  R.  (1995).  The wake-sleep  algorithm for \nunsupervised  neural  networks.  Science  268:1158- 1161. \nJordan,  M.,  Ghahramani,  Z. , Jaakkola,  T. , &  Saul , 1.  (1997) .  An  introduction  to \nvariational methods for  graphical models.  In M.  Jordan , ed.  Learning in  Graphical \nModels.  Kluwer  Academic. \nKearns ,  M. ,  &  Saul,  1.  (1998) .  Large  deviation  methods for  approximate  proba(cid:173)\nbilistic  inference.  In  Proceedings  of the  14th  Annual  Conference  on  Uncertainty  in \nA rtificial  Intelligence. \nNeal,  R.  (1992).  Connectionist  learning  of belief networks.  Artificial  Intelligence \n56:71-113 . \n\nPearl, J .  (1988).  Probabilistic  Reasoning  in  Intelligent Systems:  Networks  of Plau(cid:173)\nsible  Inference.  Morgan  Kaufmann, San Mateo,  CA. \n\n\f", "award": [], "sourceid": 1564, "authors": [{"given_name": "Michael", "family_name": "Kearns", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}