{"title": "Interpreting Images by Propagating Bayesian Beliefs", "book": "Advances in Neural Information Processing Systems", "page_first": 908, "page_last": 914, "abstract": null, "full_text": "Interpreting images by propagating \n\nBayesian beliefs \n\nYair Weiss \n\nDept.  of Brain  and Cognitive Sciences \nMassachusetts  Institute of Technology \nE10-120,  Cambridge, MA  02139,  USA \n\nyweiss<opsyche.mit.edu \n\nAbstract \n\nA central theme of computational vision research  has been  the  re(cid:173)\nalization that reliable estimation of local scene  properties requires \npropagating  measurements  across  the  image.  Many  authors  have \ntherefore  suggested  solving  vision  problems using  architectures  of \nlocally  connected  units updating their  activity in  parallel.  Unfor(cid:173)\ntunately, the convergence of traditional relaxation methods on such \narchitectures  has  proven  to  be  excruciatingly  slow  and  in  general \nthey  do  not  guarantee that the stable  point will  be  a  global mini(cid:173)\nmum. \nIn this paper we  show  that an  architecture  in  which  Bayesian  Be(cid:173)\nliefs  about  image  properties  are  propagated  between  neighboring \nunits  yields  convergence  times  which  are  several  orders  of magni(cid:173)\ntude faster  than traditional methods and  avoids  local  minima.  In \nparticular our architecture is  non-iterative in  the sense of Marr [5]: \nat every  time step,  the  local estimates  at  a  given  location are  op(cid:173)\ntimal given the information which  has already been  propagated to \nthat  location.  We  illustrate  the  algorithm's  performance  on  real \nimages and  compare it to several existing methods. \n\n1  Theory \n\nThe essence  of our approach is shown  in figure  1.  Figure 1a shows  the prototypical \nill-posed problem:  interpolation of a function from sparse  data.  Figure  1b  shows a \ntraditional relaxation  approach  to  the  problem:  a  dense  array of units  represents \nthe value of the interpolated function at discretely sampled points.  The activity of a \nunit is updated based on the local data (in those points where  data is available) and \nthe activity of the neighboring points.  As  discussed  below, the local update rule can \n\n\fInterpreting Images by Propagating Bayesian Beliefs \n\n909 \n\n/. \n\nr\u00b7\u00b7\u00b7_\u00b7 .. \u00b7_-_\u00b7 __ \u00b7 __ \u00b7 __ \u00b7\u00b7\u00b7_\u00b7 .. \u00b7\u00b7\u00b7_-_\u00b7_\u00b7\u00b7\u00b7_-_\u00b7\u00b7\u00b7\u00b7\u00b7 __ \u00b7\u00b7_-_\u00b7\u00b7\" '''--'''1 \n9 I \ni  y*? \nI  y  0-0-6--0-0-0  ! \n\ni .. ___ ._. __ . __ . __ ._. __ ..... _. __ . __ . _____ ..... _. ___ ._. ________ j \n\n0 \n\nr\u00b7\u00b7\u00b7--\u00b7-- _ \u00b7\u00b7\u00b7\u00b7_\u00b7\u00b7\u00b7_-_\u00b7\u00b7\u00b7\u00b7 __ \u00b7_\u00b7\u00b7\u00b7_\u00b7_\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7_\u00b7\u00b7\u00b7\u00b7_\u00b7\u00b7\u00b7\u00b7\u00b7-\u00b7\u00b7\u00b7_\u00b7\u00b7\u00b7\u00b7_\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 .. \u00b7\u00b7_\u00b7\u00b7_\u00b7\u00b7\u00b7'1 \n\nI  6 \n\n!  y*O \n! \nI1U,cfX \n!  ll,crO .. O..  ..0 .. 0 .. 0  i \n. . . . . . . . . .   ! \n\n0 \n\n! \nI \ni \n\ni \n: \n;, ___  \u2022 __ \u2022 _ _ __ ._  ......... __ _  \u2022 \u2022\u2022 \u2022\u2022\u2022 __ __  \u2022 __ ............. _ \n\nI1P,~ \n\n! \n! \n...... ....... __  ._1 \n\n91 \n. \n\na \n\nb \n\nc \n\nFigure  1:  a.  a  prototypical  ill-posed  problem  h.  Traditional  relaxation  approach:  dense \narray of units represent  the value  of the interpolated  function.  Units update their activity \nbased  on  local  information  and  the activity  of neighboring  units.  c.  The  Bayesian  Belief \nPropagation  (BBP)  approach.  Units  transmit  probabilities  and  combine  them  according \nto probability  calculus  in  two  non-interacting  streams. \n\nbe  defined  such  that the network converges  to a  state in which  the activity of each \nunit corresponds  to the value of the globally optimal interpolating function.  Figure \nlc shows  the  Bayesian  Belief Propagation  (BBP)  approach  to  the  problem.  As  in \nthe traditional approach the function is represented by the activity of a  dense  array \nof units.  However  the  units  transmit probabilities rather  than  single  estimates to \ntheir neighbors and combine the probabilities according to the probability calculus. \n\nTo formalize the above discussion , let Yk  represent  the activity of a  unit at location \nk,  and  let  Yk  be  noisy  samples  from  the  true  function.  A  typical  interpolation \nproblem would  be to minimize: \n\nJ(Y) = L  Wk(Yk  - yk)2 + A L(Yi - Yi+l)2 \n\n(1) \n\nk \n\nWhere  we  have defined  Wk  =  a for  grid points with no  data,  and Wk  = 1 for  points \nwith  data.  Since  J  is  quadratic,  any  local update  in  the  direction  of the gradient \nwill converge  to the optimal estimate.  This yields updates of the sort: \n\n(  ( Yk-l + YHl \n\n) \n\n2 \n\n(  '\" \n\nYk  +- Yk  + TJk  A \n\n- Yk  + Wk  Yk  - Yk \n\n(2) \nTJ  =  Ij(A + Wk)  corresponds  to \nRelaxation  algorithms  differ  in  their  choice  of TJ: \nGauss-Seidel relaxation and TJ  =  1.9j(A + Wk)  corresponds  to successive  over  relax(cid:173)\nation (SOR)  which  is  the method of choice  for  such problems [10] . \nTo  derive  a  BBP  update  rule  for  this  problem,  note  that  that  minimizing  J(Y) \nis  equivalent  to maximizing the  posterior  probability of Y  given  Y'\"  assuming the \nfollowing generative  model: \n\n\u00bb \n\n(3) \n(4) \nWhere  11  \"'\"  N(O, (TR),  TJ  \"'\"  N(O , (TD) .  The ratio of (TD  to  (TR  plays  a  role  similar to \nthat of A in the original cost functional. \n\nY;  =  WiYi  + TJ \n\nYi+l  =  Yi  + 11 \n\nThe advantage of considering the cost functional as  a posterior is that it enables us \nto  use  the methods of Hidden  Markov  Models,  Bayesian  Belief Nets  and  Optimal \nEstimation  to  derive  local  update  rules  (cf.  [6,  7,  1 D.  Denote  the  posterior  by \nPi(u) = P(ri = uIY\"'),  the Markovian property allows us to factor  Pi(U)  into three \nterms:  one depending on the local  data, another depending  on  data to the left of i \nand  a  third depending on data to the right of i.  Thus: \nPi(U) = cai(u)Li(u).Bi(u) \n\n(5) \nwhere  ai(u) = P(Yi  = uIYl~i-l)' .Bi(U) = P(Yi  = uIYi+l,N)' Li(U) = P(Yi\"'lri  = u) \nand  c  denotes  a  normalizing constant.  Now,  denoting  the  conditional  Ci(u, v)  = \n\n\f910 \nP(Yi = ulYi-l = v),  O'i(U)  can  be  written in  terms of O'i-l(V): \nO'j(u) = C 1 O'i-l(V)Ci(u, v)Li-l(V) \n\ny.  Weiss \n\n(6) \n\nwhere c denotes another normalizing constant.  A symmetric equation can be written \nfor  f3i( u). \nThis suggests a propagation scheme where units represent the probabilities given in \nthe left hand side of equations 5-6 and updates are based on the right hand side, i.e. \non the activities of neighboring units.  Specifically, for a Gaussian generating process \nthe  probabilities  can  be  represented  by  their  mean  and  variance.  Thus  denote \nPi  I'V  N(pi' Ui),  and similarly O'j  \"\"\"  N(pf, un  and f3i  I'V  N(p~, uf).  Performing the \nintegration in  6 gives  a  Kalman-Filter like update for  the  parameters: \n\n.!!!..... y.. + ..LIL~ +  1  ILl? \n-;;trrl . \naD' \n\nai rl \n\nPi  +-\n\n_1_ + W._l \n\nUD \n\nO~_l \n(  1 \nUi _ l \n\nWi-l )-1 \n+- UR +  -a- + --\nUD \n\n(7) \n\n(8) \n\n(9) \n\n(the update rules  for  the parameters of f3  are  analogous) \nSo far  we  have considered  continuous estimation problems but identical issues arise \nin labeling problems, where the task is to estimate a label L\"  which  can take on  M \ndiscrete  values.  We  will  denote  L,,(m)  = 1 if the  label takes  on  value  m  and  zero \notherwise.  Typically one minimizes functionals of the form: \n\nJ(L) = L L V,,(m)L,,(m)  - ALL L,,(m)Lk+l(m) \n\n(10) \n\nm \n\n\" \n\nm \n\n\" \n\nTraditional  relaxation  labeling  algorithms minimize  this  cost  functional  with  up(cid:173)\ndates of the form: \n\n(11) \nAgain  different  relaxation  labeling  algorithms differ  in  their  choice  of f.  A  linear \nsum followed  by  a  threshold  gives  the  discrete  Hopfield  network  updates,  a  linear \nsum  followed  by  a  \"soft\"  threshold  gives  the  continuous  or  mean-field  Hopfield \nup dates  and  yet  another form gives  the relaxation labeling  algorithm of Rosenfeld \net al.  (see  [3]  for  a  review  of relaxation labeling methods ). \nTo derive  a  BBP  algorithm for  this case  one  can  again  rewrite  J  as  the  posterior \nof  a  Markov  generating  process,  and  calculate  P(L,,(m)  =  1)  for  this  process.1. \nThis gives the same expressions as  in  equations 5-6 with the integral replaced  by  a \nlinear sum.  Since  the  probabilities here  are not Gaussian,  the O'i, f3i,  Pi  will not be \nrepresented  by their mean and variances,  but rather by a vector of length M.  Thus \nthe update  rule for  O'i  will be: \n\n(12) \n\n(and similarly for  f3.) \n\nIFor  certain  special  cases,  knowing  P(Lk(m)  = 1)  is  not  sufficient  for  choosing  the \nsequence  of labels  that  minimizes  J.  In  those  cases  one  should  do  belief revision  rather \nthan propagation  [6] \n\n\fInterpreting I11Ulges by Propagating Bayesian Beliefs \n\n911 \n\na. \n\nb. \n\nFigure 2:  R.  the first  frame  of a sequence.  The hand  is  translated  to  the left.  b.  contour \nextracted  using  standard  methods \n\n1.1  Convergence \n\nEquations  5-6  are  mathematical identities.  Hence,  it  is  possible  to  show  [6]  that \nafter  N  iterations  the  activity  of units  Pi  will  converge  to  the  correct  posteriors, \nwhere N  is the maximal distance between  any two units in the architecture,  and an \niteration refers  to one update of all units.  Furthermore, we  have been  able to show \nthat  after  n  < N  iterations,  the  activity of unit  Pi  is  guaranteed  to  represent  the \nprobability of the  hidden state  at location i  given  all data within distance  n. \n\nThis  guarantee  is  significant  in  the  light  of  a  distinction  made  by  Marr  (1982) \nregarding local propagation rules.  In a scheme where  units only communicate with \ntheir  neighbors,  there  is  an  obvious  limit on how  fast  the  information can reach  a \ngiven unit:  i.e.  after n  iterations the unit can only know  about information within \ndistance  n.  Thus  there is  a  minimal number of iterations  required  for  all  data to \nreach  all units.  Marr distinguished between  two types of iterations - those that are \nneeded  to  allow  the  information to  reach  the  units,  versus  those  that  are  used  to \nrefine  an estimate based on  information that has already arrived.  The significance \nof the guarantee on Pi is that it shows that BBP only uses the first  type of iteration \n- iterations are  used  only  to allow  more information to reach  the  units.  Once  the \ninformation has  arrived,  Pi  represents  the correct  posterior given  that information \nand no further iterations are needed  to refine  the estimate.  Moreover,  we  have been \nable  to show  that propagations schemes  that  do  not  propagate probabilities  (such \nas  those  in  equations  2)  will  in  general  not  represent  the  optimal estimate  given \ninformation that has  already arrived. \n\nTo summarize, both  traditional relaxation  updates  as  in  equation  2  and  BBP  up(cid:173)\ndates  as in equations  7-9 give  simple rules  for  updating a  unit's  activity  based on \nlocal  data and activities of neighboring units.  However,  the fact that BBP updates \nare based on the probability calculus guarantees that a unit's activity will be optimal \ngiven information that has already arrived and gives  rise  to a  qualitative difference \nbetween  the convergence of these two types of schemes.  In the next section,  we will \ndemonstrate this difference  in  image interpretation problems. \n\n2  Results \n\nFigure 2a shows the first frame of a sequence in which the hand is translated to the \nleft.  Figure  2b  shows  the  bounding  contour of the  hand  extracted  using  standard \ntechniques. \n\n2.1  Motion propagation along contours \n\nLocal  measurements  along  the  contour  are  insufficient  to  determine  the  motion. \nHildreth [2]  suggested to overcome the local ambiguity by minimizing the following \n\n\f912 \n\ny.  Weiss \n\n... , \n:  . .  . . \n.  .  . \n\n, \n\n\u2022 \n\n\" \n\n0 \u2022\n\n.'~\" ~:.  ~ . -.-. ------... - - \u2022.. - -..\u2022\u2022........ \" . \n\n.  , \n\n, \n\n,  -\n\n- .:. \n\n'-\n\n... \n\n... \n\nb. \n\nd. \n\na. \n\nc. \n\n. - - -- -- ~ - -\n\n-\n- - - -\n_\":.  -\n-- - - -\n-- - - --\n\n- - - - -\n\nFigure  3:  R.  Local  estimate  of  velocity  along  the  contour.  h.  Performance  of  SOR, \ngradient  descent  and  BBP  as  a  function  of  time.  BBP  converges  orders  of  magnitude \nfaster  than  SOR.  c.  Motion  estimate  of SOR after  500  iterations.  d.  Motion  estimate  of \nBBP  after  3 iterations. \n\ncost  functional: \n\nJ(V) = I)dx~v1: + dt1:)2  +..\\ L IlvUl -\n\n1: \n\n1: \n\nv1:11 2 \n\n(13) \n\nwhere  dx, dt  denote  the spatial and temporal image derivatives  and  V1:  denotes  the \nvelocity  at point  k  along the  contour.  This functional  is analogous to the interpo(cid:173)\nlation functional  (eq. 1)  and the derivation of the relaxation  and BBP  updates  are \nalso analogous. \n\nFigure  3a shows  the  estimate  of motion  based  solely  on  local  information.  The \nestimates are wrong due to the  aperture problem.  Figure 3b shows the performance \nof three  propagation schemes:  gradient  descent,  SOR and  BBP.  Gradient  descent \nconverges so slowly that the improvement in its estimate can not be discerned in the \nplot.  SOR converges much faster than gradient descent but still has significant error \nafter 500 iterations.  BBP gets the correct estimate after 3 iterations!  (Here  and in \nall  subsequent  plots  an  iteration  refers  to one  update of all  units  in  the  network). \nThis  is  due  to  the  fact  that  after  3  iterations,  the  estimate  at  location  k  is  the \noptimal one  given  data in  the  interval  [k  - 3, k + 3].  In  this case,  there  is  enough \ndata  in  every  such  interval  along  the  contour  to  correctly  estimate  the  motion. \nFigure  3c  shows  the  estimate  produced  by  SOR  after  500  iterations.  Even  with \nsimple  visual  inspection  it is  evident  that the  estimate is  quite  wrong.  Figure  3d \nshows the  (correct)  estimate produced  by BBP  after  3 iterations. \n\n2.2  Direction of figure propagation \n\nThe  extracted  contour  in  figure  2 bounds  a  dark  and  a  light  region.  Direction  of \nfigure  (DOF)  (e.g.  [9])  refers  to  which  of these  two  regions  is  figure  and  which  is \nground.  A  local  cue  for  DOF  is  convexity  - given  three  neighboring  points  along \nthe contour we  prefer  the  DOF that makes the angle defined  by  those  points acute \n\n\fInterpreting Images by Propagating Bayesian Beliefs \n\n9/3 \n\n.. ~ - . _ .. - - - - - _.- - _. --.- --- - _.- - ---\n\n~- ------ . ANII-Or--.,.. \n\n: \\ \n\n\" \" \n\" \nI \", \n\"\n\" I, \n\n160\n\n:lOD \n\n~ -~-- -- - - - -- --- --- -\n10 \n\n100 \n\n250)00 -\n\na. \n\nc. \n\n4Il \n\n20 \n\no \no \n\nb . \n\nd. \n\nFigure  4:  a.  Local  estimate  of  DOF  along  the  contour.  b.  Performance  of  Hop(cid:173)\nfield,gradient  descent,  relaxation  labeling  and  BBP  as  a  function  of  time.  BBP  is  the \nonly  method  that  converges  to  the  global  minimum.  c.  DOF  estimate  of  Hopfield  net \nafter convergence.  d.  DOF estimate of BBP after convergence. \n\nrather than obtuse.  Figure 4a shows  the results of using this local cue  on the hand \ncontour.  The local cue  is  not sufficient. \nWe can overcome the local ambiguity by minimizing a cost functional that takes into \naccount  the  DOF  at  neighboring points in  addition  to the local  convexity.  Denote \nby  Lk(m)  the  DOF  at point k along the contour  and define \n\n(14) \n\nm \n\nk \n\nm \n\nk \n\nwith  Vk(m)  determined by the  acuteness  of the  angle at location k. \nFigure 4b shows the performance of four  propagation algorithms on this task:  three \ntraditional relaxation labeling algorithms (MF Hopfield, Rosenfeld et aI,  constrained \ngradient  descent)  and  BBP.  All  three  traditional  algorithms  converge  to  a  local \nminimum, while  the  BBP  converges  to the  global minimum.  Figure  4c  shows  the \nlocal  minimum reached  by  the  Hopfield  network  and  figure  4d  shows  the  correct \nsolution reached by the BBP algorithm . Recall (section 1.1) that BBP is guaranteed \nto converge  to the  correct  posterior given  all  the data. \n\n2.3  Extensions to 2D \n\nIn the previous two examples ambiguity was reduced by combining information from \nother  points on  the  same  contour .  There  exist,  however,  cases  when  information \nshould  be  propagated to  all  points  in the  image.  Unfortunately,  such  propagation \nproblems correspond to Markov Random Field (MRF) generative models, for which \ncalculation  of the  posterior  cannot  be  done  efficiently.  However,  Willsky  and  his \n\n\f914 \n\ny.  Weiss \n\ncolleagues [4]  have recently shown that MRFs can be approximated with hierarchical \nor multi-resolution models.  In current work, we have been using the multi-resolution \ngenerative  model to  derive  local  BBP  rules.  In  this  case,  the  Bayesian  beliefs  are \npropagated between  neighboring units in  a  pyramidal representation  of the image. \nAlthough  this  work  is  still  in  preliminary  stages,  we  find  encouraging  results  in \ncomparison with traditional 2D  relaxation schemes. \n\n3  Discussion \n\nThe  update  rules  in  equations  5-6  differ  slightly  from  those  derived  by  Pearl  [6] \nin  that  the  quantities  a, {3  are  conditional  probabilities  and  hence  are  constantly \nnormalized to sum to unity.  Using  Pearl's original algorithm for  sequences  as  long \nas the ones we  are considering will lead to messages that become vanishingly small. \nLikewise  our  update  rules  differ  slightly from  the  forward-backward  algorithm for \nHMMs  [7]  in that ours are based on the assumption that all states are equally likely \napriore  and  hence  the  updates  are  symmetric in  a  and {3.  Finally, equation 9  can \nbe seen  as  a  variant of a  Riccati equation  [1] . \nIn  addition  to  these  minor notational differences,  the  context  in  which  we  use  the \nupdate rules is  different.  While in HMMs  and Kalman Filters, the updates are seen \nas  interim calculations toward  calculating the  posterior,  we  use  these  updates in  a \nparallel  network  of local  units  and  are  interested  in  how  the  estimates  of units  in \nthis network improve as a function of iteration.  We have shown that an architecture \nthat propagates Bayesian  beliefs according to the probability calculus  yields orders \nof magnitude improvements in  convergence  over  traditional  schemes  that  do  not \npropagate probabilities.  Thus image interpretation provides  an  important example \nof a  task  where  it pays to be  a  Bayesian. \n\nAcknowledgments \n\nI thank E.  Adelson, P.  Dayan, J. Tenenbaum and G. Galperin for  comments on ver(cid:173)\nsions ofthis manuscript; M.1. Jordan for stimulating discussions  and for introducing \nme to Bayesian nets.  Supported by  a  training grant from  NIGMS . \n\nReferences \n\n[1]  Arthur Gelb,  editor.  Applied Optimal Estimation.  MIT Press,  1974. \n[2]  E.  C.  Hildreth.  The  Measurement of Visual  Motion.  MIT  Press,  1983. \n[3]  S.Z.  Li.  Markov  Random Field Modeling in  Computer  Vision.  Springer-Verlag,  1995. \n[4]  Mark R.  Luettgen,  W . Clem  Karl,  and Allan  S. Willsky.  Efficient  multiscale  regular-\nization  with  application  to  the computation  of optical  flow.  IEEE  Transactions  on \nimage  processing, 3(1):41-64,  1994. \n\n[5]  D.  Marr.  Vision.  H. Freeman  and  Co.,  1982. \n[6]  Judea  Pearl.  Probabilistic  Reasoning  in  Intelligent  Systems:  Networks  of Plausible \n\nInference.  Morgan  Kaufmann,  1988. \n\n[7]  Lawrence Rabiner and Biing-Hwang Juang.  Fundamentals of Speech recognition. PTR \n\nPrentice  Hall,  1993. \n\n[8]  A.  Rosenfeld,  R.  Hummel,  and  S.  Zucker.  Scene  labeling  by  relaxation  operations. \n\nIEEE  Transactions  on Systems,  Man  and Cybernetics,  6:420-433,  1976. \n\n[9]  P.  Sajda and  L.  H.  Finkel.  Intermediate-level  visual representations and the construc(cid:173)\n\ntion  of surface  perception.  Journal  of Cognitive  Neuroscience, 1994. \n\n[10]  Gilbert  Strang.  Introduction to  Applied Mathematics.  Wellesley-Cambridge,  1986. \n\n\f", "award": [], "sourceid": 1309, "authors": [{"given_name": "Yair", "family_name": "Weiss", "institution": null}]}