{"title": "Phase Transitions and the Perceptual Organization of Video Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 850, "page_last": 856, "abstract": null, "full_text": "Phase transitions and the perceptual \n\norganization of video sequences \n\nYair  Weiss \n\nDept.  of Brain and Cognitive Sciences \nMassachusetts Institute of Technology \n\nElO-120,  Cambridge, MA  02139 \nhttp://www-bcs.mit.edu;-yweiss \n\nAbstract \n\nEstimating  motion  in  scenes  containing  multiple  moving  objects \nremains  a  difficult  problem  in  computer  vision.  A  promising  ap(cid:173)\nproach  to this  problem  involves  using  mixture  models,  where  the \nmotion of each object is a component in the mixture.  However, ex(cid:173)\nisting methods  typically  require specifying in advance the number \nof components  in  the  mixture,  i.e.  the  number  of objects  in  the \nscene. \n\nHere  we  show  that  the  number  of objects can  be  estimated auto(cid:173)\nmatically in a maximum likelihood framework, given an assumption \nabout the level of noise in the video sequence.  We derive analytical \nresults  showing  the  number  of models  which  maximize  the  likeli(cid:173)\nhood for a given noise level in a given sequence.  We illustrate these \nresults on a real video sequence, showing how the phase transitions \ncorrespond to different  perceptual organizations of the scene. \n\n\u2022 \n\nFigure  la depicts  a  scene  where motion  estimation is  difficult  for  many  computer \nvision  systems.  A  semi-transparent  surface  partially  occludes  a  second  surface, \nand  the  camera is  translating  horizontally.  Figure  1 b  shows  a  slice  through  the \nhorizontal  component  of  the  motion  generated  by  the  camera  - points  that  are \ncloser  to  the  camera move  faster  than  those  further  away.  In  practice,  the  local \nmotion information would be noisy as shown in figure lc and this imposes conflicting \ndemands on a  motion analysis system - reliable estimates require pooling together \nmany measurements while avoiding mixing together measurements derived from the \ntwo different  surfaces. \n\n\fPhase Transitions and the Perceptual Organization of Video Sequences \n\n851 \n\n. , .. . ,. \".... \n\nrd rd 'Cd \n\n. . .  . . ' \n' \n'..\". . .... : .. \n. ......  .:.  ... ....  .. \n\nI:l  .... \".'-, .. ~ . .'~ ... -:\".. \n..  -\"  .... \n...\nc \n\n. . . .... .. . .. .. \nr \nI:  ____________ , \n.... \n\n.  .. \n\n..  _.. \nb \n\nd \n\n.\nr \n\n. \n\n\" \n\na \n\nFigure 1:  a:  A simple scene  that can  cause problems for  motion estimation.  One surface \npartially occludes another surface.  b:  A cross section through  the horizontal motion field \ngenerated  when  the  camera  translates  horizontally.  Points  closer  to  the  camera  move \nfaster.  c:  Noisy  motion field.  In practice each  local  measurement will  be  somewhat noisy \nand pooling of information is required.  d:  A cross section through the output of a multiple \nmotion analysis system.  Points are assigned to surfaces (denoted by different plot symbols) \nand the  motion of each  surface is  estimated. \n\n\u2022  \u2022 \n\nII. \n\n\u2022  W.  \u2022 \n\n0. \n\n\" \n.' I. \n\n. \n. \n\nw'  \u2022 \n\n.' \n\n.1,.  I'\u00b7 \u2022\u2022 \n\n\u2022 \u2022 \n\n..  ..' \n\nI-\n\n.' \n\n\u2022  I ,  \n\n\u2022 \u2022   11  \u2022 \u2022  11 \u2022 \u2022  \n\n\u00b71 \n\n-0 ..  -01 \n\n-04 \n\n-a.2 \n\n0. \n\n0..1 \n\n0..  O. \n\n\u2022\u2022 \n\n1 \n\n-I \n\n-OJ! \n\n..Q.  -0. \n\n-OJ! \n\n0 \n\n0., \n\nGAl  O' \n\n0 1 \n\nI \n\nFigure 2:  The  \"correct\"  number of surfaces in  a given scene is often ambiguous.  Was the \nmotion here generated by one or two  surfaces? \n\nSignificant  progress  in  the  analysis  of such  scenes  has  been  achieved  by  multiple \nmotion analyzers - systems that simultaneously segment the scene into surfaces and \nestimating the  motion  of each  surface  [9].  Mixture  models  are  a  commonly  used \nframework  for  performing  mUltiple  motion  estimation  [5,  1,  10].  Figure  1d  shows \na  slice  through the output of a  multiple  motion analyzer on  this scene - pixels  are \nassigned to one of two surfaces and motion information is  only combined for  pixels \nbelonging to the same surface. \n\nThe  output  shown  in  figure  1d  was  obtained  by  assuming  the  scene  contains  two \nsurfaces.  In  general,  of course,  one  does  not  know  the  number  of surfaces  in  the \nscene in advance.  Figure 2 shows the difficulty in estimating this number.  It is  not \nclear  whether  this  is  very  noisy  data generated  by  a  single  surface,  or  less  noisy \ndata generated  by two  surfaces.  There  seems  no  reason  to prefer  one  description \nover another.  Indeed,  the description where there are as many surfaces as  pixels is \nalso a  valid  interpretation of this data. \n\nHere we  take the approach that there is  no single  \"correct\"  number of surfaces for \na  given  scene  in  the  absence  of  any  additional  assumptions.  However,  given  an \nassumption  about  the  noise  in  the  sequence,  there  are  more  likely  and  less  likely \ninterpretations.  Intuitively,  if we  know  that  the  data in  figure  2a was  taken  with \na  very  noisy  camera,  we  would  tend  to  prefer  the  one  surface  solution  - adding \nadditional surfaces would cause us to fit  the noise rather than the data.  However, if \nwe  know that there is little noise in the sequence, we would prefer solutions that use \nmany surfaces,  there is  a  lot  less  danger  of  \"overfitting\".  In  this  paper}  we  show, \n\n1 A longer  version  of this  paper is  available  on  the author's web  page. \n\n\f852 \n\nY.  Weiss \n\nfollowing  [6,  8]  that this intuition regarding the dependence of number of surfaces to \nassumed  noise  level  is  captured in  the maximum likelihood  framework.  We  derive \nanalytical results for  the critical values of noise levels where the likelihood function \nundergoes a  \"phase transition\"  - from  being maximized by a single model to being \nmaximized by mUltiple models.  We illustrate these transitions on synthetic and real \nvideo data. \n\n1  Theory \n\n1.1  Mixture Models for  optical flow \n\nIn  mixture  models  for  optical  flow  (cf.  [5,  1])  the  scene  is  modeled  as  com(cid:173)\nposed  of K  surfaces  with  the  velocity  of each  vsurface  at  location  (x, y)  given  by \n(uk(x,y),vk(x,y).  The  velocity  field  is  parameterized  by  a  vector  f)k.  A  typical \nchoice  [9]  is  the affine  representation: \n\nUk (x, y)  =  f)~ + f)~ x + f)~ Y \nvk(x, y)  =  f)! + f)~x + f):y \n\n(1) \n(2) \n\nThe affine family of motions includes rotations, translations, scalings and shears.  It \ncorresponds to the  2D  projection of a plane undergoing rigid  motion in  depth. \n\nCorresponding pixels  in  subsequent frames are assumed  to have identical intensity \nvalues,  up to imaging noise  which is  modeled  as  a  Gaussian with variance a 2 \u2022  The \ntask of multiple motion estimation is to find the most likely motion parameter values \ngiven  the  image  data.  A  standard  derivation  (see  e.g.  [1])  gives  the  following  log \nlikelihood function  for  the parameters e: \nK \n\nlee)  = L log(L e-R~(x.Y)/2u2) \n\n(3) \n\n(4) \n\nx,y \n\nk=l \n\nWith Rk(X, y)  the residual intensity at pixel  (x, y)  for  velocity k: \n\nRk(x, y) = Ix (x, y)uk(x, y) + Iy(x, y)vk(x, y) + It(x, y) \n\nwhere  Ix, Iy,It  denote the spatial and temporal derivatives of the image  sequence. \nAlthough our notation does not make it explicit, Rk(X, y) is a function of f)k  through \nequations 1-2.  As in most mixture estimation applications, equation 3 is  not maxi(cid:173)\nmized  directly,  but rather an Expectation-Maximization (EM)  algorithm is  used  to \niteratively increase the likelihood  [3]. \n\n1.2  Maximum Likelihood not  necessarily with maximum number of \n\nmodels \n\nIt may seem  that  since  K  is  fixed  in  the likelihood  function  (equation  3)  there  is \nno  way  that  the  number  of surfaces  can  be  found  by  maximizing  the  likelihood. \nHowever,  maximizing  over  the  likelihood  may  lead  to  a  a  solution  in  which  some \nof the  f)  parameters  are  identical  [6,  5,  8].  In  this  case,  although  the  number  of \nsurfaces is  still K, the number of distinct surfaces may be any number less than K. \nConsider a very simple case where K  = 2 and the motion of each surface is restricted \nto horizontal  translation u(x, y)  = u, vex, y)  = O.  The advantage of this  simplified \n\n\fPhase Transitions and the Perceptual Organization of Video Sequences \n\n853 \n\nFigure 3:  The log  likelihood for  the data in  figure  2 undergoes  a phase transition when  a \nis varied.  For small values of a  the likelihood has two maxima,  and at both these  maxima \nthe  two  motions are  distinct.  For large  a 2  the likelihood function  has single  maximum at \nthe  origin,  corresponding  to  the  solution where  both velocities  are  equal to  zero,  or  only \none  unique surface. \n\ncase  is  that the likelihood function  is  a  function  of two variables and  can  be easily \nvisualized.  Figure 3  shows  the likelihood  function  for  the  data in  figure  2  as  (7  is \nvaried.  Observe  that  for  small  values  of  (72  the  likelihood  has  two  maxima,  and \nat  both  these  maxima  the  two  motions  are  distinct.  For  large  (72  the  likelihood \nfunction  has  single  maximum  at  the  origin,  corresponding  to  the  solution  where \nboth  velocities  are  equal  to  zero,  or  only  one  unique  surface.  This  is  a  simple \nexample where  the ML  solution corresponds to a small  number of unique  surfaces. \n\nCan  we  predict  the  range  of values  for  (7  for  which  the  likelihood  function  has  a \nmaximum  at the origin?  This  happens  when  the gradient  of the  likelihood  at the \norigin  is  zero  and  the  Hessian  has  two  negative  eigenvalues.  It is  easy  to  show \nthat the if the data has  zero mean, the gradient is  zero  regardless of (7.  As  for  the \nHessian,  H, direct calculation gives: \n\nwhere E  is the mean squared residual of a single motion and c is a positive constant. \nThe two eigenvalues are proportional to -1 and E / (72 -1. So the likelihood function \nhas  a  local  maximum at the origin if and only if E  < (72.  (see  [6,  4,  8]  for  a similar \nanalysis in other contexts). \n\nThis  result  makes  intuitive  sense.  Recall  that  (72  is  the  expected  noise  variance. \nThus if the mean squared residual is  less  than (72  with  a  single surface,  there is  no \nneed to add additional surfaces.  The result on the Hessian shows that this intuition \nis  captured  in  the  likelihood  function.  There  is  no  need  to  introduce  additional \n\"complexity costs\"  to avoid overfitting in  this case. \n\nMore generally,  if we  assume the velocity fields  are of general parametric form,  the \nHessian evaluated at the point where both surfaces are identical has the form: \n\nwhere E  and F  are matrices: \n\nz ,y \n\n-~ ) \n\nA.-I \n\n20-~ \n\n-~ ) \n\nb-F \n\n20-\n\n(5) \n\n(6) \n\n(7) \n\n\f854 \n\nY.  Weiss \n\n. \n.. -.. \n\n.. \" \n\n0\":\" \n\no  \"0: \n\na \n\nb \n\nFigure  4:  a:  data  generated  by  two  lines.  b:  the  predicted  phase  diagram  for  the \nlikelihood  of  this  dataset  in  a  four  component  mixture.  The  phase  transitions  are  at \n(J\"  =  0.084, 0.112, 0.8088 \n\nF  =  L d(x, y)d(x, y)t \n\n:t,Y \n\n(8) \n\nwith d(x, y)  =  aR~~,y), and R(x, y)  the residual as before. \n\nA necessary and sufficient condition for the Hessian to have only negative eigenvalues \nis: \n\n(9) \n\nThus  when  the  maximal  eigenvalue  of F- 1 E  is  less  than  (12  the  fit  with  a  single \nmodel  is  a  local  maximum of the  likelihood.  Note  that F- 1 E  is  very  similar  to a \nweighted  mean  squared  error,  with  every  residual  weighted  by  a  positive  definite \nmatrix (E sums all the residuals times their weight, and F  sums all  the weights,  so \nF- 1 E  is  similar to a  weighted  average). \n\nThe above analysis predicts the phase transition of a two component mixture likeli(cid:173)\nhood,  i.e.  the critical value of (12  such  that above this critical value,  the maximum \nlikelihood  solution  will  have  identical  motion  parameters for  both  surfaces.  This \nanalysis can be straightforwardly generalized to finding  the first phase transition of \na K  component mixture, although the subsequent transitions are harder to analyze. \n\n2  Results \n\nThe  fact  that  the  likelihood  function  undergoes  a  phase  transition  as  (1  is  varied \npredicts that a  ML  technique  will  converge to different  number of distinct  models \nas  (1  is  varied.  We  first  illustrate these phase transitions on a  ID line fitting  prob(cid:173)\nlem  which  shares  some  of the  structure  of multiple  motion  analysis  and  is  easily \nvisualized. \n\nFigure  4a  shows  data  generated  by  two  lines  with  additive  noise,  and  figure  4b \nshows a  phase diagram calculated using  repeated application of equation 9;  i.e.  by \nsolving equation  9 for  all the  data,  taking the two  line  solution obtained  after  the \ntransition,  and  repeating the  calculation  separately  for  points  assigned  to each  of \nthe two  lines. \n\nFigure 5 shows  the output of an EM algorithm on  this data set.  Initial conditions \nare  identical  in  all  runs,  and  the  algorithm  converges  to  one,  two,  three  or  four \ndistinct lines  depending on  (1. \n\nWe  now  illustrate  the  phase  transitions  on  a  real  video  sequence.  Figures  6- 8 \nshow  the  output  of an  EM motion  segmentation  algorithm  with  four  components \non  the  MPEG  flower  garden  sequence  (cf.  [9,  10]).  The  camera  is  translating  in \n\n\fPhase Transitions and the Perceptual Organization of Video  Sequences \n\nx\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\n. \n\n855 \n\n..... \n\n. \n\n.078 \n\n.089 \n\n0.1183 \n\n1.0 \n\nFigure  5:  The  data in  figure  1 are fit  with  one,  two,  three  or  four  models  depending  on \na.  The  results  of EM  with  identical  initial  conditions  are  shown,  only  a  is  varied.  The \ntransitions are  consistent with  the theoretical predictions . \n\n; \n\n.. ;v~~  \\; \"\" \n\u2022  \u2022 \n.. ,~ \n\na \n\nb \n\nFigure  6:  The  first  phase  transition.  The  algorithm  finds  two  segments  corresponding \nto  the  tree  and  the  rest  of  the  scene.  The  critical  value  of  a 2  for  which  this  transition \nhappens is  consistent  with the theoretical prediction. \n\nthe  scene,  and  objects  move  with  different  velocities  due  to  parallax.  The  phase \ntransitions correspond to different  perceptual organizations of the scene - first  the \ntree is  segmented from  the background,  then branches are split from  the tree, and \nfinally  the background splits into the flower  bed and the house. \n\n3  Discussion \n\nEstimating the  number  of components  in  a  Gaussian mixture  is  a  well  researched \ntopic  in  statistics  and  data  mining  [7].  Most  approaches  involve  some  tradeoff \nparameter  to  balance  the  benefit  of  an  additional  component  versus  the  added \ncomplexity [2].  Here we  have shown how this tradeoff parameter can be implicitly \nspecified by the assumed level  of noise  in  the image sequence. \n\nWhile making an assumption regarding a  may seem rather arbitrary in the abstract \nGaussian mixture problem, we  find  it quite reasonable in  the context of motion es(cid:173)\ntimation,  where  the  noise  is  often  a  property  of the  imaging  system,  not  of  the \nunderlying surfaces.  Furthermore, as the  phase diagram in  figure  4  shows,  a  wide \nrange of assumed a  values  will  give  similar answer, suggesting that an exact speci(cid:173)\nfication  of a  is  not needed.  In current work we  are exploring the use  of weak priors \non a  as well  as  comparing our method to those based on cross validation  [7] . \n\n\u2022  , '  ,h \n\n~3 \n\n, \n\n\u2022 \n\n-\n\n> \n\nt \n' \n\n> \n\nFigure  7:  The  second  phase  transition.  The  algorithm  finds  three  segments  - branches \nwhich  are closer  to  the camera than the rest of the  tree  are  segmented from  it.  Since  the \nsegmentation is  based solely on  motion,  portions of the flower  bed that move  consistently \nwith  the branches are erroneously grouped with them. \n\n\f856 \n\nY.  Weiss \n\nFigure 8:  The third phase transition.  The algorithm finds four segments - the Bower  bed \nand the house are segregated. \n\nOur analytical  and  simulation  results  show  that  an assumption  of the  noise  level \nin the sequence enables automatic determination of the  number of moving  objects \nusing  well  understood  maximum  likelihood  techniques.  Furthermore,  for  a  given \nscene, varying the assumed noise level gives rise to different perceptually meaningful \nsegmentations.  Thus  mixture  models  may  be  a  first  step  towards  a  well  founded \nprobabilistic framework for  perceptual organization. \n\nAcknowledgments \n\nI  thank D.  Fleet,  E.  Adelson,  J.  Tenenbaum  and  G.  Hinton  for  stimulating  discussions. \nSupported by a  training grant from  NIG MS. \n\nReferences \n\n[1]  Serge  Ayer  and Harpreet  S.  Sawhney.  Layered representation of motion video  using \nrobust  maximum  likelihood  estimation  of mixture  models  and  MDL  encoding.  In \nProc.  Int'l  Con/.  Comput.  Vision,  pages  777-784,  1995. \n\n[2]  J.  Buhmann.  Data clustering and learning.  In  M.  Arbib, editor,  Handbook  of Brain \n\nTheory  and  Neural  Networks.  MIT  Press,  1995. \n\n[3]  A.  P.  Dempster, N.  M.  Laird, and D.  B.  Rubin.  Maximum likelihood from incomplete \n\ndata via the EM  algorithm.  J.  R.  Statist.  Soc.  B,  39:1-38,  1977. \n\n[4]  R.  Durbin,  R.  Szeliski,  and A.  Yuille.  An analysis of the elastic  net approach  to the \n\ntravelling salesman problem.  Neural  Computation,  1(3):348-358,  1989. \n\n[5]  A.  Jepson  and M.  J.  Black.  Mixture  models for  optical  Bow  computation.  In  Proc. \n\nIEEE  Con/.  Comput.  Vision  Pattern  Recog.,  pages  760-761,  New York,  June 1993. \n\n[6]  K.  Rose,  F.  Gurewitz,  and  G.  Fox.  Statistical  mechanics  and  phase  transitions  in \n\nclustering.  Physical  Review  Letters,  65:945-948,  1990. \n\n[7]  P.  Smyth.  Clustering using monte-carlo cross-validation.  In KDD-96,  pages  126-133, \n\n1996. \n\n[8]  J.  B.  Tenenbaum  and  E.  V.  Todorov.  Factorial  learning  by  clustering  features.  In \nG.  Tesauro,  D.S.  Touretzky,  and K.  Leen,  editors,  Advances  in  Neural  Information \nProcessing  Systems  7,  1995. \n\n[9]  J.  Y.  A.  Wang  and  E.  H.  Adelson.  Representing  moving  images  with  layers. \nIEEE Transactions  on Image  Processing  Special Issue:  Image  Sequence  Compression, \n3(5):625-638,  September 1994. \n\n[10]  Y.  Weiss  and E.  H.  Adelson.  A unified mixture framework  for  motion segmentation: \nincorporating spatial coherence and estimating the number of models.  In Proc.  IEEE \nCon/.  Comput.  Vision  Pattern  Recog.,  pages 321-326,  1996. \n\n\f", "award": [], "sourceid": 1354, "authors": [{"given_name": "Yair", "family_name": "Weiss", "institution": null}]}