{"title": "The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 845, "page_last": 851, "abstract": null, "full_text": "The Manhattan World Assumption: \nRegularities  in scene statistics which \n\nenable  Bayesian inference \n\nJames M.  Coughlan \n\nA.L.  Yuille \n\nSmith-Kettlewell Eye Research Inst. \n\nSmith-Kettlewell Eye  Research Inst. \n\n2318 Fillmore St. \n\nSan Francisco,  CA  94115 \n\n2318  Fillmore St. \n\nSan Francisco,  CA  94115 \n\ncoughlan@ski.org \n\nyuille@ski.org \n\nAbstract \n\nPreliminary work by the authors made use of the so-called  \"Man(cid:173)\nhattan  world\"  assumption  about  the  scene  statistics  of  city  and \nindoor scenes.  This assumption stated that such  scenes were built \non a  cartesian grid which led to regularities in the image edge gra(cid:173)\ndient  statistics.  In this paper we  explore the general applicability \nof this  assumption  and show  that,  surprisingly,  it holds  in a  large \nvariety of less structured environments including rural scenes.  This \nenables us, from  a single image, to determine the orientation of the \nviewer relative to the scene structure and also to detect target ob(cid:173)\njects  which  are  not  aligned  with  the  grid.  These  inferences  are \nperformed  using  a  Bayesian  model  with  probability  distributions \n(e.g.  on the image gradient statistics)  learnt from  real data. \n\n1 \n\nIntroduction \n\nIn recent  years,  there has been growing interest in the statistics of natural images \n(see  Huang  and  Mumford  [4]  for  a  recent  review).  Our focus,  however,  is  on  the \ndiscovery of scene  statistics which  are useful  for  solving visual  inference  problems. \nFor example,  in  related  work  [5]  we  have  analyzed the statistics of filter  responses \non and off edges  and hence derived effective edge detectors. \n\nIn  this  paper  we  present  results  on  statistical  regularities  of the  image  gradient \nresponses  as  a  function  of the  global  scene  structure.  This  builds  on preliminary \nwork  [2]  on  city and indoor scenes.  This work observed that such scenes  are based \non a cartesian coordinate system which puts (probabilistic) constraints on the image \ngradient statistics. \nOur current work  shows  that this  so-called  \"Manhattan world\"  assumption about \nthe scene statistics applies far more generally than urban scenes.  Many rural scenes \ncontain sufficient structure on the distribution of edges to provide a natural cartesian \nreference frame for  the viewer.  The  viewers'  orientation relative to this frame  can \nbe  determined  by  Bayesian inference.  In  addition,  certain  structures in  the scene \nstand out  by  being  unaligned  to this  natural reference  frame.  In  our theory  such \n\n\fstructures appear as  \"outlier\"  edges which makes it easier to detect them.  Informal \nevidence that  human  observers use  a  form  of the  Manhattan world  assumption  is \nprovided  by  the  Ames  room  illusion,  see  figure  (6),  where  the  observers  appear \nto  erroneously  make  this  assumption,  thereby  grotesquely  distorting  the  sizes  of \nobjects in the room. \n\n2  Previous  Work and  Three- Dimensional Geometry \n\nOur preliminary work on  city scenes was  presented in [2].  There is related work in \ncomputer  vision  for  the  detection  of vanishing  points  in  3-d  scenes  [1],  [6]  (which \nproceeds through the stages of edge detection, grouping by Hough transforms, and \nfinally the estimation of the geometry). \n\nWe  refer  the  reader  to  [3]  for  details  on  the  geometry  of  the  Manhattan  world \nand  report  only  the  main  results  here.  Briefly,  we  calculate  expressions  for  the \norientations  of  x, y, z  lines  imaged  under  perspective  projection  in  terms  of  the \norientation of the camera relative to the x, y, z axes.  The camera orientation relative \nto  the  xyz  axis  system  may  be  specified  by  three  Euler  angles:  the  azimuth  (or \ncompass  angle)  a, corresponding to rotation about the z  axis, the  elevation (3  above \nthe xy plane, and the twist'Y about the camera's line of sight.  We  use ~ = (a, (3, 'Y) \nto  denote  all  three  Euler  angles  of the  camera orientation.  Our previous  work  [2] \nassumed that the elevation and twist were both zero which turned out to be invalid \nfor  many of the images presented in this paper. \n\nWe  can  then  compute  the  normal  orientation  of lines  parallel  to  the  x, y, z  axes, \nmeasured in the image plane, as a function of film  coordinates (u, v)  and the camera \norientation ~. We  express the results in terms of orthogonal unit  camera axes ii, b \nand c,  which  are  aligned to the body of the camera and  are  determined by  ~.  For \nx  lines  (see  Figure 1, left  panel) we have tan Ox  =  -(ucx + fax)/(vcx + fb x),  where \nOx  is the normal orientation of the x line at film  coordinates (u, v)  and f  is the focal \nlength of the camera.  Similarly,  tanOy = -(ucy + fay)/(vcy + fb y) for  y  lines  and \ntanOz  = -(ucz + faz)/(vc z + fb z) for  z  lines.  In  the next  section  will  see  how  to \nrelate the normal orientation of an object boundary (such as x,y,z lines)  at a point \n(u, v)  to the magnitude and direction of the image gradient at that location. \n\nI ~ e \n~u \n\nvanishing \npoint \n\n~I I~ \n\nFigure  1:  (Left)  Geometry of an  x  line  projected onto  (u,v)  image plane.  0 is  the \nnormal orientation of the line  in the image.  (Right)  Histogram of edge orientation \nerror  (displayed  modulo  180\u00b0).  Observe  the  strong  peak  at  0\u00b0,  indicating  that \nthe  image  gradient  direction  at  an  edge  is  usually  very  close  to  the  true  normal \norientation of the edge. \n\n3  Pon  and  Poff :  Characterizing Edges  Statistically \n\nSince we  do  not know where the x, y, z lines are in the image, we  have to infer their \nlocations and orientations from  image gradient information.  This inference is  done \n\n\fusing a purely local statistical model of edges.  A key element of our approach is that \nit  allows the model to infer camera orientation without having to group pixels into \nx, y, z  lines.  Most  grouping procedures rely on the use of binary edge maps which \noften  make  premature decisions  based on too  little  information.  The poor quality \nof  some  of  the  images  - underexposed  and  overexposed  - makes  edge  detection \nparticularly  difficult,  as  well  as  the  fact  that  some  of the  images  lack  x, y, z  lines \nthat are long enough to group reliably. \nFollowing  work  by  Konishi  et  al  [5],  we  determine  probabilities  Pon(Ea)  and \nPOf!(Ea) for  the  probabilities  of the  image  gradient  magnitude  Ea  at  position  it \nin  the image  conditioned  on  whether we  are  on  or  off an  edge.  These distributions \nquantify the tendency for  the image gradient to be high  on object  boundaries and \nlow  off  them,  see  Figure  2.  They  were  learned  by  Konishi  et  al for  the  Sowerby \nimage  database which contains one hundred presegmented images. \n\nFigure  2:  POf!(Y)  (left)  and  Pon(y)(right),  the  empirical  histograms  of edge  re(cid:173)\n\nsponses off and on edges,  respectively.  Here the response  y = IV II  is  quantized to \n\ntake 20  values  and is  shown on the horizontal axis.  Note that the peak of POf!(Y) \noccurs  at a  lower edge response than the peak of Pon (y). \n\nWe  extend  the  work  of Konishi  et  al  by  putting  probability  distributions  on  how \naccurately the image gradient  direction  estimates the true  normal direction  of the \nedge.  These were learned for  this dataset by measuring the true orientations of the \nedges and comparing them to those estimated from  the image gradients. \n\nThis gives us distributions on the magnitude and direction of the intensity gradient \nPon CEaIB), Pof! CEa),  where  Ea  =  (Ea, CPa),  B is  the true normal orientation of the \nedge,  and  CPa  is  the  gradient  direction  measured  at  point  it =  (u, v).  We  make  a \nfactorization  assumption that  Pon(EaIB)  = Pon(Ea)Pang(CPa  - B)  and  POf!(Ea)  = \nPof!(Ea)U(cpa).  Pang(.)  (with  argument  evaluated  modulo  271\"  and  normalized  to \nlover the range  0 to  271\")  is  based on experimental data, see  Figure  1  (right),  and \nis  peaked  about  0  and  71\". \nIn  practice,  we  use  a  simple  box-shaped  function  to \nmodel the distribution:  Pang (r5B)  =  (1  - f)/47 if r5B  is  within  angle 7  of 0 or 71\",  and \nf/(271\"  - 47)  otherwise  (i.e.  the chance  of an  angular error greater than  \u00b17 is  f  ). \nIn our experiments  f  = 0.1  and  7  = 4\u00b0  for  indoors  and  6\u00b0  outdoors.  By  contrast, \nU(.)  =  1/271\"  is  the uniform  distribution. \n\n4  Bayesian  Model \n\nWe  devised  a  Bayesian model which  combines knowledge of the three-dimensional \ngeometry of the Manhattan world with statistical knowledge of edges in images.  The \nmodel assumes that, while the majority of pixels in the image convey no information \nabout  camera orientation,  most  of the pixels  with  high  edge  responses  arise from \nthe presence of x, y, z lines in the three-dimensional scene.  An important feature of \nthe  Bayesian  model  is  that  it  does  not force  us  to  decide  prematurely  which  pixels \n\n\fare  on  and  off an  object  boundary (or  whether an on pixel is  due to x,y, or z),  but \nallows  us to  sum  over  all  possible  interpretations  of each  pixel. \n\nThe  image  data  Eil  at  a  single  pixel  u  is  explained  by  one  of  five  models  mil: \nmil = 1,2,3 mean the data is generated by an edge due to an x, y, z line, respectively, \nin the scene;  mil =  4 means the data is generated by an outlier edge  (not due to an \nx, y, z  line);  and mil =  5 means the pixel is  off-edge.  The prior probability P(mil) \nof each of the edge models was estimated empirically to be 0.02,0.02,0.02,0.04,0.9 \nfor  mil = 1,2, ... , 5. \nUsing the factorization assumption mentioned before, we  assume the probability of \nthe image data Eil has two factors,  one for  the magnitude of the edge strength and \nanother for  the edge direction: \n\nP(Eillmil, ~,u) = P(Eillmil)P(\u00a2illmil, ~,u) \n\n(1) \n\nwhere  P(Eillmil)  equals  Po/!(Eil )  if  mil  =  5  or  Pon(Eil )  if  mil  #  5.  Also, \nP(\u00a2illmil, ~,u) equals Pang(\u00a2il-O(~,mil'U)) if mil =  1,2,3 or U(\u00a2il) if mil =  4,5. \nHere  O(~, mil, u))  is  the  predicted  normal  orientation  of lines  determined  by  the \nequation tan Ox  =  -(ucx+ fax)/(vcx+ fb x) for x lines, tanOy  =  -(ucy+ fay)/(vcy+ \nfb y) for  y lines,  and tanOz  = -(ucz + faz)/(vcz + fb z) for  z lines. \nIn  summary, the edge strength probability is  modeled by Pon  for  models 1 through \n4 and by po/! for model 5.  For models 1,2 and 3 the edge orientation is  modeled by \na  distribution  which  is  peaked  about  the  appropriate orientation  of an  x, y, z  line \npredicted by the camera orientation at pixel location u;  for  models 4 and 5 the edge \norientation is  assumed to be uniformly distributed from  0 through 27f. \nRather than decide on a particular model at each pixel,  we  marginalize over all five \npossible models  (i.e.  creating a mixture model): \n\nP(Eill~,u) =  2:  P(Eillmil, ~,u)P(mil) \n\n5 \n\nmit=l \n\n(2) \n\nNow, to combine evidence over all pixels in the image,  denoted by {Ea}, we assume \nthat the image data is  conditionally independent across all pixels, given the camera \norientation ~: \n\nP({Ea}I~) = II P(Eill~,u) \n\nil \n\n(3) \n\n(Although  the  conditional  independence  assumption  neglects  the  coupling  of gra(cid:173)\ndients  at  neighboring  pixels,  it  is  a  useful  approximation  that  makes  the  model \ncomputationally tractable.)  Thus the posterior distribution on the camera orienta-\ntion is given by nil P(Eill~, U)P(~)/Z where Z  is a normalization factor and P(~) \nis  a  uniform prior on the camera orientation. \n\nTo  find  the  MAP  (maximum  a  posterior)  estimate,  our  algorithm  maximizes  the \nlog \nposte-\nrior  term  log[P({Eil}I~)P(~)] = logP(~) + L:illog[L:muP(Eillmil,~,u)P(mil)] \nnumerically by searching over a  quantized set of compass  directions  ~ in  a  certain \nrange.  For details on this procedure, as well as coarse-to-fine techniques for speeding \nup  the search, see  [3]. \n\n\f5  Experimental Results \n\nThis section presents results on the domains for which the viewer orientation relative \nto the scene can be detected using the Manhattan world assumption.  In  particular, \nwe  demonstrate results for:  (I)  indoor and outdoor scenes  (as  reported in  [2]),  (II) \nrural English  road  scenes,  (III)  rural  English fields,  (IV)  a  painting of the  French \ncountryside, (V)  a field  of broccoli in the American mid-west,  (VI)  the Ames room, \nand  (VII)  ruins  of  the  Parthenon  (in  Athens).  The  results  show  strong  success \nfor  inference  using  the  Manhattan  world  assumption  even  for  domains  in  which \nit  might  seem  unlikely  to  apply.  (Some  examples  of failure  are  given  in  [3].  For \nexample, a helicopter in a hilly scene where the algorithm mistakenly interprets the \nhill silhouettes as horizontal lines ). \n\nThe first  set  of images were of city and indoor scenes in San Francisco with images \ntaken  by  the  second  author  [2].  We  include  four  typical  results,  see  figure  3,  for \ncomparison with the results on other domains. \n\nFigure  3:  Estimates  of the  camera orientation obtained  by  our  algorithm  for  two \nindoor scenes  (left)  and two  outdoor scenes  (right).  The estimated  orientations of \nthe x, y  lines,  derived for the estimated camera orientation q!,  are indicated by the \nblack line segments drawn on the input image.  (The z  line  orientations have been \nomitted for  clarity.)  At each point on a  sub grid two such segments are drawn - one \nfor  x  and one for  y.  In  the image on the far left, observe how  the x  directions align \nwith the wall  on the right  hand side  and  with features  parallel to this wall.  The y \nlines  align with the wall  on the left  (and objects parallel to it). \n\nWe  now  extend this work to less structured scenes in the English countryside.  Fig(cid:173)\nure  (4)  shows  two  images  of  roads  in  rural  scenes  and  two  fields.  These  images \ncome from  the Sowerby database.  The next three images  were  either downloaded \nfrom  the web  or digitized  (the painting).  These are the mid-west broccoli field, the \nParthenon ruins,  and the painting of the French  countryside. \n\n6  Detecting Objects  in Manhattan world \n\nWe  now consider applying the Manhattan assumption to the alternative problem of \ndetecting target  objects in  background clutter.  To  perform  such  a  task effectively \nrequires modelling the properties of the background clutter in addition to those of \nthe target object.  It has recently been appreciated that good  statistical modelling \nof the image background can improve the performance of target recognition [7]. \n\nThe Manhattan world assumption gives an alternative way of probabilistically mod(cid:173)\nelling  background  clutter.  The  background  clutter  will  correspond to  the  regular \nstructure  of buildings  and  roads  and  its  edges  will  be  aligned  to  the  Manhattan \ngrid.  The target object, however,  is  assumed to be unaligned  (at least,  in  part)  to \nthis grid.  Therefore  many of the  edges  of the  target  object  will  be  assigned  to model \n4  by  the  algorithm.  (Note  the  algorithm  first  finds  the  MAP  estimate  q!*  of the \n\n\fFigure 4:  Results on rural images in England without strong Manhattan structure. \nSame  conventions  as  before.  Two  images  of roads in  the countryside  (left  panels) \nand two images of fields  (right panel). \n\nFigure  5:  Results  on  an  American  mid-west  broccoli  field,  the  ruins  of  the \nParthenon, and a  digitized painting of the French countryside. \n\ncompass orientation,  see  section  (4),  and then estimates the model  by  doing  MAP \nof P(ma!Ea, ~*,'it)  to  estimate  ma  for  each  pixel  'it.)  This  enables  us  to  signifi(cid:173)\ncantly simplify the detection task by removing all edges in  the images except those \nassigned to model 4. \n\nThe  Ames  room,  see  figure  (6),  is  a  geometrically  distorted  room  which  is  con(cid:173)\nstructed so  as to give the false  impression that it is  built on a  cartesian coordinate \nframe  when  viewed  from  a  special  viewpoint.  Human  observers  assume  that  the \nroom is  indeed cartesian despite all other visual cues to the contrary.  This distorts \nthe  apparent size  of objects so  that, for  example,  humans in  different  parts of the \nroom appear to have very different sizes.  In fact,  a human walking across the room \nwill  appear to  change  size  dramatically.  Our algorithm,  like  human observers,  in(cid:173)\nterprets the room as being cartesian and helps identify the humans in the room  as \noutlier edges which are unaligned to the cartesian reference system. \n\n7  Summary and  Conclusions \n\nWe  have  demonstrated  that  the  Manhattan  world  assumption  applies  to  a  range \nof images,  rural  and  otherwise,  in  addition  to  urban  scenes.  We  demonstrated  a \nBayesian model  which  used  this  assumption  to infer the  orientation of the viewer \nrelative to this reference frame  and which  could  also  detect outlier edges which are \nunaligned to the reference frame.  A key element of this approach is the use of image \ngradient statistics, learned from  image datasets, which  quantify the distribution of \nthe  image  gradient  magnitude  and  direction  on  and  off  object  boundaries.  We \nexpect  that  there  are  many  further  image  regularities  of this  type  which  can  be \nused for  building effective artificial vision systems and which are possibly made use \nof by biological vision systems. \n\n\ff'-~, \n..-\\:,'. . \n\nl  - . \n\ni  .1 \n\u00b7;;t.J-' \n\n\\. \n\nFigure 6:  Detecting people in Manhattan world.  The left images  (top  and bottom) \nshow  the estimated scene  structure.  The right images  show  that people stand out \nas residual edges which are unaligned to the Manhattan grid.  The Ames room (top \npanel)  violates the Manhattan assumption but human observers, and our algorithm, \ninterpret it as  if it  satisfied the assumptions.  In fact,  despite  appearances, the two \npeople in the Ames  room are really the same size. \n\nAcknowledgments \n\nWe  want to acknowledge funding from  NSF  with  award number IRI-9700446, sup(cid:173)\nport from  the  Smith-Kettlewell  core  grant,  and from  the  Center  for  Imaging  Sci(cid:173)\nences  with  Army  grant  ARO  DAAH049510494.  This work  was  also  supported  by \nthe National Institute of Health (NEI)  with grant number R01-EY 12691-01.  It is a \npleasure to acknowledge email conversations with Song Chun Zhu about scene clut(cid:173)\nter.  We  gratefully acknowledge the use ofthe Sowerby image dataset from Sowerby \nResearch Centre,  British Aerospace. \n\nReferences \n\n[1]  B.  Briliault-O'Mahony.  \"New  Method for  Vanishing  Point  Detection\".  Computer  Vi(cid:173)\n\nsion,  Graphics,  and  Image  Processing.  54(2).  pp 289-300.  1991. \n\n[2]  J. Coughlan and A.L. Yuille.  \"Manhattan World:  Compass Direction from a Single Im(cid:173)\n\nage by Bayesian Inference\" . Proceedings  International Conference  on Computer  Vision \nICCV'99.  Corfu,  Greece.  1999. \n\n[3]  J.  Coughlan  and A.L.  Yuille.  \"Manhattan World:  Orientation  and Outlier  Detection \nby Bayesian Inference.\"  Submitted to International Journal of Computer Vision.  2000. \n[4]  J. Huang and D.  Mumford.  \"Statistics of Natural Images and Models\".  In Proceedings \n\nComputer  Vision  and Pattern Recognition CVPR'99.  Fort Collins,  Colorado.  1999. \n\n[5]  S.  Konishi,  A.  L.  Yuille,  J.  M.  Coughlan,  and  S.  C.  Zhu.  \"Fundamental  Bounds  on \nEdge Detection:  An  Information Theoretic Evaluation of Different Edge  Cues.\"  Proc. \nInt 'l  con/.  on  Computer  Vision  and Pattern  Recognition,  1999. \n\n[6]  E.  Lutton, H.  Maitre,  and J.  Lopez-Krahe.  \"Contribution to the determination of van(cid:173)\n\nishing points using Hough transform\".  IEEE  Trans.  on Pattern Analysis  and  Machine \nIntelligence.  16(4).  pp 430-438.  1994. \n\n[7]  S.  C.  Zhu,  A.  Lanterman,  and M.  I. Miller.  \"Clutter Modeling and Performance Anal(cid:173)\n\nysis  in  Automatic  Target  Recognition\".  In  Proceedings  Workshop  on  Detection  and \nClassification  of Difficult  Targets.  Redstone  Arsenal, Alabama.  1998. \n\n\f", "award": [], "sourceid": 1804, "authors": [{"given_name": "James", "family_name": "Coughlan", "institution": null}, {"given_name": "Alan", "family_name": "Yuille", "institution": null}]}