{"title": "Hierarchical Image Probability (H1P) Models", "book": "Advances in Neural Information Processing Systems", "page_first": 848, "page_last": 854, "abstract": null, "full_text": "Hierarchical Image Probability (HIP) Models \n\nClay D. Spence and Lucas Parra \n\nSarnoff Corporation \n\nCN5300 \n\nPrinceton, NJ 08543-5300 \n\n{ cspence, lparra} @samoff.com \n\nAbstract \n\nWe formulate a model for probability distributions on image spaces.  We \nshow that any distribution of images can be factored exactly into condi(cid:173)\ntional  distributions of feature  vectors at  one resolution  (pyramid level) \nconditioned on the  image information at  lower resolutions.  We  would \nlike to factor this over positions in the pyramid levels to make it tractable, \nbut such factoring may miss long-range dependencies. To fix this, we in(cid:173)\ntroduce hidden class  labels  at  each  pixel  in  the pyramid.  The result  is \na  hierarchical mixture of conditional probabilities,  similar  to  a  hidden \nMarkov model on a tree.  The model parameters can be found with max(cid:173)\nimum likelihood estimation using the EM algorithm.  We  have obtained \nencouraging preliminary results on the problems of detecting various ob(cid:173)\njects in SAR images and target recognition in optical aerial images. \n\n1  Introduction \nMany  approaches  to  object  recognition  in  images  estimate  Pr(class I image).  By  con(cid:173)\ntrast,  a  model  of the  probability  distribution  of images,  Pr(image),  has  many  attrac(cid:173)\ntive  features.  We  could  use  this  for  object  recognition  in  the  usual  way  by  training \na  distribution  for  each  object  class  and  using  Bayes'  rule  to  get  Pr(class I image)  = \nPr(image I class) Pr(class)J Pr{image). Clearly there are many other benefits of having a \nmodel of the distribution of images, since any kind of data analysis task can be approached \nusing knowledge of the distribution of the data.  For classification we could attempt 'to de(cid:173)\ntect unusual examples and reject them, rather than trusting the classifier's output. We could \nalso compress, interpolate, suppress noise, extend resolution, fuse multiple images, etc. \n\nMany image analysis algorithms use probability concepts, but few treat the distribution of \nimages. Zhu, Wu and Mumford [9] do this by computing the maximum entropy distribution \ngiven a set of statistics for some features,  This seems to work well for textures but it is not \nclear how well it will model the appearance of more structured objects. \n\nThere are several algorithms for modeling the distributions of features extracted from the \nimage,  instead of the image itself.  The Markov Random  Field (MRF) models are an ex(cid:173)\nample of this  line  of development;  see,  e.g.,  [5,  4].  Unfortunately  they  tend  to  be  very \nexpensive computationally. \n\nIn De Bonet and Viola's flexible histogram approach [2,  1], features are extracted at mul(cid:173)\ntiple  image  scales,  and  the  resulting  feature  vectors  are  treated  as  a  set of independent \n\n\fHierarchical Image Probability (HIP) Models \n\n849 \n\nFeature \nPyramid \n\nFigure 1:  Pyramids and feature notation. \n\nsamples drawn from  a  distribution.  They  then  model  this  distribution of feature  vectors \nwith Parzen windows.  This has given good results, but the feature vectors from  neighbor(cid:173)\ning pixels are treated as independent when in fact they share exactly the same components \nfrom lower-resolutions. To fix this we might want to build a model in which the features at \none pixel of one pyramid level condition the features at each of several child pixels at the \nnext higher-resolution pyramid level.  The multi scale stochastic process (MSP) methods do \nexactly that.  Luettgen and Willsky [7], for example, applied a scale-space auto-regression \n(AR)  model  to  texture discrimination.  They use a quadtree or quadtree-like organization \nof the  pixels in an image pyramid,  and model  the  features  in the  pyramid as  a stochastic \nprocess from coarse-to-fine levels along the tree.  The variables in  the process are hidden, \nand the observations are sums of these hidden variables plus noise.  The Gaussian distribu(cid:173)\ntions  are a limitation of MSP models.  The result is also a model of the probability of the \nobservations on the tree, not of the image. \n\nAll of these methods seem well-suited for modeling texture, but it is unclear how we might \nbuild the models to capture the appearance of more structured objects. We will argue below \nthat the presence of objects in images can make local conditioning like that of the flexible \nhistogram and  MSP approaches inappropriate.  In  the  following  we  present a  model  for \nprobability distributions  of images,  in  which  we  try  to  move  beyond  texture  modeling. \nThis hierarchical image probability (HIP) model is  similar to a hidden Markov model on \na  tree,  and can be  learned with  the EM algorithm.  In preliminary tests  of the model  on \nclassification tasks the performance was comparable to that of other algorithms. \n\n2  Coarse-to-fine factoring of image distributions \n\nOur  goal  will  be  to  write  the  image  distribution  in  a  form  similar  to  Pr(I) \n\"-' \nPr(Fo I Fd Pr(F l  I F 2 )  ... ,  where FI  is  the set of feature images at pyramid levell.  We \nexpect that  the  short-range dependencies  can be captured by the  model's  distribution of \nindividual feature vectors, while the long-range dependencies can be captured somehow at \nlow resolution. The large-scale structures affect finer scales by the conditioning. \nIn fact  we  can prove that a coarse-to-fine factoring like this  is  correct.  From an  image I \nwe build a Gaussian pyramid (repeatedly blur-and-subsample, with a Gaussian filter).  Call \nthe l-th  level  II, e.g., the original image is  10  (Figure 1).  From each Gaussian level  II  we \nextract some set of feature  images Fl.  Sub-sample these to  get feature images  GI.  Note \nthat the images in GI have the same dimensions as 11+1 .  We denote by G I the set of images \ncontaining 11+1  and the images in G I. We further denote the mapping from II to GI by gl. \n\nSuppose now that go  : 10  t-+  Go is invertible. Then we can think of go  as a change of vari-\n\n\f850 \n\nC.  D.  Spence and L.  Parra \n\nabIes.  If we have a distribution on a space, its expressions in two different coordinate sys(cid:173)\ntems are related by multiplying by the Jacobian.  In this case we get Pr( 10 )  =  Igo I Pr( Go). \nSince Go  = (Go , II),  we can factor Pr(Go) to  get Pr(Io)  = Igol Pr(Go I h) Pr(h). If \ngl  is invertible for alll  E {O, . .. , L - 1} then we can simply repeat this change of variable \nand factoring procedure to get \n\nPr(I)  =  [rflgdPr(GIIII+d] Pr(h) \n\n(1) \n\n1=0 \n\nThis is a very general result, valid for all Pr(I), no doubt with some rather mild restrictions \nto make the  change of variables valid.  The restriction that gl  be invertible is  strong, but \nmany  such feature sets are known to  exist, e.g., most wavelet transforms on images.  We \nknow of a few ways that this condition can be relaxed, but further work is needed here. \n\n3  The need for hidden variables \n\nFor the sake of tractability we want to factor Pr(GI 111+1 )  over positions, something like \nPr(I)  ,....  ITI  ITxEI,+l Pr(gl(x) I fl+! (x))  where gl(x)  and fl+! (x)  are the feature vectors \nat position x.  The dependence of gl  on fi+l  expresses the persistence of image structures \nacross  scale,  e.g.,  an  edge is  usually  detectable as  such  in  several  neighboring pyramid \nlevels.  The flexible  histogram  and MSP methods  share this  structure.  While  it  may  be \nplausible that fl+ 1 (x)  has a strong influence on gl (x), we argue now that this factorization \nand conditioning is not enough to capture some properties of real images. \n\nObjects in the world cause correlations and non-local dependencies in images.  For exam(cid:173)\nple,  the presence of a particular object might cause a certain kind of texture to be visible \nat levell.  Usually local  features  fi+l  by themselves  will  not contain enough information \nto infer the object's presence, but the entire image II+!  at that layer might.  Thus gl(x)  is \ninfluenced by more of 11+1  than the local feature vector. \n\nSimilarly,  objects  create  long-range  dependencies.  For example,  an  object  class  might \nresult in a kind of texture across a large area of the image. If an object of this class is always \npresent,  the  distribution  may  factor,  but if such  objects  aren't always  present and  can't \nbe inferred from lower-resolution information, the presence of the texture at one location \naffects the probability of its presence elsewhere. \n\nWe introduce hidden variables to represent the non-local information that is not captured by \nlocal features.  They should also constrain the variability of features at the next finer scale. \nDenoting them collectively by A, we assume that conditioning on A allows the distributions \nover feature vectors to factor.  In general, the distribution over images becomes \n\nPr(I) ()( L {IT  II Pr(gl(x) I fi+l (x) , A) Pr(A I h)} Pr(h) . \n\nA \n\n1=0  xEI,+l \n\n(2) \n\nAs written this is absolutely general, so we need to be more specific. In particular we would \nlike  to  preserve  the  conditioning of higher-resolution  information  on  coarser-resolution \ninformation, and the ability to factor over positions. \n\n\fHierarchical Image Probability (HIP) Models \n\n851 \n\nFigure 2:  Tree structure of the conditional dependency between hidden variables in the HIP \nmodel.  With subsampling by two, this is sometimes called a quadtree structure. \n\nAs a first model we have chosen the following structure for our HIP model: 1 \n\nPr(I)<x  L \n\nAo, .. . ,AL_1l=O xE11+l \n\nIT  II  [pr(gl(X) \\f1+1(x),al(X)) Pr(al(x) \\a1+1(X))]. \n\n(3) \n\nTo  each  position x  at each  level  l  we  attach a hidden discrete index or label  al (x).  The \nresulting label image Al for levell has the same dimensions as the images in G1\u2022 \nSince al(x) codes non-local  information we can think of the labels Al as a segmentation \nor classification  at  the  l-th pyramid level.  By  conditioning al(x)  on  al+l (x),  we  mean \nthat al(x)  is  conditioned on al+1  at the parent pixel of x.  This  parent-child relationship \nfollows  from the sub-sampling operation.  For example, if we sub-sample by two in each \ndirection  to  get G1 from  Fl,  we  condition the  variable  al  at  (x, y)  in  level  l  on  al+l  at \nlocation (Lx /2 J , Ly /2 J)  in levell + 1 (Figure 2).  This gives the dependency graph of the \nhidden variables a tree structure.  Such a probabilistic tree of discrete variables is sometimes \nreferred to  as  a belief network.  By conditioning child labels on their parents information \npropagates though the layers to other areas of the image while accumulating information \nalong the way. \nFor the  sake of simplicity we've chosen Pr(gl I fl+1 ,al) to be normal  with  mean (kal  + \nMal fl+ 1  and covariance ~al. We also constrain Mal  and ~al to be diagonal. \n\n4  EM algorithm \n\nThanks to the tree structure, the belief network for the hidden variables is relatively easy to \ntrain with an EM algorithm.  The expectation step (summing over ai's) can be performed \ndirectly.  If we  had  chosen  a  more  densely-connected  structure  with  each  child  having \nseveral parents, we would need either an approximate algorithm or Monte Carlo techniques. \nThe expectation is  weighted by the  probability of a label  or a parent-child pair of labels \ngiven the image.  This can be computed in a fine-to-coarse-to-fine procedure, i.e.  working \nfrom  leaves to  the root and  then  back out to  the  leaves.  The method is  based on  belief \npropagation [6].  With some care an efficient algorithm can be worked out, but we omit the \ndetails due to space constraints. \n\nOnce we can compute the expectations, the normal distribution makes the M-step tractable; \nwe simply compute the updated gal' ~al' Mal' and Pr(al I al+d as combinations of various \nexpectation values. \n\nI The \n\nproportionality \n\nfactor \n\nas \nI1x  Pr(gdX) I adx)) Pr(adx)).  This  is  the  I  =  L  factor  of Equation  3,  which  should  be \nread as having no quantities f L +1  or aL+l. \n\nPr(AL , h)  which \n\nwe  model \n\nincludes \n\n\f852 \n\nC.  D.  Spence and L.  Parra \n\nPlane ROC are8t! \n\nHP \n\no \n\n000 000 \n\nHPNN \n\n\u2022 \n\n..  ..  .. . \n\n0.4 \n\n0.5 \n\n0.6 \n\n0.7 \nAz \n\n0.8 \n\n0.9 \n\nFigure 3:  Examples of aircraft ROls. On the right are A z  values from a jack-knife study of \ndetection performance of HIP and HPNN models. \n\nFigure 4:  SAR images of three types of vehicles to be detected. \n\n5  Experiments \n\nWe  applied HIP to the problem of detecting aircraft in  an  aerial photograph of Logan air(cid:173)\nport.  A  simple  template-matching algorithm  was  used  to  select forty  candidate  aircraft, \ntwenty of which were false  positives (Figure 3).  Ten of the plane examples were used for \ntraining one HIP  model  and  ten  negative examples were  used  to  train  another.  Because \nof thesmall number of examples, we performed a jack-knife study with ten random splits \nof the data.  For features we used filter kernels that were polynomials of up to third order \nmultiplying Gaussians. The HIP pyramid used subsampling by three in each direction. The \ntest set ROC area for HIP had a mean of Az  =  0.94, while our HPNN algorithm [8]  gave a \nmean A z of 0.65. The individual values shown in Figure 3.  (We compared with the HPNN \nbecause it had given  Az  =  0.86 on a larger set of aircraft images including these with a \ndifferent set of features and subsampling by two.) \n\nWe also performed an experiment with the three target classes in the MSTAR public targets \ndata set, to  compare with the results of the flexible histogram approach of De Bonet, et al \n[lJ. We trained three HIP models, one for each of the target vehicles BMP-2, BTR-70 and \nT-72 (Figure 4).  As in  [1J  we trained each model on ten images of its class, one image for \neach of ten aspect angles, spaced approximately 36\u00b0  apart.  We  trained one model for all \nten images of a target, whereas De Bonet et al trained one model per image. \n\nWe first tried discriminating between vehicles of one class and other objects by thresholding \nlog Pr(I I class), i.e., no model of other objects is used. For the tests, the other objects were \ntaken from the test data for the two other vehicle classes, plus seven other vehicle classes. \n\n\f853 \n\nROC  uamg P, ( II t.rget1) 1 Pr( II_rget2) \n\n.. ' \n\n\" \n\nBMP-2 vs T -72:  Az = 0.79 \nBMP-2 ys BTR-70:  Az = 0.82 \nT- 72 ys BTR-70: Az = 0.89 \n\n0 1 \n\n02 \n\n0.3 \n\n04 \n\n0.5 \n\n0.6 \n\n07 \n\n08 \n\n09 \n\nf.lII9poallVes \n\nb \n\nFigure 5:  ROC  curves for  vehicle detection in SAR imagery.  (a) ROC  curves by thresh(cid:173)\nolding HIP likelihood of desired class.  (b) ROC curves for inter-class discrimination using \nratios of likelihoods as given by HIP models. \n\nThere were 1,838 image from these seven other classes, 391 BMP2 test images, 196 BTR70 \ntest images, and 386 Tn test images. The resulting ROC curves are shown in Figure 5a. \nWe then tried discriminating between pairs target classes using HIP model likelihood ratios, \ni.e., log Pr(I I classl) - log Pr(I I class2) .  Here we could not use the extra seven vehicle \nclasses. The resulting ROC curves are shown in Figure 5b. The performance is comparable \nto that of the flexible histogram approach. \n\n6  Conditional distributions of features \n\nTo  further test the HIP model's fit  to the  image distribution,  we computed several distri(cid:173)\nbutions  of features  9i(X)  conditioned on  the parent feature  Ii+! (x).2  The empirical  and \ncomputed distributions for  a particular parent-child pair of features are shown in Figure 6. \nThe conditional distributions we examined all had similar appearance, and all fit the empir(cid:173)\nical distributions well.  Buccigrossi and Simoncelli [3]  have reported such \"bow-tie\" shape \nconditional distributions for a  variety of features.  We  want  to  point out that such condi(cid:173)\ntional  distributions are  naturally obtained for  any  mixture of Gaussian  distributions with \nvarying scales and zero means.  The present HIP model learns such conditionals, in effect \ndescribing the features as non-stationary Gaussian variables. \n\n7  Conclusion \n\nWe have developed a class of image probability models we call hierarchical image proba(cid:173)\nbility or HIP models.  To justify these, we showed that image distributions can be exactly \nrepresented  as  products over pyramid levels  of distributions  of sub-sampled feature  im(cid:173)\nages conditioned on coarser-scale image information. We argued that hidden variables are \nneeded to  capture long-range dependencies while allowing us  to  further factor  the distri(cid:173)\nbutions over position.  In our current model the hidden variables act as indices of mixture \n\n2This is somewhat involved; Pr(gl I /l+d is not just Pr(gl I /1+1 , al) Pr(al) summed over aI, but \n\nLal Pr(gl, all /l+d = Lal Pr(gl 1/1+1, al) Pr(all /l+t} . \n\n\f854 \n\nC.  D.  Spence and L.  Parra \n\nCondllional distirbulion of dala \n\nCondilional dislirbulion  of HIP model \n\n0 \nW \n>-\n..!!! \n'\" \n!! \n~ \n\nc;, \n\n0 \n\n~ \n..!!! \n'\" \n~ \nj \nc;, \n\nf. fealure 9 layer 1 \n\nI; fealure 9 layer 1 \n\nFigure 6: Empirical and HIP estimates of the distribution of a feature 91 (X)  conditioned on \nits parent feature 11+1 (x). \n\ncomponents.  The resulting model is somewhat like a hidden Markov model on a tree.  Our \nearly results on classification problems showed good performance. \n\nAcknowledgements \n\nWe  thank Jeremy De Bonet and John Fisher for  kindly  answering questions  about their \nwork and experiments. Supported by the United States Government. \n\nReferences \n\n[1]  J. S.  De Bonet, P.  Viola, and 1.  W.  Fisher III.  Flexible histograms:  A multiresolution \ntarget  discrimination  model.  In  E.  G.  Zelnio,  editor,  Proceedings of SPIE,  volume \n3370,1998. \n\n[2]  Jeremy S. De Bonet and Paul Viola.  Texture recognition using a non-parametric multi(cid:173)\n\nscale  statistical model.  In  Conference on  Computer Vision  and Pattern  Recognition. \nIEEE,1998. \n\n[3]  Robert W.  Buccigrossi and Eero P.  Simoncelli.  Image compression via joint statisti(cid:173)\n\ncal characterization in the wavelet domain.  Technical Report 414,  U.  Penn. GRASP \nLaboratory, 1998.  Available at ftp :l/ftp.cis.upenn.eduJpub/eero!buccigrossi97.ps.gz. \n\n[4]  Rama Chellappa and S.  Chatterjee.  Classification of textures using Gaussian Markov \n\nrandom fields.  IEEE Trans.  ASSP, 33:959-963, 1985. \n\n[5]  Stuart Geman and Donald Geman.  Stochastic relaxation, Gibbs distributions, and the \nBayesian restoration of images.  IEEE Trans.  PAMI, PAMI-6(6):194-207 , November \n1984. \n\n[6]  Michael 1.  Jordan, editor.  Learning in Graphical Models, volume 89 of NATO Science \n\nSeries D: Behavioral and Brain Sciences.  Kluwer Academic, 1998. \n\n[7]  Mark R.  Luettgen and Alan S. Will sky.  Likelihood calculation for a class of multiscale \nstochastic models, with application to texture discrimination. IEEE Trans.  Image Proc., \n4(2):194-207, 1995. \n\n[8]  Clay D.  Spence and Paul Sajda.  Applications of multi-resolution neural networks to \nmammography. In Michael S. Kearns, Sara A. SolI a, and David A. Cohn, editors, NIPS \n11, pages 981-988, Cambridge, MA, 1998. MIT Press. \n\n[9]  Song Chun Zhu, Ying Nian Wu, and David Mumford.  Minimax entropy principle and \n\nits application to texture modeling. Neural Computation, 9(8): 1627-1660, 1997. \n\n\f", "award": [], "sourceid": 1784, "authors": [{"given_name": "Clay", "family_name": "Spence", "institution": null}, {"given_name": "Lucas", "family_name": "Parra", "institution": null}]}