{"title": "Saliency Based on Information Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 155, "page_last": 162, "abstract": null, "full_text": "Saliency Based on Information Maximization \n\nNeil D.B. Bruce and John K. Tsotsos \n\nDepartment of Computer Science and Centre for Vision Research \n\nYork University, Toronto, ON, M2N 5X8 \n\n{neil,tsotsos}@cs . yorku. c a \n\nAbstract \n\nA model of bottom-up overt attention is proposed based on the principle \nof maximizing information sampled from  a scene.  The proposed opera(cid:173)\ntion  is  based on  Shannon's self-information  measure and is  achieved in \na neural circuit, which is demonstrated as having close ties with  the cir(cid:173)\ncuitry existent in  the primate visual cortex.  It is  further shown  that the \nproposed saliency measure may  be extended  to  address  issues that cur(cid:173)\nrently elude explanation in the domain of saliency based models.  Results \non  natural  images are compared with experimental eye tracking data re(cid:173)\nvealing  the  efficacy of the  model  in  predicting  the  deployment of overt \nattention as compared with existing efforts. \n\n1  Introduction \n\nThere  has  long  been  interest  in  the  nature  of eye  movements  and  fixation  behavior fol(cid:173)\nlowing early studies by  Buswell  [I]  and  Yarbus  [2].  However,  a complete description of \nthe mechanisms underlying these peculiar fixation patterns remains elusive. This is further \ncomplicated by the fact that task demands and contextual  knowledge factor heavily in how \nsampling of visual content proceeds. \n\nCurrent bottom-up  models  of attention posit that  saliency  is  the  impetus  for selection of \nfixation points.  Each model differs in its definition of saliency. In perhaps the most popular \nmodel of bottom-up attention, saliency  is  based on centre-surround contrast of units mod(cid:173)\neled on known properties of primary  visual  cortical  cells  [3].  In  other efforts,  saliency  is \ndefined by more ad hoc quantities having less connection to  biology [4] . In  this paper, we \nexplore the notion that information is  the driving force behind attentive sampling. \n\nThe  application  of information  theory  in  this  context  is  not  in  itself novel.  There exist \nseveral previous efforts  that define  saliency  based on  Shannon  entropy  of image content \ndefined on a local neighborhood [5, 6, 7,  8].  The model presented in this work is based on \nthe closely  related quantity  of self-information  [9].  In  section 2.2  we discuss differences \nbetween entropy and self-information in  this context, including why self-information may \npresent a more appropriate metric than entropy in this domain.  That said, contributions of \nthis paper are as follows: \n\n1.  A bottom-up model of overt attention with selection based on the self-information \n\nof local image content. \n\n2.  A qualitative and quantitative comparison of predictions of the model with human \n\n\feye tracking data, contrasted against the model ofItti and Koch  [3] . \n\n3.  Demonstration that the model is neurally plausible via implementation based on a \nneural circuit resembling circuitry involved in early visual processing in primates. \n4.  Discussion of how the proposal generalizes to address issues that deny explanation \n\nby existing saliency based attention models. \n\n2  The Proposed Saliency Measure \n\nThere exists much evidence indicating that the primate visual  system  is  built on  the prin(cid:173)\nciple of establishing a sparse representation of image statistics.  In the most prominent of \nsuch studies,  it  was  demonstrated that learning  a sparse code for  natural  image statistics \nresults  in  the  emergence of simple-cell  receptive  fields  similar to  those  appearing  in  the \nprimary  visual  cortex of primates  [10,  11].  The apparent benefit of such a representation \ncomes from  the fact that a sparse representation allows certain independence assumptions \nwith regard to  neural firing.  This issue becomes important in evaluating the likelihood of a \nset of local image statistics and is elaborated on later in this section. \n\nIn this paper, saliency is determined by quantifying the self-information of each local  im(cid:173)\nage patch.  Even for a very small image patch, the probability distribution resides in  a very \nhigh dimensional  space.  There is  insufficient data in  a single image to  produce a reason(cid:173)\nable  estimate  of the  probability  distribution.  For this  reason,  a  representation  based  on \nindependent components is employed for the independence assumption  it affords.  leA is \nperformed on a large sample of 7x7 RGB  patches drawn from natural images to determine \na suitable basis.  For a given image, an estimate of the distribution of each basis coefficient \nis  learned across  the entire image through non-parametric density estimation.  The proba(cid:173)\nbility of observing the RGB  values corresponding to  a patch centred at any image location \nmay then be  evaluated by independently considering the likelihood of each corresponding \nbasis coefficient.  The product of such likelihoods yields the joint likelihood of the entire \nset of basis  coefficients.  Given  the  basis determined by  ICA,  the  preceding computation \nmay be realized entirely in  the context of a biologically plausible neural circuit.  The over(cid:173)\nall  architecture is  depicted in  figure  1.  Details of each of the aforesaid model components \nincluding the details of the neural circuit are as follows: \n\nProjection into independent component space provides, for each local neighborhood of the \nimage, a vector W  consisting of N  variables Wi  with values Vi.  Each W i  specifies the con(cid:173)\ntribution of a particular basis function  to  the representation of the local  neighborhood.  As \nmentioned, these basis functions, learned from statistical regularities observed in a large set \nof natural images show remarkable similarity to V 1 cells [10, 11]. The ICA projection then \nallows a representation w,  in  which the components W i  are as  independent as possible.  For \nfurther details on the ICA projection of local image statistics see [12]. In this paper, we pro(cid:173)\npose that salience may be defined based on a strategy for maximum information sampling. \nIn particular, Shannon's self-information measure [9],  -log(p(x )), applied to the joint like(cid:173)\nlihood of statistics  in  a local  neighborhood decribed by  w,  provides an  appropriate trans(cid:173)\nformation between probability and the degree of infom1ation inherent in the local statistics. \nIt is  in  computing  the  observation  likelihood that a sparse  representation  is  instrumental: \nConsider the probability density function p( W l  =  Vl, Wz  =  Vz, ... , Wn  =  vn )  which quanti(cid:173)\nfies  the likelihood of observing the local statistics with values Vl, ... , Vn  within a particular \ncontext. An appropriate context may include a larger area encompassing the local neigbour(cid:173)\nhood described by w,  or the entire scene in  question.  The presumed independence of the \nICA decomposition means that P(WI  =  VI, Wz  =  V2,  ... , Wn  =  V n )  =  rr~= l P(Wi  =  Vi) . \nThus,  a sparse representation allows  the estimation of the n-dimensional space described \nby  W  to  be  derived  from  n  one  dimensional  probability  density  functions.  Evaluating \np( Wl  =  VI, W2  =  V2,  ... , Wn  =  v n )  requires considering the distribution of values taken on \nby  each W i  in  a  more global  context.  In practice,  this  might be derived on  the  basis of a \n\n\fnonparametric or histogram density estimate.  In  the  section that follows,  we demonstrate \nthat an  operation equivalent to  a non-parametric density estimate may  be achieved using a \nsuitable neural circuit. \n\n2.1  Likelihood Estimation in A Neural Circuit \n\nIn the following formulation,  we  assume an estimate of the likelihood of the components \nof W  based  on  a  Gaussian  kernel  density  estimate.  Any  other choice  of kernel  may  be \nsubstituted, with a Gaussian window chosen only for its common use in density estimation \nand without loss of generality. \n\nLet Wi ,j ,k  denote the set of independent coefficients based on the neighborhood centered at \nj , k.  An estimate of p( Wi,j,k  =  Vi,j,k)  based on a Gaussian window is given by: \n\n(1) \n\nwith  L s ,t  w(s, t)  =  1 where \\f!  is the context on which the probability estimate of the coef(cid:173)\nficients of w is based. w (s, t) describes the degree to which the coefficient w at coordinates \ns, t contributes to the probability estimate.  On the basis of the form given in equation  I  it is \nevident that this operation may equivalently be implemented by the neural circuit depicted \nin figure 2.  Figure 2 demonstrates only coefficients derived from a horizontal cross-section. \nThe two dimensional case is  analogous with parameters varying in  i, j, and  k dimensions. \nK  consists  of the  Kernel  function  employed for  density  estimation.  In  our case  this  is  a \nGaussian  of the form  0\"~e-x2 /20- 2.  w(s, t)  is  encoded based on  the  weight of connec(cid:173)\ntions  to  K.  As  x  =  Vi ,j,k  - Vi,s,t  the  output of this  operation  encodes  the  impact of the \nKernel function with mean Vi,s,t  on the value of p( Wi,j,k  =  Vi,j,k).  Coefficients at the input \nlayer correspond to coefficients of v.  The logarithmic operator at the final stage might also \nbe placed before the product on each incoming connection, with the product then becom(cid:173)\ning a  summation.  It is  interesting  to  note  that the  structure of this  circuit at  the  level  of \nwithin feature spatial competition is remarkably similar to the standard feedforward model \nof lateral  inhibition,  a  ubiquitous  operation along  the  visual  pathways  thought to  playa \nchief role in  attentional processing [14].  The similarity between independent components \nand VI  cells, in conjunction with the aforementioned consideration lends credibility to the \nproposal that information may contribute to driving overt attentional selection. \n\nOne aspect lacking from the preceding description is  that the saliency map fails to take into \naccount the dropoff in  visual acuity moving peripherally from the fovea.  In some instances \nthe maximum information accommodating for visual acuity may correspond to  the center \nof a cluster of salient  items,  rather  than  centered on  one such  item.  For this  reason,  the \nresulting saliency map is convolved with a Gaussian with parameters chosen to correspond \napproximately to the drop off in visual acuity observed in  the human visual system. \n\n2.2  Self-Information versus Entropy \n\nIt is  important to  distinguish  between  self-information  and entropy  since  these terms are \noften confused. The difference is subtle but important on two fronts.  The first consideration \nlies  in  the expected  behavior in  popout paradigms and the  second  in  the  neural  circuitry \ninvolved. \nLet X  = [Xl, X2,  ... , xnl  denote a vector of RGB  values corresponding to  image patch X, \nand D  a  probability  density  function  describing the distribution  of some feature  set over \nX.  For example,  D  might correspond to  a  histogram  estimate of intensity  values  within \nX  or the  relative contribution of different orientations  within  a  local  neighborhood situ(cid:173)\nated on  the boundary of an  object silhouette [6].  Assuming an estimate of D  based on N \n\n\fOrfglnallmage \n\n10 Examplo basad \non shlfUng window \nIn horizontal direction \n\nof A \n\nI... . . Functions  : \nr----ii-iiliiiil---A----: \n:11  \u2022\u2022\u2022\u2022\u2022 \n: \n:.  _. \n: \u2022 \u2022 \u2022 \u2022 \u2022 \u2022   Ba~is  : \n:.  \u2022\u2022\u2022\u2022\u2022 \n! \n: \n~---------------------\n\nInromax \n\nleA \n\nr--------------\n\u2022 \nI  ...  I. \n. .. : \n\n\u2022 \u2022  \u2022 \n\nI \n\n\u2022 \n\n360,000 \n\n: \n: \nI \n: \nI Random  Patches I \n------- ______ 1 \n\n\u2022 \n\nFigure I : The framework that achieves the desired information measure.  Shown is the com(cid:173)\nputation corresponding to  three  horizontally  adjacent neighbourhoods  with  flow  through \nthe  network  indicated  by  the  orange,  purple,  and  cyan  windows  and  connections.  The \nconnections shown facilitate computation of the information measure corresponding to the \npixel  centered in  the purple window.  The network architecture produces this  measure on \nthe basis of evaluating the probability of these coefficients with consideration to the values \nof such coefficients in  neighbouring regions. \n\nbins, the entropy of D is given by:  - L~l Di1og(Di).  In this example, entropy character(cid:173)\nizes  the extent to which  the feature(s) characterized  by D  are uniformly distributed on X . \nSelf-information in  the proposed saliency  measure is given by -log(p(X)).  That is,  Self(cid:173)\ninfolTIlation characterizes the  raw  likelihood of the specific n-dimensional  vector of ROB \nvalues given  by  X .  p(X)  in  this case  is  based on observing a  number of n-dimensional \nfeature vectors based  on patches drawn from the area surrounding X . Thus, p( X) charac(cid:173)\nterizes the raw likelihood of observing X  based on its surround and -log(p(X)) becomes \ncloser  to  a  measure  of local contrast whereas entropy  as  defined  in  the  usual  manner  is \ncloser to  a measure of local activity.  The importance of this disti nction is  evident in  con(cid:173)\nsidering  figure  3.  Figure 3 depicts a variety  of candles of varying orientation,  and color. \nThere is a tendency  to  fixate  the  empty  region on  the  left,  which is the  location of lowest \nentropy  in  the image.  In  contrast,  this  region  receives  the  highest confidence  from  the al(cid:173)\ngorithm  proposed in  this paper as  it  is highly informative in  the context of this  image.  In \nclassic popout experiments, a vertical line among horizontal lines presents a highly salient \ntarget.  The same vertical line among many lines of random orientations is not, although the \nentropy associated with the second scenario is much greater. \n\nWith  regard  to  the  neural  circuitry  involved,  we  have  demonstrated that self-information \nmay  be  computed  using  a  neural  circuit  in  the absence  of a  representation  of the  entire \nprobability distribution. Whether an equivalent operation may be achieved in a biologically \nplausible manner for the computation of entropy remains to  be established. \n\n\fj..1 , \nj..1, \n~2  ~ 1 \n\nj..1 . \ni \n\nj..1 , \ni+1 \n\nj..1 , \n/+2 \n\ni, \n,2 \n\ni, \n,1 \n\ni, \ni+1 \n\ni, \n/+2 \n\ni+1. \n1-2 \n\ni+1 , \n;.' \n\ni+1, \n/ \n\ni+1, \n/+' \n\ni+1 , \niT, \n\ni \n\nFigure 2:  AID depiction of the neural architecture that computes the self-information of a \nset of local statistics.  The operation is equivalent to a Kernel density estimate.  Coefficients \ncorrespond to subscripts of Vi,j,k. The small black circles indicate an inhibitory relationship \nand the small white circles an excitatory relationship \n\nFigure 3:  An image that highlights the difference between  entropy  and  self-information. \nFixation invariably falls  on the empty patch, the locus of minimum entropy  in  orientation \nand color but maximum in self-information when the surrounding context is considered. \n\n3  Experimental Validation \n\nThe following section evaluates the output of the proposed algorithm as compared with the \nbottom-up  model  of Itti  and  Koch  [3].  The model  of Itti  and  Koch  is  perhaps  the  most \npopular model of saliency based attention and currently appears to be the yardstick against \nwhich other models are measured. \n\n3.1  Experimental eye tracking data \n\nThe data that forms  the basis for performance evaluation  is  derived from  eye tracking ex(cid:173)\nperiments  performed  while  subjects  observed  120 different  color  images.  Images  were \npresented in  random  order for 4  seconds each  with  a  mask  between  each pair of images. \nSubjects  were  positioned 0.75m from  a 21  inch  CRT  monitor and  given  no  particular in(cid:173)\nstructions except to  observe the images.  Images consist of a variety of indoor and outdoor \nscenes, some with very salient items, others with no particular regions of interest.  The eye \ntracking apparatus consisted of a standard non head-mounted device.  The parameters of the \nsetup are  intended to quantify salience in  a general sense based on stimuli  that one might \nexpect to  encounter in  a typical urban environment.  Data was  collected from  20 different \nsubjects for the full  set of 120 images. \n\nThe  issue  of comparing between  the  output of a particular algorithm,  and  the eye  track(cid:173)\ning  data  is  non-trivial.  Previous efforts  have  selected a  number of fixation  points  based \non  the  saliency  map,  and  compared these  with  the  experimental  fixation  points derived \n\n\ffrom  a small  number of subjects and  images (7  subjects and  15  images  in  a recent effort \n[4]).  There are  a  variety  of methodological  issues  associated  with  such  a  representation. \nThe most important such consideration is that the representation of perceptual importance \nis  typically  based on  a  saliency  map.  Observing the  output of an  algorithm  that selects \nfixation  points  based  on  the  underlying  saliency  map  obscures observation  of the degree \nto  which  the  saliency  maps predict important and  unimportant content and  in  particular, \nignores confidence away from  highly  salient regions.  Secondly,  it  is  not clear how  many \nfixation points should be selected. Choosing this value based on the experimental data will \nbias output based on information pertaining to  the content of the image and  may  produce \nartificially good results. \n\nThe preceding discussion is intended to motivate the fact that selecting discrete fixation co(cid:173)\nordinates based on the saliency map for comparison may not present the most appropriate \nrepresentation to  use for performance evaluation.  In this effort, we consider two different \nmeasures of performance.  Qualitative comparison is based on  the representation proposed \nin  [16].  In this representation, a fixation density map is produced for each image based on \nall fixation points, and subjects.  Given a fixation point, one might consider how  the image \nunder consideration is sampled by the human visual system as photoreceptor density drops \nsteeply  moving peripherally from  the centre of the fovea.  This dropoff may  be modeled \nbased on a 2D Gaussian distribution with appropriately chosen parameters, and centred on \nthe measured fixation point.  A continuous fixation  density map may  be derived for a par(cid:173)\nticular image based on the sum of all  2D Gaussians corresponding to  each fixation  point, \nfrom each subject.  The density map then comprises a measure of the extent to  which each \npixel of the image is sampled on average by a human observer based on observed fixations. \nThis affords a representation for which similarity to a saliency map  may  be considered at \na glance.  Quantitative performance evaluation is  achieved based on  the measure proposed \nin  [15].  The saliency maps produced by each algorithm are treated as binary classifiers for \nfixation  versus non-fixation points.  The choice of several different thresholds and assess(cid:173)\nment of performance in predicting fixated versus not fixated pixel locations allows an ROC \ncurve to be produced for each algorithm. \n\n3.2  Experimental Results \n\nFigure  4  affords  a  qualitative  comparison  of the  output of the  proposed  model  with  the \nexperimental eye tracking data for a variety of images.  Also depicted is  the output of the \nItti and Koch algorithm for comparison. \n\nIn the implementation results shown, the ICA basis set was learned from  a set of 360,000 \n7x7x3  image  patches  from  3600 natural  images  using  the  Lee  et al.  extended  infomax \nalgorithm  [17].  Processed images are  340 by  255  pixels.  W consists of the entire extent \nof the  image and w(s, t)  =  ~ 'V  s, t with p the  number of pixels  in  the  image.  One might \nmake a variety  of selections for  these  variables based on  arguments related  to  the  human \nvisual  system, or based on performance.  In  our case,  the values have been chosen on  the \nbasis of simplicity  and do  not appear to  dramatically  affect the predictive capacity of the \nmodel  in  the simulation results.  In  particular, we wished to  avoid tuning these parameters \nto the available data set.  Future work may include a closer look at some of the parameters \ninvolved in order to determine the most appropriate choices. The ROC curves appearing in \nfigure 5 give some sense of the efficacy of the model in predicting which regions of a scene \nhuman observers tend to fixate.  As may be observed, the predictive capacity of the model is \non par with the approach of lui and Koch.  Encouraging is the fact that similar perfonnance \nis achieved using a method derived from  first principles, and  with no parameter tuning or \nad hoc design choices. \n\n\fFigure 4:  Results  for qualitative comparison.  Within each boxed region defined by  solid \nlines:  (Top Left) Original Image (Top Right) Saliency map produced by Itti + Koch algo(cid:173)\nrithm.  (Bottom Left) Saliency  map  based on  information  maximization.  (Bottom  Right) \nFixation density map based on experimental human eye tracking data. \n\n4  On Biological Plausibility \n\nAlthough the proposed approach, along with the model of lui  and  Koch  describe saliency \non the basis of a single topographical saliency map, there is mounting evidence that saliency \nin  the primate brain is  represented at  several  levels  based on  a hierarchical representation \n[18]  of visual  content.  The  proposed approach  may  accommodate  such  a  configuration \nwith the single necessary condition being a sparse representation at each layer. \n\nAs  we  have described in  section  2,  there is evidence that suggests the possibility that the \nprimate visual system may consist of a multi-layer sparse coding architecture [10,  11]. The \nproposed  algorithm  quantifies  information  on  the  basis  of a neural  circuit,  on  units  with \nresponse properties corresponding to neurons appearing in  the primary visual cortex.  How(cid:173)\never,  given  an  analogous  representation corresponding to  higher visual areas that encode \nform,  depth, convexity etc.  the proposed method may be employed without any modifica(cid:173)\ntion.  Since the popout of features can occur on  the basis of more complex  properties such \nas a convex surface among concave surfaces [19], this is perhaps the next stage in a system \nthat encodes saliency in the same manner as primates.  Given a multi-layer architecture, the \nmechanism for selecting the locus of attention becomes less clear.  In the model of Itti and \nKoch,  a  multi-layer winner-take-all  network  acts  directly  on  the  saliency  map  and  there \nis  no  hierarchical  representation  of image  content.  There  are  however  attention  models \nthat subscribe to  a distributed representation of saliency  (e.g.  [20]),  that may  implement \nattentional selection with the proposed neural circuit encoding saliency at each layer. \n\n\f\u2022\u2022 \n\n\u2022 0 \n\n~ 05 \ncr \n~ 05 \n\n\u2022\u2022 \n\n01 \n\n02  0)  O' \n\nOS \n\n06 \n\n07 \n\n01  0' \n\n1 \n\nFalse Alarm Rate \n\nFigure 5:  ROC  curves for  Self-information (blue) and  Itti  and Koch  (red) saliency  maps. \nArea under curves is 0.7288 and 0.7277 respectively. \n\n5  Conclusion \n\nWe have described a strategy that predicts human attentional deployment on the principle of \nmaximizing information sampled from  a scene.  Although no computational machinery is \nincluded strictly on the basis of biological plausibility, nevertheless the formulation results \nin  an  implementation  based on  a  neurally  plausible circuit acting  on  units  that  resemble \nthose that facilitate early visual processing in primates. Comparison with an existing atten(cid:173)\ntion model reveals the efficacy of the proposed model in predicting salient image content. \nFinally, we demonstrate that the proposal might be generalized to facilitate selection based \non high-level features provided an  appropriate sparse representation is available. \n\nReferences \n\n[I]  G.T.  BusweII, How people look at pictures. Chicago:  The University of Chicago Press. \n[2]  A.  Yarbus, Eye movements and  vision.  New  York:  Plenum Press. \n[3]  L.  Itti, C.  Koch, E.  Niebur, IEEE T PAMI,  I I: I 254- I 259,  1998. \n[4]  C. M.  Privitera and L.w. Stark, IEEE T PAMI 22:970-981,2000. \n[5]  F. Fritz, C. Seifert, L. Paletta, H.  Bischof, Proc. WAPCV, Graz, Austria, 2004. \n[6]  L.W. Renninger, J. Coughlan, P.  Verghese, J. Malik, Proceedings NIPS  17,  Vancouver, 2004. \n[7]  T.  Kadir, M. Brady, IJCV 45(2):83-105,2001. \n[8]  T.S. Lee, S. Yu, Advances in NIPS  12:834-840, Ed. S.A. Solla, T.K. Leen, K. Muller, MIT Press. \n[9]  C. E.  Shannon, The BeII Systems Technical Journal, 27:93- I 54,  1948. \n[10]  D.J. Field, and B. A. Olshausen, Nature 381 :607-609,1996. \n[I I]  A.J. BeII, TJ. Sejnowski, Vision Research 37:3327-3338,1997. \n[12]  N.  Bruce, Neurocomputing, 65-66: I 25-133, 2005. \n[13]  P.  Comon, Signal Processing 36(3):287-314,1994. \n[14]  M.W.  Cannon and  S.c. Fullenkamp, Vision Research  36(8):1 I 15-1 125, 1996. \n[15]  B.W. Tatler, R.J.  Baddeley, J.D. Gilchrist, Vision Research 45(5):643-659,2005. \n[16]  H.  Koesling, E. Carbone, H.  Ritter,  University of Bielefeld, Technical Report, 2002. \n[17]  T.w. Lee, M. Girolami, TJ. Sejnowski, Neural  Computation  11:417-441 , 1999. \n[1 8]  J. Braun,  C.  Koch,  D. K.  Lee,  L.  Itti, In:  Visual  Attention and  Cortical Circuits,  (J . Braun, C. \n\nKoch, J.  Davis Ed.), 215-242, Cambridge, MA:MIT Press, 200 I. \n\n[19]  J.  HuIIman,  W.  Te Winkel, F.  Boselie, Perception  and Psychophysics 62: 162-174, 2000. \n[20]  J.K. Tsotsos, S. Culhane, W.  Wai, Y.  Lai, N. Davis, F.  Nuflo,  Art. Intel!.  78(1-2):507-547,1995. \n\n\f", "award": [], "sourceid": 2830, "authors": [{"given_name": "Neil", "family_name": "Bruce", "institution": null}, {"given_name": "John", "family_name": "Tsotsos", "institution": null}]}