{"title": "Dynamic visual attention: searching for coding length increments", "book": "Advances in Neural Information Processing Systems", "page_first": 681, "page_last": 688, "abstract": "A visual attention system should respond placidly when common stimuli are presented, while at the same time keep alert to anomalous visual inputs. In this paper, a dynamic visual attention model based on the rarity of features is proposed. We introduce the Incremental Coding Length (ICL) to measure the perspective entropy gain of each feature. The objective of our model is to maximize the entropy of the sampled visual features. In order to optimize energy consumption, the limit amount of energy of the system is re-distributed amongst features according to their Incremental Coding Length. By selecting features with large coding length increments, the computational system can achieve attention selectivity in both static and dynamic scenes. We demonstrate that the proposed model achieves superior accuracy in comparison to mainstream approaches in static saliency map generation. Moreover, we also show that our model captures several less-reported dynamic visual search behaviors, such as attentional swing and inhibition of return.", "full_text": "Dynamic Visual Attention: Searching for coding\n\nlength increments\n\nXiaodi Hou1,2 and Liqing Zhang1 \u2217\n\n1Department of Computer Science and Engineering, Shanghai Jiao Tong University\n\nNo. 800 Dongchuan Road, 200240, China\n\n2Department of Computation and Neural Systems, California Institute of Technology\n\nMC 136-93, Pasadena, CA, 91125, USA\n\nxhou@caltech.edu, zhang-lq@sjtu.edu.cn\n\nAbstract\n\nA visual attention system should respond placidly when common stimuli are pre-\nsented, while at the same time keep alert to anomalous visual inputs. In this paper,\na dynamic visual attention model based on the rarity of features is proposed. We\nintroduce the Incremental Coding Length (ICL) to measure the perspective en-\ntropy gain of each feature. The objective of our model is to maximize the entropy\nof the sampled visual features.\nIn order to optimize energy consumption, the\nlimit amount of energy of the system is re-distributed amongst features accord-\ning to their Incremental Coding Length. By selecting features with large coding\nlength increments, the computational system can achieve attention selectivity in\nboth static and dynamic scenes. We demonstrate that the proposed model achieves\nsuperior accuracy in comparison to mainstream approaches in static saliency map\ngeneration. Moreover, we also show that our model captures several less-reported\ndynamic visual search behaviors, such as attentional swing and inhibition of re-\nturn.\n\n1 Introduction\n\nVisual attention plays an important role in the human visual system. This voluntary mechanism\nallows us to allocate our sensory and computational resources to the most valuable information\nembedded in the vast amount of incoming visual data. In the past decade, we have witnessed the\nsuccess of a number of computational models on visual attention (see [6] for a review). Many of\nthese models analyze static images, and output \u201csaliency maps\u201d, which indicate the probability of\neye \ufb01xations. Models such as [3] and [4] have tremendously boosted the correlation between eye\n\ufb01xation data and saliency maps.\n\nHowever, during the actual continuous perception process, important dynamic behaviors such as the\nsequential order of attended targets, shifts of attention by saccades, and the inhibitory mechanism\nthat precludes us from looking at previously observed targets, are not thoroughly discussed in the\nresearch on visual attention. Rather than contributing to the accuracy of saliency map generation,\nwe instead consider alternative approaches to understand visual attention:\nis there a model that\ncharacterizes the ebbs and \ufb02ows of visual attention?\n\nUp to the present, this question is not comprehensively answered by existing models. Algorithms\nsimulating saccades in some attention systems [23, 7] are designed for engineering expediency rather\nthan scienti\ufb01c investigation. These algorithms are not intended to cover the full spectrum of dynamic\nproperties of attention, nor to provide a convincing explanation of the continuous nature of attention\nbehaviors.\n\n\u2217http://www.its.caltech.edu/\u02dcxhou\n\nhttp://bcmi.sjtu.edu.cn/\u02dczhangliqing\n\n\fIn this paper, we present a novel attention model that is intrinsically continuous. Unlike space-based\nmodels who take discrete frames of images as the elementary units, our framework is based on con-\ntinuous sampling of features. Inspired by the principle of predictive coding [9], we use the concept\nof energy to explain saliency, feature response intensity, and the appropriation of computational re-\nsources in one uni\ufb01ed framework. The appropriation of energy is based on the Incremental Coding\nLength, which indicates the rarity of a feature. As a result, stimuli that correlate to rarely activated\nfeatures will receive the highest energy, and become salient. Since the proposed model is temporally\ncontinuous, we can demonstrate a series of simulations of dynamic attention, and provide plausible\nexplanations of previously unexamined behaviors.\n\n1.1 Space and Feature Based Attention\n\nMany of the bottom-up visual attention models follow the Koch and Ullman framework [10]. By\nanalyzing feature maps that topographically encode the spatial homogeneity of features, an algo-\nrithm can detect the local irregularities of the visual input. This paradigm explains the generation of\nattention from a one-shot observation of an image. However, several critical issues may be raised\nwhen this framework is applied to continuous observations (e.g. video). First, space-based atten-\ntion itself cannot interpret ego-motion. Additional coordinate transformation models are required\nto translate spatial cues between two different frames. Second, there are attention mechanisms that\noperate after the generation of saliency, such as attentional modulation [19], and Inhibition of Return\n(IOR) [8]. The initial space-based framework is not likely to provide a convincing explanation to\nthese mechanisms.\n\nIn addition to saliency based on local irregularity, recent investigations in V4 and MT cortical ar-\neas demonstrate that attention can also be elicited by particular features [13, 18]. In the \ufb01eld of\ncomputational models, explorations that are biased by features are also used in task-dependent spa-\ntial saliency analysis [16]. The emerging evidence in feature-driven attention has encouraged us to\npropose a pure feature-based attention model in parallel with the space-based feature map paradigm.\n\n1.2 On the Cause of Attention\n\nFinding \u201cirregular patterns\u201d as a criterion for attention is widely used in computational models. In a\nmore rigid form, saliency can be de\ufb01ned by the residuals of Difference of Gaussian \ufb01lter banks [7],\nregions with maximal self-information [3], or most discriminant center-surround composition [4].\nHowever, all of these principles do little to address the cause of saliency mechanisms in the brain.\n\nAt the level of computation, we cannot attribute the formation of attention to functional advantages\nsuch as foraging for foods [6]. In this paper, we hypothesize that visual attention is driven by the\npredictive coding principle, that is, the optimization of metabolic energy consumption in the brain.\nIn our framework, the behavior of attention is explained as a consequence of an actively-searching\nobserver who seeks a more economical neural code to represent the surrounding visual environment.\n\n2 The Theory\n\nMotivated by the sparse coding strategy [15] discovered in primary visual cortex, we represent\nan image patch as a linear combination of sparse coding basis functions, which are referred as\nfeatures. The activity ratio of a feature is its average response to image patches over time and\nspace. The activity of the feature ensemble is considered as a probability function. We evaluate\neach feature with respect to its Incremental Coding Length (ICL). The ICL of ith feature is de\ufb01ned\nas the ensemble\u2019s entropy gain during the activity increment of ith feature.\nIn accordance with\nthe general principle of predictive coding [17], we redistribute energy to features according to their\nICL contribution: frequently activated features receive less energy than rarer features. Finally, the\nsaliency of a region is obtained by summing up the activity of all features at that region.\n\n2.1 Sparse Feature Representation\n\nExperimental studies [15] have shown that the receptive \ufb01elds of simple-cells in the primary visual\ncortex produce a sparse representation. With standard methods [2], we learn a set of basis functions\nthat yields a sparse representation of natural image patches. These basis functions are used as\n\n\ffeatures in the analysis of attention. Speci\ufb01cally, we use 120000 8 \u00d7 8 RGB image patches from\nnatural scenes for training. A set of 8 \u00d7 8 \u00d7 3 = 192 basis functions is obtained. (See Fig. 1).\nLet A be the sparse basis, where ai is the ith basis function. Let W = A\u22121 be the bank of \ufb01lter\nfunctions, where W = [w1, w2, . . . , w192]>. Each row vector wj of W can be considered as a\nlinear \ufb01lter to the image patch.\n\nThe sparse representation s of an image patch is its response to all \ufb01lter functions. Given a vectorized\nimage x, we have s = Wx. Since each basis function represents a structural primitive, in the\ncortex representation of natural images, only a small population of neurons are activated at one\ntime. Considering the energy consumed by neural activity in the brain, this sparse coding strategy is\nadvantageous [11].\n\nA\n\nW\n\nFigure 1: First 30 components of the basis functions A and the corresponding \ufb01lter functions W\nare shown in this \ufb01gure.\n\n2.2 The Incremental Coding Length\n\nIn contrast to the long-term evolution of sparse representation, which re\ufb02ects the general statistics\nof nature, short-term habituations, such as potentiation of synaptic strengths, occur during brief\nobservations in a particular environment.\nIn order to evaluate the immediate energy changes in\nthe cortex, some previous work has analyzed the information representation and coding in early\nvisual system [20, 21, 1]. Guided by the insights behind predictive coding [17], we propose the\nIncremental Coding Length (ICL) as a computational principle based on features. This principle\naims to optimize the immediate energy distribution in the system in order to achieve an energy-\neconomic representation of its environment.\nThe activity ratio pi for ith feature is de\ufb01ned as its relative response level over a sequence of sam-\npling. Given the sample matrix X = [x1, x2, . . . , xk, . . .], where xk is an vectorized image patch,\nwe can compute the activity ratio pi as:\n\npi = Pk | wixk |\nPiPk | wixk |\n\n.\n\n(1)\n\nFurthermore, we denote p = [p1, p2, . . .]> as the probability function of feature activities. Note\nthat the activity ratio and the energy are abstract values that re\ufb02ect the statistics of features. Wiring\nthis structure at the neuronal level goes beyond the scope of this paper. However, studies [13] have\nsuggested evidence of a population of neurons that is capable of generating a representation for in-\ntermodal features. In our implementation, the distribution p addresses the computational properties\nof this putative center.\n\nSince the visual information is jointly encoded by all features, the most ef\ufb01cient coding strategy\nshould make equal use of all possible feature response levels. To achieve this optimality, the model\nneeds to maximize the entropy H(p). Since p is determined by the samples X, it is possible for a\nsystem to actively bias the sampling process in favor of maximizing information transmission.\n\nAt a certain point of time, the activity ratio distribution is p. We consider a new excitation to feature\ni, which will add a variation \u03b5 to pi, and change the whole distribution. The new distribution \u02c6p is:\n\n\u02c6pj =( pj + \u03b5\n\n1 + \u03b5 ,\npj\n1 + \u03b5 ,\n\nj = i\n\nj 6= i\n\n\fFeature distribu\u019fon\n\nIncremental Coding Length\n\n0.02\n\n0.01\n\n0\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nBasis\n\n0.04\n\n0.02\n\n0\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\nImage\n\nSaliency map\n\nFigure 2: The framework of feature-based selective attention.\n\nThis variation therefore changes the entropy of feature activities. The change of entropy with respect\nto the feature activity probability increment is:\n\n\u2202H(p)\n\n\u2202pi\n\n= \u2212\n\n\u2202pi log pi\n\n\u2202pi\n\n\u2212\n\n\u2202Pj6=i pj log pj\n\n\u2202pi\n\n= \u22121 \u2212 logpi \u2212\n\n\u2202Pj6=i pj log pj\n\n\u2202pi\n\n,\n\nwhere:\n\n\u2202Pj6=i pj log pj\n\n\u2202pi\n\n= H(p) \u2212 1 + pi + pi log pi,\n\nAccordingly, we de\ufb01ne the Incremental Coding Length (ICL) to be:\n\nICL(pi) =\n\n\u2202H(p)\n\n\u2202pi\n\n= \u2212H(p) \u2212 pi \u2212 log pi \u2212 pi log pi\n\n(2)\n\n2.3 Energy Redistribution\n\nWe de\ufb01ne the salient feature set S as: S = {i | ICL(pi) > 0}. The partition {S, \u00afS} tells us whether\nsuccessive observations of feature i would increase H(p). In the context of visual attention, the\nintuition behind the salient feature set is straightforward: A feature is salient only when succeeding\nactivations of that feature can offer entropy gain to the system.\n\nWithin this general framework of feature-level optimization, we can redistribute the energy among\nfeatures. The amount of energy received by each feature is denoted di. Non-salient features are\nautomatically neglected by setting dk = 0\n\n(k \u2208 \u00afS). For features in the salient feature set, let:\n\ndi =\n\nICL(pi)\n\nICL(pj)\n\nXj\u2208S\n\n,\n\n(if i \u2208 S).\n\n(3)\n\nFinally, given an image X = [x1, x2, . . . , xn], we can quantify the saliency map M =\n[m1, m2, . . . , mn] as:\n\ndiwixk.\n\nmk =Xi\u2208S\n\n(4)\n\nIn Eq. 4, we notice that the saliency of a patch is not constant. It is determined by the distribution\nof p, which can be obtained by sampling the environment over space and time.\n\nAccording to Eq. 4, we notice that the saliency of a patch may vary over time and space. An\nintuitive explanation to this property is the contextual in\ufb02uence: under different circumstances,\n\u201csalient features\u201d are de\ufb01ned in different manners to represent the statistical characteristics of the\nimmediate environment.\n\n\f3 The Experiment\n\nWe proposed a framework that explains dynamic visual attention as a process that spends limited\navailable energy preferentially on rarely-seen features. In this section, we examine experimentally\nthe behavior of our attention model.\n\n3.1 Static Saliency Map Generation\n\nBy sequentially sampling over all possible image patches, we calculate the feature distribution of\na static image and generate the corresponding saliency map. These maps are then compared with\nrecords of eye \ufb01xations of human subjects. The accuracy of an algorithm is judged by the area under\nits ROC curve.\n\nWe use the \ufb01xation data collected by Bruce et al. [3] as the benchmark for comparison. This data\nset contains the eye \ufb01xation records from 20 subjects for the full set of 120 images. The images\nare down-sampled to an appropriate scale (86 \u00d7 64, 1\n4 of the original size). The results for several\nmodels are indicated below. Due to a difference in the sampling density used in drawing the ROC\ncurve, the listed performance is slightly different (about 0.003) from that given in [3] and [4]. The\nalgorithms, however, are all evaluated using the same benchmark and their relative performance\nshould be unaffected. Even though it is not designed for static saliency map generation, our model\nachieves the best performance among mainstream approaches.\n\nTable 1: Performances on static image saliency\n\nItti et al. [7]\n\n0.7271\n\nBruce et al. [3]\n\n0.7697\n\nGao et al. [4]\n\n0.7729\n\nOur model\n\n0.7928\n\ninput image our approach human \ufb01xa\u019fons\n\ninput image\n\nour approach human \ufb01xa\u019fons\n\ninput image\n\nour approach human \ufb01xa\u019fons\n\nFigure 3: Some examples of our experimental images.\n\n3.2 Dynamic Saliency on Videos\n\nA distinctive property of our model is that it is updated online. As proposed in Eq. 2, ICL is\nde\ufb01ned by the feature activity ratio distribution. This distribution can be de\ufb01ned over space (when\nsampling within one 2-D image) as well as over time (when sampling over a sequence of images).\nThe temporal correlation among frames can be considered as a Laplacian distribution. Accordingly,\nat the tth frame, the cumulative activity ratio distribution pt yields:\n\npt =\n\n1\nZ\n\nt\u22121\n\nX\u03c4 =0\n\nexp(\n\n\u03c4 \u2212 t\n\n\u03bb\n\n) \u00b7 \u02c6p\u03c4 ,\n\n(5)\n\nwhere \u03bb is the half life. \u02c6p\u03c4 is the feature distribution of the \u03c4 th image. Z = R pt(x)dx is the\n\nnormalization factor that ensures pt is a probability distribution.\n\nIn video saliency analysis, one of the potential challenges comes from simultaneous movements of\nthe targets and self-movements of the observer. Since our model is feature-based, spatial movements\nof an object or changing perspectives will not dramatically affect the generation of saliency maps. In\norder to evaluate the detection accuracy of our approach under changing environment, we compare\nthe dynamic visual attention model with models proposed in [7] and [5].\n\nIn this experiment, we use a similar criterion to that described in [5]. The ef\ufb01cacy of the saliency\nmaps to a videoclip is determined by comparing the response intensities at saccadic locations and\nrandom locations. Ideally, an effective saliency algorithm would have high output at locations gazed\nby observers, and tend not to response in most of the randomly chosen locations.\n\n\fTo quantify this tendency of selectivity, we \ufb01rst compute the distribution of saliency value at human\nsaccadic locations qs and the distribution at random locations qr. Then, KL divergency is used to\nmeasure their dissimilarity. Higher the KL divergency is, more easily a model can discriminate\nhuman saccadic locations in the image.\n\nKL = 0.2493\n\nKL = 0.3403\n\nKL = 0.5432\n\n80\n\n60\n\n40\n\n20\n\n0\n\n80\n\n60\n\n40\n\n20\n\n0\n\n80\n\n60\n\n40\n\n20\n\n0\n\nA: input sample\n\nB: model in [7]\n\nC: model in [5]\n\nD: our model\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nFigure 4: The eye-track records and the video is obtained from [5]. This video contains both target\nmovements and self-movements. In this video, 137 saccades (yellow dots in \ufb01gure A) are collected.\nGiven the sequence of generated saliency maps, we can obtain the saliency distribution at human\nsaccade locations (narrow blue bars), and random locations (wide green bars). The KL-divergency\nof these two distribution indicates the performance of each model.\n\n3.3 Dynamic Visual Search\n\nWe are particularly interested in the dynamic behaviors of attention. Reported by researchers in\nneurobiological experiments, an inhibitory effect was aroused after sustained attention [12]. This\nmechanism is referred as Inhibition of Return (IOR) [8]. Research on the cumulative effects of\nattention [24] has suggested that the dynamics of visual search have broad implications for scene\nperception, perceptual learning, automaticity, and short term memory.\nIn addition, as a mecha-\nnism that prevents an autonomous system from being permanently attracted to certain salient spots\nand thereby to facilitate productive exploration, the computational modeling of IOR is of practical\nvalue in AI and robotics. Previous computational models such as [22, 7] implemented the IOR in\na spatially-organized, top-down manner, whereas our model samples the environment online and is\ndriven by data in a bottom-up manner. Spontaneous shifts of attention to new visual cues, as well\nas the \u201crefusal of perception\u201d behavior arise naturally as consequences of our active search model.\nMoreover, unlike the spatial \u201cinhibitory masking\u201d approach in [7], our model is feature-based and\nis therefore free from problems caused by spatial coordinate transformations.\n\n3.3.1 Modeling Sensory Input\n\nThe sensory structure of the human retina is not uniform. The resolution of perception decreases\nwhen eccentricity increases. In order to overcome the physical limitations of the retina, an overt eye\nmovement is made so that the desired visual stimuli can be mapped onto the foveal region. Similar\nto the computational approximations in [14], we consider the fovea sampling bias as a weighted\nmask W over the reconstructed saliency map. Let the fovea be located at (x0, y0); the saliency at\n(x, y) is weighted by W(x, y):\n\nIn the experiments, we choose \u03be = 1.\n\nW(x, y) = e\u2212 1\n\n2\u00a3(x\u2212x0)2+(y\u2212y0)2\u00a4 + \u03be.\n\n(6)\n\n3.3.2 Overt Eye Movements towards Saliency Targets with Inhibition of Return\n\nIn the incremental perception of one static image, our dynamic visual system is guided by two fac-\ntors. The \ufb01rst factor is the non-homogeneous composition of features in the observed data that\nfosters feature preferences in the system. The second factor is a foveal structure that allows the\nsystem to bias its sampling via overt eye movements. The interplay of these two factors leads to an\nactive visual search behavior that moves towards a maximum entropy equilibrium in the feature dis-\ntribution. It is also worth noting that these two factors achieve a hysteresis effect that is responsible\nfor Inhibition Of Return (IOR). A recently attended visual region is not likely to regain eye \ufb01xation\nwithin short interval because of the foveated weighting. This property of IOR is demonstrated by\nour experiments.\n\n\fAn implementation of our dynamic visual search is shown in the algorithm box.\n\nDynamic Visual Attention\n\n1. At time t, calculate feature ICL based on pt\n2. Given current eye \ufb01xation, generate a saliency map with foveal bias.\n3. By a saccade, move eye to the global maximum of the saliency map.\n4. Sample top N \u201cinformative\u201d (largest ICL) features in \ufb01xation neighborhood. (In our ex-\n\nperiment, N = 10)\n\n5. Calculate \u02c6pt, update pt+1, and go to Step. 1.\n\nIt is also worth noting that, when run on the images provided by [3], our dynamic visual attention\nalgorithm demonstrates especially pronounced saccades when multiple salient regions are presented\nin the same image. Although we have not yet validated these saccades against human retinal data,\nto our knowledge this sort of \u201cattentional swing\u201d has never been reported in other computational\nsystems.\n\n4\n\n1\n\n2\n\n26\n\n91\n\n219\n\n279\n\n48\n\n76\n\n98\n\n294\n\n11\n\n30\n\n105\n\n137\n\nFigure 5: Results on dynamic visual search\n\n4 Discussions\n\nA novel dynamic model of visual attention is described in this paper. We have proposed Incremental\nCoding Length as a general principle by which to distribute energy in the attention system. In this\nprinciple, the salient visual cues correspond to unexpected features - according to the de\ufb01nition of\nICL, these features may elicit entropy gain in the perception state and are therefore assigned high\nenergy.\n\nTo validate this theoretical framework, we have examined experimentally various aspects of visual\nattention. In experiments comparing with static saliency maps, our model more accurately predicted\nsaccades than did other mainstream models. Because the model updates its state in an online manner,\nwe can consider the statistics of a temporal sequence and our model achieved strong results in video\nsaliency generation. Finally, when feature-based ICL is combined with foveated sampling, our\nmodel provides a coherent mechanism for dynamic visual search with inhibition of return.\n\nIn expectation of further endeavors, we have presented the following original ideas. 1) In addition\nto spatial continuity cues, which are demonstrated in other literature, saliency can also be measured\nusing features. 2) By incorporating temporal dynamics, a visual attention system can capture a broad\nrange of novel behaviors that have not successfully been explained by saliency map analysis. And\n3) dynamic attention behaviors might quantitatively be explained and simulated by the pursuit of a\nmaximum entropy equilibrium in the state of perception.\n\n\f5 Acknowledgements\n\nWe thank Neil Bruce, John Tsotsos, and Laurent Itti for sharing their experimental data. The \ufb01rst\nauthor would like to thank Charles Frogner, Yang Cao, Shengping Zhang and Libo Ma for their\ninsightful discussions on the paper. The reviewers\u2019 pertinent comments and suggestions also helped\nto improve the quality of the paper. The work was supported by the National High-Tech Research\nProgram of China (Grant No. 2006AA01Z125) and the National Basic Research Program of China\n(Grant No. 2005CB724301)\n\nReferences\n\n[1] V. Balasubramanian, D. Kimber, and M. Berry. Metabolically Ef\ufb01cient Information Processing. Neural\n\nComputation, 13(4):799\u2013815, 2001.\n\n[2] A. Bell and T. Sejnowski. The independent components of natural scenes are edge \ufb01lters. Vision Research,\n\n37(23):3327\u20133338, 1997.\n\n[3] N. Bruce and J. Tsotsos. Saliency Based on Information Maximization. Advances in Neural Information\n\nProcessing Systems, 18, 2006.\n\n[4] D. Gao, V. Mahadevan, and N. Vasconcelos. The discriminant center-surround hypothesis for bottom-up\n\nsaliency. pages 497\u2013504, 2007.\n\n[5] L. Itti and P. Baldi. Bayesian Surprise Attracts Human Attention. Advances in Neural Information\n\nProcessing Systems, 18:547, 2006.\n\n[6] L. Itti and C. Koch. Computational modeling of visual attention. Nature Reviews Neuroscience, 2(3):194\u2013\n\n203, 2001.\n\n[7] L. Itti, C. Koch, E. Niebur, et al. A model of saliency-based visual attention for rapid scene analysis.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254\u20131259, 1998.\n\n[8] R. Klein. Inhibition of return. Trends in Cognitive Sciences, 4(4):138\u2013147, 2000.\n[9] C. Koch and T. Poggio. Predicting the visual world: silence is golden. Nature Neuroscience, 2:9\u201310,\n\n1999.\n\n[10] C. Koch and S. Ullman. Shifts in selective visual attention: towards the underlying neural circuitry. Hum\n\nNeurobiol, 4(4):219\u201327, 1985.\n\n[11] W. Levy and R. Baxter. Energy Ef\ufb01cient Neural Codes. Neural Codes and Distributed Representations:\n\nFoundations of Neural Computation, 1999.\n\n[12] S. Ling and M. Carrasco. When sustained attention impairs perception. Nature neuroscience, 9(10):1243,\n\n2006.\n\n[13] J. Maunsell and S. Treue. Feature-based attention in visual cortex. Trends in Neurosciences, 29(6):317\u2013\n\n322, 2006.\n\n[14] J. Najemnik and W. Geisler. Optimal eye movement strategies in visual search. Nature, 434(7031):387\u2013\n\n391, 2005.\n\n[15] B. Olshausen et al. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code for\n\nnatural images. Nature, 381(6583):607\u2013609, 1996.\n\n[16] R. Peters and L. Itti. Beyond bottom-up: Incorporating task-dependent in\ufb02uences into a computational\nmodel of spatial attention. IEEE Computer Society Conference on Computer Vision and Pattern Recog-\nnition, 2007.\n\n[17] R. Rao and D. Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-\n\nclassical receptive-\ufb01eld effects. Nature Neuroscience, 2:79\u201387, 1999.\n\n[18] J. Reynolds, T. Pasternak, and R. Desimone. Attention Increases Sensitivity of V4 Neurons. Neuron,\n\n26(3):703\u2013714, 2000.\n\n[19] S. Treue and J. Maunsell. Attentional modulation of visual motion processing in cortical areas MT and\n\nMST. Nature, 382(6591):539\u2013541, 1996.\n\n[20] J. van Hateren. Real and optimal neural images in early vision. Nature, 360(6399):68\u201370, 1992.\n[21] M. Wainwright. Visual adaptation as optimal information transmission. Vision Research, 39(23):3960\u2013\n\n3974, 1999.\n\n[22] D. Walther, D. Edgington, and C. Koch. Detection and tracking of objects in underwater video. Computer\nVision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society\nConference on, 1.\n\n[23] D. Walther, U. Rutishauser, C. Koch, and P. Perona. Selective visual attention enables learning and\nrecognition of multiple objects in cluttered scenes. Computer Vision and Image Understanding, 100(1-\n2):41\u201363, 2005.\n\n[24] J. Wolfe, N. Klempen, and K. Dahlen. Post-attentive vision. Journal of Experimental Psychology: Human\n\nPerception and Performance, 26(2):693\u2013716, 2000.\n\n\f", "award": [], "sourceid": 142, "authors": [{"given_name": "Xiaodi", "family_name": "Hou", "institution": null}, {"given_name": "Liqing", "family_name": "Zhang", "institution": null}]}