Part of Advances in Neural Information Processing Systems 8 (NIPS 1995)
Ernst Niebur, Christof Koch
Intermediate and higher vision processes require selection of a sub(cid:173) set of the available sensory information before further processing. Usually, this selection is implemented in the form of a spatially circumscribed region of the visual field, the so-called "focus of at(cid:173) tention" which scans the visual scene dependent on the input and on the attentional state of the subject. We here present a model for the control of the focus of attention in primates, based on a saliency map. This mechanism is not only expected to model the functional(cid:173) ity of biological vision but also to be essential for the understanding of complex scenes in machine vision.
1
Introduction: "What" and "Where" In Vision
It is a generally accepted fact that the computations of early vision are massively parallel operations, i.e., applied in parallel to all parts of the visual field. This high degree of parallelism cannot be sustained in in~ermediate and higher vision because of the astronomical number of different possible combination of features. Therefore, it becomes necessary to select only a part of the instantaneous sensory input for more detailed processing and to discard the rest. This is the mechanism of visual selective attention.
• Present address: Zanvyl Krieger Mind/Brain Institute and Department of Neuros(cid:173)
cience, 3400 N. Charles Street, The Johns Hopkins University, Baltimore, MD 21218 _
Control of Selective Visual Attention: Modeling the "Where" Pathway
803
It is clear that similar selection mechanisms are also required in machine vision for the analysis of all but the simplest visual scenes. Attentional mechanisms are slowly introduced in this field; e.g. , Yamada and Cottrell (1995) used sequential scanning by a "focus of attention" in the context of face recognition. Another model for eye scan path generation, which is characterized by a strong top-down influence, is presented by Rao and Ballard (this volume). Sequential scanning can be applied to more abstract spaces, like the dynamics of complex systems in optimization problems with large numbers of minima (Tsioutsias and Mjolsness, this volume).
Primate vision is organized along two major anatomical pathways. One of them is concerned mainly with object recognition. For this reason , it has been called the What- pathway; for anatomical reasons , it is also known as the ventral pathway. The principal task of the other major pathway is the determination of the location of objects and therefore it is called the Where pathway or, again for anatomical reasons, the dorsal pathway.
In previous work (Niebur & Koch , 1994), we presented a model for the implement(cid:173) ation of the What pathway. The underlying mechanism is "temporal tagging:" it is assumed that the attended region of the visual field is distinguished from the unattended parts by the temporal fine-structure of the neuronal spike trains. We have shown that temporal tagging can be achieved by introducing moderate levels of correlation 1 between those neurons which respond to attended stimuli.
How can such synchronicity be obtained? We have suggested a simple, neurally plausible mechanism , namely common input to all cells which respond to attended stimuli. Such (excitatory) input will increase the propensity of postsynaptic cells to fire for a short time after receiving this input, and thereby increase the correlation between spike trains without necessarily increasing the average firing rate .
The subject of the present study is to provide a model of the control system which generates such modulating input. We will show that it is possible to construct an integrated system of attentional control which is based on neurally plausible elements and which is compatible with the anatomy and physiology of the primate visual system . The system scans a visual Scene and identifies its most salient parts. A possible task would be "Find all faces in this image." We are confident that this model will not only further our understanding of the function of biological vision but that it will also be relevant for technical applications.
2 A Simple Model of The Dorsal Pathway
2.1 Overall Structure
Figure 1 shows an overview of the model Where pathway. Input is provided in the form of digitized images from an NTSC camera which is then analyzed in various feature maps. These maps are organized around the known operations in early visual cortices. They are implemented at different spatial scales and in a center-surround structure akin to visual receptive fields . Different spatial scales are implemented as Gaussian pyramids (Adelson, Anderson , Bergen, Burt, & Ogden, 1984). The center
1 In (Niebur, Koch, & Rosin, 1993), a similar model was developed using periodic "40Hz" modulation. The present model can be adapted mutatis mutandis to this type of modulation.
804
E. NIEBUR, C. KOCH
of the receptive field corresponds to the value of a pixel at level n in the pyramid and the surround to the corresponding pixels at level n + 2, level 0 being the image in normal size. The features implemented so far are the thr~e principal components of primate color vision (intensity, red-green, blue-yellow), four orientations, and temporal change. Short descriptions of the different feature maps are presented in the next section (2.2).
We then (section 2.3) address the question of the integration of the input in the "saliency map," a topographically organized map which codes for the instantaneous conspicuity of the different parts of the visual field .