{"title": "VISIT: A Neural Model of Covert Visual Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 420, "page_last": 427, "abstract": null, "full_text": "VISIT:  A  Neural Model  of Covert Visual \n\nAttention \n\nSubutai Ahmad-\n\nSiemens Research  and Development, \n\nZFE ST SN6,  Otto-Hahn Ring  6, \n\n8000  Munich  83,  Germany. \n\nahmad~bsUD4Gztivax.siemens.eom \n\nAbstract \n\nVisual attention is the ability to dynamically restrict processing to a subset \nof the visual field.  Researchers  have long argued that such a  mechanism is \nnecessary  to efficiently perform many intermediate level visual tasks.  This \npaper describes  VISIT,  a  novel  neural  network  model  of visual attention. \nThe current system models the search for target objects in scenes  contain(cid:173)\ning  multiple distractors.  This  is  a  natural  task  for  people,  it  is  studied \nextensively by psychologists,  and it requires  attention.  The network's be(cid:173)\nhavior  closely  matches  the  known  psychophysical  data  on  visual  search \nand visual  attention.  VISIT also  matches much of the  physiological data \non attention and provides a  novel view  of the functionality of a  number of \nvisual areas.  This paper concentrates  on  the  biological plausibility of the \nmodel and its relationship to the primary visual cortex, pulvinar, superior \ncolliculus and posterior  parietal areas. \n\n1 \n\nINTRODUCTION \n\nVisual  attention  is  perhaps  best  understood  in  the  context  of  visual  search,  i.e. \nthe  detection  of a  target  object  in  images  containing  multiple distractor  objects. \nThis task  requires  solving the  binding problem and has  been extensively studied in \npsychology  (su[16]  for  a  review).  The  ba8ic  experimental  finding  is  that a  target \nobject  containing a  single distinguishing feature  can be detected  in  constant  time, \nindependent  of the  number  of  distractors.  Detection  based  on  a  conjunction  of \nfeatures,  however,  takes time linear in  the number of objects,  implying a  sequential \nsearch  process  (there  are  exceptions  to  this  general  rule).  It is  generally  accepted \n\n\"Thanks to Steve  Omohundro,  Anne  Treuman,  Joe Malpeli,  and  Bill  Baird  for  enlight. \nening  discussions.  Much  of  this  resea.rch  waa  conducted  at  the  International  Computer \nScience  Institute,  Berkeley,  CA. \n\n420 \n\n\fVISIT:  A Neural Model of Covert Visual  Attention \n\n421 \n\nI High Level Recognition \nI \n\n\"t  down \n!%rmaticm \n\nI ___ ~ \n\nWorking \nMemory \n\n~ .r--.....:.-....... \n\n/ \n\nr - - - - - \" \"7  \n\nealure Maps \n\nI  .. \n\n&nage \n\nFigure  1:  Overview of VISIT \n\nthat  some  form  of covert  attention 1  is  necessary  to  accomplish  this  task.  The \nfollowing sections describe  VISIT, a connectionist model of this process.  The current \npaper concentrates on the relationships to the physiology of attention, although the \npsychological studies are briefly touched on.  For further details on the psychological \naspects  see[l,  2]. \n\n2  OVERVIEW OF  VISIT \n\nWe first  outline the essential  characteristics of VISIT.  Figure  1 shows the basic ar(cid:173)\nchitecture.  A  set  of features  are first  computed from the image.  These features  are \nanalogous to the  topographic  maps computed early  in the  visual system.  There  is \none  unit per location per feature,  with each  unit computing some local property of \nthe image.  Our current  implementation uses  four feature  maps:  red,  blue,  horizon(cid:173)\ntal, and  vertical.  A  parallel global sum of each  feature  map's activity is  computed \nand is  used  to detect  the  presence  of activity in individual maps. \n\nThe feature information is fed  through two different systems:  a  gating network and \na  priority  network.  The  gating network  implements the  focus  - its  function  is  to \nrestrict higher level processing to a single circular region.  Each gate unit receives the \ncoordinates  of a  circle  as input.  If it is  outside the  circle,  it  turns  on  and inhibits \ncorresponding  locations  in  the  gated  feature  maps.  Thus  the  network  can  filter \nimage properties  based  on  an external control signal.  The required  computation is \na  simple second  order  weighted  sum and takes two time steps[l]. \n\n1 Covert attention refers to the  ability to concentrate processing on  a single image region \n\nwithout  any  overt  actions  such  as eye  movements. \n\n\f422 \n\nAhmad \n\nThe priority network  ranks image locations in parallel and encodes  the information \nin  a  manner  suited  to  the  updating  of  the  focus  of attention.  There  are  three \nunits  per  location in the  priority map.  The activity of the first  unit represents  the \nlocation's relevance to the current task.  It receives activation from the feature maps \nin  a  local neighborhood of the image.  The  value of the i'th such  unit  is  calculated \nas: \n\nAi = G(  L  L PfAfzy ) \n\nz,yERF. fEF \n\n(1) \n\nwhere A fzy  is  the activation of the unit computing feature I  at location (z,y).  RFi \ndenotes the receptive field  of unit i,  Pf  is  the priority given to feature map I, and G \nis  a  monotonically increasing function such as the sigmoid.  Pf  is  represented  as the \nreal valued activation of individual units and can be dynamically adjusted according \nto  the  task.  Thus  by  setting  Pf  for  a  particular  feature  to  1 and  all  others  to  0, \nonly  objects  containing  that  feature  will  influence  the  priority  map.  Section  2.1 \ndescribes  a  good  strategy  for  setting  Pf .  The  other  two  units  at  each  location \nencode an  \"error vector\" ,  i.e.  the vector  difference  between  the  units' location and \ncenter of the focus.  These  vectors  are continually updated as  the focus  of attention \nmoves around.  To shift the focus  to the most relevant location, the  network simply \nadds  the  error  vector  corresponding  to  the  highest  priority  unit  to  the  activations \nof the  units  representing  the  focii's  center.  Once  a  location  has  been  visited,  the \ncorresponding  relevance  unit is  inhibited,  preventing  the  network from  continually \nattending  to  the  highest  priority location. \n\nThe  control  networks  are  responsible  for  mediating the  information flow  between \nthe gating and priority networks, as well as incorporating top-down knowledge.  The \nfollowing  section  describes  the  part  which  sets  the  priority  values  for  the  feature \nmaps.  The  rest  of the networks are described  in detail in [1J.  Note that the control \nfunctions  are  fully  implemented  as  networks  of simple  units  and  thus  requires  no \n\"homunculus\"  to oversee  the  process. \n\n2.1  SWIFT:  A  FAST SEARCH  STRATEGY \n\nThe main function of SWIFT is  to integrate top-down and bottom-up knowledge to \nefficiently guide the search  process.  Top down information about the target features \nare  stored  in  a  set of units.  Let T  be  this set  of features.  Since  the  desired  object \nmust contain  all the features  of T,  any  of the  corresponding  feature  maps may be \nsearched.  Using  the ability to weight feature  maps differently,  the network removes \nthe  influence  of all  but  one  of the  features  in  T.  By  setting  this  map's  priority \nto  1,  and  all  others  to  0,  the  system  will  effectively  prune  objects  which  do  not \ncontain this  feature.SWIF~ To minimize search  time,  it should choose  the feature \ncorresponding  to  the  smallest number  of objects.  Since  it is  difficult  to  count  the \nnumber of objects in  parallel, the  network chooses  the  map with  the  minimal total \nactivity as  the  one  likely  to contain  the  minimal number  of objects.  (If the  target \nfeatures  are  not known  in  advance,  SWIFT chooses  the  minimal feature  map over \nall features .  The net  effect  is  to always pick  the  most distinctive feature.) \n\n2Hence  the  name  SWIFT:  Search  WIth  Features Thrown  out. \n\n\fVISIT: A Neural Model of Covert Visual Attention \n\n423 \n\n2.2  RELATIONSHIP  TO  PSYCHOPHYSICAL  DATA \n\nThe  run  time  behavior  of the  system  closely  matches  the  data  on  human  visual \nsearch.  Visual attention in people is known to be very quick, taking as little as 40-80 \nmsecs to engage.  Given that cortical neurons can fire about once every 10 msecs, this \nleaves  time for  at most 8 sequential steps.  In  VISIT,  unlike other implementations \nof attention[10],  the  calculation  of the  next  location  is  separated  from  the  gating \nprocess.  This  allows  the  gating  to  be  extremely fast,  requiring  only  2  time steps. \nIterative  models,  which  select  the  most  active  object  through  lateral  inhibition, \nrequire  time  proportional  to  the  distance  in  pixels  between  maximally separated \nobjects.  These  models are not consistent  with  the  80msecs  time requirement. \n\nDuring visual search,  SWIFT always searches  the minimal feature map.  The critical \nvariable  that  determines  search  time  is  M,  the  number  of objects  in the  minimal \nfeature  map.  Search  time  will  be  linear  in  M.  It  can  be  shown  that  VISIT plus \nSWIFT is  consistent  with  all  of Treisman's  original  experiments  including  single \nfeature search,  conjunctive search,  2:1  slope ratios, search asymmetries, and illusory \nconjuncts[16],  as  well  as  the  exceptions  reported  in[5,  14].  With  an  assumption \nabou t  the features that are coded (consistent with current physiological know ledge), \nthe  results  in[7,  11]  can also  be  modeled.  (This is  described  in  more detail  in  [2]). \n\n3  PHYSIOLOGY OF  VISUAL  ATTENTION \n\nThe  above  sections  have  described  the  general  architecture  of  VISIT.  There  is  a \nfairly strong correspondence  between  the  modules in  VISIT and the various visual \nareas involved in attention.  The  rest  of the  paper discusses  these  relationships. \n\n3.1  TOPOGRAPHIC  FEATURE MAPS \n\nEach  of the  early  visual areas,  LGN, VI, and  V2,  form  several  topographic  maps \nof retinal  activity.  In  VI  alone  there  are  a  thousand  times  as  many  neurons  as \nthere  are  fibers  in  the  optic  nerve,  enough  to  form  several  hundred  feature  maps. \nThere  is  a  diverse  list of features  thought  to be  computed in  these  areas,  including \norientations,  colors,  spatial frequencies,  motion, etc.[6].  These  areas are  analogous \nto  the set  of early feature  maps computed in  VISIT. \n\nIn  VISIT there  are actually two separate sets  of feature  maps:  early features  com(cid:173)\nputed directly from the image and gated feature  maps.  It might seem  inefficient  to \nhave two  copies  of the  same features.  An alternate  possibility is  to directly inhibit \nthe early feature  maps themselves, and so eliminate the need for  two sets.  However, \nin  a  focused  state,  such a  network  would  be unable to  make global decisions  based \non  the  features.  With  the  configuration  described  above,  at  some  hardware  cost, \nthe network can efficiently access  both local and global information simultaneously. \nSWIFT relies  on  this ability  to efficiently  carry  out visual search. \n\nThere  is  evidence  for  a  similar setup in  the  human visual system.  Although people \nhave  actively  searched,  no  local  attentional  effects  have  been  found  in  the  early \nfeature  maps.  (Only  global effects,  such  as an  overall  increase  in  firing  rate,  have \nbeen  noticed.)  The above reasoning  provides  a  possible computational explanation \nof this  phenomenon. \n\n\f424 \n\nAhmad \n\nA natural question to ask is:  what is the best set of features?  For fast visual search, \nif SWIFT is used  as a  constraint, then we  want the set of features  that minimize M \nover all  possible  images and target  objects,  i.e.  the features  that best  discriminate \nobjects.  It  is  easy  to  see  that  the  optimal set  of features  should  be  maximally \nuncorrelated  with  a  near  uniform  distribution  of feature  values.  Extracting  the \nprincipal  components of the  distribution of images gives  us  exactly  those  features. \nIt is well known that a  single Hebb neuron extracts the largest principal componentj \nsets  of such  neurons  can  be  connected  to  select  successively  smaller  components. \nMoreover, as some researchers  have demonstrated, simple Hebbian learning can lead \nto features that look very similar to the features in visual cortex (see  [3]  for a review). \nIf the early features in visual cortex do in fact represent  principal components, then \nSWIFT is  a  simple strategy  that takes advantage of it. \n\n3.2  THE  PULVINAR \n\nContrary  to  the  early  visual  system,  local attentional  effects  have  been  discovered \nin  the  pulvinar.  Recordings  of  cells  in  the  lateral  pulvinar  of  awake,  behaving \nmonkeys have demonstrated a spatially localized enhancement effect  tied to selective \nattention[17].  Given  this  property  it  is  tempting  to  pinpoint  the  pulvinar  as  the \nlocus  of the  gated feature  maps. \n\nThe  general  connectivity  patterns  provide  some  support  for  this  hypothesis.  The \npulvinar is  located  in the  dorsal  part of the  thalamus and is  strongly connected  to \njust about every visual area including LGN, VI, V2, superior colliculus, the frontal \neye  fields,  and posterior  parietal cortex.  The projections are  topography preserving \nand non-overlapping.  As a result, the pulvinar contains several high-resolution maps \nof visual space,  possibly one map for  each one in primary visual cortex.  In addition, \nthere  is  a  thin  sheet  of neurons  around  the  pulvinar,  the  reticular  complex,  with \nexclusively  inhibitory  connections  to  the  neurons  within  [4].  This  is  exactly  the \nstructure  necessary  to  implement  VISITs gating system. \n\nThere  are  other  clues  which  also point  to  the  thalamus as  the  gating  system.  Hu(cid:173)\nman patients with thalamic lesions have difficulty engaging attention and inhibiting \ncrosstalk from other locations.  Lesioned  monkeys give slower  responses  when  com(cid:173)\npeting events  are  present  in the  visual field[12]. \n\nThe  hypothesis  can  be  tested  by  further  experiments.  In  particular,  if a  map  in \nthe  pulvinar  corresponding  to  a  particular  cortical  area  is  damaged,  then  there \nshould  be  a  corresponding  deficit  in  the  ability  to  bind  those  specific  features  in \nthe  presence  of distractors.  In  the  absence  of distractors,  the  performance  should \nremain unchanged. \n\n3.3  SUPERIOR COLLICULUS \n\nThe  SC  is  involved  in  both  the  generation  of eye  saccades[15]  and  possibly  with \ncovert attention[12].  It is probably also involved in the integration oflocation infor(cid:173)\nmation from various different  modalities.  Like  the pulvinar, the superior colliculus \n(SC)  is  a  structure  with  converging  inputs  from  several  different  modalities  in(cid:173)\ncluding visual,  auditory, and somatosensory[15].  The superior  colliculus contains a \nrepresentation similar to  VISITs error  maps for eye saccades[15].  At each location, \n\n\fVISIT:  A Neural Model of Covert Visual Attention \n\n425 \n\ngroups  of neurons  represent  the  vector  in  motor  coordinates  required  to  shift  the \neye  to  that  spot.  In  [13]  the  authors  studied  patients  with  a  particular  form  of \nParkinson's  disease  where  the  SC  is  damaged.  These  patients  are  able  to  make \nhorizontal,  but  not  vertical  eye  saccades.  The  experiments  showed  that  although \nthe  patients  were  still  able  to  move  their  covert  attention  in  both  the  horizontal \nand  vertical  directions,  the  speed  of orienting  in  the  vertical  direction  was  much \nslower.  In  addition  [12]  mentions  that  patients  with  this  damage  shift  attention \nto  previously  attended  locations as  readily as  new  ones,  suggesting a  deficit  in  the \nmechanism that inhibits previously attended  locations. \n\nThese  findings  are consistent  with  the  priority map in  VISIT.  A  first  guess  would \nidentify  the  superior  colliculus  as  the  priority  map,  however  this  is  probably  in(cid:173)\naccurate.  More  recent  evidence  suggests  that  the  SC  might  be  involved  only  in \nbottom-up shifts of attention  (induced  by exogenous stimuli as opposed to endoge(cid:173)\nnous control signals)  (Rafal,  personal communication).  There  is also evidence  that \nthe frontal  eye fields  (F EF) are involved in saccade generation in a  manner similar \nto the superior colliculus,  particularly for  saccades  to complex stimuli[17].  The role \nof the  FE F  in covert attention is  currently unknown. \n\n3.4  POSTERIOR PARIETAL  AREAS \n\nThe  posterior  paretal  cortex  P P  may  provide  an  answer.  One  hypothesis  that \nis  consistent  with  the  data  is  that  there  are  several  different  priority  maps,  for \nbottom-up and  top-down  stimuli.  The  top-down  maps  exist  within  P P,  whereas \nthe bottom-up maps exist  in  SC and possibly F EF.  P P  receives  a  significant pro(cid:173)\njection from superior colliculus and may be involved in the  production of voluntary \neye  saccades[17].  Experiments  suggest  that  it  is  also  involved  in  covert  shifts  of \nattention.  There  is  evidence  that  neurons  in  P P  increase  their  firing  rate  when \nin  a  state  of attentive  fixation[9].  Damage  to  P P  leads  to  deficits  in  the  ability \nto  disengage  covert  attention  away  from  a  target[12].  In  the  context  of eye  sac(cid:173)\ncades,  there exist neurons in P P  that fire  about 55 msecs  before an actual saccade. \nThese results suggest  that the control structure and the aspects of the network that \nintegrate  priority  information  from  the  various  modules  might  also  reside  within \nPP. \n\n4  DISCUSSION  AND  CONCLUSIONS \n\nThe above  relationships  between  VISIT and the  brain  provides  a  coherent  picture \nof the  functionality  of the  visual  areas.  The  literature  is  consistent  with  having \nthe  LGN, V1,  and  V2  as  the early feature  maps,  the  pulvinar as a  gating system, \nthe  superior  colliculus,  and  frontal  eye  fields,  as  a  bottom-up  priority  map,  and \nposterior  parietal  cortex as  the  locus  of a  higher  level  priority  map as  well  as  the \nthe control networks.  Figure  2 displays the  various visual areas  together  with their \nproposed functional relationships. \n\nIn  [12]  the  authors  suggest  that  neurons  in  parietal  lobe  disengage  attention  from \nthe present focus,  those in superior colliculus shift attention to the  target, and neu(cid:173)\nrons  in pulvinar engage attention on it.  This hypothesis  looks at the  time course of \nan  attentional shift  (disengage,  move,  engage)  and assigns  three  different  areas  to \n\n\f426 \n\nAhmad \n\nFigure  2:  Proposed functionality of various visual areas.  Lines  denote  major path(cid:173)\nways.  Those connections  without arrows are known  to be  bi-directional. \n\nthe  three  different  intervals  within  that  temporal sequence.  In  VISIT,  these  three \ncorrespond  to a  single  operation  (add a  new  update vector  to the  current  location) \nand  a  single  module  (the  control  network).  Instead,  the  emphasis is  on  assigning \ndifferent  computational responsibilities  to  the  various  modules.  Each  module op(cid:173)\nerates  continuously  but  is  involved  in  a  different  computation.  While  the  gating \nnetwork  is  being  updated  to  a  new  location,  the  priority  network  and  portions  of \nthe control  network are  continuously updating the  priorities. \n\nThe  model doesn't  yet  explain  the  findings  in  [8]  where  neurons  in  V4  exhibited \na  localized  attentional response,  but  only  if the  stimuli were  within  the  receptive \nfields.  However, these neurons have relatively large receptive fields and are known to \ncode for  fairly high-level features.  It is  possible that this corresponds to a  different \nform  of attention working at a  much  higher  level. \n\nBy  no  means is  VISIT intended  to  be  a  detailed  physiological model of attention. \nPrecise  modeling of even  a  single  neuron  can  require  significant  computational re(cid:173)\nsources.  There are many physiological details  that are not incorporated.  However, \nat the macro level there are interesting relationships between the individual modules \nin  VISIT and  the  known  functionality of the  different  areas.  The  advantage of an \nimplemented computational model such as  VISIT is that it allows us to examine the \nunderlying computations involved and hopefully  better  understand  the  underlying \nprocesses. \n\n\fVISIT:  A Neural  Model of Covert Visual Attention \n\n427 \n\nReferences \n\n[1]  S.  Ahmad.  VISIT:  An  Efficient  Computational  Model  of Human  Visual  Attention. \nPhD  thesis,  University  of illinois  at Urbana-Champaign,  Champaign,  IL,  September \n1991.  Also  TR-91-049,  International  Computer  Science  Institute,  Berkeley,  CA. \n\n[2]  S.  Ahmad  and  S.  Omohundro.  Efficient  visual  search:  A  connectionist  solution.  In \n13th Annual Conference  of the  Cognitive  Science Society, Chicago,  IL,  August  1991. \n[3]  S.  Becker.  Unsupervised learning  procedures for  neural networks.  International Jour(cid:173)\n\nnal of Neural Sy~tem~, 12,  1991. \n\n[4]  F.  Crick.  Function  of the  thalamic  reticular  complex:  the searchlight  hypothesis.  In \n\nNational Academy of Science~, volume  81,  pages  4586-4590,  1984. \n\n[5]  H.E.  Egeth,  R.A.  Virzi,  and  H.  Garbart.  Searching for  conjunctively  defined  targets. \nJournal  of Experimental P~ychology: Human Perception and Performance, 10(1):32-\n39,  1984. \n\n[6]  D.  Van  Essen  and  C.  H.  Anderson.  Information  processing  strategies  and  pathways \nin  the  primate  retina  and  visual  cortex.  In  S.F.  Zornetzer,  J .L.  Davis,  and  C.  Lau, \neditors,  An Introduction to  Neural  and Electronic  Network!.  Academic Press,  1990. \n\n[7]  P. McLeod,  J. Driver,  and J. Crisp.  Visual  search for  a conjunction  of movement  and \n\nform  is  parallel.  Nature,  332:154-155,  1988. \n\n[8]  J.  Moran  and  R.  Desimone.  Selective  attention  gates  visual  processing in  the  extras(cid:173)\n\ntriate cortex.  Science,  229,  March  1985. \n\n[9]  V.B.  Mountcastle,  R.A.  Anderson,  and  B.C.  Motter.  The influence  of attention  fixa(cid:173)\n\ntion upon the excitability ofthe light-sensitive neurons ofthe posterior parietal cortex. \nThe  Journal  of Neuro~cience, 1(11):1218-1235,  1981. \n\n[10]  M.  Mozer.  The  Perception  of  Multiple  Objects:  A  Connectioni~t Approach.  MIT \n\nPress,  Cambridge,  MA,  1991. \n\n[11]  K.  Nakayama  and  G.  Silverman.  Serial  and  parallel  processing  of visual  feature  con(cid:173)\n\njunctions.  Nature,  320:264-265,  1986. \n\n[12]  M.l.  Posner  and  S.E.  Petersen.  The  attention  system  of the  human  brain.  Annual \n\nReview  of Neuro~cience, 13:25-42,  1990. \n\n[13]  M.l. Posner, J.A. Walker,  and R.D. Rafal.  Effects of parietal injury on covert orienting \n\nof attention.  The  Journal  of Neuro~cience, 4(7):1863-1874,  1982. \n\n[14]  P.T. Quinlan  and G.W. Humphreys.  Visual search for  targets defined by combinations \nof  color,  shape,  and  size:  An  examination  of  the  task  constraints  of  feature  and \nconjunction  searches.  Perception  &  P~ychophy~ic~, 41:455-472,  1987. \n\n[15]  D. L.  Sparks.  Translation  of sensory signals into commands for control of saccadic eye \n\nmovements:  Role  of primate superior  colliculus.  Physiological Review~, 66(1),  1986. \n\n[16]  A. Treisman.  Features  and  objects:  The Fourteenth Bartlett Memorial  Lecture.  The \n\nQuarterly Journal of Experimental P~ychology, 40A(2),  1988. \n\n[17]  R.H.  Wurtz  and  M.E.  Goldberg,  editors.  The  Neurobiology  of Saccadic  Eye  Move(cid:173)\n\nmenb.  Elsevier,  New York,  1989. \n\n\f", "award": [], "sourceid": 551, "authors": [{"given_name": "Subutai", "family_name": "Ahmad", "institution": null}]}