{"title": "Toward a Single-Cell Account for Binocular Disparity Tuning: An Energy Model May Be Hiding in Your Dendrites", "book": "Advances in Neural Information Processing Systems", "page_first": 208, "page_last": 214, "abstract": null, "full_text": "2D  Observers for  Human 3D  Object  Recognition? \n\nZili  Liu \n\nNEC  Research Institute \n\nDaniel  Kersten \n\nUniversity of Minnesota \n\n.  Abstract \n\nConverging  evidence  has  shown  that  human  object  recognition \ndepends  on  familiarity  with  the  images  of  an  object.  Further, \nthe  greater  the  similarity  between  objects,  the  stronger  is  the \ndependence  on  object  appearance,  and  the  more  important  two(cid:173)\ndimensional (2D) image information becomes.  These findings,  how(cid:173)\never, do not rule out the use of 3D structural information in recog(cid:173)\nnition,  and  the  degree  to  which  3D  information  is  used  in  visual \nmemory is an important issue.  Liu, Knill, & Kersten (1995) showed \nthat  any model  that  is  restricted  to  rotations  in  the  image  plane \nof independent  2D  templates  could not  account for  human perfor(cid:173)\nmance in discriminating novel object views.  We now present results \nfrom models of generalized radial basis functions  (GRBF), 2D near(cid:173)\nest  neighbor  matching that  allows  2D  affine  transformations,  and \na Bayesian statistical estimator that integrates over all possible 2D \naffine  transformations.  The  performance  of the  human  observers \nrelative  to  each  of  the  models  is  better for  the  novel  views  than \nfor  the familiar  template views,  suggesting that humans generalize \nbetter to novel  views  from  template views.  The  Bayesian estima(cid:173)\ntor yields the optimal performance with  2D  affine  transformations \nand  independent  2D  templates.  Therefore,  models  of  2D  affine \nmatching  operations  with  independent  2D  templates  are unlikely \nto account for  human recognition performance. \n\n1 \n\nIntroduction \n\nObject  recognition  is  one  of  the  most  important  functions  in  human  vision.  To \nunderstand human object recognition, it is  essential to understand how objects are \nrepresented  in  human  visual  memory.  A  central  component  in  object  recognition \nis  the matching of the stored object representation with  that  derived from  the im(cid:173)\nage  input.  But  the  nature  of  the  object  representation  has  to  be  inferred  from \nrecognition  performance,  by  taking into  account  the  contribution  from  the  image \ninformation.  When evaluating human performance, how  can one separate the con-\n\n\f830 \n\nZ  Liu and D. Kersten \n\ntributions to performance of the image information from  the representation?  Ideal \nobserver analysis provides a precise computational tool to answer this question.  An \nideal  observer's  recognition  performance  is  restricted  only  by  the  available  image \ninformation  and  is  otherwise  optimal,  in  the  sense  of  statistical  decision  theory, \nirrespective  of how  the  model  is  implemented.  A  comparison  of  human  to  ideal \nperformance (often in terms of efficiency)  serves to normalize performance with re(cid:173)\nspect to the image information for  the task.  We  consider the problem of viewpoint \ndependence  in human recognition. \n\nA recent debate in human object recognition has focused on the dependence of recog(cid:173)\nnition performance on viewpoint  [1 , 6].  Depending on the experimental conditions, \nan observer's ability to recognize a familiar object from novel viewpoints is  impaired \nto varying degrees.  A central assumption in  the debate is  the equivalence  in view(cid:173)\npoint dependence and recognition performance.  In other words,  the assumption  is \nthat  viewpoint  dependent  performance implies  a  viewpoint  dependent  representa(cid:173)\ntion, and that viewpoint independent performance implies a viewpoint independent \nrepresentation.  However,  given  that  any  recognition  performance  depends  on  the \ninput  image information,  which  is  necessarily viewpoint  dependent,  the  viewpoint \ndependence of the performance is  neither necessary nor sufficient for  the viewpoint \ndependence  of the representation.  Image  information  has  to  be factored  out  first, \nand the ideal  observer provides the means to do this. \n\nThe  second  aspect  of  an  ideal  observer  is  that  it  is  implementation  free.  Con(cid:173)\nsider  the  GRBF  model  [5],  as  compared  with  human  object  recognition  (see  be(cid:173)\nlow).  The  model  stores  a  number  of  2D  templates  {Ti}  of  a  3D  object  0, \nand  reco~nizes or  rejects  a  stimulus  image  S  by  the  following  similarity  measure \n~iCi exp UITi - SI1 2 j2(2 ), where  Ci  and a  are constants.  The model's performance \nas a  function  of viewpoint  parallels that of human observers.  This  observation has \nled  to the conclusion that the human visual system may indeed, as  does the model, \nuse  2D  stored views  with  GRBF interpolation to recognize  3D  objects  [2].  Such  a \nconclusion,  however, overlooks implementational constraints in  the model,  because \nthe model's performance also depends on its implementations.  Conceivably, a model \nwith  some  3D  information  of the  objects  can  also  mimic  human  performance,  so \nlong  as  it  is  appropriately  implemented.  There  are  typically  too  many  possible \nmodels  that can  produce the same pattern of results. \n\nIn contrast, an ideal observer computes the optimal performance that is only limited \nby the stimulus information and the task.  We  can define constrained ideals that are \nalso limited by explicitly specified assumptions (e.g., a class of matching operations). \nSuch  a  model  observer  therefore  yields  the  best  possible  performance  among  the \nclass  of  models  with  the  same  stimulus  input  and  assumptions. \nIn  this  paper, \nwe  are particularly  interested  in  constrained  ideal  observers  that  are restricted  in \nfunctionally  Significant  aspects  (e.g.,  a  2D  ideal  observer  that  stores  independent \n2D  templates  and  has  access  only  to  2D  affine  transformations) .  The  key  idea is \nthat a  constrained ideal  observer is  the best in  its class.  So  if humans  outperform \nthis  ideal observer,  they must  have used  more than what  is  available  to  the  ideal. \nThe  conclusion  that  follows  is  strong:  not  only  does  the  constrained  ideal  fail  to \naccount for human performance, but the whole class of its implementations are also \nfalsified. \n\nA  crucial  question  in  object  recognition  is  the  extent  to  which  human  observers \nmodel the geometric variation in images due to the projection of a 3D  object onto a \n2D image.  At one extreme, we have shown that any model that compares the image \nto  independent  views  (even  if we  allow  for  2D  rigid  transformations  of the  input \nimage)  is  insufficient  to account for  human performance.  At  the other extreme, it \nis  unlikely that variation is  modeled in terms of rigid transformation of a 3D object \n\n\f2D Observers/or Hwnan 3D Object Recognition? \n\n831 \n\ntemplate in memory.  A possible intermediate solution is  to match the input image \nto stored views,  subject  to  2D  affine  deformations.  This  is  reasonable  because  2D \naffine  transformations approximate 3D  variation over a  limited range of viewpoint \nchange. \n\nIn  this  study,  we  test  whether  any  model  limited  to  the  independent  comparison \nof  2D  views,  but  with  2D  affine  flexibility,  is  sufficient  to  account  for  viewpoint \ndependence in  human  recognition.  In  the  following  section,  we  first  define  our ex(cid:173)\nperimental task, in which the computational models yield the provably best possible \nperformance under their specified conditions.  We then review the 2D  ideal observer \nand  GRBF model  derived  in  [4],  and the  2D  affine  nearest  neighbor  model  in  [8]. \nOur principal theoretical result is a closed-form solution of a Bayesian 2D affine ideal \nobserver.  We  then compare human performance with the 2D  affine ideal model,  as \nwell  as the other three models.  In  particular, if humans can classify  novel views  of \nan object better than the 2D  affine ideal, then our human observers must have used \nmore information than that embodied  by  that ideal. \n\n2  The observers \n\nLet  us  first  define  the  task.  An  observer  looks  at  the  2D  images  of  a  3D  wire \nframe  object  from  a  number  of viewpoints.  These images  will  be  called  templates \n{Td.  Then  two  distorted  copies  of the  original  3D  object  are  displayed.  They \nare  obtained  by  adding 3D  Gaussian positional  noise  (i.i.d.)  to the vertices of the \noriginal object.  One distorted object is  called the target, whose Gaussian noise  has \na  constant variance.  The other is  the distract or , whose  noise  has  a  larger variance \nthat can be adjusted  to  achieve  a  criterion level  of performance.  The  two  objects \nare displayed  from  the same  viewpoint  in  parallel  projection,  which  is  either  from \none of the template views, or a novel view due to 3D  rotation.  The task is to choose \nthe one  that  is  more  similar  to the original  object.  The observer's  performance is \nmeasured  by  the  variance  (threshold)  that  gives  rise  to  75%  correct  performance. \nThe optimal strategy is  to choose the stimulus  S  with a larger probability p (OIS). \nFrom Bayes' rule,  this is  to choose the larger of p (SIO). \nAssume  that  the  models  are  restricted  to  2D  transformations  of the  image,  and \ncannot  reconstruct  the 3D  structure of the object from  its  independent  templates \n{Ti}.  Assume also  that the prior probability p(Td is  constant.  Let  us  represent  S \nand Ti by their (x, y)  vertex coordinates:  (X  Y  )T, where X  =  (Xl, x2, ... , xn), \ny  =  (yl, y2 , ... , yn).  We  assume  that  the  correspondence  between  S  and  T i  is \nsolved  up  to  a  reflection  ambiguity,  which  is  equivalent  to an additional  template: \nTi  =  (xr  yr  )T,  where  X r  =  (xn, ... ,x2,xl ),  yr =  (yn, ... ,y2,yl).  We  still \ndenote the template set  as  {Td.  Therefore, \n\n(1) \n\nIn  what  follows,  we  will  compute  p(SITi)p(Ti ),  with  the  assumption  that  S  = \nF  (Ti) + N  (0, crI2n ), where N  is the Gaussian distribution,  12n the 2n x 2n identity \nmatrix,  and  :F  a  2D  transformation.  For  the  2D  ideal  observer,  :F  is  a  rigid  2D \nrotation.  For  the  GRBF  model,  F  assigns  a  linear  coefficient  to  each  template \nT i ,  in  addition  to  a  2D  rotation.  For  the  2D  affine  nearest  neighbor  model,  :F \nrepresents  the  2D  affine  transformation that minimizes  liS  - Ti11 2 ,  after  Sand Ti \nare  normalized  in  size.  For  the  2D  affine  ideal  observer,  :F  represents  all  possible \n2D  affine transformations applicable to T i. \n\n\f832 \n\n2.1  The 2D  ideal observer \n\nZ  Liu and D. Kersten \n\nThe templates are the original 2D  images, their mirror reflections, and 2D  rotations \n(in angle \u00a2)  in the image plane.  Assume that the stimulus S is generated by  adding \nGaussian  noise  to  a  template,  the  probability  p(SIO)  is  an  integration  over  all \ntemplates  and  their  reflections  and  rotations.  The  detailed  derivation  for  the  2D \nideal and  the  GRBF model  can be found  in  [4]. \n\nEp(SITi)p(Ti) ex:  E J d\u00a2exp (-liS - Ti(\u00a2)112 /2( 2 )  \u2022 \n\n(2) \n\n2.2  The GRBF model \n\nThe  model  has  the  same  template  set  as  the 2D  ideal  observer  does.  Its  training \nrequires  that  EiJ;7r d\u00a2Ci(\u00a2)N(IITj  - Ti(\u00a2)II,a) = 1,  j  = 1,2, ... , with  which  {cd \ncan be obtained optimally using singular value decomposition.  When a  pair of new \nstimuli  is}  are  presented,  the optimal  decision  is  to choose the one  that  is  closer \nto the learned prototype, in other words,  the one with  a smaller value  of \n\n111- E 127r d\u00a2ci(\u00a2)exp (_liS -2:~(\u00a2)1I2) II. \n\n(3) \n\n2.3  The 2D  affine  nearest neighbor model \n\nIt has  been  proved  in  [8]  that the smallest  Euclidean  distance  D(S, T)  between  S \nand T  is,  when T  is  allowed a  2D  affine  transformation,  S ~ S/IISII,  T  ~ T/IITII, \n(4) \n\nD2(S, T) =  1 - tr(S+S . TTT)/IITII2, \n\nwhere tr strands for  trace,  and S+  =  ST(SST)-l.  The optimal strategy, therefore, \nis  to  choose the  S  that  gives  rise  to  the larger of E exp (_D2(S, Ti)/2a2) , or  the \nsmaller of ED2(S, Ti).  (Since no probability is defined in this model, both measures \nwill  be  used  and  the results from  the better one will  be reported.) \n\n2.4  The 2D  affine ideal observer \n\nWe  now  calculate  the  Bayesian  probability  by  assuming  that  the  prior  probabil(cid:173)\nity  distribution  of the  2D  affine  transformation,  which  is  applied  to  the  template \nT i,  AT + Tr  =  (~  ~) Ti + (~:  :::  ~:), obeys  a  Gaussian  distribution \nN(Xo,,,,/16 ),  where  Xo  is  the  identity  transformation  xl'  =  (a,b,c,d,tx,t y)  = \n(1,0,0,1,0,0).  We  have \n\nEp(SITi )  =  E i: dX exp (-IIATi + Tr - SII 2/2(2) \n\n(5) \n\n=  EC(n, a, \",/)deC 1  (QD exp (tr (KfQi(QD-1QiKi) /2(12), \n\n(6) \n\nwhere C(n, a, \",/)  is  a function  of n, a, \"'/;  Q' =  Q + \",/-212, and \n\nQ  _  (  XT . X T  X T \u00b7 Y T  )  QK _  (  XT\u00b7 Xs  Y T . Xs) \n\n-\n\nX T \u00b7Ys  Y T .Ys \n\n-\n\nY T \u00b7XT  YT \u00b7YT \n\n' \n\n-21 \n\n+\"'/ \n\n2\u00b7 \n\n(7) \n\nThe free  parameters are \"'/  and the number of 2D  rotated copies for  each  T i  (since \na  2D  affine  transformation  implicitly  includes  2D  rotations,  and  since  a  specific \nprior  probability  distribution  N(Xo, \",/1)  is  assumed,  both  free  parameters  should \nbe explored  together to search for  the optimal results). \n\n\f2D Observers for Hwnan 3D Object Recognition? \n\n833 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\nFigure  1:  Stimulus  classes  with  increasing  structural  regularity:  Balls,  Irregular, \nSymmetric, and V-Shaped.  There were three objects in each class in the experiment. \n\n2.5  The human observers \n\nThree naive subjects were tested with four  classes of objects:  Balls, Irregular, Sym(cid:173)\nmetric,  and  V-Shaped  (Fig.  1).  There  were  three  objects  in  each  class.  For  each \nobject,  11  template  views  were  learned  by  rotating  the  object  60\u00b0 /step,  around \nthe  X- and  Y-axis,  respectively.  The  2D  images  were  generated  by  orthographic \nprojection, and  viewed  monocularly.  The viewing distance  was  1.5  m.  During the \ntest,  the standard deviation  of the Gaussian  noise  added  to  the  target  object  was \n(J\"t  =  0.254  cm.  No  feedback was  provided. \n\nBecause  the image  information  available  to the  humans  was  more  than  what  was \navailable to the models  (shading and occlusion in addition to the  (x, y)  positions of \nthe  vertices),  both  learned  and  novel  views  were  tested  in  a  randomly interleaved \nfashion.  Therefore,  the strategy that humans  used  in  the  task for  the  learned  and \nnovel  views  should  be  the  same.  The  number  of self-occlusions,  which  in  princi(cid:173)\nple  provided  relative  depth information,  was  counted and was  about equal in both \nlearned  and  novel  view  conditions.  The  shading information  was  also  likely  to  be \nequal  for  the learned  and  novel  views.  Therefore,  this  additional  information  was \nabout equal for  the learned  and novel  views,  and should not  affect  the comparison \nof the performance  (humans relative to a  model)  between learned and novel views. \nWe  predict  that  if  the  humans  used  a  2D  affine  strategy,  then  their  performance \nrelative to the 2D  affine ideal observer should not be higher for  the novel views  than \nfor  the learned views.  One reason to use  the four  classes of objects with increasing \nstructural  regularity is  that  structural regularity is  a  3D  property  (e.g.,  3D  Sym(cid:173)\nmetric  vs.  Irregular),  which  the  2D  models  cannot  capture.  The exception  is  the \nplanar V-Shaped objects, for which the 2D  affine models completely capture 3D  ro(cid:173)\ntations, and are therefore the  \"correct\" models.  The V-Shaped objects were used in \nthe 2D  affine case as a  benchmark.  If human performance increases with increasing \nstructural regularity of the objects, this would  lend  support to the hypothesis  that \nhumans have  used 3D  information in  the task. \n\n2.6  Measuring performance \n\nA  stair-case  procedure  [7]  was  used  to  track  the  observers'  performance  at  75% \ncorrect  level  for  the  learned  and  novel  views,  respectively.  There  were  120  trials \nfor  the  humans,  and  2000  trials  for  each  of the  models.  For  the  GRBF  model, \nthe  standard  deviation  of  the  Gaussian  function  was  also  sampled  to  search  for \nthe  best  result  for  the  novel  views  for  each  of  the  12  objects,  and  the  result  for \nthe  learned  views  was  obtained  accordingly.  This  resulted  in  a  conservative  test \nof  the  hypothesis  of  a  GRBF  model  for  human  vision  for  the  following  reasons: \n(1)  Since no feedback  was  provided in the human experiment and the learned and \nnovel  views  were  randomly  intermixed,  it  is  not  straightforward for  the  model  to \nfind  the  best  standard deviation for  the novel  views,  particularly because the  best \nstandard  deviation  for  the  novel  views  was  not  the  same  as  that  for  the  learned \n\n\f834 \n\nZ  Liu and D. Kersten \n\nones.  The  performance  for  the  novel  views  is  therefore  the  upper  limit  of  the \nmodel's  performance.  (2)  The subjects'  performance relative  to  the  model  will  be \ndefined as statistical efficiency  (see  below).  The above method will  yield  the lowest \npossible efficiency for  the novel views,  and a higher efficiency for  the learned views, \nsince the  best  standard deviation for  the novel  views  is  different  from  that  for  the \nlearned views.  Because our hypothesis depends on a higher statistical efficiency  for \nthe  novel  views  than for  the learned views,  this  method  will  make  such  a  putative \ndifference even smaller.  Likewise,  for  the 2D  affine ideal,  the number of 2D  rotated \ncopies of each template Ti and the value I  were both extensively sampled, and the \nbest  performance for  the novel  views  was  selected  accordingly.  The result  for  the \nlearned views  corresponding to the same parameters was selected.  This choice also \nmakes it a  conservative hypothesis test. \n\n3  Results \n\nLearned  Views \n\n\u2022  Human \nIJ  20 Ideal \nO  GRBF \nO  20 Affine Nearest NtMghbor \nrn  20 Affine  kIoai \n\ne-\n.\u00a3. \n:!2 \n0 \n~ \n\n~ \n~ \nI-\n\n1.5 \n\ne-\n.\u00a3. \n:g \n0 \n~ \n81 \nl! \nl-\n\n25 \n\n1.5 \n\n0.5 \n\nNovel  Views \n\n\u2022  Human \nEJ  20 Ideal \no GRBF \no  20 Affine Nearesl N.tghbor \n\n~ 2DAfllna~ \n\nObject Type \n\nObject Type \n\nFigure  2:  The  threshold  standard  deviation  of the  Gaussian  noise,  added  to  the \ndistractor in the test pair, that keeps  an observer's performance at the 75%  correct \nlevel,  for  the learned and novel views,  respectively.  The dotted line is  the standard \ndeviation of the Gaussian noise added to the target  in  the test  pair. \n\nFig.  2  shows  the  threshold  performance.  We  use  statistical  efficiency  E  to  com(cid:173)\npare  human  to  model  performance.  E  is  defined  as  the  information  used  by \nhumans  relative  to  the  ideal  observer  [3]  :  E  =  (d~uman/d~deal)2,  where  d' \nis  the  discrimination  index.  We  have  shown  in  [4]  that,  in  our  task,  E  = \n((a~1!f;actor)2 - (CTtarget)2)  /  ((CT~~~~~tor)2 - (CTtarget)2) ,  where  CT  is  the  thresh(cid:173)\nold.  Fig.  3 shows the statistical efficiency  of the  human observers relative  to each \nof the four  models. \nWe  note in  Fig.  3 that the efficiency for the novel views is  higher than those for  the \nlearned views  (several of them even exceeded 100%), except for the planar V-Shaped \nobjects.  We  are  particularly interested  in  the  Irregular  and  Symmetric objects  in \nthe  2D  affine  ideal  case,  in  which  the  pairwise  comparison  between  the  learned \nand  novel  views  across  the  six  objects  and  three  observers  yielded  a  significant \ndifference  (binomial,  p  <  0.05).  This  suggests  that  the  2D  affine  ideal  observer \ncannot account for the human performance, because if the humans used a 2D  affine \ntemplate  matching strategy,  their  relative  performance for  the novel  views  cannot \nbe better than for  the learned views.  We suggest therefore that 3D information was \nused  by  the  human observers  (e.g.,  3D  symmetry).  This  is  supported in  addition \nby  the increasing efficiencies  as the  structural regularity increased from  the  Balls, \nIrregular,  to  Symmetric  objects  (except  for  the  V-Shaped  objects  with  2D  affine \nmodels). \n\n\f2D Observers for Hwnan 3D Object Recognition? \n\n835 \n\n20 Ideal \n\no Learned \n\u2022  Novel \n\n300 \n\n.. \nl  250 \n.. \n\"  200 \n..! \n\" \n$: \nw \n\n'50 \n\nQ \nN \n\nGRBF Modol \n\nI 0  l&arnedl \n.Noval \n\n'\" \n\"\" \n.'\"  --------------\n\n\"\" \n\n250 \n\nl \nf \n~ \n\"-\nII! \n\"  '\" \n\nObJect Type \n\nObjoctTypo \n\n>-\n\n250 \n\nl  300 \nj \n~ \n~  200 \nt \ni \nI \n! \n~  0 \n\n150 \n\nQ \nN \n\n20 Aftlne Nearest \no Learned \n\u2022  Novel \n\nIghbor \n\n20 Affine Ideal \n\no Learned \n\u2022  Novel \n\n300 \n\n200 \n\nl \n,..  250 \n\" \nj \n\" \n~ \nj \n\n'50 \n\nObject Type \n\nObjOGtType \n\n---\n\nFigure 3:  Statistical efficiencies of human observers relative to the 2D ideal observer, \nthe  GRBF  model,  the  2D  affine  nearest  neighbor  model,  and  the  2D  affine  ideal \nobserver_ \n\n4  Conclusions \n\nComputational  models  of visual  cognition  are  subject  to  information  theoretic  as \nwell  as implementational constraints.  When a  model's performance mimics that of \nhuman observers, it is  difficult  to interpret which aspects of the model characterize \nthe  human  visual  system.  For  example,  human  object  recognition  could  be  simu(cid:173)\nlated by both a GRBF model and a model with partial 3D information of the object. \nThe approach  we  advocate here  is  that,  instead  of trying to mimic  human  perfor(cid:173)\nmance  by  a  computational  model,  one  designs  an implementation-free  model  for  a \nspecific  recognition task that  yields  the  best  possible  performance under explicitly \nspecified computational constraints.  This model provides a well-defined benchmark \nfor  performance, and if human observers outperform it, we  can conclude firmly that \nthe  humans  must  have  used  better computational  strategies  than  the  model.  We \nshowed  that models of independent 2D  templates with 2D  linear operations cannot \naccount for human performance.  This suggests that our human observers may have \nused the templates to reconstruct a representation of the object with some (possibly \ncrude)  3D  structural information. \n\nReferences \n\n[1]  Biederman I and Gerhardstein P  C.  Viewpoint dependent mechanisms in visual \nobject recognition:  a critical analysis.  J. Exp. Psych.:  HPP, 21: 1506-1514, 1995. \n[2]  Biilthoff H H and Edelman S. Psychophysical support for a 2D view interpolation \n\ntheory of object recognition.  Proc.  Natl.  Acad.  Sci. , 89:60-64,  1992. \n\n[3]  Fisher R  A.  Statistical Methods for  Research Workers.  Oliver and Boyd,  Edin(cid:173)\n\nburgh,  1925. \n\n[4]  Liu  Z,  Knill  D  C,  and  Kersten  D.  Object  classification  for  human  and  ideal \n\nobservers.  Vision  Research,  35:549-568, 1995. \n\n[5]  Poggio T  and Edelman S.  A network that learns to recognize three-dimensional \n\nobjects.  Nature,  343:263-266, 1990. \n\n[6]  Tarr  M  J  and  Biilthoff  H  H. \n\nby  geon-structural-descriptions  or  by  multiple-views? \n21:1494-1505,1995. \n\nIs  human  object  recognition  better  described \nJ.  Exp.  Psych.:  HPP, \n\n[7]  Watson A B and Pelli D G.  QUEST: A Bayesian adaptive psychometric method. \n\nPerception and Psychophysics, 33:113-120, 1983. \n\n[8]  Werman M and  Weinshall  D.  Similarity and affine invariant  distances  between \n\n2D  point sets.  IEEE PAMI, 17:810-814,1995. \n\n\fToward  a  Single-Cell Account  for \n\nBinocular  Disparity Tuning:  An Energy \nModel May be Hiding in Your  Dendrites \n\nBartlett  W.  Mel \n\nDepartment of Biomedical Engineering \n\nUniversity of Southern California,  MC  1451 \n\nLos  Angeles,  CA 90089 \n\nmel@quake.usc.edu \n\nDaniel L.  Ruderman \n\nThe Salk Institute \n\n10010 N.  Torrey Pines Road \n\nLa Jolla,  CA  92037 \nruderman@salk.edu \n\nKevin A.  Archie \n\nNeuroscience Program \n\nUniversity of Southern California \n\nLos  Angeles,  CA  90089 \nkarchie@quake.usc.edu \n\nAbstract \n\nHubel and Wiesel  (1962)  proposed that complex cells in visual cor(cid:173)\ntex  are  driven  by  a  pool  of simple  cells  with  the  same  preferred \norientation but different spatial phases.  However, a wide variety of \nexperimental results over the past two decades have challenged the \npure  hierarchical  model,  primarily  by  demonstrating  that  many \ncomplex  cells  receive  monosynaptic  input  from  unoriented  LGN \ncells, or do not depend on simple cell input.  We recently showed us(cid:173)\ning a detailed biophysical model that nonlinear interactions among \nsynaptic inputs to an excitable dendritic tree could provide the non(cid:173)\nlinear  subunit  computations  that  underlie  complex  cell  responses \n(Mel,  Ruderman,  &  Archie,  1997).  This  work  extends  the  result \nto the  case  of complex  cell  binocular  disparity  tuning,  by  demon(cid:173)\nstrating  in  an  isolated  model  pyramidal  cell  (1)  disparity  tuning \nat  a  resolution  much  finer  than the  the overall  dimensions  of the \ncell's receptive field,  and (2)  systematically shifted optimal dispar(cid:173)\nity values  for  rivalrous  pairs of light  and dark bars-both in  good \nagreement  with  published  reports  (Ohzawa,  DeAngelis,  &  Free(cid:173)\nman,  1997).  Our results  reemphasize  the  potential importance of \nintradendritic computation for  binocular  visual  processing in  par(cid:173)\nticular,  and for  cortical neurophysiology in general. \n\n\fA Single-Cell Accountfor Binocular Disparity Tuning \n\n209 \n\n1 \n\nIntroduction \n\nBinocular  disparity  is  a  powerful  cue  for  depth  in  vision.  The  neurophysiological \nbasis  for  binocular  disparity  processing  has  been  of interest  for  decades,  spawned \nby the early studies of Rubel and Wiesel  (1962)  showing neurons in primary visual \ncortex which  could  be driven  by  both eyes.  Early qualitative models  for  disparity \ntuning held that a  binocularly driven neuron could represent a  particular disparity \n(zero,  near,  or  far)  via  a  relative  shift  of receptive  field  (RF)  centers  in  the  right \nand  left  eyes.  According  to this  model,  a  binocular  cell  fires  maximally  when  an \noptimal stimulus, e.g. an edge of a particular orientation, is simultaneously centered \nin  the  left  and  right  eye  receptive  fields,  corresponding to  a  stimulus  at  a  specific \ndepth relative to the fixation  point.  An account of this kind is  most relevant to the \ncase  of a  cortical  \"simple\"  cell,  whose  phase-sensitivity enforces  a  preference for  a \nparticular absolute location and contrast polarity of a stimulus within its monocular \nreceptive fields. \n\nThis global receptive field  shift account leads to a conceptual puzzle, however, when \nbinocular  complex cell  receptive fields  are  considered  instead,  since  a  complex  cell \ncan respond to an oriented feature nearly independent of position within its monoc(cid:173)\nular receptive field.  Since complex cell  receptive field  diameters in the cat lie in the \nrange of 1-3 degrees, the excessive  \"play\"  in  their monocular receptive fields  would \nseem to render complex cells incapable of signaling disparity on the much finer  scale \nneeded for  depth perception  (measured in  minutes). \n\nIntriguingly,  various  authors  have  reported that a  substantial fraction  of complex \ncells  in  cat  visual  cortex  are in  fact  tuned to left-right  disparities  much  finer  than \nthat suggested by the size  of the monocular RF's.  For such cells,  a  stimulus  deliv(cid:173)\nered at the  proper disparity, regardless of absolute position in either eye,  produces \na neural response in excess of that predicted by the sum of the monocular responses \n(Pettigrew, Nikara, & Bishop, 1968; Ohzawa, DeAngelis, & Freeman, 1990; Ohzawa \net al.,  1997).  Binocular responses  of this  type suggest  that for  these  cells,  the left \nand  right  RF's are combined  via a  correlation operation rather than a  simple sum \n(Nishihara & Poggio,  1984; Koch & Poggio,  1987).  This computation has also been \nformalized  in  terms  of an  \"energy\"  model  (Ohzawa  et  al.,  1990,  1997),  building \non the earlier  use of energy  models  to account  for  complex  cell  orientation  tuning \n(Pollen  &  Ronner,  1983)  and  direction  selectivity  (Adelson  &  Bergen,  1985).  In \nan  energy  model  for  binocular  disparity  tuning,  sums  of linear  Gabor  filter  out(cid:173)\nputs representing left  and right  receptive  fields  are  squared to produce  the crucial \nmultiplicative cross  terms  (Ohzawa et al.,  1990,  1997). \nOur previous biophysical modeling work has shown that the dendritic  tree of a  cor(cid:173)\ntical  pyramidal  cells  is  well  suited  to  support  an  approximative  high-dimensional \nquadratic input-output relation,  where  the  second-order multiplicative cross  terms \narise  from  local  interactions  among  synaptic  inputs  carried  out  in  quasi-isolated \ndendritic  \"subunits\"  (Mel,  1992b,  1992a,  1993).  We  recently  applied  these  ideas \nto show that the  position-invariant orientation tuning of a  monocular complex cell \ncould  be  computed  within  the dendrites  of a  single  cortical  cell,  based  exclusively \nupon  excitatory inputs from  a  uniform,  overlapping  population  of unoriented  ON \nand OFF cells  (Mel et al.,  1997).  Given the similarity of the  \"energy\"  formulations \npreviously proposed to account for  orientation tuning and binocu~ar disparity tun(cid:173)\ning,  we  hypothesized  that  a  similar  type  of dendritic  subunit  computation  could \nunderlie disparity tuning in a  binocularly driven complex cell. \n\n\f210 \n\nB.  W.  Mel,  D. L  Ruderman and K  A. Archie \n\nParameter \nRm \nRa \nem \nVrest \nCompartments \nSomatic !iNa,  YnR \nDendritic !iNa,  YnR \nInput frequency \ngAMPA \nTAMPA (on, of f) \ngNMDA \n7'NMDA (on, off) \nEsyn \n\nValue \nIOkOcm:l \n2000cm \n\n1.0ILF/cm~ \n\n-70 mV \n\n615 \n\n0.20,0.12 S/cm:l \n0.05,0.03 S/cm:t. \n\n0- 100  Hz \n\n0.027 nS  - 0.295 nS \n\n0.5  ms,  3 ms \n\n0.27 nS  - 2.95  nS \n\n0.5  ms,  50  ms \n\nOmV \n\nTable  1:  Biophysical  simulation  parameters.  Details  of HH  channel  implementa(cid:173)\ntion  are  given  elsewhere  (Mel,  1993);  original  HH  channel  implementation  cour(cid:173)\ntesy  Ojvind  Bernander  and  Rodney  Douglas.  In  order  that  local  EPSP  size  be \nheld  approximately  constant  across  the  dendritic  arbor,  peak  synaptic  conduc(cid:173)\ntance  at  dendritic  location  x  was  approximately  scaled  to  the  local  input  resis(cid:173)\ntance  (inversely),  given  by  9syn(X)  =  C/Rin(X),  where  c  was  a  constant,  and \nRin(X)  =  max(Rin(X),200MO).  Input  resistance  Rin(X)  was  measured  for  a  pas(cid:173)\nsive cell.  Thus 9syn  was  identical for  all  dendritic sites  with input resistance below \n200MO, and was given by the larger conductance value shown;  roughly 50%  of the \ntree fell  within a factor of 2 of this value.  Peak conductances at the finest distal tips \nwere smaller by roughly a factor of 10 (smaller number shown).  Somatic input resis(cid:173)\ntance was  near 24MO.  The  peak synaptic conductance values  used were such that \nthe ratio of steady state current injection through  NMDA  vs.  AMPA  channels was \n1.2\u00b10.4.  Both AMPA and NMDA-type synaptic conductances were modeled using \nthe  kinetic  scheme of Destexhe  et  al.  (1994);  synaptic activation  and  inactivation \ntime constants are shown for  each. \n\n2  Methods \n\nCompartmental  simulations  of a  pyramidal  cell  from  cat  visual  cortex  (morphol(cid:173)\nogy courtesy of Rodney Douglas and Kevan Martin)  were carried out in NEURON \n(Hines, 1989); simulation parameters are summarized in Table 1.  The soma and den(cid:173)\ndritic  membrane  contained  Hodgkin-Huxley-type  (HH)  voltage-dependent  sodium \nand potassium channels.  Following evidence for  higher spike thresholds and decre(cid:173)\nmental propagation in dendrites (Stuart & Sakmann, 1994), HH channel density was \nset to a uniform, 4-fold lower value in the dendritic membrane relative to that of the \ncell  body.  Excitatory synapses from  LGN cells  included  both  NMDA  and  AMPA(cid:173)\ntype synaptic  conductances.  Since  the  cell  was  considered  to be isolated from  the \ncortical  network,  inhibitory  input  was  not  modeled.  Cortical  cell  responses  were \nreported  as  average spike  rate recorded  at the  cell  body over  the 500  ms  stimulus \nperiod, excluding the 50  ms  initial transient. \n\nThe  binocular  LGN  consisted  of  two  copies  of  the  monocular  LGN  model  used \npreviously  (Mel  et  al.,  1997),  each  consisting of a  superimposed  pair  of 64x64  ON \nand OFF subfields.  LGN cells were modeled as linear, half-rectified center-surround \nfilters  with  centers  7 pixels  in  width.  We  randomly subsampled the left  and  right \nLGN arrays by a  factor of 16  to yield  1,024 total LGN  inputs to the pyramidal cell. \n\n\fA Single-Cell Account for Binocular Disparity Tuning \n\n211 \n\nA  developmental principle  was  used to determine the spatial arrangement of these \n1,024 synaptic contacts onto the dendritic  branches of the cortical cell,  as  follows. \nA  virtual stimulus  ensemble was defined for  the cell,  consisting of the complete set \nof single  vertical  light  or  dark bars  presented binocularly  at  zero-disparity within \nthe cell's receptive field.  Within this ensemble, strong pairwise correlations existed \namong  cells  falling  into  vertically  aligned  groups  of the  same  (ON  or  OFF)  type, \nand cells  in the vertical column at zero horizontal disparity in the other eye.  These \nbinocular  cohorts  of  highly  correlated  LGN  cells  were  labeled  mutual  \"friends\". \nProgressing through the dendritic tree in depth first order, a randomly chosen LG N \ncell  was  assigned  to  the  first  dendritic  site.  A  randomly  chosen  \"friend\"  of hers \nwas assigned to the second site, the third site was  assigned to a friend  of the site  2 \ninput,  etc.,  until  all  friends  in the available subsample were assigned  (4  from  each \neye,  on  average).  If the friends  of the  connection  at site i  were  exhausted,  a  new \nLGN  cell  was  chosen at random for  site i + 1.  In  earlier work, this type of synaptic \narrangement  was  shown  to  be  the  outcome  of a  Hebb-type  correlational  learning \nrule,  in  which  random,  activity  independent  formation  of synaptic  contacts  acted \nto slowly randomize the axo-dendritic interface, shaped by Hebbian stabilization of \nsynaptic contacts based on their short-range correlations with other synapses. \n\n3  Results \n\nModel  pyramidal cells  configured  in this  way  exhibited  prominent  phase-invariant \norientation tuning, the hallmark response property of the visual complex cell.  Mul(cid:173)\ntiple orientation tuning curves are shown, for example, for a monocular complex cell, \ngiving rise to strong tuning for  light and dark bars across the receptive field  (fig.  1). \nThe bold curve shows the average of all tuning curves for  this cell;  the half-width at \nhalf max is  25\u00b0,  in the normal range for  complex cells  in  cat visual cortex (Orban, \n1984).  When  the  spatial  arrangement  of  LGN  synaptic  contacts  onto  the  pyra(cid:173)\nmidal  cell  dendrites  was  randomly  scrambled,  leaving  all  other  model  parameters \nunchanged,  orientation tuning  was  abolished  in  this  cell  (right  frame),  confirming \nthe crucial role of spatially-mediated nonlinear synaptic interactions (average curve \nfrom  left  frame  is  reproduced for  comparison). \n\nDisparity-tuning in an orientation-tuned binocular model cell is shown in fig.  2,  com(cid:173)\npared to data from  a complex cell  in cat visual cortex (adapted from  Ohzawa et al. \n(1997)).  Responses  to  contrast  matched  (light-light)  and  contrast  non-matched \n(light-dark)  bar pairs were subtracted to produce these  plots.  The strong diagonal \nstructure  indicates  that  both the  model  and  real  cells  responded  most  vigorously \nwhen contrast-matched bars were  presented at the same horizontal  position in the \nleft  and right-eye RF's  (Le.  at zero-disparity), whereas peak responses to contrast(cid:173)\nnon-matched bars occured at symmetric near and far,  non-zero disparities. \n\n4  Discussion \n\nThe response pattern illustrated in fig.  2A is highly similar to the response generated \nby an analytical binocular energy model for  a  complex cell  (Ohzawa et al.,  1997): \n\n{exp (-kXi) cos (271' f XL) + exp (-kX'kJ cos (271' f XR)}2 + \n{exp (-kxiJ sin (271' f XL) + exp (-kXh) sin (271' f XR)}2, \n\n(1) \n\nwhere XL  and  X R  are the  horizontal  bar positions  to the two eyes,  k  is  the factor \n\n\faverage  +(cid:173)\nlightO  -+(cid:173)\ndark 4  -\u20acI(cid:173)\nlight 8  , , * -\nlight16 \n....  -\ndark 16  -ll-\n\n55 \n\n50 \n\n45 \n\n40 \n\n35 \n\n30 \n\n25 \n\n20 \n\n15 \n\n10 \n\nU \nQl \n(/) \nUs \nQl \"'\" \n'5. .e \nQl \n(/) c: \n8. \nQl ex: \n\n(/) \n\nOrdered vs.  Scrambled \n\nordered -\n\nscrambled  -+-\n\n/'---\n\n\"' \n\n'+---- / \n+ \n\n/  ,  ~ \n+\n, \n\n' +_+- ~ \n\nI \n\n70 \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\n'0 \nQl \n~ \nQl \n\"'\" \n'5. \n.e \nQl \n(/) c: \n8. \nQl ex: \n\n(/) \n\n~ \n\n0 \n-90 \n\n-60 \n\n212 \n\nB.  W.  Mel,  D_  L  Ruderman and K.  A. Archie \n\nOrientation Tuning \n\n-30 \n\n30 \nOrientabon (degrees) \n\n0 \n\n60 \n\n90 \n\n5 \n-90 \n\n-60 \n\n-30 \n\n30 \nOrientation (degrees) \n\n0 \n\n60 \n\n90 \n\nFigure  1:  Orientation tuning curves  are shown in the left  frame  for  light  and dark \nbars at  3 arbitrary  positions_  Essentially  similar  responses  were  seen  at  other re(cid:173)\nceptive  field  positions,  and  for  other  complex  cells_  Bold  trace  indicates  average \nof tuning  curves  at  positions  0,  1,  2,  4,  8, and  16  for  light and  dark  bars.  Similar \nform  of 6  curves  shown reflects  the  translation-invariance of the cell's  response  to \noriented  stimuli,  and symmetry with  respect  to  ON  and  OFF  input.  Orientation \ntuning is  eliminated when the spatial arrangement of LGN synapses onto the model \ncell  dendrites is  randomly scrambled  (right frame). \n\nComplex Cell Model \n\nComplex Cell in Cat VI \n\nOhzawa, Deangelis, & Freeman, 1997 \n\nRight eye position \n\nRight eye position \n\nFigure 2:  Comparison of disparity tuning in model  complex cell  to that of a  binoc(cid:173)\nular  complex cell  from  cat visual cortex.  Light or dark bars were  presented simul(cid:173)\ntaneously  to the  left  and right  eyes.  Bars  could  be of same  polarity  in  both  eyes \n(light, light) or different polarity (light, dark); cell responses for these two cases were \nsubtracted to  produce  plot  shown in  left  frame.  Right  frame  shows  data similarly \ndisplayed for  a  binocular  complex  cell  in  cat  visual  cortex  (adapted from  Ohzawa \net al.  (1997)). \n\n\fA Single-Cell Account for Binocular Disparity Tuning \n\n213 \n\nthat determines the width of the subunit RF's, and f  is  the spatial frequency. \nIn lieu of literal simple cell  \"subunits\" , the present results indicate that the subunit \ncomputations  associated  with  the  terms  of an  energy  model  could  derive  largely \nfrom synaptic interactions within the dendrites of the individual cortical cell,  driven \nexclusively  by  excitatory  inputs  from  unoriented,  monocular  ON  and  OFF  cells \ndrawn  from  a  uniform  overlapping  spatial  distribution.  While  lateral  inhibition \nand excitation  play numerous  important roles  in cortical computation, the  present \nresults suggest they are not essential for  the basic features of the nonlinear disparity \ntuned responses of cortical complex cells.  Further, these results address the paradox \nas  to  how  inputs  from  both  unoriented  LGN  cells  and  oriented  simple  cells  can \ncoexist  without  conflict  within the dendrites of a  single complex cell. \n\nA number of controls from  previous work suggest that this type of subunit process(cid:173)\ning is  very robustly computed  in  the dendrites of an  individual  neuron,  with little \nsensitivity to biophysical parameters and modeling assumptions, including details of \nthe algorithm used to spatially organize the genicula-cortical projection, specifics of \ncell morphology, synaptic activation density across the dendritic tree, passive mem(cid:173)\nbrane and cytoplasmic parameters, and details of the kinetics,  voltage-dependence, \nor spatial distribution of the voltage-dependent dendritic channels. \nOne important difference  between a  standard energy  model  and the  intradendritic \nresponses generated in the present simulation experiments is  that the energy model \nhas  oriented  RF  structure at  the  linear  (simple-cell-like)  stage,  giving  rise  to  ori(cid:173)\nented,  antagonistic  ON-OFF subregions  (Movshon,  Thompson,  &  Tolhurst,  1978), \nwhereas the linear stage in our model gives rise to center-surround antagonism only \nwithin individual LGN receptive fields.  Put another way, the LGN-derived subunits \nin  the present model cannot provide all the negative cross-terms that appear in the \nenergy model equations, specifically for  pairs of pixels that fall outside the range of \na  single LG N receptive field. \nWhile  the  present simulations involve  numerous simplifications  relative  to the full \ncomplexity  of the  cortical  microcircuit,  the  results  nonetheless  emphasize  the  po(cid:173)\ntential importance of intradendritic computation in visual cortex. \n\nAcknowledgements \n\nThanks to Ken Miller, Allan Dobbins, and Christof Koch for many helpful comments \non  this  work.  This  work  was  funded  by  the  National  Science  Foundation  and  the \nOffice of Naval  Research,  and by a  Slo~n Foundation Fellowship  (D.R.). \n\nReferences \n\nAdelson, E., &  Bergen, J.  (1985).  Spatiotemporal energy models for  the perception \n\nof motion.  J.  Opt.  Soc.  Amer.,  A  2,  284-299. \n\nRines,  M.  (1989).  A  program  for  simulation  of  nerve  equations  with  branching \n\ngeometries.  Int.  J.  Biomed.  Comput.,  24,  55-68. \n\nRubel,  D.,  &  Wiesel,  T .  (1962) .  Receptive  fields,  binocular  interaction  and  func(cid:173)\n\ntional architecture in the cat's visual cortex.  J.  Physiol.,  160,  106- 154. \n\nKoch,  C.,  &  Poggio,  T .  (1987) .  Biophysics  of computation:  Neurons,  synapses, \nand  membranes.  In  Edelman,  G.,  Gall,  W.,  &  Cowan,  W.  (Eds.),  Synaptic \njunction,  pp. 637-697. Wiley,  New  York. \n\nMel,  B.  (1992a).  The  clusteron:  Toward  a  simple  abstraction for  a  complex  neu(cid:173)\n\nron.  In  Moody,  J.,  Hanson,  S.,  &  Lippmann,  R.  (Eds.),  Advances  in  Neural \n\n\f214 \n\nB.  W.  Mel,  D.  L  Ruderman and K. A  Archie \n\nInformation  Processing  Systems,  vol.  4,  pp.  35-42.  Morgan  Kaufmann,  San \nMateo,  CA. \n\nMel,  B.  (1992b).  NMDA-based pattern discrimination in a modeled cortical neuron. \n\nNeural  Computation,  4,  502-516. \n\nMel,  B.  (1993).  Synaptic integration in an excitable dendritic tree.  J.  Neurophysiol., \n\n70(3),  1086-110l. \n\nMel,  B.,  Ruderman,  D.,  &  Archie,  K.  (1997).  Complex-cell responses  derived from \ncenter-surround inputs:  the surprising power of intradendritic computation. In \nMozer, M., Jordan, M.,  & Petsche, T. (Eds.),  Advances in Neural Information \nProcessing  Systems,  Vol.  9,  pp.  83-89.  MIT  Press,  Cambridge,  MA. \n\nMovshon,  J.,  Thompson, I.,  & Tolhurst, D.  (1978).  Receptive field  organization of \n\ncomplex cells  in  the cat's striate cortex.  J.  Physiol.,  283,  79-99. \n\nNishihara,  H.,  &  Poggio,  T.  (1984).  Stereo  vision  for  robotics.  In  Brady,  &  Paul \n(Eds.), Proceedings of the First International Symposium of Robotics Research, \npp.  489-505.  MIT Press, Cambridge, MA. \n\nOhzawa, I.,  DeAngelis, G., & Freeman, R.  (1990). Stereoscopic depth discrimination \nin  the  visual  cortex:  Neurons  ideally  suited  as  disparity  detectors.  Science, \n249,  1037- 104l. \n\nOhzawa, I.,  DeAngelis,  G., & Freeman,  R.  (1997).  Encoding of binocular disparity \n\nby  complex cells  in the cat's visual cortex.  J.  Neurophysiol.,  June. \n\nOrban, G.  (1984).  Neuronal  operations  in  the  visual  cortex.  Springer  Verlag,  New \n\nYork. \n\nPettigrew, J.,  Nikara, T., &  Bishop,  P.  (1968).  Responses to moving slits  by single \n\nunits in cat striate cortex.  Exp.  Brain Res.,  6,  373-390. \n\nPollen,  D.,  & Ronner,  S.  (1983).  Visual  cortical  neurons  as  localized  spatial  fre(cid:173)\n\nquency filters.  IEEE  Trans.  Sys.  Man  Cybero.,  13,  907-916. \n\nStuart, G.,  &  Sakmann, B.  (1994).  Active propagation of somatic action potentials \n\ninto neocortical pyramidal cell  dendrites.  Nature,  367,  69-72. \n\n\f", "award": [], "sourceid": 1365, "authors": [{"given_name": "Bartlett", "family_name": "Mel", "institution": null}, {"given_name": "Daniel", "family_name": "Ruderman", "institution": null}, {"given_name": "Kevin", "family_name": "Archie", "institution": null}]}