{"title": "Statistically Efficient Estimations Using Cortical Lateral Connections", "book": "Advances in Neural Information Processing Systems", "page_first": 97, "page_last": 103, "abstract": null, "full_text": "Statistically Efficient  Estimation Using \n\nCortical Lateral  Connections \n\nAlexandre Pouget \n\nalex@salk.edu \n\nKechen Zhang \nzhang@salk.edu \n\nAbstract \n\nCoarse  codes  are  widely  used  throughout  the  brain to encode  sen(cid:173)\nsory  and  motor  variables.  Methods  designed  to  interpret  these \ncodes,  such  as population vector analysis, are either inefficient, i.e., \nthe variance of the estimate is much larger than the smallest possi(cid:173)\nble  variance,  or biologically implausible, like  maximum likelihood. \nMoreover,  these  methods  attempt  to  compute  a  scalar  or  vector \nestimate  of the  encoded  variable.  Neurons  are  faced  with  a  simi(cid:173)\nlar  estimation problem .  They  must  read  out  the  responses  of the \npresynaptic  neurons,  but,  by  contrast,  they  typically  encode  the \nvariable  with  a  further  population  code  rather  than  as  a  scalar. \nWe  show  how  a  non-linear  recurrent  network  can  be  used  to  per(cid:173)\nform these estimation in an optimal way while keeping the estimate \nin  a  coarse  code  format.  This  work  suggests  that  lateral  connec(cid:173)\ntions  in  the  cortex  may  be  involved  in  cleaning  up  uncorrelated \nnoise  among neurons  representing  similar variables. \n\n1 \n\nIntroduction \n\nMost  sensory  and motor variables in  the  brain  are encoded  with coarse  codes,  i.e., \nthrough the activity of large populations of neurons  with broad tuning to the  vari(cid:173)\nables.  For  instance,  direction  of visual  motion is  believed  to  be  encoded  in  visual \narea  MT  by  the  responses  of a  large  number  of cells  with  bell-shaped  tuning,  as \nillustrated in figure  I-A. \n\nNeurophysiological  recordings  have  shown  that,  in  response  to  an  object  moving \nalong a  particular direction,  the  pattern of activity across such  a  population would \nlook  like  a  noisy  hill  of activity  (figure  I-B) .  On  the  basis  of this  activity,  A,  the \nbest  that  can  be  done  is  to  recover  the  conditional  probability of the  direction  of \nmotion , (),  given  the  activity,  p( (}IA).  A  slightly  less  ambitious goal  is  to  come  up \nwith a good  \"guess\",  or estimate, 0,  of the direction,  (),  given  the  activity.  Because \nof the  stochastic  nature  of the  noise,  the  estimator  is  a  random  variable,  i.e,  for \n\no AP is  at the Institute for  Computational  and Cognitive  Sciences,  Georgetown  Univer(cid:173)\n\nsity,  Washington,  DC  20007  and  KZ  is  at  The  Salk  Institute,  La  Jolla,  CA  92037  .  This \nwork  was  funded  by  McDonnell-Pew  and  Howard  Hughes  Medical  Institute. \n\n\f98 \n\nA \n\nA.  Pouget and K. Zhang \n\ni \\  \n\\ \ni \n\n1 \n\n\\ \n\nB \n\n3 \n2.5 \n.~  2 \n.\"5 \u00ab 1.5 \n\nOL-----------------------J \n\n100 \n\n200 \n\nDirection (deg) \n\n300 \n\n100 \n\n200 \n\n300 \n\nPreferred Direction (deg) \n\nFigure  1:  A- Tuning  curves  for  16  direction  tuned  neurons.  B- Noisy  pattern  of \nactivity  (0)  from  64  neurons  when  presented  with  a  direction  of  1800 \u2022  The  ML \nestimate  is  found  by  moving  an  \"expected\"  hill  of activity  (dotted  line)  until  the \nsquared  distance  with  the  data is  minimized (solid  line) \n\nthe  same  image,  B will  vary  from  trial  to  trial.  A  good  estimator  should  have \nthe  smallest  possible  variance  across  those  trials  because  the  variance  determines \nhow  well  two  similar  directions  can  be  discriminated  using  this  estimator.  The \nCramer-Rao bound  provides  an  analytical lower  bound  for  this  variance given  the \nnoise  in  the  system  and  the  unit  tuning  curves  [5]  .  Typically,  computationally \nsimple  estimators,  such  as  optimum  linear  estimator  (OLE),  are  very  inefficient; \ntheir  variances  are  several  times  the  bound.  By  contrast,  Bayesian  or  maximum \nlikelihood  (ML)  estimators  (which  are  equivalent  for  the  case  under  consideration \nin  this paper)  can  reach  this bound but require  more complex calculations  [5]. \n\nThese  decoding  technics  are  valuable  for  a  neurophysiologist  interested  in  reading \nout  the  population  code  but  they  are  not  directly  relevant  for  understanding  how \nneural circuits  perform estimation.  In particular,  they all provide the estimate in a \nformat which is  incompatible with what we  know of sensory  representations  in  the \ncortex.  For example, cells in V 4 are estimating orientation from the noisy responses \nof orientation tuned  VI  cells,  but,  unlike  ML  or  OLE  which  provide  a  scalar  esti(cid:173)\nmate,  V4  neurons  retain  orientation  in  a  coarse  code  format,  as  demonstrated  by \nthe fact  that V4  cells  are just as  broadly tuned to orientation as VI  neurons. \n\nTherefore,  it seems  that  a  theory  of estimation in  biological  networks should  have \ntwo  critical characteristics:  1- it should preserve  the estimate in  a  coarse  code  and \n2- it should be efficient,  i.e., the variance should be close to the Cramer-Rao bound. \nWe  explore  in  this  paper  various  network  architectures  for  performing estimations \nwith  coarse  code  using  lateral  connections.  We  start  by  briefly  describing  several \nclassical  estimators such  as  OLE  or  ML.  Then,  we  consider  linear  and  non-linear \nrecurrent  networks  and compare their  performances with the  classical estimators. \n\n2  Classical  Methods \n\nThe simplest  estimators  are  linear of the form  BOLE  = WT A.  Better  performance \ncan  be  obtained  with  a  center  of mass estimator (COM),  BeoM  =  2:i Biai/ 2:i ai; \nhowever,  in  the  case  of a  periodic  variable,  such  as  direction  of motion,  the  best \none-shot  method  known  is  the  complex  estimator  (CaMP),  BeoMP  =  phase(z) \nwhere  z  = 2::=1 akei91c  [5].  This  estimator  consists  in  fitting  a  cosine  through \nthe  pattern  of activity,  like  the  one  shown  in  figure  I-B,  and  using  the  phase  of \n\n\fStatistically Efficient Estimations Using CorticaL LateraL Connections \n\n99 \n\nA \n\nB \n\n40 \n\nActivity over Time \n\no \n\n200 \n\n300 \n\npret~~~d Direction (deg) \n\nFigure  2:  A- Circular network  of 64  units.  Only  the  connections originating from \none unit are shown.  B- Activity over time in the non-linear network when initialized \nwith a random pattern at t  = O.  The activity of the units are plotted as  a function \nof their  position  along the  circle  which  is  equivalent  to their  preferred  direction  of \nmotion with  appropriate choice  of weights. \n\nthe  best  cosine  fit  as  the  estimate of direction.  This  method is  suboptimal  if the \ndata  were  not  generated  by  cosine  tuning  functions  as  in  the  case  illustrated  in \nfigure  I-A. It is  possible to obtain optimum performance by  fitting  the  curve  that \nwas  actually  used  to generate  the  data,  i.e,  the  actual  tuning  curves  of the  units. \nA maximum likelihood estimate,  defined  as  being the direction maximizing p(AIO), \ninvolves exactly this type of curve fitting,  a process illustrated in figure  1-B [5].  The \nestimate  is  computed  by  finding  first  the  \"expected\"  hill- the  hill  that  would  be \nobtained in a  noise free  system- minimizing the distance  with the data.  In the case \nof gaussian  noise,  the  appropriate  distance  measure  to  minimize  is  the  euclidian \nsquared  distance.  The  final  position  of the  peak  of the  hill  corresponds  to  the \nmaximum likelihood estimate, OML. \n\n3  Recurrent  Networks \n\nConsider  a  circular  network  of 64  units  fully  connected  like  the  one  depicted  in \nfigure  2-A.  With  an  appropriate  choice  of  weights  and  activation  function,  this \nnetwork  will  develop  a  hill-shaped  pattern  of activity  in  response  to  a  transient \ninput as illustrated in figure 2-B. If we initialize this networks with activity patterns \nA =  {ad  corresponding  to the responses  of 64  direction  tuned  units (figure  1),  we \ncan  use  the  final  position  of the  hill  across  the  neuronal  array  after  relaxation  as \nan estimate of the  direction,  O.  The  variance  of this  estimator will  depend  on the \nexact  choice of activation function  and weights. \n\n3.1  Linear Network \n\nWe first  consider a network of 64 units whose dynamics is governed by  the following \ndifference  equation: \n\nThe  dynamics  of such  networks  is  well  understood  [3].  If each  unit  receives  the \nsame  weight  vector  'Iii,  then  the  weight  matrix W  is  symmetric.  In  this  case,  the \n\n(1) \n\n\f100 \n\nA.  Pouget and K.  Zhang \n\nnetwork dynamics amplifies or suppresses the Fourier component of the initial input \npattern,  {ad,  independently by  a  factors equal to the corresponding component of \nthe  Fourier  transform,  ;];,  of w.  For  example,  if the  first  component  of;];  is  more \nthan one  (resp.  less  than one)  the first  Fourier component of the  initial pattern of \nactivity will  be  amplified (resp.  suppressed). \n\nThus,  we  can choose W  such  that the network amplifies selectively  the first  Fourier \ncomponent of the data while suppressing the others.  The network would be unstable \nbut if we stop after a large, yet fixed,  number of iterations, the activity pattern would \nlook like  a  cosine  function  of direction  with  a  phase  corresponding  to the  phase of \nthe  first  Fourier  components  of the  data.  In  other  words,  the  network  would  end \nup  fitting  a  cosine  function  in  the  data which  is  equivalent  to  the  CaMP  method \ndescribed  above.  A  network  for  orientation  selectivity  proposed  by  Ben-Yishai  et \nal  [1]  is  closely related  to this linear network. \n\nAlthough  this  method  keeps  the  estimate  in  a  coarse  code  format,  it  suffers  two \nproblems:  it is  unclear  how  it could be extended  to non  periodic  variables,  such  as \ndisparity,  and it is suboptimal since  it is  equivalent to the  CaMP estimator. \n\n3.2  Non-Linear Network \n\nWe  consider  next a network of 64  units fully connected  whose  dynamics is  governed \nby  the following difference  equations: \n\nOi(t)  =  g( Ui(t\u00bb =  6.3  (log ( 1 + e5+1 0U,(t\u00bb) ) 0.8 \n\nu,( t Ht) = u, (t) Ht ( -u,( t) + t, W'jOj (t) ) \n\n(2) \n\n(3) \n\nZhang  (1996)  has  demonstrated  that  with  appropriate  symmetric  weights,  {Wij}, \nthis  network  develops  a  stable  hill of activity in  response  to  an  arbitrary  transient \ninput pattern {Id(figure 2-B). The shape of the hill is  fully specified  by  the weights \nand  activation function  whereas,  by  contrast,  the  final  position  of the  hill  on  the \nneuronal  array  depends  only on  the initial input.  Therefore,  like  ML,  the  network \nfits  an  \"expected\"  function  through  the  data.  We  first  present  a  set of simulations \nin  which  we  investigated  whether  ML  and  the  network  place  the  hill  at  the  same \nlocation. \n\nMethods:  The  simulations  consisted  estimating  the  value  of  the  direction  of a \nmoving  bar  based  on  the  activity,  A =  {ad,  of 64  input  units  with  hill-shaped \ntuning to direction  corrupted  by  noise.  We  used  circular  normal functions  like the \nones showed  in figure  I-A to model the  mean activities,  fiCO): \nfiCO)  =  3exp(7(cos(O - Od  - 1\u00bb + 0.3 \n\n(4) \nThe value 0.3 corresponds to the mean spontaneous activity of each  unit.  The peak, \nOJ,  of the circular normal functions were uniformly spread over the interval [0\u00b0,360\u00b0]. \nThe activities, {ad, depended on the noise distribution.  We used two types of noise, \nnormally distributed  with fixed  variance,  O'~ = 1 and Poisson  distributed: \n\nP(ai = ale) = \n\n1 \n\nJ27r0'2 \nn \n\nexp \n\n( \n\n-\n\n(a  - f'(e\u00bb2) \n\n' \n20'2 \nn \n\n, \n\nP(ai = kle)  =  J, \n\n1.(O)k  -f,(9) \n\n(5) \n\n:! \n\nOur results  compare the standard deviation offour estimators, OLE,  COM, CaMP \nand ML  to the  non-linear recurrent  network  (RN)  with  transient inputs  (the input \npatterns  are  shown  on  the  first  iteration  only).  In  the  case  of ML,  we  used  the \n\n\fStatistically Efficient Estimations Using Cortical Lateral Connections \n\n101 \n\nNoise with Normal Distribution \n\nNoise with Poisson Distribution \n\nOLE  COM  COMP  ML \n\nRN \n\nFigure 3:  Histogram of the standard deviations of the estimate for  all five  methods \n\nCramer-Rao bound to  compute the  standard  deviation  as  described  in Seung  and \nSompolinsky  (1993).  The  weights  in  the  recurrent  network  were  chosen  such  that \nthe final  pattern of activity in the network  have a  profile very similar to the tuning \nfunction  fi(O). \n\nResults:  Since  the  preferred  direction  of two  consecutive  units  in  the  network \nare  more  than  50  apart,  we  first  wonder  whether  RN  estimates  exhibit  a  bias(cid:173)\na  difference  between  the  mean  estimate and  the  true  direction- in  particular for \ndirections  between  the  peaks  of two  consecutive  units.  Our simulations showed  no \nsignificant  bias for  any  of the orientations tested  (not shown).  Next,  we  compared \nstandard  deviations  of the  estimates  for  all  five  methods  and  for  the  two  types \nof noise.  The  RN  method  was  found  to  outperform  the  OLE,  COM  and  COMP \nestimators in  both cases  and  to  match  the  Cramer-Rao bound  for  gaussian  noise \n(figure  3)  as  suggested  by  our  analysis.  For  noise  with  Poisson  distribution,  the \nstandard deviation for  RN  was only 0.3440  above  ML  (figure  3). \n\nWe also estimated numerically -lJORN j lJai 19=1700, the derivative of the RN estimate \nwith respect to the initial activity of each of 64 units for an orientation of 1700 \u2022  This \nderivative in the case of ML  matches closely  the derivative of the cell  tuning curve, \n/,(0).  In  other  words,  in  ML,  units  contribute  to  the  estimate  according  to  the \namplitude of the derivative of the tuning curve.  As shown  in figure  the same is true \nfor  RN,  -lJORN jlJai 19=1700  matches closely the derivative of the units tuning curves. \nIn  contrast,  the  same  derivatives  for  the  COMP  estimate,  (dotted  line),  or  the \nCOM  estimate,  (dash-dotted line),  do  not match the profile of /'(0).  In particular, \nunits  with  preferred  direction  far  away from  1700 ,  i.e.  units  whose  activity  is  just \nnoise,  end  up  contributing to  the final  estimate,  hindering the  performance of the \nestimator. \n\nWe  also  looked  at  the  standard  deviation  of the  RN  as  a  function  of time,  i.e., \nthe  number of iterations.  Reaching  a  stable  state  can  take  up  to  several  hundred \niterations  which  could  make  the  RN  method  too  slow  for  any  practical  purpose. \nWe  found however  that the standard  deviation decreases  very rapidly over the first \n5-6 iterations and reaches asymptotic values after around 20  iterations (figure 4-B). \nTherefore, there is no need to wait for  a perfectly stable pattern of activity to obtain \nminimum standard deviation. \n\nAnalysis:  One  way  to  determine  which  factors  control  the  final  position  of the \nhill is  to find  a function,  called a  Lyapunov function,  which  is minimized over time \nby  the  network  dynamics.  Cohen  and  Grossberg  (1983)  have  shown  that  network \ncharacterized by the dynamical equation above and in which the input pattern {sIi} \n\n\f102 \n\nA \n\n1  , \n\nCD \n> \n~  0.5 \n'c \nCD o \nu \n16 \n~ E -0.5 \no z \n\n,  . ' \n:  ,  , \n\n01----'-\" \n\nA.  Pouget and K.  Zhang \n\nB \n\n20.-~--~--~-~--, \n\nRN \nCOMP \nCOM \n\n, \n\n\" \n\n-1L-__  ~ __  ~_'_\" _' ~--J' \n\n' . \n\n100 \n\n200 \n\n300 \n\nPreferred Direction (deg) \n\nOL-~--~-~--~~ \n\no \n\n20 \n\n40 \n\n60 \n\nTime (# of iterations) \n\nFigure 4:  A- Comparison of g'(B)  (solid  line),  -oO/oai!9=1700  for  RN,  CaMP  and \nCOM.  All  functions  have  been  normalized  to  one.  B- Standard  deviation  as  a \nfunction  of the  number of iterations for  RN. \n\nis  clamped, minimizes a  Lyapunov function of the form : \n\n(6) \n\nThe last term is the dot product between  the input pattern,  {sIi },  and the current \nactivity  pattern,  {g( Ui)},  on  the  neuronal  array.  Here  s  is  simply a  scaling factor \nfor  the input pattern.  The dynamics of the network  will therefore  tend to minimize \n- Li Iig( ud, or equivalently,  to  maximize the  overlap  between  the  stable  pattern \nand the  input pattern .  The other  terms however  are  also  dependent  on  Ii  because \nthe shape of the final stable activity profile depends on the input pattern.  Therefore \nthe network  will settle  into a compromise between  maximizing overlap  and getting \nthe right  profile given  the  clamped input. \n\nWe  can  show  however  that,  for  small  input  (i.e.,  as  the  scaling  factor  s  -+  0), \nthe  dominant term  in  the  Lyapunov  function  is  the  dot  product.  To see  this,  we \nconsider  the Taylor expansion  of Lyapunov function  L  with respect  to s.  First, let \n{Ui }  denote the profile of the stable activity {ud in the limit of zero input (s  -+ 0), \nand then  write  the  corresponding  value of the  Lyapunov function  at zero  input  as \nLa.  Now  keeping only the first-order  terms of s in the Taylor expansion,  we  obtain: \n\n(7) \n\nThis means that the dot product  is  the only first  order term of s,  and disturbances \nto  the  shape  of the  final  activity  profile  contribute  only  to  higher  order  terms  of \ns,  which  are  negligible  when  s  is  small.  Notice  that in  the limit of zero  input,  the \nshape  of the  activity profile  {Ui}  is  fixed,  and  the only  thing unknown  is  its  peak \nposition.  Because  La  is  a  constant,  the global minimum of the Lyapunov  function \nhere  should  correspond  to  a  peak  position which  maximizes the  dot product.  The \ndifference  between  Ui  and  Ui  is  negligible  for  sufficiently  small input  because,  by \ndefinition,  Ui  -+  Ui  as  s  -+  O.  Consequently,  for  small  input,  the  network  will \nconverge  to  a  solution  maximizing primarily Li Iig( Ui),  which  is  mathematically \nequivalent  to  minimizing  the  square  distance  between  the  input  and  the  output \npattern. \nTherefore,  if we  use  an  activity  pattern,  A =  {ad,  as  the  input  to  this  network, \nthe  stable  hill should  have  its  peak  at  a  position  very  close  to the  direction  corre-\n\n\fStatistically Efficient Estimations Using Cortical Lateral Connections \n\n103 \n\nsponding  to  the  maximum likelihood  estimate  (under  the  assumption  of gaussian \nnoise),  provided the network is  not attracted into a local minimum of the Liapunov \nfunction.  This result  is valid when  using  a small clamped input but our simulations \nshow  that a  transient input is sufficient  to reach  the Cramer-Rao bound. \n\n4  Discussion \n\nOur results  demonstrate that it is possible to perform efficient unbiased estimation \nwith  coarse  coding  using  a  neurally  plausible  architecture.  Our  model  relies  on \nlateral  connections  to implement a  prior expectation  on  the profile  of the  activity \npatterns.  As  a  consequence,  units  determine  their  activation  according  to  their \nown  input  and  the  activity  of their  neighbors.  This  approach  shows  that  one  of \nthe  advantages  of coarse  code  is  to  provide  a  representation  which  simplifies  the \nproblem of cleaning up  uncorrelated  noise  within a  neuronal  population. \n\nUnlike OLE, COM and CaMP, the RN estimate is not the result of a voting process \nin  which  units  vote  from  their  preferred  direction,  (Ji.  Instead,  units  turn  out  to \ncontribute  according to the  derivatives of their  tuning curves,  If( (J),  as in  the case \nof ML.  This feature  allows the  network  to ignore  background noise,  that is  to say, \nresponses  due  to other  factors  beside  the  variable  of interest.  This  property  also \npredicts that  discrimination of directions  around  the  vertical  (90\u00b0)  would  be  most \naffected by shutting off the units tuned at 60\u00b0  and 120\u00b0.  This prediction is consistent \nwith  psychophysical  experiments  showing  that  discrimination  around  the  vertical \nin human is  affected  by prior adaptation to orientations displaced from  the vertical \nby  \u00b1300  [4]. \n\nOur approach can  be readily extended to any other periodic sensory or motor vari(cid:173)\nables.  For  non  periodic  variables such  as  the  disparity  of a  line  in  an  image,  our \nnetwork needs to be adapted since it currently relies on circular symmetrical weights. \nSimply unfolding the network will be sufficient to deal with values around the center \nof the interval under consideration,  but more work is  needed  to deal  with boundary \nvalues.  We  can  also  generalize  this  approach  to  arbitrary  mapping  between  two \ncoarse  codes for  variables x  and y  where  y  is  a function  of x.  Indeed,  a coarse code \nfor  x  provides a set of radial basis functions of x  which  can be subsequently  used to \napproximate arbitrary  functions.  It is  even  conceivable  to  use  a  similar approach \nfor  one-to-many mappings,  a  common situation in  vision  or  robotics,  by  adapting \nour network such  that several hills  can  coexist  simultaneously. \n\nReferences \n\n[1]  R.  Ben-Yishai,  R . L.  Bar-Or,  and H.  Sompolinsky.  Proc.  Natl.  Acad.  Sci.  USA, \n\n92:3844-3848,  1995. \n\n[2]  M.  Cohen  and S.  Grossberg.  IEEE  Trans.  SMC,  13:815-826, 1983. \n[3]  M.  Hirsch  and  S.  Smale.  Differential  equations,  dynamical  systems  and  linear \n\nalgebra.  Academic Press,  New  York,  1974. \n\n[4]  D.  M.  Regan  and K.  1.  Beverley.  J.  Opt.  Soc.  Am.,  2:147-155, 1985. \n[5]  H.  S.  Seung and H.  Sompolinsky.  Proc.  Natl.  Acad.  Sci.  USA,  90:10749-10753, \n\n1993. \n\n\f", "award": [], "sourceid": 1312, "authors": [{"given_name": "Alexandre", "family_name": "Pouget", "institution": null}, {"given_name": "Kechen", "family_name": "Zhang", "institution": null}]}