{"title": "Deriving Receptive Fields Using an Optimal Encoding Criterion", "book": "Advances in Neural Information Processing Systems", "page_first": 953, "page_last": 960, "abstract": null, "full_text": "Deriving Receptive Fields Using  An \n\nOptimal Encoding  Criterion \n\nRalph Linsker \n\nIBM  T. J.  Watson Research  Center \n\nP.  O.  Box 218,  Yorktown  Heights,  NY  10598 \n\nAbstract \n\nAn  information-theoretic  optimization  principle  ('infomax')  has \npreviously  been  used  for  unsupervised  learning  of statistical  reg(cid:173)\nularities in an input ensemble.  The principle states that the input(cid:173)\noutput mapping implemented by a  processing stage should be cho(cid:173)\nsen  so  as  to  maximize  the  average  mutual  information  between \ninput  and output  patterns,  subject  to constraints  and in  the  pres(cid:173)\nence  of processing  noise.  In the present  work  I show  how  infomax, \nwhen  applied  to  a  class  of nonlinear  input-output  mappings,  can \nunder  certain  conditions  generate  optimal filters  that  have  addi(cid:173)\ntional  useful  properties:  (1)  Output  activity  (for  each  input  pat(cid:173)\ntern)  tends  to  be  concentrated  among  a  relatively  small  number \n(2)  The  filters  are  sensitive  to  higher-order  statistical \nof  nodes. \nstructure  (beyond  pairwise correlations).  If the  input features  are \nlocalized,  the  filters'  receptive  fields  tend  to  be  localized  as  well. \n(3)  Multiresolution sets  of filters  with  subsampling at  low  spatial \nfrequencies - related to pyramid coding and wavelet representations \n- emerge as favored  solutions for  certain types of input ensembles. \n\n1 \n\nINTRODUCTION \n\nIn  unsupervised  network  learning,  the  development  of  the  connection  weights  is \ninfluenced  by  statistical properties of the ensemble of input vectors,  rather  than by \nthe  degree  of mismatch between  the  network's  output  and  some  'desired'  output. \nAn  implicit goal  of such  learning  is  that  the  network  should  transform  the  input \nso  that  salient  features  present  in  the  input  are  represented  at  the  output  in  a \n\n953 \n\n\f954 \n\nLinsker \n\nmore useful form.  This is often done by reducing  the input dimensionality in a way \nthat preserves  the high-variance components of the input (e.g., principal component \nanalysis,  Kohonen feature  maps). \n\nThe principle of maximum information preservation  ('infomax') is an unsupervised \nlearning  strategy  that  states  (Linsker  1988):  From  a  set  of allowed  input-output \nmappings  (e.g.,  parametrized by  the  connection  weights),  choose  a  mapping that \nmaximizes  the  (ensemble-averaged)  Shannon  information  that  the  output  vector \nconveys about the input vector, in the presence of noise.  Such a mapping maximizes \nthe ensemble-averaged mutual information (MI)  between  input and output. \n\nThis paper  (a)  summarize earlier  results  on  info max solutions for  linear  networks, \n(b)  identifies some limitations of these solutions (ways  in  which  very different  filter \nsets are equally optimal from the infomax standpoint), and (c) shows how, by adding \na  small  nonlinearity  to  the  network,  one  can  remove  these  limitations and  at the \nsame time improve the utility of the output representations.  We show that infomax, \nacting on  the  modified  network,  tends  to favor  sparsely coded  representations  and \n(depending on the input ensemble) sets of filters  that span multiple resolution scales \n(related  to wavelets  and 'pyramid coding'). \n\n2 \n\nINFOMAX IN LINEAR NETWORKS \n\nFor  definiteness  and brevity, we  consider  a  linear  network  having a  particular type \nof noise  model  and  input  statistical  properties.  For  a  more detailed  discussion  of \nrelated models see  (Linsker  1989). \n\nSince  the computation of the MI  (which  involves  the output entropy)  is  in  general \nintractable  for  continuous-valued  output  vectors,  previous  work  (and  the  present \npaper) makes use of a surrogate MI, which we will call the 'as-if-Gaussian' MI. This \nquantity is,  by definition,  computed as  though the output vectors comprised a  mul(cid:173)\ntivariate Gaussian distribution having the same mean and covariance as the actual \ndistribution of output vectors.  Although expedient,  this  substitution  has  lacked  a \nprincipled justification.  The Appendix shows  that,  under  certain  conditions,  using \nthis  'surrogate MI'  (and not  the full  MI)  is  indeed  appropriate and justified. \n\nDenote the input vector by S = {Si}  (Si  is  the activity at input node i), the output \nvector  by  Z = {Zn},  the matrix of connection  weights  by  C = {Cni},  noise  at the \ninput  nodes  by  N = {Nd, and  noise  at the output  nodes  by  v = {vn }.  Then our \nprocessing  model is,  in matrix form,  Z  = C(S + N) + v.  Assume that  N  and v are \nGaussian  random variables,  (S)  = (N)  = (v)  = 0,  \\fNT ) = (SvT ) = (NvT)  = 0, \nand,  for  the covariance  matrices,  (SsT)  = Q,  (N N  ) = TJI,  (vvT) = {3I'.  (Angle \nbrackets  denote  an ensemble average,  superscript  T  denotes  transpose,  and  I  and \nI'  denote  unit  matrices  on  the  input  and  output  spaces,  respectively.)  In  general, \nMI =  Hz -\n(Hzls)  where  Hz is  the output entropy and HZls is  the entropy of the \noutput  for  given  S.  Replacing  MI  by  the  'as-if-Gaussian' MI  means  replacing  Hz \nby  the expression  for  the entropy  of a  multivariate Gaussian distribution,  which  is \n(apart from an irrelevant constant term) H~ = (1/2) lndet Q', where Q' = (ZZT) = \nCQCT + TJCCT + (3I'  is  the  output  covariance.  Note  that,  when  S  is  fixed,  Z  = \nCS+(CN +v) is a Gaussian distribution centered on CS, so that we have (Hzls) = \n(1/2)lndetQII  where  Q\" =  ((CN + v)(CN + v)T)  =  TJCCT + {3I'.  Therefore  the \n\n\fDeriving Receptive  Fields  Using An  Optimal Encoding Criterion \n\n955 \n\n'as-if-Gaussian' MI is \n\nMI' =  (1/2)[lndetQ' -lndetQ\"]. \n\n(1) \n\nThe  variance  of the  output  at  node  n  (prior to  adding noise  lin)  is  Vn  =  ([C(S + \nN)]~) = (CQCT + TJCCT)nn.  We  will constrain  the dynamic range of each  output \nnode  (limiting  the  number  of output  values  that  can  be  discriminated  from  one \nanother  in  the  presence  of  output  noise)  by  requiring  that  Vn  =  1  for  each  n. \nSubject  to  this  constra.int,  we  are  to  find  a  matrix  C  that  maximizes  MI'.  For  a \nlocal  Hebbian  algorithm that accomplishes  this  maximization, see  (Linsker  1992). \nHere,  in  order  to proceed  analytically, we  consider  a  special case  of interest. \nSuppose  that the  input statistics  are  shift-invariant, so  that  the covariance  (SiSj) \nis  a function of (j - i).  We  then use  a  shift-invariant filter  Ansatz,  Cni = C(i - n). \nInfomax then determines  the optimal filter  gain as a  function  of spatial frequency; \ni.e.,  the  magnitude of the  Fourier  components  c(k)  of C(i - n).  The derivation  is \nsummarized below. \nDenote  by  q(k),  q'(k),  and  q\"(k)  the  Fourier  transforms  of QU  - i),  Q'(m - n), \nand  Q\"(m  - n)  respectively. \ntherefore \nq'(k)  =  [q(k)  + TJ)  I c(k)  12  +(3.  Similarly,  q\"(k)  =  TJ  I c(k)  12  +(3.  We  obtain \nMI'  =  (1/2)~dlnq'(k) - Inq\"(k)].  Each  node's  output  variance  Vn  is  equal  to \nV  = (I/K)~dq(k)+TJ] I c(k)  12  where  K  is  the number of terms in the sum over  k. \nmethod; that is,  we  maximize MI\" = MI' + J.L(V  - 1)  with respect  to each  I c(k)  12. \nTo  maximize MI'  subject  to  the  constraint  on  V  we  use  the  Lagrange  multiplier \n!his yields an equation for each k that is quadratic in I c(k)  12.  The unique solution \n\nSince  Q'  =  CQCT  + TJccf1'  + (3I', \n\nIS \n\n(TJ/{3)  I c(k)  I - -1 + 2[q(k) + TJ]{1  + [1- J.L{3q(k)] \n\n} \n\n(2) \n\n2_ \n\nq(k) \n\n2TJK  1/2 \n\nif the RHS is positive, and zero otherwise.  The Lagrange multiplier J.L( < 0) is chosen \nso  that  the  {I c( k)  I} satisfy  V = 1. \nStarting  from  a  differently-stated  goal  (that  of reducing  redundancy  subject  to  a \nlimit on information loss),  which  turns out to be  closely  related  to infomax, (Atick \n&  Redlich  1990a) found  an  expression  for  the  optimal filter  gain  that  is  the  same \nas  that of Eq.  2 except  for  the choice  of constraint. \n\nFilter properties found using this approach are related to those found in early stages \nof  biological  sensory  processing.  Smoothing  and  bandpass  (contrast-enhancing) \nfilters  emerge as infomax solutions (Linsker  1989, Atick &  Redlich  1990a) in certain \ncases,  and good agreement with retinal contrast sensitivity measurements has been \nfound  (Atick &  Redlich  1990b). \nNonetheless,  the  value  of the  infomax solution  Eq.  2  is  limited  in  two  important \nways.  First, the phases of the {c(k)} are left undetermined.  Any  choice of phases is \nequally good at maximizing MI'  in  a  linear  network.  Thus  the  real-space  response \nfunction  C(i - n),  which  determines  the  receptive  field  properties  of  the  output \nnodes,  is  nonunique  (and indeed  may be highly  nonlocalized in  space). \n\nSecond, it is useful to extend the solution Ansatz to allow a number of different filter \ntypes  a  =  1, ... , A  at  each  output site,  while continuing to require  that  each  type \n\n\f956 \n\nLinsker \n\nsatisfy  the  shift-invariance condition  Cni(a)  ==  C(i - n;a).  For  example,  one  may \nwant to model a topographic 'retinocortical' mapping in which each patch of cortex \n(each  'site') contains multiple filter  types,  yet each patch carries out the same set of \nprocessing functions on its input.  For this Ansatz, one  again obtains Eq.  2 (deriva(cid:173)\ntion  omitted  here),  but  with  1 c(k)  12  on  the  LHS  replaced  by  ~ap(a)1 c(k; a)  12, \nwhere  c(k; a)  is  the  F.T. of C(i - n; a),  and p(a)  is  the fraction of the total number \nof filters  (at  each  site)  that  are  of type  a.  The  partitioning of the  overall  (sum(cid:173)\nsquared)  gain among the  multiple filter  types is  thus left undetermined. \n\nThe higher-order statistical structure of the input (beyond covariance) is not being \nexploited by infomaxin the above analysis, because  (1)  the network is linear and (2) \nonly pairwise correlations among the output activities enter into MI'.  We shall show \nthat if we  make the network  even  mildly nonlinear, MI' is  no longer independent  of \nthe choice  of phases  or of the partitioning of gain among multiple filter  types. \n\n3  NETWORK WITH WEAK  NONLINEARITY \nlin,  where  Un = ~iCniSi, for small f.  This differs  from the linear network analyzed \n\nWe consider the weakly nonlinear input-output relation Zn  = Un + fU~ + ~iCniNi + \n\nabove  by  the  term  in  U~.  (For  simplicity,  terms  nonlinear  in  the  noise  are  not \nincluded.)  The cubic  term increases  the signal-to-noise ratio selectively  when  Un  is \nlarge in absolute value.  We  maximize MI' as  defined  in  Eq.  l. \nHeuristically,  the  new  term  will  cause  infomax  to  favor  solutions  in  which  some \noutput nodes have large (absolute) activity values, over solutions in which all output \nnodes  have  moderate  activities.  The  output  layer  can  thus  encode  information \nabout the input  vector  (e.g., signal  the  presence  of a  feature)  via the  high  activity \nof a  small number of nodes,  rather  than  via the particular activity  values of many \nnodes.  This has  several  (interrelated)  potential advantages.  (1)  The concentration \nof activity among fewer  nodes  is  a  type  of sparse coding.  (2)  The resulting  output \nrepresentation  may be  more resistant  to noise.  (3)  The presence  of a feature can be \nsignaled to a later processing stage using fewer connections.  (4) Since the particular \nnodes  that have  high activity depend  upon  the  input  vector,  this type of mapping \ntransforms a set of continuous-valued inputs at each site into a partially place-coded \nrepresentation.  A  model  of this  sort  may  thus  be  useful  for  understanding  better \nthe formation of place-coded  representations  in  biological systems. \n\n3.1  MATHEMATICAL  DETAILS \n\nThis section  may be  skipped without  loss  of continuity.  In  matrix form,  U  CS, \nWn  - U~ for  each  n,  and  Z  =  U + f W  + C N  + II.  Keeping  terms  through first \norder  in  f,  the  output  covariance  is  Q'  = {ZZT}  = CQCT + 1]CCT + (3I'  + fF, \nwhere F = {WUT)+{UWT} .  [As  an aside, Fnm  =  (UnUm(U~+U~)} resembles the \ncovariance (UnUm),  except that presentations having large U~ +U~ are given greater \ntype Cni = C( i-n), taking the Fourier transform yields q'(k) =  [q(k)+1]]1  c(k; a)  12+ \nweight  in  the ensemble  average.]  For  shift-invariant input  statistics  and one  filter \n(3 + ff(k)  where  f(k)  is  the  F.T. of F(m - n) = Fnm.  So  In det Q'  =  ~k lnq'(k)  = \n~ln{[q(k)+1]] 1 c(k)  12  +(3}+f~g(k) where g(k)  _  [f(k)/{[q(k) +1]]1  c(k;a) 12+{3}] . \nUsing  a  Lagrange  multiplier  as  before,  the  quantity  to  be  maximized  is  MI\"  = \n\n\fDeriving Receptive Fields  Using An  Optimal  Encoding Criterion \n\n957 \n\n~~J[~ \n~~ -+-11~ \n\nFigure  1:  Breaking of phase degeneracy.  See  text for  discussion. \n\nMI\"(e =  0) + (e/2)~g(k). \neach  k  define  d(k)  to  be  the  A  x  A  matrix whose  elements  are:  d(k)a.b  = [q(k) + \nNow  suppose  there  are  multiple filter  types  a =  1, ... , A  at  each  output  site.  For \nTJ]c(k; a)c*(k; b) + [f3/ p(a)]c5a.b  where  c5a.b  is  the  Kronecker  delta.  Also define  f(k)  to \nbe  the  A  x  A  matrix each  of whose  elements  f(k)ab  is  the  F.T.  of F(m - n;a,b) \nwhere  F(m - n; a, b)  =  (Un(a)Wm(b)) + (Wn(a)Um(b)).  Then the O{e)  part of MI\" \nis:  (e/2)~kTr{[d(k)]-lf(k)}.  Note  that  [d(k)]-l  is  the  inverse  of the  matrix d(k), \nand  that 'Tr' denotes  the  trace.  [Outline of derivation:  In  the basis defined  by  the \nFourier harmonics, Q' is  block diagonal (one A x A block for  each  k).  So In det Q' = \n~k Indetq'(k)  where  each  q'(k)  is  an  A  x  A  matrix  of the  form  q~(k) + eq~(k). \nExpanding In det q' (k)  through  O( e)  yields the stated  result.] \nThe infomax calculation to lowest order in e [i.e., O( eO)]  is the same as for  the linear \nnetwork.  Here, for simplicity, we determine the sum-squared gain, ~ap(a)1 c(k; a)  12 , \nas  in  the  linear  case;  then  seek  to  maximize  the  new  term,  of  O(e),  subject  to \nthis  constraint  on  the  value  of the  sum-squared  gain.  How  the  nonlinear  term \nbreaks  phase and gain-apportionment degeneracies  is  of interest  here;  a  small O( e) \ncorrection  to the sum-squared gain is  not. \n\n4 \n\nILLUSTRATIVE  RESULTS \n\nTwo  examples  will  show  how  adding  the  nonlinear  perturbative  term  to  the  net(cid:173)\nwork's output breaks a degeneracy  among different  filter solutions.  In each case the \ninput space  is  a  one-dimensional 'retina' with wraparound. \n\n4.1  BREAKING THE PHASE DEGENERACY \n\nIn this example (see Figure 1) there is one filter type at each output site.  We consider \ntwo  types  of input ensembles:  (1)  Each  input vector  (Fig.  la shows  one  example) \nis  drawn  from  a  multivariate  Gaussian  distribution  (so  there  is  no  higher-order \nstatistical  structure  beyond  pairwise  correlations).  The  input  covariance  matrix \nQ(j - i)  is  a  Gaussian function  of the  distance  between  the sites.  (2)  Each  input \n\n\f958 \n\nLinsker \n\nvector is  a random sum of Gaussian 'bumps':  Si  =  Ej aj [s( i - j) - so]  where  s( i - j) \nis a  Gaussian (shown  in  Fig.  1b for j=20j there  are 64  nodes  in all);  So  is  the mean \nvalue of s( i - j);  and each aj  is  independently and randomly chosen  (with constant \nprobability)  to  be  1  or  O.  This  ensemble  does  have  higher-order  structure,  with \neach  input  presentation  being  characterized  by  the  presence  of localized  features \n(the  bumps) at  particular locations. \nThe  infomax solution  for  I c(k)  12  is  plotted  versus  spatial frequency  k  in  Fig.  1c \nfor  a  particular choice  of noise  parameters (1], (3).  As  stated earlier,  MI'  for  a linear \nnetwork is  indifferent  to the phases of the Fourier components {c(k)}.  A particular \nrandom choice  of phases  produces  the  real-space  filter  C(i - n)  shown  in  Fig.  Id, \nwhich  spans  the  entire  'retina.'  Setting  all  phases  to  zero  produces  the  localized \nfilter shown in  Fig.  1\u00a3.  If the Gaussian 'bump' of Fig.  1b is  presented  as input to a \nnetwork of filters  each of which is a  shifted version of Fig.  1d, the linear response  of \nthe network (i.e.,  the convolution of the 'bump' with the filter)  is  shown in Fig.  Ie. \nReplacing the filter  of Fig.  1d  by  that of Fig.  1\u00a3,  but  keeping  the  input  the  same, \nproduces  the output response  shown in  Fig.  19. \n\nThe cubic nonlinearity causes MI'  to be larger for the filter of Fig.  1\u00a3 than for  that of \nFig.  1d.  Heuristically, if we  focus  on the diagonal elements of the output covariance \nQ',  the  nonlinear  term is  2\u20ac(U~).  Maximizing MI' favors  increasing  this term (sub(cid:173)\nject  to  a  constraint  on  output  variance)  hence  favors  filter  solutions for  which  the \nUn  distribution  is  non-Gaussian  with  a  preponderance  of large  values.  Projection \npursuit  methods  also  use  a  measure  of the  non-Gaussianity  of the  output  distri(cid:173)\nbution to construct filters  that extract  'interesting' features  from high-dimensional \ndata (cf.  Intrator  1992). \n\n4.2  BREAKING  THE  PARTITIONING DEGENERACY  FOR \n\nMULTIPLE FILTER TYPES \n\nIn  this  example  (see  Fig.  2),  the  input  ensemble  comprises  a  set  of  self-similar \npatterns (each is  a  sine-Gabor 'ripple' as in  Fig.  2a)  that are  related by translation \nand  dilation  (scale  change  over  a  factor  of 80).  Figure  2b  shows  the  input  power \nspectrum vs.  k;  the scaling region goes as 11k.  Figure 2c  shows the infomax solution \nfor  the gain I c(k;a) I vs.  k when there  is just one filter  type.  When the input SNR \nis  large  (as  in  the  scaling  region)  the  infomax filters  'whiten' the output;  note the \nflat  portion  of  the  output  power  spectrum  (Fig.  2d). \n[We  modify  the  infomax \nsolution by  extending  the  power-law form of I c(k)  I to low  k  (dotted  line  in  Figs. \n2c,d).  This avoids  artifacts  resulting  from  the  rapid  increase  in  I c(k)  I,  which  is \nin  turn caused  by  our  having omitted low-k  patterns from  the  input ensemble for \nreasons  of numerical efficiency.]  The dotted envelope  curve  in  Figure 2e  shows the \nsum-squared  gain  Eap(a)  I c(k)  12  when  multiple filter  types  a  are  allowed.  The \nquantity plotted is just the square of that shown  in  Fig.  2c,  but on a  linear  rather \nthan log-log plot  (note values greater  than  5 are cut off to save  space). \n\nThe  network  nonlinearity  has  the  following  effect.  We  first  allow  two  filter  types \nto  share  the  overall gain.  Optimizing MI'  over  various  partitionings,  we  find  that \ninfo max favors a crossover between filter types at k  ~ 400.  Allowing three, then four, \nfilter  types produces additional crossovers  at lower  k.  For an Ansatz  in which each \nfilter's  share of the sum-squared gain is  tapered  linearly near its cutoff frequencies, \n\n\fDeriving  Receptive  Fields  Using  An  Optimal Encoding  Criterion \n\n959 \n\nthe  best  solution found for  each  p( a)  1 c( k)  12  is  shown  in  Fig.  2e  (semilog plot vs. \nk ).  Figure 2f plots the corresponding  1 c( k; a)  1 vs.  k on a linear scale.  Note that the \nthree lower-k filters  appear roughly self-similar.  (The peak in the highest-k  filter  is \nan artifact due  to the cutoff of the input ensemble  at high  k.)  The four  real-space \nfilters C(i-n; a)  are plotted vs.  (i-n) in Fig.  2g [phases chosen to make C(i-n; a) \nantisymmetric] . \nThe resulting filters  span multiple resolution scales.  The density  p( a)  is less  for  the \nlower-frequency  filters  (spatial subsampling).  When  more filter  types  are  allowed, \nthe increase in MI' becomes progressively less.  Although in our model the filters  are \npresent  with density  p  at each  output  site,  a  similar MIl  is  obtained if one  spaces \nadjacent  filters  of  type  a  by  a  distance  <X  1/ p( a).  The  resulting  arrangement  of \nfilters  resembles  the 'tiling' of the joint space  and spatial-frequency domain that is \nused in wavelet and 'pyramid coding' approaches to image processing.  [The infomax \nfilters  overlap,  rather  than disjointly tiling  the (:I;, k)  domain.] \nUsing  the  infomax method, the  region  of (:1;, k)  space  spanned  by  an optimal filter \nhas an aspect ratio that depends upon the relative distances - along the :I;  and k axes \n- over  which  the  input  feature  is  'coherent'  (possesses  higher-order  correlations). \nOne  may thus  be  able  to  use  infomax to  predict  relationships  between  statistical \nmeasures  of coherence  in  natural scenes  and observed  (:1;, k)  aspect  ratios for,  e.g., \norientation-selective cells.  See  (Field  1989) for  a  discussion of this issue  that is  not \nbased  on infomax. \n\n5  APPENDIX:  HEURISTIC JUSTIFICATION  FOR \n\nUSING  A  SURROGATE,  'AS-IF-GAUSSIAN,' \nMUTUAL INFORMATION \n\ninformation  between \n\ninput  S  and  output  Z \n\nis  MI \n!dSPsKD(Pzls;Pz)  where  KD(Pzls;PZ) \n\nThe  mutual \n!dSdZPszln(Psz/PsPz) \n! dZPzlsln(Pzls/PZ)  is  a  Kullback  divergence.  So,  maximizing MI  means maxi(cid:173)\nmizing the average  (over  S)  of KD(Pzls; Pz). \nWhat does  the  KD  represent?  Suppose  that  the  network  has somehow  learned the \ndistribution Pz.  Before being presented  with a  particular input S,  the network 'ex(cid:173)\npects' an output vector drawn from Pz.  The actual output response  to S,  however, \nis  a  vector  drawn from  PZls.  The  KD  measures  the  'surprise'  (i.e.,  the amount of \ninformation gained)  upon  seeing  the  actual  distribution  PZls  when  one  expected \nPz.  Infomax maximizes this average 'surprise.' \n\nHowever,  the  network  cannot  in  general  have  access  to  the  full  distribution  Pz, \nwhich  contains  far  too  much  information  (including  all  higher-order  statistics)  to \nbe  stored  in  the  connections  and  nodes  of the  network.  Let  us  suppose  for  defi(cid:173)\nniteness  that  the  system  remembers  only  the  mean  and  the  covariance  matrix  of \nZ.  Define pff  to be the multivariate Gaussian distribution that has the same mean \nand covariance as  Pz.  Then we  may think of the system as a  priori 'expecting' the \noutput vector  to be drawn from  the distribution pff. \nWe  accordingly  modify \nage \n\n(over  S)  of  KD(PzIS; pff)  (note \n\nthe  superscript  G). \n\nthe  principle  so \n\nthat  we  maximize \n\nthe  aver(cid:173)\n\nThis  equals \n\n\f960 \n\nLinsker \n\n1000 \n\n100 \n\n10 \n\n0.1 \n\n10 \n\n100  1000 \n\n10 \n\n100  1000 \n\n10 \n\n(d) \n\n0.1  I \n\n10 \n\n100  1000 \n\no \n\n(9) \n\n= \n\n__-~ ,...=;;\n\n80 \n\n' \n\nFigure  2:  Partitioning among multiple filter  types.  See  text. \n\nJ dSPs J dZ PZls In( PZ1s / pi) = (- H zls) s  - J dZ Pz In pi (where  H  denotes en(cid:173)\ntropy).  Using  a  property  of the  Gaussian distribution,  we  have  - J dZPz In pi = \n- J dZpi InPff  =  He;.  We conclude  that the average of KD equals He;  - (HZls)s, \nwhich is exactly equal to the surrogate 'as-if-Gaussian' MI defined  preceding  Eq.  1. \nThis argument provides a  principled justification for  using  the surrogate  MI,  when \nthe system has stored  information about the output vectors'  mean and covariance, \nbut not about higher-order statistics. \n\nReferences \n\nJ.  J.  Atick  & A.  N.  Redlich.  (1990a)  Towards a  theory  of early  visual  processing. \nNeural  Computation  2:308-320. \nJ.  J.  Atick  &  A.  N.  Redlich. \nprocessing:  contrast  sensitivity curves.  Inst.  Adv.  Study IASSNS-HEP-90/51. \n\n(1990b)  Quantitative  tests  of  a  theory  of  retinal \n\nD. J. Field.  (1989) What the statistics of natural images tell us about visual coding. \nIn  Proc.  SPIE 1077:269-276. \n\nN.lntrator.  (1992)  Feature extraction using an unsupervised  neural  network.  Neu(cid:173)\nral  Computation 4:98-107. \n\nR.  Linsker.  (1988)  Self-organization in a  perceptual  network.  Computer 21(3):105-\n117. \n\nR.  Linsker.  (1989)  An application of the principle of maximum information preser(cid:173)\nvation to linear systems.  In D.  S.  Touretzky (ed.),  Advances  in Neural Information \nProcessing  Systems  1,  186-194.  San  Mateo,  CA:  Morgan  Kaufmann. \n\nR.  Linsker.  (1992)  Local  synaptic learning  rules  suffice  to  maximize mutual infor(cid:173)\nmation in a  linear  network.  Neural  Computation 4(5 ):691-702. \n\n\f", "award": [], "sourceid": 667, "authors": [{"given_name": "Ralph", "family_name": "Linsker", "institution": null}]}