{"title": "Towards an Organizing Principle for a Layered Perceptual Network", "book": "Neural Information Processing Systems", "page_first": 485, "page_last": 494, "abstract": null, "full_text": "485 \n\nTOWARDS AN  ORGANIZING PRINCIPLE FOR \n\nA LAYERED PERCEPTUAL NETWORK \n\nIBM Thomas J.  Watson Research Center, Yorktown Heights, NY 10598 \n\nRalph Linsker \n\nAbstract \n\nAn information-theoretic optimization principle is  proposed for  the development \nof  each  processing  stage  of  a  multilayered  perceptual  network.  This  principle  of \n\"maximum information preservation\"  states that the signal transformation that is to be \nrealized at each stage is one that maximizes the information that the output signal values \n(from that stage) convey about the input signals values (to that stage), subject to certain \nconstraints and in  the presence of processing noise.  The quantity being maximized is  a \nShannon information rate.  I provide motivation for this principle and -- for some simple \nmodel cases -- derive some of its consequences, discuss an algorithmic implementation, \nand  show  how  the  principle  may  lead  to  biologically  relevant  neural  architectural \nfeatures  such  as  topographic  maps,  map  distortions,  orientation  selectivity,  and \nextraction of spatial and temporal signal correlations.  A  possible  connection between \nthis  information-theoretic principle  and  a  principle  of minimum  entropy production in \nnonequilibrium thermodynamics is suggested. \n\nIntroduction \n\nThis  paper  describes  some  properties  of  a  proposed  information-theoretic \norganizing principle for the development of a layered perceptual network.  The purpose \nof this paper is to provide an intuitive and qualitative understanding of how the principle \nleads to specific feature-analyzing properties and signal transformations in some simple \nmodel cases.  More detailed analysis is  required in order to apply the principle to cases \ninvolving more realistic  patterns of signaling activity as  well  as  specific constraints on \nnetwork connectivity. \n\nThis section gives a  brief summary of the  results  that motivated the  formulation \nof  the  organizing  principle,  which  I  call  the  principle  of  \"maximum  information \npreservation.\"  In later sections the principle is stated and its consequences studied. \n\nIn previous work l  I analyzed the development of a layered network of model cells \nwith feedforward connections whose strengths change in accordance with a Hebb-type \nsynaptic modification rule.  I found that this development process can produce cells that \nare  selectively  responsive  to  certain  input  features,  and  that  these  feature-analyzing \nproperties  become  progressively  more  sophisticated  as  one  proceeds  to  deeper  cell \nlayers.  These  properties include  the  analysis  of contrast and of edge  orientation,  and \nare  qualitatively  similar  to  properties  observed  in  the  first  several  layers  of  the \nmammalian visual pathway.2 \n\nWhy  does  this  happen?  Does  a  Hebb-type  algorithm  (which  adjusts  synaptic \nstrengths  depending  upon  correlations among  signaling activities3)  cause  a  developing \nperceptual network to optimize some property that is deeply connected with the mature \nnetwork's functioning as an information processing system? \n\n\u00a9 American Institute ofPhvsics 1988 \n\n\f486 \n\nFurther  analysis4.s  has  shown \n\nthat  a  suitable  Hebb-type  rule  causes  a \nlinear-response  cell  in  a  layered feedforward  network  (without  lateral connections)  to \ndevelop so  that the statistical variance of its output activity (in response to an ensemble \nof  inputs  from  the  previous  layer)  is  maximized,  subject  to  certain  constraints.  The \nmature cell  thus performs an operation similar to principal component analysis  (PCA), \nan  approach  used  in  statistics  to  expose  regularities  (e.g.,  clustering)  present  in \nhigh-dimensional  input  data. \n(Oja6  had  earlier  demonstrated  a  particular  form  of \nHebb-type rule  that produces a model cell that implements PCA exactly.) \n\nFurthermore, given a linear device that transforms inputs into an output, and given \nany  particular  output  value,  one  can  use  optimal  estimation  theory  to  make  a  \"best \nestimate\"  of the input values that gave  rise  to that  output.  Of all  such  devices,  I have \nfound  that  an  appropriate  Hebb-type  rule  generates  that  device  for  which  this  \"best \nestimate\"  comes closest to matching the input values. 4\u2022s Under certain conditions, such \na  cell  has  the  property that its  output  preserves  the  maximum  amount  of information \nabout its input values.s \n\nMaximum Information Preservation \n\nThe  above  results  have  suggested  a  possible  organizing  principle  for  the \ndevelopment of each layer of a  multilayered perceptual network.s The principle can be \napplied even if  the cells  of the  network  respond to their inputs in  a  nonlinear  fashion, \nand even if lateral as well as feedforward connections are present.  (Feedback from later \nto earlier layers, however, is absent from this formulation.)  This principle of \"maximum \ninformation  preservation\"  states  that  for  a  layer  of  cells  L  that  is  connected  to  and \nprovides  input  to  another  layer  M,  the  connections  should  develop  so  that  the \ntransformation  of  signals  from  L  to  M  (in  the  presence  of processing  noise)  has  the \nproperty that the set of output values M  conveys the maximum  amount of information \nabout  the  input  values  L, subject  to  various  constraints  on,  e.g.,  the  range  of  lateral \nconnections  and  the  processing  power  of  each  cell.  The  statistical  properties  of  the \nensemble of inputs L  are assumed stationary, and the particular L-to-M transformation \nthat achieves  this  maximization  depends  on  those  statistical  properties.  The  quantity \nbeing maximized is  a Shannon information rate. 7 \n\nAn equivalent statement of this principle is:  The L-to-M transformation is chosen \nso as to minimize the amount of information that would be conveyed by the input values \nL  to someone who already knows the output values M. \n\nWe  shall  regard  the  set  of  input  signal  values  L  (at  a  given  time)  as  an  input \n\"message\"; the message is  processed to give  an output message M.  Each message  is  in \ngeneral  a  set  of  real-valued  signal  activities.  Because  noise  is  introduced  during  the \nprocessing,  a  given  input  message  may  generate  any  of  a  range  of  different  output \nmessages when processed by the same set of connections. \n\nThe  Shannon information  rate  (i.e.,  the  average  information  transmitted from  L \n\nto M per message) is7 \n\nR  =  LL LMP(L,M) log  [P(L,M)/P(L)P(M)]. \n\n(1) \n\nFor a  discrete message  space,  peL)  [resp.  P(M)]  is  the  probability  of the  input  (resp. \noutput)  message  being  L  (resp.  M),  and  P(L,M)  is  the  joint  probability  of  the  input \nbeing  L  and  the  output  being  M.  [For  a  continuous  message  space,  probabilities  are \n\n\freplaced by probability densities, and sums  (over states)  by integrals.] This rate can be \nwritten as \n\n487 \n\nwhere \nh  ==  - LL P(L) log P(L) \n\nis the average information conveyed by message Land \n\n(2) \n\n(3) \n\n(4) \n\nis  the  average  information conveyed by message  L  to someone  who already knows  M. \nSince  II.  is  fixed  by  the  properties  of  the  input  ensemble,  maximizing  R  means \nminimizing I LIM, as stated above. \n\nThe information rate R  can also  be written as \n\n(5) \n\nwhere 1M  and IMI L  are defined by interchanging Land M  in Eqns.  3 and 4.  This form  is \nheuristically  useful,  since  it  suggests  that  one  can  attempt  to  make  R  large  by  (if \npossible)  simultaneously  making  1M  large  and  IMI L  small.  The  term  1M  is  largest  when \neach message M  occurs with equal probability.  The term 1\"'1/.  is  smallest when each L \nis  transformed into a  unique M,  and more  generally is  made small by  \"sharpening\"  the \nP(M I L) distribution,  so  that for each L, P(M I L) is  near zero except for a small set of \nmessages M. \n\nHow  can  one  gain  insight  into  biologically  relevant  properties  of  the  L  - M \ntransformation that may follow from the principle of maximum information preservation \n(which we also call the \"infomax\" principle)?  In a network, this L  - M  transformation \nmay be a function of the values of one or a few variables (such as a connection strength) \nfor each of the allowed connections between and within layers, and for  each cell.  The \nsearch space  is  quite  large,  particularly from  the  standpoint  of  gaining  an  intuitive  or \nqualitative  understanding  of  network  behavior.  We  shall  therefore  consider a  simple \nmodel in  which  the dimensionalities of the Land M  signal spaces  are  greatly reduced, \nyet  one  for  which  the  infomax  analysis  exhibits  features  that  may  also  be  important \nunder  more  general  conditions  relevant \nto  biological  and  synthetic  network \ndevelopment. \n\nThe  next  four  sections  are  organized  as  follows. \n\n(i)  A  model  is  introduced  in \nwhich the Land M messages,  and the L-to-M transformation, have simple  forms.  The \ninfomax  principle  is  found  to  be  satisfied when some  simple  geometric conditions  (on \nthe  transformation)  are  met.  (ii)  I relate  this model to the  analysis of signal processing \nand  noise  in  an  interconnection  network.  The  formation  of  topographic  maps  is \ndiscussed. \n(iii)  The  model  is  applied  to  simplified  versions  of  biologically  relevant \nproblems,  such  as  the  emergence  of orientation  selectivity.  (iv)  I show  that the  main \nproperties  of  the  infomax  principle  for  this  model  can  be  realized  by  certain  local \nalgorithms  that  have  been  proposed  to  generate  topographic  maps  using  lateral \ninteractions. \n\n\f488 \n\nA Simple Geometric Model \n\nIn  this model,  each input message  L  is  described by a point in a low-dimensional \nvector  space,  and  the  output  message  M  is  one  of a  number  of  discrete  states.  For \ndefiniteness,  we  will  take  the  L  space  to  be  two-dimensional  (the  extension  to  higher \ndimensionality  is  straightforward).  The  L  - M  transformation  consists  of two  steps. \n(i)  A  noise  process  alters  L  to  a  message  L' lying  within  a  neighborhood  of  radius  v \ncentered on L.  (ii)  The altered message L' is  mapped deterministically onto one of the \noutput messages M. \n\nA given L' - M  mapping corresponds to a partitioning of the L space into regions \nlabeled by the output states M. (We do not exclude a priori the possibility that multiple \ndisjoint  regions  may  be labeled by the  same  M.)  Let A  denote  the  total area  of the  L \nstate space.  For each M, let A (M)  denote the area of L space that is  labeled by M.  Let \nsCM)  denote the total border length that the region(s)  labeled M  share with regions of \nunlike M -label.  A point L  lying within distance v of a border can be mapped onto either \nM-value  (because of the noise process L  - L').  Call this a  \"borderline\" L. A point L \nthat is  more than a distance  v from every border can only be mapped onto the M-value \nof the  region containing it. \n\nSuppose  v  is  sufficiently  small  that  (for  the  partitionings  of  interest)  the  area \noccupied  by  borderline  L  states  is  small  compared  to  the  total  area  of  the  L  space. \nConsider first  the case in  which peL)  is  uniform over L.  Then the information rate R \n(using Eqn.  5) is  given approximately (through terms of order v)  by \n\n(yv/A) ~Ms(M). \n\nR  =  - ~M[A(M)/A] 10g[A(M)/A] -\n(6) \nTo see this, note that P(M) =  A(M)/ A and that P(M I L) log P(M I L) is zero except for \nborderline  L  (since  0  log 0  =  1 log  1  =  0).  Here y  is  a  positive  number whose  value \ndepends upon the details of the noise process, which determines P(M I L) for borderline \nL  as a function of distance from the border. \n\nFor small  v (low noise)  the first  term  (1M)  on the RHS of Eqn.  6 dominates.  It is \nmaximized when the A(M) [and hence the P(M)] values are equal for all M. The second \nterm  (with  its  minus  sign),  which  equals  (  -~'4IL)' is  maximized  when  the  sum  of the \nborder  lengths  of  all  M  regions  is  minimized.  This  corresponds  to  \"sharpening\"  the \nP(M I L)  distribution  in  our  earlier,  more  general,  discussion.  This  suggests  that  the \ninfomax solution is  obtained by partitioning the L  space into M-regions  (one for  each \nM  value)  that  are  of  substantially  equal  area,  with  each  M-region  tending  to  have \nnear-minimum border length. \n\nAlthough this simple analysis applies to the low-noise case, it is plausible that even \nwhen  v is  comparable  to the spatial  scale  of the  M  regions,  infomax  will  favor  making \nthe  M  regions  have  approximately  the  same  extent  in  all  directions  (rather  than  be \nelongated),  in  order  to  \"sharpen\"  p(MI L)  and  reduce  the  probability  of  the  noise \nprocess mapping L  onto many different M  states. \n\nWhat if peL) is nonuniform?  Then the same result (equal areas, minimum border) \nis obtained except that both the area and border-length elements must now be weighted \nby the  local value  of  peL).  Therefore  the  infomax  principle  tends  to  produce  maps  in \nwhich greater representation in the  output space  is  given  to  regions  of the input signal \nspace that are activated more frequently. \n\nTo see  how  lateral interactions within  the  M layer can affect these  results,  let us \nsuppose  that  the  L  - M  mapping  has  three,  not  two,  process  steps:  L  - L' \n\n\f489 \n\n- M - M, where the first two steps are as above, and the third step changes the output \nM  into  any  of  a  number  of  states  M \nthe \n\"M-neighborhood\" of M). We consider the case in which this M-neighborhood relation \nis  symmetric. \n\n(which  by  definition  comprise \n\nThis type  of \"lateral interaction\"  between M  states causes the infomax principle \nto favor  solutions for  which M  regions sharing a  border in L space are  M-neighbors in \nthe  sense  defined.  For a  simple  example  in  which  each  state M  has  n  M-neighbors \n(including  itself),  and  each  M-neighbor  has  an  equal  chance  of  being  the  final  state \n(given  M),  infomax  tends  to  favor  each  M-neighborhood  having  similar  extent  in  all \ndirections (in L space). \n\nRelation Between the Geometric Model and Network Properties \n\nThe  previous  section  dealt  with  certain  classes  of  transformations  from  one \nmessage space to another, and made no specific reference to the implementation of these \ntransformations  by  an interconnected network  of  processor cells.  Here  we  show how \nsome of the features discussed in the previous section are related to network properties. \nFor  simplicity  suppose  that  we  have  a  two-dimensional  layer  of  uniformly \ndistributed cells,  and  that  the  signal  activity  of each  cell  at  any  given  time  is  either  1 \n(active)  or 0  (quiet).  We  need to  specify the ensemble of input patterns.  Let us  first \nconsider a simple case in which each pattern consists of a disk of activity of fixed radius, \nbut arbitrary center position, against a quiet background.  In this case the pattern is fully \ndefined by specifying the coordinates of the disk  center.  In a  two-dimensional L  state \nspace  (previous  section),  each  pattern  would  be  represented  by  a  point  having  those \ncoordinates. \n\nNow  suppose  that  each  input  pattern  consists  not  of  a  sharply  defined  disk  of \nactivity,  but  of a  \"fuzzy\"  disk  whose  boundary  (and center  position)  are  not  sharply \ndefined.  [Such a pattern could be generated by choosing (from a specified distribution) \na  position Xc as  the nominal  disk  center, then setting the  activity  of the cell  at position \nX  to  1 with a probability that decreases with distance  I x - Xc I . ] Any such pattern can \nbe described by giving the coordinates of the \"center of activity\" along with many other \nvalues describing  (for example)  various  moments of the activity pattern relative  to the \ncenter. \n\nFor the  noise  process  L  - L'  we  suppose  that  the  activity  of  an  L  cell  can  be \n\"misread\"  (by  the  cells  of  the  M  layer)  with  some  probability.  This  set  of distorted \nactivity values is  the  \"message\" L'. We  then suppose that the set of output activities M \nis  a deterministic function of L'. \n\nWe have constructed a situation in which (for an appropriate choice of noise level) \ntwo of the dimensions of the L state space -- namely,  those defined by the disk center \ncoordinates  -- have  large  variance  compared  to  the  variance  induced  by  the  noise \nprocess, while  the other dimensions have variance comparable to that induced by noise. \nIn other words,  the center position of a  pattern is  changed only a  small amount by the \nnoise  process  (compared to  the  typical difference  between the center positions of two \npatterns), whereas the values of the other attributes of an input pattern differ as much \nfrom  their  noise-altered  values  as  two  typical  input  patterns  differ  from  each  other. \n(Those attributes are  \"lost in the noise. \") \n\nSince  the  distance  between  L  states  in  our  geometric  model  (previous  section) \ncorresponds to  the likelihood of one L  state being changed into the other by the noise \n\n\f490 \n\nprocess,  we  can  heuristically  regard  the  L  state  space  (for  the  present  example)  as  a \n\"slab\"  that is  elongated in  two  dimensions  and very  thin  in  all  other  dimensions.  (In \ngeneral this space could have a much more complicated topology, and the noise process \nwhich we  here treat as defining a simple  metric structure on the  L state space need not \ndo so.  These complications are beyond the scope of the present discussion.) \n\nThis  example,  while  simple.  illustrates  a  feature  that is  key  to  understanding the \noperation of the infomax principle:  The character of the ensemble  statistics and of the \nnoise  process  jointly  determine  which  attributes  of  the  input  pattern  are  statistically \nmost significant; that is,  have largest variance relative  to the  variance induced by noise. \nWe  shall  see  that  the  infomax  principle  selects  a  number  of  these  most  significant \nattributes to be encoded by the L  - M  transformation. \n\nWe  turn now to a description of the output state space  M.  We  shall assume  that \nthis space is also of low dimensionality.  For example, each M pattern may also be a disk \nof activity having a center defined within some tolerance.  A discrete set of discriminable \ncenter-coordinate  values  can  then  be  used  as  the  M-region  \"labels\"  in  our  geometric \nmodel. \n\nRestricting  the  form  of  the  output  activity  in  this  particular  way  restricts  us  to \nconsidering  positional  encodings  L  - M,  rather  than  encodings  that  make  use  of  the \nshape of the  output pattern,  its  detailed  activity  values,  etc.  However,  this  restriction \non the  form  of the  output does  not determine which  features  of the  input patterns are \nto  be encoded,  nor whether or not a  topographic  (neighbor-preserving)  mapping  is  to \nbe  used.  These  properties  will  be  seen  to  emerge  from  the  operation  of  the  infomax \nprinciple. \n\nIn  the  previous  section  we  saw  that  the  infomax  principle  will  tend  to  lead  to  a \npartitioning of the L space into M regions having equal areas [if peL) is  uniform in the \ncoordinates of the L disk center] and minimum border length.  For the present case this \nmeans that the M regions will tend to \"tile\" the two long dimensions of the L state space \n\"slab,\"  and that a single M value  will  represent all  points ill  L space  that differ only in \ntheir  low-variance  coordinates.  If peL)  is  nonuniform,  then the  area of the  M  region \nat L  will tend to be inversely proportional to peL). Furthermore, if there are local lateral \nconnections  between  M  cells,  then  (depending  upon  the  particular  form  of  such \ninteraction)  M states corresponding to nearby localized regions of layer-M activity can \nbe M-neighbors in the sense of the  previous section.  In this case  the mapping from the \ntwo high-variance coordinates of L space to M space will tend to be topographic. \n\nExamples: Orientation Selectivity and Temporal Feature Maps \n\nThe  simple  example  in  the  previous  section  illustrates  how  infomax  can lead  to \ntopographic  maps,  and \n[which  provide  greater  M-space \nrepresentation for regions of L having large peL)].  Let us now consider a case in which \ninformation about input features  is  positionally encoded in  the  output layer as  a  result \nof the infomax principle. \n\nto  map  distortions \n\nConsider a model case in which an ensemble of patterns is  presented to the input \nlayer L.  Each pattern consists of a rectangular bar of activity (of fixed length and width) \nagainst  a  quiet  background.  The  bar's  center position  and  orientation  are  chosen  for \neach pattern from uniform distributions over some  spatial interval for  the position, and \nover all orientation angles (i.e., from 0\u00b0 to 180\u00b0).  The bar need not be sharply defined, \nbut  can  be  \"fuzzy\"  in  the  sense  described  above.  We  assume,  however,  that  all \n\n\f491 \n\nproperties  that  distinguish  different  patterns  of  the  ensemble  -- except  for  center \nposition and orientation -- are  \"lost in the  noise\" in the sense we discussed. \n\nTo  simplify  the  representation  of the  solution,  we  further  assume  that only  one \ncoordinate is  needed to describe  the  center position of  the bar for the given ensemble. \nFor example,  the ensemble  could  consist  of bar patterns  all  of which  have  the  same y \ncoordinate of center position, but differ in their x coordinate and in orientation 0. \n\nWe  can then represent each input state by a point in a rectangle  (the L state space \ndefined  in  a  previous  section)  whose  abscissa  is  the  center-position  coordinate x  and \nwhose ordinate is  the angle  0.  The horizontal sides  of this rectangle are  identified with \neach  other,  since  orientations  of  0 0  and  180 0  are  identical. \n(The  interior  of  the \nrectangle can thus be thought of as the surface of a horizontal cylinder.) \n\nThe number Nx  of different x  positions that are discriminable is given by the range \nof x values in the input ensemble divided by the tolerance with which x can be measured \n(given the noise process L  - L'); similarly for No.  The relative lengths Llx and MJ  of the \nsides of the L state space rectangle are given by Llx/ MJ  =  Nj No.  We  discuss below the \ncase in which Nx  >  >  No;  if No  were> >  Nx  the roles of x and 0 in the resulting mappings \nwould be reversed. \n\nThere  is  one complicating  feature  that  should  be  noted,  although in  the  interest \nof clarity we  will not include it in the present analysis.  Two horizontal bar patterns that \nare  displaced  by  a  horizontal distance  that is  small  compared  with  the  bar length,  are \nmore likely to be  rendered indiscriminable by the noise process than are two vertical bar \npatterns  that  are  displaced  by  the  same  horizontal  distance  (which  may  be  large \ncompared with  the  bar's width).  The Hamming distance,  or number of binary activity \nvalues that need to be altered to change one such pattern into the other, is greater in the \nlatter case than in the former.  Therefore, the distance in L state space between the two \n\nUNORIENTED  RECEPTIVE  FIELDS \n\nFigure  1.  Orientation Selectivity in  a Simple Model:  As the input domain size \n(see  text)  is  reduced  [from  (a)  upper left,  to  (b)  upper right,  to  (c) \nlower \nthe  emergence  of  an \norientation-selective L  - M  mapping.  (d)  Lower right  figure  shows \na solution obtained by applying Kohonen's relaxation  algorithm with \n50 M-points (shown as dots)  to this mapping problem. \n\ninfomax \n\nfavors \n\nleft \n\nfigure], \n\n\f492 \n\nstates should be greater in the latter case.  This leads to a  \"warped\" rather than simple \nrectangular state space.  We ignore this effect here, but it must be taken into account in \na fuller treatment of the emergence of orientation selectivity. \n\nConsider  now  an L  - M  transformation  that consists  of  the  three-step  process \n(discussed  above)  (i)  noise-induced  L  - L'  ;  (ii)  deterministic  L' - M'; \n(iii) \nlateral-interaction-induced M'  - M. Step  (ii)  maps  the  two-dimensional L  state space \nof points  (x,  0)  onto a  one-dimensional M state space.  For the present discussion,  we \n.consider  L' - M'  maps  satisfying  the  following  Ansatz:  Points  corresponding  to  the \nM states are spaced uniformly, and in topographic order, along a helical line in L  state \nspace (which we recall is represented by the surface of a horizontal cylinder).  The pitch \nof the  helix  (or the  slope  dO/dx)  remains  to  be  determined by  the  infomax  principle. \nEach M-neighborhood of M  states  (previous  section)  then  corresponds  to  an interval \non such a helix.  A state L' is  mapped onto a state in a particular M-neighborhood if L' \nis closer (in L space) to the corresponding interval of the helix than to any other portion \nof the helix.  We call this  set  of L  states  (for an M-neighborhood centered on M  )  the \n\"input domain\" of M. It has rectangular shape and lies on the cylindrical surface of the \nL space. \n\nWe  have  seen (previous sections)  that infomax tends to produce maps  having  (i) \nequal M-region areas,  (ii)  topographic organization, and (iii)  an input domain (for each \nM-neighborhood)  that has  similar extent in  all  directions  (in L  space).  Our choice  of \nAnsatz enforces (i) and (ii) explicitly.  Criterion (iii)  is satisfied by choosing dO / dx such \nthat the input domain is square (for a given M-neighborhood size). \n\nFigure  1a  (having dO/dx =  0) shows a  map in  which  the  output M  encodes only \ninformation  about bar center position x,  and is  independent  of bar orientation o.  The \nsize of the M -neighborhood is relatively large in this case.  The input domain of the state \nM  denoted by the 'x' is shown enclosed by dotted lines.  (The particular 0 value at which \nwe chose to draw the M line in Fig.  1a is irrelevant.)  For this M-neighborhood size, the \nlength of the border of the input domain is as small as it can be. \n\nAs the M -neighborhood size is  reduced, the dotted lines move closer together.  A \nvertically oblong input domain  (which  would  result  if  we  kept dO/dx =  0  )  would  not \nsatisfy the infomax criterion.  The  helix  for  which  the  input domain is  square  (for this \nsmaller  choice  of  M-neighborhood  size)  is  shown  in  Fig.  lb.  The  M  states  for  this \nsolution encode information about bar orientation as well as center position.  If each M \nstate corresponds  to a  localized output activity  pattern centered at some position  in  a \none-dimensional array of M cells, then this solution corresponds to orientation-selective \nthis \ncells  organized  in  \"orientation  columns\"  (really  \"orientation  intervals\" \none-dimensional model).  A  \"labeling\" of the linear array of cells according to whether \ntheir orientation preferences lie between 0 and 60, 60 and 120, or 120 and 180 degrees \nis indicated by the bold, light, and dotted line segments beneath the rectangle in Fig.  1 b \n(and 1c). \n\nin \n\nAs the M-neighborhood size is  decreased still further,  the mapping shown in Fig. \nIe becomes favored over that of either Fig.  1a or lb.  The \"orientation columns\" shown \nin the lower portion of Fig.  1 c are narrower than in Fig.  1 b. \n\nA  more  detailed  analysis  of  the  information  rate  function  for  various  mappings \n\nconfirms the main features we have here obtained by a simple geometric argument. \n\nThe  same  type  of  analysis  can  be  applied  to  different  types  of  input  pattern \nensembles.  To  give  just  one  other  example,  consider  a  network  that  receives  an \nensemble of simple patterns of acoustic input.  Each such pattern consists of a  tone  of \n\n\f493 \n\nsome  frequency that is  sensed by two  \"ears\" with some interaural time delay.  Suppose \nthat the initial network layers organize  the  information from  each ear (separately) into \ntonotopic  maps,  and  that  (by  means  of  connections  having  a  range  of  different  time \ndelays)  the signals received by both ears over some time  interval appear as  patterns of \ncell activity at some intermediate layer L.  We  can then apply the  infomax principle to \nthe  signal  transformation from  layer L  to  the  next layer M.  The  L state space can  (as \nbefore)  be  represented  as  a  rectangle,  whose  axes  are  now  frequency  and  interaural \ndelay (rather than spatial position and bar orientation).  Apart from certain differences \n(the  density  of  L  states  may  be  nonuniform,  and states  at the  top and bottom of the \nrectangle are  no longer identical),  the  infomax analysis can be carried out as it was  for \nthe simplified case of orientation selectivity. \n\nLocal Algorithms \n\nThe  information  rate  (Eqn.  I),  which  the  infomax  principle  states  is  to  be \nmaximized  subject  to  constraints  (and  possibly  as  part  of  an  optimization  function \ncontaining other cost  terms  not discussed  here),  has  a  very  complicated  mathematical \nform.  How might this optimization process, or an approximation to it,  be implemented \nby a network of cells  and connections each of which has limited computational power? \nThe geometric form in  which we  have cast the  infomax principle  for  some  very simple \nmodel cases, suggests how this might be accomplished. \n\nAn algorithm due to Kohonen 8 demonstrates how topographic maps can emerge \nas  a  result  of lateral  interactions within  the  output layer.  I applied this  algorithm  to  a \none-dimensional M layer and a  two-dimensional  L layer,  using a  Euclidean metric and \nimposing  periodic  boundary  conditions  on  the  short  dimension  of  the  L  layer.  A \nresulting map is  shown in Fig.  Id.  This map is  very similar to those of Figs.  1 band Ic, \nexcept for one reversal of direction.  The reversal is  not surprising,  since  the  algorithm \ninvolves  only  local  moves  (of  the  M-points)  while  the  infomax  principle  calls  for  a \nglobally optimal solution. \n\nMore generally,  Kohonen's algorithm tends  empirically8  to  produce  maps  having \nthe property that if one constructs the Voronoi diagram corresponding to the positions \nof the M-points  (that is,  assigns each point L  to an M region  based on which  M-point \nL  is  closest  to),  one  obtains  a  set  of  M  regions  that  tend  to  have  areas  inversely \nproportional to  P(L)  ,  and  neighborhoods  (corresponding  to  our input  domains)  that \ntend to have similar extent in all directions rather than being elongated. \n\nThe Kohonen algorithm makes  no  reference to noise,  to information content, or \neven to an optimization principle.  Nevertheless,  it  appears  to  implement,  at least in  a \nqualitative  way,  the  geometric  conditions  that  infomax  imposes  in  some  simple  cases. \nThis suggests that local algorithms along similar lines may be capable of implementing \nthe infomax principle in more general situations. \n\nOur  geometric  formulation  of  the  infomax  principle  also  suggests  a  connection \nwith an algorithm proposed by von der Malsburg and Willshaw9  to generate topographic \nmaps.  In their \"tea trade\"  model, neighborhood relationships are postulated within the \nsource  and the  target spaces,  and  the algorithm's  operation leads  to the establishment \nof a neighborhood-preserving mapping from source to target space.  Such neighborhood \nrelationships arise naturally in our analysis when the infomax principle is  applied to our \nthree-step  L  - L' - M'  - M \ninduces  a \n\nThe  noise  process \n\ntransformation. \n\n\f494 \n\nneighborhood relation  on  the  L  space,  and  lateral  connections in  the  M  cell  layer can \ninduce a neighborhood relation on the  M space. \n\nMore recently, Durbin and Willshaw lO  have devised an approach to solving certain \ngeometric optimization problems (such as the traveling salesman problem) by a gradient \ndescent method bearing some similarity to Kohonen's algorithm. \n\nThere is  a complementary relationship  between the  infomax principle  and a local \nalgorithm  that  may  be  found  to  implement  it.  On  the  one  hand,  the  principle  may \nexplain what the algorithm is  \"for\"  -- that is,  how the  algorithm  may contribute  to  the \ngeneration of a useful perceptual system.  This in turn can shed light on the system-level \nrole of lateral connections and synaptic modification mechanisms in biological networks. \nOn the other hand, the existence of such a local algorithm is important for demonstrating \nthat a network of relatively simple processors -- biological or synthetic -- can in fact find \nglobal near-maxima of the Shannon information rate. \n\nA Possible Connection Between Infomax and a Thermodynamic Principle \n\nThe  principle  of  \"maximum  preservation  of  information\"  can  be  viewed \nequivalently as a principle of \"minimum dissipation of information.\"  When the principle \nis  satisfied,  the  loss  of  information  from  layer  to  layer  is  minimized,  and  the  flow  of \ninformation  is  in  this  sense  as  \"nearly reversible\"  as  the  constraints allow.  There  is  a \nresemblance between this principle and the principle of \"minimum entropy production\" \nII  in  nonequilibrium  thermodynamics.  It has  been  suggested  by  Prigogine  and  others \nthat  the  latter  principle  is  important  for  understanding  self-organization  in  complex \nsystems.  There  is  also  a  resemblance,  at  the  algorithmic  level,  between  a  Hebb-type \nmodification  rule  and  the  autocatalytic  processes l2  considered  in  certain  models  of \nevolution  and  natural selection.  This  raises  the  possibility  that  the  connection  I  have \ndrawn  between  synaptic  modification  rules  and  an  information-theoretic  optimization \nprinciple  may  be  an  example  of a  more  general  relationship  that  is  important  for  the \nemergence  of complex  and apparently  \"goal-oriented If  structures  and  behaviors  from \nrelatively simple local interactions, in both neural and non-neural systems. \n\nReferences \n\n[1] \n[2] \n[3] \n[4] \n\nR.  Linsker, Proc.  Natl.  Acad.  Sci.  USA  83,7508,8390,8779 (1986). \nD.  H . Hubel and T.  N.  Wiesel,  Proc.  Roy.  Soc.  London  8198,1  (1977). \nD.  O.  Hebb,  The  Organization of Behavior (Wiley,  N.  Y.,  1949). \nR.  Linsker,  in:  R.  Cotterill  (ed.),  Computer  Simulation  in  Brain  Science  (Copenhagen. \n20-22 August  1986; Cambridge  Univ.  Press,  in  press), p.  416. \nR.  Linsker,  Computer  (March  1988, in press). \nE. Oja,J. Math.  Bioi.  15 , 267  (1982). \nC. E. Shannon,  Bell Syst.  Tech.  J.  27 . 623  (1948). \nT. Kohonen,  Self-Organization and Associative  Memory (Springer-Verlag, N. Y .. 19S4). \nC.  von der Malsburg and D.  J.  Willshaw,  Proc.  Nat I.  A cad.  Sci.  USA  74 , 5176 (1977). \n\n[5] \n[6] \n[7] \n[8] \n[9] \n[10]  R.  Durbin and D. J.  Willshaw,  Nature  326,689 (1987). \n[11] \n\nP.  Glansdorff  and  I.  Prigogine,  Thermodynamic  Theory  of  Structure,  Stabili(v.  and \nFluctuations  (Wiley-Interscience, N.  Y.,  1971). \n\n[12]  M.  Eigen and P.  Schuster,  Die  Naturwissenschaften  64 , 541  (1977). \n\n\f", "award": [], "sourceid": 5, "authors": [{"given_name": "Ralph", "family_name": "Linsker", "institution": null}]}