{"title": "Learning Continuous Attractors in Recurrent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 654, "page_last": 660, "abstract": "", "full_text": "Learning Continuous Attractors in \n\nRecurrent  Networks \n\nH.  Sebastian Seung \n\nBell Labs,  Lucent Technologies \n\nMurray Hill,  NJ 07974 \nseung~bell-labs.com \n\nAbstract \n\nOne approach  to invariant  object  recognition  employs  a  recurrent  neu(cid:173)\nral  network  as an associative  memory.  In  the standard depiction of the \nnetwork's state space, memories of objects are stored as attractive fixed \npoints of the dynamics.  I argue for  a  modification of this picture:  if an \nobject has a continuous family of instantiations, it should be represented \nby  a  continuous  attractor.  This  idea is  illustrated with  a  network  that \nlearns to complete patterns.  To  perform the task of filling  in missing in(cid:173)\nformation,  the network develops a  continuous attractor that models the \nmanifold  from  which  the patterns  are  drawn.  From  a  statistical  view(cid:173)\npoint, the pattern completion task allows  a  formulation of unsupervised \nlearning in  terms of regression  rather than density estimation. \n\nA classic approach to invariant object recognition is  to  use  a  recurrent neural net(cid:173)\nwork  as  an  associative  memory[l].  In  spite  of the  intuitive  appeal  and  biological \nplausibility of this approach, it has largely been abandoned in practical applications. \nThis paper introduces two new concepts  that could help resurrect it:  object repre(cid:173)\nsentation by continuous attractors, and learning attractors by  pattern completion. \nIn most models of associative memory, memories are stored as attractive fixed points \nat discrete locations in state space[l].  Discrete attractors may not be appropriate for \npatterns with  continuous  variability,  like the images of a  three-dimensional object \nfrom  different viewpoints.  When the instantiations of an object lie on a continuous \npattern manifold, it is more appropriate to represent objects by attractive manifolds \nof fixed  points, or continuous attractors. \nTo  make this idea practical, it is  important to find  methods for  learning attractors \nfrom examples.  A naive method is  to train the network to retain examples in short(cid:173)\nterm memory.  This  method  is  deficient  because  it  does  not  prevent  the  network \nfrom  storing spurious  fixed  points that are  unrelated to the examples.  A  superior \nmethod  is  to  train the network to  restore  examples  that  have  been  corrupted,  so \nthat it learns to complete patterns by filling  in missing information. \n\n\fLearning Continuous Attractors in Recurrent Networks \n\n655 \n\n(a) \n\n(b) \n\nFigure  1:  Representing  objects  by  dynamical  attractors.  (a)  Discrete  attractors. \n(b)  Continuous attractors. \n\nLearning by  pattern completion can  be  understood from  both dynamical  and sta(cid:173)\ntistical perspectives.  Since the completion task requires  a large basin of attraction \naround  each  memory,  spurious  fixed  points  are  suppressed.  The  completion  task \nalso  leads  to  a  formulation  of unsupervised  learning  as  the  regression  problem  of \nestimating functional  dependences between variables in the sensory input. \nDensity estimation, rather than regression, is the dominant formulation of unsuper(cid:173)\nvised learning in stochastic neural networks like the Boltzmann machine[2] .  Density \nestimation has the virtue of suppressing spurious fixed  points automatically,  but it \nalso  has the serious drawback of being intractable for  many network architectures. \nRegression  is  a  more  tractable,  but  nonetheless  powerful,  alternative  to  density \nestimation. \nIn a number of recent neurobiological models,  continuous attractors have been used \nto represent continuous quantities like eye position-[3],  direction of reaching[4], head \ndirection[5],  and orientation of a  visual  stimulus[6].  Along  with  these  models,  the \npresent work is  part of a new paradigm for  neural computation based on continuous \nattractors. \n\n1  DISCRETE VERSUS  CONTINUOUS ATTRACTORS \n\nFigure 1 depicts two ways of representing objects as attractors of a recurrent neural \nnetwork  dynamics.  The standard way  is  to represent each object by  an attractive \nfixed  point[l],  as  in  Figure  1a.  Recall of a  memory is  triggered by  a  sensory input, \nwhich sets the initial conditions.  The network dynamics converges to a fixed  point, \nthus  retrieving  a  memory.  If different  instantiations of one  object  lie  in  the  same \nbasin of attraction, they  all trigger  retrieval of the same  memory,  resulting in  the \nmany-to-one map required for  invariant recognition. \nIn  Figure  1b,  each  object  is  represented  by  a  continuous  manifold  of fixed  points. \nA  one-dimensional  manifold  is  shown,  but generally  the  attractor should  be  mul(cid:173)\ntidimensional,  and is  parametrized  by  the instantiation or pose  parameters of the \nobject.  For example, in visual object recognition, the coordinates would include the \nviewpoint from  which  the object is  seen. \nThe reader should be cautioned that the term \"continuous attractor\" is an idealiza(cid:173)\ntion and should not be taken too literally.  In real networks,  a  continuous attractor \nis  only  approximated  by  a  manifold  in  state space  along which  drift  is  very  slow. \nThis  is  illustrated  by  a  simple  example,  a  descent  dynamics  on  a  trough-shaped \nenergy  landscape[3].  If the  bottom of the  trough  is  perfectly  level,  it is  a  line  of \nfixed  points and an ideal continuous attract or of the dynamics.  However, any slight \nimperfections cause slow drift  along the line.  This sort of approximate continuous \nattract or is  what is found  in real networks, including those trained by the learning \n\n\f656 \n\nH  S.  Seung \n\n(a) \n\nhidden layer \n\n(b) \n\n~ visible layer \n\nFigure 2:  (a)  Recurrent network.  (b)  Feedforward autoencoder. \n\nalgorithms to be discussed  below. \n\n2  DYNAMICS OF MEMORY RETRIEVAL \n\nThe preceding discussion has motivated the idea of representing pattern manifolds \nby  continuous attractors.  This  idea will  be further developed  with  the simple  net(cid:173)\nwork  shown  in  Figure 2a,  which  consists  of a  visible  layer  Xl  E Rnl  and a  hidden \nlayer  X2  E  Rn2.  The  architecture  is  recurrent,  containing  both  bottom-up  con(cid:173)\nnections  (the n2  x nl  matrix W2d  and top-down connections  (the nl  x n2  matrix \nWI2).  The vectors bl  and b2 represent the biases ofthe neurons.  The neurons have \na  rectification nonlinearity  [x]+  =  max{x, O},  which  acts on vectors  component by \ncomponent. \nThere are many variants of recurrent network dynamics:  a convenient choice is  the \nfollowing  discrete-time  version,  in  which  updates  of the  hidden  and  visible  layers \nalternate in time.  After  the visible layer is  initialized with  the input vector  Xl (0), \nthe dynamics evolves  as \n\n[b2 + W2IXI(t -1)]+ , \n\nX2(t)  = \nXl (t)  =  [bl  + W12X2(t)]+  . \n\n(1) \n\nIf memories are stored as  attractors, iteration of this dynamics  can be regarded as \nmemory retrieval. \nActivity circulates around the feedback  loop between the two  layers.  One  iteration \nof this  loop  is  the map  Xl(t - 1)  ~ X2(t)  ~ Xl(t).  This  single  iteration is  equiv(cid:173)\nalent  to  the  feedforward  architecture  of Figure  2b.  In  the  case  where  the  hidden \nlayer  is  smaller  than  the  visible  layers,  this  architecture  is  known  as  an  auto en(cid:173)\ncoder  network[7].  Therefore  the  recurrent  network  dynamics  (1)  is  equivalent  to \nrepeated iterations of the feedforward  autoencoder.  This is just the standard trick \nof unfolding  the  dynamics  of a  recurrent  network  in  time,  to  yield  an  equivalent \nfeedforward network with many layers[7].  Because of the close relationship between \nthe recurrent network of Figure 2a and the autoencoder of Figure 2b, it should not \nbe  surprising  that learning  algorithms  for  these  two  networks  are  also  related,  as \nwill  be  explained below. \n\n3  LEARNING TO RETAIN PATTERNS \n\nLittle trace of an arbitrary input vector  Xl (0)  remains after a few  time steps of the \ndynamics  (1).  However,  the  network  can  retain some  input  vectors  in short-term \nmemory as  \"reverberating\" patterns of activity.  These correspond to fixed points of \nthe dynamics (1); they are patterns that do not change as activity circulates around \nthe feedback  loop. \n\n\fLearning Continuous Attraclors in Recurrent Networlcs \n\n657 \n\nThis suggests a formulation of learning as the optimization of the network's ability to \nretain examples in short-term memory.  Then a suitable cost function is the squared \ndifference  IXI (T)  - Xl (0)12  between  the  example  pattern  Xl (0)  and the  network's \nshort-term  memory  Xl (T)  of it after  T  time  steps.  Gradient  descent  on  this  cost \nfunction  can be done via backpropagation through time[7]. \nIf the network is  trained with patterns drawn from  a continuous family,  then it can \nlearn to perform the short-term memory task oy developing a continuous attractor \nthat lies  near the examples it is  trained on.  When the hidden layer is  smaller than \nthe  visible  layer,  the  dimensionality  of the  attractor is  limited  by  the  size  of the \nhidden layer. \n\nFor the case of a single time step  (T =  1), training the recurrent network of Figure \n2a  to  retain  patterns  is  equivalent  to  training  the  autoencoder  of  Figure  2b  by \nminimizing  the  squared  difference  between  its  input  and  output  layers,  averaged \nover the examples[8].  From the information theoretic perspective, the small  hidden \nlayer in Figure 2b acts as a bottleneck between the input and output layers, forcing \nthe autoencoder to learn an efficient encoding of the input. \nFor  the  special  case  of  a  linear  network,  the  nature  of  the  learned  encoding  is \nunderstood completely.  Then the input and output vectors are related by  a simple \nmatrix  multiplication.  The  rank  of the  matrix  is  equal  to  the  number  of  hidden \nunits.  The average distortion is  minimized when  this  matrix becomes a  projection \noperator onto the subspace spanned by the principal components of the examples[9]. \nFrom  the dynamical  perspective,  the  principal  subspace  is  a  continuous  attractor \nof the  dynamics  (1).  The  linear  network  dynamics  converges to  this  attractor  in \na  single  iteration,  starting from  any  initial  condition.  Therefore  we  can  interpret \nprincipal  component  analysis  and  its  variants  as  methods  of learning  continuous \nattractors[lO]. \n\n4  LEARNING TO  COMPLETE PATTERNS \n\nLearning to retain patterns in short-term memory only works properly for architec(cid:173)\ntures with  a small  hidden  layer.  The problem  with a  large  hidden  layer is  evident \nwhen  the hidden  and  visible  layers are  the  same  size,  and the  neurons  are linear. \nThen the cost function for learning can be minimized by setting the weight matrices \nequal to the identity, W21  =  W l2  =  I.  For this trivial minimum, every input vector \nis a fixed point of the recurrent network (Figure 2a), and the equivalent feedforward \nnetwork (Figure 2b)  exactly realizes the identity map.  Clearly these networks have \nnot learned anything. \nTherefore  in  the  case  of a  large  hidden  layer,  learning  to  retain  patterns  is  inad(cid:173)\nequate.  Without  the  bottleneck  in  the  architecture,  there  is  no  pressure  on  the \nfeedforward  network to learn an efficient  encoding.  Without constraints on the di(cid:173)\nmension of the attractor, the recurrent network develops spurious fixed  points that \nhave nothing to do  with the examples. \nThese  problems  can  be  solved  by  a  different  formulation of learning  based  on  the \ntask  of pattern  completion.  In the  completion  task  of Figure  3a,  the  network  is \ninitialized  with  a corrupted version of an example.  Learning is  done by minimizing \nthe completion error, which is the squared difference  IXI (T) - dl 2  between the uncor(cid:173)\nrupted pattern d and the final  visible vector Xl (T).  Gradient descent on completion \nerror can be  done  with backpropagation through time[ll]. \nThis new formulation of learning eliminates  the trivial identity map  solution  men-\n\n\f658 \n\nH.  S.  Seung \n\n(a)  ~1 retention.  ~1 \n\nL  _ \n\n..  _ \n\n(b) \n\ntopographic feature map \n\n9x9 patch \nmissing \n\n~ completio~ ~ 1 \n\nIt  ~ \n\nIt  ___ \n\nsensory \nInput \n\nretrieved \nmemory \n\nFigure  3:  (a)  Pattern retention  versus  completion.  (b)  Dynamics of pattern com(cid:173)\npletion. \n\n(b) \n\n5x5 receptive fields \n\nFigure 4:  ( a) Locally connected architecture.  (b)  Receptive fields of hidden neurons. \n\ntioned above:  while the identity network can retain any example, it cannot restore \ncorrupted examples to their pristine form.  The completion task forces  the network \nto enlarge the basins of attraction of the stored memories,  which suppresses spuri(cid:173)\nous  fixed  points.  It also  forces  the network to learn associations between  variables \nin the sensory input. \n\n5  LOCALLY CONNECTED ARCHITECTURE \n\nExperiments  were  conducted  with  images  of handwritten  digits  from  the  USPS \ndatabase  described  in  [12].  The  example  images  were  16  x  16,  with  a  gray  scale \nranging from  a to  1.  The network  was  trained  on  a  specific  digit  class,  with  the \ngoal of learning a  single  pattern manifold.  Both the network  architecture and the \nnature of the completion task were chosen to suit the topographic structure present \nin  visual images. \nThe network architecture was given a topographic organization by constraining the \nsynaptic connectivity to be local, as shown in Figure 4a.  Both the visible and hidden \nlayers of the  network were  16  x  16.  The visible  layer represented an  image,  while \nthe  hidden layer was  a topographic feature  map.  Each neuron had 5 x  5 receptive \nand projective fields,  except for  neurons near the edges, which  had more restricted \nconnectivity. \nIn  the  pattern  completion  task,  example  images  were  corrupted  by  zeroing  the \npixels  inside  a  9  x  9  patch  chosen  at  a  random  location,  as  shown  in  Figure  3a. \nThe  location  of the  patch  was  randomized  for  each  presentation  of  an  example. \nThe  size  of the  patch  was  a  substantial  fraction  of the  16  x  16  image,  and  much \nlarger than the 5 x  5 receptive field  size.  This method of corrupting the  examples \ngave the completion task a topographic nature, because it involved a set of spatially \ncontiguous pixels.  This topographic nature would have been lacking if the examples \nhad been corrupted by,  for  example,  the addition of spatially uncorrelated noise. \nFigure  3b  illustrates the dynamics of pattern completion  performed  by  a  network \n\n\fLearning Continuous Attractors in Recurrent Networks \n\n659 \n\ntrained  on  examples  of the  digit  class  \"two.\"  The  network  is  initialized  with  a \ncorrupted example of a  \"two.\"  After the first  itex:ation of the dynamics,  the image \nis partially restored.  The second iteration leads to superior restoration, with further \nsharpening of the image.  The \"filling in\"  phenomenon is  also evident in the hidden \nlayer. \nThe network was first trained on a retrieval dynamics of one iteration.  The resulting \nbiases  and  synaptic weights  were  then  used  as  initial  conditions  for  training on  a \nretrieval dynamics of two iterations.  The hidden layer developed into a topographic \nfeature  map suitable for  representing images of the digit  \"two.\"  Figure 4b  depicts \nthe bottom-up receptive fields  of the 256  hidden neurons.  The top-down projective \nfields  of these neurons were similar, but are not shown. \nThis  feature  map  is  distinct  from  others[13)  because  of its  use  of top-down  and \nbottom-up  connections  in  a  feedback  loop.  The  bottom-up  connections  analyze \nimages  into  their constituent  features,  while  the  top-down  connections  synthesize \nimages  by  composing  features.  The  features  in  the  top-down  connections  can  be \nregarded as  a  \"vocabulary\"  for  synthesis of images.  Since  not  all  combinations  of \nfeatures are proper patterns, there must be some  \"grammatical\" constraints on their \ncombination.  The network's ability to complete patterns suggests that some of these \nconstraints are embedded in the dynamical equations of the network.  Therefore the \nrelaxation dynamics (1)  can be regarded as a process of massively parallel constraint \nsatisfaction. \n\n6  CONCLUSION \n\nI  have  argued  that continuous  attractors  are  a  natural  representation  for  pattern \nmanifolds.  One  method  of  learning  attractors  is  to  train  the  network  to  retain \nexamples in short-term memory.  This method is equivalent to autoencoder learning, \nand does not work if the number of hidden units is large.  A better method is to train \nthe  network  to  complete  patterns.  For  a  locally  connected  network,  this  method \nwas demonstrated to learn a topographic feature map.  The trained network is  able \nto complete  patterns,  indicating  that  syntactic  constraints  on  the combination  of \nfeatures are embedded in  the network  dynamics. \nEmpirical  evidence  that the network has  indeed  learned  a  continuous  attractor  is \nobtained  by  local  linearization  of  the  network  (1).  The  linearized  dynamics  has \nmany  eigenvalues  close  to  unity,  indicating  the existence  of an  approximate  con(cid:173)\ntinuous attractor.  Learning with an increased number of iterations in  the retrieval \ndynamics should improve the quality of the approximation. \nThere  is  only  one  aspect  of the  learning  algorithm  that is  specifically  tailored  for \ncontinuous attractors.  This  aspect  is  the limitation  of the  retrieval  dynamics  (1) \nto  a  few  iterations,  rather  than  iterating it  all  the  way  to  a  true fixed  point.  As \nmentioned earlier, a  continuous attractor is  only an idealization;  in  a  real network \nit does not consist of true fixed  points, but is just a manifold to which relaxation is \nfast  and along which drift  is  slow.  Adjusting the shape of this manifold  is  the goal \nof learning; the exact locations of the true fixed  points are not relevant. \nThe use of a fast retrieval dynamics removes one long-standing objection to attractor \nneural networks, which is that true convergence to a fixed point takes too long.  If all \nthat is  desired is  fast  relaxation to an  approximate continuous attractor, attractor \nneural networks are not much slower than feedforward  networks. \nIn the experiments discussed here, learning was done with backpropagation through \ntime.  Contrastive Hebbian learning[14]  is  a simpler alternative.  Part of the  image \n\n\f660 \n\nH  S. Seung \n\nis  held  clamped,  the  missing  values  are  filled  in  by  convergence  to  a  fixed  point, \nand an anti-Hebbian update is made.  Then the missing values  are clamped at their \ncorrect values,  the network converges to a  new  fixed  point,  and a  Hebbian  update \nis  made.  This  procedure has  the  disadvantage  of requiring  true  convergence  to  a \nfixed point, which can take many iterations.  It also requires symmetric connections, \nwhich may be a  representational handicap. \nThis  paper addressed  only  the learning  of a  single  attractor  to  represent  a  single \npattern manifold.  The problem of learning multiple attractors to represent mUltiple \npattern  classes  will  be  discussed  elsewhere,  along  with  the  extension  to  network \narchitectures with many layers. \n\nAcknowledgments  This work was supported by Bell Laboratories.  I thank J.  J. \nHopfield,  D.  D.  Lee,  L. K.  Saul,  N.  D.  Socci,  H.  Sompolinsky,  and D.  W. Tank for \nhelpful discussions. \n\nReferences \n[1]  J.  J.  Hopfield.  Neural  networks and physical  systems  with emergent collective  com(cid:173)\n\nputational abilities.  Proc.  Nat.  Acad.  Sci.  USA,  79:2554-2558,  1982. \n\n[2]  D.  H. Ackley, G.  E. Hinton, and T.  J. Sejnowski.  A learning algorithm for  Boltzmann \n\nmachines.  Cognitive  Science,  9:147-169,  1985. \n\n[3]  H. S. Seung. How the brain keeps the eyes still. Proc. Natl.  Acad.  Sci.  USA,93:13339-\n\n13344,  1996. \n\n[4]  A.  P.  Georgopoulos,  M.  Taira,  and  A.  Lukashin.  Cognitive  neurophysiology  of the \n\nmotor cortex.  Science,  260:47-52,  1993. \n\n[5]  K  Zhang.  Representation  of  spatial  orientation  by  the  intrinsic  dynamics  of  the \n\nhead-direction cell  ensemble:  a  theory.  J.  Neurosci.,  16:2112-2126,  1996. \n\n[6]  R .  Ben-Yishai,  R.  L.  Bar-Or,  and  H.  Sompolinsky.  Theory  of orientation  tuning in \n\nvisual  cortex.  Proc.  Nat.  Acad.  Sci.  USA,  92:3844-3848,  1995. \n\n[7]  D.E.  Rumelhart,  G.E.  Hinton,  and R.J . Williams.  Learning internal representations \n\nby error  propagation.  In D.E.  Rumelhart and J.L.  McClelland,  editors,  Parallel Dis(cid:173)\ntributed Processing, volume 1, chapter 8,  pages 318-362. MIT Press, Cambridge, 1986. \n[8]  G. W . Cottrell,  P. Munro, and D.  Zipser.  Image compression by back propagation:  an \nexample of extensional  programming.  In N.  E.  Sharkey,  editor,  Models  of cognition: \na  review  of cognitive  science. Ablex,  Norwood,  NJ,  1989. \n\n[9]  P.  Baldi and K  Hornik.  Neural networks and principal component analysis:  Learning \n\nfrom  examples without local  minima.  Neural  Networks,  2:53-58,  1989. \n\n[10]  H.  S.  Seung. Pattern analysis and synthesis in attractor neural networks.  In K-Y. M. \nWong, 1.  King,  and D.-y' Yeung, editors,  Theoretical  Aspects  of Neural  Computation: \nA  Multidisciplinary  Perspective,  Singapore,  1997.  Springer-Verlag. \n\n[11]  F.-S.  Tsung and G.  W . Cottrell.  Phase-space learning.  Adv.  Neural  Info . Proc.  Syst., \n\n7:481-488,  1995. \n\n[12]  Y.  LeCun et al.  Learning algorithms for  classification:  a  comparison on handwritten \ndigit  recognition.  In  J.-H.  Oh,  C.  Kwon,  and  S.  Cho,  editors,  Neural  networks:  the \nstatistical  mechanics  perspective,  pages  261-276,  Singapore,  1995.  World Scientific. \n\n[13]  T . Kohonen.  The self-organizing map.  Proc.  IEEE,  78:1464-1480,  1990. \n[14]  J. J . Hopfield, D. I. Feinstein, and R.  G.  Palmer.  \"Unlearning\"  has a stabilizing effect \n\nin collective memories.  Nature,  304:158-159,  1983. \n\n\f", "award": [], "sourceid": 1369, "authors": [{"given_name": "H. Sebastian", "family_name": "Seung", "institution": null}]}