{"title": "Performance Measures for Associative Memories that Learn and Forget", "book": "Neural Information Processing Systems", "page_first": 432, "page_last": 441, "abstract": null, "full_text": "432 \n\nPerformance Measures for Associative Memories \n\nthat Learn and Forget \n\nAnthony /(uh \n\nDepartment of Electrical  Engineering \n\nUniversity  of Hawaii  at Manoa \n\nHonolulu HI,  96822 \n\nABSTRACT \n\nRecently,  many  modifications  to  the  McCulloch/Pitts  model  have  been  proposed \nwhere  both  learning  and  forgetting  occur.  Given  that  the  network  never saturates  (ceases \nto  function  effectively  due  to  an  overload  of  information),  the  learning  updates  can  con(cid:173)\ntinue indefinitely.  For these networks,  we  need  to introduce  performance  measmes in  addi(cid:173)\ntion  to  the  information  capacity  to  evaluate  the  different  networks.  We  mathematically \ndefine  quantities such  as  the  plasticity  of  a  network,  the  efficacy  of an  information  vector, \nand the  probability  of network  saturation.  From  these  quantities  we  analytically  compare \ndifferent networks. \n\n1.  Introduction \n\nWork  has  recently  been  undertaken  to  quantitatively  measure  the  computational \naspects  of  network  models  that  exhibit  some  of  the  attributes  of  neural  networks.  The \nMcCulloch/Pitts  model  discussed  in  [1]  was one  of the  earliest  neural  network  models  to  be \nanalyzed.  Some  computational  properties  of  what  we  call  a  Hopfield  Associative  Memory \nNetwork  (HAMN)  :similar  to  the  McCulloch/Pitts  model  was  discussed  by  Hopfield  in  [2]. \nThe  HAMN  can  be  measured  quantitatively  by  defining  and  evaluating  the  information \ncapacity  as  [2-6]  have shown,  but this  network  fails  to exhibit  more  complex  computational \ncapabilities  that  neural  network  have  due  to  its  simplified  structure.  The  HAMN  belongs \nto  a  class  of networks which we  call  static.  In  static  networks  the  learning  and  recall  pro(cid:173)\ncedures  are  separate.  The network  first  learns  a  set of  data and  after learning  is  complete, \nrecall  occurs.  In  dynamic  networks,  as  opposed  to  static  networks,  updated  learning  and \nassociative  recall  are  intermingled  and  continual.  In  many applications such as in  adaptive \ncommunications  systems,  image  processing,  and  speech  recognition  dynamic  networks  are \nneeded  to  adaptively  learn  the  changing  information  data.  This  paper  formally  develops \nand  analyzes  some  dynamic  models  for  neural  networks.  Some  existing  models  [7-10]  are \nanalyzed,  new  models  are  developed,  and  measures  are  formulated  for  evaluating  the  per(cid:173)\nformance  of different  dynamic  networks. \n\nIn  [2-6]'  the  asymptotic  information  capacity  of  the  HAMN  is  defined  and  evaluated. \nIn  [4-5]'  this  capacity  is  found  by  first  assuming  that  the  information  vectors  (Ns)  to  be \nstored  have  components  that  are  chosen  randomly  and  independently  of  all  other  com(cid:173)\nponents  in  all  IVs.  The information  capacity  then  gives  the  maximum  number of Ns that \ncan  be  stored  in  the  HAMN  such  that  IVs  can  be  recovered  with  high  probability  during \nretrieval.  At  or  below  capacity,  the  network  with  high  probability,  successfully  recovers \nthe  desired  IVs.  Above  capacity,  the  network  quickly  degrades  and  eventually  fails  to \nrecover any of the desired  IVs.  This  phenomena is  sometimes referred  to  as  the  \"forgetting \ncatastrophe\" [10].  In this paper we  will  refer to this  phenomena as  network saturation. \n\nThere  are  two  ways  to  avoid  this  phenomena.  The  first  method  involves  learning  a \nlimited  number  of IVs  such  that  this  number  is  below  capacity.  After  this  leaming  takes \nplace,  no  more  learning  is  allowed.  Once  learning  has  stopped,  the  network  does  not \nchange  (defined  as  static)  and  therefore  lacks  many  of  the  interesting  computational \n\n@ American Institute of Physics 1988 \n\n\f433 \n\ncapabilities  that adaptive  learning  and  neural  network  models  have .  The  second  method  is \nto  incorporate  some  type  oC  forgetting  mechanism  in  the  learning  structure  so  that  the \ninCormation  stored  in  the  network  can  never  exceed  capacity.  This  type  of network  would \nbe  able  to  adapt  to the  changing statistics  of the  IVs  and  the  network  would  only  be  able \nto  recall  the  most  recently  learned  IVs.  This  paper focuses  on analyzing dynamic  networks \nthat  adaptively  learn  new  inCormation  and  do  not  exhibit  network  saturation  phenomena \nby selectively  Corgetting  old  data.  The  emphasis  is  on developing  simple  models  and  much \noC  the  analysis  is  performed  on  a  dynamic  network  that  uses  a  modified  Hebbian  learning \nrule. \n\nSection  2  introduces  and  qualitatively  discusses  a  number of network  models  that are \nclassified  as  dynamic  networks.  This  section  also  defines  some  pertinent  measures  Cor \nevaluating  dynamic  network  models.  These  measures  include  the  plasticity  of  a  network, \nthe  probability  oC  network  saturation,  and  the  efficacy  of stored  IVs.  A  network  with  no \nplasticity  cannot  learn  and  a  network  with  high  plasticity  has  interconnection  weights  that \nexhibit  large  changes.  The  efficacy  oC  a  stored  IV  as  a  function  oC  time  is  another  impor(cid:173)\ntant  parameter  as  it  is  used  in  determining  the  rate  at  which  a  network  forgets  informa(cid:173)\ntion. \n\nIn  section  3,  we  mathematically  analyze  a  simple  dynamic  network  referred  to as  the \nAttenuated  Linear  Updated  Learning  (AL UL)  network  that  uses  linear  updating  and  a \nmodified  Hebbian  rule.  Quantities  introduccd  in  section  3  are  analytically  dctcrmincd  for \nthe  ALUL  network.  By  adjusting  the  attenuation  parameter  of  the  AL UL  network,  the \nCorgetting  factor  is  adjusted.  It  is  shown  that  the  optimal  capacity  for  a  large  AL UL  net(cid:173)\nwork  in  steady  state  defined  by  (2.13,3.1)  is  a  factor  of  e  less  than  the  capacity  of  a \nHAMN.  This  is  the  tradeoff  that  must  be  paid  for  having  dynamic  capabilities.  We  also \nconjecture  that  no  other  network  can  perform  better  than  this  network  when  a  worst  case \ncriterion is  used.  Finally, section  4  discusses  further  directions for  this  work  along with pos(cid:173)\nsible applications in  adaptive signal  processing. \n\n2.  Dynamic Associative Memory Networks \n\nThe  network  models  discussed  in  this  paper  are  based  on  the  concept  of  associative \nmemory.  Associative  memories  are  composed  of  a  collection  of  interconnected  elements \nthat  have  data  storage  capabilities.  Like  other  memory  structures,  there  are  two  opera(cid:173)\ntions  that  occur  in  associative  memories.  In  the  learning  operation  (referred  to  as  a  write \noperation  for  conventional  memories),  inCormation  is  stored  in  the  network  structure.  In \nthe  recall  operation  (referred  to  as  a  read  operation  for  conventional  memories),  informa(cid:173)\ntion  is  retrieved  from  the  memory  structure.  Associative  memories  recall  information  on \nthe  basis  of  data  content  rather  than  by  a  specific  address.  The  models  that  we  consider \nwill  have  learning  and  recall  operations  that  are  updated  in  discrete  time  with  the  activa(cid:173)\ntion state XU) consisting of N  cells  that take on  the  values {-l,1}. \n\n2.1.  Dynamic Network MeasureS \n\nGeneral  associative  memory  networks  are  described  by  two  sets  of  equations.  If we \nlet  XU) represent  the  activation  state  at  time  i  and  W( k)  represent  the  weight  matrix  01\u00b7 \ninterconnection state at time  k  then  the  activation or recall  equation  is  described  by \n\nX(j+ 1)  =  f (XU), W(k)), \n\n(2.1 ) \nwhere  X  is  the  data  probe  vector  used  for  reca.ll.  The  learning  algorithm  or  int.erconnec(cid:173)\ntion  equation is  described  by \n\ni? 0,  k? 0,  X(O)  =  X \n\nW(k+ 1)  =  g(V(i),O::; i< k, W(O)) \n\nwhere  {V( i)}  are  the  information  vectors  (IV)s  to be stored  and  W(O)  is  the  initial state of \nthe  interconnection  matrix.  Usually  the  learning  algorithm  time  scale  is  much  longer  than \n\n\f434 \n\nthe  recall  equation  time  scale  so  that  W  in  (2.1)  can  be  considered  time  invariant.  Often \n(2.1)  is  viewed  as  the  equation  governing  short  term  memory  and  (2 .2)  is  the  equation \ngoverning  long  term  memory.  From  the  Hebbian  hypothesis  we  note  that  the  data  probe \nvectors  should  have  an effect  on  the  interconnection  matrix  W.  If a  number  of  data  p!'Obe \nvectors  recall  an  IV  V( a') ,  the  strength  of  recall  of  the  IV  V( i)  should  be  increased  by \nappropriate  modification  of  W.  If another  IV  is  never  recalled,  it should  gradually  be  for(cid:173)\ngotten  by  again  adjusting  terms  of  W.  Following  the  analysis  in  [4,5]  we  assume  that  all \ncomponents of IVs  introduced are  independent and  identically  distributed  Bernoulli  random \nvariables with  the  probability of a  1 or -1  being chosen equal to  ~. \n\nOur analysis focuses  on learning algorithms.  Before describing some  dynamic  learning \nalgorithms  we  present  some  definitions.  A  network  is  defined  as  dynamic  if  given  sorne \nperiod  of time  the  rate of change  of  W  is never nonzero.  In  addition we  will  primarily  dis(cid:173)\ncuss  networks  where  learning  is  gradual  and  updated  at  discrete  times  as  shown  in  (2.2). \nBy  gradual,  we  want  networks  where  each  update  usually  consists  of  one  IV being  learned \nand/or forgotten.  IVs  that have been introduced  recently should  have  a  high  probability  of \nrecovery.  The  probability  of recall  for  one  IV should  also  be  a  monotonic  decreasing  func(cid:173)\ntion of time,  given that the  IV is  not repeated.  The  networks  that we consider should  also \nhave  a  relatively low  probability of network saturation. \n\nQuantitatively,  we  let  e(k,l,i}  be  the  event  that  an  IV  introduced  at  time  l  can  be \nrecovered  at  time  k  with  a  data  probe  vector  which  is  of  Hamming  distance  i  f!'Om  the \ndesired  IV.  The  efficacy  of  network  recovery  is  then  given  as  p(k,l,i) =  Pr(e(k,l,i)).  In \nthe  analysis  performed  we  say  a  a  vector  V  can  recover  V(I),  if  V(I)  =  6(V) where  6(.) \nis  a  synchronous  activation  update  of  all  cells  in  the  network.  The  capacity  for  dynamic \nnetworks  is  then  given  by \n\nO(k,i,l) =  maxm3-Pr(r(e(k,l,i),05:I<k)= m)  >  l-l  O<i< N \n2 \n\n-\n\n(2.3) \n\nwhere  r(X}  gives  the  cardinality  of the  number  of  events  that occur  in  the  set  X.  Closely \nrelated  to  the  capacity  of  a  network  is  network  saturation.  Saturation  occurs  when  the \nnetwork  is  overloaded  with  IVs  such  that  few  or  none  of  the  IVs  can  be  successfully \nrecovered.  When  a  network  at time  0  starts  to  leal'll  IVs,  at some  time  l < i  we  have  that \nO(l,i,l\u00bb  OU,i,l).  For  k>1  the  network  saturation  probability  is  defined  by  S(k,m) \nwhere  S  describes the  probability  that the  network  cannot recover  m  IVs. \n\nAnother  important  measure  in  analyzing  the  performance  of dynamic  networks  is  t.he \nplasticity  of  the  interconnections  of  the  weight  matrix  W.  Following  definitions  that  are \nsimilar to  [10],  define \n\nN \n\n2: 2: V AR{ Wi,j(k)  - Wi,j(k-l)} \ni\". ii-I \n\nN(N-l) \n\nh(k)  = \n\nas the incremental synaptic  intensity  and \n\nN \n\n2: 2:V AR{ Wi,j(k)} \ni\"..;;= 1 \n\nN(N-l) \n\nH(k)  = \n\n(2.4) \n\n(2 .5) \n\nas  the  cumulative  synaptic  intensity.  From  these  definitions we  can  define  the  plasticity  of \nthe  network  as \n\nWhen network  plasticity  is  zero,  the  network  does  not  change  and  no  learning  takes  place. \nWhen plasticity is  high,  the  network  interconnections exhibit  large  changes. \n\nP(k) =  h(k) \nH(k) \n\n(2.6) \n\n\f435 \n\nWhen  analyzing  dynamic  networks  we  are  often  interested  if  the  network  reaches  a \n\nsteady state.  We  say a  dynamic  network  reaches steady state if \n\nlimH(k) =  H \nIe--.oo \n\n(2.7) \n\nwhere  H  is  a  finite  nonzero  constant.  If the  IVs  have  stationary  statistics  and  given  that \nthe  learning operations  are  time  invariant,  then  if  a  network  reaches  steady  state,  we  have \nthat \n\nlimP(k) =  P \nIe-+oo \n\n(2.8) \n\nwhere  P  is  a  finite  constant.  It  is  also  easily  verified  from  (2.6)  that  if  the  plasticity  con(cid:173)\nverges  to  a  nonzero  constant in a  dynamic  network,  then given  the  above  conditions on  the \nIVs  and  the  learning operations the  network will  eventually  reach steady state. \n\nLet us also  define  the synaptic state at time  k  for  activation state  V  as \n\n(2.9) \nFrom  the  synaptic  state,  we  Can  define  the  SNR  of  V,  which  we  show  III  section  3  is \nclosely  related  to the  efficacy  of an IV and  the  capacity of the network . \n\ns(k, V)  =  W(k)V \n\nSNR(k, V,i)  = \n\n(E(s.(k  V)))2 \nVAR(si(k, V)) \n\n. ,  \n\n(2.1O) \n\nAnother  quantity  that is  important  in  measuring  dynamic  networks  is  the  complexity \nof implementation.  Quantities  dealing  with  network  complexity  are  discussed  in  [12]  and \nthis  paper  focuses  on  networks  that  are  memory less.  A  network  is  memoryless  if  (2.2)  can \nbe  expressed in the  following  form: \n\nW(k+ 1)  =  9 #( W(k), V(k)) \n\n(2.11) \n\nNetworks  that are not  memoryless  have  the  disadvantage  that all  Ns need  t.o  be  saved  dur(cid:173)\ning  all  learning  updates.  The  complexity  of implementation  is greatly  increased in  terms of \nspace complexity  and very likely  increased  in  terms of time  complexity. \n\n2.2.  Examples of Dynamic Associative Memory Networks \n\nThe  previous  subsection  discussed  some  quantities  to  measure  dynamic  networks. \nThis  subsection  discusses  some  examples  of  dynamic  associative  memo!,y  networks  and \nqualitatively  discusses  advantages  and  disadvantages  of  different  networks .  All  the  net(cid:173)\nworks  considered  have  the  memoryless  propel\u00b7ty. \n\nThe  first  network  that we  discuss  is  described  by  the  following  difference  equation \n\nW(k+ 1)  =  a(k)W(k) +  b(k)L(V(k)) \n\n(2.12) \nwith  W(O)  being the  initial  value  of weights  before  any  learning has  taken  place .  Networks \nwith  these  learning  rules  will  be  labeled  as  Linear  Updated  Learning  (LUL)  networks  and \nin  addition  if O<a(k)<l  for  k2::0  the  network  is  labeled  as  an  Attenuated Linear  Updated \nLearning  (ALUL)  network.  We  will  primarily  deal  with  ALUL  where  O<a(k)<l  and  b(k) \ndo  not  depend  on  the  position  in  W.  This  model  is  a  specialized  version  of  Grossberg's \nPassive  Decay  LTM  equation  discussed  in  [11].  If the  learning  algorithm  is  of  the  conela(cid:173)\ntion type  then \n\nL(V(J.\u00b7))  =  V(k)V(kf-1 \n\n(2.13) \n\nThis learning  scheme  has similarities to  the  marginalist learning  schemes introduced  in  [10]. \nOne  of  the  key  parameters  in  the  ALUL  network  is  the  value  of  the  attenuation  coefficient \na.  From  simulations  and  intuition  we  know  that  if  the  attenuation  coefficient  is  to  high, \nthe  network  will  saturate  and  if  the  attenuation  parameter  is  to  low,  the  network  will \n\n\f436 \n\nforget  all  but  the  most recently  introduced  IVs.  Fig.  1  uses  Monte  Carlo  methods  to show \na  plot of the  number  of  IVs  recoverable  in  a  64  cell  network  when  a =  1,  (the  HAMN)  as  a \nfunction  of the  learning  time scale.  From  this figure  we  clearly see  that network saturation \nis  exhibited  and  for  the  time  k ~ 25  no  IV  are  recoverable  with  high  probability.  Section  3 \nfurther  analyzes  the AL UL  network  and  derives  the  value  of different  measUl'es  introduced \nin section  2.1. \n\nAnother learning scheme  called  bounded  learning (BL)  can  be  described  by \n\nL(V(k)) = \n\n{V(k)V(k)T  -I  F(W(k)~A \nF( W(J.:))<A \n\n0 \n\nBy setting the attenuation parameter  a =  1  and  letting \n\nF(W(k)) =  ~a;<Wi.i(k) \n\nI,J \n\n(2.14) \n\n(2 .15) \n\nthis  is  identical  to  the  learning  with  bounds  scheme  discussed  in  [10].  Unfortunately  there \nis  a  serious  drawbacks  to this  model.  If A  is  too  large  the  network  will  saturate with  high \nprobability.  If A is set such  that the  probability of network saturation is  low  then  the net(cid:173)\nwork \nof \nk  >  k(A)  =  min  I  :7 F( W(I))~ A.  Th~efore  we  have  that  the  efficacy  of  netwOl'k \nrecovery,  p (k,1 ,0)  ~ 0  for  all  J.:  ~ I  ~ k{A). \n\ncharacteristic \n\nlearning \n\nof \n\nnot \n\nalmost \n\nall \n\nhas \n\nthe \n\nvalues \n\nfor \n\nIn  order  for  the  (BL)  scheme  to  be  classified  as  dynamic  learning,  the  attenuation \n\nparameter  a  must  have  values  between  0  and  1.  This  learning scheme  is  just a  more  com(cid:173)\nplex  version  of the  learning  scheme  derived  from  (2.10,2 .11).  Let  us  qualitatively  analyze \nthe  learning  scheme  when  a  and  b  are  constant.  There  are  two  cases  to  consider.  When \nA> H,  then  the  network  is  not  affected  by  the  bounds  and  the  network  behaves  as  the \nAL UL  network.  When  A <H,  then  the  network  accepts  IVs  until  the  bound  is  reached. \nWhen  the  bound  is  reached,  the  network  waits  until  the  values  of  the  interconnection \nmatrix  have  attenuated  to  the  prescribed  levels  where  learning  can  continue.  If A  is  judi(cid:173)\nciously  chosen,  BL  with  a < 1  provides  a  means  for  a  network  to  avoid  saturation.  By \nholding  an  IV  until  H(k )<A,  it  is  not  too  difficult  to  show  that  this  learning  scheme  is \nequivalent to an AL UL  network with  b (k)  time varying. \n\nA  third  learning  scheme  called  refresh  learning  (RL)  can  be  described  by  (2.12)  with \n\nb(k)=I,  W(O)=O,  and \n\na(k)  =  1  -.5(kmod(l)) \n\n(2.16) \n\nThis  learning  scheme  learns  a  set  of  IV  and  periodically  refreshes  the  weighting  matrix  so \nthat  all  interconnections  are  O.  RL  can  be  classified  as  dynamic  learning,  but  learning  is \nnot gradual  during the  periodic  l'efresh  cycle.  Another problem with  this learning scheme is \nthat  the  efficacy  of  the  IVs  depend  on  where  during  the  period  they  were  learned.  IVs \nlearned  late in  a  period  are  quickly  forgotten  where  as  IVs  learned  eady  in  a  period  have  a \nlonger time in which  they  are recoverable. \n\nIn  all  the  learning  schemes  introduced,  the  network  has  both  learning  and  forgetting \ncapabilities,  A  network  introduced  in  [7,8]  separates  the  learning  and  forgetting  tasks  by \nusing  the  standard  HAMN  algorithm  to  learn  IV  and  a  random  selective  forgetting  algo(cid:173)\nrithm  to  unlearn  excess  information.  The  algorithm  which  we  call  random  selective  forget(cid:173)\nting (RSF) can be  described  formally  as follows. \n\nwhere \n\nW(k+ 1)  =  Y(J.:)  +  L(V(k)) \n\nY(k)  =  W(k)  -Jl(k) \n\n(V(k,a')V(k,i)T  -n(F(W(k)))I) \n\nn(FU!::(k))) \n\n2..; \ni= 1 \n\n(2.17) \n\n(2.18) \n\n\f437 \n\nEach  of  the  vectors  V( k, i)  are  obtained  by  choosing  a  random  vector  V  in  the  same \nmanner IVs are  chosen  and letting  V  be  the initial state of the HAMN with  interconnection \nmatrix  W(k).  The  recall  operation  described  by  (2.1)  is  repeated  until  the  activation  has \nsettled  into  a  local  minimum  state .  V(k,i)  is  then  assigned  this state. \n/L(k)  is  the  rate  at \nwhich  the  randomly  selected  local  minimum  energy  states  are  forgotten,  W(k)  is  given  by \n(2.15),  and  n (X)  is  a  nonnegative  integer valued function  that is  a  monotonically  increasing \nfunction  of X. \n\nThe  analysis  of  the  RSF  algorithm  is  difficult,  because  the  energy  manifold  that \ndescribes  the  energy  of  each  activation  state  and  the  updates  allowable  for  (2.1)  must  be \nwell  understood.  There  is  a  simple  transformation  between  the  weighting  matrix  and  the \nenergy  of an activation state given  below, \n\nE(X(k))  =  -~~~Wi,jX;\u00b7(j)Xj(k)  k>O \n\ni \n\nj \n\n(2 .19) \n\nbut aggregately  analyzing  all  local  minimum  energy  activation states  is  complex.  Through \ncomputer  simulations  and  simplified  assumptions  [7,8]  have  come  up  with  a  qualitative \nexplanation of the  RSF  algorithm based  on an  eigenvalue approach. \n\n3.  Analysis of the ALUL Network \n\nSection  2  focused  on  defining  properties  and  analytical  measures  for  dynamic  AMN \nalong  with  presenting some  examples of some  learning  algorithms  for  dynamic  AMN.  This \nsection  will  focus  on  the  analysis  of  one  of  the  simpler  algorithms,  the  ALUL  network. \nFrom  (2.12)  we  have  that  the  time  invariant  ALUL  network  can  be  described  by  the  fol(cid:173)\nlowing interconnection state equation. \n\nW(k+ 1)  =  aW(k) +  bL(V(k)) \n\n(3.1 ) \n\nwhere  a  and  b  are  nonnegative  real  numbers .  Many  of the  measures  introduced  in  section \n2  can  easily  be  determined for  the AL UL  network. \n\nTo  calculate  the  incremental  synaptic  intensity  h (k)  and  the  cumulative  synaptic \nintensity  H(k)  let the  initial  condition  of the  interconnection  state  W\",i(O)  be  independent \nIf  E W\",i(O)  =  0  and \nof  all  other  interconnections  states  and  independent  of  all  IVs. \nV AR W .. ,j(O)  = \n\n\"Y  then \n\nand \n\nIn steady state when  a < 1 we  have  that \n\np  =  2(1~) \n\n(3.2) \n\n(3.3) \n\n(3.4) \n\nFrom  this  simple  relationship  between  the  attenuation  parameter  a  and  the  plasticity \nmeasure  P,  we  can  directly  relate  plasticity  to other  measures such  as  the  capacity  of the \nnetwork. \n\nWe  define  the  steady  state  capacity  as  C(i,i)=  lim C(k,i,i)  for  networks  where \n\nk--o.o \n\nsteady  state  exists.  To  analytically  determine \nthat \nS(k, V(j))  =  S(k-i)  is  a  jointly  Gaussian  random  vector.  Further  assume  that  Si(l)  for \n1~ i< N,  1~ 1< m  are  all  independent and  identically  distributed.  Then for  N  sufficiently \nlarge,  f(a) =  a2(k...,.-,l}(1~2),  and \n\nthe  capacity \n\nfirst  assume \n\n\f438 \n\nwe  have  that \n\nSNR(k, VU)) =  SNR(k-n =  (N-l)f(a) \nI-f{a) \n\n=  c{a )logN  \u00bb  1 \n\nj<k \n\n(3.5) \n\np{k,j,O) = \n\n~l _ \n\n1~ \n\nN \n\n2 \n\nV21rC (a )logN \n\nj<k \n\n(3.6) \n\nGiven  a  we  first  find \n~~p{k,j,O)= 1 when  c(a\u00bb2.  By letting  c(a)= 2 the  maximum  m \n\nlargest  m= k-j>O  where \n\nthe \n\nlim  p(k,j,O)  ~ 1.  Note  that \nN-oo \n\nis  given when \n\nf(a) \nI-f (a)  =  2logN \n\nN \n\nSolving for  m  we  get that \n\nm  = \n\n[ \n\nI \nog  (N + 21ogN)(1-a2) \n\n210gN \n\n1 -.......::.-------~ + 1 \n2 \n\nloga \n\n1 \n\nIt is  also  possible  to find  the  value of  a  that  maximizes  m.  If we  let f  =  1  - a2,  then \n\n2logN  1 \n\nI \nog  (N+ 2logN)f \n\n[ \n\nf \n\nm  ~ \n\n(3.7) \n\n(3.8) \n\n(3.9) \n\n. \n\nNTh\u00b7 \n\nd \n\n. \n\nI \n\nh \n\n2elogN \n\nh \n\n2m \n\nor w  en  m  ~ \n\nm  IS  at a  maximum  va ue  w  en  f  ~ \nIS  correspon  s  to \na ~ 2m -l.  Note  that this  is  a  factor of  e  less than  the  maximum  number of Ns allowable \nin  a  static  HAMN  [4,5],  such  that  one  of the  Ns is  recoverable.  By  following  the  analysis \nin  [5],  the  independence  assumption  and  the  Gaussian  assumptions  used  earlier  can  be \nremoved.  The  arguments  involve  using  results  from  exchangeability  theory  and  normal \napproximation  theory. \n\n. \n2elogN \n\nN \n\nA  similar  and  somewhat  more  cumbersome  analysis  can  be  performed  to show  that in \n\nsteady state  the  maximum  capacity achievable  is  when  a ~ 2m -l  and  given by \n\n2m \n\nlim  C(k,O,f)  =  ~ N \n4e  og \n\nN-oo \n\n(3.10) \n\nThis  again  is  a  factor  of  e  less  than  the  maximum  number  of  Ns  allowable  in  a  static \nHAMN  [4,5]'  such  that  all  Ns  are  recoverable.  Fig.  2  shows  a  Monte  Carlo  simulation  of \nthe  number  of  Ns  recoverable  in  a  64  cell  network  versus  the  learning  time  scale  for  a \nvarying between  .5  and  .99.  We  can see  that the network reaches approximate steady state \nwhen k:2: 35.  The  maximum  capacity  achievable  is  when  a ~ .9  and  the  capacity is  around \n5.  This  is  slightly  more  than  the  theoretical  value  predicted  by  the  analysis  just  shown \nwhen  we  compare  to  Fig.  1.  For smaller  simulations  conducted  with  larger  networks  the \nsimulated  capacity  was  closer  to  the  predicted  value.  From  the  simulations  and  the \nanalysis we  observe  that when  a  is too small Ns are  forgotten  at too  high a  rate  and  when \n\n\fa  is  too  high  network saturation occurs. \n\n439 \n\nUsing  the  same  arguments,  it  is  possible  to  analyze  the  capacity  of  the  network  and \nefficacy  of  rvs  when  k  is  small.  Assuming  zero  initial  conditions  and  a  ~ 2m-l  we  can \nsummarize  the  learning  behavior  of  the  AL UL  network.  The  learning  behavior  can  be \ndivided  into  three  phases.  In  the  first  phase  for  k< \nall  Ns are  remembered  and \nthe  characteristics  of  the  network  are  similar  to  the  HAMN  below  saturation. \nIn  the \nsecond  phase  some  rvs  are  forgotten  as  the  rate  of forgetting  becomes  nonzero.  During this \nphase  the  maximum  capacity  is  reached  as  shown  in  fig .  2.  At  this  capacity  the  network \ncannot  dynamically  recall  all  IVs  so  the  network  starts  to  forget  more  information  then  it \nreceives.  This  continues  until  steady  state  is  reached  where  the  learning  and  forgetting \nrates  are  equal.  If initial  conditions  are  nonzero  the  network  starts in  phase  1  or the  begin(cid:173)\nning  of  phase  2  if  H( k)  is  below  the  value  corresponding  to  the  maximum  capacity  and  at \nthe  end of phase  2 for  larger H( k). \n\n4elogN \n\n2m \n\nN \n\n-\n\nThe  calculation of the  network saturation probabilities  S( k, m) is  trivial  for  large  net(cid:173)\n\nworks  when  the  capacity  curves  have  been  found.  When  m~ C(k,O,E)  then  S(k,m)  ~ 0 \notherwise  S(k ,m)  ~ 1. \n\nin \n\nintroduced \n\n[10].  The  network \n\nBefore  leaving  this  section  let  us  briefly  examine  AL UL  networks  where  a (k)  and \nb (k)  are  time  varying.  An  example  of  a  time  varying  network  is  the  marginalist  learning \nscheme \nthe  value  of  the \nSNR(k,k-l,i)  =  D(N)  for  all  k.  This value  is  fixed  by setting  a= 1  and  varying  b.  Since \nthe  VARSi(k,V(k-l))  is  a  monotonic  increasing  function  of k,  b(k)  must  also  be  a  mono(cid:173)\ntonic  increasing  function  of k.  It  is  not  too  difficult  to show  that when  k  is  large,  the  mar(cid:173)\nginalist  learning scheme is  equivalent to the steady state AL UL  defined  by  (3.1).  The argu(cid:173)\nment  is  based  on  noting  that  the  steady  state  SNR  depends  not  on  the  update  time,  but \non  the  difference  between  the  update  time  and  when  the  rv was  stored  as  is  the  case  with \nthe  marginalist  learning  scheme.  The  optimal  value  of D( N)  giving  the  highest capacity  is \nwhen D(N)  =  4elogN and \n\nis  defined  by \n\nfixing \n\nb(k+ 1)  = \n\n2m  b(k) \n\n2m-l \n\n(3.11) \n\nwhere  m  = \n\nN \n\n4elogN' \n\nIf performance  is  defined  by  a  worst  case  criterion  with  the  criterion  being \n\nJ(I,N)  =  min(C(k,O,E),k~/) \n\n(3.12) \n\nthen  we  conjecture  that  for  I  large,  no  AL UL  as  defined  in  (2.12,2.13)  can  have  larger \nJ(I,N)  than  the  optimal ALUL  defined  by  (3.1).  If we  consider  average  capacity,  we  note \nthat  the  RL  network  has  an  average  capacity  of  N  which  is  larger  than  the  optimal \n\n810gN \n\nAL UL  network  defined  in  (3.1).  However,  for  most  envisioned  applications  a  worst  case \ncriterion  is  a  more  accurate  measure  of  performance  than  a  criterion  based  on  average \ncapacity. \n\n4.  Summary \n\nThis  paper  has  introduced  a  number  of  simple  dynamic  neural  network  models  and \ndefined  several  measures  to  evaluate  the  performance  of  these  models.  All  parameters  for \nthe  steady  state  AL UL  network  described  by  (3.1)  were  evaluated  and  the  attenuation \nparameter  a  giving  the  largest  capacity  was  found.  This capacity was  found  to be  a  factor \nof  e  less  than  the  static  HANIN  capacity.  Furthermore  we  conjectured  that if  we  consider \na  worst  case  performance  criteria  that  no  AL UL  network  could  perform  better  than  the \n\n\f440 \n\noptimal  ALUL  network  defined  by  (3.1).  Finally,  a  number  of  other  dynamic  models \nincluding BL, RL,  and  marginalist learning were stated to be equivalent to AL UL  networks \nunder certain conditions. \n\nThe  network  models  that were  considered  in  this  paper all  have  binary  vector valued \n\nactivation states  and  may  be  to  simplistic  to be  considered  in  many signal  processing appli(cid:173)\ncation.  By generalizing the  analysis to  more  complicated  models with  analog vector valued \nactivation  states  and  continuous  time  updating  it  may  be  possible  to use  these  generalized \nmodels  in  speech  and  image  processing.  A  specific  example  would  be  a  controller  for  a \nmoving  robot.  The  generalized  network  models  would  learn  the  input  data  by  adaptively \nchanging  the  interconnections  of the  network.  Old  data would  be  forgotten  and  data that \nwas  repeatedly  being  recalled  would  be  reinforced.  These  network  models  could  also  be \nused when  the input data statistics are  nonstationary. \n\nReferences \n\n[I]  W.  S.  McCulloch  and  W .  Pitts,  \"A  Logical  Calculus  of the  Ideas  Iminent  in  Nervous \n\nActivity\", Bulletin of Mathematical Biophysics,  5,  115-133,  1943. \n\n[2) \n\nJ.  J.  Hopfield,  \"Neural Networks  and Physical Systems  with  Emergent  Collective  Com(cid:173)\nputational Abilities \",  Proc.  Natl.  Acad.  Sci. USA 79,  2554-2558,  1982. \n\n[3J  Y.  S.  Abu-Mostafa  and  J.  M.  St.  Jacques,  \"The  Information  Capacity  of the  Hopfield \n\nModel\",  IEEE Trans.  Inform.  Theory,  vol.  IT-31,  461-464,  1985. \n\n[4)  R.  J. McEliece,  E.  C.  Posner,  E.  R. Rodemich  and  S.  S.  Venkatesh,  \"The  Capacity  of \n'the  Hopfield  Associative  Memory\",  IEEE  Trans.  Inform.  Theory,  vol.  IT-33,  461-482, \n1987. \n\n[5J  A.  Kuh  and  B. W.  Dickinson,  \"Information  Capacity  of Associative Memories \",  to be \n\npublished  IEEE Trans.  Inform.  Theory. \n\n[6]  D.  J.  Amit,  H.  Gutfreund,  and  H.  Sompolinsky,  \"Spin-Glass  Models  of Nev.ral  Net(cid:173)\n\nworks\",  Phys. Rev.  A,  vol.  32,  1007-1018,  1985. \n\n[7J \n\nJ.  J.  Hopfield,  D.  I.  Feinstein,  and  R.  G.  Palmer,  \"  'Unlearning'  has  a  StabIlizing \neffect  in  Collective Memories\",  Nature,  vol.  304,  158-159,  1983. \n\n[8]  R.  J.  Sasiela,  \"Forgetting  as  a  way  to  Improve  Neural-Net  Behavior\" ,  AIP  Confer(cid:173)\n\nence Proceedings 151,  386-392,  1986. \n\n[9] \n\n[10] \n\nJ.  D.  Keeler,  \"Basins  of  Attraction  of  Neural  Network  Models\",  AlP  Conference \nProceedings 151,  259-265,  1986. \n\nJ.  P.  Nadal,  G.  Toulouse,  J.  P.  Changeux,  and  S.  Dehaene,  \"Networks  of Formal \nNeurons  and Memory Palimpsests\",  Europhysics Let., Vol.  1,535-542,  1986. \n\n[11)  S.  Grossberg,  \"Nonlinear  Neural  Networks:  Principles,  Mechanisms,  and  Architec(cid:173)\n\ntures \",  Neural  Networks in press. \n\n[12J  S.  S.  Venkatesh  and  D.  Psaltis,  \"Information  Storage  and  Retrieval  in  Two  Associa(cid:173)\ntive  Nets \",  California  Institute  of  Technology  Pasadena,  Dept.  of  Elect.  Eng.,  pre(cid:173)\nprint,  1986. \n\n\f441 \n\n\"HAMN Capacity\" \n\n10 \n\nN=64,  1024 trials \n\n8 \n\n> \n0 \n::ea: \nco \nC) \nCIS  4 \n\n- 6 \n\"-co > < \n\n-a- Average # of IV \n\n2 \n\n0 \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\nUpdate Time \nFig. 1 \n\n0 \n\n~  8 \n... en \n~ -0 \n\nen \n\n6 \n\n4 \n\n::ea: \nco \nC) \nCIS \n\"-co \n> \n< \n\n10 \n\n\"ALUL Capacity\" \n\nN=64,  1024  trials \n\n..  a=.7 \n\n-Go  a=.5 \n\n-II- a=.90 \n\n....  a=.95 \n....  a=.99 \n\n2 \n\n0 \n\n0 \n\n10 \n\n20 \n\n30 \n\n40 \n\nUpdate Time \n\nFig. 2 \n\n\f", "award": [], "sourceid": 46, "authors": [{"given_name": "Anthony", "family_name": "Kuh", "institution": null}]}