{"title": "Learning on a General Network", "book": "Neural Information Processing Systems", "page_first": 22, "page_last": 30, "abstract": null, "full_text": "22 \n\nLEARNING  ON A  GENERAL NETWORK \n\nAmir F.  Atiya \n\nDepartment of Electrical Engineering \n\nCalifornia Institute of Technology \n\nCa 91125 \n\nAbstract \n\nThis paper generalizes the backpropagation method to a  general network containing feed(cid:173)\n\nback t;onnections.  The network model considered consists of interconnected groups of neurons, \nwhere each  group could be fully  interconnected  (it could  have feedback connections, with pos(cid:173)\nsibly  asymmetric weights),  but no loops between the  groups are  allowed.  A stochastic descent \nalgorithm  is  applied,  under  a  certain inequality constraint  on each  intra-group weight  matrix \nwhich  ensures for  the  network  to  possess  a  unique equilibrium  state for  every input. \n\nIntroduction \n\nIt has been shown in the last few  years that large networks of interconnected \"neuron\" -like \nelemp.nts  are  quite suitable for  performing a  variety of computational and  pattern recognition \ntasks.  One of the well-known  neural network  models  is  the  backpropagation  model  [1]-[4].  It \nis  an  elegant  way  for  teaching  a  layered  feedforward  network  by  a  set  of given  input/output \nexamples.  Neural  network  models  having  feedback  connections,  on  the  other hand,  have  also \nbeen  devised  (for  example  the  Hopfield  network  [5]),  and  are shown  to  be  quite  successful in \nperforming some computational tasks.  It  is  important,  though,  to have  a  method for  learning \nby  examples  for  a  feedback  network,  since  this  is  a  general  way  of design,  and  thus one  can \navoid  using  an  ad  hoc  design  method  for  each  different  computational  task.  The  existence \nof feedback  is  expected  to  improve  the  computational  abilities  of  a  given  network.  This  is \nbecause in feedback  networks the state iterates until a stable state is  reached.  Thus processing \nis  perforrr:.ed  on  several steps  or recursions.  This,  in  general  allows  more  processing  abilities \nthan  the  \"single  step\"  feedforward  case  (note  also  the  fact  that  a  feedforward  network  is \na  special case  of  a  feedback  network).  Therefore,  in  this  work  we  consider  the  problem  of \ndeveloping  a  general learning algorithm for  feedback  networks. \n\nIn developing a  learning algorithm for  feedback  networks,  one  has to pay attention to the \nfollowing  (see  Fig.  1 for  an  example  of a  configuration  of a  feedback  network).  The  state  of \nthe  network  evolves  in  time  until it  goes  to  equilibrium,  or  possibly  other  types  of behavior \nsuch  as  periodic or chaotic motion could occur.  However,  we  are  interested in  having  a steady \nand and  fixed  output for  every input  applied  to the network.  Therefore, we  have the following \ntwo  important  requirements  for  the  network.  Beginning  in  any  initial  condition,  the  state \nshould ultimately  go  to equilibrium.  The  other requirement  is  that we  have  to  have  a  unique \n\n\u00a9 American Institute of Physics 1988 \n\n\f23 \n\nequilibrium  state.  It  is  in  fact  that  equilibrium  state  that  determines  the  final  output.  The \nobjective of the learning algorithm is to adjust the parameters (weights) of the network in small \nsteps,  so  as  to move  the unique equilibrium state in a way that will result  finally  in an output \nas close  as possible  to the required one  (for each given input).  The existence of more than op.e \nequilibrium state for  a given input causes the following  problems.  In some  iterations one might \nbe updating the weights so as to move one of the equilibrium states in a sought direction, while \nin  other  iterations  (especially  with  different  input  examples)  a  different  equilibrium  state  is \nmoved.  Another important point is that when implementing the network (after the completion \noflearning), for a fixed input there can be more than one possible output.  Independently, other \nwork  appeared  recently  on  training  a  feedback  network  [6],[7],[8].  Learning  algorithms  were \ndeveloped, but solving the problem of ensuring a unique equilibrium was not considered.  This \nproblem  is  addressed  in  this  paper  and  an  appropriate  network  and  a  learning  algorithm  are \nproposed. \n\nneuron  1 \n\ninputs \n\noutputs \n\nFig.  1 \n\nA recurrent  network \n\nThe Feedback  Network \n\nConsider  a  group  of n  neurons  which  could  be  fully  inter-connected  (see  Fig.  1  for  an \nexample).  The  weight  matrix  W  can  be  asymmetric  (as  opposed  to  the  Hopfield  network). \nThe  inputs are  also  weighted  before  entering  into the  network  (let  V  be  the  weight  matrix). \nLet  x  and y  be  the  input  and output vectors respectively.  In  our model y  is  governed  by  the \nfollowing  set of differential equations,  proposed by  Hopfield  [5]: \n\ndu \nTdj  =  Wf(u) - u + Vx, \n\ny  =  f(u) \n\n(1) \n\n\f24 \n\nwhere  f(u)  =  (J(ud, ... , f(un)f,  T  denotes  the  transpose  operator,  f  is  a  bounded  and \ndifferentiable function,  and.,.  is  a  positive constant. \n\nFor a  given input, we  would like the network after a short transient period to give a steady \nand  fixed  output,  no  matter what  the  initial  network  state  was.  This  means  that  beginning \nany  initial condition,  the state is  to be  attracted towards  a  unique equilibrium.  This  leads  to \nlooking for  a  condition on the  matrix W. \n\nTheorem:  A  network  (not  necessarily  symmetric)  satisfying \n\nL L w'fi  <  l/max(J')2, \n\ni \n\ni \n\nexhibits no other behavior except  going  to  a  unique equilibrium for  a  given  input. \n\nProof : Let udt)  and U2(t)  be  two solutions of (1).  Let \n\nwhere  \"  II  is  the  two-norm.  Differentiating  J  with respect  to  time,  one  obtains \n\nUsing  (1)  ,  the expression  becomes \n\ndJ(t) \n-d- =  --lluI(t) - u2(t))11  + -(uI(t) - U2(t))  W  f  uI(t)  - f  uz(t) \n\n[ (  \n\nT \n\n) \n\n( \n\n)] \n. \n\nt \n\n2 \n1\" \n\n2 \n\n2 \n.,. \n\nUsing  Schwarz's Inequality,  we  obtain \n\nAgain,  by  Schwarz's  Inequality, \n\ni  =  1, ... ,n \n\n(2) \n\nwhere Wi  denotes  the  ith  row  of W.  Using  the mean  value  theorem,  we  get \n\nIlf(udt)) - f(U2(t))II  ~ (maxl!'I)IIUl(t)  - uz(t)ll. \n\n(3) \n\nUsing  (2),(3),  and  the expression  for  J(t),  we  get \n\nd~~t) ~ -aJ(t) \n\n(4) \n\nwhere \n\n\f25 \n\nBy hypothesis of the  Theorem,  a  is  strictly positive.  Multiplying both sides of  (4)  by  exp( at), \nthe inequality \n\nresults,  from which  we  obtain \n\nJ(t)  ~ J(O)e- at . \n\nFrom that and from the fact  that J  is  non-negative,  it follows  that J(t)  goes  to zero as t  -+ <Xl. \nTherefore,  any  two solutions  corresponding  to  any  two  initial conditions  ultimately  approach \neach other.  To show  that this  asymptotic  solution  is  in fact  an  equilibrium,  one  simply  takes \nU2(t)  =  Ul(t + T),  where  T  is  a  constant,  and  applies  the  above  argument  (that  J(t)  -+ 0  as \nt  -+ <Xl),  and hence Ul(t + T)  -+ udt) as  t  -+  <Xl  for  any  T,  and  this completes  the  proof. \n\nFor example,  if the function  I  is  of the following  widely used sigmoid-shaped  form, \n\n1 \n\nI(u)  =  l+e- u ' \n\nthen the  sum of the  square of the  weights should  be  less  than  16.  Note  that  for  any  function \nI,  scaling  does  not  have  an  effect  on  the  overall results.  We  have  to  work  in  our  updating \nscheme subject  to the constraint  given  in  the Theorem.  In  many  cases where  a  large  network \nis necessary, this constraint might  be  too  restrictive.  Therefore we  propose a  general network, \nwhich  is  explained  in  the  next  Section. \n\nThe General  Network \n\nWe  propose  the  following  network  (for  an  example  refer  to  Fig.  2).  The  neurons  are \npartitioned into several groups.  Within each group there are no restrictions on the connections \nand therefore the group could be fully interconnected  (i.e.  it could have feedback connections) . \nThe  groups are  connected  to  each other,  but  in  a  way  that there  are  no  loops.  The  inputs  to \nthe  whole  network  can  be  connected  to  the  inputs  of any  of the  groups  (each  input  can  have \nseveral connections to several groups).  The outputs of the whole  network are taken  to  be the \noutputs  (or  part of the  outputs)  of a  certain  group,  say  group  I.  The  constraint  given  in  the \nTheorem is  applied on  each intra-group weight  matrix separately.  Let  (qa, s\"), a  = 1, .. . , N  be \nthe  input/output vector  pairs of the function  to  be  implemented.  We  would  like  to  minimize \nthe sum of the  square error,  given  by \n\nwhere \n\na=l \n\nM \n\ne\" = I)y{ - si}2, \n\ni=l \n\nand yf is  the output vector of group f  upon giving input qa, and M  is the dimension of vector \ns\".  The  learning  process  is  performed  by  feeding  the  input  examples  qU  sequentially  to  the \nnetwork,  each  time  updating  the  weights  in  an  attempt  to minimize  the error. \n\n\f26 \n\ninputs \n\nJ - - -V   outputs \n\nFig.  2 \n\nAn example of a  general network \n\n(each  group represents a recurrent  network) \n\nNow,  consider  a  single  group  l.  Let  Wi  be  the  intra-group weight  matrix of group l,  vrl \nbe the matrix of weights  between the outputs of group,. and the inputs of group  l,  and yl  be \nthe  output  vector of group  I.  Let  the  respective  elements  be  w~i'  V[~.,  and  y~.  Furthermore, \nlet n,  be  the number  of neurons of group  l.  Assume  that  the  time  constant  l'  is  sufficiently \n\nsmall so  as  to  allow  the  network  to  settle  quickly  to  the equilibrium  state,  which  is  given  by \nthe solution of the equation \n\nyl  =  f(W'yl + L vrlyr) . \n\n(5) \n\nr\u00a3A I \n\nwhere  A,  is  the set  of the  indices of the  groups  whose  outputs a.re  connected  to the  inputs of \ngroup ,.  We  would like each iteration to update the weight matrices Wi  and vrl so  as to move \nthe equilibrium in a  direction  to decrease  the error.  We  need  therefore  to know  the change in \nthe  error produced  by  a  small change  in  the  weight  matrices.  Let  .:;';, ,  and  aa~~,  denote  the \nmatrices whose  (i, j)th  element  are  :~'.' and  ::~  respectively.  Let  ~ be the column vector \nwhose ith  element is  ~. We  obtain the following  relations: \n\n:r \n\n'1 \n\n'1 \n\nuy. \n\n8ea  =  [A'  _ (W')T] -1 8ea  ( \n')T \n8W' \n8yl  Y \n, \n8ea  = [A'  _ (W')T] -1 8ea  (  r)T \n8V tl \n, \n\n8yl  y \n\nwhere  A' is  the diagonal matrix whose ith  diagonal element  is  l/f'(Lk w!kY~ + LrLktJ[kyk) \nfor  a  derivation refer  to  Appendix).  The  vector  ~ associated with  groUp  l  can  be  obtained \nin terms of the vectors  ~, fEB\"  where B,  is  the set of the indices of the  groups whose inputs \nare connected  to the outputs of group  ,.  We  get  (refer  to  Appendix) \n\n8ea  = '\" (V'i)T[Ai _  (Wi{r 1 8e\".. \n8yl  ~ \n8y3 \n\nJlBI \n\n(6) \n\nThe matrix A'  ~ (W')T  for  any  group  l  can  never  be  singular,  so  we  will  not  face  any \n\nproblem in  the  updating process.  To  prove  that,  let  z  be  a  vector satisfying \n\n[A' - (W'f]z = o. \n\n\f27 \n\nWe  can write \n\nzdmaxlf' I ~ LW~.Zk' \n\ni  = I, ... , nl \n\nk \n\nwhere Zi  is  the  ,\"th  element of z.  Using  Schwarz's Inequality,  we  obtain \n\ni  =  I, ... ,nl \n\nSquaring both sides  and  adding  the  inequalities for i  = I, ... , nl, we  get \n\nL/; ~ max(J')2(Lz~) LL(w~i)2. \n\n(7) \n\nk \n\ni \n\nk \n\nSince  the condition \n\nLL(W!k)2 <  I/max(J')2), \n\nk \n\nis enforced, it follows  that  (7)  cannot be satisfied unless z  is  the zero vector.  Thus, the matrix \nA' - (W')T cannot  be singular. \n\nFor  each  iteration  we  begin  by  updating  the  weights  of  group  f  (the  group  contammg \nthe  final  outputs).  For that  group  ~ equals simply 2(y{  - SI, ... , yf.t  - SM, 0, ... , O)T).  Then \nwe  move  backwards  to  the  groups  connected  to  that  group  and  obtain  their  corresponding \n!!J:  vectors using  (6),  update the weights,  and proceed  in the same manner until we  complete \nupdating  all  the  groups.  Updating  the  weights  is  performed  using  the  following  stochastic \ndescent  algorithm for  each  group, \n\nt:. V  =  -a3 8V + a4 ea R , \n\n8ea \n\nwhere  R  is  a  noise  matrix whose  elements are  characterized by independent zero-mean unity(cid:173)\nvariance  Gaussian  densities,  and  the  a's  are  parameters.  The  purpose  of adding  noise  is  to \nallow escaping local minima if one  gets stuck in any of them.  Note that the control parameter \nis  taken  to  be  ea.  Hence  the  variance  of  the  added  noise  tends  to  decrease  the  more  we \napproach  the ideal zero-error solution.  This  makes sense  because  for  a  large  error,  i.e.  for  an \nunsatisfactory  solution,  it  pays  more  to  add  noise  to  the  weight  matrices  in  order  to  escape \nlocal  minima.  On the  other  hand,  if the  error  is  small,  then  we  are  possibly  near  the  global \nminimum  or  to  an  acceptable  solution,  and  hence  we  do  not  want  too  much  noise  in  order \nnot to be  thrown out of that  basin.  Note  that once  we  reach  the ideal zero-error solution the \nadded noise as well as  the gradient of ea  become zero for  all a  and hence the  increments of the \nweight  matrices become zero.  If after a  certain  iteration  W  happens to violate  the constraint \nLiiwlj  ~ constant <  I/max(J')2,  then  its  elements  are  scaled  so  as  to project  it  back  onto \nthe  surface of the  hypershere. \n\nImplementation Example \n\nA  pattern  recognition  example  is  considered.  Fig.  3  shows  a  set  of  two-dimensional \ntraining patterns from  three classes.  It is  required  to design  a  neural network recognizer with \n\n\f28 \n\nthree output neurons.  Each of the neurons should be on if a sample of the corresponding class is \npresented, and off otherwise, i.e.  we would like to design a \"winner-take-all\" network.  A single(cid:173)\nlayer three neuron feedback  network is  implemented.  We  obtained 3.3% error.  Performing the \nsame  experiment  on  a  feedforward  single-layer  network with  three neurons,  we  obtained  20% \nerror.  For  satisfactory  results,  a  feedforward  network  should  be  two-layer.  With  one  neuron \nin  the  first  layer  and three in the  second  layer,  we  got  36.7% error.  Finally,  with two  neurons \nin  the  first  layer  and  three  in  the  second  layer,  we  got  a  match with  the feedback  case,  with \n3.3%  error. \n\nz \n\nz \n\nz \n\nz \nz  z \nz \n\nz \nz \n\nz \n\nz \n\nz \nz \n\nzil \n\nz \n\nz \n\nz \n\n33 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n33  3 \n3  ~  3 \n3 \n\n3 \n\n3 \n\n1 \n\n1 \n\n3 \n\nFig.  3 \n\nA  pattern recognition  example \n\nConclusion \n\nA  way  to extend  the  backpropagation  method  to  feedback  networks  has  been  proposed. \nA  condition  on  the  weight  matrix  is  obtained,  to  insure  having  only  one  fixed  point,  so  as \nto  prevent  having  more  than  one  possible  output  for  a  fixed  input.  A  general  structure for \nnetworks is presented, in which  the network consists of a  number of feedback  groups connected \nto each other in a feedforward  manner.  A stochastic descent rule is used to update the weights. \nThe lJ!ethod is applied to a  pattern recognition  example.  With a single-layer feedback  network \nit obtained good results.  On the other hand, the feedforward backpropagation method achieved \ngood resuls only for the case of more than one layer, hence also with a larger number of neurons \nthan the feedback  case. \n\n\f29 \n\nAcknow ledgement \n\nThe  author  would  like  to  gratefully  acknowledge  Dr.  Y.  Abu-Mostafa  for  the  useful \ndiscussions.  This  work  is  supported  by  Air  Force  Office  of  Scientific  Research  under  Grant \nAFOSR-86-0296. \n\nAppendix \n\nDifferentiating  (5),  one obtains \n\na I \na I \nYm \nYj \n- a  I  =  f  Zj  L..,Wjm-a \nwkp \nw kp \n\n'(')(,,\",  I \n\nm \n\n'6) \nI  +Yp  jk  , \n\nk,p =  1, ... ,n, \n\nwhere \n\nand \n\nWe  can write \n\nif j  = k \notherwise, \n\na~'  =  (A'  _  Wi) -lbkz> \nawkp \n\n(A  - 1) \n\nwhere b kp  is  the  nt-dimensional vector whose  ith  component  is  given  by \n\nBy  the chain  rule, \n\nb~l> =  {y~ \n0 \n\u2022 \n\nifi =  k \notherwise. \n\naea  _  \"\"' aea  ay; \n-a  I  -L..,-a I -a  I '  \nw kp \nYj  w kp \n\nj \n\nwhich,  upon  substituting from  (A - 1),  can be  put  in  the  form  y!,gk~' where  gk  is  the  kth \ncolumn  of (A'  - Wt)-l.  Finally,  we  obtain  the  required expression,  which  is \n\nae\"  =  [At  _  (WI)T] -1 ae\" (  ,)T \naw' \n. \n\nayl  y \n\nRegarding  a()~~I'  it  is  obtained  by  differentiating  (5)  with  respect  to vr~,.  We  get  similarly \n\nwhere  C kl'  is  the  nt-dimensional vector whose  ith  component  is  given  by \n\nif i  =  k \notherwise. \n\n\f30 \n\nA derivation very similar to  the case of  :~l results in the following  required expression: \n\nBea  =  [A'  _ (w,)T] -1 Bea (  r)T. \nBVrl \n\nBy'  y \n\n8 \n\n8 \n\nj \n\nNow,  finally  consider  ~. Let  ~, jf.B,  be  the  matrix  whose  (k,p)th  element  is  ~. The \nelements  of  ~ can be obtained  by  differentiating  the  equation  for  the  fixed  point for  group \n. \nJ,  as follows, \n\nuy \n\n8 y J \n\nHence, \n\n:~~.  =  (Ai - Wi) -IV'i. \n\n(A - 2) \n\nUsing  the chain rule, one  can write \n\nBea  = ~(ByJ)  Bea \nBy'  ~ Byl  By;' \n\n\u00b7T \n\nJEEr \n\nWe substitute from  (A - 2)  into the previous equation to complete the derivation by obtaining \n\nReferences \n\n111  P.  Werbos,  \"Beyond  regression:  New  tools  for  prediction  and  analysis  in  behavioral sci(cid:173)\n\nences\",  Harvard University  dissertation,  1974. \n\n[21  D. Parker, \"Learning logic\",  MIT Tech Report TR-47, Center for Computational Research \n\nin Economics and  Management Science,  1985. \n\n[31  Y.  Le  Cun, \"A learning scheme  for  asymmetric  threshold  network\",  Proceedings of Cog(cid:173)\n\nnitiva,  Paris,  June  1985. \n\n[41  D.  Rumelhart,  G.Hinton,  and  R.  Williams,  \"Learning  internal  representations  by  error \npropagation\", in D. Rumelhart, J. McLelland and the PDP research group (Eds.), Parallel \ndistributed processing:  Explorations in  the  microstructure  of cognition,  Vol.  1,  MIT Press, \nCambridge,  MA,  1986. \n\n151  J.  Hopfield,  \"Neurons with  graded  response  have  collective  computational properties like \n\nthose  of two-state neurons\",  Proc.  N atl.  Acad.  Sci.  USA,  May  1984. \n\n[61  L.  Ahneida, \" A learning rule for  asynchronous perceptrons with feedback in a combinato(cid:173)\n\nrial environment\",  Proc.  of the  First  Int.  Annual Conf.  on  Neural  Networks, San  Diego, \nJune  1987. \n\n[71  R.  Rohwer,  and  B.  Forrest, \"Training time-dependence in  neural networks\",  Proc.  of the \n\nFirst  Int.  Annual Conf.  on  Neural  Networks,  San  Diego,  June  1987. \n\n[81  F. Pineda, \"Generalization of back-propagation to recurrent neural networks\", Phys.  Rev. \n\nLett.,  vol.  59,  no.  19, 9  Nov.  1987. \n\n\f", "award": [], "sourceid": 9, "authors": [{"given_name": "Amir", "family_name": "Atiya", "institution": null}]}