{"title": "A Back-Propagation Algorithm with Optimal Use of Hidden Units", "book": "Advances in Neural Information Processing Systems", "page_first": 519, "page_last": 526, "abstract": null, "full_text": "A BACK-PROPAGATION ALGORITHM \nWITH OPTIMAL USE OF HIDDEN UNITS \n\n519 \n\nYves  Chauvin \n\nThomson-CSF,  Inc \n\n(and  Psychology  Department,  Stanford  University) \n\n630,  Hansen Way  (Suite  250) \n\nPalo  Alto,  CA  94306 \n\nABSTRACT \n\nThis  paper  presents  a  variation  of  the  back-propagation  algo(cid:173)\nrithm  that makes  optimal  use  of  a  network  hidden units  by  de(cid:173)\ncr~asing an  \"energy\"  term written  as  a  function  of  the  squared \nactivations  of  these  hidden units.  The  algorithm  can automati(cid:173)\ncally  find  optimal  or  nearly  optimal  architectures  necessary  to \nsolve  known  Boolean  functions,  facilitate  the  interpretation  of \nthe  activation  of  the  remaining  hidden  units  and  automatically \nestimate the complexity of architectures appropriate for phonetic \nlabeling  problems.  The  general  principle  of the  algorithm  can \nalso be adapted to different tasks:  for  example,  it can be used to \neliminate the  [0,  0]  local minimum  of the  [-1.  +1]  logistic  acti(cid:173)\nvation  function  while  preserving  a  much  faster  convergence  and \nforcing  binary  activations  over the  set of hidden  units. \n\nPRINCIPLE \n\nThis paper describes an algorithm which makes optimal use of the hidden units in \na  network using the standard back-propagation algorithm  (Rumelhart.  Hinton & \nWilliams,  1986).  Optimality is  defined as  the  minimization of a  function  of the \n\"energy\"  spent  by  the  hidden  units  throughtout  the  network,  independently  of \nthe  chosen  architecture,  and  where  the  energy  is  written  as  a  function  of  the \nsquared  activations  of the  hidden units. \n\nThe standard back-propagation algorithm is  a  gradient descent algorithm  on the \nfollowing  cost  function: \n\nP  0 \n\nC = I  I  (dij- Oij)2 \n\nj \n\n[1] \n\nwhere  d  is  the  desired output of an output unit,  0  the  actual output,  and where \nthe sum  is  taken over the set of output units  0  for the set of training patterns P. \n\n\f520 \n\nChauvin \n\nThe following algorithm implements a gradient descent on the following  cost func(cid:173)\ntion: \n\nPOP  H \n\nC = IJer I  I  (dij - Oij)'l + IJen I  I  e(ot) \n\nj \n\nj \n\n[2] \n\nwhere e is a positive monotonic function and where the sum of the second term is \nnow taken over a set or subset of the hidden units H.  The first  term of this cost \nfunction  will  be  called the  error term,  the  second,  the  energy  term. \n\nIn principle,  the theoretical minimum  of this  function  is  found  when  the desired \nactivations are equal to the actual activations for all output units and all presented \npatterns  and  when  the  hidden  units  do  not  \"spend  any  energy\". \nIn  practical \ncases,  such a  minimum  cannot be reached and the hidden units  have  to \"spend \nsome  energy\"  to  solve  a  given  problem.  The  quantity  of energy  will  be  in  pan \ndetermined by the relative importance given to the error and energy terms during \ngradient descent.  In principle, if a hidden unit has a constant activation whatever \nthe pattern presented to the  network,  it contributes to the  energy term  only and \nwill be  \"suppressed\" by the algorithm.  The precise energy distribution among the \nhidden units  will  depend  on the  actual  energy  function  e. \n\nANALYSIS \n\nALGORITHM  IMPLEMENTATION \n\nWe  can  write  the  total  cost  function  that  the  algorithm  tries  to  minimize  as  a \nweighted  sum  of an  error  and  energy term: \n\n[3] \n\nThe  first  term is  the  error term  used  with  the  standard  back-propagation  algo(cid:173)\nrithm  in  Rumelhan  et  al. \nIf we  have  h  hidden  layers,  we  can  write  the  total \nenergy term as a sum of all  the energy terms corresponding to each hidden layer: \n\nh  Hi \n\nEen = I  I  e(o}) \n\ni \n\nj \n\n[4] \n\nTo decrease the energy  of the uppermost hidden layer Hh,  we  can compute the \nderivative  of the  energy function with respect to the weights.  This derivative will \nbe null for any weight  \"above\"  the considered hidden layer.  For any weight just \nbelow the  considered  hidden layer,  we  have  (using  Rumelhan  et al.  notation): \n\n\fA Back-Propagation Algorithm \n\n521 \n\n[5] \n\n[6] \n\n~en  ae(ot) \nU\u00b7  = \nanet; \nI \n\n= \n\nae(01)  a01  aOi  2'.\" ( \n\n) \n- - -=  e OiJ  ;  net; \n\na01  ao;  anet; \n\nwhere  the  derivative  of e is  taken with  respect to the  .. energy\"  of the  unit i and \nwhere f  corresponds  to  the  logistic  function.  For  any  hidden  layer  below  the \nconsidered layer  h.  the  chain rule  yields: \n\nd1n  = f  /c(net/c) I  dj\"Wj/c \n\nJ \n\n[7] \n\nThis is just. standard back-propagation with a different back-propagated term.  If \nwe  minimize both the error at the output layer and the energy of the hidden layer \nh,  we  can compute the complete weight change for  any connection below layer h: \n\nA \n_ \nuW/C1  -\n\n~en  _ \n- a!J.eru/c  01  - a!J.enu/c  01  -\n\n,ur \n\nt..  A.er \n\n- aOI\\P'eru/c  + !J.enu/c  = - aOlu/c \n\n~en) \n\n~ac \n\n[8] \n\nwhere  d~c is now the delta accumulated for  error and energy that we  can write as \na  function  of the  deltas  of the  upper layer: \n\n[9] \n\nThis  means that instead  of propagating the  delta  for  both  energy  and  error.  we \ncan  compute  an  accumulated  delta  for  hidden  layer  h  and  propagate  it  back \nthroughout the network.  If we  minimize the energy of the layers hand h-J, the \nnew  accumulated  delta  will  equal  the  previously  accumulated  delta  added  to  a \nnew delta  energy  on layer h-J.  The procedure can be repeated throughout the \ncomplete network.  In shon. the back-propagated error signal used to change the \nweights  of each layer  is  simply  equal to  the  back-propagated signal  used  in  the \nprevious layer augmented with the delta energy of the current hidden layer.  (The \nalgorithm  is  local  and easy to implement). \n\nENERGY FUNCTION \n\nThe  algorithm  is  sensitive  to  the  energy  function  e being  minimized.  The  func(cid:173)\ntions  used  in  the simulations  described  below have  the  following  derivative  with \n\n\f522 \n\nChauvin \n\nrespect to the squared activations/energy  (only this  derivative  is  necessary  to im(cid:173)\nplement the  algorithm,  see  Equation  [6]): \n\n[10] \n\nwhere  n  is  an integer  that  determines  the  precise  shape  of the  energy  function \n(see Table  1)  and modulates the behavior of the  algorithm  in the  following  way. \nFor n = 0,  e is  a  linear  function  of the  energy:  \"high and  low  energy\"  units  are \nequally penalized.  For n = I,  e is  a  logarithmic function  and \"low energy\"  units \nbecome  more  penalized  than  uhigh  energy\"  units,  in  proportion  to  the  linear \ncase.  For n  = 2,  the  energy penalty  may  reach  an asymptote  as  the  energy in(cid:173)\ncreases:  \"high energy\"  units are not penalized more than umiddle  energy\"  units. \nIn the simulations, as expected,  it appears that higher values of n tend to suppress \n(For n  > 2,  the  behavior  of the  algorithm  was  not  signifi(cid:173)\n\"low  e-nergy\"  units. \ncantly  different  from  n  = 2.  for  the tests  described  below). \n\nTABLE  1:  Energy  Functions. \n\nn \n\ne \n\n0 \n\n0 2 \n\nI \nI \nI \nI \nI \n\n1 \n\nI \nI \n! \nLog(l +02)  I \nI \nI \n\n2 \n0 2 \n\n1 +02 \n\nI \nI \nI \nI \nI \n\nn>2 \n\n? \n\nBOOLEAN  EXAMPLES \n\nThe  algorithm  was  tested  with  a  set  of Boolean  problems.  In typical  tasks,  the \nenergy of the network significantly decreases during early learning.  Later on,  the \nnetwork finds  a better minimum of the total cost function by decreasing the error \nand by \"spending\"  energy to solve the problem.  Figure  1 shows energy and error \nin  function  of the  number of learning  cycles  during  a  typical  task  (XOR)  for  4 \ndifferent  runs.  For  a  broad  range  of the  energy  learning  rate,  the  algorithm  is \nquite  stable  and  finds  the  solution  to  the  given  problem.  This  nice  behavior is \nalso  quite  independent of the  variations  of the  onset  of  energy  minimization. \n\nEXCLUSIVE  OR AND  PARITY \n\nThe algorithm was tested with EXOR for  various network architectures.  Figure 2 \nshows  an example  of the activation of the hidden units after learning.  The algo(cid:173)\nrithm  finds  a  minimal  solution  (2  hidden  units,  \"minimum  logic\")  to  solve  the \nXOR  problem  when  the  energy  is  being  minimized.  This  minimal  solution  is \nactually found whatever the starting number of hidden units.  If several layers are \nused,  the  algorithm  finds  an  optimal  or  nearly-optimal size  for  each  layer. \n\n\f~r-------'--------r------~--------__ ------~ \n\nA Back-Propagation Algorithm \n\n523 \n\n0.16 \n\nFigure  1.  Energy  and  error curves  as  a  function  of the  number  of pattern \npresentations  for  different  values  of the  \"energy\"  rate  (0,  .1,  .2,  .4).  Each \n\nenergy  curve  (It e\"  label)  is  associated  with  an error curve  (\" +\"  label). \nDuring  learning,  the units  \"spend\"  some  energy to  solve  the  given  task. \n\nWith parity  3,  for  a  [-1,  +1]  activation range  of the  sigmoid  function,  the  algo(cid:173)\nrithm does not find  the  2 hidden units optimal solution but has no problem find(cid:173)\ning  a  3  hidden units  solution,  independently of the  staning architecture. \n\nSYMMETRY \n\nThe algorithm was  tested with  the  symmetry problem,  described in Rumelhan et \nal.  The minimal solution for this task uses  2 hidden units.  The simplest  form  of \nthe  algorithm  does  not actually  find  this  minimal  solution because  some  weights \nfrom  the hidden units to the output unit can actually grow enough to compensate \nthe  low  activations  of  the  hidden  units.  However,  a  simple  weight  decay  can \nprevent these  weights  from  growing too  much and allows  the  network to  find  the \nIn  this  case,  the  total  cost  function  being  minimized  simply \nminimal  solution. \nbecomes: \n\n\f524 \n\nChauvin \n\n1 __ 1 .1_1 \n\n---- _1_-\n\npattern  2 \n\npattem  3 \n\npattern  2 \n\npattem  3 \n\nI \n\nI_II  1 __ 10 \n\n__ 1- ----\n\nFigure  2.  Hidden unit activations  of a  4  hidden unit  network  over the  4 \nEXOR  patterns  when  (left)  standard back-propagation  and  (right)  energy \nminimization  are  being  used  during  learning.  The  network  is  reduced  to \n\nminimal  size  (2  hidden units)  when the. energy is  being  minimized. \n\nPOP  H \n\nC = Per I I  (di) - Oi)2 + Pen I Ie (ot)  + Pw I wf) \n\nW \n\nij \n\nj \n\nj \n\n[11] \n\nPHONETIC  LABELING \n\nThe algorithm was  tested  with  a  phonetic labeling task.  The input patterns con(cid:173)\nsisted  of spectrograms  (single  speaker,  10x3.2ms spaced  time  frames,  centered, \n16  frequencies)  corresponding to 9 syllables  [ba] ,  [da],  [ga],  [bi] ,  [di],  [gi] , and \n[bu] ,  [du] ,  [gu].  The task of the  network  was  to  classify these  spectrograms  (7 \ntokens  per  syllable)  into  three  categories corresponding to the three  consonants \n[b],  [g],  and  [g].  Starting with  12  hidden units,  the  algorithm  reduced  the net(cid:173)\nwork to  3 hidden units.  (A hidden unit is  considered unused when its  activation \nover the  entire  range  of patterns  contributes  very  little  to  the  activations  of the \noutput units).  With  standard back-propagation,  all  of the  12  hidden units  are \nusually being used.  The resulting network is  consistent with  the sizes  of the hid(cid:173)\nden layers  used by  Elman  and  Zipser  (1986)  for  similar tasks. \n\n\fA Back-Propagation Algorithm \n\n525 \n\nEXTENSION  OF THE ALGORITHM \n\nEquation [2]  represents a constraint over the set of possible LMS solutions found \nby  the  back-propagation  algorithm.  With  such  a  constraint.  the  \"zero-energy\" \nlevel  of  the  hidden  units  can  be  (informally)  considered  as  an  attractor  in  the \nsolution  space.  However.  by  changing  the  sign  of  the  energy  gradient.  such  a \npoint  now  constitutes  a  repellor  in  this  space.  Having  such  repellors  might  be \nuseful  when  a  set  of  activation  values  are  to  be  avoided  during  learning.  For \nexample.  if the activation range  of the sigmoid  transfer function  is  [-1.  + 1]. the \nlearning speed of the back-propagation algorithm can be greatly improved but the \n[0.  0]  unit  activation  point  (zero-input,  zero-output)  often  behaves  as  a  local \nminimum.  By inversing the sign of the energy gradient during early learning,  it is \npossible  to have  the  point  [0,  0]  act as  a  repellor.  forcing  the  network  to  make \n\"maximal use\"  of its  resources  (hidden units).  This principle  was  tested on the \nparity-3  problem  with  a  network  of  7  hidden  units.  For a  given  set  of  coeffi(cid:173)\ncients. standard back-propagation can solve parity-3 in about 15  cycles but yields \nabout  65%. of local  minima in  [0.  0].  By  using  the  \"repulsion\"  constraint,  par(cid:173)\nity-3  can be  solved  in about  20  cycles  with  0%  of  local  minima. \n\nInterestingly,  it is  also possible to design a  I'trajectory\"  of such constraints during \nlearning.  For  example,  the  [0,  0]  activation  point  can  be  built  as  a  repellor \nduring early learning in order to avoid the corresponding local minimum,  then as \nan attractor during middle learning to reduce the size of the hidden layer.  and as \na  repulsor  during  late  learning,  to  force  the  hidden  units  to  have  binary  activa(cid:173)\ntions.  This type  of trajectory was  tested  on the parity-3  problem with  7  hidden \nunits.  In this case, the algorithm always avoids the  [0,  0]  local minimum.  More(cid:173)\nover,  the network can be reduced to 3 or 4 hidden units taking binary values over \nthe set of input patterns.  In contrast, standard back-propagation often gets stuck \nin local minima  and uses the  initial  7 hidden units  with  analog  activation values. \n\nCONCLUSION \n\nThe present algorithm simply imposes a  constraint over the  LMS  solution space. \nIt can be argued that limiting such a  solution space can in some cases increase the \ngeneralizing propenies of the network  (curve-fitting analogy).  Although  a  com(cid:173)\nplete  theory  of  generalization  has  yet  to  be  formalized,  the  present  algorithm \npresents  a  step  toward  the  automatic design  of \"minimal\"  networks  by  imposing \nconstraints on the activations  of the hidden units.  (Similar constraints  on weights \ncan be imposed and have been tested with  success by  D.  E.  Rumelhan,  Personal \nCommunication.  Combinations of constraints on weights  and activations  are be(cid:173)\ning tested).  What is  simply shown here is  that this energy minimization principle \nis  easy  to  implement,  is  robust  to  a  brQad  range  of parameter  values,  can  find \nminimal  or nearly  optimal  network sizes  when tested  with  a  variety  of tasks  and \ncan be used  to  \"bend\"  trajectories  of activations  during  learning. \n\nAckowledgments \n\n\f526 \n\nChauvin \n\nThis research was  conducted at Thomson-CSF, Inc. in Palo AIto.  I would like to \nthank the Thomson neural net team for useful discussions.  Dave  Rumelhan and \nthe  PDP  team  at  Stanford  University  were  also  very  helpful. \nI  am  especially \ngreateful  to Yoshiro  Miyata,  from  Bellcore,  for  having  letting  me  use  his  neural \nnet simulator  (SunNet)  and to Jeff Elman,  from  UCSD,  for  having letting me use \nthe  speech data  that he  collected. \n\nReferences. \n\nJ.  L.  Elman &  D.  Zipser.  Learning the hidden structure of speech.  ICS Techni(cid:173)\ncal  Repon  8701.  Institute  for  Cognitive  Science.  University  of Califor(cid:173)\nnia,  San Diego  (1987). \n\nD.  E.  Rumelhan,  O.  E.  Hinton  &  R.  J.  Williams.  Learning  internal  represen(cid:173)\n\ntaions  by  error  propagation. \n(Eds.),  Parallel  Distributed  Processing:  Exploration  in  the  Microstruc(cid:173)\nture 0/ Cognition.  Vol.  1.  Cambridge,  MA:  MIT Press/ Bradford Books \n(1986) . \n\nIn  D.  E.  Rumelhan  &  J.  L.  McClelland \n\n\f", "award": [], "sourceid": 133, "authors": [{"given_name": "Yves", "family_name": "Chauvin", "institution": null}]}