{"title": "The CHIR Algorithm for Feed Forward Networks with Binary Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 516, "page_last": 523, "abstract": null, "full_text": "516 \n\nGrossman \n\nThe CHIR Algorithm for  Feed Forward \n\nNetworks with Binary Weights \n\nTal Grossman \n\nDepartment of Electronics \n\nWeizmann Institute of Science \n\nRehovot 76100 Israel \n\nABSTRACT \n\nA new learning algorithm, Learning by Choice of Internal Rep(cid:173)\nresetations  (CHIR), was  recently introduced.  Whereas many algo(cid:173)\nrithms  reduce  the  learning  process  to  minimizing a  cost  function \nover the  weights, our method treats the  internal representations as \nthe fundamental entities to  be determined.  The algorithm applies \na  search  procedure  in the  space  of internal representations,  and  a \ncooperative adaptation of the weights (e.g.  by using the perceptron \nlearning rule).  Since the introduction of its basic, single output ver(cid:173)\nsion, the CHIR algorithm was generalized to train any feed  forward \nnetwork of binary neurons.  Here we present the generalised version \nof the  CHIR algorithm,  and further  demonstrate  its  versatility by \ndescribing  how it can  be  modified  in order  to  train networks with \nbinary  (\u00b11)  weights.  Preliminary  tests  of this  binary  version  on \nthe random teacher  problem are  also  reported. \n\nI.  INTRODUCTION \n\nLearning by Choice oflnternal Representations (CHIR) was recently introduced \n\n[1,11]  as a  training method for  feed  forward  networks of binary  units. \n\nInternal  Representations  are  defined  as  the  states  taken  by  the  hidden  units \nof a  network when  patterns (e.g.  from the training set)  are  presented  to the input \nlayer of the network.  The CHIR algorithm views the internal representations associ(cid:173)\nated with various inputs as the basic independent  variables of the learning process. \nOnce such representations are formed,  the weights can be found by simple and local \nlearning  procedures  such  as  the  Percept ron  Learning  Rule  (PLR)  [2].  Hence  the \nproblem of learning  becomes  one  of searching  for  proper internal  representations, \n\n\fThe CHIR Algorithm for Feed Forward Networks with Binary Weights \n\n517 \n\nrather  than of minimizing a  cost  function  by varying the  values  of weights, which \nis  the  approach used  by  back propagation  (see,  however  [3],[4]  where  \"back  prop(cid:173)\nagation  of desired  states\"  is  described).  This  basic  idea,  of viewing  the  internal \nrepresentations as  the fundamental entities, has been used since by other groups [5-\n7].  Some of these works,  and the main differences  between them and our approach, \nare briefly disscussed  in [11].  One important difference  is  that the CHIR algorithm, \nas well as another similar algorithm, the MRII [8],  try to solve the learning problem \nfor  a  fixed  architecture,  and are not guaranteed to converge.  Two other algorithms \n[5,6]  always find  a  solution,  but  at  the  price of increasing  the  network size  during \nlearning in a manner that resembles similar algorithms developed earlier [9,10].  An(cid:173)\nother approach [7]  is to use  an error minimizing algorithm which treat~ the internal \nrepresentations  as  well as  the weights as  the  relevant variables of the search space. \n\nTo be  more specific,  consider first  the single layer perceptron  with  its Percep(cid:173)\n\ntron  Learning  Rule  (PLR)  [2].  This  simple  network  consists  of N  input  (source) \nunits  j, and a  single  target  unit  i.  This  unit  is  a  binary linear threshold  unit,  i.e. \nwhen  the source  units  are  set  in  anyone of Jl  =  1, .. M  patterns,  i.e.  Sj  = ef,  the \nstate of unit i, Si = \u00b11 is  determined  according to the rule \n\nSi =  sign(L WijSj + 0i)  . \n\nj \n\n(1) \n\nHere  Wij  is  the  (unidirectional)  weight  assigned  to  the  connection  from  unit  j  to \nij  0i  is  a  local  bias.  For each  of the  M  input  patterns,  we  require  that  the  target \n\nunit  (determined  using  (1))  will take a  preassigned  value er.  Learning  takes place \nin the course of a  training session.  Starting from any arbitrary initial guess for  the \nweights, an input v  is presented, resulting in the output taking some value Sf.  Now \nmodify every weight according to the rule \n\n(2) \nwhere  TJ  > 0 is a  step size parameter (ej  = 1 is used to modify the bias 0).  Another \ninput pattern is  presented,  and so on, until all inputs draw the correct output.  The \nPerceptron convergence theorem states  [2]  that the PLR will find  a solution (if one \nexists),  in  a  finite  number of steps.  Nevetheless,  one  needs,  for  each  unit, both the \ndesired  input and output states  in order  to apply the  PLR. \n\nConsider  now  a  two layer perceptron,  with  N  input,  H  hidden  and  J{ output \nunits  (see  Fig.1).  The elements of the network are  binary linear  threshold  units  i, \nwhose  states  Si  = \u00b11  are  determined  according  to  (1).  In  a  typical task  for  such \na  network,  M  specified  output  patterns,  Sf'-,t,1J.  =  efut,lJ.,  are  required  in  response \nto Jl  = 1, ... , M  input patterns.  If a  solution is found,  it first  maps each input onto \nan internal representation  generated on the hidden layer, which,  in turn,  produces \nthe  correct  output.  Now  imagine  that  we  are  not  supplied  with  the  weights  that \nsolve the problem;  however the  correct  internal representations  are revealed.  That \nis,  we are given a  table with M rows, one for each input.  Every row has H bits ef'lJ. I \nfor  i  =  1..H, specifying  the state of the  hidden layer obtained in  response  to input \n\n\f518 \n\nGrossman \n\npattern  1'.  One  can  now  view  each  hidden-layer  cell  i  as  the  target  of the  PLR, \nwith  the  N  inputs viewed  as  source.  Given sufficient  time,  the  PLR will  converge \nto  a  set  of weights  Wii'  connecting  input  unit  j  to  hidden  unit  i,  so  that  indeed \nthe input-hidden association  that appears in column i of our table will  be realized. \nIn order  to obtain the correct output, we  apply the  PLR in a  learning process  that \nuses  the  hidden  layer  as  source  and  each  output  unit  as  a  target,  so  as  to  realize \nthe correct  output.  In general,  however, one is  not supplied with  a  correct  table of \ninternal representations.  Finding such  a  table is  the  goal of our approach . \n\n...  0 \n\nFigure  1.  A  typical  three  layered feed  forward  network  (two layered  percep(cid:173)\n\ntron)  with  N  input,  H  hidden and  I(  output units.  The unidirectional weight Wij \nconnects  unit j  to unit i.  A  layer index is implicitely included  in  each unit's index. \n\nDuring learning, the CHIR algorithm alternates between two phases:  in one it \n\ngenerates  the internal  representations,  and in the other it uses  the  updated  repre(cid:173)\nsentations in order to search for  weights, using some single layer learning rule.  This \ngeneral  scheme  describes  a  large  family  of possible  algorithms,  that  use  different \nways to change the internal representations. and update  the weights. \n\nA simple algorithm based on this general scheme was introduced recently [1,11]. \nIn section II we  describe the multiple output version of CHIR [11].  In section III we \npresent a  way to modify the algorithm so it can train networks with binary weights, \nand  the  preliminary  results  of a  few  tests  done  on  this  new  version.  In  the  last \nsection  we  shortly discuss  our results  and describe  some future  directions. \n\n\fThe CHIR Algorithm for Feed Forward Networks with Binary Weights \n\n519 \n\nII.  THE CHIR ALGORITHM \n\nThe CHIR algorithm that we describe here implements the basic idea of learn(cid:173)\n\ning by choice of internal representations  by breaking the learning process  into four \ndistinct  procedures  that are  repeated in a  cyclic order: \n1.  SETINREP: Generate  a table of internal representations {ef''''}  by presenting \neach  input  pattern  from  the  training  set  and  recording  the  states  of the  hidden \nunits,  using Eq.(l), with the existing couplings Wij  and 0i. \n2.  LEARN23:  The current table of internal representations is used as the training \nset,  the  hidden  layer  cells  are  used  as  source,  and each  output  as  the  target  unit \nof the  PLR. If weights Wij  and  0i  that produce the desired  outputs are found,  the \nproblem  has  been  solved.  Otherwise  stop  after  123  learning sweeps,  and  keep  the \ncurrent weights, to use  in CHANGE INREP. \n3.  CHANGE  INREP: Generate  a  new  table of internal  representations,  which \nreduces  the  error  in  the  output  :  We  present  the  table  sequentially,  row  by  row \n(pattern  by  pattern),  to  the  hidden  layer.  If for  pattern  v  the  wrong  output  is \nobtained, the internal representation eh'lI  is  changed. \n\nThis  is  done  simply  by  choosing  (at  random)  a  hidden  unit  i,  and  checking \nthe  effect  of flipping  the sign of e?''''  on  the  total output  error,  i.e.  the  number of \nwrong bits.  If the output error is  not increased,  the flip  is  accepted  and the table of \ninternal representations  is  changed  accordingly.  Otherwise  the flip  is  rejected  and \nwe  try  another  unit.  When  we  have more  than one  output  unit,  it  might happen \nthat an error in one output unit can not  be corrected  without introducing an error \nin another  unit.  Therefore  we  allow only for  a  pre-specified  number of attempted \nflips,  lin, and go on to the next pattern even if the output error was not eliminated \ncompletely.  This  procedure  ends  with  a \"modified,  \"improved\"  table  which  is  our \nnext guess of internal representations.  Note that this new table does not necessarily \nyield a totally correct output for all the patterns.  In such a case, the learning process \nwill go on even if this new  table is  perfectly realized  by the next stage - LEARN12. \n\n4.  LEARN12:  Present  an  input  pattern;  if the output  is  wrong,  apply the  PLR \nwith  the  first  layer  serving  as  source,  treating  every  hidden  layer  site  separately \nas  target.  If input  v  does  yield  the  correct  output,  we  insert  the  current  state \nof the  hidden  layer  as  the  internal  representation  associated  with  pattern  v,  and \nno learning steps  are  taken.  We sweep  in  this  manner  the  training set,  modifying \nweights Wij,  (between input and hidden layer), hidden-layer thresholds  Oi,  and,  as \nexplained  above,  internal  representations.  If the  network  has  achieved  error-free \nperformance for  the entire training set,  learning is  completed.  Otherwise,  after lt2 \ntraining sweeps (or if the current internal representation is perfectly realized),  abort \nthe PLR stage,  keeping the present  values of Wij, Oi,  and start SETINREP again. \nThe idea in trying to learn the current internal representation even if it does not \nyield the perfect output is that it can serve as a better input for the next LEARN23 \nstage.  That  way, in  each  learning  cycle  the  algorithm tries  to improve the overall \nperformance of the network. \n\n\f520 \n\nGrossman \n\nThis  algorithm can  be  further  generalized  for  multi-layered feed  forward  net(cid:173)\n\nworks by applying the  CHANGE INREP  and  LEARN12 procedures  to each of the \nhidden layers, one  by one, from the last  to the first  hidden layer. \n\nThere are  a  few  details that need  to be added. \n\na)  The  \"iInpatience\"  parameters:  lt2  and h3, which  are rather  arbitrary, are \nintroduced to guarantee that the  PLR stage is  aborted  if no solution is  found,  but \nthey have to be large enough to allow the PLR to find  a solution (if one exists) with \nsufficiently high probability. Similar considerations are  valid for  the  lin  parameter, \nthe  number of flip  attempts  allowed  in  the  CHANGE  INREP  procedure.  If this \nnumber is  too small,  the updated internal representations may not improve.  If it is \ntoo large, the new internal representations might be too different from the previous \nones,  and therefore  hard to learn. \n\nThe optimal values depend,  in  general,  on  the  problem and  the  network size. \nOur experience indicates, however, that once a  \"reasonable\" range of values is found, \nperformance  is  fairly  insensitive to the precise  choice.  In addition, a  simple rule of \nthumb  can  always  be  applied:  \"Whenever  learning  is  getting  hard,  increase  the \nparameters\".  A  detailed  study of this issue is  reported  in  [11]. \nb)  The  Internal representations updating  scheme:  The  CHANGE  INREP \nprocedure  that is  presented  here  (and studied in [11])  is  probably the simplest  and \n\"most primitive\"  way to update the InRep table.  The choice of the hidden units to \nbe flipped  is completely blind and relies only on the single bit of information about \nthe improvement of the total output error.  It may even happen that no change in the \ninternal representaion  is  made,  although such  a  change is  needed.  This procedure \ncan certainly  be  made  more efficient,  e.g.  by probing the fields  induced  on all  the \nhidden units to be flipped  and then choosing one  (or  more)  of them by applying a \n\"minimal disturbance\"  principle as in [8].  Nevertheless  it was shown [11]  that even \nthis simple algorithm works  quite well. \n\nc)  The weights updating schemes:  In our experiments we have used the simple \nPLR with a  fixed  increment (7]  = 1/2,  .6.Wij = \u00b11) for  weight learning.  It has the \nadvantage of allowing the use of discrete (or integer) weights.  Nevertheless, it is just \na component that can be replaced by other, perhaps more sophisticated methods, in \norder  to  achieve, for  example,  better  stability [12],  or  to take into account  various \nconstraints  on  the  weights,  e.g.  binary  weights  [13].  In  the  following  section  we \ndemonstrate  how this can be done. \n\nIII. THE CHIR ALGORITHM FOR BINARY WEIGHTS \n\nIn this section we describe how the CHIR algorithm can be used in order to train \nfeed  forward networks with binary weights.  According to this strong constraint, all \nthe  weights in  the system  (including  the  thresholds)  can  be  either  +1  or  -1.  The \nway to  do  it  within  the  CHIR framework  is  simple:  instead  of applying the  PLR \n(or  any other single layer, real  weights algorithm) for  the updating of the weights, \n\n\fThe CHIR Algorithm for Feed Forward Networks with Binary Weights \n\n521 \n\nwe  can  use  a  binary  perceptron  learning  rule.  Several  ways  to solve  the  learning \nproblem in the binary weight perceptron were suggested recently [13].  The one that \nwe used  in the experiments reported  here is  a  modified version of the directed  drift \nalgorithm introduced by Venkatesh [13].  Like  the standard PLR, the directed  drift \nalgorithm works on-line, namely, the patterns are presented one by one, the state of \na  unit i  is  calculated  according  to (1),  and whenever  an error  occurs  the  incoming \nweights are updated.  When  there is an error it means  that \n\nNamely, the field  hi = Ej Wiie.r  ' (induced by the current pattern e.n  is  \"wrong\". \n\nIf so,  there must be some weights that pull it to the wrong direction.  These are the \nweights for  which \n\n~'! hI!  < 0 \n'-'  , \n\nerWii{r < o. \n\nHere er  is  the desired  output of unit  i  for  pattern  v.  The updating of the  weights \n\nis  done simply by flipping  (i.e.  Wii  ~ -Wij  )  at random k  of these  weights. \n\nThe number of weights to  be  changed  in each  learning step,  k,  can  be  a  pre(cid:173)\nfixed  parameter of the algorithm, or,  as suggested  by Venkatesh,  can  be decreased \ngradually during the  learning process  in  a  way similar to a  cooling schedule  (as  in \nsimulated  annealing).  What  we  do  is  to  take  k = Ihl/2 +  1,  making  sure,  like  in \nrelaxation algorithms,  that just enough  weights  are flipped  in  order  to obtain the \ndesired  target for  the current  pattern.  This simple and local rule  is  now  \"plugged\" \ninto the Learn12 and Learn23 procedures  instead of (2),  and the initial weights are \nchosen to be + 1 or -1  at random. \n\nWe tested  the  binary  version  of CHIR on  the  \"random teacher\"  problem.  In \nthis  problem  a  \"teacher  network\"  is  created  by  choosing  a  random  set  of +1/-1 \nweights for  the  given  architecture.  The  training set  is  then  created  by  presenting \nM  input patterns to the network and  recording the resulting output as  the  desired \noutput  patterns.  Ip.  what follows  we  took  M  = 2N  (exhaustive  learning),  and  an \nN : N : 1 architecture. \n\nThe  \"time\"  parameter  that  we  use  for  measuring  performance  is  the  number \nof sweeps through the training set of M patterns (\"epochs\") needed  in order to find \nthe solution.  Namely, how many times each  pattern was presented to the network. \nIn  the  experiments  presented  here,  all  possible  input  patterns  were  presented  se(cid:173)\nquentially  in  a  fixed  order  (within  the  perceptron  learning  sweeps).  Therefore  in \neach cycle of the algorithm there are 112 +  h3 +  1 such sweeps.  Note that according \nto our  definition,  a  single sweep involves the updating of only one layer of weights \nor  internal  representations.  for  each  network  size,  N,  we  created  an  ensemble  of \n50  independent  runs,  with different  ranodom teachers  and starting with  a  different \nrandom choice of initial weights. \n\nWe calculate, as a  performance measure,  the following quantities: \n\na.  The median number of sweeps,  t m . \n\nb.  The  \"inverse average rate\",  T,  as  defined  by Tesauro and Janssen in  [14]. \n\n\f522 \n\nGrossman \n\nc.  The  success  rate,  S,  i.e. \nsolution in less  than the maximal number of training cycles  [max  specified. \n\nthe  fraction  of runs  in  which  the  algorithm finds  a \n\nThe results,with  the typical parameters, for  N=3,4,5,6, are given in Table 1. \n\nTable 1.  The  Random Teacher problem with N:N:l architecture. \n\nN \n3 \n4 \n5 \n6 \n\nlt2 \n20 \n25 \n40 \n70 \n\n123 \n10 \n10 \n15 \n40 \n\nlin \n5 \n7 \n9 \n11 \n\n[max \n20 \n60 \n300 \n900 \n\nS \ntm \n14 \n1.00 \n1.00 \n87 \n430 \n1.00 \n15000  1100  0.71 \n\nT \n9 \n37 \n60 \n\nAs mentioned before,  these are only preliminary results.  No attempt was made \n\nto to optimize the learning  parameters. \n\nIV.  DISCUSSION \n\nWe  presented  a  generalized  version  of  the  CHIR  algorithm  that  is  capable \nof training  networks  with  multiple outputs  and  hidden  layers.  A  way  to  modify \nthe  basic  alf$ortihm so  it can  be  applied  to networks  with  binary weights was  also \nexplained and tested.  The potential importance of such networks, e.g.  in hardware \nimplementation, makes this modified  version particularly interesting. \n\nAn  appealing feature  of the  CHIR algorithm  is  the  fact  that  it  does  not  use \nany kind  of \"global  control\",  that  manipulates  the  internal  representations  (as  is \nused for example in [5,6]).  The mechanism by which the internal representations are \nchanged is local in the sense  that the change is done for  each unit and each pattern \nwithout conveying any information from other units or patterns  (representations). \nMoreover, the feedback from the \"teacher\"  to the system is only a single bit quantity, \nnamely, whether the output is getting worse or not (in contrast to BP, for example, \nwhere  one informs each and every output unit about its individual error). \n\nOther  advantages of our  algorithm are  the  simplicity of the  calculations,  the \n\nneed for only integer, or even binary weights and binary units,  and the good perfor(cid:173)\nmance.  It should be mentioned again that the  CHIR training sweep  involves much \nless computations than that of back-propagation.  The price is the extra memory of \nM H  bits that  is  needed  during  the  learning process  in order  to store  the  internal \nrepresentations  of all  M  training patterns.  This feature  is  biologically implausible \nand may be practically limiting.  We are developing a method that does not require \nsuch memory.  The learning method that is  currently studied for  that purpose  [15], \nis  related  to the  MRII rule,  that  was  recently  presented  by Widrow and Winter in \n[8].  It seems  that  further  research  will  be  needed  in  order  to  study  the  practical \ndifferences  and the relative advantages of the CHIR and the  MRII algorithms. \n\n\fThe eHIR Algorithm for Feed Forward Networks with Binary Weights \n\n523 \n\nAcknowledgements:  I  am  gratefull  to  Prof.  Eytan  Domany  for  many  useful \nsuggestions  and comments.  This research  was  partially supported  by a  grant from \nMinerva. \n\nReferences \n[1]  Grossman  T.,  Meir  R.  and  Domany E.,  Complex  Systems  2,  555  (1989).  See \nalso  in  D.  Touretzky (ed.),  Advances in  Neural Information  Processing Systems  1, \n(Morgan  Kaufmann, San Mateo  1989). \n[2]  Minsky M. and Papert S.  1988,  Perceptrons (MIT,  Cambridge); \nRosenblatt  F.  Principles  of neurodynamics (Spartan,  New  York,  1962). \n[3]  Plaut D.C.,  Nowlan S.J.,  and  Hinton G.E., Tech.Report CMU-CS-86-126, \nCarnegie-Mellon  University (1986). \n[4]  Le  Cun Y.,  Proc.  Cognitiva 85, 593  (1985). \n[5]  Rujan P.  and  Marchand M.,  in  the  Proc.  of the  First  International  Joint  Con(cid:173)\nference  Neural Networks  - Washington  D. C.  1989,  Vol.lI, pp.  105.  and  to  appear \nin  Complex  Systems. \n[6]  Mezard  M.  and  Nadal J.P., J.Phys.A.  22, 2191  (1989). \n[7]  Krogh  A., Thorbergsson  G.1.  and  Hertz  J.A., in  these  Proceedings. \nR. Rohwer, to apear in the Proc.  of DANIP,  GMD Bonn, April 1989, J. Kinderman \nand A.  Linden eds  ; \nSaad D.  and Merom E., preprint (1989). \n\n[8]  Widrow B.  and Winter R.,  Computer 21,  No.3,  25  (1988). \n[9]  See  e.g.  Cameron  S.H.,  IEEE  TEC  EC-13,299  (1964)  ;  Hopcroft  J.E.  and \nMattson  R.L., IEEE, TEC EC-14, 552  (1965). \n[10]  Honavar V.  and' Uhr  L.  in the  Proc.  of the  1988  Connectionist  Models  Sum(cid:173)\nmer School,  Touretzky  D.,  Hinton G .  and  Sejnowski T.  eds.  (Morgan  Kaufmann, \nSan Mateo,  1988). \n\n[11]  Grossman T., to be published in  Complex Systems (1990). \n[12]  Krauth  W. and  Mezard  M.,  J.Phys.A, 20, L745  (1988). \n[13]  Venkatesh S.,  preprint (1989)  ; \n\nAmaldi E.  and  Nicolis S., J.Phys.France 50, 2333  (1989). \nKohler  H.,  Diederich S.,  Kinzel  W.  and Opper  M.,  preprint  (1989). \n\n[14]  Tesauro G.  and  Janssen  H.,  Complex Systems 2,  39  (1988). \n[15]  Nabutovski D., unpublished. \n\n\f", "award": [], "sourceid": 205, "authors": [{"given_name": "Tal", "family_name": "Grossman", "institution": null}]}