{"title": "Stochastic Learning Networks and their Electronic Implementation", "book": "Neural Information Processing Systems", "page_first": 9, "page_last": 21, "abstract": null, "full_text": "9 \n\nStochastic Learning Networks and their Electronic Implementation \n\nJoshua Alspector*. Robert B. Allen. Victor Hut. and Srinagesh Satyanarayanat \n\nBell Communications Research. Morristown. NJ  01960 \n\nWe describe a family  of learning algorithms that operate on a recurrent, symmetrically \nconnected.  neuromorphic  network  that.  like  the  Boltzmann  machine,  settles  in  the \npresence of noise.  These networks learn by modifying synaptic connection strengths on \nthe  basis  of correlations  seen  locally  by  each  synapse.  We describe  a  version  of the \nsupervised learning algorithm for a network  with  analog activation functions.  We also \ndemonstrate  unsupervised  competitive  learning  with  this  approach.  where  weight \nsaturation  and  decay  play  an  important  role.  and  describe  preliminary  experiments  in \nreinforcement  learning.  where noise  is used  in  the  search  procedure.  We  identify  the \nabove described phenomena as elements that can unify learning techniques at a physical \nmicroscopic level. \nThese algorithms were chosen for ease of implementation in vlsi.  We have designed a \nCMOS  test  chip  in  2 micron rules  that  can  speed  up  the  learning  about  a  millionfold \nover an equivalent simulation on a VAX lln80.  The speedup is due to parallel analog \ncomputation  for  snmming  and  multiplying  weights  and  activations.  and  the  use  of \nphysical processes for generating random noise.  The components of the  test chip are a \nnoise amplifier. a neuron amplifier. and a 300 transistor adaptive synapse. each of which \nis  separately  testable.  These  components  are  also  integrated  into  a  6  neuron  and  15 \nsynapse  network.  Finally.  we  point  out  techniques  for  reducing  the  area  of  the \nelectronic  correlational  synapse  both  in  technology  and  design  and  show  how  the \nalgorithms we study can be implemented naturally in electronic systems. \n\n1.  INTRODUCTION \n\nIbere has been significant progress. in recent years. in modeling  brain function  as  the collective \nbehavior of highly  interconnected networks of simple model neurons.  This paper focuses  on the \nissue of learning in these networks especially with regard to their implementation in an electronic \nsystem.  Learning  phenomena that have  been  studied  include  associative memoryllJ.  supervised \nleaming by error correction(2) and by stochastic search(3). competitive learning(4) lS)  reinforcement \nleamingI6).  and  other  forms  of unsupervised  leaming(7).  From  the  point  of  view  of  neural \nplausibility  as  well  as  electronic  implementation.  we  particularly  like  learning  algorithms  that \nchange  synaptic  connection  strengths  asynchronously  and  are  based  only  on  information \navailable locally at the synapse.  This is illustrated in Fig.  1. where a model synapse uses only the \ncorrelations  of the  neurons  it  connects  and  perhaps  some  weak  global  evaluation  signal  not \nspecific to individual neurons to decide how to adjust its conductance. \n\n\u2022  Address for correspondence: J. Alspector, BeU  Communications ReselllCh,  2E-378, 435 South St., Morristown, Nl \n\n07960 / (201) 8294342/ josh@beUcore.com \n\nt  Pennanent address: University of California, Belkeley, EE Department, Cory HaU, Belkeley, CA 94720 \n* PennllDeDt address: Columbia University, EE Department, S.W. Mudd Bldg., New Yolk, NY 10027 \n\n@  American Institute of Physics 1988 \n\n\f10 \n\nS, \nI \n\nS, \nJ \n\nC.= <s  's  > \n'1 \n\nj \n\ni \n\n<r> \n\nglobal scalar \nevaluation \nsignal \n\nHebb-type learning rule: \nIf  C ij \n\nIncreases, \n\n(perhaps in the presence of  r  ) \n\nIncrement  W  ij \n\nFig.  1.  A local correlational synapse. \n\nWe believe that a stochastic search procedure is most compatible with this  viewpoint.  Statistical \nprocedures based on noise form  the communication  pathways by  which global optimization can \ntake  place based only on the  interaction  of neurons.  Search is  a  necessary part of any  learning \nprocedure  as  the  network  attempts  to  find  a  connection  strength  matrix  that  solves  a  particular \nproblem.  Some learning procedures attack the search directly by gradient following through error \n(orrection[8J  (9J  but  electronic  implementation  requires  specifying  which  neurons  are  input, \ntudden and output in  advanC'e  and nece!;sitates  global  control  of the  error correction[2J procedure \nm a way that requires  specific connectivity and  ~ynch!'Ony at  the neural  Jevel.  There is  also the \nquestion of how such procedures would work with  unsupervised methods and whether they might \nget stuck in  local minima.  Stochastic processes can also do gradient foUowing but they are better \nat  avoiding  minima,  are compatible  with  asynchronous  updates  and  local  weight  adjustments, \nand, as we show in this paper, can generalize well to less supervifM!d learning. \n\nThe  phenomena  we  studied  are  1)  analog  activation,  2)  noise,  3)  semi-local  Hebbian  synaptic \nmodification, and 4) weight decay and saturation.  These techniques  were applied to  problems in \nsupervised, unsupervised, and reinforcement learning.  The goal  of the  study  was  to  see if these \ndiverse  learning  styles  can  be  unified  at  the  microscopic  level  with  a  small  set  of physically \nplausible  and  electronically  implementable  phenomena.  The  hope  is  to  point  the  way  for \npowerful  electronic learning systems in the future by elucidating the conditions and the types of \ncircuits that may be necessary.  It may also be true that the conditions for electronic learning may \n\n\f11 \n\nhave some bearing on the general principles of biologicalleaming. \n\n2.  WCAL LEAltNlNG AND STOCHASl'IC SEARCH \n\n2.1  Supervised Learning in Recurrent Networks with Analog Activations \n\nWe  have  previously  shown! 10]  how  the  supervised  learning  procedure  of  the  Boltzmann \nmachine(3)  can  be  implemented  in  an  electronic  system.  This  system  works  on  a  recurrent, \nsymmetrically  connected  network  which  can  be  characterized  as  settling  to  a  minimum  in  its \nLiapunov function(l]!II).  While this architecture may stretch our criterion of neural plausibility, it \ndoes  provide  for  stability  and  analyzability.  The  feedback  connectivity  provides  a  way  for  a \nsupervised  learning  procedure  to  propagate  information  back  through  the  network  as  the \nstochastic  search  proceeds.  More  plausible  would  be  a  randomly  connected  network  where \nsymmetry is a statistical approximation and inhibition  damps  oscillations,  but symmetry is more \nefficient and weD  matched to our choice of learning rule and search procedure. \n\nWe have extended our electronic model of the Boltzmann machine to include analog activations. \nFig. 2 shows the model of the neuron  we used and its tanh or sigmoid transfer function.  The net \ninput  consists  of the  usual  weighted  sum  of activations  from  other neurons  but,  in  the  case  of \nBoltzmann  machine  learning,  these  are  added  to  a  noise  signal  chosen  from  a  variety  of \ndistributions so that the neuron performs the physical computation: \n\nactivation =1 (neti FI (EwijSj+noise ):::tanh(gain*neti) \n\nInstead  of counting  the  number of on-on and off-off cooccurrences  of neurons  which a synapse \nconnects, the correlation rule now defines the value of a cooccurrence as: \n\nCij=/i*/i \n\nwhere Ii  is  the  activation  of neuron  i  which  is  a  real  value  from  -1  to  1.  Note  that  this  rule \neffectively counts both  on-on and off-off cooccurrences in  the high gain limit.  In  this  limit,  for \nGaussian noise, the cumulative probability distribution for the neuron to have activation + 1 (on) \nis  close  to  sigmoidal.  The effect of noise  \"jitter\"  is  illustrated  at  the  bottom  of the  figure.  The \nweight change rule is still: \n\nif Cij+ > Cij- then  increment  Wij  .... else  decrement \n\nwhere  the  plus  phase clamps  the  output  neurons  in  their  desired  states  while  the  minus  phase \nallows them to run free. \n\nAs\u00b7 mentioned,  we  have  studied  a  variety  of noise  distributions  other  than  those  based  on  the \nBoltzmann  distribution.  The  2-2-1  XOR  problem  was  selected  as  a  test case since  it  has  been \nshown! 10]  to  be easily caught in local minima.  The gain  was  manipulated  in  conditions  with  no \nnoise or with noise sampled from  one of three distributions.  The Gaussian distribution is closest \nto  true  electronic  thermal  noise  such  as  used  in  our  implementation,  but  we  also  considered a \ncut-off uniform  distribution and a Cauchy distribution with  long noise tails for comparison.  The \ninset  to  Fig.  3 shows  a histogram  of samples  from  the noise distributions  used.  The noise  was \nmultiplied by  the  temperature to  'jitter'  the  transfer function.  Hence.  the jitter decreased as the \nannealing schedule proceeded. \n\n\f12 \n\n1;.  Vnolse \n\n1;.  vout  or \nf. (r.  W II II  + noise) \n\nI \n\nJ \n\n1;.  Vln+ 1;.  Vnolsl \nor  r.  WIJIJ  + noise = ne~ \n\nhigh  IIIln \n\ntr8nl'..  function \nwUh  noll.  'line\" \n\nFig. 2.  Electronic analog neuron. \n\nFig.  3  shows  average  performance  across  100  runs  for  the  last  100  patterns  of 2000  training \npattern  presentations.  It  can  be  seen  that  reducing  the  gain  from  a  sharp  step  can  improve \nlearning in  a small region of gain, even without noise.  There seems to be an optimal  gain level. \nHowever, the addition of noise for any distribution can substantially improve learning at all levels \nof gain. \n\n~ \n\ntl CLI \n~ u \nc \n0 \n.,.j \n~ \n\n~ 8. 0 \n&: \n\n1 \n\n0 . 9 \n\n0.8 \n\n0.7 \n\n0.6-\n\n0.5 \n\nGaussian \nUnifona \nCauchy \nHO  Hoise \n\n~ -----~ -\n\n......-, .'.' .... u __ . . .  , .. \n\n... \n\n., \n\n-3 \n\n10 \n\n-2 \n\n10 \n\n-1 \n\n10 \n\nInverse  Gain \n\n1 \n\n1 \n\n10 \n\nFig. 3.  Proportion correct vs. inverse gain. \n\n\f13 \n\n2.2  Stochastic Competitive Learning \n\nWe have studied how competitive  leaming(4J[~) can be accomplished  with  stochastic  local units. \nMter the  presentation  of the  input  pattern. the  network  is  annealed  and  the  weight  is  increased \nbetween  the  winning  cluster  unit  and  the  input  units  which  are  on.  As  shown  in  Fig.  4  this \napproach  was  applied  to  the  dipole problem  of Rumelhart and Zipser.  A 4x4  pixel  array  input \nlayer connects  to  a  2  unit  competitive  layer  with  recurrent  inhibitory  connections  that  are  not \nadjusted.  The  inhibitory  connections  provide  the  competition  by  means  of a  winner-lake-all \nprocess as the network settles.  The input patterns are dipoles -\nonly two input units are turned \nOIl  at  each  pattern  presentatiOll  and  they  must  be  physically  adjacent.  either  vertically  or \nhorizontally.  In this way, the network learns about the connectedness of the space and eventually \ndivides it into two equal spatial regions with each of the cluster units responding only to dipoles \nfrom  one of the  halves.  Rumelhart and Zipser renormalized  the  weights  after each  pattern  and \npicked the  winning unit as the one with  the  highest activation.  Instead of explicit nonnalization \nof the  weights.  we  include  a  decay  term  proportional  to  the  weight.  The  weights  between  the \ninput  layer  and  cluster  layer  are  incremented  for  on-on  correlations,  but  here  there  are  no \nalternating  phases  so  that  even  this  gross  synchrony  is  not  necessary.  Indeed.  if small  time \nconstants are introduced to the weight updates. no external timing should be needed. \n\nwinner-lake-all \ncluster layer \n\ninput/ayer \n\nPig. 4.  Competitive learning network for the dipole problem. \n\nFig. S shows the results of several runs.  A  1 at the po~ition of an input unit  means that unit  1 of \nthe  cluster  layer has  the  larger  weight  leading  to it from  that  position.  A + between  two  units \nmeans the dipole from these two units excites unit 1.  A 0 and - means that unit 0 is the winner in \nthe complementary case.  Note that adjacent  l's should always have a + between them since both \nweights  to unit 1 are stronger.  H, however, there is a  1 next to a 0, then  there is  a tension  in  the \ndipole and  a competition for dominance  in the cluster layer.  We define  a figure  of merit called \n\"surface tension\"  which  is  the number of such dipoles  in dispute.  The  smaller  the  number, the \n\n\f14 \n\nbetter.  Note  in  Runs  A  and  B, the number  is  reduced  to  4, the  minimum  possible value,  after \n2000 pattern presentations.  The space is divided vertically and horizontally, respectively.  Run C \nbas adopted a less favorable diagonal division with a surface tension of 6. \n\nNumber  of  dipole  pattern  presentations \n\n0 \n\n200 \n\n800 \n\n1400 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n2000 \n\n1+1+1+1 \n+  +  +  + \n1+1+1+1 \n- - - + \n0-0-0-0 \n\n0-0-1+1 \n\n0-0-1+1 \n\n- - +  + \n- - +  + \n- - +  + \n\n0-0-1+1 \n\n0-0+1+1 \n\n1+1+1+1 \n- +  +  + \n0-0+1+1 \n- - +  + \n- - - + \n\n0-0-0-1 \n\n0-0-0-1 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\nRUn  A \n\nRun  B \n\nRun  C \n\n1+0-0+1 \n+  +  +  + \n1+1+1+1 \n+  +  - -\n1+1-0-0 \n+  - - -\n0-0-0-0 \n\n0-0-0-0 \n\n- - - + \n- - - + \n\n0-0-0+1 \n\n1-0-1+1 \n+  - +  + \n1+0+1+1 \n\n1+1+1+1 \n\n+  +  +  -\n+  +  - -\n\n1+1+1-0 \n\n1-0-0-0 \n\n1+1+1+1 \n+  +  +  + \n1+1+1+1 \n+  - +  -\n0-0-0-0 \n\n0-0-0+1 \n\n0-0-1+1 \n\n- - +  + \n- - +  + \n- - +  + \n\n0-0-1+1 \n\n0-0+1+1 \n\n0-0-0-1 \n\n0-0-1+1 \n\n- - - + \n- - +  + \n0-0-1+1 \n- - +  + \n0-0+1+1 \n\n0+1+1+1 \n- +  +  + \n0-1+1+1 \n- +  +  + \n0-1+1+1 \n\n0+1+1+1 \n- +  +  + \n0+1+1+1 \n- +  +  + \n\n0-0-0-0 \n\n1+1+1+1 \n+  +  +  + \n0+1+1+1 \n- - +  + \n0-0-0-0 \n\n0- 0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\n0-0-0-0 \n\nFig. 5.  Results of competitive learning runs on the dipole problem. \n\nTable  1  sbows  the  result  of several competitive  algorithms  compared  when  averaged  over  100 \nsuch runs.  The deterministic algorithm of Rumelhart and Zipser gives an average surface tension \nof 4.6 while the stochastic procedure is almost as good.  Note that noise is essential in belping the \ncompetitive layer settle.  Without noise the surface tension  is  9.8, sbowing that  the winner-take(cid:173)\nall procedure is not working properly. \n\nCompetitive learning algorithm \n\n\"surface tension\" \n\nStochastic net with decay \n- anneal: T=3H T=1.0 \n- no anneal: 70 @ T =1.0 \n\nStochastic net with renonnallzation \n\nDeterministic, winner-take-all \n(Rumelhart & Zipser) \n\n4.8 \n9.8 \n5.6 \n4.6 \n\nTable 1.  Performance of competitive learning algorithms across 1 ()() runs. \n\nWe also tried a procedure where, instead of decay, weights were renormalized.  The model is that \neach neuron can support a  maximum amount of weight leading  into it.  Biologically,  this  might \nbe  the  area  that  other  neurons  can  form  synapses  on,  so that  one  synapse cannot  increase  its \nstrength except at the  expense of some of the others.  Electronically, this can be implemented as \n\n\f15 \n\ncurrent  emanating  from  a  fixed  clUTent  source  per  neuron.  As  shown  in  Table  1,  this  works \nnearly  as  well  as  decay.  Moreover,  preliminary  results  show  that renormalization  is  especiaUy \neffective when more then two cluster units are employed. \nBoth of the  stochastic  algorithms,  which can be implemented  in  an electronic  synapse  in nearly \nthe  same  way  as  the  supervised  learning  algorithm,  divide  the  space  just  as  the  deterministic \nnormalization  procedure14J  does.  This  suggests  that  our  chip  can  do  both  styles  of  learning, \nsupervised if one includes both phases and unsupervised if only the procedure of the minus phase \nis used. \n\n1.3  Reiolorcelfteot Learning \n\nWe  have  tried several  approaches to reinforcement  learning using  the  synaptic  model  of Fig.  1 \nwhere  the  evaluation  signal  is  a  scalar  value  available  globally  that  represents  how  well  the \nsystem performed on each trial.  We applied this  model  to an xor problem with only one output \nunit.  The reinforcement  was  r  = 1 for the correct output and  r  = -1  otherwise.  To the  network, \nthis was similar to supervised learning since for a single unit, the output state is fully  specified by \na  scalar  value.  A  major  difference,  however,  is  that  we  do  not  clamp  the  output  unit  in  the \ndesired state in order to compare plus and minus phases.  This feature of supervised  learning has \nthe  effect  of adjusting  weights  to  follow  a  gradient  to  the  desired  state.  In the  reinforcement \nlearning  described  here,  there  is  no  plus  phase.  This  has  a  satisfying  aspect  in  that  no  overall \nsynchrony  is  necessary  to  compare  phases, but  is  also  much  slower at  converging  to  a solution \nbecause the network has  to  search the solution space without the guidance of a teacher clamping \nthe output units.  This situation becomes much worse when there is more than one output unit.  In \nthat case, the probability of reinforcement  goes down exponentially with the number of outputs. \nTo test multiple outputs, we chose the simple replication problem whereby the output simply has \nto replicate the input.  We chose the number of bidden units equal to the input (or output). \n\n10  the  absence of a teacher to clamp the outputs, the  network has to  find  the answer  by  chance, \nguided  only by a \"critic\" which rates its effort as  \"better\" or \"worse\".  This means the units must \nsomehow search the space.  We use the same stochastic units as in the supervised or unsupervised \ntechniques, but now  it is important to have the noise or the annealing  temperature set to a proper \nlevel.  If it is  too high, the reinforcement received is random rather than  directed  by  the  weights \nin the network.  If it is  too low, the available states searched become too smaU and the probability \nof  finding  the  right  solution  decreases.  We  tuned  our  annealing  schedule  by  looking  at  a \nvolatility  measure  defined  at  each  neuron  which  is  simply  the  fraction  of the  time  the  neuron \nactivation  is  above  zero.  We  then  adjust  the  final  anneal  temperature  so  that  this  number  is \nneither 0 or 1 (noise too  low) nor 0.5  (noise too high).  We used both a fixed  annealing schedule \nfor all neurons and a unit-specific schedule where the noise was proportional to the sum of weight \nmagnitudes  into  the  unit.  A  characteristic  of reinforcement  learning  is  that  the  percent correct \ninitially increases but then decreases and often oscillates widely.  To avoid this, we added a factor \nof (I - <r \u00bb  multiplying the final temperature.  This helped to stabilize the learning. \n\nIn keeping with our simple model  of the synapse,  we chose a weight  adjustment  technique that \nconsisted of correlating the states of the connected neurons with  the global reinforcement signal. \nEach  synapse  measured  the  quantity  R  = rs;sj  for  each  pattern  presented.  If R >0,  then  ~';j  is \nincremented  and  it  is  decremented if R <0.  We  later refined this  procedure by  insisting  that the \nreinforcement be  greater  than a recent average  so  that R  = (r-<,. > hi Sj.  This  type  of procedure \n\n\f16 \n\n(13)  For  r  =\u00b1l  only,  this  \"excess \nappears  in  previous  work  in  a  number  of  fonns.(12] \nreinforcement\"  is  the  same  as  our  previous  algorithm  but  differs  if we  make  a  comparison \nbetween short term and long tenn averages or use a graded reinforcement such as the negative of \nthe  sum  squared  error.  Following  a  suggestion  by  G.  Hinton,  we  also  investigated  a  more \ncomplex  technique  whereby  each  synapse  must  store  a  time  average  of three  quantities:  <r>, \n<SiSj>,  and  <rsiSj>.  The  definition  now  is  R  = <rsiSj>-<r><SjSj>  and  the  rule  is  the  same  as \nbefore.  Statistically,  this  is  the  same  as  \"excess  reinforcement\"  if the  latter  is  averaged  over \ntrials.  For the results reported below the values were collected across 10 pattern presentations.  A \nvariation. which employed a continuous moving average, gave similar results. \n\nTable 2  summarizes the perfonnance on the xor and the  replication  task  of these reinforcement \nlearning techniques.  As the table shows a variety of increasingly sophisticated weight adjustment \nrules  were  explored;  nevertheless  we  were  unable  to  obtain  good  results  with  the  techniques \ndescribed for more than S output units.  In the third column, a small threshold had to be exceeded \nprior  to  weight  adjustment.  In the  fourth  column,  unit-specific  temperatures  dependent  on  the \nsum  of weights,  were  employed.  The  last  column  in  the  table  refers  to  frequency  dependent \nlearning  where  we  trained  on  a  single  pattern until  the  network  produced  a  correct  answer and \nthen  moved  on  to  another  pattern.  This  final  procedure  is  one  of several  possible  techniques \nrelated to 'shaping' in operant learning theory in which difficult patterns are presented more often \nto the network. \n\nnetwork \n\nxor \n24-1 \n2-2-1 \n-\neplication \n2-2-2 \n3-3-3 \n444 \nS-S-S \n6-6-6 \n\nt=1 \n\ntime-averaged \n\n+\u00a3=0.1 \n\n+T-I:W \n\n+freq \n\n(0.60) 0.64 \n(0.58) 0.57 \n\n(0.70) 0.88 \n(0.69) 0.74 \n\n(0.76) 0.88 \n(0.96) 1.00 \n\n(0.92)0.99 \n(0.85) 1.00 \n\n(0.98) 1.00 \n(0.78) 0.88 \n\n(0.94)0.94 \n(0.15) 0.21 \n\n-\n-\n-\n\n(0.46) 0.46 \n(0.31) 0.33 \n\n-\n-\n-\n\n(0.91) 0.97 \n(0.31) 0.62 \n\n-\n-\n-\n\n(0.87) 0.99 \n(0.37)0.37 \n\n-\n-\n-\n\n(0.97) 1.00 \n(0.97) 1.00 \n(0.75) 1.00 \n(0.13) 0.87 \n(0.02) 0.03 \n\nTable 2.  Proportion correct performance of reinforcement learning \n\nafter (2K) and 10K patterns. \n\nOur experiments. while incomplete, hint that reinforcement learning can also be implemented  by \nthe  same  type  of local-global  synapse  that  characterize  the  other  learning  paradigms.  Noise  is \nalso necessary here for the random search procedure. \n\n2. .. Sanunary of Study of hDdameatai Learning Par ... eters \n\nIn summary,  we see that the use of noise and our model of a  local  correlational  synapse with a \nDOn-specific  global  evaluation  signal  are  two  important  features  in all  the  learning  paradigms. \nGraded  activation  is  somewhat  less  important.  Weight  decay  seems  to  be  quite  important \nalthough saturation can substitute for it in unsupervised learning.  Most interesting from our point \nof view  is  that  all  these  phenomena  are electronically  implementable  and  therefore  physically \n\n\f17 \n\nplausible.  Hopefully  this  means  they  are  also  related  to  true  neural  phenomena  and  therefore \nprovide a basis for unifying the various approaches of learning at a microscopic level. \n\n3.  ELECTRONIC IMPLEMENTATION \n\n3.1  The Supervised LearDiog Chip \n\nWe  have  completed  the  design  of  the  chip  previously  proposed.(IO]  Its  physical  style  of \ncomputation speeds up learning a millionfold over a computer simulation.  Fig. 6  shows a block \ndiagram of the neuron.  It is a double differential amplifier.  One branch forms a sum of the inputs \nfrom  the  differential  outputs  of aU  other neurons  with  connections  to  it.  The  other  adds  noise \nfrom the noise amplifier.  This first stage has low gain to preserve dynamic range at the summing \nnodes.  The second stage  has  high  gain and converts to a  single  ended output.  This  is fed  to  a \nswitching arrangement whereby either this output state or some externally applied desired state is \nfed  into  the  final  set  of inverter  stages  which  provide  for  more  gain  and  guaranteed  digital \ncomplementarity . \n\nSdlslrld \n\nFig. 6.  Block diagram of neuron. \n\nThe noise amplifier is shown schematically in Fig. 7.  Thermal noise, with an nns level of tens of \nmicrovolts,  from  the  channel  of an FET is  fed  into  a  3  stage  amplifier.  Each  stage  provides  a \npotential  gain of 100 over the  noise  bandwidth.  Low pass  feedback  in each stage stabilizes  the \nDC output as  well as controls gain and bandwidth by means of an externally controlled variable \nresistance for tuning the annealing cycle. \n\nFig.  8 shows a  block diagram  of the synapse.  The  weight is stored  in  5  flip-flops  as a  sign and \nmagnitude binary number.  These flip-flops control the conductance from the outputs of neuron i \nto the inputs of neuron j  and vice-versa as shown in the figure.  The conductance of the FETs are \nin  the ratio  1 :2:4:8 to correspond to the value of the binary number while the sign bit determines \nwhether the true or complementary lines connect.  The flip-flops are arranged in a counter which \nis  controUed  by  the  correlation  logic.  If the  plus  phase correlations  are  greater than  the minus \nphase, then the counter is incremented by a single unit  If less, it is decremented. \n\n\f18 \n\nVcontrol \n\nI \n\nI \n\nI \n\nl \n\n>--.._V._.nOISI \n\nFig. 7.  Block diagram of noise amplifier. \n\nSj or  I \n\nSj or  I \nnior~\" T---~'-~ __ ~--~ \n\nW \n\nI)  or JI \n\n.r----... \"ncrement \n\ncorrelation \nlogic \n\nphase \n\nup. \n\ndown. \n& set \nlogic \n\ni------lhnl \n\nsgn \n\no \n\n2 \n\n3 \n\nFig. 8.  Block diagram of synapse. \n\nFig. 9 sbows the layout of a test chip.  A 6 neuron,  15  synapse network may be seen in the lower \nleft  comer.  Eacb  neuron  bas  attacbed  to  it  a  noise  amplifier  to  assure  that  the  noise  is \nuncorrelated.  The network occupies  an area about  2.5  mm  on  a  side  in  2  micron  design  rules. \nEacb  300  transistor  synapse  occupies  400  by  600  microns.  In contrast,  a  biological  synapse \noccupies only about one square micron.  The real miracle of biological learning is in  the synapse \nwbere plasticity operates on a molecular level, not in the neuron. We can't bope to compete using \ntransistors, bowevc:r  small, especially  in  the digital  domain.  Aside from  this  small  network, the \nrest of the chip is occupied with test structures of the various components. \n\n3.1  Analog Synapse \n\nAnalog  circuit  tecbni~ues can  reduce  the  size  of the  synapse  and  increase  its  functionality. \nSeveral  recent  papers(  4]  II~I  have  shown  how  to  make  a  voltage  controlled  resistor  in  MOS \ntechnology.  The  voltage  controlling  the  conductance  representing  the  synaptic  weight  can  be \nobtained  by an analog charge integrator from  the correlated activation  of the  neurons  which  the \nsynapse in question connects.  A charge integrator with  a  \"leaky capacitor\" bas a time constant \n\n\fwhich can be used to make comparisons as a continuous time average over the last several trials. \nthereby' adding  temporal  information.  One can envision  this  time  constant as  being  adaptive as \nwell.  The  charge  integrator directly  implements  the  analog  Hebb-typel 16]  correlation  rules  of \nsection 2. \n\n19 \n\n\u2022 \n\n: \n..  , .... \" .  II~.~ \n\ni \n\n\u2022  . /  .\n\n' \n\n\"  A , ..\u2022 . \\ ' :\":  :\" .  _.\n\n' \n\n. \n\n_  \u2022 \n\n..\u00b7 \n\n.  .. . .  . \n\nIf. ., \u2022 \n\niii \u2022 \n\n-\n\n., \u2022 \u2022 \u2022 \u2022  It ill \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \u2022  \n\n~.~ ~,~~~' ~ .. ~i~ 'i~ ~ ~~ilf'~~ \n\u2022\u2022 ' /., ~ \"') '<\"\"~:~\";\" .. I I  . \u00b7  ~ii.:' .. . \n~., \n.' \n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022  * \u2022\u2022 ~.* \u2022\u2022.\u2022\u2022\u2022\u2022 ~ \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 ~.~ \u2022\u2022 \n:i.  c\u00b7..\n~.I ':;;::dU:;.;;;.UEEi \n......... ~ ~\" , . \n: i ii r .. \u00b7\u00b7 .. ' .. ,: \n...  : .................. ~ \n.. ,;.:\" ..... ' ..\u2022.\u2022...... \n:-\"fO.,a'l.~\"~;\" \n...\u2022 ',......  .. \n'.  JjI! \n!\n.. .\n. ,.;,:,... \n. \n.. ... \n. \n\". :::, ':. \u2022\u2022 \n\u2022 \n.. ....... \n\u2022 \n\n\u2022 1!~'.' \n-Jot-: \n~ II. \n\u2022 . ' ;~-.  ~ .... ...... , \u2022 \u2022  ~I' \n... :t~\"1I \u2022 \n\u2022 \u2022 \u2022  \n\ns \u00b7 ,\u00b7  '. .  \n\nIi iI  \u2022 \n\u2022 \n\ni nr-':,~\"\u00b7\"\";\u00b7\u00b7;. \n\n' :\" :::'.l(cid:173)\n\n1IIi1. \n\n3.3  Tecbnologicalbnprovemeots for Flectronic Neural Networks \n\nFig. 9.  Chip layout. \n\nIt is still necessary to store the voltage which controls the analog conductance and we propose the \nEPROMll7] or EEPROM device for  this.  Such a device can hold  the  value of the  weight  in the \nsame way  that flip-flops  do in the digital  implementation of the  synapse(lOJ.  The process which \ncreates  this  device  has  two  polysilicon  layers  which  are  useful  for  making  high  valued \ncapacitances in analog circuitry.  In addition.  the second polysilicon layer could be used to make \nCCD  devices  for  charge  storage  and  transport.  Coupled  with  the  charge  storage  on  a  floating \ngate(l8],  this  forms  a  compact.  low  power  representation  for  weight  values  that  apyroach \nbiological values.  Another useful addition  would be a high valued stable resistive layerl l9 .  One \n\n\f20 \n\ncould  thereby  avoid  space-wasting  long-channel  MOSFETs  which  are  currently  the  only \nrea~ble way to achieve high resistance in MOS technology.  Lastly, the addition of a diffusion \nstep  or  two  creates  a  Bi-CMOS  process  which  adds  high  quality  bipolar  transistors  useful  in \nanalog design.  Furthermore, one gets the logarithmic dependence of voltage on current in bipolar \ntechnology  in  a  natural,  robust  way,  that  is  not  subject  to  the  variations  inherent  in  using \nMOSFETs  in  the  subthreshold  region.  This  is  especially  useful  in  compressing  the  dynamic \nrange in sensory processing[20J\u2022 \n\n4.  CONCLUSION \n\nWe have  shown how  a  simple  adaptive  synapse  which  measures  correlations can account  for  a \nvariety  of learning  styles  in  stochastic  networks.  By  embellishing  the  standard CMOS  process \nand  using  analog  design  techniques.  a  technology  suitable  for  implementing  such  a  synapse \nelectronically can be developed.  Noise is  an important element in our formulation of learning.  It \ncan help a network settle, interpolate between discrete values of conductance during learning. and \nsearch a large  solution space.  Weight decay (\"forgetting\")  and saturation are also important for \nstability.  These  phenomena  not  only  unify  diverse  learning  styles  but  are  electronically \nimplementabfe. \n\nACKNOWLEDGMENT: \nThis  work has  been  influenced  by many  researchers.  We  would especially  like  to  thank  Andy \nBarto  and Geoffrey Hinton for valuable  discussions  on reinforcement learning,  Yannis  Tsividis \nfor contributing  many  ideas  in  analog circuit design, and Joel Gannett for  timely  releases  of his \nvlsi verification software. \n\n\f21 \n\nReferences \n\n1. JJ. Hopfield, \"Neural netwolks and physical systems with emergent coUective computational abilities\", Proc. Natl. \n\nAcad. Sci. USA 79,2554-2558 (1982). \n\n2. D.E.  Rumelhart,  G.E.  Hinton,  and  RJ.  Williams,  \"Learning  internal  representations  by  error  propagation\",  in \nParalld Distribuled Processing: Explorations in  th~ Microstructur~ of Cognition.  Vol.  1: Foundations. edited by \nD.E. Rumelhart and J.L. McClelland, (MrT Press, Cambridge, MA, 1986), p.  318. \n\n3. D.H. Ackley, G.E. Hinton, and T J. Sejnowski, \"A learning algorithm for Boltzmann machines\", Cognitive Science \n\n9, 147-169 (1985). \n\n4. D.E. Rumelhart and D. apser, ''Feature dillCovery by competitive learning\", Cognitive Science 9, 75-112 (1985). \n5. s. Grossberg, \"Adaptive pattern classification and universal recoding:  Part L  Parallel development and  coding of \n\nneural feature detectors.\", Biological Cybernetics 23, 121-134 (1976). \n\n6.  A.G.  Barto,  R.S.  Sutton,  and  C.W.  Anderson,  \"Neuronlike  adaptive  elements  that  can  solve  difficult  learning \n\ncontrol problems\",1EEE Trans. Sys. Man Cyber. 13,835 (1983). \n\n7. B.A.  Pearlmutter  and  G.E.  Hinton,  \"G-Maximization:  An  unsupervised  learning  procedure  for  discovering \nregularities\",  in  N~ural Networks  for  Computing.  edited  by  J.S.  Denker,  AIP  Conference  Proceedings  151, \nAmerican Inst. of Physics, New Yolk (1986), p.333. \n\n8. F.  Rosenblatt,  Principirs  of Neurodyrramics:  Perc~ptrons and the Th~ory of Brain Mechanisms (Spartan Books, \n\nWashington, D.C., 1961). \n\n9. G.  Widrowand  M.E.  Hoff,  \"Adaptive switching cirt:uits\",  Inst.  of Radio Engineers, Western  Electric  Show  and \n\nConvention. COftycntion Record, Part 4, ~104 (1960). \n\n10. J. Alspeaor and R.B.  Allen, \"A neuromorphic vlsi  learning system\". in M~'aN:rd Rrs~arch in VLSl: Procudings \n\nofth~ 1987 StQ1lfordConf~rtnu. edited by P. Losleben (MIT Press, Cambridge, MA.1987), pp. 313-349. \n\n11. M.A.  Cohen  and  S.  Grossberg,  \"Absolute  stability  of global pattern  formation  and parallel  memory  storage  by \n\ncompetitive neural networks\", Trans. IEEE 13,815, (1983). \n\n12. B.  Widrow.  N.K.  Gupta,  and  S.  Maitra,  \"Punish,IReward:  Learning with  a critic  in  adaptive  threshold  systems\", \n\nIEEE Trans. on Sys.  Man &  Cyber., SMC-3, 455 (1973). \n\n13. R.S. Sutton, \"Temporal credit assignment in reinforcement learning\",  unpublished doctoral dissertation, U.  Mass. \n\nAmherst, technical report COINS 84-02 (1984). \n\n]4. Z.  Czamul,  \"Design  of  voltage-controlled  linear  ttansconductance  elements  with  a  muched  pair  of  FET \n\ntransistors\", IEEE Trans. Cire. Sys.  33, 1012, (1986). \n\n15. M. Banu and Y. Tsividis, \"Flouing voltage-controUed resistors in CMOS technology\", Electron. Lett. 18,678-679 \n\n(1982). \n\n16. D.O. Hebb, Th~ OrganizotiOlf ofBtMV;oT (Wiley, NY, 19(9). \n17. D.  Frohman-Bentchkowsky. HFAMOS  - \u2022  new semiconductor charge  storage device\", Solid-State Electronics 17, \n\n517 (1974). \n\n18. J.P.  Sage,  K..  Thompson, and R.S.  Withers, \"An  artificial  neural  network integrued circuit based on  MNOS/CCD \nprinciples\",  in  Nrural  Networks  for  Computing.  edited  by  J.S.  Denker.  AIP  Conference  Proceedings  151, \nAmerican lost. of Physics, New York (1986), p.38 1. \n\n19. A.P. ThaJcoor, J.L.  Lamb.  A.  Moopenn, and  J. Lambe,  \"Binary synaptic connections ba!ICd  on  memory  switching \nin a-Si:H\". in Neural N~\"\"\"orks for Computing. edited by J.S. Denker, AIP Conference Proceedings  151. American \nInst. of Physics, New York (1986), p.426. \n\n20. M.A.  Sivilotti,  M.A.  Mahowald, and C.A. Mead, ~ReaJ-Time visual computations using analog CMOS processing \narrays\",  in  Advanud R~S('arch in  VLSl: Prou~dings of thr  1987 Stanford  Corrf~r~nu.  edited  by  P.  Losleben \n(MIT Press, Cambridge, MA,  1987), pp. 295-312. \n\n\f", "award": [], "sourceid": 80, "authors": [{"given_name": "Joshua", "family_name": "Alspector", "institution": null}, {"given_name": "Robert", "family_name": "Allen", "institution": null}, {"given_name": "Victor", "family_name": "Hu", "institution": null}, {"given_name": "Srinagesh", "family_name": "Satyanarayana", "institution": null}]}