{"title": "A Neurocomputer Board Based on the ANNA Neural Network Chip", "book": "Advances in Neural Information Processing Systems", "page_first": 773, "page_last": 780, "abstract": null, "full_text": "A  Neurocomputer Board  Based on the ANNA \n\nNeural Network  Chip \n\nEduard Sackinger, Bernhard E.  Boser, and Lawrence D. Jackel \n\nAT&T  Bell  Laboratories \n\nCrawfords Corner  Road,  Holmdel,  NJ  07733 \n\nAbstract \n\nA  board is described  that contains the ANN A  neural-network  chip,  and a \nDSP32C  digital  signal  processor.  The  ANNA  (Analog  Neural  Network \nArithmetic  unit)  chip  performs  mixed  analog/digital  processing.  The \ncombination  of ANNA  with  the  DSP  allows  high-speed,  end-to-end  ex(cid:173)\necution  of numerous signal-processing  applications,  including the  prepro(cid:173)\ncessing,  the  neural-net  calculations,  and  the  postprocessing  steps.  The \nANNA  board  evaluates  neural  networks  10  to  100  times  faster  than  the \nDSP  alone.  The  board  is  suitable  for  implementing  large  (million  con(cid:173)\nnections)  networks  with sparse  weight  matrices.  Three  applications  have \nbeen  implemented on  the  board:  a  convolver  network  for  slant  detection \nof text  blocks,  a  handwritten  digit  recognizer,  and  a  neural  network  for \nrecognition-based  segmentation. \n\n1 \n\nINTRODUCTION \n\nMany researchers  have built neural-network chips, but few  chips have been  installed \nin  board-level systems,  even  though  this next  level  of integration provides  insights \nand  advantages  that  can't be  attained  on  a  chip  testing station.  Building a  board \ndemonstrates  whether  or not  the  chip  can  be effectively  integrated  into  the larger \nsystems  required  for  real  applications.  A  board  also  exposes  bottlenecks  in  the \nsystem  data paths.  Most  importantly,  a  working  board  moves  the  neural-network \nchip  from  the  realm  of a  research  exercise,  to  that of a  practical  system,  readily \navailable  to  users  whose  primary  interest  is  actual  applications.  An  additional \nbonus  of carrying  the  integration  to  the  board  level  is  that the  chip  designer  can \ngain the user feedback  that will  assist  in  designing new  chips  with greater utility. \n773 \n\n\f774 \n\nSackinger.  Boser,  and Jackel \n\n32 BIT DATA BUS \n\nDATA \n\nSTATE  WEIGHT \n\nDATA \n\nSRAM \n\nADDR \n\nANNA \n\n, \n\nMICROCODE \n\nDSP32C \n\n0 \nii: \n\nADDR \n\n.. o \n\nw \n~ > \n\n\\CODER / \n\nf \n\n24 BIT ADDRESS BUS \n\nFigure  1:  Block  Diagram of the ANNA  Board \n\n2  ARCHITECTURE \n\nThe  neurocomputer  board  contains  a  special  purpose  chip  called  ANN A  (Boser \net  al.,  1991),  for  the  parallel  evaluation of neuron  functions  (a squashing function \napplied to a weighted sum) and a general purpose digital signal processor,  DSP32C. \nThe board also contains interface and clock synchronization logic as well as 1 MByte \nof static memory, SRAM  (see  Fig.  1).  Two version  of this  board  with  two different \nbus  interfaces  have  been  built:  a  double  height  VME  board  (see  Fig.  2)  and  a \nPC/ AT  board (see  Fig. 3). \nThe  ANNA  neural  network  chip  is  an  ALU  (Arithmetic  and  Logic  Unit)  special(cid:173)\nized  for  neural  network  functions.  It contains  a  12-bit  wide  state-data  input,  a \n12-bit wide state-data output, a  12-bit wide weight-data input, and a  37-bit micro(cid:173)\ninstruction  input.  The  instructions  that can  be executed  by  the  chip  are  the fol(cid:173)\nlowing (parameters are not shown): \n\nRFSH  Write  weight  values  from  the  weight-data  input  into  the  dynamic  on-chip \n\nweight storage. \n\nSHIFT  Shift on-chip  barrel  shifter  to the  left  and load  up  to four  new  state  values \n\nfrom state-data input into the right end  of the shifter. \n\nSTORE  Transfer state vector  from  the shifter  into the on-chip state storage and/or \n\ninto the state-data latches of the arithmetic unit. \n\nCALC  Calculate eight dot-products between on-chip weight vectors  and the contents \n\nof the above mentioned data latches; subsequently evaluate the squashing func(cid:173)\ntion. \n\nOUT  Transfer  the results  of the  calculation to the state-data output. \n\n\fA Neurocomputer Board  Based on  the ANNA Neural  Network Chip \n\n775 \n\nFigure 2:  ANNA  Board  with VME  Bus  Interface \n\nFigure 3:  ANNA  Board  with  PCI AT Bus  Interface \n\n\f776 \n\nSackinger,  Boser,  and Jackel \n\nFigure 4:  Photo Micrograph  of the ANNA  Chip \n\nSome  of the  instructions  (like  SHIFT  and  CALC)  can  be  executed  in  parallel.  The \nbarrel  shifter  at  the  input  as  well  as  the  on-chip  state  storage  make  the  ANN A \nchip  very  effective  for  evaluating locally-connected,  weight-sharing  networks  such \nas feature  extraction and time-delay neural  networks  (TDNN). \nThe ANNA  neural  network  chip,  implemented  in  a  0.9/-lm CMOS  technology,  con(cid:173)\ntains  180,000  transistors  on  a  4.5  x  7 mm 2  die  (see  Fig.  4).  The  chip  implements \n4,096  physical synapses  which  can  be  time  multiplexed in order to realize  networks \nwith  many more  than 4,096  connections.  The resolution  of the synaptic  weights is \n6 bits and  that of the  states  (input/output of the  neurons)  is  3 bits.  Additionally, \na  4-bit scaling  factor  can  be  programmed  for  each  neuron  to extend  the  dynamic \nrange of the weights.  The weight  values  are stored  as  charge  packets  on capacitors \nand  are  periodically  refreshed  by  two  on  chip  6-bit  D/ A  converter.  The  synapses \nare realized by multiplying 3-bit D/ A converters (analog weight times digital state). \nThe  analog  results  of this  multiplication  are  added  by  means  of current  summing \nand then  converted  back  to digital by  a  saturating 3-bit A/D converter.  Although \nthe chip uses analog computing internally, all input/output is digital.  This combines \nthe  advantages of the  high  synaptic  density,  the  high speed,  and  the  low  power  of \nanalog with the ease of interfacing to a digital system like a  digital signal  processor \n(DSP). \n\nThe  32-bit  floating-point  digital  signal  processor  (DSP32C)  on  the  same  board \nruns  at  40 MHz  without  wait  states  (100 ns  per  instruction)  and  is  connected  to \n1 MByte of static RAM.  The  DSP  has several functions:  (1)  It generates  the micro \ninstructions for the ANNA chip.  (2) It is responsible for accessing the pixel, feature, \nand  weight  data from  the  memory  and  then  storing  the  results  of the  chip  in  the \nmemory.  (3) If the precision of the ANNA  chip is  not sufficient the DSP can do the \ncalculations with 32-bit floating-point precision.  (4) Learning algorithms can be run \n\n\fA Neurocomputer Board Based on  the ANNA Neural  Network Chip \n\n777 \n\non  the DSP.  (5)  The DSP is  useful  as a  pre- and post-processor for neural networks. \nIn  this  way  a  whole  task  can  be  carried  out  on  the  board  without  exchanging \nintermediate results  with  the host. \n\nAs shown by Fig. 1 ANNA instructions are supplied over the DSP address bus, while \nstate  and  weight  data  is  transferred  over  the  data bus.  This  arrangement  makes \nit  possible  to supply  or  store  ANN A  data  and  execute  a  micro  instruction  simul(cid:173)\ntaneously,  i.e.,  using  only one DSP instruction.  The ANNA  clock  is  automatically \ngenerated  whenever  the DSP  issues  a  micro instruction  to the ANN A  chip. \n\n3  PERFORMANCE \n\nUsing a  DSP for supplying micro instructions as  well  as accessing the data from the \nmemory makes the board very flexible and fairly simple.  Both data and instruction \nflow  to and from the ANNA chip are under software control and can be programmed \nusing  the C  or  DSP32C assembly  language. \n\nBecause  of  DSP32C  features  such  as  one-instruction  32-bit  memory-to-memory \ntransfer  with  auto  increment  and  overhead  free  looping,  ANNA  instruction  se(cid:173)\nquences  can  be  generated  at  a  rate  of approximately  5 MIPS.  A  similar  rate  of \n5 MByte/s  is  achieved  for  reading  and  writing ANNA  data from  and  to  the mem(cid:173)\nory. \n\nThe  speed  of the  board  depends  on  the  application  and  how  well  it  makes  use \nof the chip's parallelism and  ranges  between  30 MC/s and 400 MC/s.  For  concrete \nexamples see the section on Applications.  Compared to the DSP32C which performs \nat about 3 MC/s (for sparsely connected  networks)  the  board with the ANNA  chip \nis  10  to 100  times faster. \n\nThe speed of the board is not limited by the ANNA chip but by the above mentioned \ndata rates.  The use  of a  dedicated  hardware  sequencer  will  improve  the speed  by \nup  to ten times.  The board can thus be used  for  prototyping an application, before \nbuilding more specialized  hardware. \n\n4  SOFTWARE \n\nTo  make  the  board  easily  usable  we  implemented  a  LISP  interpreter  on  the  host \ncomputer  (a  SUN  workstation)  which  allows  us  to  make  remote  procedure  calls \n(RPC)  to the ANNA  board.  After starting the LISP  interpreter on the host it will \ndownload  the  DSP  object  code  to  the  board  and  start  the  main  program on  the \nDSP. Then, the DSP will  transfer  the addresses  of all procedures that are available \nto the user to the LISP interpreter.  From then on, all these  procedures  can be called \nas  LISP  functions  of the form  (==>  anna  procedure  parameter{s)  from  the  host. \nParameters and return  value are  handled  automatically by the  LISP  interpreter. \n\nThree ways of using the ANNA board are described.  The first  two methods do not \nrequire DSP programming; everything is  controlled from the LISP interpreter.  The \nthird  method  requires  DSP  programming  and  results  in  maximum speed  for  any \napplication. \n\n\f778 \n\nSackinger,  Boser,  and Jackel \n\nl.  The  simplest  way  to  use  the  board  together  with  this  LISP  interpreter  is  to \ncall  existing  library  functions  on  the  board.  For  example  a  neural  network  for \nrecognizing  handwritten digits can  be called  as  follows: \n\n(==>  anna  down-weight  weight-matrix) \n(setq  class  (==>  anna  down-ree-up  digit-pattern\u00bb \n\nThe first  LISP  function  activates  the  down-weight  function  on  the  ANN A  board \nthat transfers  the  LISP  matrix,  weight-matrix,  to the board.  This function  defines \nall  the  weights  of the  network  and  has  to  be  called  only  once.  The  second  LISP \nfunction calls the down-ree-up function  which  takes the  digit-pattern (pixel  image) \nas  an  input,  downloads  this  pattern,  runs  the  recognizer,  and  uploads  the  class \nnumber (0  ... 9). \nThis  method  requires  no  knowledge  of the  ANN A  or  nsp  instruction  set.  The \nlibrary functions  are fast  since  they  have  been  optimized  by  the  implementer.  At \nthe moment library functions for  nonlinear convolution, character  recognition,  and \ntesting are  available. \n2.  If a function  which  is  not part of the  library  has to be  implemented, an  ANNA \nprogram must be written.  A collection of LISP functions (ANNANAS), support the \ntranslation  of symbolic  ANNA  program  into  micro  code.  The  micro  code  is  then \nrun on the ANNA chip by  means of a software sequencer  implemented on  the nsp. \nAssembling and running a simple ANNA  program using ANNANAS looks like this: \n\n(anna-repeat  16) \n(anna-shift  4  0) \n(anna-store  0  'a 2) \n(anna-endrep) \n(anna-stop) \n(anna-run  0) \n\nREPEAT  16 \nSHIFT  4,RO; \nSTORE  RO,A.L2; \nEND REP \nSTOP \n\nstart  of loop \nANNA  shift  instruction \nANNA  store  instruction \nend  of loop \nend  of program \n\nstart  sequencer \n\nIn  this  way,  all  the  features  of the  ANN A  chip  and  board  can  be  used  without \nnsp  programming.  This  mode  is  also  helpful  for  testing  and  debugging  ANN A \nprograms.  Beside  the  assembler,  ANN AN AS  also  provides several  monitoring and \ndebugging tools. \n3.  If maximum  speed  is  imperative,  an  application  specific  sequencer  has  to  be \nwritten  (as  opposed  to the slower  generic  sequencer  described  above).  To  do  this \na  nsp  assembler  and  C  compiler  are  required.  A  toolbox of assembly  macros  and \nC  functions  help  implementing  this  sequencer.  Besides  the  sequencer,  pre- and \npost-processing software  can also be implemented on the fast nsp hardware.  After \nsuccessfully  testing the  program it  can  be  added  to the library as  a  new  function. \n\n5  APPLICATIONS \n\n5.1  CONVOLVER NETWORK \n\nIn  this  application the ANNA  chip  is  configured  for  16  neurons  with  256  synapses \neach.  First,  each  of these  neurons  connect  to  the  upper  left  16  x  16  field  of  a \n\n\fA Neurocompurer Board  Based on  the ANNA Neural Network Chip \n\n779 \n\nTable  1:  Performance of the Recognizer. \n\nIMPLEMENTATION  ERROR RATE  FOR 1 %  ERROR \n\nREJECT  RATE \n\nFull  Precision \nANNA/DSP \nANN A/DSP /Retraining \n\n4.9% \n5.3 \u00b1  0.2% \n4.9\u00b10.2% \n\n9.1 % \n13.5\u00b1 0.8% \n11.5 \u00b1  0.8% \n\na \n\nFRl..o/  talha  Rh \nb.  c5\u00b7 \n\nv \n\noon \n/hRaRn \n\nr \n0/ \ne  NF \n\na  h \n\n.IFma  ce  aFN \n\na  w \n\n0 \nnay  tu  .a. \n611. \n\n\f780 \n\nSackinger,  Boser,  and Jackel \n\n5.3  RECOGNITION  BASED SEGMENTATION \n\nBefore  individual digits  can  be  passed  to a  recognizer  as  described  in  the  previous \nsection,  they  typically  have  to be  isolated  (segmented)  from  a  string of characters \n(e.g.  a  ZIP  code).  When  characters  overlap,  segmentation  is  a  difficult  problem \nand simple algorithms which  look for  connected  components or histograms fail. \n\nA  promising solution  to  this  problem  is  to combine  recognition  and segmentation \n(Keeler  et  al.,  1992,  Matan  et  aI.,  1992).  For  instance  recognizers  like  the  one \ndescribed  above  can  be  replicated  horizontally  and vertically over  the region  of in(cid:173)\nterest.  This will  guarantee,  that there  is  a  recognizer  centered  over  each  character. \nIt is  crucial,  however,  to train the recognizer such  that it rejects  partial characters. \nSuch  a  replicated  version  of the  recognizer  (at  31  times  6  locations)  with  approxi(cid:173)\nmately  2  million  connections  has  been  implemented  on  the  ANN A  board  and  was \nused  to segment  ZIP codes. \n\n6  CONCLUSION \n\nA  board  with a  neural-network  chip  and  a  digital signal processor  (DSP)  has  been \nbuilt.  Large  pattern recognition  applications have  been  implemented on  the board \ngiving a  speed  advantage of 10  to  100 over  the  DSP alone. \n\nAcknowledgements \n\nThe  authors  would  like  to  thank  Steve  Deiss  for  his  excellent  job  in  building the \nboards and Yann LeCun  and Jane Bromley for  their help  with the digit recognizer. \n\nReferences \n\nBernhard  Boser,  Eduard Sackinger,  Jane Bromley,  Yann  LeCun,  and  Lawrence  D. \nJackel.  An  analog neural  network  processor  with  programmable network  topology. \nIEEE J.  Solid-State  Circuits,  26(12):2017-2025,  December  1991. \nYann  Le  Cun,  Bernhard  Boser,  John  S.  Denker,  Donnie  Henderson,  Richard  E. \nHoward,  Wayne  Hubbard,  and  Lawrence  D.  Jackel.  Handwritten  digit  recognition \nwith  a  back-propagation network.  In  David  S.  Touretzky,  editor,  Neural Informa(cid:173)\ntion  Processing  Systems,  volume  2,  pages  396-404.  Morgan  Kaufmann  Publishers, \nSan  Mateo,  CA,  1990. \n\nEduard  Sackinger,  Bernhard  Boser,  Jane Bromley,  Yann  LeCun,  and  Lawrence  D. \nJackel.  Application  of the  ANNA  neural  network  chip  to  high-speed  character \nrecognition.  IEEE  Trans.  Neural  Networks,  3(2),  March  1992. \n\nJ.  D.  Keeler  and  D.  E.  Rumelhart.  Self-organizing segmentation  and  recognition \nneural network.  In J.  M.  Moody, S.  J.  Hanson,  and R.  P.  Lippman, editors,  Neural \nInformation  Processing  Systems,  volume  4.  Morgan  Kaufmann  Publishers,  San \nMateo,  CA,  1992. \n\nOfer  Matan,  Christopher  J.  C.  Burges,  Yann  LeCun,  and  John S.  Denker.  Multi(cid:173)\ndigit recognition using a space delay neural network.  In J.  M.  Moody, S.  J. Hanson, \nand  R.  P.  Lippman,  editors,  Neural  Information  Processing  Systems,  volume  4. \nMorgan  Kaufmann  Publishers,  San  Mateo,  CA,  1992. \n\n\f", "award": [], "sourceid": 554, "authors": [{"given_name": "Eduard", "family_name": "S\u00e4ckinger", "institution": null}, {"given_name": "Bernhard", "family_name": "Boser", "institution": null}, {"given_name": "Lawrence", "family_name": "Jackel", "institution": null}]}