{"title": "Back Propagation Implementation on the Adaptive Solutions CNAPS Neurocomputer Chip", "book": "Advances in Neural Information Processing Systems", "page_first": 1028, "page_last": 1031, "abstract": null, "full_text": "Back Propagation Implementation on the \n\nAdaptive Solutions CNAPS Neurocomputer Chip \n\nHal McCartor \nAdaptive Solutions Inc. \n1400 N.W. Compton Drive \nSuite 340 \nBeaverton, OR 97006 \n\nAbstract \n\nThe Adaptive Solutions CN APS architecture chip is a general purpose \nneurocomputer chip. It has 64 processors, each with 4 K bytes of local \nmemory, running at 25 megahertz. It is capable of implementing most \ncurrent neural network algorithms with on chip learning. This paper dis(cid:173)\ncusses the implementation of the Back Propagation algorithm on an array \nof these chips and shows performance figures from a clock accurate hard(cid:173)\nware simulator. An eight chip configuration on one board can update 2.3 \nbillion connections per second in learning mode and process 9.6 billion \nconnections per second in feed forward mode. \n\n1 \n\nIntroduction \n\nThe huge computational requirements of neural networks and their natural paral(cid:173)\nlelism have led to a number of interesting hardware innovations for executing such \nnetworks. Most investigators have created large parallel computers or special pur(cid:173)\npose chips limited to a small subset of algorithms. The Adaptive Solutions CNAPS \narchitecture describes a general-purpose 64-processor chip which supports on chip \nlearning and is capable of implementing most current algorithms. Implementation \nof the popular Back Propagation (BP) algorithm will demonstrate the speed and \nversatility of this new chip. \n\n1028 \n\n\fBack Propagation Implementation \n\n1029 \n\n2 The Hardware Resources \n\nThe Adaptive Solutions CNAPS architecture is embodied in a single chip digital \nneurocomputer with 64 processors running at 25 megahertz. All processors receive \nthe same instruction which they conditionally execute. Multiplication and addition \nare performed in parallel allowing 1.6 billion inner product steps per second per \nchip. Each processor has a 32-bit adder, 9-bit by 16-bit multiplier (16 by 16 in two \nclock cycles), shifter, logic unit, 32 16-bit registers, and 4096 bytes oflocal memory. \nInput and output are accomplished over 8-bit input and output buses common \nto all processors. The output bus is tied to the input bus so that output of one \nprocessor can be broadcast to all others. When multiple chips are used, they appear \nto the user as one chip with more processors. Special circuits support finding the \nmaximum of values held in each processor and conserving weight space for sparsely \nconnected networks. An accompanying sequencer chip controls instruction flow, \ninput and output. \n\n3 The Back Propagation Algorithm Implementation \n\nThree critical issues must be addressed in the parallel implementation of BP on effi(cid:173)\ncient hardware. These are the availability of weight values for back propagating the \nerror, the scaling and precision of computations, and the efficient implementation \nof the output transfer function. \n\nBP requires weight values at different nodes during the feed forward and back \npropagation phases of computation. This problem is solved by having a second set \nof weights which is the transpose of the output layer weights. These are located on \nhidden node processors. The two matrices are updated identically. The input to the \nhidden layer weight matrix is not used for error propagation and is not duplicated. \nBP implementations typically use 32-bit floating point math. This largely eliminates \nscaling, precision and dynamic range issues. Efficient hardware implementation \ndictates integer arithmetic units with precision no greater than required. Baker \n[Bak90] has shown 16-bit integer weights are sufficient for BP training and much \nlower values adequate for use after training. \n\nWith fixed point integer math, the position of the binary point must be chosen. In \nthis implementation weights are 16 bits and use 12 bits to the right of the binary \npoint and four to the left including a sign bit. They range from -8 to +8. Input \nand output are represented as 8-bit unsigned integers with binary point at the left. \nThe leaning rate is represented as an 8-bits integer with two bits to the left of the \nbinary point and values ranging from .016 to 3.98. Error is represented as 8 bit \nsigned integers at the output layer and with the same representation as the weights \nat the hidden layer. \n\nThis data representation has been used to train benchmark BP applications with \nresults comparable to the floating point versions [HB91]. \n\nThe BP sigmoid output function is implemented as an 8-bit by 256 lookup table. \n\nDuring the forward pass input values are broadcast to all processors from off chip \nvia the input bus or from hidden nodes via the output bus to the input bus. During \n\n\f1030 McCartor \n\nthe backward error propagation, error values are broadcast from the output nodes \nto hidden nodes. \nThe typical BP network has two computational layers, the hidden and output layers. \nThey can be assigned to the same or different processor nodes (PN s) depending on \navailable memory for weights. PNs used for the hidden layer contain the transpose \nweights of the output layer for back propagating error. If momentum or periodic \nweight update are used, additional storage space is allocated with each weight. \n\nIn this implementation BP can be mapped to any set of contiguous processors \nallowing multiple networks in CNAPS memory simultaneously. Thus, the output \nof one algorithm can be directly used as input to another. For instance, in speech \nrecognition, a Fourier transform performed on the PN array could be input to a \nseries of matched BP networks whose hidden layers run concurrently. Their output \ncould be directed to an LVQ2 network for final classification. This can all be \naccomplished without any intermediate results leaving the chip array. \n\n4 Results \n\nBP networks have been successfully run on a hardware clock accurate simulator \nwhich gives the following timing results. In this example an eight-chip implemen(cid:173)\ntation (512 processors) was used. The network had 1900 inputs, 500 hidden nodes \nand 12 outputs. Weights were updated after each input and no momentum was \nused. The following calculations show BP performance: \n\nTRAINING PHASE \n\nOverhead clock cycles per input vector = 360 \nCycles per input vector element = 4 \nCycles per hidden node = 4 \nCycles per output node = 7 \nCycles per vector = 360+(1900*4)+(500*4)+(12*7) = 10,044 \nVectors per second = 25,000,000 / 10,044 = 2,489 \nTotal forward weights = (1900*500)+(500*12) = 956,000 \nWeight updates per second = 956,000*2,489 = 2,3'79,484,000 \n\nFEED FORWARD ONLY \nOverhead cycles per input vector = 59 \nCycles per input vector element = 1 \nCycles per hidden node = 1 \nCycles per output node = 1 (for output of data) \nCycles per vector = 59+1900+500+12 = 2,471 \nVectors per second = 25,000,000/2,471 = 10,117 \nConnections per second = 956,000*10,11'7 = 9,6'71,852,000 \n\n\fBack Propagation Implementation \n\n1031 \n\n5 Comparative Performance \n\nAn array of eight Adaptive Solutions CN APS chips would execute the preceding BP \nnetwork at 2.3 billion training weight updates per second or 9.6 billion feed forward \nconnections per second. These results can be compared with the results on other \ncomputers shown in table 1. \n\nMCUPS MCPS WTS \n\nMACHINE \nSUN 3 lD88j \nSAle SIGMA-llD88j \nWARP [PGTK88] \nCRAY 2 lPGTK88J \nCRAY X-MP lD88J \nCM-2 (65,536) [ZMMW90] \nGF-1l1566) lWZ89j \n8 ADAPTIVE CN APS chips \n\n.034 \n\n17 \n7 \n\n40 \n901 \n2,379 \n\n0.25 \n8 \n\n50 \n182 \n\nfp \nfp \nfp \nfp \nfp \nfp \nfp \n\n9,671 \n\n16 bit int \n\nTable 1. Comparison of BP performance for various computers and 8 Adaptive \nSolutions CNAPS chips on one board. MCUPS is Millions of BP connection updates \nper second in training mode. MCPS is millions of connections processed per second \nin feed forward mode. WTS is representation used for weights. \n\n6 Summary \n\nThe Adaptive Solutions CN APS chip is a very fast general purpose digital neuro(cid:173)\ncomputer chip. It is capable of executing the Back Propagation algorithm quite \nefficiently. An 8 chip configuration can train 2.3 billion connections per second and \nevaluate 9.6 billion BP feed forward connections per second. \n\nReferences \n\n[Bak90] T Baker. Implementation limits for artificial neural networks. Master's \nthesis, Oregon Graduate Institute, 1990. \n\n[D88] DARPA Neural Network Study. pp309-310 AFCEA International Press, Fair(cid:173)\nfax Virginia. 1988 \n[HB91] J. Holt and T. Baker. Back Propagation Simulations using Limited Precision \nCalculations. Submitted to IJCNN, Seattle WA 1991. \n[RM86] D. Rummelhart, J. McClelland. Parallel Distributed Processing. (1986) \nMIT Press, Cambridge, MA. \n[WZ89] M. Witbrock and M Zagha. An Implementation of Back-Propagation Learn(cid:173)\ning on GFll, a Large SIMD Parallel Computer. 1989. Tech report CMU-CS-89-208 \nCarnegie Mellon University. \n[ZMMW90] X. Zhang, M. Mckenna, J Misirov, D Waltz. An Efficient Implementa(cid:173)\ntion of the Back-propagation Algorithm on the Connection Machine CM-2 (1990) \nin Adv. in Neural Information Processing Systems 2. Ed. D. Touretzky. Morgan \nKaufmann, San Mateo, CA. \n\n\f", "award": [], "sourceid": 383, "authors": [{"given_name": "Hal", "family_name": "McCartor", "institution": null}]}