{"title": "A Massively-Parallel SIMD Processor for Neural Network and Machine Vision Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 843, "page_last": 849, "abstract": null, "full_text": "A Massively-Parallel SIMD Processor for \n\nNeural Network and Machine Vision \n\nApplications \n\nMichael A. Glover \n\nCurrent Technology, Inc. \n\n99 Madbury Road \nDurham, NH 03824 \n\nW. Thomas Miller, III \n\nDepartment of Electrical and Computer Engineering \n\nThe University of New Hampshire \n\nDurham, NH 03824 \n\nAbstract \n\nThis paper describes the MM32k, a massively-parallel SIMD com(cid:173)\nputer which is easy to program, high in performance, low in cost \nand effective for implementing highly parallel neural network ar(cid:173)\nchitectures. The MM32k has 32768 bit serial processing elements, \neach of which has 512 bits of memory, and all of which are inter(cid:173)\nconnected by a switching network. The entire system resides on \na single PC-AT compatible card. It is programmed from the host \ncomputer using a C++ language class library which abstracts the \nparallel processor in terms of fast arithmetic operators for vectors \nof variable precision integers. \n\n1 \n\nINTRODUCTION \n\nMany well known neural network techniques for adaptive pattern classification and \nfunction approximation are inherently highly parallel, and thus have proven dif(cid:173)\nficult to implement for real-time applications at a reasonable cost. This includes \n\n843 \n\n\f844 \n\nGlover and Miller \n\na variety of learning systems such as radial basis function networks [Moody 1989], \nKohonen self-organizing networks [Kohonen 1982], ART family networks [Carpenter \n1988], and nearest-neighbor interpolators [Duda 1973], among others. This paper \ndescribes the MM32k, a massively-parallel SIMD computer which is easy to pro(cid:173)\ngram, high in performance, low in cost and effective for implementing highly parallel \nneural network architectures. The MM32k acts as a coprocessor to accelerate vector \narithmetic operations on PC-AT class computers, and can achieve giga-operation \nper second performance on suitable problems. It is programmed from the host \ncomputer using a C++ language class library, which overloads typical arithmetic \noperators, and supports variable precision arithmetic. The MM32k has 32768 bit \nserial PEs, or processing elements, each of which has 512 bits of memory, and all \nof which are interconnected by a switching network. The PEs are combined with \ntheir memory on an single DRAM memory chip giving 2048 processors per chip. \nThe entire 32768 processor system resides on a single ISA bus compatible card. It \nis much more cost effective than other SIMD processors [Hammerstrom 1990; Hillis \n1985; Nickolls 1990; Potter 1985] and more flexible than fixed purpose chips [Holler \n1991]. \n\n2 SIMD ARCHITECTURE \n\nThe SIMD PE array contains 32768 one bit processors, each with 512 bits of memory \nand a connection to the interconnection network. The PE array design is unique \nin that 2048 PEs, including their PE memory, are realized on a single chip. The \ntotal PE array memory is 2 megabytes and has a peak memory bandwidth is 25 \ngigabytes per second. The PE array can add 8 bit integers at 2.5 gigaoperations \nper second. It also dissipates less than 10 watts of power and is shown in Figure 1. \n\nEach PE has three one bit registers, a 512 bit memory, and a one bit AL U. It \nperforms bit serial arithmetic and can therefore vary the number of bits of precision \nto fit the problem at hand, saving SIMD instruction cycles and SIMD memory. \nThere are 17 instructions in the PE instruction set, all of which execute at a 6.25 \nMIPS rate. The PE instruction set is functionally complete in that it can perform \nboolean NOT and OR functions and can therefore perform any operation, including \narithmetic and conditional operations. A single PE is shown in Figure 2. \nThe interconnection network allows data to be sent from one PE to another. It is \nimplemented by a 64*64 full crossbar switch with 512 PEs connected to each port \nof the switch. It allows data to be sent from one PE to another PE, an arbitrary \ndistance away, in constant time. The peak switch bandwidth is 280 megabytes per \nsecond. The switch also allows the PE array to perform data reduction operations, \nsuch as taking the sum or maximum over data elements distributed across all PEs. \n\n3 C++ PROGRAMMING ENVIRONMENT \n\nThe purpose of the C++ programming environment is to allow a programmer to \ndeclare and manipulate vectors on the MM32k as if they were variables in a pro(cid:173)\ngram running on the host computer. Programming is performed entirely on the \nhost, using standard MS-DOS or Windows compatible C++ compilers. The C++ \nprogramming environment for the MM32k is built around a C++ class, named \n\n\fA Massively-Parallel SIMD Processor for Neural Network and Machine Vision Applications \n\n845 \n\nHost Computer \n\n(PC-AT) \n\nVector Instructions and Data \n\nController \n\nPE Instructions and Data \n\n1-1-\n\n1-\n\nPE \n0 \n\nPE \n1 \n\nPE \n2 \n\nPE \n3 \n\n... \n\nPE \nj \n\n. .. \n\n'--,.- - , . -\n\n'--,--\n\nPE \n3276 \n\nSwitch \n\nFigure 1: A block diagram of the MM32k. \n\n512 Bit Memory \n\nBit 511 \nBit 510 \nBit 509 \n\n.. \n\nBit 5 \nBit 4 \nBit 3 \nBit 2 \nBit 1 \nBit 0 \n\n.. \n\n9 Bit Address \nfrom Controller \nAddress Bus \n\nPE ALU Opcode \nfrom Controller \nData Bus\u00b7 \n\nData to \nSwitch \n\nA Register \n1 Bit \n\nM Register \n1 Bit \n\nB Register \n1 Bit \n\nData from \nSwitch \n\nFigure 2: A block diagram of a single processing element (PE). \n\n\f846 \n\nGlover and Miller \n\nTable 1: 8 Bit Operations With 32768 and 262144 Elements \n\n8 bit \n\noperation \n\ncopy \nvector+vector \nvector+scalar \nvector*vector \nvector*scalar \nvector>scalar \nalign( vector ,scalar) \nsum( vector) \nmaximum( vector) \n\nActual MOPS \n\nwith length \n\nof 32768 \n\n1796 \n1455 \n1864 \n206 \n426 \n1903 \n186 \n52 \n114 \n\nActual MOPS \n\nwith length \nof 262144 \n\n9429 \n2074 \n3457 \n215 \n450 \n6223 \n213 \n306 \n754 \n\nMM_ VECTOR, which represents a vector of integers. Most of the standard C \narithmetic operators, such as +, -, *, I, =, and> have been overloaded to work \nwith this class. Some basic functions, such as absolute value, square root, mini(cid:173)\nmum, maximum, align, and sum, have also been overloaded or defined to work with \nthe class. \nThe significance of the class MM_ VECTOR is that instances of it look and act \nlike ordinary variables in a C++ program. So a programmer may add, subtract, \nassign, and manipulate these vector variables from a program running on the host \ncomputer, but the storage associated with them is in the SIMD memory and the \nvector operations are performed in parallel by the SIMD PEs. MM_ VECTORs can \nbe longer than 32768. This is managed (transparent to the host program) by placing \ntwo or more vector elements in the SIMD memory of each PE. The class library \nkeeps track of the number of words per PE. MM_ VECTORs can be represented by \ndifferent numbers of bits. The class library automatically keeps track of the number \nof bits needed to represent each MM_ VECTOR without overflow. For example, if \ntwo 12 bit integers were added together, then 13 bits would be needed to represent \nthe sum without overflow. The resulting MM_VECTOR would have 13 bits. This \nsaves SIMD memory space and SIMD PE instruction cycles. The performance of \nthe MM32k on simple operators running under the class library is listed in Table 1. \n\n4 NEURAL NETWORK EXAMPLES \n\nA common operation found in neural network classifiers (Kohonen, ART, etc.) is \nthe multi-dimensional nearest-neighbor match. If the network has a large number \nof nodes, this operation is particularly inefficient on single processor systems, which \nmust compute the distance metric for each node sequentially. Using the MM32k, the \ndistance metrics for all nodes (up to 32768 nodes) can be computed simultaneously, \nand the identification of the minimum distance can be made using an efficient tree \ncompare included in the system microcode. \n\n\fA Massively-Parallel SIMD Processor for Neural Network and Machine Vision Applications \n\n847 \n\nTable 2: Speedup on Nearest Neighbor Search \n\nProcessor \n\nTime for \n\nTime for \n\nMM32k \n\nMM32k \n\n32768 nodes 65536 nodes \n\nspeedup for \nspeedup for \n32768 nodes 65536 nodes \n\nMM32k \ni486 \nMIPS \nAlpha \nSPARC \n\n2.2 msec \n350 msec \n970 msec \n81 msec \n410 msec \n\n3.1 msec \n700 msec \n1860 msec \n177 msec \n820 msec \n\n1:1 \n159:1 \n441:1 \n37:1 \n186:1 \n\n1:1 \n226:1 \n600:1 \n57:1 \n265:1 \n\nFigure 3 shows a C++ code example for performing a 16-dimensional nearest neigh(cid:173)\nbor search over 32768 nodes. The global MM_ VECTOR variable state[16] defines \nthe 16-dimensionallocation of each node. Each logical element of state[ ] (state[O], \nstate[l], etc.) is actually a vector with 32768 elements distributed across all pro(cid:173)\ncessors. The routine find_besLmatchO computes the euclidean distance between \neach node's state and the current test vector test_input[ ], which resides on the host \nprocessor. Note that the equations appear to be scalar in nature, but in fact direct \nvector operations to be performed by all processors simultaneously. \n\nThe performance of the nearest neighbor search shown in Figure 3 is listed in Table \n2. Performance on the same task is also listed for four comparison processors: a \nGateway2000 mode14DX2-66V with 66 MHz 80486 processor (i486), a DECstation \n5000 Model 200 with 25 MHz MIPS R3000A processor (MIPS), a DECstation 3000 \nModel 500AXP with 150 MHz Alpha AXP processor (Alpha), and a Sun SPARC(cid:173)\nstation 10 Model 30 with 33 MHz SuperSPARC processor (SPARC). There are 16 \nsubtractions, 16 additions, 16 absolute values, one global minimum, and one global \nfirst operation performed. The MM32k is tested on problems with 32768 and 65536 \nexemplars and compared against four popular serial machines performing equivalent \nsearches. The MM32k requires 3.1 milliseconds to search 65536 exemplars which is \n265 times faster than a SPARC 10. \nThe flexibility of the MM32k for neural network applications was demonstrated \nby implementing compl~te fixed-point neural network paradigms on the MM32k \nand on the four comparison processors (Table 3). Three different neural network \nexamples were evaluated. The first was a radial basis function network with 32,768 \nbasis functions (rational function approximations to gaussian functions). Each basis \nfunction had 9 8-bit inputs, 3 16-bit outputs (a vector basis function magnitude), \nand independent width parameters for each of the nine inputs. The performances \nlisted in the table (RBF) are for feedforward response only. The second example \nwas a Kohonen self-organizing network with a two-dimensional sheet of Kohonen \nnodes of dimension 200x150 (30,000 nodes). The problem was to map a nonlinear \nrobotics forward kinematics transformation with eight degrees of freedom (8-bit \nparameters) onto the two-dimensional Kohonen layer. The performances listed in \nthe table (Kohonen) are for self-organizing training. The third example problem \nwas a neocognitron for target localization in a 256x256 8-bit input image. The first \nhidden layer of the neocognitron had 8 256x256 sheets of linear convolution units \n\n\f848 \n\nGlover and Miller \n\n1* declare 16-D \"\"32k exemplars *1 \n\"\"_VECTOR state[16] = { \n\n\"\"_VECTOR(32168), \"\"_VECTOR(32168), \n\"\"_VECTOR(32168), \"\"_VECTOR(32168), \n\"\"_VECTOR(32168), \"\"_VECTOR(32168), \n\"\"_VECTOR(32168) , \"\"_VECTOR(32168) , \n\"\"_VECTOR(32168), \"\"_VECTOR(32168), \n\"\"_VECTOR(32168), \"\"_VECTOR(32168), \n\"\"_VECTOR(32168), \"\"_VECTOR(32168), \n\"\"_VECTOR(32168) , \"\"_VECTOR(32168) \n\n}; \n\n1* return PE number of processor with closest match */ \nlong find_best_match(long test_input[16]) \n{ \n\nint i; \n\"\"_VECTOR difference(32168); \n\"\"_VECTOR distance(32168); \n\n1* differences *1 \n1* distances \n*1 \n\n1* compute the 16-D distance scores *1 \ndistance = OJ \nfor (i=O; i<16; ++i) { \n\ndifference = state[i] - test_input[i]; \ndistance = distance + (difference * difference); \n\n} \n\n1* return the PE number for minimum distance *1 \nreturn first(distance == minimum(distance\u00bb; \n\n} \n\nFigure 3: A C++ code example implementing a nearest neighbor search. \n\n\fA Massively-Parallel SIMD Processor for Neural Network and Machine Vision Applications \n\n849 \n\nTable 3: MM32k Speedup for Select Neural Network Paradigms \n\nProcessor \n\nRBF \n\nKohonen NCGTRN \n\nMM32k \ni486 \nMIPS \nAlpha \nSPARC \n\n1:1 \n161:1 \n180:1 \n31:1 \n94:1 \n\n1:1 \n76:1 \n69:1 \n11:1 \n49:1 \n\n1:1 \n336:1 \n207:1 \n35:1 \n378:1 \n\nwith 16x16 receptive fields in the input image. The second hidden layer of the \nneocognitron had 8 256x256 sheets of sigmoidal units (fixed-point rational function \napproximations to sigmoid functions) with 3x3x8 receptive fields in the first hidden \nlayer. The output layer of the neocognitron had 256x256 sigmoidal units with \n3x3x8 receptive fields in the second hidden layer. The performances listed in the \ntable (NCGTRN) correspond to feedforward response followed by backpropagation \ntraining. The absolute computation times for the MM32k were 5.1 msec, 10 msec, \nand 1.3 sec, for the RBF, Kohonen, and NCGTRN neural networks, respectively. \n\nAcknowledgements \n\nThis work was supported in part by a grant from the Advanced Research Projects \nAgency (ARPA/ONR Grant #NOOOI4-92-J-1858). \n\nReferences \n\nJ. 1. Potter. (1985) The Massively Parallel Processor, Cambridge, MA: MIT Press. \n\nG. A. Carpenter and S. Grossberg. (1988) The ART of adaptive pattern recognition \nby a self-organizing neural network. Computer vol. 21, pp. 77-88. \n\nR. O. Duda and P. E. Hart. (1973) Pattern Classification and Scene Analysis. New \nYork: Wiley. \n\nD. Hammerstrom. (1990) A VLSI architecture for high-performance, low cost, on(cid:173)\nchip learning, in Proc. IJCNN, San Diego, CA, June 17-21 , vol. II, pp. 537-544. \nW. D. Hillis. (1985) The Connection Machine. Cambridge, MA: MIT Press. \n\nM. Holler. (1991) VLSI implementations oflearning and memory systems: A review. \nIn Advances in Neural Information Processing Systems 3, ed. by R. P. Lippman, J. \nE. Moody, and D. S. Touretzky, San Francisco, CA: Morgan Kaufmann. \n\nT. Kohonen. (1982) Self-organized formation of topologically correct feature maps. \nBiological Cybernetics, vol. 43, pp. 56-69. \nJ. Moody and C. Darken. (1989) Fast learning in networks of locally- tuned pro(cid:173)\ncessing units. Neural Computation, vol. 1, pp. 281-294. \nJ. R. Nickolls. (1990) The design of the MasPar MP-1: A cost-effective massively \nparallel computer. In Proc. COMPCON Spring '90, San Francisco, CA, pp. 25-28 .. \n\n\f", "award": [], "sourceid": 718, "authors": [{"given_name": "Michael", "family_name": "Glover", "institution": null}, {"given_name": "W.", "family_name": "Miller", "institution": null}]}