{"title": "A Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 836, "page_last": 844, "abstract": null, "full_text": "A Parallel Gradient Descent Method for Learning \n\nin Analog VLSI Neural Networks \n\nJ. Alspector R. Meir'\" B. Yuhas A. Jayakumar D. Lippet \n\nBellcore \n\nMorristown, NJ 07962-1910 \n\nAbstract \n\nTypical methods for gradient descent in neural network learning involve \ncalculation of derivatives based on a detailed knowledge of the network \nmodel. This requires extensive, time consuming calculations for each pat(cid:173)\ntern presentation and high precision that makes it difficult to implement \nin VLSI. We present here a perturbation technique that measures, not \ncalculates, the gradient. Since the technique uses the actual network as \na measuring device, errors in modeling neuron activation and synaptic \nweights do not cause errors in gradient descent. The method is parallel \nin nature and easy to implement in VLSI. We describe the theory of such \nan algorithm, an analysis of its domain of applicability, some simulations \nusing it and an outline of a hardware implementation. \n\n1 \n\nIntroduction \n\nThe most popular method for neural network learning is back-propagation (Rumel(cid:173)\nhart, 1986) and related algorithms that calculate gradients based on detailed knowl(cid:173)\nedge of the neural network model. These methods involve calculating exact values \nof the derivative of the activation function. For analog VLSI implementations, such \ntechniques require impossibly high precision in the synaptic weights and precise \nmodeling of the activation functions. It is much more appealing to measure rather \nthan calculate the gradient for analog VLSI implementation by perturbing either a \n\n\u00b7Present address: Dept. of EE; Technion; Haifa, Israel \ntpresent address: Dept. of EE; MIT; Cambridge, MA \n\n836 \n\n\fA Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks \n\n837 \n\nsingle weight (Jabri, 1991) or a single neuron (Widrow, 1990) and measuring the \nresulting change in the output error. However, perturbing only a single weight or \nneuron at a time loses one of the main advantages of implementing neural networks \nin analog VLSI, namely, that of computing weight changes in parallel. The one(cid:173)\nweight-at-a-time perturbation method has the same order of time complexity as a \nserial computer simulation of learning. A mathematical analysis of the possibility \nof model free learning using parallel weight perturbations followed by local corre(cid:173)\nlations suggests that random perturbations by additive, zero-mean, independent \nnoi~e sources may provide a means of parallel learning (Dembo, 1990). We have \npre :Tiously used such a noise source (Alspector, 1991) in a different implement able \nlearning model. \n\n2 Gradient Estimation by Parallel Weight Perturbation \n\n2.1 A Brownian Motion Algorithm \n\nOne can estimate the gradient of the error E(w) with respect to any weight WI \nby perturbing WI by OWl and measuring the change in the output error oE as the \nentire weight vector w except for component Wl is held constant. \n\nE(w + OWl) - E(w) \n\nOWl \n\nThis leads to an approximation to the true gradient g:l: \n\noE \n-\nOWl \n\n= - + O([owd) \n\noE \nOWl \n\n(1) \n\n(2) \n\nFor small perturbations, the second (and higher order) term can be ignored. This \nmethod of perturbing weights one-at-a-time has the advantage of using the correct \nphysical neurons and synapses in a VLSI implementation but has time complexity \nof O(W) where W is the number of weights. \n\nFollowing (Dembo, 1990), let us now consider perturbing all weights simultaneously. \nHowever, we wish to have the perturbation vector ow chosen uniformly on a hyper(cid:173)\ncube. Note that this requires only a random sign multiplying a fixed perturbation \nand is natural for VLSI. Dividing the resulting change in error by any single weight \nchange, say OWl, gives \n\noE \n\nE(w + ow) - E(w) \n\nOWl \nwhich by a Taylor expansion is \n\nOWl \n\nleading to the approximation (ignoring higher order terms) \n\n(3) \n\n(4) \n\n\f838 \n\nAlspector, Meir, Yuhas, Jayakumar, and Lippe \n\n(5) \n\nAn important point of this paper, emphasized by (Dembo, 1990) and embodied in \nEq. (5), is that the last term has expectation value zero for random and indepen(cid:173)\ndently distributed OWi since the last expression in parentheses is equally likely to \nbe +1 as -1. Thus, one can approximately follow the gradient by perturbing all \nweights at the same time. If each synapse has access to information about the re(cid:173)\nsulting change in error, it can adjust its weight by assuming it was the only weight \nperturbed. The weight change rule \n\n(6) \n\nwhere TJ is a learning rate, will follow the gradient on the average but with the \nconsiderable noise implied by the second term in Eq. (5). This type of stochas(cid:173)\ntic gradient descent is similar to the random-direction Kiefer-Wolfowitz method \n(Kushner, 1978), which can be shown to converge under suitable conditions on TJ \nand OWi. This is also reminiscent of Brownian motion where, although particles may \nbe subject to considerable random motion, there is a general drift of the ensemble \nof particles in the direction of even a weak external force. In this respect, there is \nsome similarity to the directed drift algorithm of (Venkatesh, 1991), although that \nwork applies to binary weights and single layer perceptrons whereas this algorithm \nshould work for any level of weight quantization or precision - an important ad(cid:173)\nvantage for VLSI implementations - as well as any number of layers and even for \nrecurrent networks. \n\n2.2 \n\nImproving the Estimate by Multiple Perturbations \n\nAs was pointed out by (Dembo, 1990), for each pattern, one can reduce the variance \nof the noise term in Eq. (5) by repeating the random parallel perturbation many \ntimes to improve the statistical estimate. If we average over P perturbations, we \nhave \n\noE \nOWl = P L oif.l = 8Wl + P L ?= 8Wi \nwhere p indexes the perturbation number. The variance of the second term, which \n. \n. \nIS a nOise, v, IS \n\n1 P W (EJE) (owf) \n\np=l&>l \n\nOwPl \n\n. \n\n1 p oE \n\n8E \n\np=l \n\n(7) \n\nwhere the expectation value, <>, leads to the Kronecker delta function, off, . This \nreduces Eq. (8) to \n\nI \n\n\fA Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks \n\n839 \n\n1 P W (OE)2 \n\n< II > = p2 LL OW. \n\n2 \n\np=li>l \n\nz \n\n(9) \n\nThe double sum over perturbations and weights (assuming the gradient is bounded \nand all gradient directions have the same order of magnitude) has magnitude \nO(PW) so that the variance is O(~) and the standard deviation is \n\n(10) \n\nTherefore, for a fixed variance in the noise term, it may be necessary to have a \nnumber of perturbations of the same order as the number of weights. So, if a \nhigh precision estimate of the gradient is needed throughout learning, it seems as \nthough the time complexity will still be O(W) giving no advantage over single \nperturbations. However, one or a few of the gradient derivatives may dominate \nthe noise and reduce the effective number of parameters. One can also make a \nqualitative argument that early in learning, one does not need a precise estimate of \nthe gradient since a general direction in weight space will suffice. Later, it will be \nnecessary to make a more precise estimate for learning to converge. \n\n2.3 The Gibbs Distribution and the Learning Problem \n\nNote that the noise of Eq. (7) is gaussian since it is composed of a sum of random \nsign terms which leads to a binomial distribution and is gaussian distributed for \nlarge P. Thus, in the continuous time limit, the learning problem has Langevin \ndynamics such that the time rate of change of a weight Wk is, \n\n(11) \n\nand the learning problem converges in probability (Zinn-Justin, 1989), so that \nas~mpto~ically Pr(w) <X exp[-,BE(w)] where ,B is inversely proportional to the \nnOIse vanance. \n\nTherefore, even though the gradient is noisy, one can still get a useful learning algo(cid:173)\nrithm. Note that we can \"anneal\" Ilk by a variable perturbation method. Depending \non the annealing schedule, this can result in a substantial speedup in learning over \nthe one-weight-at-a-time perturbation technique. \n\n2.4 Similar Work in these Proceedings \n\nCoincidentally, there were three other papers with similar work at NIPS*92. This \nalgorithm was presented with different approaches by both (Flower, 1993) and \n(Cauwenberghs, 1993). \n1 A continuous time version was implemented in VLSI \nbut not on a neural network by (Kirk, 1993). \n\n1 We note that (Cauwenberghs, 1993) shows that multiple perturbations are n o t needed for learning \nif D.w is small enough and h e does not study them . This does not agree with our simulations (following) \n\n\f840 \n\nAlspector, Meir, Yuhas, Jayakumar, and Lippe \n\n3 Simulations \n\n3.1 Learning with Various Perturbation Iterations \n\nWe tried some simple problems using this technique in software. We used a standard \nsigmoid activation function with unit gain, a fixed size perturbation of .005 and \nrandom sign. The learning rate, T/, was .1 and momentum, Q, was o. We varied \nthe number of perturbation iterations per pattern presentation from 1 to 128 (21 \nwhere 1 varies from 0 to 7). We performed 10 runs for each condition and averaged \nthe results. Fig. 1a shows the average learning curves for a 6 input, 12 hidden, 1 \noutput unit parity problem as the number of perturbations per pattern presentation \nis varied. The symbol plotted is l. \n\npa\"ty 6 avg10 \n\nreplication 6 avg , 0 \n\n~ f \n\nI \nI \nI \n\n----::-1 \n\n7 \n\n7 7 1 7 7 7 7 7 ,. 7 1 7 :lllil \u2022 I I I I ;;:-;-, ~ 1 \n. \n\u2022\u2022 \n\n........ \n\n.. ..... \n\n.. \n\n.. \n\n7 \n\n\u2022 \n\n3 3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 \n\n3 33 \n\n:I \n\n3 \n\n3 33 3 3 3 3 3 \n\n3 3 3 :I \n\n3 3 \n\n~------~ -------~I-\n150 \n\n100 \n\n50 \n\nj \n\n10 \n\n15 \n\n20 \n\nI \nI \n\n~- J \n\n2S \n\nFigure 1. Learning curves for 6-12-1 parity and 6-6-6 replication . \n\nThere seems to be a critical number of perturbations, Pc, about 16 (1 = 4) in this \ncase, below which learning slows dramatically. \n\nWe repeated the measurements of Fig. 1a for different sizes of the parity problem \nusing a N-2N-1 network. We also did these measurements on a different problem, \nreplication or identity, where the task is to replicate the bit pattern of the input on \nthe output. We used a N-N-N network for this task so that we have a comparison \nwith the parity problem as N varies for roughly the same number of weights (2N 2 + \n2N) in each network. The learning curves for the 6-6-6 problem are plotted in Fig. \nlb. The critical value also seems to be 16 (l = 4). \n\np erhaps b ecause we do not d ecrease 6w and 11 as learning proceeds. He did not check this for large \nproblems as we did. In an implementation, one will not be able to reduce 6w too much so that the effect \non the output error can be measur ed. It is also likely that multiple perturbations can be done more \nquickly than multiple pattern presentations, if learning speed is an issue. He also notes the importance \nof correlating with the change in error rather than the error alone as in (Dembo, 1990). \n\n\fA Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks \n\n841 \n\n3.2 Scaling of the Critical Value with Problem Size \n\nTo determine how the critical value of perturbation iterations scales, we tried a \nvariety of problems besides the N-N-N replication and N-2N-1 parity. We added N-\n2N-N replication and N-N-1 parity to see how more weights affect the same problem. \nWe also did N-N-N /2 edge counting, where the output is the number of sign changes \nin an ordered row of N inputs. Finally we did N-2N-N and N-N-N hamming where \nthe output is the closest hamming code for N inputs. We varied the number of \nperturbation iterations so that p = 1,2,5,10,20,50,100,200,400. \n\nEdge N-N-N/2 \n\nParity N-2N-1 \n\ni \n\n... \n\n, .... \n\n0 \n\nI \n\nI \n\n\"'\" \n\n.., \n\n... \n\n, .... \n\n... \n\n--\n\nHamming N-2N-N \n\n~ \n\n~ \n\ni \n\ni \n\n0 . I \n\n~ \n\n~ \n\ni \n\nI \n\nlOll - ... \n--\n\nReplication N-2N-N \n\n--\n\n10) \n\n.tOO \n\n100 \n\n100 \n\n'000 \n\n200 \n\n.00 \n\n100 \n\neoo \n\n'000 \n\nFigure 2. Critical value scaling for different problems. \n\n--\n\nFig. 2 gives a feel for the effective scale of the problem by plotting the critical value \nof the number of perturbation iterations as a function of the number of weights for \nsome of the problems we looked at. Note that the required number of iterations is \nnot a steep function of the network size except for the parity problem. We speculate \nthat the scaling properties are dependent on the shape of the error surface. If the \nderivatives in Eq. 9 are large in all dimensions (learning on a bowl-shaped surface), \nthen the effective number of parameters is large and the variance of the noise term \nwill be on the order of the number of weights, leading to a steep dependence in \nFig. 2. If, however, there are only a few weight directions with significantly large \nerror derivatives (learning on a taco shell), then the noise will scale at a slower \nrate than the number of weights leading to a weak dependence of the critical value \nwith problem size. This is actually a nice feature of parallel perturbative learning \nbecause it means learning will be noisy and slow in a bowl where it's easy, but \nprecise and fast in a taco shell where it's hard. \n\nThe critical value is required for convergence at the end of learning but not at \nthe start. This means it should be possible to anneal the number of perturbation \niterations to achieve an additional speedup over the one-weight-at-a-time perturba-\n\n\f842 \n\nAlspector, Meir, Yuhas, Jayakumar, and Lippe \n\ntion technique. We would also like to understand how to vary bw and 11 as learning \nproceeds. The stochastic approximation literature is likely to serve as a useful guide. \n\n3.3 Computational Geometry of Stochastic Gradient Descent \n\nfW;Y:t~~f~;s~~:ii~ \n\n' \"\n\n:\" \n\n: , H ' \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 _\n\n., ,\n\n' \n\n., \n\nerror \n\n'\" o \n\no o \n\nweight I \n\nFigure 3. Computational Geometry of Stochastic Gradient Descent. \n\nFig. 3a shows some relevant gradient vectors and angles in the learning problem. \nFor a particular pattern presentation, the true gradient, gb, from a back-propagation \ncalculation is compared with the one-weight-at-a-time gradient, go, from a pertur(cid:173)\nbation, bWi , in one weight direction. The gradient from perturbing all weights, gm, \nadds a noise vector to go. By taking the normalized dot product between gm and \ngb, one obtains the direction cosine between the estimated and the true gradient \ndirection. This is plotted in Fig. 3b for the 10 input N-N-l parity problem for all \nnine perturbation values. The shaded bands increase in cos (decrease in angle) as \nthe number of perturbations goes from 1 to 400. Note that the angles are large \nbut that learning still takes place. Note also that the dot product is almost always \npositive except for a few points at low perturbation numbers. Incidentally, by look(cid:173)\ning at plots of the true to one-weight-at-a-time angles (not shown), we see that the \nlarge angles are due almost entirely to the parallel perturbative noise term and not \nto the stepsize, bw. \n\n4 Outline of an analog implementation \n\nFig. 4 shows a diagram of a learning synapse using this perturbation technique. \nNote that its only inputs are a single bit representing the sign of the perturbation \nand a broadcast signal representing the change in the output error. Multiple per(cid:173)\nturbations can be averaged by the summing buffer and weight is stored as charge \non a capacitor or floating gate device. \n\n\fA Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks \n\n843 \n\nAn estimate of the power and area of an analog chip implementation gives the \nfollowing: Using a standard 1.2J,.tm, double poly technology, the synapse with about \n7 to 8 bits ofresolution and which includes a 0.5 pf storage capacitor, weight refresh \n(Hochet, 1989) and update circuitry can be fabricated with an area of about 1600 \nJ,.tm2 and with a power dissipation of about 100 J,.t W with continuous self-refresh. \nThis translates into a chip of about 22000 synapses at 2.2 watts on a 36 mm2 die \ncore. It is likely that the power requirements can be greatly reduced with a more \nrelaxed refresh technique or with a suitable non-volatile analog storage technology. \n\nS WI,j \n\nP (Perturbation anneal) \n\n~cw \nI,j \n\n~ --L \n~~ \n\n\"-+.-~ \n\naW. \n\n1,1 ~ \n\npertur6~r----. \n\n, I .. \n, I,J \n\numming &, \nIntegrating , \n\nbuffer \n\nI\u00b7 k I., \nI, \nI, \n\nTeach . \nI \n\nSynapse' \n----_. \n\nFigure 4. Diagram of perturbative learning synapse. \n\nWe intend to use our noise generation technique (Alspector, 1991) to provide un(cid:173)\ncorrelated perturbations potentially to thousands of synapses. Note also that the \nerror signal can be generated by a simple resistor or a comparator followed by a \nsummer. The difference signal can be generated by a simple differentiator. \n\n5 Conclusion \n\nWe have analyzed a parallel perturbative learning technique and shown that it \nshould converge under the proper conditions. We have performed simulations on \na variety of test problems to demonstrate the scaling behavior of this learning \nalgorithm. We are continuing work to understand speedups possible in an analog \nVLSI implementation. Finally, we describe such an implementation. Future work \nwill involve applying this technique to learning in recurrent networks. \n\nAcknowledgment \n\nWe thank Barak Pearhuutter for valuable and insightful discussions and Gert \nCauwenberghs for making an advance copy of his paper available. This work has \n\n\f844 \n\nAlspector, Meir, Yuhas, Jayakumar, and Lippe \n\nbeen partially supported by AFOSR contract F49620-90-C-0042, DEF. \n\nReferences \n\nJ. Alspector, J. W. Gannett, S. Haber, M.B. Parker, and R. Chu, \"A VLSI-Efficient \nTechnique for Generating Multiple Uncorrelated Noise Sources and Its Application \nto Stochastic Neural Networks\", IEEE Trans. Circuits and Systems, 38, 109, (Jan., \n1991). \n\nJ. Alspector, A. Jayakumar, and S. Luna, \"Experimental Evaluation of Learning \nin a Neural Microsystem\" in Advances in Neural Information Processing Systems \n4, J. E. Moody, S. J. Hanson, and R. P. Lippmann (eds.) San Mateo,CA: Morgan(cid:173)\nKaufmann Publishers (1992), pp. 871-878. \n\nG. Cauwenberghs, \"A Fast Stochastic Error-Descent Algorithm for Supervised \nLearning and Optimization,\" in Advances in Neural Information Processing Sys(cid:173)\ntems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, 1993. \n\nA. Dembo and T. Kailath, \"Model-Free Distributed Learning\", IEEE Trans. Neural \nNetworks Bt, (1990) pp. 58-70. \n\nB. Flower and M. Jabri, \"Summed Weight Neuron Perturbation: An O(n) Improve(cid:173)\nment over Weight Perturbation,\" in Advances in Neural Information Processing \nSystems, San Mateo, CA: Morgan Kaufman Publishers, vol. 5, 1993. \n\nB. Hochet, \"Multivalued MOS memory for Variable Synapse Neural Network\", Elec(cid:173)\ntronics Letters, vol 25, no 10, (May 11, 1989) pp. 669-670. \n\nM. Jabri and B. Flower, \"Weight Perturbation: An Optimal Architecture and \nLearning Technique for Analog VLSI Feedforward and Recurrent Multilayer N et(cid:173)\nworks\", Neural Computation 3 (1991) pp. 546-565. \n\nD. Kirk, D. Kerns, K. Fleischer, and A. Barr, \"Analog VLSI Implementation of \nGradient Descent,\" in Advances in Neural Information Processing Systems, San \nMateo, CA: Morgan Kaufman Publishers, vol. 5, 1993. \n\nH.J. Kushner and D.S. Clark, \"Stochastic Approximation Methods for Constrained \nand Unconstrained Systems\", p. 58 ff., Springer-Verlag, New York, (1978). \n\nD. E. Rumelhart, G. E. Hinton, and R. J. Williams, \"Learning Internal Repre(cid:173)\nsentations by Error Propagation\", in Parallel Distributed Processing: Ezplorations \nin the Microstructure of Cognition. Vol. 1: Foundations, D. E. Rumelhart and \nJ. L. McClelland (eds.), MIT Press, Cambridge, MA (1986), p. 318. \n\nS. Venkatesh, \"Directed Drift: A New Linear Threshold Algorithm for Learning \nBinary Weights On-Line\", Journal of Computer Science and Systems, (1993), in \npress. \n\nB. Widrow and M. A. Lehr, \"30 years of Adaptive Neural Networks. Perceptron, \nMadaline, and Backpropagation\", Proc. IEEE 78 (1990) pp. 1415-1442. \n\nJ. Zinn-Justin, \"Quantum Field Theory and Critical Phenomena\", p. 57 ff., Oxford \nUniversity Press, New York, (1989). \n\n\fPART XI \n\nCOGNITIVE \n\nSCIENCE \n\n\f\f", "award": [], "sourceid": 681, "authors": [{"given_name": "J.", "family_name": "Alspector", "institution": null}, {"given_name": "R.", "family_name": "Meir", "institution": null}, {"given_name": "B.", "family_name": "Yuhas", "institution": null}, {"given_name": "A.", "family_name": "Jayakumar", "institution": null}, {"given_name": "D.", "family_name": "Lippe", "institution": null}]}