{"title": "The \"Softmax\" Nonlinearity: Derivation Using Statistical Mechanics and Useful Properties as a Multiterminal Analog Circuit Element", "book": "Advances in Neural Information Processing Systems", "page_first": 882, "page_last": 887, "abstract": "", "full_text": "The \"Softmax\" Nonlinearity: \n\nDerivation Using Statistical Mechanics \n\nand Useful Properties \n\nas a Multiterminal Analog Circuit \n\nElement \n\nI. M. Elfadel \n\nResearch Laboratory of Electronics \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nJ. L. Wyatt, Jr. \n\nResearch Laboratory of Electronics \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nWe use mean-field theory methods from Statistical Mechanics to \nderive the \"softmax\" nonlinearity from the discontinuous winner(cid:173)\ntake-all (WTA) mapping. We give two simple ways of implementing \n\"soft max\" as a multiterminal network element. One of these has a \nnumber of important network-theoretic properties. It is a recipro(cid:173)\ncal, passive, incrementally passive, nonlinear, resistive multitermi(cid:173)\nnal element with a content function having the form of information(cid:173)\ntheoretic entropy. These properties should enable one to use this \nelement in nonlinear RC networks with such other reciprocal el(cid:173)\nements as resistive fuses and constraint boxes to implement very \nhigh speed analog optimization algorithms using a minimum of \nhardware. \n\n1 \n\nIntroduction \n\nIn order to efficiently implement nonlinear optimization algorithms in analog VLSI \nhardware, maximum use should be made of the natural properties of the silicon \nmedium. Reciprocal circuit elements facilitate such an implementation since they \n\n882 \n\n\fThe \"Softmax\" Nonlinearity \n\n883 \n\ncan be combined with other reciprocal elements to form an analog network having \nLyapunov-like functions: the network content or co-content. In this paper, we show \na reciprocal implementation of the \"softmax\" nonlinearity that is usually used to \nenforce local competition between neurons [Peterson, 1989]. We show that the cir(cid:173)\ncuit is passive and incrementally passive, and we explicitly compute its content and \nco-content functions. This circuit adds a new element to the library of the analog \ncircuit designer that can be combined with reciprocal constraint boxes [Harris, 1988] \nand nonlinear resistive fuses [Harris, 1989] to form fast, analog VLSI optimization \nnetworks. \n\n2 Derivation of the Softmax Nonlinearity \n\nTo a vector y E ~n of distinct real numbers, the discrete winner-take-all (WTA) \nmapping W assigns a vector of binary numbers by giving the value 1 to the com(cid:173)\nponent of y corresponding to maxl<i<n Yi and the value 0 to the remaining com(cid:173)\nponents. Formally, W is defined as - -\n\nW(y) = (Wl(y), ... I Wn(y\u00bbT \n\nwhere for every 1 ~ j ~ n, \n\nWj (y) = \n\nif YJ' > Yi, V 1 ~ i ~ n \n\n{ 1 \n\n0 otherwise \n\nFollowing [Geiger, 1991], we assign to the vector y the \"energy\" function \n\nn \n\nEy(z) = - L ZkYk = _zT y, z E Vnl \n\n(1) \n\n(2) \n\nwhere Vn is the set of vertices of the unit simplex Sn = {x E ~n, Zi > 0, 1 < i < \nn and E~=l Zk = 1}. Every vertex in the simplex encodes one possible winner. It \nis then easy to show that W(y) is the solution to the linear programming problem \n\nk=l \n\nn \n\nmax LZkYk. \nZE'V .. \n\nk=l \n\nMoreover, we can assign to the energy Ey(z) the Gibbs distribution \n\nPy(z) = Py(Zl' ... , zn) = \n\nZT \n\ne-Ey(Z)/T \n\nwhere T is the temperature of the heat bath and ZT is a normalizing constant. \nThen one can show that the mean of Zj considered as a random variable is given \nby [Geiger, 1991] \n\nA _ \n\ney;/T \n\nFj(y/T) = Zj = -z = En/T\u00b7 \n\ney;/T \n\ni=l eY' \n\nT \n\nThe mapping F : ~n -+ ~n whose components are the Fj's, 1 ~ j < n, is the gener(cid:173)\nalized sigmoid mapping [Peterson, 1989] or \"soft max\" . It plays, in WTA networks, \na role similar to that of the sigmoidal function in Hopfield and backpropagation \n\n\f884 \n\nElfadel and Wyatt \n\nv \n\nFigure 1: A circuit implementation of softmax with 5 inputs and 5 outputs. This \ncircuit is operated in subthreshold mode, takes the gates voltages as inputs and \ngives the drain currents as outputs. This circuit is not a reciprocal multiterminal \nelement. \n\nnetworks [Hopfield, 1984, Rumelhart, 1986] and is usually used for enforcing com(cid:173)\npetitive behavior among the neurons of a single cluster in networks of interacting \nclusters [Peterson, 1989, Waugh, 1993]. \n\nFor y E ~n, we denote by FT(y) = F(y IT). The softmax mapping satisfies the \nfollowing properties: \n\nA \n\n1. The mapping FT converges pointwise to W over ~n as T -+ 0 and to the \n\ncenter of mass of Sn, *e = *(1,1, . . . , 1)T E ~n, as T -+ +00. \n\n2. The Jacobian DF of the softmax mapping is a symmetric n x n matrix that \n\nsatisfies \n\nDF(y) = diag (F,,(y\u00bb - F(y)F(y)T. \n\n(3) \n\nIt is always singular with the vector e being the only eigenvector correspond(cid:173)\ning to the zero eigenvalue. Moreover, all its eigenvalues are upper-bounded \nby maxlS\"Sn F,,(y) < 1. \n\n3. The soft max mapping is a gradient map, i.e, there exists a \"potential\" \n\nfunction 'P : ~n -+ ~ such that F = V'P. Moreover 'P is convex. \n\nThe symbol 'P was chosen to indicate that it is a potential function. It should be \nnoted that if F is the gradient map of 'P then FT is the gradient map of T'PT \nwhere 'PT(Y) = 'P(yIT). In a related paper [Elfadel, 1993], we have found that \nthe convexity of'P is essential in the study of the global dynamics of analog WTA \nnetworks. Another instance where the convexity of 'P was found important is the \none reported in [Kosowsky, 1991] where a mean-field algorithm was proposed to \nsolve the linear assignment problem. \n\n\fThe \"Softmax\" Nonlinearity \n\n885 \n\nv, \n\nVz \n\nv\" \n\nVa \n\nTa \n\nFigure 2: Modified circuit implementation of softmax. In this circuit all the tran(cid:173)\nsistors are diode-connected, and all the drain currents are well in saturation region. \nNote that for every transistor, both the voltage input and the current output are \non the same wire - the drain. This circuit is a reciprocal multiterminal element. \n\n3 Circuit Implementations and Properties \n\nNow we propose two simple CMOS circuit implementations of the generalized sig(cid:173)\nmoid mapping. See Figures 1 and 2. When the transistors are operated in the \nsubthreshold region the drain currents i l , .. . ,in are the outputs of a softmax map(cid:173)\nping whose inputs are the gate voltages Vl, \u2022.. , Vn . The explicit v - i characteristics \nare given by \n\n(4) \n\nis \n\nthe thermal voltage \n\nis a process-dependent parameter and Vo \n\nwhere K, \n([Mead, 1989],p. 36). These circuits have the interesting properties of being un(cid:173)\nclocked and parallel. Moreover, the competition constraint is imposed naturally \nthrough the KCL equation and the control current source. From a complexity \npoint of view, this circuit is most striking since it computes n exponentials, n ra(cid:173)\ntios, and n - 1 sums in one time constant! A derivation similar to the above was \nindependently carried out in [Waugh, 1993] for the circuit of Figure 1. Although \nthe first circuit implements softmax, it has two shortcomings. The first is practical: \nthe separation between inputs and outputs implies additional wiring. The second \nis theoretical: this circuit is not a reciprocal multiterminal element, and therefore \nit can't be combined with other reciprocal elements like resistive fuses or constraint \nboxes to design analog, reciprocal optimization networks. \nTherefore, we only consider the circuit of Figure 2 and let v and i be the n(cid:173)\ndimensional vectors representing the input voltages and the output currents, respec(cid:173)\ntively. 1 The softmax mapping i = F(v) represents a voltage-controlled, nonlinear, \nlCompare with Lazarro et. al.'s WTA circuit [Lazzaro, 1989] whose inputs are currents \n\nand outputs are voltages. \n\n\f886 \n\nElfadel and Wyatt \n\nresistive multiterminal element. The main result of our paper is the following: 2 \n\nTheorem 1 The softmax multiterminal element F is reciprocal, passive, locally \npassive and has a co-content function given by \n\n1 \n\n~(v) = K, Ie Vo In L exp(K,vm/Vo) \n\nn \n\nm=l \n\nand a content function given by \n\n..w..*(O) _ IeVo ~ im 1 im \nIe \n\n- - - L.J - n-. \n\nK, m=l Ie \n\n'If \n\nI \n\n(5) \n\n(6) \n\nThus, with this reciprocal, locally passive implementation of the softmax mapping, \nwe have added a new circuit element to the library of the circuit designer. Note \nthat this circuit element implements in an analog way the constraint L:~=1 y\" = 1 \ndefining the unit simplex Sn. Therefore, it can be considered a nonlinear constraint \nbox [Harris, 1988] that can be used in reciprocal networks to implement analog \noptimization algorithms. \nThe expression of ~* is a strong reminder of the information-theoretic definition of \nentropy. We suggest the name \"entropic resistor\" for the circuit of Figure 2. \n\n4 Conclusions \n\nIn this paper, we have discussed another instance of convergence between the sta(cid:173)\ntistical physics paradigm of Gibbs distributions and analog circuit implementation \nin the context of the winner-take-all function. The problem of using the simple, \nreciprocal circuit implementation of softmax to design analog networks for find(cid:173)\ning near optimal solutions of the linear assignment problem [Kosowsky, 1991] or \nthe quadratic assignment problem [Simic, 1991] is still open and should prove a \nchallenging task for analog circuit designers. \n\nAcknowledgements \n\nI. M. Elfadel would like to thank Alan Yuille for many helpful discussions and Fred \nWaugh for helpful discussions and for communicating the preprint of [Waugh, 1993]. \nThis work was supported by the National Science Foundation under Grant No. \nMIP-91-17724. \n\nReferences \n\n[Peterson, 1989] C. Peterson and B. Soderberg. A new method for mapping opti(cid:173)\n\nmization problems onto neural networks. International Journal of \nNeural Systems, 1(1):3 - 22, 1989. \n\n2The concepts of reciprocity, passivity, content, and co-content are fundamental to \n\nnonlinear circuit theory. They are carefully developed in [Wyatt, 1992]. \n\n\fThe \"Softmax\" Nonlinearity \n\n887 \n\n[Harris, 1988] \n\n[Harris,1989] \n\nJ. G. Harris. Solving early vision problems with VLSI constraint \nnetworks. In Neural Architectures for Computer Vision Workshop, \nAAAI-88, Minneapolis, MN, 1988. \nJ. G. Harris, C. Koch, J. Luo, and J. Wyatt. Resistive fuses: \nAnalog hardware for detecting discontinuities in early vision. In \nC. Mead and M. Ismail, editors, Analog VLSIImplemenation of \nNeural Systems. Kluwer Academic Publishers, 1989. \n\n[Geiger, 1991] D. Geiger and A. Yuille. A common framework for image segmen(cid:173)\n\ntation. Int. J. Computer Vision, 6:227 - 253, 1991. \n\n[Hopfield, 1984] J. J. Hopfield. Neurons with graded responses have collective com(cid:173)\n\nputational properties like those of two-state neurons. Proc. Nat'l \nAcad. Sci., USA, 81:3088-3092, 1984. \n\n[Rumelhart, 1986] D. E. Rumelhart et. al. Parallel Distributed Processing, vol(cid:173)\n\nume 1. MIT Press, 1986. \n\n[Waugh, 1993] F. R. Waugh and R. M. Westervelt. Analog neural networks with \nlocal competition. I. dynamics and stability. Physical Review E, \n1993. in press. \nI. M. Elfadel. Global dynamics of winner-take-all networks. In \nSPIE Proceedings, Stochastic and Neural Methods in Image Pro(cid:173)\ncessing, volume 2032, pages 127 - 137, San Diego, CA, 1993. \n\n[Elfadel, 1993] \n\n[Kosowsky, 1991] J. J. Kosowsky and A. L. Yuille. The invisible hand algorithm: \nSolving the assignment problem with statistical physics. TR # \n91-1, Harvard Robotics Laboratory, 1991. \nCarver Mead. Analog VLSI and Neural Systems. Addison-Wesley, \n1989. \n\n[Mead, 1989] \n\n[Lazzaro,1989] J. Lazarro, S. Ryckebush, M. Mahowald, and C. Mead. Winner(cid:173)\n\n[Wyatt, 1992] \n\n[Simic, 1991] \n\ntake-all circuits of O(n) complexity. In D. S. Touretsky, editor, \nAdvances in Neural Information Processing Systems I, pages 703 \n- 711. Morgan Kaufman, 1989. \nJ. L. Wyatt. Lectures on Nonlinear Circuit Theory. MIT VLSI \nmemo # 92-685,1992. \nP. D. Simic. Constrained nets for graph matching and other \nquadratic assignment problems. Neural Computation, 3:169 - 281, \n1991. \n\n\f", "award": [], "sourceid": 877, "authors": [{"given_name": "I. M.", "family_name": "Elfadel", "institution": null}, {"given_name": "J. L.", "family_name": "Wyatt, Jr.", "institution": null}]}