{"title": "How to Choose an Activation Function", "book": "Advances in Neural Information Processing Systems", "page_first": 319, "page_last": 326, "abstract": null, "full_text": "How to Choose an Activation Function \n\nH. N. Mhaskar \n\nDepartment of Mathematics \nCalifornia State University \n\nLos Angeles, CA 90032 \nhmhaska@calstatela.edu \n\nc. A. Micchelli \n\nIBM Watson Research Center \n\nP. O. Box 218 \n\nYorktown Heights, NY 10598 \n\ncam@watson.ibm.com \n\nAbstract \n\nWe study the complexity problem in artificial feedforward neural networks \ndesigned to approximate real valued functions of several real variables; i.e., \nwe estimate the number of neurons in a network required to ensure a given \ndegree of approximation to every function in a given function class. We \nindicate how to construct networks with the indicated number of neurons \nevaluating standard activation functions. Our general theorem shows that \nthe smoother the activation function, the better the rate of approximation. \n\n1 \n\nINTRODUCTION \n\nThe approximation capabilities of feedforward neural networks with a single hidden \nlayer has been studied by many authors, e.g., [1, 2, 5]. In [10], we have shown that \nsuch a network using practically any nonlinear activation function can approximate \nany continuous function of any number of real variables on any compact set to any \ndesired degree of accuracy. \nA central question in this theory is the following. If one needs to approximate \na function from a known class of functions to a prescribed accuracy, how many \nneurons will be necessary to accomplish this approximation for all functions in the \nclass? For example, Barron shows in [1] that it is possible to approximate any \nfunction satisfying certain conditions on its Fourier transform within an L2 error \nof O(1/n) using a feedforward neural network with one hidden layer comprising of \nn2 neurons, each with a sigmoidal activation function. On the contrary, if one is \ninterested in a class of functions of s variables with a bounded gradient on [-1, I]S , \n\n319 \n\n\f320 \n\nMhaskar and Micchelli \n\nthen in order to accomplish this order of approximation, it is necessary to use at \nleast 0(11$) number of neurons, regardless of the activation function (cf. [3]). \n\nIn this paper, our main interest is to consider the problem of approximating a \nfunction which is known only to have a certain number of smooth derivatives. We \ninvestigate the question of deciding which activation function will require how many \nneurons to achieve a given order of approximation for all such functions. We will \ndescribe a very general theorem and explain how to construct networks with various \nactivation functions, such as the Gaussian and other radial basis functions advocated \nby Girosi and Poggio [13] as well as the classical squashing function and other \nsigmoidal functions. \n\nIn the next section, we develop some notation and briefly review some known facts \nabout approximation order with a sigmoidal type activation function. In Section \n3, we discuss our general theorem. This theorem is applied in Section 4 to yield \nthe approximation bounds for various special functions which are commonly in use. \nIn Section 5, we briefly describe certain dimension independent bounds similar to \nthose due to Barron [1], but applicable with a general activation function. Section \n6 summarizes our results. \n\n2 SIGMOIDAL-TYPE ACTIVATION FUNCTIONS \n\nIn this section, we develop some notation and review certain known facts. For the \nsake of concreteness, we consider only uniform approximation, but our results are \nvalid also for other LP -norms with minor modifications, if any. Let s 2: 1 be the \nnumber of input variables. The class of all continuous functions on [-1, IP will be \ndenoted by C$. The class of all 27r- periodic continuous functions will be denoted \nby C$*. The uniform norm in either case will be denoted by II . II. Let IIn,I,$,u \ndenote the set of all possible outputs of feedforward neural networks consisting of \nn neurons arranged in I hidden layers and each neuron evaluating an activation \nfunction (j where the inputs to the network are from R$. It is customary to assume \nmore a priori knowledge about the target function than the fact that it belongs \nto C$ or cn. For example, one may assume that it has continuous derivatives of \norder r 2: 1 and the sum of the norms of all the partial derivatives up to (and \nincluding) order r is bounded. Since we are interested mainly in the relative error \nin approximation, we may assume that the target function is normalized so that this \nsum of the norms is bounded above by 1. The class of all the functions satisfying \nthis condition will be denoted by W: (or W:'\" if the functions are periodic). In this \npaper, we are interested in the universal approximation of the classes W: (and their \nperiodic versions). Specifically, we are interested in estimating the quantity \n\n(2.1) \n\nwhere \n\n(2.2) \n\nsup En,l,$,u(f) \nJEW: \n\nEn,l,$,u(f) := p Anf \n\nE n,l,s,1T \n\nIII - PII\u00b7 \n\nThe quantity En,l,s ,u(l) measures the theoretically possible best order of approxi(cid:173)\nmation of an individual function I by networks with 11 neurons. We are interested \n\n\fHow to Choose an Activation Function \n\n321 \n\nin determining the order that such a network can possibly achieve for all functions \nin the given class. An equivalent dual formulation is to estimate \n\n(2.3) \n\nEn,l,s,O'(W:) := min{m E Z : sup Em,l,s,O'(f) ~ lin}. \n\nfEW: \n\nThis quantity measures the minimum number of neurons required to obtain accuracy \nof lin for all functions in the class W:. An analogous definition is assumed for W:* \nin place of W: . \nLet IH~ denote the class of all s-variable trigonometric polynomials of order at most \nn and for a continuous function f, 27r-periodic in each of its s variables, \n\n(2.4) \n\nE~(f):= min Ilf - PII\u00b7 \n\nPEIH~ \n\nWe observe that IH~ can be thought of as a subclass of all outputs of networks with \na single hidden layer comprising of at most (2n + 1)\" neurons, each evaluating the \nactivation function sin X. It is then well known that \n\n(2.5) \n\nHere and in the sequel, c, Cl, ... will denote positive constants independent of the \nfunctions and the number of neurons involved, but generally dependent on the other \nparameters of the problem such as r, sand (j. Moreover, several constructions for \nthe approximating trigonometric polynomials involved in (2.5) are also well known. \nIn the dual formulation, (2.5) states that if (j(x) := sinx then \n\n(2.6) \n\nR s \n\nIt can be proved [3] that any \"reasonable\" approximation process that aims to ap(cid:173)\nproximate all functions in W:'\" up to an order of accuracy lin must necessarily \ndepend upon at least O(ns/r) parameters. Thus, the activation function sin x pro(cid:173)\nvides optimal convergence rates for the class W:*. \nThe problem of approximating an r times continuously differentiable function \nf \n--+ R on [-1, I]S can be reduced to that of approximating another \nfunction from the corresponding periodic class as follows. We take an infinitely \nmany times differentiable function 1f; which is equal to 1 on [-2,2]S and 0 outside \nof [-7r, 7rp. The function f1f; can then be extended as a 27r-periodic function. This \nfunction is r times continuously differentiable and its derivatives can be bounded \nby the derivatives of f using the Leibnitz formula. A function that approximates \nthis 27r-periodic function also approximates f on [-I,I]S with the same order of \napproximation. In contrast, it is not customary to choose the activation function \nto be periodic. \n\nIn [10] we introduced the notion of a higher order sigmoidal function as follows. Let \nk > O. We say that a function (j : R --+ R is sigmoidal of order k if \n\n(2.7) \n\nand \n\n(2.8) \n\nlim (j( x) - 1 \nx-+oo xk - , x-+-oo xk \n\n(j(x) - 0 \n, \n\nlim \n\n-\n\nxE R. \n\n\f322 \n\nMhaskar and Micchelli \n\nA sigmoidal function of order 0 is thus the customary bounded sigmoidal function. \n\nWe proved in [10] that for any integer r ~ 1 and a sigmoidal function (j of order \nr - 1, we have \n\n(2.9) \n\nif s = 1, \nif s > 2. \n\nSubsequently, Mhaskar showed in [6] that if (j is a sigmoidal function of order k > 2 \nand r ~ 1 then, with I = O(log r/ log k)), \n\n(2.10) \n\nThus, an optimal network can be constructed using a sigmoidal function of higher \norder. During the course of the proofs in [10] and [6], we actually constructed the \nnetworks explicitly. The various features of these constructions from the connec(cid:173)\ntionist point of view are discussed in [7, 8, 9]. \n\nIn this paper, we take a different viewpoint. We wish to determine which acti(cid:173)\nvation function leads to what approximation order. As remarked above, for the \napproximation of periodic functions, the periodic activation function sin x provides \nan optimal network. Therefore, we will investigate the degree of approximation by \nneural net.works first in terms of a general periodic activation function and then \napply these results to the case when the activation function is not periodic. \n\n3 A GENERAL THEOREM \n\nIn this section, we discuss the degree of approximation of periodic functions using \nperiodic activation functions. It is our objective to include the case of radial basis \nfunctions as well as the usual \"first. order\" neural networks in our discussion. To \nencompass both of these cases, we discuss the following general formuation. Let \ns ~ d 2: 1 be integers and \u00a2J E Cd \u2022. We will consider the approximation of functions \nin ca. by linear combinat.ions of quantities of the form \u00a2J(Ax + t) where A is a d x s \nmatrix and t E Rd. (In general, both A and t are parameters ofthe network.) When \nd = s, A is the identity matrix and \u00a2J is a radial function, then a linear combination \nof n such quantities represents the output of a radial basis function network with n \nneurons. When d = 1 then we have the usual neural network with one hidden layer \nand periodic activation function \u00a2J. \n\nWe define the Fourier coefficients of \u00a2J by the formula \n\n, \n\u00a2J(m) := (2 )d \n\n. t \n\n\u00a2J(t)e- zm . dt, \n\n1 1 \n\n7r \n\n[-lI',lI']d \n\n(3.1) \n\nLet \n\n(3.2) \n\nand assume that there is a set J co Itaining d x s matrices with integer entries such \nthat \n(3.3) \n\n\fHow to Choose an Activation Function \n\n323 \n\nwhere AT denotes the transpose of A. If d = 1 and \u00a2(l) #- 0 (the neural network \ncase) then we may choose S4> = {I} and J to be Z8 (considered as row vectors). \nIf d = sand \u00a2J is a function with none of its Fourier coefficients equal to zero (the \nradial basis case) then we may choose S4> = zs and J = {Is x s}. For m E Z8, we \nlet k m be the multi-integer with minimum magnitude such that m = ATkm for \nsome A = Am E J. Our estimates will need the quantities \n\n(3.4) \n\nand \n\n(3.5) \n\nmn := min{I\u00a2(km)1 : -2n::; m::; 2n} \n\nN n := max{lkml : -2n::; m < 2n} \n\nwhere Ikml is the maximum absolute value of the components of km. In the neural \nnetwork case, we have mn = 1\u00a2(1)1 and N n = 1. In the radial basis case, N n = 2n. \nOur main theorem can be formulated as follows. \n\nTHEOREM 3.1. Let s ~ d ~ 1, n ~ 1 and N ~ Nn be integers, f E C n , \u00a2J E C d*. \nIt is possible to construct a network \n\n(3.6) \n\nsuch that \n\n(3.7) \n\nIn (3.6), the sum contains at most O( n S Nd) terms, Aj E J, tj E R d, and dj are \nlinear functionals of f, depending upon n, N,