{"title": "Computing with Arrays of Bell-Shaped and Sigmoid Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 735, "page_last": 742, "abstract": null, "full_text": "Computing with Arrays of Bell-Shaped and \n\nSigmoid Functions \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPierre Baldi\u00b7 \n\nPasadena, CA 91109 \n\nAbstract \n\nWe consider feed-forward neural networks with one non-linear hidden layer \nand linear output units. The transfer function in the hidden layer are ei(cid:173)\nther bell-shaped or sigmoid. In the bell-shaped case, we show how Bern(cid:173)\nstein polynomials on one hand and the theory of the heat equation on the \nother are relevant for understanding the properties of the corresponding \nnetworks. In particular, these techniques yield simple proofs of universal \napproximation properties, i.e. of the fact that any reasonable function can \nbe approximated to any degree of precision by a linear combination of bell(cid:173)\nshaped functions. In addition, in this framework the problem of learning \nis equivalent to the problem of reversing the time course of a diffusion pro(cid:173)\ncess. The results obtained in the bell-shaped case can then be applied to \nthe case of sigmoid transfer functions in the hidden layer, yielding similar \nuniversality results. A conjecture related to the problem of generalization \nis briefly examined. \n\n1 \n\nINTRODUCTION \n\nBell-shaped response curves are commonly found in biological neurons whenever a \nnatural metric exist on the corresponding relevant stimulus variable (orientation, \nposition in space, frequency, time delay, ... ). As a result, they are often used in \nneural models in different context ranging from resolution enhancement and inter(cid:173)\npolation to learning (see, for instance, Baldi et al. (1988), Moody et al. (1989) \n\n*and Division of Biology, California Institute of Technology. The complete title of \nthis paper should read: \"Computing with arrays of bell-shaped and sigmoid functions. \nBernstein polynomials, the heat equation and universal approximation properties\". \n\n735 \n\n\f736 \n\nBaldi \n\nand Poggio et al. (1990)). Consider then the problem of approximating a function \ny = I(x) by a weighted sum of bell-shaped functions B(k, x), i. e. of finding a \nsuitably good set of weights H(k) satisfying \n\nI(x) ~ L H(k)B(k, x). \n\n(1) \n\nk \n\nIn neural network terminology, this corresponds to using a feed-forward network \nwith a unique hidden layer of bell-shaped units and a linear ouput layer. In this note, \nwe first briefly point out how this question is related to two different mathematical \nconcepts: Bernstein Polynomials on one hand and the Heat Equation on the other. \nThe former shows how such an approximation is always possible for any reasonable \nfunction whereas through the latter the problem of learning, that is of finding H(k), \nis equivalent to reversing the time course of a diffusion process. For simplicity, the \nrelevant ideas are presented in one dimension. However, the extension to the general \nsetting is straightforward and will be sketched in each case. We then indicate how \nthese ideas can be applied to similar neural networks with sigmoid transfer functions \nin the hidden layer. A conjecture related to the problem of generalization is briefly \nexamined. \n\n2 BERNSTEIN POLYNOMIALS \n\nIn this section, without any loss of generality, we assume that all the functions to \nbe considered are defined over the interval [0,1]. For a fixed integer n, there are n \nBernstein polynomials of degree n (see, for instance, Feller (1971)) given by \n\nBn(k, x) = (~) xk(l - xt- k. \n\n(2) \n\nBn(k, x) can be interpreted as being the probability of having k successes in a coin \nflipping experiment of duration n, where x represents the probability of a single \nsuccess. It is easy to see that Bn(k, x) is bell-shaped and reaches its maximum \nfor x = kin. Can we then approximate a function I using linear combinations \nof Bernstein polynomials of degree n? Let us first consider, as an example, the \nsimple case of the identity function I(x) = x (x E [0,1]). If we interpret x as the \nprobability of success on a single coin toss, then the expected number of successes \nin n trials is given by \n\nor equivalently \n\n(3) \n\n(4) \n\nThe remarkable theorem of Bernstein is that (4) remains approximately true for a \ngeneral function I. More precisely: \nTheorem: Assume I is a bounded function defined over the interval [0,1]. Then \n\n.'i.':!, i:t(;) G) x>(1 - xr> = J(x) \n\nk=O \n\n(5) \n\n\fComputing with Arrays of Bell-Shaped and Sigmoid functions \n\n737 \n\nat any point x where 1 is continuous. Moreover, if 1 is continuous everywhere, the \nsequence in (5) approaches 1 uniformly. \nProof: The proof is beautiful and elementary. It is easy to see that \n\nfor any 0 < 6 < 1. To bound the first term in the right hand side of this inequality, \nwe use the fact that for fixed f and for n large enough, at a point of continuity x, \nwe can find a 6 such that I/(x) - I(~)I < f as soon as Ix - ~I < 6. Thus the first \nterm is bounded by f. If 1 is continuous everywhere, then it is uniformly continuous \nand 6 can be found independently of x. For the second term, since 1 is bounded \n(1/(x)1 5 M), we have I/(x) - I(~)I 5 2l'vf. Now we use Tchebycheff inequality \n(P(IX - E{X)I ~ a) ::; (VarX)/a 2 ) to bound the tail of the binomial series \n\nI ~ (n) k(1_ )n-kl \n\n~ k x \n\nx \n\nIx- !-1~6 \n\nnx{1- x) \n\n62n2 \n\n_1_ \n::; 4n62 . \n\n5 \n\nCollecting terms, we finally get \n\nwhich completes the proof. \n\nBernsteins's theorem provides a probabilistic constructive proof of Weierstrass the(cid:173)\norem which asserts that every continuous function over a compact set can be uni(cid:173)\nformly approximated by a sequence of polynomials. Its \"connectionist\" interpre(cid:173)\ntation is that every reasonable function can be computed by a two layer network \nconsisting of one array of equally spaced bell-shaped detectors feeding into one lin(cid:173)\near output unit. In addition, the weighting function H{k) is the function 1 itself \n(see also Baldi et al. (1988)). Notice that the shape of the functions Bn ( k, x) in the \narray depends on k: in the center (k :::::::: n/2) they are very symmetric and similar to \ngaussians, as one moves towards the periphery the shape becomes less symmetric. \nTwo additional significant properties of Bernstein polynomials are that, for fixed \nn, they form a partition of unity: Ek Bn(k, x) = (x + (1 - x))n = 1 and that they \nhave constant energy f01 Bn(k, x) = 1/(n + 1). One important advantage of the \napproximation defined by (5) is its great smoothness. If 1 is differentiable, then not \nonly (5) holds but also \n\nd (~ (k) (n) k ( \n\n. \nhm -d ~ 1 -\nn-oo x \nn \n\nk x 1 - x \n\nk=O \n\n)n k) \n\n-\n\ndl \n-+-d \nx \n\n(6) \n\n\f738 \n\nBaldi \n\nuniformly on [0,1] and the same is true for higher order derivatives (see, for instance, \nDavis (1963\u00bb. Thus the Bernstein polynomials provide simultaneous approxima(cid:173)\ntion of a function and of its derivatives. In particular, they preserve the convexity \nproperties of the function f being approximated and mimic extremely well its qual(cid:173)\nitative behavior. The price to be paid is in precision, for the convergence in (5) \ncan sometimes be slow. Good qualitative properties of the approximation may \nbe relevant for biological systems, whereas precision there may not be a problem, \nespecially in light of the fact that n is often large. \n\nFinally, this approach can be extended to the general case of an input space with d \ndimensions by defining the generalized Bernstein polynomials \n\nIf f(Xl, ... ,Xd) is a continuous function over the hypercube [0, l]d, then \n\n(8) \n\napproaches uniformly f on [0, l]d as min ni -+ 00. \n\n3 LEARNING AND THE HEAT EQUATION \n\nConsider again the general problem of approximating a function f by a linear com(cid:173)\nbination of bell-shaped functions, but where now the bell-shaped functions are gaus(cid:173)\nsians B( w, x), of the form \n\nB( \n\nW,X = ~ \n\n) \n\n1 e-(x-w)2/2 q 2 \n\nV 27rU \n\n(9) \n\nThe fixed centers w of the gaussians are distributed in space according to a density \np( w) (this enables one to treat the continuous and discrete case together and also \nto include the case where the centers are not evenly distributed). This idea was \ndirectly suggested by a presentation of R. Durbin (1990), where the limiting case of \nan infinite number oflogistic hidden units in a connectionist network was considered. \nIn this setting, we are trying to express f as \n\nor \n\nf(x)::::::: \n\n1+00 \nf(x) ::::::: 1+00 \n\n-00 \n\nh(w) \n\n2 \n\n2 \n\ne-(x-w) /217 p(w)dw \n\n1 \n\n..J2;u \n\n1 H(w)e-(x-w)2/ 2q 2 dw \n\n-00 V27ru \n\n(10) \n\n(11) \n\nwhere H = hp. Now, diffusion processes or propagation of heat are usually modeled \nby a partial differential equation of the type \n\n(12) \n\n\fComputing with Arrays of Bell-Shaped and Sigmoid functions \n\n739 \n\n(the heat equation) where u(x, t) represents the temperature (or the concentration) \nat position x at time t. Given a set of initial conditions of the form u( x, 0) = g( x), \nthen the distribution of temperatures at time t is given by \n\nu(x, t) = \n\n1+00 \n\n-00 \n\ng(w) __ e-(x-w) / 4tdw. \n\n2 \n\n1 \n\nv47rt \n\n(13) \n\ntime t provided 9 is continuous, Ig(x)1 = O(exp(hx2\u00bb and 0 ~ t < 1/4h. Under \nTechnically, (13) can be shown to give the correct distribution of temperatures at \nthese conditions, it can be seen that u(x, t) = O(exp(kx2\u00bb for some constant k > 0 \n(depending on h) and is the unique solution satisfying this property (see Friedman \n(1964) and John (1975) for more details). \nThe connection to our problem now becomes obvious. If the initial set of temper(cid:173)\natures is equal to the weights in the network (H(w) = g(w\u00bb, then the function \ncomputed by the network is equal to the temperature at x at time t = 0'2/2. Given \na function f(x) we can view it as a description of temperature values at time 0'2/2; \nthe problem of learning, i. e. of determining the optimal h( w) (or H( w\u00bb) consists in \nfinding a distribution of initial temperatures at time t = 0 from which f could have \nevolved. In this sense, learning is equivalent to reversing time in a diffusion process. \nIf the continuous case is viewed as a limiting case where units with bell-shaped \ntuning curves are very densely packed, then it is reasonable to consider that, as \nthe density is increased, the width 0' of the curves tends to O. As 0' ~ 0, the final \ndistribution of temperatures approaches the initial one and this is another heuristic \nway of seeing why the weighting function H (w) is identical to the function being \nlearnt. \nIn the course of a diffusion or heat propagation process, the integral of the concen(cid:173)\ntration (or of the temperature) remains constant in time. Thus the temperature \ndistribution is similar to a probability distribution and we can define its entropy \n\nE(u(x, t\u00bb = -1:00 u(x, t) In u(x, t)dx. \n\n(14) \n\nIt is easy to see that the heat equation tends to increase E. Therefore learning can \nalso be viewed as a process by which E is minimized (within certain time boundaries \nconstraints). This is intuitively clear if we think oflearning as an attempt to evolve \nan initially random distribution of connection weights and concentrate it in one or \na few restricted regions of space. \n\nIn general, the problem of solving the heat equation backwards in time is difficult: \nphysically it is an irreversible process and mathematically the problem is ill-posed \nin the sense of Hadamard. The solution does not always exist (for instance, the \nfinal set of temperatures must be an analytic function), or exists only over a limited \nperiod of time and, most of all, small changes in the final set of temperatures can \nlead to large changes in the initial set of temperatures) (see, for instance, John \n(1955\u00bb. However, the problem becomes well-posed if the final set of temperatures \nhas a compact Fourier spectrum (see Miranker (1961); alternatively, one could use \na regularization approach as in Franklin (1974\u00bb. In a connectionist framework, one \nusually seeks a least square approximation to a given function. The corresponding \nerror functional is convex (the heat equation is linear) and therefore a solution \nalways exists. In addition, the problem is usually not ill-posed because the functions \n\n\f740 \n\nBaldi \n\nto be learnt have a bounded spectrum and are often known only through a finite \nsample. Thus learning from examples in networks consisting of one hidden layer \nof gaussians units and a linear output unit is relatively straightforward, for the \nlandscape of the usual error function has no local minima and the optimal set of \nweights can be found by gradient descent or directly, essentially by linear regression. \nTo be more precise, we can write the error function in the most general case in the \nform: \n\nE(h(w\u00bb = j[/(x) - j h(u)e-(X-tJ)2/202 Jl(u)du]2v(x)dx \n\n(15) \n\nwhere Jl and v are the measures defined on the weights and the examples respec(cid:173)\ntively. The gradient, as in the usual back-propagation of errors, is given by: \n\nBE = -2j[/(X) _ jH(u)e-(X-tJ)2/202 du]e-(x-w)2/ 202 Jl(w)v(x)dx \nBh(w) \n. \n\n(16) \n\nThus the critical weights of (15) where I-'(w) ~ 0 are characterized by the relation \n\nj I(x )e-(x-w)2/202 v(x)dx = j \n\nj H( u)e-(x-w)2/202 e-(x-u)2/202 v(x)dudx. (17) \n\nIf now we assume that the centers of the gaussians in the hidden layer occupy a \n(finite or infinite) set of isolated points Wi, (17) can be rewritten in matrix form as \n\n(18) \nwhere Bi = f I(x)exp(-(x - Wi)2/2u 2)v(x)dx, H(u)j = h(Ui)Jl(Ui) and A is the \nreal symmetric matrix with entries \n\nB = AH(u) \n\nAi; = j e-(x-w i )2/202 e-(x-tJj)2/202 v(x)dx. \n\n(19) \nUsually A is invertible, so that H(u) = A-I B which, in turn, yields h(Ui) = \nH(Ui)/ Jl( Ui). \nFinally, everything can be extended without any difficulty to d dimensions, where \nthe typical solution of '\\]2u = Bu/at is given by \n\nU(Xl, ... ,Xd,t) = \n\ndWl ... dwd \n\n(20) \n\n1+00 1+00 \n-00 \n-00 \n\n... \n\ng(w)( \n\n- E .(x.-w.)2/4t \n\n1 \n)d/2e. \n47rt \n\nwith, under some smoothness assumptions, u(x, t) ~ g(x) as t ~ O. \n\nRemark \n\nFor an application to a discrete setting consider, as in Baldi et al. (1988), the sum \n\ninitial gaussian distribution of temperatures u(x,O) of the form \nFor an \n(1/Vf;) exp( _x2 /2rp), the distribution u(x, t) of temperatures at time t is also \ngaussian, centered at the origin, but with a larger standard deviation which, using \n(13), is given by (172 +2t)I/2. Thus, if we imagine that at time 0 a temperature equal \n\n\fComputing with Arrays of Bell-Shaped and Sigmoid functions \n\n741 \n\nto k has been injected (with a very small \"I) at each integer location along the real \naxis, then lex) represents the distribution of temperatures at time t = (0'2 -\n\"12 )/2. \nIntuitively, it is clear that as 0' is increased (i.e. as we wait longer) the distribution \nof temperatures becomes more and more linear. \n(2) It is aesthetically pleasing that the theory of the heat equation can also be \nused to give a proof of Weierstrass theorem. For this purpose, it is sufficient to \nobserve that, for a given continuous function 9 defined over a closed interval [a, b], \nthe function u(x, t) given by (13) is an analytic function in x at a fixed time t. \nBy letting t --+ 0 and truncating the corresponding series, one can get polynomial \napproximations converging uniformly to g. \n\n4 THE SIGMOID CASE \n\nWe now consider the case of a neural network with one hidden layer of sigmoids and \none linear output unit. The output of the network can be written as a transform \n\nout(x) = J O'(w.x)h(w)J.l(w)dw \n\n(21) \n\nwhere x is the input vector and w is a weight vector which is characteristic of each \nhidden unit (i. e. each hidden units is characterized by the vector of weights on its \nincoming input lines rather than, for instance, its spatial location). Assume that \nthe inputs and the weights are normalized, i.e. IIxll = Ilwll = 1 and that the weight \nvectors cover the n-dimensional sphere uniformly (or, in the limit, that there is a \nvector for each point on the sphere). Then for a given input x, the scalar products \nw.x are maximal and close to 1 in the region of the sphere corresponding to hidden \nunits where wand x are colinear and decay as we move away till they reach negative \nvalues close to -1 in the antipodal region. When these scalar products are passed \nthrough an appropriate sigmoid, a bell-shaped pattern of activity is created on the \nsurface of the sphere and from then on we are reduced to the previous case. Thus the \nprevious results can be extended and in particular we have a heuristic simple proof \nthat the corresponding networks have universal approximation properties (see, for \ninstance, Hornik et al. (1989\u00bb. Notice that intuitively the reason is simple for we \nend up we something like a grand-mother cell per pattern or cluster of patterns. \nIf we assume that initially J.l( w) -# 0 everywhere, then it is clear that for learning \nvia LMS optimization we can take J.l to be fixed and adjust only the output weights \nh. But the problem then is convex and without local minima. This suggests that in \nthe limit of an extremely large number of hidden units, the landscape of the error \nfunction is devoid of local minima and learning becomes very smooth. This result \nis consistent with the conjecture that under reasonable assumptions, as we progres(cid:173)\nsively increase the number of hidden units, learning goes from being impossible, to \nbeing possible but difficult and lengthy, to being relatively easy and quick to trivial. \nAnd if so what is the nature of these transitions? This picture is also consistent \nwith certain simulation results reported by several authors, whereby optimal per(cid:173)\nformance and generalization is not best obtained by training for a very long time a \nminimal size highly constrained network, but rather by training for a shorter time \n(until the validation error begins to go up (see Baldi and Chauvin (1991\u00bb) a larger \nnetwork with extra hidden units. \n\n\f742 \n\nBaldi \n\nAcknowledgements \n\nThis work is supported by NSF grant DMS-8914302 and ONR contract NAS7-\n100/918. We would like to thank Y. Rinott for useful discussions. \n\nReferences \n\nBaldi, P. and Heiligenberg, W. (1988) How sensory maps could enhance resolution \nthrough ordered arrangements of broadly tuned receivers. Biological Cybernetics, \n59, 313-318. \n\nBaldi, P. and Chauvin, Y. (1991) A study of generalization in simple networks. \nSubmitted for publication. \n\nDavis, P. J. (1963) Interpolation and approximation. Blaisdell. \n\nDurbin, R. (1990) Presented at the Neural Networks for Computing Conference, \nSnowbird, Utah. \n\nFeller, W. (1971) An introduction to probability theory and its applications. John \nWiley & Sons \nFranklin, J. N. (1974) On Tikhonov's method for ill-posed problems. Mathematics \nof Computation, 28, 128, 889-907. \n\nFriedman, A. (1964) Partial differential equations of parabolic type. Prentice-Hall. \n\nHornik, K., Stinchcombe, M. and White, H. (1989) Multilayer feedforward networks \nare universal approximators. Neural Networks, 2, 5, 359-366. \n\nJohn, F. (1955) Numerical solutions of the equation of heat conduction for preceding \ntimes. Ann. Mat. Pura Appl., ser. IV, vol. 40, 129-142. \n\nJohn, F. (1975) Partial differential equations. Springer Verlag. \nMiranker, W. L. (1961) A well posed problem for the backward heat equation. \nProceedings American Mathematical Society, 12, 243-247. \nMoody, J. and Darken, C. J. (1989) Fast learning in networks of locally-tuned \nprocessing units. Neural Computation, 1, 2, 281-294. \n\nPoggio, T. and Girosi, F. (1990) Regularization algorithms for learning that are \nequivalent to multilayer networks. Science, 241, 978-982. \n\n\f", "award": [], "sourceid": 412, "authors": [{"given_name": "Pierre", "family_name": "Baldi", "institution": null}]}