{"title": "The Capacity of a Bump", "book": "Advances in Neural Information Processing Systems", "page_first": 556, "page_last": 562, "abstract": null, "full_text": "The Capacity of a Bump \n\nGary William Flake\u00b7 \n\nInstitute for Advance Computer Studies \n\nUniversity of Maryland \nCollege Park, MD 20742 \n\nAbstract \n\nRecently, several researchers have reported encouraging experimental re(cid:173)\nsults when using Gaussian or bump-like activation functions in multilayer \nperceptrons. Networks of this type usually require fewer hidden layers \nand units and often learn much faster than typical sigmoidal networks. \nTo explain these results we consider a hyper-ridge network, which is a \nsimple perceptron with no hidden units and a rid\u00a5e activation function. If \nwe are interested in partitioningp points in d dimensions into two classes \nthen in the limit as d approaches infinity the capacity of a hyper-ridge and \na perceptron is identical. However, we show that for p ~ d, which is the \nusual case in practice, the ratio of hyper-ridge to perceptron dichotomies \napproaches pl2(d + 1). \n\n1 Introduction \n\nA hyper-ridge network is a simple perceptron with no hidden units and a ridge activation \nfunction. With one output this is conveniently described as y = g(h) = g(w . x - b) \nwhere g(h) = sgn(1 - h2). Instead of dividing an input-space into two classes with a \nsingle hyperplane, a hyper-ridge network uses two parallel hyperplanes. All points in the \ninterior of the hyperplanes form one class, while all exterior points form another. For more \ninformation on hyper-ridges, learning algorithms, and convergence issues the curious reader \nshould consult [3]. \n\nWe wouldn't go so far as to suggest that anyone actually use a hyper-ridge for a real-world \nproblem, but it is interesting to note that a hyper-ridge can represent linear inseparable \nmappings such as XOR, NEGATE, SYMMETRY, and COUNT(m) [2, 3]. Moreover, \nhyper-ridges are very similar to multilayer perceptrons with bump-like activation functions, \nsuch as a Gaussian, in the way the input space is partitioned. Several researchers [6, 2,3, 5] \nhave independently found that Gaussian units offer many advantages over sigmoidal units. \n\n\u00b7Current address: Adaptive Information and Signal Processing Department, Siemens Corporate \n\nResearch, 755 College Road East, Princeton, NJ 08540. Email: ftake@scr.siemens .com \n\n\fThe Capacity of a Bump \n\n557 \n\nIn this paper we derive the capacity of a hyper-ridge network. Our first result is that \nhyper-ridges and simple perceptrons are equivalent in the limit as the input dimension \nsize approaches infinity. However, when the number of patterns is far greater than the \ninput dimension (as is the usual case) the ratio of hyper-ridge to perceptron dichotomies \napproaches p/2(d + 1), giving some evidence that bump-like activation functions offer an \nadvantage over the more traditional sigmoid. \n\nThe rest of this paper is divided into three more sections. In Section 2 we derive the number \nof dichotomies for a hyper-ridge network. The capacities for hyper-ridges and simple \nperceptrons are compared in Section 3. Finally, in Section 4 we give our conclusions. \n\n2 The Representation Power of a Hyper-Ridge \n\nSuppose we have p patterns in the pattern-space, ~d, where d is the number of inputs of our \nneural network. A dichotomy is a classification of all of the points into two distinct sets. \nClearly, there are at most 2P dichotomies that exist. We are concerned with the number of \ndichotomies that a single hyper-ridge node can represent. Let the number of dichotomies \nof p patterns in d dimensions be denoted as D(p, d). \nFor the case of D(1, d), when p = 1 there are always two and only two dichotomies since \none can trivially include the single point or no points. Thus, D(1, d) = 2. \nFor the case of D(p, 1), all of the points are constrained to fallon a line. From this set \npick two points, say Xa and Xb. It is always possible to place a ridge function such that \nall points between Xa and Xb (inclusive of the end points) are included in one set, and all \nother points are excluded. Thus, there are p dichotomies consisting of a single point, p - 1 \ndichotomies consisting of two points, p - 2 dichotomies consisting of three points, and \nso on. No other dichotomies besides the empty set are possible. The number of possible \nhyper-ridge dichotomies in one dimension can now be expressed as \n\nD(p, 1)= 2: i + 1 = 2P(P + 1)+ 1, \n\nP \n\n1 \n\n(1) \n\nwith the extra dichotomy coming from the empty set. \n\ni=1 \n\nTo derive the general form of the recurrence relationship, we would have to resort to \ntechniques similar to those used by Cover [1], Nilsson [7], and Gardner [4] . Because of \nspace considerations, we do not give the full derivation of the general form of the recurrence \nrelationship in this paper, but instead cite the complete derivation given in [3] . The short \nversion of the story is that the general form of the recurrence relationship for hyper-ridge \ndichotomies is identical to the equivalent expression for simple perceptrons: \n\nD(p, d) = D(P - 1, d) + D(P - 1, d - 1). \n\n(2) \n\nAll differences between the capacity of hyper-ridges and simple perceptrons are, therefore, \na consequence of the different base cases for the recurrence expression. \n\nTo get Equation 2 into closed form, we first expand D(p, d) a total of p times, yielding \n\nD(P,d)=~ i \n\np-l ( \n\np-l \n) \n\n. \nD(I, d-z). \n\n(3) \n\nFor Equation 3 it is possible for the second term of D( 1, d - 1) to become zero or negative. \nTaking the two identities D(P,O) = p + 1 and D(p, -1) = 1 are the only choices that are \nconsistent with the recurrent relationship expressed in Equation 2. With this in mind, there \nare three separate cases that we need to be concerned with: p < d + 2, p = d + 2, and \n\n\f558 \n\np>d+2. Whenp d + 2, some ofthe last terms in D(I, d - i) are always negative. We can \ndisregard all d - i < -1, taking D(1 , d - i) equal to zero in these cases (which is consistent \nwith the recurrence relationship), \n\nDCp,d) = ~ (p~ 1) D(I , d _ i)= ~ (p ~ 1) D(I,d _ i) \n. (p - 1) ~ (p - 1) \nz) + d + 1 = 2 ~ i \n\n~ (p - 1) \n\n= ~ i \n\nD(1, d -\n\nCombining Equations 4, 5, and 6 gives \n\nd \n\n2 L (p ~ 1) + (~~:) for p > d + 2 \nfor p = d + 2 \n\n2P - 1 \n\nI~ \n\nD~~= \n\n(p - 1) \n\n+ d + 1 \n\n. (6) \n\nm \n\n2P \n\nforp d+2 \n\n2P - 2 \n\nfor p=d+2 \n\n2P \n\nforp d + 2, since that is where Equations 7 and 8 differ the \nmost. Moreover, problems are more difficult when the number of training patterns greatly \nexceeds the number of trainable weights in a neural network. \n\nLet Dh(p, d) and Dp(p, d) denote the number of dichotomies possible for hyper-ridge net(cid:173)\nworks and simple perceptrons, respectively. Additionally, Let Ch , and Cp denote the \n\n\fThe Capacity of a Bump \n\n559 \n\nrespective capacities. We should expect both Dh(p, d)/2P and Dp(p, d)/2P to be at or around \n1 for small values of p/(d + 1). At some point, for large p/(d + 1), the 2P term should \ndominate, making the ratio go to zero. The capacity of a network can loosely be defined as \nthe value p/(d + 1) such that D(p, d)/2P = ~. This is more rigorously defined as \n\nC= { . l' D(c(d+ 1).d) =~} \n\nc . d~~ 2c(d+1) \n\n2' \n\nwhich is the point in which the transition occurs in the limit as the input dimension goes to \ninfinity. \n\nFigures 1, 2, and 3 illustrate and compare Cp and Ch at different stages. In Figure 1 \nthe capacities are illustrated for perceptrons and hyper-ridges, respectively, by plotting \nD(p, d)/LP versus p/(d + 1) for various values of d. On par with our intuition, the ratio \nD(p, d)/LP equals 1 for small values of p/(d + 1) but decreases to zero as p(d + 1) increases. \nFigure 2 and the left diagram of Figure 3 plot D(p, d)/2P versus p/(d + 1) for perceptron \nand hyper-ridges, side by side, with values of d = 5,20, and 100. As d increases, the two \ncurves become more similar. This fact is further illustrated in the right diagram of Figure 3 \nwhere the plot is of Dh(p, d)/Dp(P, d) versus p for various values of d. The ratio clearly \napproaches 1 as d increases, but there is significant difference for smaller values of d. \n\nThe differences between Dp and Dh can be more explicitly quantified by noting that \n\nDh(p, d) = Dp(p, d) + d + 1 \n\n( p -1) \n\nfor p > d + 2. This difference clearly shows up in in the plots comparing the two capacities. \nWe will now show that the capacities are identical in the limit as d approaches infinity. To \ndo this, we will prove that the capacity curves for both hyper-ridges and perceptrons crosses \n~ at p/(d + 1) = 2. This fact is already widely known for perceptrons. Because of space \nlimitations we will handwave our way through lemma and corollary proofs. The curious \nreader should consult [3) for the complete proofs. \n\nLemma 3.1 \n\nlim \nn-oo 22n \n\n(2nn) = O. \n\nShort Proof Since n approaches infinity, we can use Stirling's formula as an approximation \nof the factorials. \n\no \n\nCorollary 3.2 For all positive integer constants, a, b, and c, \n\nlim _1_ (2n + b) = O. \nn-oo 22n+a \n\nn + c \n\nShort Proof When adding the constants band c to the combination, the whole combination \ncan always be represented as comb(2n, n)\u00b7 y, where y is some multiplicative constant. Such \na constant can always be factored out of the limit. Additionally, large values of a only \nincrease the growth rate of the denominator. \n\no \n\nLemma 3.3 For p/(d + 1) = 2, liffid ..... oo Dp(p, d)/2P = ~. \nShort Proof Consult any of Cover [1], Nilsson [7], or Gardner [4] for full proof. \n\no \n\n\f560 \n\n\" \" \n\n. \n'\\ \n\\ : \n\\ ' \n.~ \n\nd = S (cid:173)\nd =20 ---(cid:173)\nd= 100 .. . \n\n0.' \n\n~ - - ~ --r-- -_ . -\n\nI \n\n.. \u2022 \n\nos \n\n- -\n\n-\n\nG. W.FLAKE \n\nd = 5 -\nd= 20 - (cid:173)\nd=l00 . \n\nFigure 1: On the left, Dp(P, tf)12P versus pl(d + 1), and on the right, Dh(p, d)/2P versus \npl(d + 1) for various values of d . Notice that for perceptrons the curve always passes \nthrough! at pl(d + 1) = 2. For hyper-ridges, the point where the curve passes through! \ndecreases as d increases. \n\nI perceptmn(cid:173)\nihyper-ridge ---\n\no.s \n\n--\n\n-\n\n---'r-'--\n\n\\ \n\n-- - \\ . . \n\n2 \n\np/(d+ I) \n\n\u00b00L-----~----~2 --~~------J \n\npl(d+ J) \n\nFigure 2: On the left, capacity comparison for d = 5. There is considerable difference for \nsmall values of d, especially when one considers that the capacities are normalized by 2P. \nOn the right, comparison for d = 20. The difference between the two capacities is much \nmore subtle now that d is fairly large. \n\nos \n\n-\n\n-\n\nt perecpuon -\n:hyper.ridge _. \n! \n\nd: 1 -\nd\", 2 ---(cid:173)\nd= 5 \nd\", I O (cid:173)\nd= 100 - _. -\n\n20 \n\n10 \n\no L-----~----~~--~----~ \n\no \n\n10 \n\n20 \n\n30 \n\n40 \n\nso \nP \n\n60 \n\n70 \n\n80 \n\n90 \n\n100 \n\nFigure 3: On the left, capacity comparison for d = 100. For this value of d, the capacities \nare visibly indistinguishable. On the right, Dh(P, d)1 Dp(P, tf) versus p for various values of \nd. For small values of d the capacity of a hyper-ridge is much greater than a perceptron. \nAs d grows, the ratio asymptotically approaches 1. \n\n\fThe Capacity of a Bump \n\nTheorem 3.4 For pl(d + 1) = 2, \n\nlim Dh(p, d) = !. \n2 \nd-oo \n\n2P \n\n561 \n\nProof Taking advantage of the relationship between perceptron dichotomies and hyper(cid:173)\nridge dichotomies allows us to expand Dh(p, d), \n\n1\u00b7 Dh(P, d) \n1m \nd-oo \n\n2P \n\nl' Dp(P, d) \n= 1m \nd-oo \n\n2P \n\n+ 1m \n\nl' \nd-oo d + 1 \n\n(p - 1) \n. \n\nBy Lemma 3.3, and substituting 2(d + 1) for p, we get: \n\n1 l' \n\n- + 1m \n2 d-oo \n\n(2d + 1) \n\nd + 1 \n\n. \n\nFinally, by Corollary 3.2 the right limit vanishes leaving us with !. \no \n\nSuperficially, Theorem 3.4 would seem to indicate that there is no difference between the \nrepresentation power of a perceptron and a hyper-ridge network. However, since this result \nis only valid in the limit as the number of inputs goes to infinity, it would be interesting to \nknow the exact relationship between Dp(d, p) and Dh(d, p) for finite values of d. \n\nIn the right diagram of Figure 3 values of Dp(d,p)IDh(d,p) are plotted against various \nvalues of p. The figure is slightly misleading since the ratio appears to be linear in p, \nwhen, in fact, the ratio is only approximately linear in p. If we normalize the ratio by } \nand recompute the ratio in the limit as p approaches infinity the ratio becomes linear in d. \nTheorem 3.5 establishes this rigorously. \n\nTheorem 3.5 \n\nProof First, note that we can simplify the left hand side of the expression to \n\n1 Dh(d,p) \n\n. \n= hm -\nhm -\np-oopDp(d,p) p_oop \n\n. \n\n1 Dp(d,p) + (~~ :) \n\n1 (~~:) \n\n. \n\n= hm -\n\nDp(d,p) \n\np_oop Dp(d,p) \n\n(9) \n\nIn the next step, we will invert Equation 9, making it easier to work with. We need to show \nthat the new expression is equal to 2(d + 1). \n\nlim p Dp(d,p) = lim 2p \np-oo (~ ~ :) p_oo \n\nL~ (p~ 1) \n(~ ~ :) \n\nl \n\n= \n\n. 2:d \n\nhm 2p \np_oo \n\n. \n1=0 \n\n(P -\n\nI)! \n\n(d + 1)!(P - d - 2)! \n\ni!(P - i-I)! \n\n(P -\n\nI)! \n\n2:d (d + I)! (P - d - 2)! \n\ni! \n\n(P - i-I)! \n\n. \n\n= hm 2p \n\nP_oo. \n1=0 \n\n= \n\nd \n\nlim \np_oo (P - 1 - d) \n\np \n\n2(d+l)\"d!(p-d-l)!= lim2(d+l)\"d!(p-d-l)! (10) \n\ni! (P -\n\ni - 1)! \n\np_oo \n\n~ i! (P - i-I)! \ni=O \n\nd \n\n6 \n1=0 \n\nIn Equation 10, the summation can be reduced to 1 since \n\nd! (P - d - 1 )! _ {O when 0 :5 i < d \n\n1. \n1m -\np-oo i! (P - i-I)! \n\n-\n\n1 \n\nh \n\n. d \n\nw en l = \n\n\f562 \n\nG. W.FLAKE \n\nThus, Equation 10 is equal to 2(d + 1), which proves the theorem. \no \n\nTheorem 3.5 is valid only in the case when p ~ d, which is typically true in interesting \nclassification problems. The result of the theorem gives us a good estimate of how many \nmore dichotomies are computable with a hyper-ridge network when compared to a simple \nperceptron. When p ~ d the equation \n\nDh(d,p) \nDp(d,p) - 2(d+ 1) \n\nP \n\n(11) \n\nis an accurate estimate of the difference between the capacities of the two architectures. \nFor example, taking d = 4 and p = 60 and applying the values to Equation 11 yields the \nratio of 6, which should be interpreted as meaning that one could store six times the number \nof mappings in a hyper-ridge network than one could in a simple perceptron. Moreover, \nEquation 11 is in agreement with the right diagram of Figure 3 for all values of p ~ d. \n\n4 Conclusion \n\nAn interesting footnote to this work is that the VC dimension [8] of a hyper-ridge network \nis identical to a simple perceptron, namely d. However, the real difference between \nperceptrons and hyper-ridges is more noticeable in practice, especially when one considers \nthat linear inseparable problems are representable by hyper-ridges. \n\nWe also know that there is no such thing as a free lunch and that generalization is sure \nto suffer in just the cases when representation power is increased. Yet given all of the \ncomparisons between Ml.Ps and radial basis functions (RBFs) we find it encouraging that \nthere may be a class of approximators that is a compromise between the local nature of \nRBFs and the global structure of MLPs. \n\nReferences \n\n[l] T.M. Cover. Geometrical and statistical properties of systems of linear inequalities \nwith applications in pattern recognition. IEEE Transactions on Electronic Computers, \n14:326-334,1965. \n\n[2] M.R.W. Dawson and D.P. Schopflocher. Modifying the generalized delta rule to train \nnetworks of non-monotonic processors for pattern classification. Connection Science, \n4(1), 1992. \n\n[3] G. W. Flake. Nonmonotonic Activation Functions in Multilayer Perceptrons. PhD \n\nthesis, University of Maryland, College Park, MD, December 1993. \n\n[4] E. Gardner. Maximum storage capacity in neural networks. Europhysics Letters, \n\n4:481-485,1987. \n\n[5] F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: from \nregularization to radial, tensor and additive splines. Technical Report A.I. Memo No. \n1430, C.B.C.L. Paper No. 75, MIT AI Laboratory, 1993. \n\n[6] E. Hartman and J. D. Keeler. Predicting the future: Advanages of semilocal units. \n\nNeural Computation, 3:566-578,1991. \n\n[7] N.J. Nilsson. Learning Machines: Foundations of Trainable Pattern Classifying Sys(cid:173)\n\ntems. McGraw-Hill, New York, 1965. \n\n[8] Y.N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies \nof events to their probabilities. Theory of Probability and Its Applications, 16:264-280, \n1971. \n\n\f", "award": [], "sourceid": 1136, "authors": [{"given_name": "Gary", "family_name": "Flake", "institution": null}]}