{"title": "Radial Basis Function Networks and Complexity Regularization in Function Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 197, "page_last": 203, "abstract": "", "full_text": "10. .......,. ...\u00b7. . . . . Basis Function Networks and Complexity\n\n--_.IIiIIIIIIIIlo............... JIIIL ....'IIIoo4II\u2022 .,JIIIL'IIU\"JIIILJIIIL in Function Learning\n\nAdam Krzyzak\n\nDepartment of Computer Science\n\nConcordia University\n\nMontreal, Canada\n\nkrzyzak@cs.concordia.ca\n\nTamas Linder\n\nDept. of Math. & Comp. Sci.\n\nTechnical University of Budapest\n\nBudapest, Hungary\nlinder@inf.bme.hu\n\nAbstract\n\nIn this paper we apply the method of complexity regularization to de(cid:173)\nrive estimation bounds for nonlinear function estimation using a single\nhidden layer radial basis function network. Our approach differs from\nthe previous complexity regularization neural network function learning\nschemes in that we operate with random covering numbers and 11 metric\nentropy, making it po~sibleto consider much broader families of activa(cid:173)\ntion functions, namely functions of bounded variation. Some constraints\npreviously imposed on the network parameters are also eliminated this\nway. The network is trained by means of complexity regularization in(cid:173)\nvolving empirical risk minimization. Bounds on the expected risk in\ntenns of the sample size are obtained for a large class of loss functions.\nRates of convergence to the optimal loss are also derived.\n\n1 INTRODUCTION\n\nArtificial neural networks have been found effective in learning input-outputmappings from\nnoisy examples. In this learning problem an unknown target function is to be inferred from a\nset of independent observations drawn according to some unknown probability distribution\nfrom the input-output space JRd x JR. Using this data set the learner tries to determine a\nfunction which fits the data in the sense of minimizing some given empirical loss function.\nThe target function mayor may not be in the class of functions which are realizable by the\nlearner. In the case when the class of realizable functions consists of some class of artificial\nneural networks, the above problem has been extensively studied from different viewpoints.\n\nIn recent years a special class of artificial neural networks, the radial basis function (RBF)\nnetworks have received considerable attention. RBF networks have been shown to be\nthe solution of the regularization problem in function estimation with certain standard\nsmoothness functionals used as stabilizers (see [5], and the references therein). Universal\n\n\f198\n\nA. Krzyiak and T. Linder\n\nconvergence of RBF nets in function estimation and classification has been proven by\nKrzyzak et at. [6]. Convergence rates of RBF approximation schemes have been shown to\nbe comparable with those for sigmoidal nets by Girosi and Anzellotti [4]. In a recent paper\nNiyogi and Girosi [9] studied the tradeoff between approximation and estimation errors and\nprovided an extensive review of the problem.\n\nIn this paper we consider one hidden layer RBF networks. We look at the problem of\nchoosing the size of the hidden layer as a function of the available training data by means\nof complexity regularization. Complexity regularization approach has been applied to\nmodel selection by Barron [1], [2] resulting in near optimal choice of sigmoidal network\nparameters. Our approach here differs from Barron's in that we are using II metric en(cid:173)\ntropy instead of the supremum norm. This allows us to consider amore general class\nof activation function, namely the functions of bounded variation, rather than a restricted\nclass of activation functions satisfying a Lipschitz condition. For example, activations with\njump discontinuities are allowed. In our complexity regularization approach we are able\nto choose the network parameters more freely, and no discretization of these parameters is\nrequired. For RBF regression estimation with squared error loss, we considerably improve\nthe convergence rate result obtained by Niyogi and Girosi [9].\n\nIn Section 2 the problem is formulated and two results on the estimation error of complexity\nregularized RBF nets are presented: one for general loss functions (Theorem 1) and a\nsharpened version of the first one for the squared loss (Theorem 2). Approximation bounds\nare combined with the obtained estimation results in Section 3 yielding convergence rates\nfor function learning with RBF nets.\n\n2 PROBLEM FORMULATION\n\nThe task is to predict the value of a real random variable Y upon the observation of an lRd\nvalued random vector X. The accuracy of the predictor f : lRd\n--+ R is measured by the\nexpected risk\n\nJ(/) = EL(f(X), Y),\n\nwhere L : lR x lR --+ lR+ is a nonnegative loss function. It will be assumed that there exists\na minimizing predictor f* such that\n\nJ(!*) == inf J(/).\n\nJ\n\nA good predictor f n is to be detennined based on the data (Xl, Y]), \u00b7 .. , (Xn , Yn ) which\nare i.i.d. copies of (X, Y). The goal is to make the expected risk EJ(fn) as small as\npossible, while fn is chosen from among a given class :F of candidate functions.\nIn this paper the set of candidate functions :F will be the set of single-layer feedforward\nneural networks with radial basis function activation units and we let:F == Uk=l:Fk, where\n:Fk is the family of networks with k hidden nodes whose weight parameters satisfy certain\nconstraints. In particular, for radial basis functions characterized by a kernel K : lR+ --+ JR.,\n:Fk is the family of networks\n\nk\n\nI(x) = LWiK([x-Ci]tAi[X-Ci)) +wo,\n\ni=l\n\nwhere Wo, WI ..\u2022 , Wlc are real numbers called weights, CI, .\u2022\u2022 ,Ck E R d\ndefinite d x d matrices, and x t denotes the transpose of the column vector x. _\n\n, Ai are nonnegative'\n\nThe complexity regularization principle for the learning problem was introduced by Vapnik\n[10] and fully developed by Barron [1], [2] (see also Lugosi and Zeger [8]).\nIt enables\nthe learning algorithm to choose the candidate class :Fk automatically, fro~ which is picks\n\n\fRadial Basis Function Networks and t;OJ'1Wi~exlltv N,t:Jnu'ln7'J\"7n1~..n1lP'll\n\n199\n\nthe estimate function by minimizing the empirical error over the training data. Complexity\nregularization penalizes the large candidate classes, which are bound to have small approx(cid:173)\nimation error, in favor of the smaller ones, thus balancing the estimation and approximation\nerrors.\nLet :F be a subset of a space X of real functions over some set, and let p be a pseudometric\non X. For f > 0 the covering number N (f, :F, p) is defined to be the minimal number of\nclosed f balls whose union cover :F. In other words, N (f, :F, p) is the least integer such\nthat there exist 11, ... ,IN with N = N(f.,:F, p) satisfying\nsup ~n p(/, fi) S: f.\nIEF l~I~N\n\nIn our case, :F is a family of real functions on lRm\ngiven by\n\n, and for any two functions I and g, p is\n\n\u2022 In this case we will use the notation N(f,:F, p) ==\n~hereZl, ... , Zn aren givenpointsinlRm\nN (f, :F, zl)' emphasizing the dependence of the metric p on zl == (Zl' ... , zn). Let us\ndefine the families of functions 1-\u00a3k, k = 1, 2, ... by\n\n1-lIc == {L(f(\u00b7), .) : I E :FIc}.\n\n1 into R. It will be assumed that for each k we are given\nThus each member of1-\u00a3k maps lRd+\na finite, almost sure uniform upper bound on the random covering numbers N( f, 1-\u00a3k, Zr),\nwhere Zj == \u00abXl, Yl), ... , (Xn , Yn )). We may assume without loss of generality that\nN(E,1ik) is monotone decreasing in E. Finally, assume that L(f(X), Y) is uniformly\nalmost surely bounded by a constant B, Le.,\n\nP{L(/(X), Y) ~ B} == 1,\n\nf E :FIc, k == 1,2, ...\n\n(1)\n\nThe complexity penalty of the kth class for n training samples is a nonnegative number ~kn\nsatisfying\n\nAkn ~\n\nlZ8B2 1og N(Akn/ 8,1lk) + Ck ,\n\n(2)\nn\n' :s 1. Note that since N(f, 1-\u00a3k) is\nwhere the nonnegative constants Ck satisfy Er=l e- Ck\nnonincreasing in f, it is possibleto choose such ~kn for all k and n. The resulting complexity\npenalty optimizes the upper bound on the estimation error in the proof of Theorem 1 below.\nWe can now define our estimate. Let\n\nthat is, fkn minimizes over rk the empirical risk for n training samples. The penalized\nempirical risk is defined for each f E rIc as\n\nThe estimate f n is then defined as the fkn minimizing the penalized empirical risk over all\nclasses:\n\n!~ == argminJn(!kn).\n\nlkn:k?;.l\n\n(3)\n\nWe have the following theorem for the expected estimation error of the above complexity\nregularization scheme.\n\n\f200\n\nA. Krzyiakand T. Linder'\n\nTheorem 1 For any nand k the complexity regularization estimate (3) satisfies\n\nEJ(/n) - J(/*) ~ min (Rkn + inf J(f) - J(f*)) ,\n\nk~l\n\n. JE:Fk\n\nwhere\n\nAssuming without loss of generality that log N (f., 1-lk) ~ 1, it is easy to see that the choice\n\n~kn ==\n\n128B21ogN(B/vnl1it.:) +(:;:\n\nn\n\n(4)\n\nsatisfies (2).\n\n2.1 SQUARED ERROR LOSS\n\nFor the 'special case when\n\nL(x, y) == (x _ y)2\n\nwe can obtain a better upper bound. The estimate will be the same as before, but instead of\n(2), the complexity penalty Akn now has to satisfy\n\n0 log N(Akn /C2 , :Fk) + Ck\n\nA\n\no.kn 2::\n\n1\n\n,\n\n(5)\n\nn\n\nwhere 01 == 349904, C2 == 25603, and 0 == max{B, I}. Here N (f. , :Fk) is a uniform upper\nbound on the random 11 covering numbers N( f., :Fk, Xl). Assume that the class:F == Uk:Fk\nis convex, and let :F be the closure of :F in L 2(lJ), where Jl denotes the distribution of X.\nThen there is a unique1 E :F whose squared loss J(1) achieves infJ E:F J (I). We have the\nfollowing bound on the difference EJ(fn) - J(I).\n\nTheorem 2 Assume that:F == Uk:Fk is\u00b7a convex set offunctions, and consider the squared\nerror loss. Suppose that I/(x)1 ~ B for all x E ]Rd and f E :F, and P(IYI > B) == o.\nThen complexity regularization estimate with complexity penalty satisfying (5) gives\n\nC1\nEJ(fn) - J(l) .::; 2min (Akn + inf J(f) - J(I)) + 2\nn\n\nfE:Fk\n\nk~l\n\n.\n\nThe proof of this result uses an idea of Barron [1] and a Bernstein-type uniform probability\ninequality recently obtained by Lee et all [7].\n\n3 RBF NETWORKS\n\nWe will consider radial basis function (RBF) networks with one hidden layer. Such a\nnetwork is characterized by a kernel K : lR+ -+ JR. An RBF net of k nodes is of the fonn\n\nk\n\nf(x) = L Wi K ([x - Ci]t A[x - Cil) + wo,\n\ni=l\n\n(6)\n\nwhere wo, WI, .\u2022. , Wk are real numbers called weights, CI, ... , Ck E ]Rd, and the Ai are\nnonnegative definite d x d matrices. The kth candidate class :FA: for the function estimation\ntask is defined as the class of networks with k nodes which satisfy the weight condition\nI:~=o tWil ~ b for a fixed b > 0:\n\n:FII: = {t. WiK([x - Ci]1A[x - cil) + wo: ~ IWil ::; b}.\n\n(7)\n\n\fRadial Basis Function Networks and LOllrlDlexztv KeRru[airization\n\n201\n\nLet L(x, y) == Ix - yiP, and\n\nJ(/) == EI/(X) - YIP,\n\n(8)\nwhere 1 ~ p < 00. Let JJ denote the probability measure induced by X. Define:F to be the\nclosui\"e in V(JJ) ofthe convex hull of the functions bK([x - c]t A[x - c]) and the constant\nfunction h(x) == 1, x E lRd , where fbi ~ b, c E lRd, and A varies over all nonnegative d x d\nmatrices. That is, :F is the closure of :F == Uk:Fk, where:Fk is given in (7). Let 9 E :F be\narbitrary. If we assume that IK Iis uniformly bounded, then by Corollary 1 of Darken et ala\n[3], we have for 1 :::; p ~ 2 that\n\nwhere Ilf - 911\u00a31'(1') denotes the LP(jl) norm (f If - r IPdjl) IIp, and.1\"k is givenin (7).\nThe approximation error infJErk J(/) - J(f*) can be dealt with using this result if the\noptimal 1* happens to be in :F. In this case, we obtain\n\ninf J(/) - J(/*) == O(ljVk)\nJErk\n\n(9)\n\nfor all 1 ~ p ~ 2. Values of p close to 1 are of great importance for robust neural network\nregression.\nWhen the kernel K has a bounded total variation, it can be shown that N (t:, 1ik ) ~\n(AI/t:)Azk, where the constants AI, A 2 depend on ~upx IK(x)1, the total variation V of K,\nthe dimension d, and on the the constant b in the definition (7) of :Fk. Then, if 1 ~ p ~ 2,\nthe following consequence of Theorem 1 can be proved for LP regression estimation.\nTheorem 3 Let the kernel K be of bounded variation and assume that IYI is bounded.\nThenfor 1 ~ p :::; 2 the error (8) ofthe complexity regularized estimate satisfies\n\nEJ(fn) - J(!*) ~ ~1[0 (Jkl~g~) +0(J*)]\n\n/\no ( Co~n) 1\n4) .\n\nFor p = 1, i.e., for L 1 regression estimation, this rate is known to be optimal within the\nlogarithmic factor.\n\nFor squared error loss J(f) == E(f(X) - y)2 we have 1* (x) == E(YIX == x). If f* E :F,\nthen by (9) we obtain\n\ninf 1(/) - J(/*) == O(ljk).\nJErk,\n\n(10)\n\nIt is easy to check that the class Uk:Fk is convex if the:Fk are the collections of RBF nets\ndefined in (7). The next result shows that we can get rid of the square root in Theorem 3.\n\nTheorem 4 Assume that K is of bounded variation. Suppose furthermore that IY t is a\nbounded random variable, and let L(x, y) = (x - y)2. Then the complexity regularization\nRBF squared regression estimate satisfies\n\nEJ(fn) -\n\n.\n\ninf J(f) < 2 min (\nJEr\n\nk~l\n\n-\n\ninf J(/) -\nJErk\n\nJE:F\n\ninf J(/) + 0 (k 10gn)) + 0 (~) .\n\nn\n\nn\n\n\f202\n\nA. Krzyiakand T. Linder\n\nIf f~- E :F, this result and (10) give\n\nEJ(fn) - J(I*) ~ ~? [0 (kl:gn ) + 0 (~ )]\n\no ( Co~ n ) 1/2) .\n\n(11)\n\nThis result sharpens and extends Theorem 3.1 of Niyogi and Girosi [9] where the weaker\n\no (Jk l:gn) + 0 (t) convergence rate was obtained (in a PAC-like formulation) for the\n\nsquared loss of Gaussian RBF network regression estimation. The rate in (11) varies linearly\nwith dimension. Our result is valid for a very large class of RBF schemes, including the\nGaussian RBF networks considered in [9]. Besides having improved on the convergence\nrate, our result has the advantage of allowing kernels which are not continuous, such as the\nwindow kernel.\n\nThe above convergence rate results hold in the case when there exists an f* minimizing the\nrisk which is a member of the LP(JJ) closure of :F = U:Fk, where :Fk is given in (7). In\nother words, f* should be such that for all ( > 0 there exists a k and a member f of Fk\nwith Itf - f* IILP(tt) < f. The precise characterization of:F seems to be difficult. However,\nbased on the work of Girosi and Anzellotti [4] we can describe a large class of functions\nthat is contained in :F.\nLet H (x, t) be a real and bounded function of two variables x E lRd and t E lRn. Suppose\nthat Ais a signed measure on lRn with finite total variation IfAll. If g(x) is defined as\n\ng(x) = ( H(x, t)>'(dt),\n\nJRD\n\nthen 9 E LP (J.l) for any probability measure J.l on lRd. One can reasonably expect that g\ncan be approximated well by functions f (x) of the fonn\nf(x) = L wiH(x, ti),\n\nk\n\ni=}\n\nwhere t}, ... , tk E lRn and 2::=1 {Wil ~ ItAII. The case m = d and H(x, t) == G(x - t) is\ninvestigated in [4], where a detailed description of function spaces arising from the different\nchoices of the basis function G is given. Niyogi and Girosi [9] extends this approach to\napproximation by convex combinations of translates and dilates of a Gaussian function. In\ngeneral, we can prove the following\n\nLemmal Let\n\ng(x) = ( H(x, t)>'(dt),\n\nJRD\n\n(12)\n\nwhere H (x, t) and A are as above. Define for each k ~ 1 the class offunctions\n\ngk = {f(X) =t wiH(x,ti):t. 1wd ~ II>'II}.\n\nThen for any probab'ility measure JJ on lRd andforany 1 ~ p < 00, thefunctiong can be\napproximated in \u00a3P(J.l) arbitrarily closely by members of9 == Ugk, i.e.,\n\ninf Ilf - gIILP(tt) ~ 0\nlEYk\n\nas\n\nk -+ 00.\n\n\fRadial Basis Function Networks and l:o~rrw~~exttvKeR14~larlW~~ton\n\n203\n\nTo prove this lemma one need only slightly adapt the proof of Theorem 8.2 in [4], or in a\nmore elementary way following the lines of the probabilistic proof of Theorem 1 of [6]. To\napply the lemma for RBF networks considered in this paper, let n == d2 + d, t == (A, c),\nand H(x, t) == ]( ([x - c]tA[x - c]). Then we obtain that:F contains all the functions 9\nwith the integral representation\ng(x) = f\n\nK ([x - e]tA[x - en A(dedA),\n\nJRcil+d\n\nfor which 11.A1-I ~ b, where b is the constraint on the weights as in (7).\n\nAcknowledgements\n\nThis work was supported in part by NSERC grant OGPOO0270, Canadian National Networks\nof Centers of Excelle~cegrant 293 and OTKA grant F014174.\n\nReferences\n\n[1] A. R. Barron. Complexity regularization with application to artificial neural networks.\nIn G. Roussas, editor, Nonparametric Functional Estimation andRelatedTopics, pages\n561-576. NATO ASI Series, Kluwer Academic Publishers, Dordrecht, 1991.\n\n[2] A. R. Barron. Approximation and estimation bounds for artificial neural networks.\n\nMachine Leaming, 14:115-133,1994.\n\n[3] C. Darken, M. Donahue, L. Gurvits, and E. Sontag. Rate of approximation results\nIn Proc. Sixth Annual Workshop on\n\nmotivated by robust neural network learning.\nComputationalLeaming Theory,\u00b7pages 303-309. Morgan Kauffman, 1993.\n\n[4] F. Girosi and G. Anzellotti. Rates of convergence for radial basis functions and neural\nnetworks. In R. J. Mammone, editor, ArtijicialNeuralNetworksfor Speech and Vision,\npages 97-113. Chapman & Hall, London, 1993.\n\n[5] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural network archi(cid:173)\n\ntectures. Neural Computation, 7:219-267,1995.\n\n[6] A.Krzyzak, T. Linder, and G. Lugosi. Nonparametric estimation and classification\nusing radial basis function nets and empirical risk minimization. IEEE Transactions\non Neural Networks, 7(2):475-487, March 1996..\n\n[7] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural\nnetworks with bounded fan-in. to be published in IEEE Transactions on Information\nTheory, 1995.\n\n[8] G. Lugosi and K. Zeger. Concept learning using complexity regularization.\n\nTransactions on Information Theory, 42:48-54, 1996.\n\nIEEE\n.\n\n[9] P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis\ncomplexity, and sample complexity for radial basis functions. Neural Computation,\n8:819-842,1996.\n\n[10] V. N. Vapnik. Estimation ofDependencies Based on Empirical Data. Springer-Verlag,\n\nNew York, 1982.\n\n\f\f", "award": [], "sourceid": 1184, "authors": [{"given_name": "Adam", "family_name": "Krzyzak", "institution": null}, {"given_name": "Tam\u00e1s", "family_name": "Linder", "institution": null}]}