{"title": "Algorithms for Better Representation and Faster Learning in Radial Basis Function Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 482, "page_last": 489, "abstract": null, "full_text": "482 \n\nSaba and Keeler \n\nAlgorithms/or Better Representation and \n\nFaster Learning in Radial \nBasis Function Networks \n\nA vijit Saba 1 \n\nJames D. Keeler \n\nMicroelectronics and Computer Technology corporation \n\n3500 West Balcones Center Drive \n\nAustin, Tx 78759 \n\nABSTRACT \n\nin \n\nlearning, using radial basis functions \n\nIn this paper we present upper bounds for the learning rates for \nhybrid models that employ a combination of both self-organized \nand supervised \nto build \nreceptive field representations \nthe hidden units. The learning \nperformance in such networks with nearest neighbor heuristic can \nbe improved upon by multiplying the individual receptive field \nwidths by a suitable overlap factor. We present results indicat!ng \noptimal values for such overlap factors. We also present a new \nalgorithm for determining receptive field centers. This method \nnegotiates more hidden units in the regions of the input space as a \nfunction of the output and is conducive to better learning when the \nnumber of patterns (hidden units) is small. \n\n1 INTRODUCTION \nFunctional approximation of experimental data ongmating from a continuous \ndynamical process is an important problem. Data is usually available in the form of \na set S consisting of {x,y} pairs, where x is a input vector and y is the corresponding \noutput vector. In particular, we consider networks with a single layer of hidden \nunits and the jth output unit computes Yj = L fa Ra { xj ' xa ' (J'a}' where, Yj is the \n1 \nUniversity of Texas at Austin, Dept. of ECE, Austin TX 78712 \n\n\fAlgorithms for Better Representation \n\n483 \n\nthe a.th \nnetwork output due to input xj, fa. is the synaptic weight associated with \nhidden neuron and the jth output unit; Ra ( x(tj), xa' cr} is the Radial Basis Function \n(RBF) response of the ath hidden neuron. This technique of using a superposition of \nRBF for the purposes of approximation has been considered before by [Medgassy \n'58] and more recently by [Moody '88], [Casdagli t89] and [poggio t89]. RBF \nnetworks are particularly attractive since such networks are potentially 1000 times \nthan the ubiquitous backpropagation network for comparable error rates \nfaster \n[Moody t88]. \nThe essence of the network model we consider is described in [Moody '88]. A typical \nnetwork that implements a receptive field response consists of a layer of linear in(cid:173)\nput units t a layer of linear output units and an intennediate ( hidden ) layer of non(cid:173)\nlinear response units. Weights are associated with only the links connecting the \nhidden layer to the output layer. For the single output case the real valued function-\nal mapping f: RD -> R is characterized by the following equations: \n\no (Xi) = 1: f a. Ra. (Xi) \nO(xi) = 1: f a. Ra. (Xi) / 1: Ra. (Xi) \n2 \n\n( I x a. - Xi I / cr a. ) \n\n= e -\n\n(1) \n\n(2) \n\n(3) \n\nthe centers of the receptive field units and cr \n\nwhere xa. is a real valued vector associated with the a.th receptive field ( hidden) \nunit and is of the same dimension as the input The output can be nonnalized by the \nsum of the responses of the hidden units due to any inputt and the expression for the \noutput using nonnalized response function is presented in Equation 2. The xa. val-\nues \nare their widths. Training in \nsuch networks can be performed in a two stage hybrid combination of \nindependent \nprocesses. In the fll'St stage, a clustering of the input data is performed. The objec-\ntive of this clustering algorithm is to establish appropriate xa values for each of \nthe receptive field units such that the cluster points represent the input distribution \nin the best possible manner. We use competetive learning with the nearest neighbor \nheuristic as our clustering algorithm (Equation 5). The degree or quality of cluster(cid:173)\ning achieved is quantified by the sum-square measure in Equation 4, which is the ob(cid:173)\njective function we are trying to minimize in the clustering phase. \nTSS- KMEANS = L ( xa-closest - xi) 2 \n\n(4) \n\na. \n\nxa-closest = xa-closest + A. ( Xi _ xa-closest) \n\n(5) \n\nAfter suitable cluster points (xavalues) are determined the next step is to determine \n\n\f484 \n\nSaha and Keeler \n\nthe O'a. or widths for each of the receptive fields. Once again we use the nearest \nneighbor heuristic where O'a. (the width of the a.th neuron) \nclidian distance between xa. and its nearest neighbor. Once the receptive field centers \nxa. and the widths (0' a.) are found, the receptive field responses can be calculated \nfor any input using Equation 3. Finally, the fa. values or weights on links connect-\ning the hidden layer units to the output are determined using the well-known gradi(cid:173)\nent descent learning rule. Pseudo inverse methods are usually impractical in these \nproblems. The rules for the objective function and weight update are given by equa(cid:173)\ntions 6 and 7. \n\nis set equal to the eu(cid:173)\n\nE \n\n= L. (O(x.) - t.)2 \n\n1 \n\n-. \n\n1 \n\n= f + Tl ( O(x.) - t.) } R (x.) \n\n1 \n\n-. \n\na. \n\n1 \n\na. \n\n(6) \n\n(7) \n\nwhere, \n\ni is the number of input patterns, x\u00b7 is the input vector and t\u00b7 is the target \n\nI \n\n1 \n\noutput for the ith pattern. \n\n2 LEARNING RATES \n\nthere \n\nlearning \n\nrate \n\nis very \n\nis crucial, since for \n\nIn this section we present an adaptive formulation for the network learning rate Tl \n(Equation 7). Learning rates (Tl) in such networks that use gradient descent \nare \nusually chosen in an adhoc fashion. A conservative value for Tl \nis usually sufficient. \nHowever, there are two problems with such an approach. If the learning rate is not \nsmall enough the TSS (Total Sums of Squares) measure can diverge to high values \ninstead of decreasing. A very conservative estimate on the other hand will work \nwith almost all sets of data but will unnecessarily slow down the learning process. \nor hardware \nThe choice of \nimplementations of such systems \ninteractive \nmonitoring. \nThis problem is addressed by the Theorem 1. We present the proof for this theorem \nfor the special case of a single output In the gradient descent algorithm, weight \nupdates can be performed after each presentation of the entire set of patterns (per \nepoch basis or after each pattern (per pattern basis); both cases are considered. \nEquation p.3 gives the upper bound for Tl when updates are done on a per epoch \nbasis. Only positive values of Tl should be considered. Equations pA and p.5 gives \nthe bounds for Tl when updates are done on a per pattern basis without and with \nnormalized response function respectively. We present some simulation results for \nthe logistic map ( x(t+l) = r x(t) [ 1 - x(t)] } data in Figure 1. The plots are \nthe normalized response case, and the learning rate was set to Tl = \nshown only for \nJl( ( L Ra.)2 / L (Ra.)2). We used a flXed number of 20 hidden units, and r was set \nto 4.0. The network TSS did not diverge until Jl was set arbitrarily close to 1. \n\nreal-time \nlittle scope \n\nfor \n\n\fAlgorithms (or Better Representation \n\n485 \n\nif the sum of squares of the hidden unit responses \n\nThis is shown in Figure 1 which indicates that, with the normalized response \nis nearly equal to \nfunction. \nthe square of the sum of the responses. then a high effective learning rate (Tl) \ncan \nbe used. \nTheorem 1: The TSS measure of a network will be decreasing in time. provided the \n\nlearning rate Tl does not exceed Li ei La EaRai / ( Li (:Ea EaRai )2) if the \n\nnetwork is trained on a per epoch basis. and 1/ La (Rai )2 when updates are done \non a per pattern basis. With normalized response function. the upper bound for the \nlearning rate is (LaRai )2/ La(Rai )2. Note similar result of [Widrow 1985]. \n\nTSS(t) = \n\nfl:.w!(: \nwhere N is the number of exemplars. and K is the number of receptive fields and ti \nis the ith target output. \n\nLi (~ - Lafa Rai)2 \n\n(p.1) \n\n(p.2) \n\nFor stability. we impose the condition TSS(t) - TSS (t + 1) ~ O. From \n\nEqns (p.l) and (p.2) above and substituting Tl 2Ea for ~fa' we have: \n\nTSS(t) - TSS (t + 1) > L\u00b7 e.2 \n.)2 \n1 \nExpanding the RHS of the above expression and substituting ei appropriately : \nTSS(t) - TSS (t + 1) ~ - 4 Tl ~ ei LaEa Rai + 4 Tl2 ~ (LaEa Rai)2 \n\n- L\u00b7 (t. - L f R . - L \na \n\n2 on E R \n\na a at \n\n'I a \n\nat \n\n1 \n\n'\"'l \n\n-\n\n1 \n\n\u2022 \u2022\u2022 From the above inequality it follows that for stability in per epoch basis \n\ntraining. the upper bound for learning rate Tl is given by : \n\n2 \n~ ~ ei LaEa Rai / Li (LaEaRai) \n\n( p.3) \nIf updates are done on a per pattern basis. then N = 1 and we drop the summation \nover N and the index i and we obtain the following bound: \n\nTl \n\nTl \n\n2 \n~ 1/ La(Rai) . \n\n(pA) \n\nWith normalized response function the upper bound for the learning rate is : \n\nTl \n\n~ (LaRai)2 / La(Rai )2. \n\n\u00abP.5) \n\nQ.E.D. \n\n\f486 \n\nSaba and Keeler \n\n0.2 ~----------------:------, \n\nn \no \nr \nm 0.15 \na \nI \n1 \nZ 0.1 \ne \nd \n\ne 0.05 \nr \nr \no \nr \n\n0.0 1 - -__ .............. ___ - - - - - - - - - - -\n\n0.1 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\n1.0 \n\n2.0 \n\nJl \n\nFigure 1: Nonnalized error vs. fraction (Jl ) \n\nof maximum allowable learning rate \n\n3 EFFECT OF WIDTH (a) ON APPROXIMATION ERROR \n\nIn the nearest-neighbor heuristic Cj values of the hidden units are set equal to the \neuclidian distance between its center and the center of its nearest neighbor. This \nmethod is preferred mainly because it is computationally inexpensive. However, the \nperfonnance can be improved by increasing the overlap between nearby hidden unit \nresponses. This \nthe widths obtained with the nearest \nneighbor heuristic by an overlap factor m as shown in Equation 3.1. \n\nis done by multiplying \n\nn \no \nr \nm \na \nI \ni \nz \ne \nd \n\ne \nr \nr \no \nr \n\n0.14 \n\n0.12 \n\n0.10 \n\n0.08 \n\n0.06 \n\n0.04 \n\n0.02 \n\n0.0 \n\n0.0 \n\n10 hidden units \n\nLogistic Map data \n\n(r = 4.0 ) \n\n20 hidden units \n\n1.0 \n\n2.0 \n\noverlap factor (m) \n\n3.0 \n\n4.0 \n;ao \n\n5.0 \n\nFigure 2: Nonnalized errors vs. overlap factor for the \n\nlogistic map. \n\n\fAlgorithms for Better Representation \n\n487 \n\ncr a. = m. II xa. - xa.-nearestll \n\n(3.1) \n\nand II. II is the euclidian distance nonn. \n\nIn Figures 2 and 3 we show the network performance ( nonnalized error) as a \nfunction of m. In the logistic map case a value of r = 4.0 was used, predicting 1 \ntimestep into the future; training set size was 10 times the number of hidden units \nand test set size was 70 patterns. The results for the Mackey-Glass data are with \nparameter values a = 0.1, b = 0.2, A = 6, D = 4. The number of training patterns was \n10 times the number of hidden units and the nonnalized error was evaluated based \non the presentation of 900 unseen patterns. For the Mackey-Glass data the optimal \nvalues were rather well-defmed; whereas for the logistic map case we found that the \noptimal values were spread out over a range. \n\n50 units \n\nII \n\nI \n\nI \n\nI \n\n\u2022 \n\nI \n\nI \n\nII \n\n.' \n\n.. \n\nII! \u2022 I \n\u2022 \u2022 \u2022 \n\n\u2022 \n\nO~ \n\n0.5 \n\nn \no \nr \nm \na 0.4-\nI \n1 \nZ \ne \nd \n\n0.3 \n\nr-\n\nr-\n\ne \nr \nr \no \nr \n\n0.2 \n\n0.1 ~ \n\n.. \n\n0.0 \n\n0.0 \n\n100 units \n\n250 units \n500 units \n900 units \n\u2022 \n0.5 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \u2022 I \n\nI S \n\n:I \n\nI \n\nI \n\nt: \n\nI \n\n:I: t \n\nI \n\n\u2022 \n1.0 \n\noverlap factor (m) \n\n1.5 \n\n2.0 \n\n2.5 \n\nFigure 2: Nonnalized errors vs. overlap factor for varying \n\nnumber of hidden units, Mackey-Glass data. \n\n4 EXTENDED METRIC CLUSTERING \nIn this method clustering is done in higher dimensions. In our experiments we set \nthe initial K hidden unit center values based on the first K exemplars. The receptive \nfields are assigned vector values of dimensions determined by \nthe \ninput and the output vectors. Each center value was set equal to the vector obtained \nby concatenating the input and the corresponding output During the clustering phase \nthe output Yi is concatenated with the input Xi and presented to the hidden layer. \nThis method fmds cluster points in the (I+O)-dimensional space of the input-output \nmap as defined by Equations 4.1, 4.2 and 4.3. \n\nsize of \n\nthe \n\n\f488 \n\nSaha and Keeler \n\nXa = < xa ' Ya> \nXi = < xi' Yi> \n\nXa-new = Xa -old + A ( Xa - Xi) \n\n(4.1) \n\n(4.2) \n\n(4.3) \n\nOnce the cluster points or the centers are determined we disable the output field, \nand only the input field is used for computing the widths and receptive field \nresponses. In Figure 3 we present a comparison of the performances of such a \nnetwork with and without the enhanced metric clustering. Variable size networks of \nonly Gaussian RBF units were used. The plots presented are for the Mackey-Glass \ndata with the same parameter values used in [Farmer 88]. This method works \nsignificantly better when the number of hidden units is low. \n\nn \n0 \nr \nm \na \nI \ni \nZ \ne \nd \n\ne \nr \nr \no \nr \n\nNearest neighbor \n\nEnhanced metric clustering \n\nMACKEY GLASS DATA \na = 0.2 , b = 0.1, D = 4, A = 6. \n\n0.6 \n\no.s \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0.0 ~ ___ ------''---___ ---' __ __ - -L ____ - - ' ' \n800 \n\n400 \n\n200 \n\n600 \n\n~ \n\no \n\nNumber of units \n\nFigure 3: Performance of enhanced metric clustering \n\nalgorithmm. \n\n5 CONCLUSIONS \nOne of the emerging application areas for neural network models is real time signal \nprocessing. For such applications and hardware implementations, adaptive methods \nfor determining network parameters are essential. Our derivations for learning rates \nare important in such situations. We have presented results indicating that in RBF \nnetworks, performance can be improved by tuning the receptive field widths by some \nsuitable overlap factor. We have presented an extended metric algorithm \nthat \nnegotiates hidden units based on added output information. We have observed more \nthan 20% improvement in the normalized error measure when the number of training \n\n\fAlgorithms for Better Representation \n\n489 \n\npatterns, and therefore the number of hidden units, used is reasonably small. \n\nand Christen Darlcen (1989). \"Learning with Localised Receptive \n\nReferences \nM. Casdagli. (1989) \"Nonlinear Prediction of Chaotic Time Series\" Physica 35D, \n335 -356. \nD. J. Farmer and J. J. Sidorowich. (1988). \"Exploiting Chaos to Predict the Future \nand Reduce Noise\". Tech. Report No. LA-UR-88-901, Los Alamos National Labora(cid:173)\ntory. \nJohn Moody \nFields\". In: Eds: D. Touretzky. Hinton and Sejnowski: Proceedings of the 1988 Con(cid:173)\nnectionist Models Summer School. Morgan Kaufmann Publishing, San Mateo, CA. \nP. Medgassy. (1961) Decomposition of Superposition of Distribution Functions. \nPublishing house of the Hungarian Academy of Sciences, Budapest, 1961. \nT. Poggio and F. Girosi. (1989). \"A Theory of Networks for Approximation and \nLearning\". A.I. Memo No. 1140, Massachusetts Institute of Technology. \nB. Widrow and S. Stearns (1985). Adaptive Signal Processing. Prentice-Hall Inc., \nEnglewood Cliffs, NJ, pp 49,102. \n\n\f", "award": [], "sourceid": 242, "authors": [{"given_name": "Avijit", "family_name": "Saha", "institution": null}, {"given_name": "James", "family_name": "Keeler", "institution": null}]}