{"title": "Leaning by Combining Memorization and Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 714, "page_last": 720, "abstract": "", "full_text": "Learning by Combining Memorization \n\nand Gradient Descent \n\nJohn C. Platt \nSynaptics, Inc. \n\n2860 Zanker Road, Suite 206 \n\nSan Jose, CA 95134 \n\nABSTRACT \n\nWe have created a radial basis function network that allocates a \nnew computational unit whenever an unusual pattern is presented \nto the network. The network learns by allocating new units and \nadjusting the parameters of existing units. If the network performs \npoorly on a presented pattern, then a new unit is allocated which \nmemorizes the response to the presented pattern. If the network \nperforms well on a presented pattern, then the network parameters \nare updated using standard LMS gradient descent. For predicting \nthe Mackey Glass chaotic time series, our network learns much \nfaster than do those using back-propagation and uses a comparable \nnumber of synapses. \n\nINTRODUCTION \n\n1 \nCurrently, networks that perform function interpolation tend to fall into one of two \ncategories: networks that use gradient descent for learning (e.g., back-propagation), \nand constructive networks that use memorization for learning (e.g., k-nearest neigh(cid:173)\nbors). \n\nNetworks that use gradient descent for learning tend to form very compact repre(cid:173)\nsentations, but use many learning cycles to find that representation. Networks that \nmemorize their inputs need to only be exposed to examples once, but grow linearly \nin the training set size. \n\nThe network presented here strikes a compromise between memorization and gradi(cid:173)\nent descent. It uses gradient descent for the \"easy\" input vectors and memorization \nfor the \"hard\" input vectors. If the network performs well on a particular input \n\n714 \n\n\fLearning by Combining Memorization and Gradient Descent \n\n715 \n\nvector, or the particular input vector is already close to a stored vector, then the \nnetwork adjusts its parameters using gradient descent. Otherwise, it memorizes the \ninput vector and the corresponding output vector by allocating a new unit. The ex(cid:173)\nplicit storage of an input-output pair means that this pair can be used immediately \nto improve the performance of the system, instead of merely using that information \nfor gradient descent. \nThe network, called the resource-allocation network (RAN), uses units whose re(cid:173)\nsponse is localized in input space. A unit with a non-local response needs to undergo \ngradient descent, because it has a non-zero output for a large fraction of the training \ndata. \n\nBecause RAN is a constructive network, it automatically adjusts the number of \nunits to reflect the complexity of the function that is being interpolated. Fixed-size \nnetworks either use too few units, in which case the network memorizes poorly, \nor too many, in which case the network generalizes poorly. Parzen windows and \nK-nearest neighbors both require a number of stored patterns that grow linearly \nwith the number of presented patterns. With RAN, the number of stored patterns \ngrows sublinearly, and eventually reaches a maximum. \n\n1.1 PREVIOUS WORK \nPrevious workers have used networks with localized basis functions (Broomhead & \nLowe, 1988) (Moody & Darken, 1988 & 89) (Poggio & Girosi, 1990). Moody has \nfurther extended his work by incorporating a hash table lookup (Moody, 1989). The \nhash table is a resource-allocating network where the values in the hash table only \nbecome non-zero if the entry in the hash table is activated by the corresponding \npresence of non-zero input probability. \nThe RAN adjusts the centers of the Gaussian units based on the error at the output, \nlike (Poggio & Girosi, 1990). Networks with centers placed on a high-dimensional \ngrid, such as (Broomhead & Lowe, 1988) and (Moody, 1989), or networks that use \nunsupervised clustering for center placement, such as (Moody & Darken, 1988 & \n89) generate larger networks than RAN, because they cannot move the centers to \nincrease the accuracy. \n\nPrevious workers have created function interpolation networks that allocate fewer \nunits than the size of training set. Cascade-correlation (Fahlman & Lebiere, 1990), \nSONN (Tenorio & Lee, 1989), and MARS (Friedman, 1988) all construct networks \nby adding additional units. These algorithms work well. The RAN algorithm \nimproves on these algorithms by making the addition of a unit as simple as possible. \nRAN uses simple algebra to find the parameters of a new unit, while cascade(cid:173)\ncorrelation and MARS use gradient descent and SONN uses simulated annealing. \n\n2 THE ALGORITHM \nThis section describes a resource-allocating network (RAN), which consists of a \nnetwork, a strategy for allocating new units, and a learning rule for refining the \nnetwork. \n\n2.1 THE NETWORK \nThe RAN is a two-layer radial-basis-function network. The first layer consists of \n\n\f716 \n\nPlatt \n\nunits that respond to only a local region of the space of input values. The second \nlayer linearly aggregates outputs from these units and creates the function that \napproximates the input-output mapping over the entire space. \nA simple function that implements a locally tuned unit is a Gaussian: \n\nZj = L(Cjk - h)2, \n\nk \n\nXj = exp( -Zj /wJ). \n\n(1) \n\nWe use a C 1 continuous polynomial approximation to speed up the algorithm, \nwithout loss of network accuracy: \n\nif z\u00b7 < qw?' \nJ' \notherwise; \n\nJ \n\n(2) \n\nwhere q = 2.67 is chosen empirically to make the best fit to a Gaussian. \nEach output of the network Yi is a sum of the outputs Xj, each weighted by the \nsynaptic strength h ij plus a global polynomial. The Xj represent information about \nlocal parts of the space, while the polynomial represents global information: \n\nYi = E hijXj + E Liklk + Ii\u00b7 \n\nj \n\nk \n\n(3) \n\nThe h ij Xj term can be thought of as a bump that is added or subtracted to the \npolynomial term Lk Likh + Ii to yield the desired function. \nThe linear term is useful when the function has a strong linear component. In \nthe results section, the Mackey-Glass equation was predicted with only a constant \nterm. \n\n2.2 THE LEARNING ALGORITHM \nThe network starts with a blank slate: no patterns are yet stored. As patterns are \npresented to it, the network chooses to store some of them. At any given point \nthe network has a current state, which reflects the patterns that have been stored \npreviously. \nThe allocator may allocate a new unit to memorize a pattern. After the new unit \nis allocated, the network output is equal to the desired output f. Let the index of \nthis new unit be n. \nThe peak of the response of the newly allocated unit is set to the memorized input \nvector, \n\n(4) \n\nThe linear synapses on the second layer are set to the difference between the output \nof the network and the novel output, \n\n(5) \n\n\fLearning by Combining Memorization and Gradient Descent \n\n717 \n\nThe width of the response of the new unit is proportional to the distance from the \nnearest stored vector to the novel input vector, \n\nwhere K is an overlap factor. As K grows larger, the responses of the units overlap \nmore and more. \nThe RAN uses a two-part memorization condition. An input-output pair (I, f) \nshould be memorized if the input is far away from existing centers, \n\nIII - Cne ares t II > oCt), \n\n(7) \n\n(6) \n\nand if the difference between the desired output and the output of the network is \nlarge \n\nIlf - y(l) I I > f. \n\n(8) \n\nTypically, f is a desired accuracy of output of the network . Errors larger than f \nare immediately corrected by the allocation of a new unit, while errors smaller than \nf are gradually repaired using gradient descent. The distance oCt) is the scale of \nresolution that the network is fitting at the tth input presentation. The learning \nstarts with oCt) = 0max, which is the largest length scale of interest, typically the \nsize of the entire input space of non-zero probability density. The distance oCt) \nshrinks until the it reaches Omin, which is the smallest length scale of interest. The \nnetwork will average over features that are smaller than Omin. We used a function: \n\n6(t) = max(omax exp( -tiT), Omin), \n\n(9) \n\nwhere T is a decay constant. \nAt first, the system creates a coarse representation of the function, then refines the \nrepresentation by allocating units with smaller and smaller widths. Finally, when \nthe system has learned the entire function to the desired accuracy and length scale, \nit stops allocating new units altogether. \nThe two-part memorization condition is necessary for creating a compact network. \nIf only condition (7) is used, then the network will allocate units instead of using \ngradient descent to correct small errors. If only condition (8) is used, then fine-scale \nunits may be allocated in order to represent coarse-scale features, which is wasteful. \nBy allocating new units the RAN eventually represents the desired function ever \nmore closely as the network is trained. Fewer units are needed for a given accuracy \nif the first-layer synapses Cj 1:, the second-level synapses hij , and the parameters for \nthe global polynomial'Yi and Lil: are adjusted to decrease the error: \u00a3 = lIil - fll2 \n(Widrow & Hoff, 1960). We use gradient descent on the second-layer synapses to \ndecrease the error whenever a new unit is not allocated: \n\nAhij = a(1i - Yi)Xj, \nA'Yi = a(Ti - Yi), \nALiI: = a(Ti - Yi)h. \n\n(10) \n\n\f718 \n\nPlatt \n\nIn addition, we adjust the centers of the responses of units to decrease the error: \n\n(11) \n\nEquation (11) is derived from gradient descent and equation (1). Empirically, equa(cid:173)\ntion (11) also works for the polynomial approximation (2). \n\n3 RESULTS \nOne application of an interpolating RAN is to predict complex time series. As \na test case, a chaotic time series can be generated with a nonlinear algebraic or \ndifferential equation. Such a series has some short-range time coherence, but long(cid:173)\nterm prediction is very difficult. \nThe RAN was tested on a particular chaotic time series created by the Mackey-Glass \ndelay-difference equation: \n\nx(t + 1) = (1- b)x(t) + a \n\nx(t - r) \nl+xt-r \n\n( \n\n)10' \n\n(12) \n\nfor a = 0.2, b = 0.1, and r = 17. We trained the network to predict the value \nx(T + dT), given the values x(T), x(T - 6), x(T - 12), and x(T - 18) as inputs. \nThe network was tested using two different learning modes: off-line learning with \na limited amount of data, and on-line learning with a large amount of data. The \nMackey-Glass equation has been learned off-line, by other workers, using the back(cid:173)\npropagation algorithm (Lapedes & Farber, 1987), and radial basis functions (Moody \n& Darken, 1989). We used RAN to predict the Mackey-Glass equations with the \nfollowing parameters: a = 0.02, 400 learning epochs, 6max = 0.7, K, = 0.87 and \n6m in = 0.07 reached after 100 epochs. RAN was simulated using f = 0.02 and \nf = 0.05. In all cases, dT = 85. \nFigure 1 shows the efficiency of the various learning algorithms: the smallest, most \naccurate algorithms are towards the lower left. When optimized for size of network \n(f = 0.05), the RAN has about as many weights as back-propagation and is just \nas accurate. The efficiency of RAN is roughly the same as back-propagation, but \nrequires much less computation: RAN takes approximately 8 minutes of SUN-4 \nCPU time to reach the accuracy listed in figure 4, while back-propagation took \napproximately 30-60 minutes of Cray X-MP time. \nThe Mackey-Glass equation has been learned using on-line techniques by hashing \nB-splines (Moody, 1989). We used on-line RAN using the following parameters; \na = 0.05, 6max = 0.7, 6m in = 0.07, K, = 0.87, and 6min reached after 5000 input \npresentations. Table 1 compares the on-line error versus the size of network for \nboth RAN and the hashing B-spline (Moody, personal communication). In both \ncases, dT = 50. The RAN algorithm has similar accuracy to the hashing B-splines, \nbut the number of units allocated is between a factor of 2 and 8 smaller. \nFor more detailed results on the Mackey-Glass equation, see (Platt, 1991). \n\n\fLearning by Combining Memorization and Gradient Descent \n\n719 \n\no = RAN \n... = hashing B-spline \no = standard RBF \n\u2022 = K-means RBF \n* = back -propagation \n\n-o \n\n1000 \nNwnber of Weights \n\no~ __________ -+ ____________ *-__________ ~ \n100 \n100000 \n\n10000 \n\nFigure 1: The error on a test set versus the size of the network. Back-propagation \nstores the prediction function very compactly and accurately, but takes a large \namount of computation to form the compact representation. RAN is as compact \nand accurate as back-propagation, but uses much less computation to form its \nrepresentation. \n\nTable 1: Comparison between RAN and hashing B-splines \n\nMethod \n\nNumber of Units Normalized RMS Error \n\nRAN, f = 0.05 \nRAN, f = 0.02 \n\nHashing B-spline \n1 level of hierarchy \nHashing B-spline \n\n2 levels of hierarchy \n\n50 \n143 \n\n284 \n\n1166 \n\n0.071 \n0.054 \n\n0.074 \n\n0.044 \n\n4 CONCLUSIONS \nThere are various desirable attributes for a network that learns: it should learn \nquickly, it should learn accurately, and it should form a compact representation. \nFormation of a compact representation is particularly important for networks that \nare implemented in hardware, because silicon area is at a premium. A compact \nrepresentation is also important for statistical reasons: a network that has too \nmany parameters can overfit data and generalize poorly. \n\n\f720 \n\nPlatt \n\nMany previous network algorithms either learned quickly at the expense of a com(cid:173)\npact representation, or formed a compact representation only after laborious com(cid:173)\nputation. The RAN is a network that can find a compact representation with a \nreasonable amount of computation. \n\nAcknowledgements \nThanks to Carver Mead, Carl Ruoff, and Fernando Pineda for useful comments on \nthe paper. Special thanks to John Moody who not only provided useful comments \non the paper, but also provided data on the hashing B-splines. \n\nReferences \nBroomhead, D., Lowe, D., 1988, Multivariable function interpolation and adaptive \nnetworks, Complex Systems, 2, 321-355. \nFahlman, S. E., Lebiere, C., 1990, The Cascade-Correlation Learning Architecture, \nIn: Advances in Neural Information Processing Systems 2, D. Touretzky, ed., 524-\n532, Morgan-Kaufmann, San Mateo. \nFriedman, J. H., 1988, Multivariate Adaptive Regression Splines, Department of \nStatistics, Stanford University, Tech. Report LCSI02. \nLapedes, A., Farber, R., 1987, Nonlinear Signal Processing Using Neural Networks: \nPrediction and System Modeling, Technical Report LA-UR-87-2662, Los Alamos \nNational Laboratory, Los Alamos, NM. \nMoody, J, Darken, C., 1988, Learning with Localized Receptive Fields, In: Proceed(cid:173)\nings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, \nT. Sejnowski, eds., 133-143, Morgan-Kaufmann, San Mateo. \nMoody, J, Darken, C., 1989, Fast Learning in Networks of Locally-Tuned Processing \nUnits, Neural Computation, 1(2), 281-294. \nMoody, J., 1989, Fast Learning in Multi-Resolution Hierarchies, In: Advances \nin Neural Information Processing Systems 1, D. Touretzky, ed., 29-39, Morgan(cid:173)\nKaufmann, San Mateo. \nPlatt., J., 1991, A Resource-Allocating Network for Function Interpolation, Neural \nComputation, 3(2), to appear. \nPoggio, T., Girosi, F., 1990, Regularization Algorithms for Learning that are Equiv(cid:173)\nalent to Multilayer Networks, Science, 247, 978-982. \nPowell, M. J. D., 1987, Radial Basis Functions for Multivariable Interpolation: A \nReview, In: Algorithms for Approximation, J. C. Mason, M. G. Cox, eds., Claren(cid:173)\ndon Press, Oxford. \nTenorio, M. F., Lee, W., 1989, Self-Organizing Neural Networks for the Identi(cid:173)\nfication Problem, In: Advances in Neural Information Processing Systems 1, D. \nTouretzky, ed., 57-64, Morgan-Kaufmann, San Mateo. \nWidrow, B., Hoff, M., 1960, Adaptive Switching Circuits, In: 1960 IRE WESCON \nConvention Record, 96-104, IRE, New York. \n\n\f", "award": [], "sourceid": 325, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}]}