{"title": "Learning Mackey-Glass from 25 examples, Plus or Minus 2", "book": "Advances in Neural Information Processing Systems", "page_first": 1135, "page_last": 1142, "abstract": null, "full_text": "Learning Mackey-Glass from 25 \n\nexamples, Plus or Minus 2 \n\nMark Plutowski\u00b7 Garrison Cottrell\u00b7 Halbert White\u00b7\u00b7 \n\nInstitute for Neural Computation \n\n*Department of Computer Science and Engineering \n\n**Department of Economics \n\nUniversity of California, San Diego \n\nLa J oHa, CA 92093 \n\nAbstract \n\nWe apply active exemplar selection (Plutowski &. White, 1991; \n1993) to predicting a chaotic time series. Given a fixed set of ex(cid:173)\namples, the method chooses a concise subset for training. Fitting \nthese exemplars results in the entire set being fit as well as de(cid:173)\nsired. The algorithm incorporates a method for regulating network \ncomplexity, automatically adding exempla.rs and hidden units as \nneeded. Fitting examples generated from the Mackey-Glass equa(cid:173)\ntion with fractal dimension 2.1 to an rmse of 0.01 required about 25 \nexemplars and 3 to 6 hidden units. The method requires an order \nof magnitude fewer floating point operations than training on the \nentire set of examples, is significantly cheaper than two contend(cid:173)\ning exemplar selection techniques, and suggests a simpler active \nselection technique that performs comparably. \n\n1 \n\nIntroduction \n\nPlutowski &. White (1991; 1993), have developed a method of active selection of \ntraining exemplars for network learning. Active selection uses information about \nthe state of the network when choosing new exemplars. The approach uses the sta(cid:173)\ntistical sampling criterion Integrated Squared Bias (ISB) to derive a greedy selection \nmethod that picks the training example maximizing the decrement in this measure. \n(ISB is a special case of the more familiar Integrated Mean Squared Error in the \ncase that noise variance is zero.) We refer to this method as A.ISB. The method \nautomatically regulates network complexity by growing the network as necessary \n\n1135 \n\n\f1136 \n\nPlutowski, Cottrell, and White \n\nto fit the selected exemplars, and terminates when the model fits the entire set of \navailable examples to the desired accuracy. Hence the method is a nonparametric \nregression technique. In this paper we show that the method is practical by apply(cid:173)\ning it to the Mackey-Glass time series prediction task. We compare AISB with \nthe method of training on all the examples. AIS8 consistently learns the time \nseries from a small subset of the available examples, finding solutions equivalent \nto solutions obtained using all of the examples. The networks obtained by AISB \nconsistently perform better on test data for single step prediction, and do at least \nas well at iterated prediction, but are trained at much lower cost. \n\nHaving demonstra.ted that this particular type of exemplar selection is worthwhile, \nwe compare AISE with three other exemplar selection methods which are easier \nto code and cost less to compute. We compare the total cost of training, as well \nas the size of the exemplar sets selected. One of the three contending methods was \nsuggested by the AISB algorithm, and is also an active selection technique, as its \ncalculation involves the network state. Among the four exemplar selection methods, \nwe find that the two active selection methods provide the greatest computational \nsavings and select the most concise training sets. \n\n2 The Method \n\nWe are provided with a set of N \"candidate\" examples of the form (Zi, g(zd) . \nGiven g, we can denote this as x N . Let 1(\u00b7, w) denote the network function pa(cid:173)\nrameterized by weights w. For a particular subset of the examples denoted x n , let \nWn = Wn (zn) minimize \n\nLet w\u00b7 be the \"best\" set of weights, which minimizes \n\nwhere IJ is the distribution over the inputs. Our objective is to select a subset zn \nof zN such that n < N, while minimizing J(/(z, wn ) - I(z, w\u00b7))21J(dz). Thus, \nwe desire a subset representative of the whole set. We choose the zn C zN giving \nweights Wn that minimize the Integrated Squared Bias (ISB): \n\n(1) \n\nWe generate zn incrementally. Given a candidate example Zn+l, let zn+l = \n(zn, Zn+l). Selecting Zl optimally with respect to (1) is straightforward. Then \ngiven zn minimizing ISB(zn), we opt to select Zn+l E zN maximizing ISB(zn)(cid:173)\nISB(xn+1). Note that using this property for Zn+1 will not necessarily deliver the \nglobally optimal solution. Nevertheless, this approach permits a computationally \nfeasible and attractive method for sequential selection of training examples. \n\n\fLearning Mackey-Glass from 25 Examples, Plus or Minus 2 \n\n1137 \n\nChoosing Zn+l to maximize this decrement directly is expensive. We use the follow(cid:173)\ning simple approximation (see Plutowski &- White, 1991) for justification): Given \nzn, select Zn+l E argmaxzn+l~ISB(xn+llzn), where \n\n6ISB(xn+llzn) = 6Wn+l' L V w!(Zi, wn)(g(zd - !(Zi, wn\u00bb, \n\nN \n\nand \n\ni=l \n\nn \n\nH(zn ,wn ) = LV w!(Zj, wn)Vw!(Zi, wn }' . \n\ni=l \n\nIn practice we approximate H appropriately for the task at hand. Although we \narrive at this criterion by making use of approximations valid for large n, this crite(cid:173)\nrion has an appealing interpretation as picking the single example having individual \nerror gradient most highly correlated with the average error gradient of the entire \nset of examples. Learning with this example is therefore likely to be especially in(cid:173)\nformative. The 6ISB criterion thus possesses heuristic appeal in training sets of \nany size. \n\n3 The Algorithm \n\nBefore presenting the algorithm we first explain certain implementation details. We \nintegrated the ~I SB criterion with a straightforward method for regulating network \ncomplexity. We begin with a small network and an initial training set composed of \na single exemplar. When a new exemplar is added, if training stalls, we randomize \nthe network weights and restart training. After 5 stalls, we grow the network by \nadding another unit to each hidden layer. \n\nBefore we can select a new exemplar, we require that the network fit the current \ntraining set \"sufficiently well.\" Let en(zm) measure the rmse (root mean squared \nerror) network fit over m arbitrary examples zm when trained on xn. Let Fn E ~+ \ndenote the rmse fit we require over the current set of n exemplars before selecting \na new one. Let FN E ~+ denote the rmse fit desired over all N examples. (Our \ngoal is en(zN) < FN.) It typically suffices to set Fn = FN, that is, to train to a fit \nover the exemplars which is at least as stringent as the fit desired over the entire \nset (normalized for the number of exemplars.) However,. active selection sometimes \nchooses a new exemplar \"too close\" to previously selected exemplars even when this \nis the case. This is easy to detect, and in this case we reject the new exemplar and \ncontinue with training. \n\nWe use an \"exemplar spacing\" parameter d to detect when a new exemplar is too \nclose to a previous selection. Two examples Xi and Xi are \"close\" in this sense if \nthey are within Euclidean distance d, and if additionally Ig(Zi) - g(xi)1 < FN. The \nadditional condition allows the new exemplar to be accepted even when it is close to \na previous selection in input space, provided it is sufficiently far away in the output \nspace. In our experiments, the input and output space are of the same scale, so we \nset d = FN. When a new selection is too close to a current exemplar, we reject the \n\n\f1138 \n\nPlutowski, Cottrell, and White \n\nnew selection, reduce Fn by 20%, and continue training, resetting Fn = FN when \na subsequent selection is appended to the current training set. We now outline the \nalgorithm: \n\nInitialize: \n\nexemplar spacing parameter, and the maximum number of restarts . \n\n\u2022 Specify user-set parameters: initial network size, the desired fit FN, the \n\u2022 Select the first training set, xl = {xd. Set n = 1 and Fn = FN. Train the \n\nnetwork on xl until en{x 1 ) 5 Fn. \n\nWhile(en(xN) > FN) { \n\nSelect a new exemplar, Zn+l E x N , maximizing 6.ISB. \nH (Zn+l is \"too close\" to any Z E zn) { \n\nReject Zn+l \nReduce Fn by 20%. } \n\nElse { \n\nAppend Zn+l to zn. \nIncrement n. \nSet Fn = FN . } \n\nWhile(en(zn} > Fn) { \n\nTrain the network on the current training set zn, \nrestarting and growing as necessary. }} \n\n4 The Problem \n\nWe generated the data from the Mackey-Glass equation (Mackey &, Glass, 1977), \nwith T = 17, a = 0.2, and b = 0.1. We integrated the equation using fourth order \nRunge-Kutta with step size 0.1, and the history initialized to 0.5. We generated two \ndata sets. We iterated the equation for 100 time steps before beginning sampling; \nthis marks t = O. The next 1000 time steps comprise Data Set 1. We generated \nData Set 2 from the 2000 examples following t = 5000. \nWe used the standard feed-forward network architecture with [0, 1] sigmoids and one \nor two hidden layers. Denoting the time series as z(t}, the inputs were z(t), x(t -\n6}, z(t - 12), z(t - 18), and the desired output is z(t + 6) (Lapedes &, Farber, 1987). \nWe used conjugate gradient optimization for all of the training runs. The line search \nroutine typically required 5 to 7 passes through the data set for each downhill step, \nand was restricted to use no more than 10. \nInitially, the single hidden layer network has a single hidden unit, and the 2 hidden \nlayer network has 2 units per hidden layer. A unit is added to each hidden layer \nwhen growing either architecture. All methods use the same growing procedure. \nThus, other exemplar selection techniques are implemented by modifying how the \nnext training set is obtained at the beginning of the outer while loop. The method \nof using all the training examples uses only the inner while loop. \nIn preliminary experiments we evaluated sensitivity of 6.ISB to the calculation of \nH. We compared two ways of estimating H, in terms of the number of exemplars \n\n\fLearning Mackey-Glass from 25 Examples, Plus or Minus 2 \n\n1139 \n\nselected and the total cost of training. The first approach uses the diagonal terms \nof H (Plutowski &. White, 1993). The second approach replaces H with the identity \nmatrix. Evaluated over 10 separate runs, fitting 500 examples to an rmse of 0.01, \n~ISB gave similar results for both approaches, in terms of total computation used \nand the number of exemplars selected. Here, we used the second approach. \n\n5 The Comparisons \n\nWe performed a number of experiments, each comparing the ~ISB algorithm with \ncompeting training methods. The competing methods include the conventional \nmethod of using all the examples, henceforth referred to as \"the strawman,\" as well \nas three other data selection techniques. In each comparison we denote the cost \nas the total number of floating point multiplies (the number of adds and divides is \nalways proportional to this count). \nFor each comparison we ran two sets of experiments. The first compares the total \ncost of the competing methods as the fit requirement is varied between 0.02, 0.015, \nand 0.01, using the first 500 examples from Data Set 1. The second compares the \ncost as the size of the \"candidate\" set (the set of available examples) is varied using \nthe first 500, 625, 750, 875, and 1000 examples of Data Set I, and a tolerance of \n0.01. To ensure that each method is achieving a comparable fit over novel data, \nwe evaluated each network over a test set. The generalization tests also looked at \nthe iterated prediction error (IPE) over the candidate set and test set (Lapedes &. \nFarber, 1987). Here we start the network on the first example from the set, and \nfeed the output back into the network to obtain predictions in multiples of 6 time \nsteps. Finally, for each of these we compare the final network sizes. Each data point \nreported is an average of five runs. For brevity, we only report results from the two \nhidden layer networks. \n\n6 Comparison With Using All the Examples \n\nWe first compare ~ISB with the conventional method of using all the available \nexamples, which we will refer to as \"the strawman.\" For this test, we used the first \n500 examples of Data Set 1. For the two hidden layer architecture, each method \nrequired 2 units per hidden layer for a fit of 0.02 and 0.015 rmse, and from 3 to \n4 (typically 3) units per hidden layer for a fit of 0.01 rmse. While both methods \ndid quite well on the generalization tests, ~ISB clearly did better. Whereas the \nstrawman networks do slightly worse on the test set than on the candidate set, \nnetworks trained by ~ISB tended to give test set fits close to the desired (training) \nfit. This is partially due to the control flow of the algorithm, which often fits the \ncandidate set better than necessary. However, we also observed ~ISB networks \nexhibited a test set fit better than the candidate set fit 7 times over these 15 training \nruns. This never occurred over any of the strawman runs. \nOverall, ~ISB networks performed at least as well as the strawman with respect to \nIPE. Figure 1a shows the second half of Data Set 1, which is novel to this network, \nplotted along with the iterated prediction of a ~ISB network to a fit of 0.01, giving \nan IPE of 0.081 rmse, the median IPE observed for this set of five runs. Figure 1b \nshows the iterated prediction over the first 500 time steps of Data Set 2, which is \n\n\f1140 \n\nPlutowski, Cottrell, and White \n\n4500 time steps later than the training set. The IPE is 0.086 rmse, only slightly \nworse than over the \"nearer\" test set. This fit required 22 exemplars. Generalization \ntests were excellent for both methods, although t1ISB was again better overall. \nt1ISB networks performed better on Data Set 2 than they did on the candidate \nset 9 times out of the 25 runs; this never occurred for the strawman. These effects \ndemand closer study before using them to infer that data selection can introduce a \nbeneficial bias. However, they do indicate that the t1ISB networks performed at \nleast as well as the strawman, ensuring the validity of our cost comparisons. \n\nFigure 1: \nItell.ted prediction for a 2 hidden layer network trained to 0.01 rmae over the \nfirst 500 time steps of Data Set 1. The dotted line gives the network prediction; the solid \nline is the target time series. Figure la, on the left, is over the next (consecutive) 500 time \nsteps of Data Set 1, with IPE = 0.081 rmse. Figure Ib, on the right, is over the first 500 \nsteps of Data Set 2, with IPE = 0.086 rmse. This network was typical, being the median \nIPE of 5 runs. \n\nFigure 2a shows the average total cost versus required fit FN for each method. \nThe strawman required 109, 115, and 4740 million multiplies for the respective \ntolerances, whereas t1ISB required 8, 28, and 219 million multiplies, respectively. \nThe strawman is severely penalized by a tighter fit because growing the network \nto fit requires expensive restarts using all of the examples. Figure 2b shows the \naverage total cost versus the candidate set sizes. One reason for the difference is \nthat t1ISB tended to select smaller networks. For candidate sets of size 500, 625, \n750 and 875, each method typically required 3 units per hidden layer, occasionally \n4. Given 1000 examples, the strawman selected networks larger than 3 hidden units \nper layer over twice as often as t1ISB. t1ISB also never required more than 4 \nhidden units per layer, while the strawman sometimes required 6. This suggests \nthat the growing technique is more likely to fit the data with a smaller network \nwhen exemplar selection is used. \n\nCost \n\n7000 \n6000 \n5000 \n4000 \n3000 \n2000 \n1000 \n\n0.02 \n\nCost \n\n35000 \n30000 \n25000 \n20000 \n15000 \n10000 \n5000 \n\nFigure 2: Cost (in millions of multiplies) oftraining t1ISB, compared to the Strawman. \nFigure 2a on the left gives total cost versus the desired fit, and Figure 2b on the right \ngives total cost versus the number of ca.ndidate examples. Each point is the average of 5 \nruns; the error bars are equal in width to twice the standard deviation. \n\n\fLearning Mackey-Glass from 25 Examples, Plus or Minus 2 \n\n1141 \n\n7 Contending Data Selection Techniques \n\nThe results above clearly demonstrate that exemplar selection can cut the cost of \ntraining dramatically. In what follows we compare ~ISB with three other exemplar \nselection techniques. Each of these is easier to code and cheaper to compute, and \nare considerably more challenging contenders than the strawman. In addition to \ncomparing the overall training cost we will also evaluate their data compression \nability by comparing the size of the exemplar sets each one selects. We proceed in \nthe same manner as with ~ISB, sequentially growing the training set as necessary, \nuntil the candidate set fit is as desired. \nTwo of these contending techniques do not depend upon the state of the network, \nand are therefore are not \"Active Selection\" methods. Random Selection selects an \nexampk randomly from the candidate set, without replacement, and appends it to \nthe current exemplar set. Uniform Grid exploits the time series representation of \nour data set to select training sets composed of exemplars evenly spaced at regular \nintervals in time. Note that Uniform Grid does. not append a single exemplar to \nthe training set, rather it selects an entirely new set of exemplars each time the \ntraining set is grown. Note further that this technique relies heavily upon the time \nseries representation. The problem of selecting exemplars uniformly spaced in the \n4 dimensional input space would be much more difficult to compute. \n\nThe third method, \"Maximum Error,\" was suggested by the ~ISB algorithm, and \nis also an Active Selection technique, since it uses the network in selecting new \nexemplars. Note that the error between the network and the desired value is a \ncomponent of the tiISB criterion. ~ISB need not select an exemplar for which \nnetwork error is maximum, due to the presence of terms involving the gradient \nof the network function. In comparison, the Maximum Error method selects an \nexemplar maximizing network error, ignoring gradient information entirely. It is \ncheaper to compute, typically requiring an order of magnitude fewer multiplies in \noverhead cost as compared to tiISB. This comparison will test, for this particular \nlearning task, whether the gradient information is worth its additional overhead. \n\n7.1 Comparison with Random Selection \nRandom Selection fared the worst among the four contenders. However, it still \nperformed better overall than the strawman method. This is probably because the \ncost due to growing is cheaper, since early on restarts are performed over small \ntraining sets. As the network fit improves, the likelihood of randomly selecting \nan informative exemplar decreases, and Random Selection typically reaches a point \nwhere it adds exemplars in rapid succession, often doubling the size of the exemplar \nset in order to attain a slightly better fit. Random Selection also had a very high \nvariance in cost and number of exemplars selected. \n\n7.2 Comparison with Uniform Grid and Maximum Error \nUniform Grid and Maximum Error are comparable with tiISB in cost as well as \nin the size of the selected exemplar sets. Overall, tiISB and Maximum Error \nperformed about the same, with Uniform Grid finishing respectably in third place. \nMaximum Error was comparable to ~ISB in generalization also, doing better on \nthe test set than on the candidate set 10 times out of 40, whereas tiISB did so a \n\n\f1142 \n\nPlutowski, Cottrell, and White \n\ntotal of 16 times . This occurred only 3 times out of 40 for Uniform Grid. \n\nFigure 3a shows that Uniform Grid requires more exemplars at all three tolerances, \nwhereas ~ISB and Maximum Errorselect about the same number. Figure 3b shows \nthat Uniform Grid typically requires about twice as many exemplars as the other \ntwo. Maximum Error and ~ISB selected about the same number of exemplars, \ntypically selecting about 25 exemplars, plus or minus two. \n\nn \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\nDe ta ISB \n\nUniforrr \n\nMax Er \n\nn \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\nri \n* ~ \n\nDelt . . ISB \n\nMax Error ~ \n\n1--\n\nUniforn \n\nI \n-t \n~ !, \n\n0.02 \n\n0.015 \n\n0.02 RInse \n\n500 \n\n625 \n\n750 \n\n875 \n\n1000 N \n\nFigure 3: Number of examples selected by three contending selection techniques: Uni(cid:173)\nform, ~ISB (diamonds) and Max Error (triangles.) Figure 3a on the left gives number of \nexamples selected versus the desired fit, and Figure 3b on the right is versus the number \nof candidate examples. The two Active Selection techniques selected about 25 exemplars, \n\u00b12. Each point is the average of 5 runSi the error bars are equal in width to twice the \nstandard deviation . The datapoints for ~I SB and Max Error are shifted slightly in the \ngraph to make them easier to distinguish. \n\n8 Conclusions \nThese results clearly demonstrate that exemplar selection can dramatically lower \nthe cost of training. This particular learning task also showed that Active Selection \nmethods are better overall than two contending exemplar selection techniques. \n~I S B and Maximum Error consistently selected concise sets of exemplars, reducing \nthe total cost of training despite the overhead associated with exemplar selection. \nThis particular learning task did not provide a clear distinction between the two \nActive Selection techniques. Maximum Error is more attractive on problems of this \nscope even though we have not justified it analytically, as it performs about as well \nas ~ISB but is easier to code and cheaper to compute. \n\nAcknowledgements \nThis work was supported by NSF grant IRI 92-03532. \n\nReferences \nLapedes, Alan, and Robert Farber. 1987. \"Nonlinear signal processing using neural net(cid:173)\nworks. Prediction and system modelling.\" Los Alamos technical report LA-UR-87-2662. \n\nMackey, M.C., and L. Glass. 1977. \"Oscillation and chaos in physiological control sys(cid:173)\ntems.\" Science 197, 287. \nPlutowski, Mark E., and Halbert White. 1991. \"Active selection of training examples for \nnetwork learning in noiseless environments.\" Technical Report No. CS91-180, CSE Dept., \nUCSD, La Jolla, California. \nPlutowski, Mark E., and Halbert White. 1993. \"Selecting concise training sets from clean \ndata.\" To appear, IEEE Transactions on Neural Networks. 3, 1. \n\n\f", "award": [], "sourceid": 784, "authors": [{"given_name": "Mark", "family_name": "Plutowski", "institution": null}, {"given_name": "Garrison", "family_name": "Cottrell", "institution": null}, {"given_name": "Halbert", "family_name": "White", "institution": null}]}