{"title": "Use of Bad Training Data for Better Predictions", "book": "Advances in Neural Information Processing Systems", "page_first": 343, "page_last": 350, "abstract": null, "full_text": "Use of Bad Training Data For Better \n\nPredictions \n\nTal Grossman \n\nComplex Systems Group (T13) and CNLS \nLANL, MS B213 Los Alamos N .M. 87545 \n\nAlan Lapedes \n\nComplex Systems Group (T13) \n\nLANL, MS B213 Los Alamos N.M. 87545 \n\nand The Santa Fe Institute, Santa Fe, New Mexico \n\nAbstract \n\nWe show how randomly scrambling the output classes of various \nfractions of the training data may be used to improve predictive \naccuracy of a classification algorithm. We present a method for \ncalculating the \"noise sensitivity signature\" of a learning algorithm \nwhich is based on scrambling the output classes. This signature can \nbe used to indicate a good match between the complexity of the \nclassifier and the complexity of the data. Use of noise sensitivity \nsignatures is distinctly different from other schemes to avoid over(cid:173)\ntraining, such as cross-validation, which uses only part of the train(cid:173)\ning data, or various penalty functions, which are not data-adaptive. \nNoise sensitivity signature methods use all of the training data and \nare manifestly data-adaptive and non-parametric. They are well \nsuited for situations with limited training data. \n\n1 \n\nINTRODUCTION \n\nA major problem of pattern recognition and classification algorithms that learn from \na training set of examples is to select the complexity of the model to be trained. \nHow is it possible to avoid an overparameterized algorithm from \"memorizing\" \nthe training data? The dangers inherent in over-parameterization are typically \n\n343 \n\n\f344 \n\nGrossman and Lapedes \n\nillustrated by analogy to the simple numerical problem of fitting a curve to data \npoints drawn from a simple function. If the fit is with a high degree polynomial then \nprediction on new points, i.e. generalization, can be quite bad, although the training \nset accuracy is quite good. The wild oscillations in the fitted function, needed to \nacheive high training set accuracy, cause poor predictions for new data. When \nusing neural networks, this problem has two basic aspects. One is how to choose \nthe optimal architecture (e.g. the number oflayers and units in a feed forward net), \nthe other is to know when to stop training. Of course, these two aspects are related: \nTraining a large net to the highest training set accuracy usually causes overfitting. \nHowever, when training is stopped at the \"correct\" point (where train-set accuracy \nis lower), large nets are generalizing as good as, or even better than, small networks \n(as observed e.g. in Weigend 1994). This prompts serious consideration of methods \nto avoid overparameterization. Various methods to select network architecture or \nto decide when to stop training have been suggested. These include: (1) use of \na penalty function (c.!. Weigend et al. 1991). (2) use of cross validation (Stone \n1974). (3) minimum description length methods (Rissanen 1989), or (4) \"pruning\" \nmethods (e.g. Le Cun et al. 1990). \n\nAlthough all these methods are effective to various degrees, they all also suffer some \nform of non-optimality: \n\n(1) various forms of penalty function have been proposed and results differ between \nthem. Typically, using a penalty function is generally preferable to not using one. \nHowever, it is not at all clear that there exists one \"correct\" penalty function and \nhence any given penalty function is usually not optimal. (2) Cross validation holds \nback part of the training data as a separate valdiation set. It therefore works best in \nthe situation where use of smaller training sets, and use of relatively small validation \nsets, still allows close approximation to the optimal classifier. This is not likely to \nbe the case in a significantly data-limited regime. (3) MDL methods may be viewed \nas a form of penalty function and are subject to the issues in point (1) above. (4) \npruning methods require training a large net, which can be time consuming, and \nthen \"de-tuning\" the large network using penalty functions. The issues expressed \nin point(l) above apply. \n\nWe present a new method to avoid overfitting that uses \"noisy\" training data where \nsome of the output classes for a fraction of the data are scrambled. We describe \nhow to obtain the \"noise sensitivity signature\" of a classifier (with its learning \nalgorithm), which is based on the scrambled data. This new methodology is not \ncomputationally cheap, but neither is it prohibitively expensive. It can provide an \nalternative to methods (1 )-( 4) above that (i) can test any complexity parameter of \nany classifying algorithm (i.e. the architecture, the stopping criterion etc.) (ii) uses \nall the training data, and (iii) is data adaptive, in contrast to fixed penalty/pruning \nfunctions. \n\n2 A DETAILED DESCRIPTION OF THE METHOD \n\nDefine a \"Learning Algorithm\" L(S, P), as any procedure which produces a classifier \nf(~), which is a (discrete) function over a given input space X (~ E X). The input \nof the learning algorithm L is a Training Set S and a set of parameters P. The \ntraining set S is a set of M examples, each example is a pair of an input instance ~i \n\n\fUse of Bad Training Data for Better Predictions \n\n345 \n\nand the desired output Yi associated with it (i = l..M). We assume that the desired \noutput represents an unknown \"target function\" f* which we try to approximate, \ni.e. Yi = f*(:ni). The set of parameters P includes all the relevant parameters of the \nspecific learning algorithm and architecture used. When using a feed-forward neural \nnetwork classifier this set usually includes the size of the network, its connectivity \npattern, the distribution of the initial weights and the learning parameters (e.g. \nthe learning rate and momentum term size in usual back-propagation). Some of \nthese parameters determine the \"complexity\" of the classifiers produced by the \nlearning algorithm, or the set of functions f that are realizable by L. The number \nof hidden units in a two layer perceptron, for example, determines the number \nof free parameters of the model (the weights) that the learning algorithm will fit \nto tbe data (the training set). In general, the output of L can be any classifier: \na neural network, a decision tree, boolean formula etc. The classifier f can also \ndepend on some random choices, like the initial choice of weights in many network \nlenrning algortihm. It can also depend, like in pruning algorithms on any \"stopping \ncrite~'ion\" which may also influence its complexity. \n\n2.1 PRODUCING ff \n\nThe classification task is given as the training set S. The first step of our method \nis to prepare a set of noisy, or partially scrambled realizations of S. We define S: \nas one partiCUlar such realization, in which for fraction P of the M examples tne \ndesired ou.tpu.t values (classes) are changed. In this work we consider only binary \nclassification tasks, which means that we choose pM examples at random for which \n(f.L = l..n) is \nyf = 1 - Yi\u00b7 For each noise level p and set of n such realizations S; \n8-10 noise levels in the range p = 0.0 - 0.4, with n \"\" 4 - 10 realizations of S: for \nprepared, each with a different random choice of scrambled examples. Practically, \nof the different S: to produce the corresponding classifiers, which are the boolean \neach level are enough. The second step is to apply the learning algorithm to each \nfunctions ff = L(S;, P). \n\n2.2 NOISE SENSITIVITY MEASURES \nUsing the set of ff, three quantities are measured for each noise level p: \n\n\u2022 The average performance on the original (noise free) training set S. We \n\ndefine the average noise-free error as \n\n1 \n\nEj(p) = Mn I: L If;(:ni) - Yil \n\nn M \n\n(1) \n\nAnd the noise-free pereformance, or score as Qj(p) = 1 - Ej(p). \n\n\u2022 In a similar way, we define the average error on the noisy training-sets S:: \n\nI/o \n\ni \n\n1 \n\nn M \n\nEn(P) = Mn L ~ If;(:ni) - yfl \n\n(2) \nNote that the error of each classifier f; is measured on the training set \nby which it was created. The noisy-set performance is then defined as \nQn(P) = 1 - En(P)\u00b7 \n\nI/o \n\n\\ \n\n\f346 \n\nGrossman and Lapedes \n\n\u2022 The average functional distance between classifiers. The functional distance \nbetween two classifiers, or boolean functions, d(J, g) is the probability of \nI(z) #- g(z). For a uniform input distribution, it is simply the fraction of \nthe input space X for which I(z) #- g(z). In order to approximate this \nquantity, we can use another set of examples. In contrast with validation \nset methods, these examples need not be classified, i.e. we only need a set of \ninputs z, without the target outputs y, so we can usually use an \"artificial\" \nset of m random inputs. Although, in principle at least, these z instances \nshould be taken from the same distribution as the original task examples. \nThe approximated distance between two classifiers is therefore \n\nd(J, g) = m ~ I/(Zi) - g(zi)1 \n\n(3) \nWe then calculate the average distance, D(p), between the n classifiers It \nobtained for each noise level p: \n\n1 m \n, \n\nD(p) = n(n 2_ 1) L d(J:, I;) \n\nn \n\nIJ.>V \n\n(4) \n\n3 NOISE SENSITIVITY BEHAVIOR \n\nObserving the three quantities Q,(p), Qn(P) and D(p), can we distinguish between \nan overparametrized classifier and a \"well tuned\" one? Can we use this data in order \nto choose the best generalizer out of several candidates? Or to find the right point \nto stop the learning algorithm L in order to achieve better generalization? Lets \nestimate how the plots of Q\" Qn and D vs. p, which we call the \"Noise Sensitivity \nSignature\" (NSS) of the algorithm L, look like in several different scenarios. \n\n3.1 D(p) \n\nThe average functional distance between realizations, D(p), measures the sensitiv(cid:173)\nity of the classifier (or the model) to noise. An over-parametrized architecture is \nexpected to be very sensitive to noise since it is capable of changing its classifica(cid:173)\ntion boundary to learn the scrambled examples. Different realizations of the noisy \ntraining set will therefore result in different classifiers. \n\nOn the other hand, an under-parametrized classifier should be stable against at \nleast a small amount of noise. Its classification boundary will not change when \na few examples change their class. Note, however, that if the training set is not \nvery \"dense\", an under-parametrized architecture can still yield different classifiers, \neven when trained on a noise free training set (e.g. when using BP with differ(cid:173)\nent initial weights). Therefore, it may be possible to observe some \"background \nvariance\", i.e. non-zero average distance for small (down to zero) noise levels for \nunder-parametrized classifiers. \n\nSimilar considerations apply for the two quantities Q,(p) and Qn(P). When the \ntraining set is large enough, an under-parametrized classifier cannot \"follow\" all \n\n\fUse of Bad Training Data for Better Predictions \n\n347 \n\nthe changed examples. Therefore most of them just add to the training error. \nNevertheless, its performance on the noise free training set, Qf(P), will not change \nmuch. As a result, when increasing the noise level P from zero (where Qf(P) = \nQn(P)), we should find Qf (p) > Qn(P) up to a high noise level - where the decision \nboundary has changed enough so the error on the original training set becomes \nlarg '~r than the error on the actual noisy set. The more parameters our model has, \nthe sooner (i.e. smaller p) it will switch to the Qf(P) < Qn(P) state. If a network \nstarts with Qf(P) = Qn(P) and then exhibits a behavior with Qf(P) < Qn(P), this \nis a signature of overparameterization. \n\n3.3 THE TRAINING SET \n\nIn addition to the set of parameters P and the learning algorithm itself, there \nis another important factor in the learning process. This is the training set S. \nThe dependence on M, the number of examples is evident. When M is not large \nenough, the training set does not provide enough data in order to capture the full \ncomplexity of the original task. In other words, there are not enough constraints \n- to approximate well the target function f*. Therefore overfitting will occur for \nsmaller classifier complexity and the optimal network will be smaller. \n\n4 EXPERIMENTAL RESULTS \n\nTo demonstrate the possible outcomes of the method described above in several \ncases, we have performed the following experiment. A random neural network \n\"teacher\" was created as the target function f*. This is a two layer percept ron \nwith 20 inputs, 5 hidden units and one output. A set of M random binary input \nexamples was created and the teacher network was used to classify the training \nexamples. Namely, a desired output Yi was obtained by recording the output of \nthe teacher net when input :l:i was presented to the network, and the output was \ncalculated by applying the usual feed forward dynamincs: \n\n(5) \n\nThis binary threshold update rule is applied to each of the network's units j, i.e \nthe hidden and the output units. The weights of the teacher were chosen from a \nuniform distribution [-1,1]. No threshold (bias weights) were used. \nThe set of scrambled training sets St was produced as explained above and different \nnetwork architectures were trained on it to produce the set of classifiers jl1o. The \nlearning networks are standard two layer networks of sigmoid units, trained by con(cid:173)\njugate gradient back-propagation, using a quadratic error function with tolerance, \ni.e. if the difference between an output of the net and the desired 0 or 1 target is \nsmaller than the tolerance (taken as 0.2 in our experiment) it does not contribute \nto the error. The tolerance is, of course, another parameter which may influences \nthe complexity of the resulting network, however, in this experiment it is fixed. \nThe quantities Qf(P), Qn(P) and D(p) were calculated for networks with 1,2,3, .. 7 \nhidden units (1 hidden unit means just a perceptron, trained with the same error \nfunction). In our terminology, the architecture specification is part of the set of \n\n\f348 \n\nGrossman and Lapedes \n\nhidden units 400 \n\n1 0.81 0.04 \n0.81 0.04 \n2 \n0.78 0.02 \n3 \n0.77 0.03 \n4 \n5 0.74 ( 0.03 \n6 0.74 ( 0.01 \n7 0.71 ( 0.01 \n\nTraining Set Size \n\n700 \n0.81 0.001) \n0.84 0.05 \n0.82 0.06 \n0.81 0.05 \n0.79 0.03 \n0.80 0.05 \n0.76 0.02 \n\n1024 \n0.82 0. 0011 \n0.86 0.04 \n0.90 0.03 \n0.90 0.03 \n0.87 0.04 \n0.89 0.03 \n0.85 0.05 \n\nTable 1: The prediction rate for 1..7 hidden units, averaged on 4 nets that were \ntrained on the noisefree training set of size M = 400,700,1024 (the standard devi(cid:173)\nation is given in parenthesis). \n\nparameters P that is input to the learning algorithm L. The goal is to identify the \n\"correct\" architecture according to the behavior of QJ, Qn and D with p. \nThe experiment was done with three training set sizes M = 400, 700 and 1024. \nAnother set of m = 1000 random examples was used to calculate D. As an \"ex(cid:173)\nternal control\" this set was also classified by the teacher network and was used to \nmeasure the generalization (or prediciton rate) of the different learning networks. \nThe prediction rate, for the networks trained on the noise free training set (aver(cid:173)\naged over 4 networks, trained with different random initial weights) is given for \nthe 1 to 7 hidden unit architectures, for the 3 sizes of M, in Table 1. The noise \nsensitivity signatures of three architectures trained with M = 400 (1,2,3 hidden \nunits) and with M = 1024 examples (2,4,6 units) are shown in Figure 1. Compare \nthese (representative) results with the expected behaviour of the NSS as described \nqualitatively in the previous section. \n\n5 CONCLUSIONS and DISCUSSION \n\nWe have introduced a method of testing a learning model (with its learning algo(cid:173)\nrithm) against a learning task given as a finite set of examples, by producing and \ncharacterizing its \"noise sensitivity signature\". Relying on the experimental results \npresented here, and similar results obtained with other (less artificial) learning tasks \nand algorithms, we suggest some guidelines for using the NSS for model tuning: \n1. If D(p) approaches zero with p -+ 0, or if QJ(p) is significantly better than \nQn(P) for noise levels up to 0.3 or more - the network/model complexity can be \nsafely inreased. \n2. If QJ(p) < Qn(P) already for small levels of noise (say 0.2 or less) - reduce the \nnetwork complexity. \n\n3. In more delicate situations: a \"good\" model will have at least a trace of concavity \nin D(p). A clearly convex D(p) probably indicates an over-parametrized model. In \na \"good\" model choice, Qn (p) will follow Q J (p) closely, from below, up to a high \nnoise level. \n\n\fUse of Bad Training Data for Better Predictions \n\n349 \n\nI I \n\n005 \n\n01 \n\n015 \n\n02 \n\n025 \n\n03 \n\n035 \n\n0 4 \n\n045 \n\n1024 exa~9S 4 hidden units \n\n...\u2022... -.\u2022. -.--.... -...... -,----~ \n\n......... , -\" '1 \n\nt \n\noos \n\n0 1 \n\n015 \n\n0, \n\n025 \n\n03 \n\n035 \n\n04 \n\n045 \n\n04 \n\n08 \n\n0 6 \n\n04 \n\n08 \n\nO~ \n\n04 \n\ni \n\nI \n\n02 \n\n\u2022 \n\nI \n\n\u00b00L---O~OS---0~1---0~15--~02--~02~5---0~3 ---0~35---0~4--~045 \n\n02 \n\n\u2022 \n\noL-__ L -__ ~ __ ~ __ ~ __ -L __ ~ __ ~ __ ~ __ ~ \no \n\noos \n\n0 25 \n\n015 \n\n045 \n\n035 \n\n02 \n\n03 \n\n01 \n\n04 \n\n400 IlX~. 2 hrd:len UIlIIs \n\n02 \n\n~ I \n\noL-__ L -__ ~ __ ~ __ ~ __ -L __ ~ __ ~ __ ~ __ ~ \no \n\n005 \n\n015 \n\n025 \n\n0 35 \n\n03 \n\n02 \n\n04~' \n\n01 \n\n04 \n\n04 \n\n\u2022 \n\nOL_--~--~--~--~---L--~--~--~--~ \no \n04~ \n\n005 \n\n015 \n\n025 \n\nOJ5 \n\n03 \n\n02 \n\n04 \n\n01 \n\nFigure 1: The signatures (Q and D vs. p) of networks with 1,2,3 hidden units (top \nto bottom) trained on M=400 examples (left), and networks with 2,4,6 hidden units \ntrained on M=1024 examples. The (noisy) training set score Qn(P) is plotted with \nfull line, the noise free score Qf(P) with dotted line, and the average functional \ndistance D(p) with error bars (representing the standard deviation of the distance). \n\n\f350 \n\nGrossman and Lapedes \n\n5.1 Advanatages of the Method \n\n1. The method uses all the data for training. Therefore we can extract all the \navailable information. Unlike validation set methods - there is no need to spare \npart of the examples for testing (note that classified examples are not needed for \nthe functional distance estimation). This may be an important advantage when \nthe data is limited. As the experiment presented here shows: taking 300 examples \nout of the 1024 given, may result in choosing a smaller network that will give \ninferior prediction (see table 1). Using \"delete-1 cross-validation\" will minimize \nthis problem but will need at least as much computation as the NSS calculation in \norder to achieve reliable prediction estimation. \n2. It is an \"external\" method, i.e. independent of the classifier and the training \nalgorithm. It can be used with neural nets, decision trees, boolean circuits etc. It \ncan evaluate different classifiers, algorithms or stopping/prunning criteria. \n\n5.2 Disadvantages \n\n1. Computationally expensive (but not prohibitively so). In principle one can use \njust a few noise levels to reduce computational cost. \n\n2. Presently requires a subjective decision in order to identify the signature, unlike \ncross-validation methods which produce one number. In some situations, the noise \nsensitivity signature gives no clear distinction between similar architectures. \nIn \nthese cases, however, there is almost no difference in their generalization rate. \n\nAcknowledgements \n\nWe thank David Wolpert, Michael Perrone and Jerom Friedman for many iluminat(cid:173)\ning discussions and usefull comments. We also thank Rob Farber for his invaluable \nhelp with software and for his assistance with the Connection Machine. \n\nReferencess \n\nLe Cun Y., Denker J.S. and Solla S. (1990), in Adv. in NIPS 2, Touretzky D.S. ed. \n(Morgan Kaufmann 1990) 598. \n\nRissanen J. (1989), Stochastic Complezity in Statistical Inquiry (World Scientific \n1989). \n\nStone M. (1974), J.Roy.Statist.Soc.Ser.B 36 (1974) 11I. \n\nWiegend A.S. (1994), in the Proc. of the 1993 Connectionist Models Summer School, \nedited by M.C. Mozer, P. Smolensky, D.S. Touretzky, J.L. Elman and A.S. Weigend, \npp. 335-342 (Erlbaum Associates, Hillsdale NJ, 1994). \n\nWiegend A.S., Rummelhart D. and Huberman B.A. (1991), in Adv. in NIPS 3, \nLippmann et al. eds. (Morgen Kaufmann 1991) 875. \n\n\f", "award": [], "sourceid": 801, "authors": [{"given_name": "Tal", "family_name": "Grossman", "institution": null}, {"given_name": "Alan", "family_name": "Lapedes", "institution": null}]}