Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)
Tom Heskes
We compare different methods to combine predictions from neu(cid:173) ral networks trained on different bootstrap samples of a regression problem. One of these methods, introduced in [6] and which we here call balancing, is based on the analysis of the ensemble gen(cid:173) eralization error into an ambiguity term and a term incorporating generalization performances of individual networks. We show how to estimate these individual errors from the residuals on valida(cid:173) tion patterns. Weighting factors for the different networks follow from a quadratic programming problem. On a real-world problem concerning the prediction of sales figures and on the well-known Boston housing data set, balancing clearly outperforms other re(cid:173) cently proposed alternatives as bagging [1] and bumping [8].
1 EARLY STOPPING AND BOOTSTRAPPING
Stopped training is a popular strategy to prevent overfitting in neural networks. The complete data set is split up into a training and a validation set. Through learning the weights are adapted in order to minimize the error on the training data. Training is stopped when the error on the validation data starts increasing. The final network depends on the accidental subdivision in training and validation set , and often also on the, usually random, initial weight configuration and chosen minimization procedure. In other words , early stopped neural networks are highly unstable: small changes in the data or different initial conditions can produce large changes in the estimate. As argued in [1 , 8], with unstable estimators it is advisable to resample, i.e., to apply the same procedure several times using different sub(cid:173) divisions in training and validation set and perhaps starting from different initial
RWCP: Real World Computing Partnership; SNN: Foundation for Neural Networks.
Balancing Between Bagging and Bumping
467
configurations. In the neural network literature resampling is often referred to as training ensembles of neural networks [3, 6]. In this paper, we will discuss methods for combining the outputs of networks obtained through such a repetitive procedure.
First, however, we have to choose how to generate the subdivisions in training and validation sets. Options are, among others, k-fold cross-validation, subsampling and bootstrapping. In this paper we will consider bootstrapping [2] which is based on the idea that the available data set is nothing but a particular realization of some probability distribution. In principle, one would like to do inference on this "true" yet unknown probability distribution. A natural thing to do is then to define an em(cid:173) pirical distribution. With so-called naive bootstrapping the empirical distribution is a sum of delta peaks on the available data points, each with probability content l/Pdata with Pdata the number of patterns. A bootstrap sample is a collection of Pdata patterns drawn with replacement from this empirical probability distribution. Some of the data points will occur once, some twice and some even more than twice in this bootstrap sample. The bootstrap sample is taken to be the training set, all patterns that do not occur in a particular bootstrap sample constitute the validation set. For large Pdata, the probability that a pattern becomes part of the validation set is (1 - l/Pdata)Pda.ta. ~ l/e ~ 0.368. An advantage of bootstrapping over other resampling techniques is that most statistical theory on resampling is nowadays based on the bootstrap.
Using naive bootstrapping we generate nrun training and validation sets out of our complete data set of Pdata input-output combinations {iI', tl'}. In this paper we will restrict ourselves to regression problems with, for notational convenience, just one output variable. We keep track of a matrix with components q; indicating whether pattern p is part of the validation set for run i (q; = 1) or of the training set (qf = 0). On each subdivision we train and stop a neural network with one layer of nhidden hidden units. The output or of network i with weight vector w( i) on input il' reads