{"title": "Balancing Between Bagging and Bumping", "book": "Advances in Neural Information Processing Systems", "page_first": 466, "page_last": 472, "abstract": "", "full_text": "Balancing between bagging and bumping \n\nTom Heskes \n\nRWCP Novel  Functions SNN  Laboratory;  University of Nijmegen \n\nGeert  Grooteplein 21 , 6525  EZ  Nijmegen, The Netherlands \n\ntom@mbfys.kun.nl \n\nAbstract \n\nWe  compare  different  methods  to  combine  predictions  from  neu(cid:173)\nral networks  trained on different  bootstrap samples of a  regression \nproblem.  One  of these  methods,  introduced  in  [6]  and  which  we \nhere  call  balancing,  is  based  on  the  analysis  of the  ensemble gen(cid:173)\neralization error into an ambiguity term  and  a  term incorporating \ngeneralization  performances of individual networks.  We  show  how \nto  estimate  these  individual  errors  from  the  residuals  on  valida(cid:173)\ntion  patterns.  Weighting  factors  for  the  different  networks  follow \nfrom  a  quadratic  programming problem.  On a  real-world problem \nconcerning  the  prediction  of sales  figures  and  on  the  well-known \nBoston  housing  data set,  balancing clearly  outperforms  other  re(cid:173)\ncently  proposed  alternatives as  bagging  [1]  and bumping [8]. \n\n1  EARLY STOPPING AND  BOOTSTRAPPING \n\nStopped  training is  a  popular  strategy  to  prevent  overfitting  in  neural  networks. \nThe  complete  data set  is  split  up  into  a  training  and  a  validation  set.  Through \nlearning  the  weights  are  adapted  in  order  to  minimize  the  error  on  the  training \ndata.  Training is  stopped  when  the  error  on  the  validation data starts  increasing. \nThe final  network depends on the accidental subdivision in training and validation \nset ,  and often  also on  the,  usually  random,  initial weight  configuration  and chosen \nminimization procedure.  In  other  words , early stopped  neural  networks  are  highly \nunstable:  small changes in the data or different  initial conditions can produce large \nchanges in the estimate.  As  argued in [1 , 8],  with unstable estimators it is  advisable \nto  resample,  i.e.,  to  apply  the  same  procedure  several  times  using  different  sub(cid:173)\ndivisions  in  training  and  validation set  and  perhaps  starting from  different  initial \n\nRWCP:  Real  World Computing  Partnership;  SNN:  Foundation for  Neural Networks. \n\n\fBalancing Between Bagging and Bumping \n\n467 \n\nconfigurations.  In  the  neural  network  literature  resampling  is  often  referred  to  as \ntraining ensembles of neural networks  [3,  6].  In  this paper,  we  will discuss  methods \nfor combining the outputs of networks obtained through such a repetitive procedure. \n\nFirst,  however,  we  have to choose  how  to generate  the subdivisions in  training and \nvalidation sets.  Options are, among others, k-fold cross-validation, subsampling and \nbootstrapping.  In  this  paper  we  will  consider  bootstrapping [2]  which  is  based  on \nthe idea that  the available data set  is  nothing  but  a  particular  realization  of some \nprobability distribution.  In principle,  one  would  like  to do inference  on  this  \"true\" \nyet unknown probability distribution.  A natural thing to do is then to define an em(cid:173)\npirical distribution.  With so-called  naive  bootstrapping  the empirical distribution \nis  a  sum of delta peaks on the available data points, each  with  probability content \nl/Pdata  with Pdata  the  number  of patterns.  A  bootstrap  sample  is  a  collection  of \nPdata  patterns drawn with replacement from this empirical probability distribution. \nSome  of  the  data  points  will  occur  once,  some  twice  and  some  even  more  than \ntwice  in  this  bootstrap  sample.  The  bootstrap sample is  taken  to  be  the  training \nset,  all  patterns  that do not  occur  in  a  particular  bootstrap sample constitute  the \nvalidation set.  For  large Pdata,  the  probability that  a  pattern  becomes  part of the \nvalidation set  is  (1  -\nl/Pdata)Pda.ta.  ~ l/e ~ 0.368.  An  advantage of bootstrapping \nover  other  resampling  techniques  is  that  most  statistical  theory  on  resampling  is \nnowadays based on the  bootstrap. \n\nUsing  naive  bootstrapping we  generate  nrun  training and validation sets  out of our \ncomplete  data set  of Pdata  input-output  combinations  {iI', tl'}.  In  this  paper  we \nwill  restrict  ourselves  to  regression  problems with, for  notational convenience,  just \none  output  variable.  We  keep  track  of a  matrix  with  components  q;  indicating \nwhether  pattern p  is  part of the  validation set for  run  i  (q; = 1)  or of the  training \nset  (qf  =  0).  On  each  subdivision  we  train  and  stop  a  neural  network  with  one \nlayer of nhidden  hidden  units.  The output or  of network  i  with  weight  vector  w( i) \non input il' reads \n\no~ I \n\n+ wo(i)  , \n\nwhere  we  use  the definition  x~ ==  1.  The validation error for  run i  can be  written \n\nEvalidation(i)  ==  -:- L qrr; , \n\n1  Pda.ta. \n\nPI  /.'=1 \n\nwith  Pi  ==  L:/.' qf  ~ 0.368 Pdata,  the  number  of validation  patterns  m  run  z,  and \nr; ==  (or  - ttl)2/2,  the error  of network  i  on  pattern p. \nAfter  training  we  are  left  with  nrun  networks,  with,  in  practice,  quite  different \nperformances on  the complete data set.  How  should  we  combine all these  outputs \nto get  the best  possible  performance on  new  data? \n\n2  COMBINING  ESTIMATORS \n\nSeveral  methods  have  been  proposed  to  combine estimators  (see  e.g.  (5)  for  a  re(cid:173)\nview).  In  this  paper  we  will  only  consider  estimators  with  the  same  architecture \n\n\f468 \n\nT.  Heskes \n\nbut  trained  and stopped  on  different  subdivisions  of the  data in  training  and  val(cid:173)\nidation  sets.  Recently,  two  such  methods  have  been  suggested  for  bootstrapped \nestimators:  bagging  [1],  an  acronym  for  bootstrap  aggregating,  and  bumping  [8], \nmeaning bootstrap umbrella of model parameters.  With bagging, the prediction on \na  newly  arriving input vector is  the  average  over  all  network  predictions.  Bagging \ncompletely disregards  the  performance of the individual networks on the data used \nfor  training  and stopping.  Bumping, on  the other  hand,  throws  away  all networks \nexcept  the  one  with  the  lowest  error  on  the  complete  data set 1 \u2022  In  the  following \nwe  will  describe  an  intermediate form  due  to  [6],  which  we  here  call  balancing.  A \ntheoretical  analysis of the implications of this idea can  be  found  in  [7]. \n\nSuppose that after  training we  receive  a  new set of Ptest  test  patterns for  which  we \ndo  not  know  the true  targets [II,  but can  calculate  the  network  output  OJ  for  each \nnetwork  i.  We give each  network a  weighting factor  aj and define  the prediction of \nall  networks  on pattern  1/  as  the  weighted average \n\nnrun \n\n-II  _  ~ -II \nm  =  L- ajOi  . \n\ni=1 \n\nThe goal is  to find  the weighting factors  aj, subject  to the constraints \n\nnrun \n\nL  aj  =  1  and  aj  ~ 0  Vj\nj=1 \n\n, \n\n(1) \n\nyielding the smallest possible generalization error \n\nE \n\n~ ( - II \ntest  =  - - L- m  -\n\n-\n\nt-II) 2 \n\n. \n\n1  Ptest \n\nPtest  11:1 \n\nThe problem, of course, is our ignorance about the targets [II.  Bagging simply takes \nai =  l/nrun  for  all  networks,  whereas  bumping implies aj = din.  with \n\nK. \n\n1  Pd .. t .. \n\nargmin - - L (or  - t JJ )2  . \n\nj \n\nPdata  JJ=1 \n\nAs  in  [6,  7]  we  write  the generalization error in  the form \n\nE test \n\n_1_ L  L ajaj(or - [1I)(oj - [II) \n\nPtest \n\nII \n\n.. \nI,) \n\n2p1  L  L  ajaj [(or - [11)2  + (oj - ill)2 - (or - oj)2] \n\ntest \n\nII \n\nj ,j \n\nL  ajaj  [Etest(i) + Etest(j)  - ~ L(or - 5j )2]. \n\n(2) \n\n.  . \nIJ \n\nPtest \n\nII \n\nThe  last  term  depends  only  on  the  network  outputs  and  can  thus  be  calculated. \nThis  \"ambiguity\"  term  favors  networks  with  conflicting  outputs.  The  first  part, \n\nlThe  idea  behind  bumping  is  more  general  and  involved  than  discussed  here.  The \n\ninterested reader is referred  to [8] .  In this paper we  will  only  consider its naive  version. \n\n\fBalancing Between Bagging and Bumping \n\n469 \n\ncontaining the generalization errors  Etest(i)  for individual networks, depends on the \ntargets tV  and is thus unknown.  It favors networks that by themselves already have \na low generalization error.  In the next section we  will find  reasonable estimates for \nthese  generalization errors  based  on  the  network  performances on  validation data. \nOnce  we  have  obtained  these  estimates,  finding  the  optimal  weighting  factors  Cti \nunder  the constraints  (1)  is  a  straightforward quadratic programming problem. \n\n3  ESTIMATING  THE GENERALIZATION ERROR \n\nAt  first  sight,  a  good  estimate for  the  generalization error  of network  i  could  be \nthe  performance  on  the  validation  data  not  included  during  training.  However, \nthe validation error  Evalidation (i)  strongly depends  on  the accidental subdivision in \ntraining and validation set.  For example, if there are  a few  outliers which,  by  pure \ncoincidence,  are  part  of the  validation set,  the  validation  error  will  be  relatively \nlarge  and  the  training  error  relatively  small.  To  correct  for  this  bias  as  a  result \nof the  random subdivision,  we  introduce the  \"expected\"  validation error for  run  i. \nFirst we define nil as the number of runs in which pattern J.l  is part of the validation \nset  and  E~alidation as  the error averaged  over  these  runs: \n\nnrun \n\nnil  ==  L qf  and  E~alidation  ==  nil ?= qf rf , \n\n1  nrun \n\ni=1 \n\n.=1 \n\nThe expected  validation error  then follows  from \n\n, \n\nEvalidation (i)  ==  --:- L qf E~alidation . \n\n1  Pda.ta. \n\nP.  11=1 \n\nThe ratio between the observed and the expected validation error indicates whether \nthe  validation error  for  network  i  is  relatively  high  or  low.  Our  estimate for  the \ngeneralization error of network  i  is  this ratio multiplied by an overall scaling factor \nbeing the estimated average  generalization error: \n\nE \n\ntest  t\n\n(.)  ~  validation  t  __  '\"\"  Ell. \n\nE \n, \nEvalidation (t)  Pdata  11=1 \n\n(.) \n. \n\n1 \n\nPda.ta. \n~ validation' \n\n. \n\nNote that we  implicitly make the assumption that the bias introduced  by stopping \nat the minimal error on the validation patterns is negligible, i.e., that the validation \npatterns  used  for  stopping  a  network  can  be  considered  as  new  to  this  network  as \nthe completely independent  test  patterns. \n\n4  SIMULATIONS \n\nWe compare the following methods for  combining neural  network outputs. \n\nIndividual:  the average individual generalization error,  i.e.,  the generalization er(cid:173)\n\nror  we  will  get  on  average  when  we  decide  to  perform  only  one  run.  It \nserves  as  a  reference  with which  the other methods will be compared. \n\nBumping:  the  generalization  of the  network  with  the  lowest  error  on  the  data \n\navailable for  training and stopping. \n\n\f470 \n\nT.  Heskes \n\nunfair \n\nunfair \n\nbumping \n\nbagging \n\nambiguity \n\nbalancing \n\nbumping \n\nbalancing \n\nstore  1 \n\nstore  2 \n\nstore 3 \n\nstore 4 \n\nstore 5 \n\nstore 6 \n\nmean \n\n4% \n\n5% \n\n-7 % \n\n6% \n\n6% \n\n1% \n\n3% \n\n) \n\n9% \n\n15  % \n\n11% \n\n11% \n\n10% \n\n8% \n\n11% \n\n10% \n\n22  % \n\n18  % \n\n17  % \n\n22  % \n\n14  % \n\n17  % \n\n17  % \n\n23  % \n\n25  % \n\n26  % \n\n19  % \n\n19  % \n\n22  % \n\n17  % \n23  % \n\n25  % \n\n26  % \n\n22  % \n\n16  % \n\n22  % \n\n24  % \n\n34  % \n\n36 % \n\n31  % \n\n26  % \n\n26  % \n\n30 % \n\nTable 1:  Decrease  in generalization error relative to the average individual general(cid:173)\nization error  as  a  result  of several  methods for  combining neural  networks  trained \nto  predict  the sales figures  for  several stores. \n\nBagging:  the  generalization  error  when  we  take  the  average  of all  n run  network \n\noutputs as our prediction. \n\nAmbiguity:  the generalization error when the weighting factors are chosen to max(cid:173)\nimize the ambiguity, i.e.,  taking identical estimates for  the  individual gen(cid:173)\neralization errors of all networks in  expression  (2). \n\nBalancing:  the generalization error when  the weighting factors are chosen  to min(cid:173)\n\nimize our estimate of the generalization error. \n\nUnfair bumping:  the smallest generalization error for an individual error, i.e., the \nresult  of bumping if we  had  indeed  chosen  the  network  with  the  smallest \ngeneralization error. \n\nUnfair balancing:  the lowest possible generalization error that we  could obtain if \n\nwe  had perfect  estimates of the individual generalization errors. \n\nThe  last  two  methods,  unfair  bumping  and  unfair  balancing,  only  serve  as  some \nkind of reference  and can  never  be used  in  practice. \n\nWe  applied  these  methods  on  a  real-world  problem  concerning  the  prediction  of \nsales  figures  for  several  department  stores  in  the  Netherlands.  For each  store,  100 \nnetworks  with  4  hidden  units  were  trained  and  stopped  on  bootstrap  samples  of \nabout 500 patterns.  The test set, on which the performances of the various methods \nfor  combination  were  measured,  consists  of about  100  patterns. \nInputs  include \nweather conditions,  day of the week,  previous sales figures,  and season.  The results \nare  summarized in  Table  1,  where  we  give  the  decrease  in  the generalization error \nrelative  to the  average individual generalization error. \n\nAs  can  be  seen  in  Table 1,  bumping hardly improves the performance.  The reason \nis  that the error  on  the data used for  training and stopping is  a  lousy  predictor  of \nthe generalization error, since some amount of overfitting is inevitable.  The general(cid:173)\nization performance obtained through bagging, i.e.,  first  averaging over all outputs, \ncan be pro\"en to be always better than  the average individual generalization error. \n\n\fBalancing Between Bagging and Bumping \n\n471 \n\n80r---~----~--~----~-' \n\nE \nQ) \nE 60 \n~ \nE \n.\u00a740 \n\nQ) \nC> \n\n~ 20 \n> \u00abI \n\n~ 30 r-------..-------.-----~---~......, \nE \n~ 25 \nE \na. .s 20 \n\nlI('lIE- \"\"*- __ \u2022  - - -\n\n-lIE \n\n-\n\nO~--~----~----~--~~ \n\n80 \n\no \n\n20 \nnumber of replicates \n\n60 \n\n40 \n\n20 \nnumber of replicates \n\n40 \n\n60 \n\n80 \n\nFigure  1:  Decrease  of generalization  error  relative  to  the  average  individual gen(cid:173)\neralization  error  as  a  function  of the  number  of bootstrap  replicates  for  different \ncombination methods:  bagging  (dashdot , star),  ambiguity (dotted,  star),  bumping \n(dashed,  star),  balancing  (solid,  star) ,  unfair  bumping  (dashed,  circle),  unfair  bal(cid:173)\nancing  (solid,  circle).  Shown are the mean  (left)  and the standard deviation  (right) \nof  the  decrease  in  percentages.  Networks  are  trained  and  tested  on  the  Boston \nhousing database. \n\nOn these data bagging is definitely better than bumping, but also worse  than max(cid:173)\nimizing  the  ambiguity.  In  all  cases,  except  for  store  5  where  maximization  of the \nambiguity is slightly better,  balancing is  a  clear  winner  among the  \"fair\"  methods. \nThe last column in Table 1 shows how much better we  can get if we could find  more \naccurate estimates for  the generalization errors of individual networks. \n\nThe  method  of balancing  discards  most  of the  networks,  i.e.,  the  solution  to  the \nquadratic programming problem (2)  under constraints (1) yields just a few  weighting \nfactors different from zero (on average about 8 for this set of simulations).  Balancing \nis  thus indeed a  compromise between  bagging, taking all  networks into acount , and \nbumping, keeping just one  network. \n\nWe  also compared these  methods on the  well-known  Boston housing data set  con(cid:173)\ncerning the median housing price in several tracts based on 13 mainly socio-economic \npredictor  variables  (see  e.g.  [1]  for  more  information).  We  left  out  50  of the  506 \navailable cases for  assessment of the generalization performance.  All other 456 cases \nwere  used for  training and  stopping neural  networks  with  4  hidden  units.  The av(cid:173)\nerage  individual  mean  squared  error  over  all  300  bootstrap  runs  is  16.2,  which  is \ncomparable to  the  mean squared  error  reported  in  [1].  To study  how  the  perfor(cid:173)\nmance  depends  on  the  number  of bootstrap  replicates ,  we  randomly drew  sets  of \nn  =  5,10,20,40 and 80  bootstrap  replicates  out  of our ensemble  of 300  replicates \nand  applied  the  combination  methods  on  these  sets.  For  each  n  we  did  this  48 \ntimes.  Figure  1 shows  the mean decrease  in  the generalization error  relative  to the \naverage individual generalization error  and its standard deviation . \n\nAgain, balancing comes out  best , especially for  a  larger  number of bootstrap repli(cid:173)\ncates.  It seems that beyond say  20  replicates both bumping and bagging are hardly \nhelped  by  more runs,  whereas  both  maximization of the  ambiguity and  balancing \nstill increase their performance.  Bagging, fully  taking into account  all network  pre-\n\n\f472 \n\nT.  Heskes \n\ndictions, yields the smallest variation, bumping, keeping just one of them, by far the \nlargest.  Balancing and  maximization of the ambiguity combine several  predictions \nand thus  yield  a  variation that is  somewhere in  between. \n\n5  CONCLUSION  AND  DISCUSSION \n\nBalancing,  a  compromise  between  bagging  and  bumping,  is  an  attempt  to  arrive \nat  better  performances  on  regression  problems.  The  crux  in  all  this  is  to  obtain \nreasonable  estimates  for  the  quality  of the  different  networks  and  to  incorporate \nthese  estimates  in  the  calculation  of the  proper  weighting  factors  (see  [5,  9]  for \nsimilar ideas  and related  work in  the context  of stacked generalization). \n\nObtaining several estimators is  computationally expensive.  However,  the notorious \ninstability offeedforward neural networks hardly leaves us a choice.  Furthermore, an \nensemble of bootstrapped neural networks can also be used to deduce (approximate) \nconfidence  and prediction intervals  (see  e.g.  [4]),  to estimate the relevance  of input \nfields  and  so  on.  It has  also  been  argued  that  combination  of several  estimators \ndestroys the structure  that may be present  in a single estimator [8].  Having hardly \nany interpretable structure, neural networks do not seem to have a lot they can lose. \nIt is  a  challenge  to  show  that  an  ensemble  of neural  networks  does  not  only  give \nmore accurate predictions,  but also reveals more information than a single network. \n\nReferences \n\n[1]  L.  Breiman.  Bagging predictors.  Machine  Learning,  24:123-140,  1996. \n\n[2]  B.  Efron and R. Tibshirani.  An Introduction  to  the  Bootstrap.  Chapman &  Hall, \n\nLondon,  1993. \n\n[3]  L.  Hansen  and  P.  Salomon.  Neural  network  ensembles.  IEEE  Transactions  on \n\nPattern  Analysis  and Machine  Intelligence,  12:993-1001, 1990. \n\n[4]  T.  Heskes.  Practical  confidence  and  prediction  intervals.  These  proceedings, \n\n1997. \n\n[5]  R.  Jacobs.  Methods  for  combining  experts'  probability  assessments.  Neural \n\nComputation,  7:867-888,  1995. \n\n[6]  A.  Krogh  and  J.  Vedelsby.  Neural  network  ensembles,  cross  validation,  and \nactive  learning.  In  G.  Tesauro,  D.  Touretzky,  and  T.  Leen,  editors,  Advances \nin  Neural  Information  Processing  Systems  7,  pages  231-238,  Cambridge,  1995. \nMIT Press. \n\n[7]  P.  Sollich  and  A.  Krogh.  Learning  with  ensembles:  How  over-fitting  can  be \nuseful. \nIn  D.  Touretzky,  M.  Mozer,  and  M.  Hasselmo,  editors,  Advances  in \nNeural  Information  Processing  Systems  8,  pages  190-196,  San  Mateo,  1996. \nMorgan Kaufmann. \n\n[8]  R.  Tibshirani and K. Knight.  Model  search  and inference by  bootstrap  \"bump(cid:173)\n\ning\".  Technical report,  University of Toronto,  1995. \n\n[9]  D.  Wolpert  and  W.  Macready.  Combining stacking  with  bagging to improve a \n\nlearning algorithm.  Technical  report,  Santa Fe  Institute,  Santa Fe,  1996. \n\n\f", "award": [], "sourceid": 1182, "authors": [{"given_name": "Tom", "family_name": "Heskes", "institution": null}]}