{"title": "Learning with ensembles: How overfitting can be useful", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 196, "abstract": null, "full_text": "Learning with ensembles:  How \n\nover-fitting can be useful \n\nDepartment of Physics \n\nUniversity of Edinburgh, U.K. \n\nPeter Sollich \n\nP.SollichGed.ac.uk \n\nAnders Krogh'\" \n\nNORDITA,  Blegdamsvej  17 \n2100 Copenhagen,  Denmark \n\nkroghGsanger.ac.uk \n\nAbstract \n\nWe  study  the  characteristics  of learning  with ensembles.  Solving \nexactly  the  simple  model  of an  ensemble  of linear  students,  we \nfind  surprisingly  rich  behaviour.  For  learning  in  large  ensembles, \nit  is  advantageous  to  use  under-regularized  students,  which  actu(cid:173)\nally  over-fit  the  training data.  Globally optimal performance  can \nbe  obtained by  choosing  the training set  sizes  of the students ap(cid:173)\npropriately.  For  smaller ensembles,  optimization of the  ensemble \nweights  can yield significant improvements in ensemble generaliza(cid:173)\ntion  performance,  in particular if the  individual students  are  sub(cid:173)\nject to noise in the training process.  Choosing students with a wide \nrange of regularization parameters makes this improvement robust \nagainst changes in the unknown level of noise  in the training data. \n\n1 \n\nINTRODUCTION \n\nAn  ensemble  is  a  collection of a  (finite)  number of neural  networks  or other types \nof predictors  that  are  trained  for  the  same  task.  A  combination  of many  differ(cid:173)\nent  predictors  can  often  improve  predictions,  and  in  statistics  this  idea  has  been \ninvestigated  extensively,  see  e.g.  [1,  2,  3].  In  the  neural  networks  community, en(cid:173)\nsembles of neural networks have been investigated by several groups, see for instance \n[4,  5,  6,  7].  Usually  the  networks  in  the  ensemble  are  trained  independently  and \nthen their predictions are  combined. \n\nIn  this  paper  we  study  an  ensemble  of linear  networks  trained  on  different  but \noverlapping  training  sets.  The limit in  which  all  the  networks  are  trained  on  the \nfull  data set  and  the one  where  all  the  data sets  are  different  has  been  treated  in \n[8] .  In this  paper we  treat  the  case  of intermediate training set  sizes  and overlaps \n\n\u00b7Present address:  The Sanger  Centre,  Hinxton,  Cambs CBIO  IRQ,  UK. \n\n\fLearning with Ensembles:  How  Overfitting Can Be Useful \n\n191 \n\nexactly, yielding novel insights into ensemble learning.  Our analysis also allows us to \nstudy the effect  of regularization and of having different  predictors in an ensemble. \n\n2  GENERAL FEATURES  OF  ENSEMBLE LEARNING \n\nWe  consider  the  task  of approximating  a  target  function  fo  from  RN  to  R.  It \nwill  be  assumed  that  we  can  only  obtain  noisy  samples  of the  function,  and  the \n(now  stochastic)  target  function  will  be  denoted  y(x) .  The  inputs  x  are  taken \nto  be  drawn  from  some  distribution  P(x).  Assume  now  that  an  ensemble  of K \nindependent  predictors  fk(X)  of y(x)  is  available.  A  weighted  ensemble  average  is \ndenoted  by  a  bar,  like \n\nlex) = L,wkfk(X), \n\nk \n\n(1) \n\nwhich  is  the  final  output  of the ensemble.  One  can  think of the  weight Wk  as  the \nbelief in predictor  k  and  we  therefore  constrain the weights to be positive  and sum \nto one.  For  an  input  x  we  define  the error  of the ensemble  c(x),  the  error  of the \nkth predictor  ck(X),  and its  ambiguity  ak(x) \n\nc(x) \nck(X) \n\n(y(x) -lex)? \n(y(x) - fk(X)? \n(fk(X) -1(x\u00bb2. \n\n(2) \n(3) \n(4) \nThe  ensemble  error  can  be  written  as  c(x)  = lex)  - a(x)  [7],  where  lex)  = \nL,k Wkck(X)  is  the  average  error  over  the  individual  predictors  and  a(x)  = \nL,k Wkak(X)  is the average of their ambiguities, which is the variance of the output \nover  the ensemble.  By  averaging over  the  input distribution  P(x)  (and implicitly \nover  the target outputs y(x\u00bb, one obtains the ensemble  generalization  error \n\n(5) \nwhere  c(x)  averaged over  P(x) is simply denoted  c,  and similarly for  land a.  The \nfirst term on the right is the weighted average of the generalization errors of the indi(cid:173)\nvidual predictors,  and the second  is  the weighted average of the ambiguities, which \nwe  refer  to as the ensemble ambiguity.  An important feature of equation (5)  is that \nit separates the generalization error into a term that depends  on the generalization \nerrors of the individual students and another term that contains all correlations be(cid:173)\ntween  the students.  The latter can  be estimated entirely from  unlabeled  data,  i. e., \nwithout any knowledge of the target function to be approximated.  The relation (5) \nalso  shows that the more the predictors  differ,  the lower the error will be,  provided \nthe individual errors  remain constant. \nIn this paper we  assume that the predictors  are trained on a  sample of p  examples \nof the  target  function,  (xt',yt'),  where  yt'  =  fo(xt') + TJt'  and  TJt'  is  some  additive \nnoise  (Jl.  = 1, ... ,p).  The predictors,  to which  we  refer  as  students  in  this  context \nbecause  they  learn  the  target  function  from  the  training  examples,  need  not  be \ntrained  on  all  the  available data.  In fact,  since  training on different  data sets  will \ngenerally increase  the ambiguity, it is possible that training on subsets  of the data \nwill  improve  generalization.  An  additional  advantage  is  that,  by  holding  out  for \neach  student  a  different  part of the  total  data set  for  the  purpose  of testing,  one \ncan use the whole data set for training the ensemble while still getting an unbiased \nestimate of the ensemble generalization error.  Denoting this estimate by f,  one has \n(6) \nwhere  Ctest  =  L,k WkCtest,k  is  the  average  of the  students'  test  errors.  As  already \npointed out, the estimate ft of the ensemble ambiguity can be found from unlabeled \ndata. \n\n\f192 \n\nP. SOLLICH, A. KROGH \n\nSo far,  we  have  not mentioned how  to  find  the weights Wk.  Often  uniform weights \nare  used,  but  optimization of the  weights  in  some  way  is  tempting.  In  [5,  6]  the \ntraining set was used  to perform the optimization, i.e.,  the weights  were  chosen  to \nminimize the ensemble training error.  This can easily lead to over-fitting, and in [7] \nit was suggested  to minimize the estimated generalization error  (6)  instead.  If this \nis  done,  the estimate (6)  acquires  a  bias;  intuitively, however,  we  expect  this effect \nto be small for  large ensembles. \n\n3  ENSEMBLES  OF  LINEAR STUDENTS \n\nIn preparation for our analysis of learning with ensembles of linear students we now \nbriefly  review  the  case  of a  single  linear student,  sometimes referred  to  as  'linear \nperceptron  learning'.  A linear student  implements the input-output mapping \n\nT \nJ(x) =  ..JNw  x \n\n1 \n\nparameterized in  terms of an N-dimensional parameter vector  w  with real compo(cid:173)\nnents;  the scaling factor  1/..JN is introduced here for  convenience,  and . .. T  denotes \nthe  transpose  of a  vector.  The  student  parameter  vector  w  should  not  be  con(cid:173)\nfused  with the ensemble weights Wk.  The  most  common method for  training such \na  linear student  (or  parametric inference  models in general)  is  minimization of the \nsum-of-squares training error \n\nE = L:(y/J  - J(x/J))2 + Aw2 \n\n/J \n\nwhere  J.L  = 1, ... ,p numbers  the  training examples.  To  prevent  the  student  from \nfitting noise in the training data, a weight decay term Aw2  has been added.  The size \nof the  weight  decay  parameter  A determines how  strongly large  parameter vectors \nare penalized;  large A corresponds  to a stronger  regularization of the student. \nFor  a  linear  student,  the  global  minimum of  E  can  easily  be  found.  However, \nin  practical  applications using  non-linear  networks,  this  is  generally not  true,  and \ntraining can be thought of as a stochastic process  yielding a  different solution each \ntime.  We  crudely model this by  considering white  noise added to gradient descent \nupdates of the parameter vector w.  This yields a limiting distribution of parameter \nvectors  P(w)  ex:  exp(-E/2T),  where  the  'temperature' T  measures  the  amount of \nnoise  in the training process. \nWe focus our analysis on the 'thermodynamic limit' N  - t  00 at constant normalized \nnumber of training examples, ex  = p/ N.  In this limit, quantities such as the training \nor  generalization  error  become  self-averaging,  i.e.,  their  averages  over  all  training \nsets  become  identical  to their  typical  values  for  a  particular  training set.  Assume \nnow  that  the  training  inputs  x/J  are  chosen  randomly  and  independently  from  a \nGaussian distribution  P(x) ex:  exp( - ~x2), and that training outputs are  generated \nby  a linear target function corrupted  by  additive noise,  i.e.,  y/J  = w'f x/J /..IN + 1]/J, \nwhere the 1]/J  are zero mean noise variables with variance u2 \u2022  Fixing the length of the \nparameter vector of the target function to w~ =  N  for simplicity, the generalization \nerror of a  linear student  with weight decay  A and learning noise T  becomes  [9] \n\n8G \n(;  =  (u2 + T)G + A(U2  - A) 8A  . \n\n(7) \n\nOn  the  r.h.s.  of this equation we  have  dropped the term arising from  the noise  on \nthe  target  function  alone,  which  is  simply u2 ,  and  we  shall follow  this  convention \nthroughout.  The 'response  function'  Gis [10,  11] \n\nG =  G(ex, A)  =  (1  - ex  - A + )(1 - ex  - A)2 + 4A)/2A. \n\n(8) \n\n\fLearning with Ensembles:  How Overfitting Can Be Useful \n\n193 \n\nFor  zero  training  noise,  T  =  0,  and for  any a, the  generalization error  (7}  is mini(cid:173)\nmized when  the  weight decay  is  set  to  A =  (T2j  its value is then  (T2G(a, (T2),  which \nis  the minimum achievable generalization error [9]. \n\n3.1  ENSEMBLE  GENERALIZATION  ERROR \n\nWe  now  consider  an  ensemble  of  K  linear  students  with  weight  decays  Ak  and \nlearning  noises  Tk  (k  =  1 . . . K).  Each ,student  has  an  ensemble  weight  Wk  and \nis  trained on  N ak  training examples,  with students k  and I sharing  N akl  training \nexamples (of course, akk =  ak).  As above, we consider noisy training data generated \nby  a  linear  target  function.  The  resulting  ensemble  generalization  error  can  be \ncalculated  by  diagrammatic  [10]  or  response  function  [11]  methods.  We  refer  the \nreader  to a forthcoming publication for  details and only state the result: \n\nwhere \n\n(9) \n\n(10) \n\nHere  Pk  is  defined  as  Pk  =  AkG(ak, Ak).  The  Kronecker  delta  in  the  last  term \nof (10)  arises because the training noises of different students are uncorrelated.  The \ngeneralization errors  and ambiguities of the individual students  are \n\nak  =  ckk  - 2 LWlckl + LWIWmclm; \n\nI \n\n1m \n\nthe  result  for  the  Ck  can  be  shown  to  agree  with  the  single  student  result  (7).  In \nthe following  sections,  we  shall  explore  the  consequences  of the  general  result  (9) . \nWe  will  concentrate  on the  case  where  the training set  of each student  is  sampled \nrandomly from the total available data set of size NO',  For the overlap of the training \nsets of students  k  and I (k 'II) one then has akl/a = (ak/a)(al/a)  and hence \n\n(11) \nup  to fluctuations  which  vanish  in the  thermodynamic limit.  For  finite  ensembles \none  can  construct  training  sets  for  which  akl  <  akal/a.  This  is  an  advantage, \nbecause  it results  in  a smaller generalization error,  but for  simplicity we  use  (11). \n\nak/ = akal/a \n\n4  LARGE ENSEMBLE LIMIT \n\nWe now  use our main result  (9) to analyse the generalization performance of an en(cid:173)\nsemble with a large number K  of students, in particular when the size of the training \nsets  for  the  individual students  are  chosen  optimally.  If the  ensemble  weights  Wk \nare  approximately  uniform  (Wk  ~ 1/ K)  the  off-diagonal  elements  of the  matrix \n(ckl)  dominate the generalization error for  large K, and the contributions from the \ntraining noises n are suppressed.  For the special  case  where  all students  are iden(cid:173)\ntical and are  trained on training sets of identical size,  ak  =  (1  - c)a, the ensemble \ngeneralization  error  is  shown  in  Figure  1(left).  The  minimum at  a  nonzero  value \nof c,  which  is  the  fraction  of the  total data set  held  out for  testing  each  student, \ncan  clearly  be  seen.  This  confirms  our  intuition:  when  the  students  are  trained \non  smaller,  less  overlapping  training  sets,  the  increase  in  error  of the  individual \nstudents can  be more than offset  by the corresponding increase  in ambiguity. \n\nThe optimal training set sizes  ak  can be  calculated analytically: \n\n_ \n\nCk  = 1 - ak/a = 1 + G(a, (T2) ' \n\n1 - Ak/(T2 \n\n(12) \n\n\f194 \n\nP.  SOLLICH, A.  KROGH \n\n1.0 r---,-----,r---.,----,.----:. \n\n1.0 r - - - , - - - - - , - - - . - - - - r - - - - \"  \n\nw \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n/ \n\n/ \n\n/ \n\n0.0 / \n0.0 \n\n,...-------\n\n, \n\n, \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\nC \n\n, \n, \n1.0 \n\nw \n\n0.8 \n\n0.6 \n\n0.2 \n\n0.0 \n\n.' \n\n-------\n\n0.0 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\nC \n\n..... \n1.0 \n\nFigure  1:  Generalization  error  and  ambiguity for  an  infinite  ensemble of identical \nstudents.  Solid  line:  ensemble  generalization  error,  fj  dotted  line:  average  gener(cid:173)\nalization  error  of the  individual students,  l;  dashed  line:  ensemble  ambiguity,  a. \nFor both plots  a  = 1 and  (72  = 0.2 .  The left  plot corresponds  to under-regularized \nstudents  with  A = 0.05  <  (72.  Here  the  generalization  error  of the  ensemble  has \na  minimum at a  nonzero  value of c.  This minimum exists  whenever>'  < (72.  The \nright  plot shows  the  case  of over-regularized  students  (A  =  0.3  >  (72),  where  the \ngeneralization error is  minimal at c =  O. \n\nThe resulting generalization error is  f  =  (72G(a,  (72) + 0(1/ K), which is the globally \nminimal generalization error that can be obtained using all available training data, \nas  explained  in  Section  3.  Thus,  a  large  ensemble  with  optimally  chosen  training \nset  sizes  can  achieve  globally  optimal generalization  performance.  However,  we  see \nfrom  (12)  that a valid solution  Ck  > 0 exists only for  Ak  < (72,  i.e.,  if the ensemble \nis  under-regularized.  This  is  exemplified,  again  for  an  ensemble  of identical  stu(cid:173)\ndents,  in  Figure  1 (right) ,  which  shows  that  for  an  over-regularized  ensemble  the \ngeneralization error  is  a:  monotonic function of c and thus minimal at c = o. \nWe conclude this section  by  discussing  how  the adaptation of the training set sizes \ncould be performed in  practice, for  simplicity confining ourselves  to an ensemble of \nidentical students,  where only one parameter c = Ck  = 1- ak/a has to be adapted. \nIf the  ensemble  is  under-regularized  one  expects  a  minimum of the  generalization \nerror  for  some  nonzero  c  as  in  Figure  1.  One  could,  therefore,  start  by  training \nall students on a  large fraction  of the total  data set  (corresponding  to c ~ 0),  and \nthen gradually and randomly remove training examples from the students' training \nsets.  Using  (6),  the  generalization  error  of each  student  could  be  estimated  by \ntheir performance on the examples on which  they were  not trained,  and one would \nstop removing training examples when the estimate stops decreasing.  The resulting \nestimate  of  the  generalization  error  will  be  slightly  biased;  however,  for  a  large \nenough ensemble the risk  of a strongly  biased estimate from systematically testing \nall students on too 'easy' training examples seems small, due to the random selection \nof examples. \n\n5  REALISTIC  ENSEMBLE SIZES \n\nWe now discuss some effects that occur in learning with ensembles of 'realistic' sizes. \nIn an over-regularized ensemble nothing can be gained by making the students more \ndiverse by training them on smaller,  less  overlapping training sets.  One would also \n\n\fLearning with Ensembles:  How Overfitting Can Be Useful \n\n195 \n\nFigure  2:  The  generalization  error  of \nan  ensemble  with  10  identical  stu(cid:173)\ndents  as  a  function  of  the  test  set \nfraction  c.  From  bottom  to  top  the \ncurves  correspond  to  training  noise \nT  = 0,0.1,0.2, ... ,1.0.  The  star  on \neach  curve  shows  the error  of the op(cid:173)\ntimal single perceptron  (i. e.,  with op(cid:173)\ntimal  weight  decay  for  the  given  T) \ntrained  on  all  examples,  which  is  in(cid:173)\ndependent  of  c.  The  parameters  for \nthis  example  are:  a  =  1,  A = 0.05, \n0'2  = 0.2. \n\n0.2 \n\n0.0 L-_--'-_---' __  -'--_--'-_~ \n1.0 \n\n0.8 \n\n0.0 \n\nC \n\n0.2 \n\n0.4 \n\n0.6 \n\nexpect  this  kind  of 'diversification'  to  be  unnecessary  or  even  counterproductive \nwhen  the training noise  is  high  enough  to  provide sufficient  'inherent'  diversity  of \nstudents.  In  the  large  ensemble  limit,  we  saw  that  this  effect  is  suppressed,  but \nit  does  indeed  occur  in  finite  ensembles.  Figure  2  shows  the  dependence  of the \ngeneralization error on c for an ensemble of 10  identical, under-regularized students \nwith  identical training noises  Tk  =  T.  For small T, the minimum of f.  at nonzero  c \npersists.  For  larger T,  f.  is  monotonically increasing  with  c,  implying that further \ndiversification of students beyond that caused by the learning noise is wasteful.  The \nplot  also  shows  the  performance  of the  optimal single  student  (with  A chosen  to \nminimize the generalization error at the given T), demonstrating that the ensemble \ncan perform significantly better by effectively  averaging out learning noise. \n\nFor  realistic  ensemble  sizes  the  presence  of learning  noise  generally  reduces  the \npotential for  performance  improvement  by  choosing  optimal training  set  sizes.  In \nsuch  cases  one can still adapt the  ensemble  weights to optimize performance, again \non the basis of the estimate of the ensemble generalization error (6).  An example is \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\ntV \n\n----\n\n, \n, \n\nI \nI \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\ntV \n\nI \n\nI \n\n/ \n\n,-\n\n,-\n\n..... -_ ................. \n\n... .... \n\n0.0 \n0.001 \n\n0.010 \n\n0 2 \n\n0.100 \n\n1.000 \n\n0.0 \n0.001 \n\n0.010 \n\n0 2 \n\n0.100 \n\n1.000 \n\nFigure  3:  The  generalization  error  of an  ensemble  of  10  students  with  different \nweight  decays  (marked  by  stars  on  the  0'2-axis)  as  a  function  of the  noise  level \n0'2.  Left:  training  noise  T  =  0;  right:  T  = 0.1.  The  dashed  lines  are  for  the \nensemble with uniform weights, and the solid line is for optimized ensemble weights. \nThe  dotted  lines  are  for  the  optimal single  perceptron  trained  on  all  data.  The \nparameters for  this example are:  a  =  1,  c =  0.2. \n\n\f196 \n\nP.  SOu...ICH, A.  KROGH \n\nshown in Figure 3 for an ensemble of size 1< = 10 with the weight decays >'k  equally \nspaced on a  logarithmic axis between  10-3  and  1.  For  both of the temperatures T \nshown,  the ensemble with uniform weights performs worse  than the optimal single \nstudent.  With weight optimization, the generalization performance approaches that \nof the  optimal single  student  for  T  =  0,  and  is  actually  better  at  T  = 0.1  over \nthe  whole  range  of noise  levels  rr2  shown.  Even  the  best  single  student  from  the \nensemble can never perform better than the optimal single student, so combining the \nstudent outputs in a  weighted ensemble average  is  superior  to simply choosing the \nbest  member of the ensemble by  cross-validation,  i.e., on the basis of its estimated \ngeneralization error.  The reason is that the ensemble average suppresses the learning \nnoise on  the individual students. \n\n6  CONCLUSIONS \n\nWe  have  studied  ensemble  learning  in  the simple,  analytically solvable scenario  of \nan  ensemble  of linear  students.  Our  main  findings  are:  In  large  ensembles,  one \nshould  use  under-regularized  students  in  order  to  maximize  the  benefits  of the \nvariance-reducing  effects  of ensemble  learning.  In  this  way,  the  globally  optimal \ngeneralization error on the basis of all the available data can be reached by optimiz(cid:173)\ning  the training set sizes  of the individual students.  At the same time an estimate \nof the generalization error can be obtained.  For ensembles of more realistic size,  we \nfound that for  students subjected to a large amount of noise in the training process \nit is  unnecessary  to increase  the diversity of students  by training them on smaller, \nless  overlapping  training  sets.  In  this  case,  optimizing the  ensemble  weights  can \nstill yield substantially better generalization performance than an optimally chosen \nsingle  student  trained  on all  data with  the  same  amount of training  noise.  This \nimprovement is  most  insensitive  to  changes  in  the  unknown  noise  levels  rr2  if the \nweight decays of the individual students cover a wide range.  We expect most of these \nconclusions to carryover, at least qualitatively, to ensemble learning with nonlinear \nmodels, and this correlates  well  with experimental results  presented  in  [7]. \n\nReferences \n[1]  C.  Granger, Journal of Forecasting 8, 231  (1989). \n[2]  D.  Wolpert,  Neural  Networks  5, 241  (1992) . \n[3]  L.  Breimann, Tutorial at  NIPS  7 and personal communication. \n[4]  L.  Hansen  and  P.  Salamon, IEEE Trans.  Pattern Anal.  and  Mach.  Intell.  12, \n\n[8]  R.  Meir,  in  NIPS  7,  ed.  G.  Tesauro  et  al.,  p.  295 (MIT Press,  1995). \n[9]  A.  Krogh  and J. A.  Hertz,  J. Phys.  A  25,1135 (1992). \n[10]  J.  A.  Hertz,  A.  Krogh,  and G.  I. Thorbergsson,  J.  Phys.  A  22,  2133  (1989). \n[11]  P.  Sollich,  J. Phys.  A  27, 7771  (1994). \n\n[5]  M.  P.  Perrone  and  L.  N.  Cooper,  in  Neural  Networks  for  Speech  and  Image \n\nprocessing,  ed.  R. J.  Mammone (Chapman-Hall, 1993). \n\n[6]  S.  Hashem:  Optimal  Linear  Combinations  of  Neural  Networks.  Tech.  Rep. \n\nPNL-SA-25166,  submitted to Neural  Networks (1995) . \n\n[7]  A.  Krogh and J. Vedelsby, in  NIPS 7,  ed.  G. Tesauro  et  al.,  p.  231  (MIT Press, \n\n993  (1990). \n\n1995). \n\n\f", "award": [], "sourceid": 1044, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "Anders", "family_name": "Krogh", "institution": null}]}