{"title": "Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping", "book": "Advances in Neural Information Processing Systems", "page_first": 402, "page_last": 408, "abstract": null, "full_text": "Overfitting in Neural Nets:  Backpropagation, \n\nConjugate Gradient, and Early Stopping \n\nRich Caruana \nCALD,CMU \n\n5000 Forbes Ave. \n\nPittsburgh, PA  15213 \ncaruana@cs.cmu.edu \n\nSteve Lawrence \n\nNEC Research Institute \n4 Independence Way \nPrinceton, NJ 08540 \n\nLee Giles \n\nInformation Sciences \nPenn State University \n\nUniversity Park, PA  16801 \n\nlawrence@ research. nj. nec. com \n\ngiles@ist.psu.edu \n\nAbstract \n\nThe conventional wisdom is that backprop nets with excess hidden units \ngeneralize  poorly.  We  show  that  nets  with  excess  capacity  generalize \nwell when  trained with backprop and early  stopping.  Experiments  sug(cid:173)\ngest two reasons for this:  1) Overfitting can vary significantly in different \nregions of the model.  Excess capacity allows better fit to regions of high \nnon-linearity,  and  backprop often  avoids  overfitting  the  regions  of low \nnon-linearity.  2)  Regardless  of size,  nets  learn  task  subcomponents  in \nsimilar sequence.  Big nets pass  through stages  similar to those learned \nby  smaller nets.  Early  stopping can  stop training the large net  when  it \ngeneralizes  comparably  to  a  smaller net.  We  also  show  that conjugate \ngradient can yield worse generalization because it overfits regions of low \nnon-linearity when learning to fit regions of high non-linearity. \n\nIntroduction \n\n1 \nIt is commonly believed that large multi-layer perceptrons (MLPs) generalize poorly:  nets \nwith too much capacity overfit the training data.  Restricting net capacity prevents  overfit(cid:173)\nting because  the net has  insufficient capacity  to  learn models  that are  too complex.  This \nbelief is  consistent with a VC-dimension analysis  of net capacity  vs.  generalization:  the \nmore free parameters  in the net the larger the VC-dimension of the hypothesis space,  and \nthe less likely the training sample is large enough to select a (nearly) correct hypothesis [2]. \n\nOnce it became feasible to train large nets on real problems, a number of MLP users noted \nthat the overfitting they expected from nets with excess  capacity did not occur.  Large nets \nappeared  to  generalize  as  well  as  smaller  nets  -\nsometimes  better.  The  earliest  report \nof this  that  we  are  aware  of is  Martin and  Pittman in  1991:  \"We find only marginal and \ninconsistent indications that constraining net capacity improves generalization\" [7]. \n\nWe  present empirical results  showing that MLPs  with excess  capacity  often do not over(cid:173)\nfit.  On the  contrary,  we  observe that large  nets  often generalize better than small  nets  of \nsufficient capacity.  Backprop appears  to  use excess  capacity  to better fit  regions  of high \nnon-linearity, while still fitting regions of low non-linearity smoothly.  (This desirable be(cid:173)\nhavior can disappear if a fast training algorithm such as  conjugate gradient is used instead \nof backprop.) Nets  with excess capacity trained with backprop appear first to learn models \nsimilar to models learned by smaller nets. If early stopping is used, training of the large net \ncan be halted when the large net's model is similar to models learned by smaller nets. \n\n\fApprOXlUlat l o D \nT rain i ng Data \nT urgel Functi on '-Vilhau! Noise \n\n-\n\n>< \n\nx \n\nApproxunatlo n \nT rain i ng Data \nT urg .. t  Functi on 'Without Noise \n\n- 1 \n\n- 1 \n\nOrder 10 \n\nOrder 20 \n\n15~----------~----------~ \n\nTraining Data \n><  TurS .. 1 Functi on '\\Vi l h out N oise \n\n>< \n\nApprOXLnlutlon  -\n\n- 1  5  O~-----------'~O----------~20 \n\n- 1  5  O~-----------'~O----------~20 \n\n10 Hidden Nodes \n\n50 Hidden Nodes \n\nFigure 1: Top:  Polynomial fit to  data from  y = sin( x /3) + v . Order 20 overfits.  Bottom:  Small and \nlarge MLPs fit to  same data.  The large MLP does not overfit significantly more than the small MLP. \n\n2  Overfitting \nMuch has been written about overfitting and the bias/variance tradeoff in neural nets  and \nother  machine  learning  models  [2,  12,  4,  8,  5,  13,  6] .  The  top  of Figure  1  illustrates \npolynomial overfitting.  We  created  a  training  dataset  by  evaluating  y  = sin( x /3) + lJ \nat  0 1 I I 2, ... ,20  where  lJ  is  a  uniformly  distributed random  variable between  -0.25  and \n0.25.  We  fit  polynomial models  with  orders  2-20  to  the  data.  Underfitting occurs  with \norder 2.  The fit is good with order 10.  As the order (and number of parameters) increases, \nhowever,  significant overfitting (poor generalization) occurs.  At order 20, the polynomial \nfits the training data well, but interpolates poorly. \n\nThe bottom of Figure 1 shows MLPs fit  to the data.  We used  a single hidden layer MLP, \nbackpropagation  (BP),  and  100,000  stochastic  updates.  The  learning  rate  was  reduced \nlinearly to zero from an initial rate of 0.5 (reducing the learning rate improves convergence, \nand linear reduction performs sintilarly to other schedules  [3]).  This schedule and number \nof updates trains the MLPs  to completion.  (We examine early  stopping in  Section 4.)  As \nwith polynomials, the smallest net with one hidden unit (HU) (4 weights weights) underfits \nthe data. The fit is good with two HU (7 weights). Unlike polynomials, however, networks \nwith  10 HU (31  weights)  and  50 HU (151  weights)  also  yield good  models.  MLPs  with \nseven times as  many parameters as  data points trained with BP do not significantly overfit \nthis data.  The experiments in Section 4 confirm that this bias of BP-trained MLPs towards \nsmooth models is not limited to the simple 2-D problem used here. \n\n3  Local Overfitting \nRegularization methods such as  weight decay  typically assume that overfitting is  a global \nphenomenon.  But overfitting can vary significantly in different regions of a model.  Figure \n2 shows polynomial fits  for data generated from the following equation: \n\n- cos( x) + v \ncos(3(x - iT)) + V \n\n0 :::;  x  < iT \niT:::;  X  :::;  2iT \n\nY = \n{ \nFive equally spaced points were generated in the first region,  and  15  in the second region, \nso  that  the  two regions  have  different data densities  and  different underlying functions. \nOverfitting is different in the two regions.  In Figure 2 the order 6 model fits  the left region \n\n(Equation 1) \n\n\fO roe r  2  Approxim .. tio n \nTrain ing  Dat a \nT .. r get  Fu n c tio n  'Withou t  No,,.., \n\n-\n\n+ \n\nOroe r  I'i  Approxi m .. t ion \nT raining  D a . a \nT .. r ge.  Fu n ction  'W ithou .  Noio.c \n\n-\n\n+ \n\nOrder 2 \n\nOrder 6 \n\nOroer 10 A.t~..i~\\,',';~''::;~  - . -\n\nT arget Fu n c tio n  'Withou. No !o.c \n\nOroer ll'i  ~t~..i~i:,';~':'~  - . -\n\nT arget Fu n c tio n  \"\"ithoU!  Noise \n\n'oL---~--~--~----~--~--~U \n\nOrder 10 \n\nOrder 16 \n\nFigure 2:  Polynomial approximation of data from Equation 1 as the order of the  model is increased \nfrom 2 to  16.  The overfitting behavior differs in the left and right hand regions. \n\nwell, but larger models overfit it.  The order 6 model underfits the region on the right, and \nthe order 10 model fits  it better. No  model performs well on both regions.  Figure 3 shows \nMLPs  trained  on  the  same  data  (20,000 batch updates,  learning rate  linearly  reduced  to \nzero starting at 0.5).  Small nets  underfit.  Larger nets, however,  fit  the entire function well \nwithout significant overfitting in  the left region. \n\nThe  ability of MLPs  to fit both regions of low and  high non-linearity well  (without over(cid:173)\nfitting) depends  on  the  training algorithm.  Conjugate gradient (CG)  is the  most popular \nsecond order method.  CG results in lower training error for this problem, but overfits  sig(cid:173)\nnificantly.  Figure 4  shows  results  for  10 trials for  BP and  CG. Large  BP nets  generalize \nbetter on this problem -- even the optimal size CG net is prone to overfitting. The degree \nof overfitting varies  in different regions.  When the net is large enough to fit  the region of \nhigh non-linearity, overfitting is often seen in the region of low non-linearity. \n\n4  Generalization, Network Capacity, and Early Stopping \nThe  results  in  Sections  2  and  3  suggest  that  BP  nets  are  less  prone  to  overfitting  than \nexpected.  But MLPs  can  and do  overfit.  This  section examines  overfitting vs.  net  size \non seven  problems:  NETtalk  [10],  7  and  12  bit parity,  an  inverse  kinematic  model for  a \nrobot arm (thanks  to  Sebastian  Thrun for the  simulator),  Base  1 and  Base  2:  two  sonar \nmodeling problems  using data collected  from  a robot wondering  hallways at  CMU, and \nvision data used to learn to steer an  autonomous car [9].  These problems exhibit a variety \nof characteristics.  Some are Boolean.  Others are continuous. Some have noise.  Others are \nnoise-free. Some have many inputs or outputs. Others have few inputs or outputs. \n4.1  Results \nFor each problem we used small training sets (100-1000 points, depending on the problem) \nso  that overfitting was  possible.  We  trained fully  connected feedforward  MLPs  with one \nhidden layer whose size varied from 2 to 800 HU (about 500-100,000 parameters).  All the \nnets were trained with BP using stochastic updates, learning rate 0.1, and momentum 0.9. \n\nWe used early stopping for regularization because it doesn't interfere with backprop's abil(cid:173)\nity to control capacity locally. Early  stopping combined with backprop is so effective that \nvery  large nets can be trained without significant overfitting.  Section 4.2 explains why. \n\n\f/ \n\n-\n\n\" -\n\n\"'/ \n\n1 Hidden Unit \n\n4 Hidden Units \n\n10 Hidden Units \n\n100 Hidden Units \n\nFigure 3:  MLP  approximation using backpropagation (BP)  training of data from Equation  1 as the \nnumber of hidden units is increased. No significant overfitting can be seen. \n\n07 \n\n06 \n\n05 \n\n0 4 \n\nOJ \n\n02 \n\n01 \n\n0 7 \n\n06 \n\n05 \n\n0 4 \n\nOJ \n\n0 2 \n\n0 1 \n\nOJ \n\n'\" ::l \nZ \n~ \n\n25 \n\n50 \n\n5 \n\n10 \n\nNumbe!  of Hidden Ncdes \n\n5 \n\n10 \n\n25 \n\n50 \n\nNumbei  cI Hidden Nodes \n\nFigure 4:  Test Normalized Mean  Squared Error for  MLPs  trained  with  BP  (left)  and  CG  (right). \nResults are shown with both box-whiskers plots and the mean plus and minus one standard deviation. \n\nFigure 5 shows  generalization curves for four of the problems.  Examining the results for \nall  seven  problems,  we  observe  that  on  only  three  (Base  1,  Base  2,  and  ALVINN),  do \nnets  that  are  too  large  yield  worse  generalization  than  smaller  networks,  but  the  loss  is \nsurprisingly  small.  Many  trials  were  required  before  statistical  tests  confirmed  that  the \ndifferences between the optimal size net and the largest net were significant.  Moreover, the \nresults suggest that generalization is hurt more by using a net that is a little too small than \nby using one that is far too large, i.e., it is better to make nets too large than too small. \n\nFor most tasks  and  net sizes,  we  trained well beyond the point where generalization per(cid:173)\nformance peaked.  Because we had complete generalization curves, we noticed something \nunexpected.  On some  tasks,  small  nets  overtrained considerably.  The  NETtalk graph in \nFigure 5 is a  good example.  Regularization (e.g.,  early stopping) is critical for nets  of all \nsizes -\n\nnot just ones that are too big.  Nets with restricted capacity can overtrain. \n\n4.2  Why Excess Capacity Does Not Hurt \nBP nets initialized with small  weights can  develop large weights only after the number of \nupdates is large.  Thus BP nets consider hypotheses with small weights before hypotheses \nwith large weights.  Nets  with large weights have more representational power,  so simple \nhypotheses are explored before complex  hypotheses. \n\n\f0.17  _ -..... - - , - -..... - - , - - - - ,  \n\nNETtalk \n\n0.2  .----..... - - , - -..... - - , - - - - ,  \n\nInverse  Kinematics \n\n0.16 \n\n0.15 \n\n0.14 \n\n0.13 \n\n0 . 12  ' - - - - - ' - - - ' - - - - - ' - - - ' - - - - '  \n\no \n\n100000  200000  300000  400000  500000 \n\nPattern  Presentations \n\nBase  1:  Average  of  10  Runs \n\no . 15  r<I!\",,;;-.--t;--,-----.---, \n0.14 \n\n2  hidden  units  +-\n8  hidden  units  -+--\n32  hidden  units  -8- -\n8  hidden  units  \u00b7K\u00b7\u00b7\u00b7 \n2  hidden  units  -A- .. \n\n0.13 \n\n0.12 \n\n0.11 \n0.1 \n\n0.09 \n\n0.08 \n0.07 \n\n0.06 \n0.05  ' - - - - - ' - - - - - - ' - - - - ' - - - - '  \n\no \n\n2et06 \n6et06 \nPattern  Present ations \n\n4et06 \n\n8et06 \n\n\" o ... \njJ \u2022 '0 ... \nr< \u2022 ;> , \n\u2022 00 o \n\" u \n\n0.18 \n\n0.16 \n0.14  \" '  ___ _______  ~ \n\n2  hidden  units  +-\n8  hidden  units  -+--\n32  hidden  units  -E}--\n128  hidden  unit s  .. * .... \n512  hidden  units  ~ .. \n\n0.12 \n\n0.1 \n\n0.08 1\" \u2022 \u2022 \u2022 \u2022  \n\n0.06 \n\n2et06  4et06  6et06  8et06  1et07 \n\nPattern  Presentations \n\nBase  2 :  Ave rage  of  10  Runs \n\nO. 22  ,\"~-.----,------.----, \n\n2  hidden  units  +-\n8  hidden  units  -+--\n32  hidden  units  \u00b78 \u00b7\u00b7 \n128  hidden  units  \u00b7X\u00b7\u00b7\u00b7\u00b7 \n512  hidden  units  -A- .. \n\n0.21 \n\n0.2 \n\n0.18 \n0.17 \n\n0.16 \n\n0.15 \n\n0.14 \n0.13 \n0.12  ' - - - - - ' - - - - - - ' - - - - ' - - - - '  \n\no \n\n200000 \n\n400000 \n\n600000 \n\n800000 \n\nPattern  Presentations \n\n\" o ... \njJ \u2022 '0 ... \nr< \u2022 ;> , \n\n00 \n\n00 o \n\" u \n\n\" o ... \njJ \u2022 '0 ... \nr< \u2022 ;> , \n\u2022 00 o \n\" u \n\nFigure 5:  Generalization peiformance vs. net size for four of the seven test problems. \n\nWe  analyzed  what nets  of different  size  learn  while they  are  trained.  We  compared  the \ninput/output behavior of nets  at different stages  of learning on large  samples  of test pat(cid:173)\nterns.  We compare  the input/output behavior of two nets by computing the squared error \nbetween the predictions made  by the two nets.  If two nets  make  the same predictions for \nall  test cases,  they  have  learned the same  model  (even  though each  model  is represented \ndifferently), and  the squared error between the two models is  zero.  If two nets make dif(cid:173)\nferent predictions for test cases,  they have learned different models,  and the squared error \nbetween them is large.  This is not the error the models make predicting the true labels, but \nthe difference between  predictions made by  two different models.  Two  models  can  have \npoor generalization (large error on true labels), but have near zero error compared to each \nother if they  are  similar models.  But two  models  with good generalization  (low error on \ntrue labels) must have low error compared to each other. \n\nThe first graph in Figure 5 shows learning curves for nets with 10,25, 50, 100, 200, and 400 \nHU trained on NETtalk.  For each  size,  we  saved  the net from the epoch that generalized \nbest on a large test set.  This gives us the best model of each  size found by backprop.  We \nthen trained a BP net with 800 HU,  and after each  epoch compared this net's model  with \nthe best models saved for nets of 10-400 HU. This lets us compare the sequence of models \nlearned by the 800 HU net to the best models learned by smaller nets. \n\nFigure  6  shows  this  comparison.  The  horizontal  axis  is  the  number  of backprop  passes \napplied to the  800 HU  net.  The vertical  axis  is the error between  the  800 HU  net model \nand  the best model for each  smaller net.  The 800 HU net starts  off distant from the good \nsmaller models,  then  becomes  similar to  the good models,  and  then diverges  from  them. \nThis is expected.  What is interesting is that the 800 HU net first becomes closest to the best \n\n\f1000  ,....,-,-,--------,r'--------\"----,--\"---------r--------, \n\nS imilarity o f BOO  HU Net  DUring T raining to S maller Size Peak Performers \n\n~ ,* Xx x \n\nH~ t ~.%>< \nt \nb \nE \n\n~: '\"  x \n\n\"',;t: \n\nx \n\n10hu pea k  ....r--\n25hu pea k  -+- -\n50hu pea k  E} \n100hu p eak \n)( \n200hu pea k  -6-\n400hu p eak  ...,.,... \n\n---- ------+  ---\n\n-,-\n\n--;\"\"':'-: li; ::::: -::  -- ----- -- -- -- =- = =:. -=- ----it- --: \n\n.!lo, ot: \n\nX  x \n\n>'1: \n\n'~1 \n\n800 \n\n600 \n\n400 \n\n200 \n\no ~----~~----~-----~-----~ \n2000 00 \n\n1000 00 \n\n150000 \n\n50000 \n\no \n\nPattern  Pre sentat IOns \n\nFigure 6:  I/O similarity during training between an  800 hidden unit net and smaller nets (10, 25, 50, \n100,200, and 400 hldden units) trained on NETtalk. \n\n10 HU net, then closest to the 25 HU net, then closest to the 50 HU net, etc.  As it is trained, \nthe 800 HU net learns a sequence of models similar to the models learned by smaller nets. \nIf early stopping is used, training of the 800 HU net can be stopped when it behaves similar \nto  the best model  that could be learned  with  nets  of 10,  25,  50,  . ..  HU.  Large  BP nets \nlearn models  similar to those learned by smaller nets.  If a BP net with too much capacity \nwould overjit,  early stopping could stop training when  the model was similar to  a model \nthat would have been learned by a smaller net of optimal size. \n\nThe  error between  models  is  about  200-400, yet the  generalization  error is  about  1600. \nThe models are  much closer to each  other than  any  of them are  to  the true model.  With \nearly  stopping, what counts is  the closest  approach  of each  model  to  the  target function, \nnot where models end up late in training.  With early  stopping there is  little disadvantage \nto  using models  that are  too large because  their learning trajectories  are  similar to those \nfollowed by smaller nets of more optimal size. \n\n5  Related Work \nOur results show that models learned by backprop are biased towards \"smooth\" solutions. \nAs  nets  with  excess  capacity  are  trained,  they  first  explore  smoother  models  similar  to \nthe models smaller nets would have learned.  Weigend [11]  performed an  experiment that \nshowed BP nets  learn  a problem's eigenvectors in  sequence,  learning the  1st eigenvector \nfirst, then the 2nd, etc.  His result complements our analysis of what nets of different sizes \nlearn:  if large nets  learn an  eigenvector sequence similar to  smaller nets,  then the models \nlearned by the large net will pass through intermediate stages similar to what is learned by \nsmall  nets  (but iff nets  of different sizes  learn the eigenvectors  equally  well,  which is  an \nassumption we do not need to make.) \n\nTheoretical work by [1]  supports our results.  Bartlett notes:  \"the VC-bounds seem loose; \nneural nets often peiform successfully with training sets that are considerably smaller than \nthe  number of weights.\"  Bartlett  shows  (for  classification)  that  the  number  of training \nsamples  only  needs  to  grow according  to  A 21  (ignoring log factors)  to  avoid  overfitting, \nwhere A  is a  bound on  the  total  weight magnitudes  and I  is  the  number of layers  in the \nnetwork.  This result suggests that a net with smaller weights will generalize better than a \nsimilar net with large weights. Examining the weights from BP and CG nets shows that BP \ntraining typically results in smaller weights. \n\n\f6  Summary \nNets  of all sizes  overfit  some  problems.  But generalization is  surprisingly insensitive to \nexcess  capacity  if the net is  trained with backprop.  Because BP nets  with excess  capacity \nlearn a sequence of models functionally similar to what smaller nets learn, early  stopping \ncan often be used to stop training large nets when they have learned models similar to those \nlearned  by smaller  nets  of optimal  size.  This  means  there  is  little loss  in  generalization \nperformance for nets with excess capacity if early stopping can be used. \n\nOverfitting is  not a global phenomenon, although methods for controlling it often assume \nthat it is.  Overfitting can vary significantly in different regions of the model.  MLPs trained \nwith BP use  excess  parameters  to  improve  fit  in regions of high non-linearity,  while not \nsignificantly overfitting other regions.  Nets  trained  with conjugate gradient, however,  are \nmore sensitive to net size.  BP nets appear to be better than CG nets  at avoiding overfitting \nin regions  with different degrees  of non-linearity, perhaps  because  CG  is  more effective \nat  learning more complex  functions  that  overfit  training data,  while BP is  biased  toward \nlearning smoother functions. \n\nReferences \n\n[1]  Peter L. Bartlett.  For valid generalization the size of the weights is more important than the size \nof the  network.  In Advances in  Neural Information Processing Systems, volume  9,  page 134. \nThe MIT Press, 1997. \n\n[2]  E.B.  Baum and D.  Haussler.  What size  net gives  valid generalization?  Neural Computation, \n\n1(1):151- 160,1989. \n\n[3]  C.  Darken  and J.E.  Moody.  Note  on  learning  rate  schedules for  stochastic optimization.  In \n\nAdvances in Neural Information Processing Systems, volume 3, pages 832- 838. Morgan Kauf(cid:173)\nmann, 1991. \n\n[4]  S.  Geman et al.  Neural networks and the bias/variance dilemma.  Neural Computation, 4(1):1-\n\n58,1992. \n\n[5]  A  Krogh and J.A Hertz.  A simple weight decay can improve  generalization.  In Advances in \nNeural Information Processing Systems, volume 4, pages 950-957. Morgan Kaufmann, 1992. \n[6]  Y.  Le  Cun, J.S.  Denker, and  S.A Solla.  Optimal Brain  Damage.  In  D.S.  Touretzky, editor, \nAdvances in  Neural Information  Processing Systems,  volume  2,  pages 598-605, San Mateo, \n1990. (Denver 1989), Morgan Kaufmann. \n\n[7]  G.L.  Martin and J.A Pittman.  Recognizing hand-printed letters  and digits using backpropaga(cid:173)\n\ntion learning.  Neural Computation, 3:258-267, 1991. \n\n[8]  J.E.  Moody.  The effective  number of parameters:  An  analysis of generalization and regular(cid:173)\n\nization  in  nonlinear learning systems.  In Advances in Neural Information Processing Systems, \nvolume 4, pages 847-854. Morgan Kaufmann, 1992. \n\n[9]  D.A Pomerleau.  Alvinn:  An autonomous land vehicle in a neural network.  In D.S.  Touretzky, \neditor,  Advances in  Neural Information  Processing Systems,  volume  1,  pages  305-313, San \nMateo, 1989. (Denver 1988), Morgan Kaufmann. \n\n[10]  T. Sejnowski and C. Rosenberg. Parallel networks that learn to pronounce English text.  Complex \n\nSystems, 1:145-168, 1987. \n\n[11]  A  Weigend.  On  overfitting  and the  effective  number of hidden units.  In  Proceedings of the \n1993 Connectionist Models  Summer School,  pages  335- 342. Lawrence Erlbaum Associates, \n1993. \n\n[12]  AS. Weigend, D.E. Rumelhart, and B.A Huberman. Generalization by weight-elimination with \napplication to  forecasting.  In  Advances in Neural Information  Processing Systems, volume 3, \npages 875-882. Morgan Kaufmann, 1991. \n\n[13]  D. Wolpert.  On bias plus variance. Neural Computation, 9(6):1211-1243, 1997. \n\n\f", "award": [], "sourceid": 1895, "authors": [{"given_name": "Rich", "family_name": "Caruana", "institution": null}, {"given_name": "Steve", "family_name": "Lawrence", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}]}