{"title": "Regularizing AdaBoost", "book": "Advances in Neural Information Processing Systems", "page_first": 564, "page_last": 570, "abstract": "", "full_text": "Regularizing  AdaBoost \n\nGunnar Riitsch,  Takashi Onoda; Klaus R. M iiller \nGMD  FIRST,  Rudower  Chaussee 5,  12489 Berlin,  Germany \n\n{raetsch, onoda, klaus }@first.gmd.de \n\nAbstract \n\nBoosting methods  maximize  a  hard  classification  margin  and  are \nknown as powerful techniques that do not exhibit overfitting for low \nnoise  cases.  Also for noisy data boosting will  try to enforce a  hard \nmargin and thereby give  too much  weight  to outliers,  which  then \nleads to the dilemma of non-smooth fits  and overfitting.  Therefore \nwe  propose three algorithms to allow for  soft margin classification \nby introducing regularization with slack variables into the boosting \nconcept:  (1)  AdaBoostreg  and  regularized  versions  of  (2)  linear \nand  (3)  quadratic programming AdaBoost.  Experiments show  the \nusefulness of the proposed algorithms in comparison to another soft \nmargin classifier:  the support vector machine. \n\n1 \n\nIntrod uction \n\nBoosting and  other ensemble  methods  have  been  used  with  success  in  several ap(cid:173)\nplications, e.g.  OCR [13,  8].  For  low  noise cases several lines of explanation have \nbeen  proposed as  candidates for  explaining the well  functioning of boosting meth(cid:173)\nods.  (a)  Breiman proposed that during boosting also a  \"bagging effect\"  takes place \n[3]  which  reduces the variance and effectively limits the capacity of the system and \n(b)  Freund  et  al.  [12]  show  that  boosting  classifies  with  large  margins,  since  the \nerror  function  of boosting  can  be  written  as  a  function  of the  margin  and  every \nboosting step tries to minimize this function  by maximizing the margin [9,  11]. \nRecently,  studies with noisy patterns have shown that boosting does indeed overfit \non  noisy  data,  this  holds  for  boosted  decision  trees  [10],  RBF  nets  [11]  and  also \nother  kinds  of classifiers  (e.g.  [7]).  So  it  is  clearly  a  myth  that  boosting  methods \nwill  not overfit.  The fact  that boosting is  trying to maximize the margin, is  exactly \nalso  the argument  that  can be used  to  understand why  boosting must  necessarily \noverfit for  noisy patterns or overlapping distributions and we give asymptotic argu(cid:173)\nments for  this statement in section 3.  Because the hard margin (smallest margin in \nthe trainings set)  plays a  central role in causing overfitting, we  propose to relax the \nhard margin classification and allow for  misclassifications by  using the soft margin \nclassifier concept that has been applied to support vector machines successfully [5]. \n\n\u00b7permanent  address:  Communication  &  Information  Research  Lab.  CRIEPI,  2-11-1 \n\nIwado kita,  Komae-shi,  Tokyo 201-8511,  Japan. \n\n\fRegularizing AdaBoost \n\n565 \n\nOur view  is  that the margin  concept is  central for  the  understanding of both sup(cid:173)\nport vector machines and boosting methods.  So far it is  not clear what the optimal \nmargin distribution should be that a learner has to achieve for optimal classification \nin the noisy  case.  For  data without  noise  a  hard margin might  be the best  choice. \nHowever,  for  noisy  data there  is  always  the  trade-off in  believing  in  the  data or \nmistrusting it,  as  the  very  data point  could  be  an  outlier.  In  general  (e.g.  neural \nnetwork)  learning strategies  this  leads  to the  introduction  of regularization which \nreflects  the  prior  that  we  have  about  a  problem.  We  will  also  introduce  a  regu(cid:173)\nlarization  strategy  (analogous  to  weight  decay)  into  boosting.  This  strategy  uses \nslack  variables  to achieve  a  soft  margin  (section  4).  Numerical  experiments  show \nthe validity of our regularization approach in section 5 and finally a brief conclusion \nis  given. \n\n2  AdaBoost  Algorithm \n\nLet  {ht(x)  : t  =  1, ... ,T} be an  ensemble of T  hypotheses defined on input vector \nx  and e  = [Cl  ... CT]  their  weights  satisfying  Ct  > 0  and  lei  = 2:t Ct  = 1.  In  the \nbinary classification  case,  the output is  one of two  class labels, i.e.  ht (x)  =  \u00b11. \nThe  ensemble  generates  the  label  which  is  the  weighted  majority  of  the  votes: \nsgn (2:t Ctht(x)). \nIn  order  to  train  this  ensemble  of T  hypotheses  {ht(x)}  and \ne,  several  algorithms  have  been  proposed:  bagging,  where  the  weighting is  simply \nCt  =  l/T [2]  and  AdaBoost/ Arcing,  where  the  weighting  scheme  is  more  compli(cid:173)\ncated [12].  In the following we give a  brief description of AdaBoost/ Arcing.  We  use \na  special form  of Arcing,  which  is  equivalent to AdaBoost  [4].  In the binary classi(cid:173)\nfication  case we  define the margin for  an input-output pair  Zi  = (Xi, Yi), i  = 1, ... ,1 \nby \n\nmg(zi' e)  =  Yi L Ctht(Xi), \n\nT \n\nt=l \n\n(1) \n\nwhich  is  between  -1 and 1,  if lei  = 1.  The correct class is  predicted, if the margin \nat  Z  is  positive.  When  the  positivity  of the  margin  value  increases,  the  decision \ncorrectness  becomes  larger.  AdaBoost  maximizes  the  margin  by  (asymptotically) \nminimizing a function of the margin mg(zi' e)  [9,  11] \n\ng(b) = t, exp { -1~lmg(Zi' C)}, \n\n(2) \n\nwhere  b  = [bl ... bTl  and  Ibl  = 2:t bt  (starting  from  b  = 0).  Note  that  bt  is  the \nunnormalized  weighting  of  the  hypothesis  ht,  whereas  e  is  simply  a  normalized \nversion of b, i.e.  e  =  b/lbl.  In order to find  the hypothesis ht  the learning examples \nZi  are weighted in each iteration t  with Wt(Zi).  Using a  bootstrap on this weighted \nsample we train ht ; alternatively a weighted error function can be used (e.g. weighted \nMSE).  The weights Wt(Zi)  are computed according tol \n\n() \n\nWt  Zi  = \n\nexp{-lbt-llmg(zi,et-l)/2} \nI \n\n2:j=l exp {-Ibt-dmg(zj, et-d/2} \n\n(3) \n\nand the training error tOt  of ht is  computed as tOt  =  2:~=1 Wt(zi)I(Yi t  ht(Xi)), where \nI(true)  =  1 and I(false) =  O.  For each given hypothesis ht  we have to find a weight \nbt ,  such  that g(b)  is  minimized.  One can optimize this  parameter by  a  line  search \n\n1 This direct way for computing the weights is equivalent to the update rule of AdaBoost. \n\n\f566 \nor directly by  analytic minimization  [4],  which  gives  bt = 10g(1  - \u20act} \nInterestingly,  we  can write \n\nG.  RaIsch.  T.  Onoda and K.-R. Maller \n\n() \n\nWt  Zi  = \n\n8g(ht-d/8mg(zi, h t- 1 } \nI \n\n2:j=l 8g(ht-d/8mg(zj, ht-d \n\n' \n\n- log ft. \n\n(4) \n\nas  a  gradient  of g(ht - 1 )  with  respect  to the  margins.  The  weighted  minimization \nwith Wt(Zi)  will give a  hypothesis ht  which is  an approximation to the best possible \nhypothesis  h;  that  would  be  obtained  by  minimizing  9  directly.  Note  that,  the \nweighted  minimization  (bootstrap,  weighted LS)  will  not necessarily  give  hi,  even \nis  minimized  [11].  AdaBoost  is  therefore  an  approximate  gradient  descent \nif  \u20act \nmethod which minimizes  9  asymptotically. \n\n3  Hard  margins \n\nA  decrease  of g(c, Ihl)  := g(h)  is  predominantly  achieved  by  improvements of the \nmargin mg(zi' c).  IT the margin mg(zi, c)  is  negative, then the error g(c, Ihl)  takes \nclearly  a  big  value,  which  is  additionally  amplified  by  Ihl.  So,  AdaBoost  tries  to \ndecrease the negative margin efficiently to improve the error g(c, Ihl). \nNow,  let  us  consider  the  asymptotic  case,  where  the  number  of  iterations  and \ntherefore  also  Ihl  take  large  values  [9]. \nIn  this  case,  when  the  values  of  all \nmg(zi,c),i = 1,\u00b7\u00b7\u00b7,l, are almost the same but have  small  differences, these  differ(cid:173)\nences  are amplified  strongly in g(c, Ihl).  Obviously the function  g(c, Ihl)  is  asymp(cid:173)\ntotically very sensitive to small differences between margins.  Therefore, the margins \nmg(zi' c)  of the  training  patterns  from  the  margin  area  (boundary  area between \nclasses)  should  asymptotically  converge  to  the  same  value.  From  Eq.  (3),  when \nIhl  takes  a  very  big value,  AdaBoost  learning becomes  a  \"hard competition\"  case: \nonly the pattern with smallest margin will  get high weights,  the other patterns are \neffectively  neglected  in  the  learning  process.  In  order  to  confirm  that  the  above \nreasoning  is  correct,  Fig.  1  shows  margin  distributions  after  104  AdaBoost  itera(cid:173)\ntions for a toy example [9]  at different noise levels generated by uniform distribution \nU(0.0,u 2 )  (left).  From this figure,  it becomes apparent that the margin distribution \nasymptotically makes a step at a fixed size of the margin for training patterns which \nare in the margin area.  In previous studies  [9,  11]  we  observed that those patterns \nexhibit a  large overlap to support vectors in support vector machines.  The numeri(cid:173)\ncal  results support our theoretical asymptotic analysis.  The property of AdaBoost \nto produce a  big margin area (no  pattern in  the  area,  i.e.  a  hard margin),  will  not \nalways  lead  to  the  best  generalization  ability  (d.  [5,  11]).  This  is  especially  true, \n\n09 \n\n0.8 \n\nF: \n\n.~ 0.5 \n~ o. \n~ 0.3 \n\n0.2 \n\n0.1 \n\n0 \n0 \n\nI , \n\" \n\nI \n\n, , \n, \n, , \n\n0.22 \n\n0.215 \n\n0.21 \n\n0.205 \n\n0.2 \n\n0 \n\n1'. \n\n0 \n\n0 \n\n\u00b700  0 \n\n'\" 0 \n\n0.2 \n\n0 .' \n\nstability \n\n0.6 \n\n0.8  0 . 1'15 \n\n10' \n\n10' \n\n10' \n\n10' \n\n10' \n\n1.5 \n\n2.5 \n\nFigure  1:  Margin  distributions  for  AdaBoost  (left)  for  different  noise  levels  (a 2  = \nO%(dotted),  9%(dashed),  16%(solid\u00bb  with  fixed  number of RBF-centers for  the base  hy(cid:173)\npothesis  and typical overfitting  behaviour in  the generalization  error  as  a function  of the \nnumber  of  iterations  (middle)  and a  typical  decision  line  (right)  generated  by  AdaBoost \nusing RBF  networks in  the case  with  noise  (here:  30  centers and a 2  =  16%; smoothed) \n\n\fRegularizing AdaBoost \n\n567 \n\nif the training patterns  have classification or  input  noise.  In our experiments with \nnoisy  data,  we  often observed that AdaBoost  made overfitting  (for  a  high  number \nof boosting iterations).  Fig.  1 (middle)  shows a typical overfitting behaviour in the \ngeneralization error for  AdaBoost:  after  only 80  boosting iterations the  best  gen(cid:173)\neralization performance is  already achieved.  Quinlan  [10]  and  Grove et  al.  [7]  also \nobserved overfitting and that the generalization performance of AdaBoost is  often \nworse  than that of the single classifier,  if the data has  classification noise. \nThe first  reason for  overfitting is the increasing value of Ibl:  noisy patterns (e.g. bad \nlabelled)  can asymptotically have an \"unlimited\"  influence to the decision line lead(cid:173)\ning  to  overfitting  (cf.  Eq.  (3)).  Another  reason  is  the  classification  with  a  hard \nmargin, which also means that all training patterns will asymptotically be correctly \nclassified  (without  any capacity limitation!).  In the presence of noise  this  will  cer(cid:173)\ntainly  be not  the right  concept,  because the best  decision  line  (e.g. Bayes)  usually \nwill  not give a  training error of zero.  So,  the achievement of large hard margins for \nnoisy data will  produce hypotheses which  are too complex for  the problem. \n\n4  How to get  Soft  Margins \n\nIn order  to  avoid  overfitting,  we  in(cid:173)\n\nChanging  AdaBoost's  error  function \ntroduce  slack  variables,  which  are similar to those of the support vector algorithm \n[5,  14],  into AdaBoost. \nWe know that all training patterns will get non-negative stabilities after many itera(cid:173)\ntions(see Fig.  1(left)), i.e.  mg(zi, c)  2:  p for  all  i  =  1, ... , I,  where p is the minimum \nmargin of the patterns.  Due to this fact,  AdaBoost often produces high weights for \nthe  difficult  training patterns by enforcing a  non-negative margin  p  2:  0  (for  every \npattern including outliers)  and this  property will  eventually lead to overfitting,  as \nobserved in Fig. 1.  Therefore, we  introduce some variables ~i - the slack variables -\nand get \n\nmg(zi, c)  2:  p - C~L \n\n(5) \nIn  these  inequalities,  ~!  are  positive  and  if  a  training  pattern has  high  weights  in \nthe previous iterations, the ~! should be increasing.  In this way,  for  example, we  do \nnot  force  outliers to  be  classified  according to  their  possibly  wrong  labels,  but  we \nallow for  some  errors.  In this sense  we  get a  trade-off between  the margin and the \nimportance of a pattern in the training process  (depending on the constant C  2:  0). \nIf we  choose  C  =  0 in Eq.  (5),  the original AdaBoost algorithm is  retrieved.  If C is \nchosen  too high,  the  data is  not  taken seriously.  We  adopt  a  prior on the weights \nWr(Zi)  that punishes  large weights  in  analogy to weight  decay and choose \n\n~f > O. \n\n\u20acl  ~ (t, c,. Wc(Zi) r \n\n(6) \n\nwhere the inner sum is the cumulative weight of the pattern in the previous iterations \n(we  call  it  influence of a  pattern - similar  to  Lagrange  multipliers  in  SVMs) .  By \nthis  ~!,  AdaBoost  is  not  changed for  easy  classifiable  patterns,  but  is  changed for \ndifficult  patterns.  From Eq.  (5),  we  can derive a  new  error function: \n9reg(ct,lbt l)  =  ~exp{ -1~tlmg(zi,Ct) - C~f} \n\n(7) \n\nI \n\nBy this error function,  we  can control the trade-off between the weights,  which the \npattern had in the last iterations, and the achieved margin.  The weight Wt(Zi)  of a \npattern is computed as the derivative ofEq. (7)  subject to mg(zi, b t - 1 )  (cf.  Eq.  (4)) \nand is  given  by \n\n() \n\nWt  Zi  = \n\nexp {lbt-11(mg(zi,Ct-d - ~:-1)/2} \nI \n\nE j =l exp  Ibt-11(mg(zj, Ct-t} - ~j -\n\nt  1 \n\n{ \n\n}  . \n\n)/2 \n\n(8) \n\n\f568 \n\nG.  Riitsch,  T.  Onoda and K.-R.  Muller \n\nTable  1:  Pseudocode  description of the algorithms \n\nLP-AdaBoost(Z, T)  I LPreg-AdaBoost(Z, T, C)  I QPreg-AdaBoost(Z, T, C) \n\nRun  Ada Boost  on  dataset  Z  to get T  hypotheses h  and  their weights c \n\nC \n\nonstruct  oss  matnx \n\nI \n\n{-I  if h t (Xi)  =1=  Yi \n\n1  otherwise \n\n.  L \n\ni,t  = \n\nminimize  -p \nS.t.  E~=l CtLi,t  ~ P \nCt  ~ 0,  ECt  =  1 \n\nT \n\nminimize  -p+C2:\u00b7ei \nS.t.  2:t=l CtLi ,t  ~ P + ei \nCt  ~ 0,  E Ct  =  1 \n{i ~ 0 \n\n\u2022 \n\nminimize IlbW +CE\u00b7ei \nS.t.  Et=l btLi ,t  ~ 1 - ei \n\nT \n\n\u2022 \n\nb t  ~ 0 \n{i ~ 0 \n\nThus we  can get an update rule  for  the weight  of a  training pattern [11] \n\nWt(Zi)  =  Wt-l (Zi) exp{bt-1I(Yi =I  ht- 1 (Xi\u00bb)  + C~:-2Ibt_21 - C~;-llbt_ll}.  (9) \nIt is  more  difficult  to  compute  the  weight  bt  of  the  t-th  hypothesis  analytically. \nHowever, we  can get bt  by a line search procedure over Eq.  (7),  which has an unique \nsolution because  8~t greg> 0 is  satisfied.  This line search can be  implemented very \nefficiently.  With  this  line-search,  we  can  now  also  use  real-valued  outputs  of the \nbase hypotheses, while  the original AdaBoost algorithm could  not  (d. also  [6]). \n\nOptimizing a  given  ensemble \nIn  Grove  et  al.  [7],  it  was  shown  how  to  use \nlinear  programming to  maximize  the  minimum  margin  for  a  given  ensemble  and \nLP-AdaBoost  was  proposed  (table  1  left).  This  algorithm  maximizes  the  mini(cid:173)\nmum  margin  on  the  training  patterns.  It  achieves  a  hard  margin  (as  AdaBoost \nasymptotically does)  for  small  number of iterations.  For  the reasoning for  a  hard \nmargin  (section  3)  this  can  not generalize  well.  If we  introduce  slack  variables  to \nLP-AdaBoost, one gets the algorithm  LP reg-AdaBoost  (table  1 middle)  [11].  This \nmodification allows that some patterns have lower margins than p  (especially lower \nthan 0).  There is  a  trade-off:  (a)  make all margins bigger than p  and  (b)  maximize \np.  This trade-off is  controlled by the constant C. \nAnother formulation of a optimization problem can be derived from the support vec(cid:173)\ntor algorithm.  The optimization objective of a SVM  is  to find  a  function  h W  which \nminimizes  a  functional  of the  form  E  =  IlwW + C 2:i ~i'  where  Yih(Xi)  ~ 1 -\n~i \nand the  norm of the  parameter  vector  w  is  the  measure for  the  complexity  of the \nhypothesis  h W  [14].  For  ensemble  learning we  do  not have such  a  measure of com(cid:173)\nplexity and so we use the norm of the hypotheses weight vector b.  For Ibl  =  1 this is \na small value, if the elements are approximately equal  (analogy to bagging) and has \nhigh  values,  when  there  are some  strongly  emphasized  hypotheses  (far  away from \nbagging).  Experimentally,  we  found  that  IIbl12  is  often  larger  for  more  complex \nhypothesis.  Thus,  we  can apply the optimization principles  of SVMs  to  AdaBoost \nand get  the  algorithm  QPreg-AdaBoost  (table  1 right).  We  effectively use  a  linear \nSVM on top of the results of the base hypotheses. \n\n5  Experiments \n\nIn order  to  evaluate  the performance of our  new  algorithms,  we  make  a  compari(cid:173)\nson among the single RBF classifier, the original AdaBoost algorithm, AdaBoostreg \n(with  RBF  nets),  LfQPreg-AdaBoost  and  a  Support  Vector  Machine  (with  RBF \nkernel).  We  use  ten  artificial  and  real  world  datasets  from  the  DCI  and  DELVE \nbenchmark repositories:  banana  (toy dataset as in  [9,  11]),  breast cancer,  image seg(cid:173)\nment,  ringnorm,  flare  sonar,  splice,  new-thyroid,  titanic,  twonorm,  waveform.  Some of \nthe  problems  are  originally  not  binary  classification  problems,  hence  a  (random) \npartition into two classes was  used.  At first  we  generate 20  partitions into training \nand  test  set  (mostly  ~ 60%  : 40%).  On  each  partition  we  train the  classifier  and \nget its test  set  error.  The performance is  averaged and we  get  table  2. \n\n\fRegularizing AdaBoost \n\n569 \n\nTable  2:  Comparison  among  the  six  methods:  Single  RBF  classifier,  AdaBoost(AB), \nAdaBoostreg (ABreg), L/QP reg-AdaBoost (L/QPR) and a Support Vector Machine(SVM) : \nEstimation of generalization error in % on  10 datasets (best method in bold face).  Clearly, \nAdaBoostreg  gives  the best  overall  performance.  For  further  explanation see  text. \n\nBanana \nCancer \nImage \nRingnorm \nFSonar \nSplice \nThyroid \nTitanic \nTwonorm \nWaveform \nMean  '70 \nWinner  '70 \n\nRBF \n\n10.9\u00b10.5 \n28.7\u00b15.3 \n2.8\u00b10.7 \n1.1\u00b1O.3 \n34.6\u00b12.1 \n1O.0\u00b10.3 \n4.8\u00b12.4 \n23.4\u00b11.7 \n2.8\u00b10.2 \n10.7\u00b11.0 \n\n6.7 \n16.4 \n\nAB \n\n12.3\u00b10.7 \n30.5\u00b14.5 \n2.5\u00b10.7 \n2.0\u00b10.2 \n35.6\u00b11.9 \n10.1\u00b10.3 \n4.4\u00b11.9 \n22.7\u00b11.2 \n3.1\u00b10.3 \n10.8\u00b10.4 \n\n9.6 \n8.2 \n\nABreg \n\nlO.1\u00b1O.5 \n26.3\u00b14.3 \n2.5\u00b10.7 \n1.1\u00b1O.2 \n33.6\u00b11.7 \n9.5\u00b1O.2 \n4.4\u00b12.1 \n22.5\u00b11.0 \n2.1\u00b12.1 \n9.9\u00b10.9 \n\n1.0 \n28.5 \n\nQPR \n\nLPR \n\nSVM \n10.8\u00b10.4 \n11.5\u00b14.7 \n10.9\u00b10.5 \n31.0\u00b14.2 \n26.2\u00b14.7  26.1\u00b14.8 \n2.6\u00b10.6 \n2.9\u00b10.7 \n2.4\u00b1O.5 \n2.2\u00b10.4 \n1.9\u00b10.2 \n1.1\u00b1O.1 \n36.2\u00b11.7  32.5\u00b11.1 \n35.7\u00b14.5 \n10.2\u00b11.6 \n10.1\u00b10.5 \n10.9\u00b10.7 \n4.4\u00b12.0  4.4\u00b12.2 \n4.8\u00b12.2 \n22.7\u00b11.0  22.4\u00b11.0 \n22.9\u00b11.9 \n3.0\u00b10.2 \n3.0\u00b10.3 \n3.4\u00b10.6 \n10.6\u00b11.0 \n9.8\u00b1O.3 \n10.1\u00b10.5 \n\n6.3 \n16.6 \n\n11.1 \n15.0 \n\n4.7 \n15.3 \n\nWe  used  RBF  nets  with  adaptive  centers  (some  conjugate  gradient  iterations  to \noptimize  positions  and  widths  of the  centers)  as  base  hypotheses  as  described  in \n[1,  11].  In  all  experiments,  we  combined  200  hypotheses.  Clearly,  this  number  of \nhypotheses  may  be  not  optimal,  however  Adaboost  with  optimal  early  stopping \nis  not  better  than  AdaBoost.reg .  The  parameter  C  of the  regularized  versions  of \nAdaBoost  and  the  parameters  (C, a)  of the  SVM  are  optimized  by  the  first  five \ntraining datasets.  On  each  training  set  5-fold-cross  validation  is  used  to  find  the \nbest  model  for  this  dataset 2 .  Finally,  the  model  parameters are  computed  as  the \nmedian  of the  five  estimations.  This  way  of estimating  the  parameters  is  surely \nnot possible in practice, but will  make this comparison more robust and the results \nmore  reliable.  The last  but  one  line  in  Tab.  2 shows  the  line  'Mean  %',  which  is \ncomputed as follows:  For  each  dataset the  average error rate of all  classifier  types \nare  divided  by  the minimum error rate and 1 is  subtracted.  These  resulting num(cid:173)\nbers are averaged over the 10 datasets.  The last line  shows the probabilities that a \nmethod wins,  i.e.  gives  the smallest generalization error, on the basis of our exper(cid:173)\niments  (averaged over all  ten datasets) .  Our experiments on noisy  data show that \n(a)  the  results  of AdaBoost  are in  almost  all  cases  worse than the single  classifier \n(clear overfitting effect)  and (b)  the results of AdaBoostreg  are in all  cases  (much) \nbetter than those of AdaBoost  and  better  than that  of the  single  classifier.  Fur(cid:173)\nthermore, we  see clearly, that  (c)  the single classifier wins as often as the SVM,  (d) \nL/QPreg-AdaBoost improves the  results  of AdaBoost,  (e)  AdaBoostreg  wins  most \noften.  L/QP reg-AdaBoost  improves  the  results  of  AdaBoost  in  almost  cases  due \nto  established  the  soft  margin.  But  the  results  are  not  as  good  as  the  results  of \nAdaBoostreg  and the SVM, because the hypotheses generated by AdaBoost (aimed \nto construct a  hard margin)  may be not the appropriate ones generate a  good soft \nmargin.  We  also  observe that quadratic programming gives  slightly  better results \nthan  linear  programming.  This  may  be  due  to the fact  that  the hypotheses  coef(cid:173)\nficients  generated by  LPreg-AdaBoost  are  more sparse  (smaller  ensemble).  Bigger \nensembles may have a better generalization ability (due to the reduction of variance \n[3]).  The worse performance of SVM compared to AdaBoostreg  and the unexpected \ntie between SVM  and RBF  net may be explained with  (a)  the fixed  a  of the RBF(cid:173)\nkernel  (loosing multi-scale information),  (b)  coarse model selection,  (c)  worse error \nfunction ofthe SV  algorithm (noise model).  Sumarizing, AdaBoost is useful for  low \nnoise cases, where the classes are separable (as shown for  OCR[13, 8]).  AdaBoostreg \nextends  the  applicability  of boosting to  \"difficult  separable\"  cases  and  should  be \napplied,  if the data is  noisy. \n\n2The  parameters are  only  near-optimal.  Only  10  values for  each  parameter are  tested. \n\n\f570 \n\n6  Conclusion \n\nG.  Ratsch,  T.  Onoda and K.-R. Maller \n\nWe  introduced three algorithms to alleviate the overfitting problems of boosting al(cid:173)\ngorithms for  high noise data:  (1)  direct incorporation ofthe regularization term into \nthe  error function  (Eq.(7)),  use  of  (2)  linear  and  (3)  quadratic programming with \nconstraints given by the slack variables.  The essence of our proposal is to introduce \nslack  variables  for  regularization in order  to  allow  for  soft  margin  classification in \ncontrast to the hard margin classification used before.  The slack variables basically \nallow to control how  much we trust the data, so we  are permitted to ignore outliers \nwhich  would  otherwise  have  spoiled  our  classification.  This generalization is  very \nmuch in the spirit of support  vector machines that also trade-off the maximization \nof the margin and the minimization of the classification errors in the slack variables. \nIn our experiments, AdaBoostreg  showed a better overall generalization performance \nthan all  other  algorithms  including  the  Support  Vector  Machines.  We  conjecture \nthat this  unexpected result  is  mostly  due  to the fact  that SVM  can only use  one  CT \nand therefore loose scaling information.  AdaBoost  does  not have this  limitation. \nSo  far  we  balance our trust in the data and the margin maximization by cross  val(cid:173)\nidation.  Better  would  be,  if we  knew  the  \"optimal\"  margin  distribution  that  we \ncould  achieve  for  classifying  noisy  patterns,  then  we  could  of  course  balance  the \nerrors and the margin sizes optimally. \nIn future works, we plan to establish more connections between AdaBoost and SVM. \nAcknowledgements:  We  thank  for  valuable  discussions  with  A.  Smola,  B. \nSch6lkopf,  T.  FrieB  and D.  Schuurmans.  Partial funding  from  EC  STORM project \ngrant  number  25387  is  greatfully  acknowledged.  The  breast  cancer  domain  was \nobtained  from  the  University  Medical  Centre,  Inst.  of  Oncology,  Ljubljana,  Yu(cid:173)\ngoslavia.  Thanks go  to M.  Zwitter and M.  Soklic for  providing the data. \n\nReferences \n[1]  C.  M.  Bishop.  Neural  Networks  for  Pattern Recognition.  Clarendon,  1995. \n[2]  L.  Breiman.  Bagging predictors.  Machine  Learning, 26(2):123- 140,  1996. \n[3]  L.  Breiman.  Arcing classifiers.  Tech.Rep.460,  Berkeley Stat.Dept.,  1997. \n[4]  L.  Breiman.  Prediction games and arcing algorithms.  Tech.Rep. 504,  Berkeley \n\nStat.Dept.,  1997. \n\n[5]  C.  Cortes, V.  Vapnik.  Support vector network.  Mach.Learn.,  20:273-297,1995. \n[6]  R.  Schapire, Y.  Singer.  Improved Boosting Algorithms Using Confidence-rated \n\nPredictions.  In  Proc.  of COLT'98. \n\n[7]  A.J.  Grove,  D.  Schuurmans.  Boosting in  the limit:  Maximizing the margin of \n\nlearned ensembles.  In  Proc.  15th  Nat.  Conf.  on  AI,  1998.  To  appear. \n\n[8]  Y.  LeCun et al.  Learning algorithms for  classification:  A comparism on hand(cid:173)\n\nwritten digit  recognistion.  Neural  Networks,  pages  261-276,  1995. \n\n[9]  T.  Onoda,  G.  Ratsch,  and  K.-R.  Muller.  An  asymptotic analysis of adaboost \n\nin the binary classification case.  In  Proc.  of ICANN'98,  April  1998. \n\n[10]  J.  Quinlan.  Boosting first-order  learning.  In  Proc.  of the  7th  Internat.  Work(cid:173)\n\nshop  on  Algorithmic  Learning  Theory,  LNAI,  1160,143-155. Springer. \n\n[11]  G.  Ratsch.  Soft  Margins for  AdaBoost.  August 1998.  Royal Holloway College, \n\nTechnical Report NC-TR-1998-021.  Submitted to Machine Learning. \n\n[12]  R.  Schapire,  Y.  Freund,  P.  Bartlett, W.  Lee. Boosting the margin:  A new  ex(cid:173)\nplanation for  the effectiveness of voting methods. Mach. Learn. ,  148-156, 1998. \n\n[13]  H.  Schwenk and Y.  Bengio. -Adaboosting neural networks:  Application to on(cid:173)\n\nline character recognition. In ICANN'97, LNCS,  1327,967-972,1997. Springer. \n\n[14]  V.  Vapnik.  The  Nature  of Statistical  Learning  Theory.  Springer,  1995. \n\n\f", "award": [], "sourceid": 1615, "authors": [{"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Takashi", "family_name": "Onoda", "institution": null}, {"given_name": "Klaus", "family_name": "M\u00fcller", "institution": null}]}