{"title": "v-Arc: Ensemble Learning in the Presence of Outliers", "book": "Advances in Neural Information Processing Systems", "page_first": 561, "page_last": 567, "abstract": null, "full_text": "v-Arc:  Ensemble Learning \nin the Presence of Outliers \n\nG.  Ratscht ,  B.  Scholkopf1,  A.  Smola\", \nK.-R.  Miillert, T.  Onodatt ,  and S.  Mikat \n\nt Microsoft  Research,  1 Guildhall Street,  Cambridge CB2  3NH,  UK \n\nt GMD  FIRST,  Rudower  Chaussee 5,12489 Berlin, Germany \n* Dep.  of Engineering,  ANU,  Canberra ACT 0200,  Australia \ntt CRIEPI, 2-11-1,  Iwado Kita,  Komae-shi,  Tokyo,  Japan \n\n{raetsch,  klaus,  mika}~first.gmd.de,bsc~microsoft.com, \n\nAlex.Smola~anu.edu.au,onoda~criepi.denken.or.jp \n\nAbstract \n\nAdaBoost and other ensemble methods have successfully  been ap(cid:173)\nplied  to a  number  of classification  tasks,  seemingly  defying  prob(cid:173)\nlems of overfitting.  AdaBoost performs gradient descent in an error \nfunction  with  respect  to the margin,  asymptotically concentrating \non  the  patterns which  are  hardest to learn.  For  very  noisy  prob(cid:173)\nlems,  however,  this  can  be  disadvantageous.  Indeed,  theoretical \nanalysis has shown that the margin distribution,  as opposed to just \nthe minimal margin, plays a crucial role in understanding this phe(cid:173)\nnomenon.  Loosely  speaking,  some  outliers  should  be  tolerated  if \nthis  has  the  benefit  of substantially  increasing  the  margin  on  the \nremaining points.  We  propose a  new  boosting algorithm which  al(cid:173)\nlows for  the possibility of a  pre-specified fraction of points to lie  in \nthe margin area Or even on the wrong side of the decision boundary. \n\n1 \n\nIntroduction \n\nBoosting and related Ensemble learning methods have been recently used with great \nsuccess in  applications such as  Optical Character Recognition  (e.g.  [8,  16]). \nThe idea of a  large minimum  margin  [17]  explains  the good  generalization perfor(cid:173)\nmance  of  AdaBoost  in  the  low  noise  regime.  However,  AdaBoost  performs  worse \non noisy tasks  [10,  11],  such  as  the iris  and the  breast  cancer  benchmark data sets \n[1].  On the  latter tasks,  a  large margin  on  all  training points  cannot be achieved \nwithout  adverse effects  on the generalization error.  This  experimental observation \nwas supported by the study of [13]  where the generalization error of ensemble meth(cid:173)\nods was  bounded by the sum of the fraction of training points which have a margin \nsmaller than some value  p,  say,  plus  a  complexity term depending on  the base  hy(cid:173)\npotheses  and  p.  While  this  bound  can  only  capture  part  of what  is  going  on  in \npractice,  it nevertheless already conveys  the message that in some cases  it pays to \nallow for  some  points  which  have a  small  margin,  or  are misclassified,  if this leads \nto a  larger overall margin on the remaining points. \nTo  cope  with  this  problem,  it  was  mandatory to  construct  regularized  variants  of \nAdaBoost, which traded off the number of margin errors and the size of the margin \n\n\f562 \n\nG.  Riitsch, B.  Sch6lkopf, A. J.  Smola,  K.-R.  Muller,  T.  Onoda and S.  Mika \n\n[9,  11].  This goal, however, had so far been achieved in a heuristic way by introduc(cid:173)\ning  regularization  parameters  which  have  no  immediate  interpretation and  which \ncannot be adjusted easily. \nThe present paper addresses this problem in two ways.  Primarily, it makes an algo(cid:173)\nrithmic contribution to the problem of constructing regularized boosting algorithms. \nHowever,  compared to the previous efforts,  it parameterizes the above trade-off in \na much more intuitive way:  its only free  parameter directly determines the fraction \nof margin errors.  This,  in  turn,  is  also appealing from  a  theoretical  point of view, \nsince it involves  a  parameter which controls a  quantity that plays  a  crucial role  in \nthe  generalization error  bounds  (cf.  also  [9,  13]).  Furthermore,  it allows  the  user \nto roughly specify this  parameter once a  reasonable estimate of the expected error \n(possibly from  other studies)  can be obtained, thus reducing the training time. \n\n2  Boosting and the Linear  Programming Solution \n\nBefore  deriving  a  new  algorithm,  we  briefly  discuss  the  properties  of the  solution \ngenerated  by  standard  AdaBoost  and,  closely  related,  Arc-GV  (2],  and  show  the \nrelation to a linear programming (LP)  solution over the class of base hypotheses G. \nLet  {gt(x)  :  t  = 1, ... ,T} be  a  sequence  of hypotheses  and a  = [al  ... aT]  their \nweights  satisfying  at  ~ O.  The  hypotheses  gt  are  elements  of a  hypotheses  class \nG = {g: x  14 [-1, In, which is  defined  by a  base learning algorithm. \nThe ensemble generates the label which is  the weighted majority of the votes  by \n\nsign(f(x))  where \n\nf(x)  = ~ lI:ill gt(x). \n\n(1) \n\nIn order to express that f  and therefore also the margin p depend on a  and for ease \nof notation we  define \n\np(z, a) := yf(x) where z := (x, y)  and  f  is  defined  as in  (1). \n\nLikewise  we  use  the  normalized margin: \n\np(a):=  min  P(Zi, a) , \n\nl~t~m \n\n(2) \n\n(3) \n\nEnsemble learning methods  have  to find  both,  the hypotheses gt  E  G  used for  the \ncombination and their weights a.  In the following  we  will  consider only  AdaBoost \nalgorithms  (including  Arcing).  For  more  details  see  e.g.  (4,  2].  The main  idea of \nAdaBoost is to introduce weights Wt(Zi)  on the training patterns.  They are used  to \ncontrol the importance of each single pattern in learning a new hypothesis (Le.  while \nrepeatedly running the base algorithm).  Training patterns that are difficult to learn \n(which are misclassified  repeatedly)  become more important. \nThe minimization objective of AdaBoost can be expressed in terms of margins as \n\nm \n\ni=1 \n\n(4) \n\nIn every iteration, AdaBoost tries to minimize this error by a stepwise maximization \nof the margin.  It is  widely believed  that  AdaBoost  tries to maximize  the  smallest \nmargin on  the  training set  [2,  5,  3,  13,  11].  Strictly speaking,  however,  a  general \nproof is missing.  It would imply that AdaBoost asymptotically approximates (up to \nscaling) the solution of the following linear programming problem over the complete \nhypothesis set G  (cf.  [7],  assuming a  finite  number of basis hypotheses): \n\nmaximize \nsubject to \n\np \np( Zi, a) ~ p \nat, P ~ 0 \nlIalil = 1 \n\nfor  all  1 < i  < m \nfor  all 1 ~ t  ~ IGI \n\n(5) \n\n\fv-Arc:  Ensemble Learning in the Presence o/Outliers \n\n563 \n\nSince  such  a  linear  program cannot  be solved  exactly for  a  infinite  hypothesis  set \nin  general,  it  is  interesting  to  analyze  approximation  algorithms  for  this  kind  of \nproblems. \nBreiman  [2]  proposed  a  modification  of AdaBoost - Arc-GV  - making  it  possible \nto show the asymptotic convergence of p(a t )  to the global solution pIP: \nTheorem 1  (Breiman [2]).  Choose  at  in  each  iteration  as \n\nat  := argmin Lexp [-llatlll (p(Zi' at) - p(at- I ))], \n\naE[O,I] \n\ni \n\n(6) \n\nand assume that the base  learner always finds  the hypothesis 9  E G which  minimizes \nthe  weighted  training  error with  respect  to  the  weights.  Then \n\nlim  p( at) =  pIp. \nt-HX> \n\nNote that the algorithm above can be derived from  the modified  error function \n\n9gv(at):= Lexp [-llatlll (p(Zi' at) - p(at - I ))]. \n\n(7) \n\nThe  question  one  might  ask  now  is  whether  to  use  AdaBoost  or  rather  Arc-GV \nin  practice.  Does  Arc-GV  converge  fast  enough  to  benefit  from  its  asymptotic \nproperties?  In  [12]  we  conducted  experiments  to  investigate  this  question.  We \nempirically found that (a)  AdaBoost has problems finding the optimal combination \nif pIp  < 0,  (b)  Arc-GV's convergence does  not depend  on pIp,  and  (c)  for  pIp> 0, \nAdaBoost  usually  converges  to the  maximum  margin solution slightly  faster  than \n\nArc-GV.  Observation  (a)  becomes clear from  (4):  9(a) will  not converge to \u00b0 and \n\nlIal11  can be bounded by some value.  Thus the asymptotic case cannot be reached, \nwhereas for  Arc-GV the optimum is  always found. \nMoreover, the number of iterations necessary to converge to a good solution seems to \nbe reasonable, but for a near optimal solution the number of iterations is rather high. \nThis  implies  that for  real  world  hypothesis  sets,  the  number  of iterations  needed \nto find  an almost optimal solution can become prohibitive,  but we  conjecture that \nin  practice a  reasonably good  approximation to the  optimum  is  provided  by  both \nAdaBoost and Arc-GV. \n3  v-Algorithms \nFor  the  LP-AdaBoost [7]  approach it has  been shown for  noisy  problems  that the \ngeneralization performance is  usually not as good as the one of AdaBoost [7,  2,  11]. \nFrom  Theorem  5  in  [13]  (cf.  Theorem  3  on  page  5)  this  fact  becomes  clear,  as \nthe minimum of the right  hand side of inequality  (cf.  (13))  need  not necessarily be \nachieved with a maximum margin.  We now propose an algorithm to directly control \nthe number of margin errors and therefore also the contribution of both terms in the \ninequality  separately  (cf.  Theorem  3).  We  first  consider  a  small  hypothesis  class \nand  end  up  with  a  linear  program  - v-LP-AdaBoost.  In  subsection  3.2  we  then \ncombine  this  algorithm  with  the  ideas  from  section  2 and  get  a  new  algorithm  -\nv-Arc - which  approximates the  v-LP solution. \n\n3.1  v-LP-AdaBoost \nLet us consider the case where we are given a  (finite) set G = {g: x  I-t [-1, 1n ofT \nhypotheses.  To find  the coefficients  a  for the combined hypothesis  f(x)  we  extend \nthe LP-AdaBoost algorithm [7,  11]  by incorporating the parameter v  [15]  and solve \nthe following  linear optimization problem: \n\nmaximize \nsubject to \n\np - v!n  E::'I ~i \nP(Zi' a) :::::  p - ~i \n\nfor  all  1 ~ i  ~ m \n\n~i' at, P ::::: \u00b0 for  all  1 ~ t  ~ T  and 1 ~ i  ~ m \nlIalh = 1 \n\n(8) \n\n\f564 \n\nG.  Riitsch, B.  SchOlkopf, A. J.  Smola,  K.-R.  Muller,  T.  Onoda and S.  Mika \n\nThis  algorithm  does  not  force  all  margins  to  be  beyond  zero  and  we  get  a  soft \nmargin classification  (cf.  SVMs)  with a  regularization constant  v!n.  The following \nproposition shows that v  has an immediate interpretation: \nProposition 2  (Ratsch et al.  [12]).  Suppose  we  run  the  algorithm  given  in  (8) \non  some  data  with  the  resulting  optimal P > o.  Then \n\n1.  v  upper bounds  the  fraction  of margin  errors. \n\n2.  1 - v  upper  bounds  the  fraction  of patterns  with  margin  larger than p. \n\nSince the slack variables  ~i only enter the cost function  linearly,  their absolute size \nis  not  important.  Loosely  speaking,  this  is  due  to  the fact  that for  the  optimum \nof the primal objective function,  only  derivatives  wrt.  the primal variables matter, \nand the derivative of a  linear function is  constant. \nIn  the  case  of SVMs  [14],  where  the  hypotheses  can  be  thought  of as  vectors  in \nsome  feature  space,  this  statement  can  be  translated  into  a  precise  rule  for  dis(cid:173)\ntorting training patterns without changing the solution:  we  can move  them locally \northogonal to a separating hyperplane.  This yields a desirable robustness  property. \nNote  that the algorithm essentially depends  on the  number of outliers,  not on the \nsize  of the error [15]. \n\n3.2  The  v-Arc  Algorithm \nSuppose we have a very large (but finite)  base hypothesis class G.  Then it is difficult \nto solve  (8)  as  (5)  directly.  To this end,  we  propose a  new algorithm - v-Arc - that \napproximates the solution of (8). \nThe optimal p for  fixed  margins P(Zi' a) in  (8)  can be written as \n\npv(a)  := argmax (p - _1 f)p - p(Zi' a\u00bb+) . \n\npE[O,I] \n\nvm i=1 \n\n(9) \n\nwhere  (~)+  :=  max(~, 0).  Setting  ~i  :=  (pv(a)  - P(Zi' a\u00bb+  and  subtracting \nv!n  I:~l ~i from  the resulting inequality on both sides  yields  (for  all  1 ~ i  ~ m) \n\nP(Zi' a) + ~i -\n\nL~i  ~  pv(a) - - L~i . \n\n(10) \n\n1  m \n\n-\nvm i=1 \n\n1  m \n\nvm i=1 \n\nTwo  more  substitutions  are needed  to transform the  problem  into  one  which  can \nbe solved by the AdaBoost algorithm.  In particular we  have to get rid of the slack \nvariables ~i again by absorbing them into quantities similar to P(Zi' a) and p(a). \nThis works as follows:  on the right hand side of (10)  we have the objective function \n(cf.  (8\u00bb  and on the left hand side a  term that depends nonlinearly on a.  Defining \n\n_ \npv(a)  := pv(a) - -\n\n1  m \nL \nvm. \n,=1 \n\n_ \n\nand  Pv(Zi' a) := P(Zi' a) + ~i -\n\n~i \n\n1  m \n\n' \" '  ~i, \n-\nvm~ \ni=l \n\n(11) \n\nwhich  we  substitute  for  p(a)  and  p(z,a)  in  (5),  respectively,  we  obtain  a  new \noptimization problem.  Note  that ,ov (a)  and ,ov (Zi' a)  play  the  role  of a  corrected \nor virtual  margin.  We  obtain a  nonlinear min-max problem \n\nmaximize \nsubject to \n\n,o(a) \n,o( Zi, a) ~ ,o( a) \nat > 0 \nlIallt ~ 1 \n\nfor  all  1 ~ i  ~ m \nfor  all 1 ~ t  ~ T \n\n' \n\n(12) \n\nwhich Arc-GV can solve approximately (cf. section 2).  Hence, by replacing the mar(cid:173)\ngin  p(Z, a)  by  ,o(z,a)  in  equation  (4)  and the other formulas  for  Arc-GV  (cf.  [2]), \n\n\fv-Arc:  Ensemble Learning in the Presence o/Outliers \n\n565 \n\nwe  obtain a  new  algorithm which  we  refer to as  v-Arc. \nWe  can now  state interesting properties for  v-Arc by using  Theorem 5 of [13]  that \nbounds the generalization error R(f) for  ensemble methods.  In our case Rp(f)  ~ v \nby construction (i.e. the number of patterns with a margin smaller than p, cf. Propo(cid:173)\nsition 2),  thus  we  get the following  simple reformulation of this bound: \nTheorem 3.  Let p(x, y)  be  a  distribution  over X  x  [-1,1]'  and let X  be  a  sample \nof m  examples  chosen  iid  according  to p.  Suppose  the  base-hypothesis  space G  has \nVC  dimension  h,  and  let  [)  >  0.  Then  with  probability  at  least  1 -\n[)  over  the \nrandom  choice  of the  training set X, Y,  every function  f  generated by v-Arc,  where \nv  E  (0,1)  and pv  > 0,  satisfies  the following  bound: \n\nR(f) ~ v +  ~ (hIOg2 (ml h) \n\nm \n\n2 \nPv \n\nI  (!)) \n+ og \n\n. \n\n~ \nu \n\n(13) \n\nSo,  for  minimizing  the  right  hand  side  we  can  tradeoff between  the  first  and  the \nsecond term by  controlling an easily interpretable regularization parameter v. \n\n4  Experiments \n\nWe  show  a  set  of toy  experiments  to  illustrate  the  general  behavior of v-Arc.  As \nbase  hypothesis  class  G  we  use  the RBF  networks of [11],  and as  data a  two-class \nproblem  generated  from  several  2D  Gauss  blobs  (cf.  Banana  shape  dataset  from \nhttp://www.first.gmd.derdata/banana .html.).  We  obtain the following  results: \n\n\u2022  v-Arc  leads  to  approximately  vm  patterns  that  are  effectively  used  in \nthe  training  of  the  base  learner:  Figure  1  (left)  shows  the  fraction \nof  patterns  that  have  high  average  weights  during  the  learning  process \n(i.e.  Ei=l Wt(Zi)  > 112m).  We  find  that the number of the latter increases \n(almost)  linearly  with  v.  This  follows  from  (11)  as  the  (soft)  margin  of \npatterns with p(z, a) < Pv  is set to pv  and the weight of those patterns will \nbe the same. \n\n\u2022  The (estimated) test error, averaged over 10 training sets, exhibits a  rather \n\nflat  minimum  in  v  (Figure  1  (lower)).  This  indicates  that  just  as  for  v(cid:173)\nSVMs, where corresponding results have been obtained, v  is a well-behaved \nparameter in the sense that a  slight misadjustment it is  not harmful. \n\n\u2022  v-Arc  leads to the fraction  v  of margin errors  (cf.  dashed line in Figure 1) \n\nexactly as  predicted in Proposition 2. \n\n\u2022  Finally,  a  good value  of v  can already be inferred from  prior knowledge of \nthe  expected  error.  Setting  it  to a  value  similar  to  the  latter  provides  a \ngood starting point for  further  optimization (cf. Theorem 3). \n\nNote  that  for  v  =  1,  we  recover  the  Bagging  algorithm  (if  we  used  bootstrap \nsamples),  as  the  weights  of all  patterns  will  be  the  same  (Wt(Zi)  =  11m  for  all \ni  =  1, . . . ,m)  and  also  the  hypothesis  weights  will  be  constant  (at'\"  liT for  all \nt =  1, .. . ,T) . \nFinally, \nbenchmark \nwe \ndata \nrepository \n(cf.  http://ida.first.gmd.de/-raetsch/data/benchmarks.html).  We  an-\nalyze  the  performance of single  RBF networks,  AdaBoost,  v-Arc  and  RBF-SVMs. \nFor  AdaBoost  and  v-Arc  we  use  RBF  networks  [11]  as  base  hypothesis.  The \nmodel parameters of RBF  (number of centers etc.),  v-Arc (v)  and SVMs  (0', C)  are \noptimized using 5-fold cross-validation.  More details on the experimental setup can \n\ncomparison \n[1] \n\non \nten \nbenchmark \n\nthe  VCI \n\nsets \n\npresent \nobtained \n\na \nfrom \n\nsmall \n\n\f566 \n\nG.  Riitsch, B.  SchO/kopf, A. J.  Smola, K.-R.  Muller,  T.  Onoda and S.  Mika \n\n0.8  number of important \n\npanems \n\n~ \n\n~ \n~ 0.6 \n\"-\n'0 \nc \n~ 0 .4 \n~ \n\ntraining error \n\nArc-GV \n\n0.16 \n\n0. 15 \n~ \nw  0.'4 \n\n0.12 \n\n0.11 \n\nBagging \n\no \n\no. I \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.5 \n\n0.6 \n\n0.7 \n\n0.8 \n\n0.9 \n\no \n\n0.1 \n\n0.2 \n\n0.3 \n\n0.4 \n\n0.5 \n\n0.6 \n\n0.7 \n\n0.8 \n\n0.9 \n\nFigure 1:  Toy  experim~nt (0'  = 0):  the left  graph shows the average Yfraction  of important \npatterns,  the  avo  fraction  of margin  errors  and  the  avo  training  error  for  different  values \nof  the  regularization  constant  v  for  v-Arc.  The  right  graph  shows  the  corresponding \ngeneralization  error.  In  both  cases,  the  parameter  v  allows  us  to  reduce  the  test  errors \nto  values  much  lower  than  for  the  hard  margin  algorithm  (for  v  =  0  we  recover  Arc(cid:173)\nGV / AdaBoost,  and for  v  =  1 we  get  Bagging.) \n\nbe  found  in  [11].  Fig.  1 shows  the  generalization  error  estimates  (after  averaging \nover  100  realizations  of  the  data  sets)  and  the  confidence  interval.  The  results \nof the  best  classifier  and  the  classifiers  that  are  not  significantly  worse  are  set  in \nbold  face .  To  test  the  significance,  we  used  at-test  (p  =  80%).  On eight  out of \nthe ten data sets,  v-Arc  performs significantly better than AdaBoost.  This clearly \nshows the superior performance of v-Arc for  noisy  data sets and supports this soft \nmargin  approach  for  AdaBoost.  Furthermore,  we  find  comparable  performances \nfor  v-Arc  and  SVMs.  In  three  cases  the  SVM  performs  better  and  in  two  cases \nv-Arc  performs  best.  Summarizing,  AdaBoost is  useful  for  low  noise  cases, where \nthe  classes are separable.  v-Arc  extends  the  applicability of boosting to problems \nthat are difficult  to separate and should be applied if the data are noisy. \n\n5  Conclusion \n\nWe  analyzed  the  AdaBoost  algorithm  and found  that  Arc-GV  and  AdaBoost  are \nefficient  for  approximating the solution of non-linear min-max problems over  huge \nhypothesis classes.  We re-parameterized the LP Reg-AdaBoost algorithm (cf. [7,  11]) \nand  introduced  a  new  regularization constant  v  that controls  the  fraction  of pat(cid:173)\nterns inside the margin area.  The new  parameter is  highly  intuitive and has to be \noptimized only on a  fixed  interval  [0,1] . \nUsing the fact that Arc-GV can approximately solve min-max problems, we found a \nformulation of Arc-G V - v-Arc - that implements the v-idea for Boosting by defining \nan  appropriate  soft  margin.  The present paper extends  previous  work  on  regular(cid:173)\nizing  boosting  (DOOM  [9],  AdaBoostReg  [11])  and shows the utility and flexibility \nof the soft margin approach for  AdaBoost. \n\nRBF \n\n10.8 \u00b1  0.06 \nBanana \n27.6  \u00b1  0.47 \nB.Cancer \n24.3  \u00b1  0.19 \nDiabetes \n24.7  \u00b1  0.24 \nGerman \n17.6 \u00b1  0.33 \nHeart \n1.7 \u00b1  0.02 \nRingnorm \n34.4  \u00b1 0.20 \nF .Sonar \n4.5 \u00b1  0.21 \nThyroid \n23.3  \u00b1 0.13 \nTitanic \nWaveform  10.7 \u00b1  0.11 \n\nAB \n\n12.3 \u00b1  0.07 \n30.4  \u00b1  0.47 \n26.5  \u00b1  0.23 \n27.5  \u00b1  0.25 \n20.3  \u00b1  0.34 \n1.9  \u00b1  0.03 \n35.7 \u00b1 0.18 \n4.4 \u00b1  0.22 \n22.6 \u00b1  0.12 \n10.8 \u00b1  0.06 \n\nv-Arc \n\n10.6 \u00b1  0.05 \n25.8 \u00b1  0.46 \n23.7 \u00b1  0.20 \n24.4  \u00b1  0.22 \n16.5 \u00b1  0.36 \n1.7 \u00b1  0.02 \n34.4 \u00b1  0.19 \n4.4 \u00b1  0.22 \n23.0  \u00b1  0.14 \n10.0 \u00b1  0.07 \n\nSVM \n\n11.5  \u00b1  0.07 \n26.0 \u00b1  0.47 \n23.5 \u00b1  0.17 \n23.6 \u00b1  0.21 \n16.0 \u00b1  0.33 \n1.7 \u00b1  0.01 \n32.4 \u00b1  0.18 \n4.8 \u00b1  0.22 \n22.4 \u00b1  0.10 \n9.9 \u00b1  0.04 \n\nTable 1:  Generalization  error estimates and confidence intervals.  The best classifiers for  a \nparticular data set are  marked in bold face  (see  text). \n\n\fv-Arc:  Ensemble Learning in the Presence of Outliers \n\n567 \n\nWe  found  empirically  that  the  generalization  performance  in  v-Arc  depends  only \nslightly  on  the  choice  of the  regularization  constant.  This  makes  model  selection \n(e.g.  via cross-validation)  easier and faster. \nFuture work will study the detailed regularization properties of the regularized ver(cid:173)\nsions of AdaBoost,  in particular in comparison to v-LP  Support Vector  Machines . \nAcknowledgments:  Partial  funding  from  DFG  grant  (Ja  379/52)  is  gratefully \nacknowledged.  This work was done while  AS  and BS were at GMD  FIRST. \nReferences \n\n[1]  C.  Blake,  E.  Keogh,  and  C. J.  Merz.  UCI  repository  of machine learning  databases, \n\n1998.  http://www.ics. uci.edu/ \",mlearn/MLRepository.html. \n\n[2]  L.  Breiman.  Prediction games and arcing algorithms.  Technical Report 504, Statistics \n\nDepartment, University of California,  December  1997. \n\n[3]  M.  Frean and T. Downs.  A simple cost function for  boosting.  Technical report, Dept. \n\nof Computer Science  and Electrical  Eng.,  University of Queensland, 1998. \n\n[4]  Y. Freund and R.  E.  Schapire.  A decision-theoretic  generalization  of on-line learning \nand  an  application  to  boosting.  In  Computational  Learning  Theory:  Eurocolt  '95, \npages 23-37.  Springer-Verlag,  1995. \n\n[5]  Y.  Freund and R.  E. Schapire.  A decision-theoretic generalization  of on-line  learning \n\nand an  application to boosting.  J.  of Compo fj Syst.  Sc. , 55(1):119- 139,  1997. \n\n[6]  J . Friedman,  T.  Hastie,  and R.  Tibshirani.  Additive logistic  regression:  a  statistical \n\nview  of boosting.  Technical  report,  Stanford University,  1998. \n\n[7)  A.  Grove  and  D.  Schuurmans.  Boosting  in  the  limit:  Maximizing  the  margin  of \n\nlearned ensembles.  In  Proc.  of the  15th  Nat.  Conf.  on  AI,  pages 692- 699 ,  1998. \n\n[8]  Y.  LeCun,  L.  D .  Jackel,  L.  Bottou,  C.  Cortes,  J .  S.  Denker,  H. Drucker,  I.  Guyon, \nU.  A.  Muller,  E.  Sackinger,  P.  Simard,  and  V.  Vapnik.  Learning  algorithms  for \nclassification:  A comparison on handwritten digit recognition.  Neural Networks,  pages \n261-276,  1995. \n\n[9)  L.  Mason,  P.  L.  Bartlett,  and  J.  Baxter.  Improved  generalization  through  explicit \n\noptimization of margins.  Machine  Learning,  1999.  to appear. \n\n(10)  J.  R.  Quinlan.  Boosting first-order  learning  (invited lecture).  Lecture  Notes  in  Com(cid:173)\n\nputer Science, 1160:143,  1996. \n\n(11)  G. Ratsch, T. Onoda, and K.-R. Muller.  Soft margins for  AdaBoost.  Technical Report \nNC-TR-1998-021,  Department  of  Computer  Science,  Royal  Holloway,  University  of \nLondon,  Egham,  UK , 1998.  To  appear in  Machine  Learning. \n\n(12)  G.  Ratsch,  B.  Schokopf,  A.  Smola,  S.  Mika,  T.  Onoda,  and  K.-R.  Muller.  Robust \nensemble  learning.  In  A.J .  Smola,  P.L.  Bartlett,  B.  Scholkopf,  and  D.  Schuurmans, \neditors,  Advances  in  LMC,  pages 207-219.  MIT  Press, Cambridge,  MA , 1999. \n\n[13]  R.  Schapire,  Y.  Freund,  P.  L.  Bartlett,  and  W .  Sun  Lee.  Boosting  the  margin:  A \nnew  explanation  for  the effectiveness  of voting  methods.  Annals  of Statistics,  1998. \n(Earlier  appeared in:  D. H.  Fisher,  Jr.  (ed.),  Proc.  ICML97,  M. Kaufmann). \n\n[14]  B.  Scholkopf,  C.  J.  C.  Burges,  and  A.  J.  Smola.  Advances  in  Kernel  Methods  -\n\nSupport  Vector  Learning.  MIT  Press,  Cambridge,  MA,  1999. \n\n(15)  B.  Scholkopf,  A.  Smola,  R.  C.  Williamson,  and P.  L.  Bartlett.  New  support vector \n\nalgorithms.  Neural  Computation,  12:1083  - 1121,  2000. \n\n(16)  H.  Schwenk  and  Y.  Bengio.  Training  methods  for  adaptive  boosting  of  neural  net(cid:173)\n\nworks.  In Michael  I.  Jordan, Michael  J.  Kearns,  and Sara A.  Solla, editors,  Advances \nin Neural  Inf.  Processing  Systems,  volume  10. The MIT  Press,  1998. \n\n[17)  V.  Vapnik.  The  Nature  of Statistical  Learning  Theory.  Springer  Verlag,  New  York, \n\n1995. \n\n\f", "award": [], "sourceid": 1721, "authors": [{"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Takashi", "family_name": "Onoda", "institution": null}, {"given_name": "Sebastian", "family_name": "Mika", "institution": null}]}