{"title": "Some New Bounds on the Generalization Error of Combined Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 245, "page_last": 251, "abstract": null, "full_text": "Some new bounds on the generalization error of \n\ncombined classifiers \n\nVladimir Koltchinskii \n\nDepartment of Mathematics and Statistics \n\nDmitriy Panchenko \n\nDepartment of Mathematics and Statistics \n\nUniversity of New Mexico \n\nAlbuquerque, NM 87131-1141 \n\nvlad@math.unm.edu \n\nUniversity of New Mexico \n\nAlbuquerque, NM 87131-1141 \n\npanchenk@math.unm.edu \n\nDepartment of Electrical and Computer Engineering \n\nFernando Lozano \n\nUniversity of New Mexico \nAlbuquerque, NM 87131 \nflozano@eece.unm.edu \n\nAbstract \n\nIn this paper we develop the method of bounding the generalization error \nof a classifier in terms of its margin distribution which was introduced in \nthe recent papers of Bartlett and Schapire, Freund, Bartlett and Lee.  The \ntheory of Gaussian and empirical processes allow us  to prove the margin \ntype inequalities for the most general functional classes, the complexity \nof the class being measured via the so called Gaussian complexity func(cid:173)\ntions.  As  a  simple  application  of our results,  we  obtain  the  bounds  of \nSchapire, Freund, Bartlett and Lee for the generalization error of boost(cid:173)\ning.  We  also  substantially improve the results  of Bartlett on bounding \nthe generalization error of neural networks in terms  of h -norms of the \nweights  of neurons.  Furthermore, under additional assumptions  on  the \ncomplexity of the class  of hypotheses we provide some tighter bounds, \nwhich in  the  case  of boosting improve the results  of Schapire,  Freund, \nBartlett and Lee. \n\n1 \n\nIntroduction and margin type inequalities for general functional \nclasses \n\nLet (X, Y) be a random couple, where X  is  an  instance in a space Sand Y  E  {-I, I} is \na label.  Let 9 be a set of functions from  S into JR.  For 9  E g, sign(g(X)) will  be used as \na predictor (a classifier) of the unknown label Y.  If the distribution of (X, Y) is unknown, \nthen  the choice of the predictor is  based on  the  training data  (Xl, Yl ), ... , (Xn, Yn) that \nconsists ofn i.i.d.  copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based \non  the  training data)  whose generalization (classification) error JP'{Yg(X)  :::;  O}  is  small \nenough.  We  will  first  introduce some probabilistic bounds for  general  functional classes \nand then give several examples of their applications to bounding the generalization error of \nboosting and neural networks.  We omit all the proofs and refer an interested reader to [5]. \n\n\fLet  (8, A, P)  be  a  probability  space  and  let F  be  a  class  of measurable functions  from \n(8, A)  into  lR.  Let  {Xd  be  a  sequence  of  i.i.d. \nrandom  variables  taking  values  in \n(8, A) with common distribution P.  Let Pn  be the empirical measure based on the sample \n(Xl,'\"  ,Xn), Pn  := n- l  E~=l c5x \"  where c5x denotes the probability distribution con(cid:173)\ncentrated at the point x.  We  will denote P! := Is !dP, Pn!  := Is !dPn, etc.  In  what \nfollows, \u00a3OO(F)  denotes the Banach  space of uniformly bounded real  valued functions on \nF  with the norm IIYII.:F  := sUPfE.:F 1Y(f)I,  Y  E \u00a3OO(F).  Define \nn \n\nn \n\nwhere {gi} is a sequence of i.i.d.  standard normal random variables, independent of {Xi}' \nWe will call n  t-+  Gn(F) the Gaussian complexity function of the class F. One can find in \nthe literature (see, e.g.  [11]) various upper bounds on such quantities as Gn (F) in terms of \nentropies, VC-dimensions, etc. \n\nWe give below a bound in terms of margin cost functions (compare to [6, 7]) and Gaussian \ncomplexities. \nLet <P  = {CPk  : IR -+  1R}~1 be a class of Lipschitz functions such that (1 + sgn( -x) )/2 :-::; \nCPk(X)  for all x  E IR and all k.  For each cP  E <P,  L(cp)  will denote it's Lipschitz constant. \n\nTheorem 1  For all t  > 0, \n\nlP'{ =3/  E F: P{! :-::;  O}  > \n\n+ \n\nLet us consider a special family of cost functions.  Assume that cP  is a fixed non increasing \nLipschitz function from IR into IR such that cp(x)  2:  (1 + sgn( -x)) /2 for all x E lR. One can \neasily observe that L( cpU 15))  :-::;  L( cP )15- 1 .  Applying Theorem 1 to the class of Lipschitz \nfunctions <P  := {cpU 15k)  : k 2:  O},  where 15k  := 2- k, we get the following result. \n\nTheorem 2  For all t > 0, \n\nlP'{3!  E F: P{! :-::;  O}  > \n\ninf  [Pncp(L)  +  2y'2irL(cp) Gn(F) \naE[O,l] \n\n+  cogIOg~(2c5-l)r/2] + t:n2 }:-::;  2exp{-2t2}. \n\n15 \n\n15 \n\nIn [5]  an example was given which shows that, in general, the order of the factor 15- 1  in the \nsecond term of the bound can not be improved. \nGiven  a  metric  space  (T, d),  we  denote  Hd(Tj c:)  the  c:-entropy  of T  with respect  to  d, \ni.e.  Hd(Tj c:)  := log Nd(Tj c:),  where Nd(Tj c:)  is  the  minimal number of balls of radius \nc:  covering T.  The next theorem improves the  previous results  under some additional as(cid:173)\nsumptions on the growth of random entropies Hdp \n\n2 (Fj .). Define for \"(  E  (0,1] \n\nn, \n\nand \n\n8n (-yjf):= sup { 15  E (0,1): c5\"fPn {/:-::;  c5}:-::;  n-1+!}. \n\nWe call c5n (\"(j f) and 8n (\"(j f), respectively, the ,,(-margin and the empirical ,,(-margin of f. \n\n\fTheorem 3  Suppose that for some a  E  (0,2) and for some constant D  > 0 \n\nHdpn ,2  (Fj u)  ~ Du- a ,  u > 0 a.s. \n\n(1) \n\nThen for any \"(  ~ 2~a ,for some constants A, B  > 0 andfor all large enough n \n\nJIItv'f  E  F: A- 18n(\"(jJ)  ~ 8nbjf) ~ A8nbjJ)} \n~ 1 - B(log210g2 n) exp { -n t /2}. \n\nThis implies that with high probability for all f  E F \n\nP{f ~ O}  ~ c(nl -'Y/28n bj f)'Y)-I. \n\nThe  bound  of Theorem 2  corresponds  to  the  case  of \"(  =  1.  It is  easy  to  see  from  the \ndefinitions  of \"(-margins that the  quantity  (n l -'Y/28n bj f)'Y)-1  increases  in  \"(  E  (0,1]. \nThis  shows  that  the  bound  in  the  case  of \"(  <  1  is  tighter.  Further discussion  of this \ntype of bounds and their experimental study in the case of convex combinations of simple \nclassifiers is given in the next section. \n\n2  Bounding the generalization error of convex combinations of \n\nclassifiers \n\nRecently, several authors  ([1, 8]) suggested a new class of upper bounds on generalization \nerror that are expressed in terms of the empirical distribution of the margin of the predictor \n(the classifier),  The margin is  defined as  the product Y g(X). The bounds in question are \nespecially useful in the case of the classifiers that are the combinations of simpler classifiers \n(that belong, say, to a class 1-\u00a3).  One of the examples of such classifiers is provided by the \nclassifiers obtained by boosting [3, 4], bagging [2]  and other voting methods of combining \nthe classifiers. We will now demonstrate how our general results can be applied to the case \nof convex combinations of simple base classifiers. \nWe assume that S := 8x {-1, 1} andF:= {]: f  E  F}, where j(x,y) := yf(x). Pwill \ndenote the distribution of (X, Y), Pn  the empirical distribution based on  the observations \n((Xl, YI ), ... , (Xn, Yn)) . It is  easy  to  see  that  Gn(F)  =  Gn(F).  One  can  easily  see \nthat  if F  :=  conv(1-\u00a3),  where 1-\u00a3  is  a  class  of base  classifiers,  then  Gn(F)  =  Gn(1-\u00a3). \nThese easy  observations allow us  to  obtain useful bounds for boosting and  other methods \nof combining the classifiers.  For instance, we get in  this  case  the following  theorem that \nimplies the bound of Schapire, Freund, Bartlett and Lee [8] when 1\u00a3 is  a VC-class of sets. \n\nTheorem 4  Let F  := conv(1\u00a3),  where 1-\u00a3  is  a class of measurable functions from  (8, A) \ninto R  For all t > 0, \nlP'{ 3f E  F  : P{yf(x) ~ O} \n\nIn particular, if 1-\u00a3  is a VC--class of classifiers h  : 8  H  {-1, 1} (which means that the class \nof sets {{x: h(x)  =  +1}  : h E 1-\u00a3}  is a Vapnik-Chervonenkis class) with VC--dimension \nV(1-\u00a3),  we have with some constant C  > 0, Gn(1-\u00a3)  ~ C(V(1-\u00a3)/n)I/2.  This implies that \nwith probability at least 1 - a \n\nP{yf(x) ~ O}  ~  inf  [Pn{yf(x) ~ 8}  + ~ JV(1-\u00a3) + \n\nOE(O ,I] \n\nu \n\nn \n\n\f( log log2 (28-1)) 1/ 2]  V! log ~ + 2 \n+ \n\n+ \n\n, \n\nr,;; \nyn \n\nn \n\nwhich  slightly improves the  bound obtained previously by  Schapire, Freund, Bartlett and \nLee [8]. \n\nTheorem 3  provides  some  improvement of the  above  bounds  on  generalization error of \nconvex combinations  of base  classifiers.  To  be  specific,  consider the  case  when  H  is  a \nVC-class of classifiers.  Let V  := V(H) be its VC-dimension.  A well known bound (going \nback to Dudley) on the entropy of the convex hull (see [11], p.  142) implies that \n\nHdpn,2(conv(H);u)::; \n\nsup  HdQ,2(conv(H);u)::;  Du- - v -\n\n. \n\n2 ( V - l) \n\nQEP(S) \n\nIt immediately follows  from Theorem 3 that for all  'Y  2:  2J~::::~)  and for  some constants \nC,B \n\nIF'{3f E conv(H): p{f::; a} > \n\nwhere \n\n? \n\n}::; BIog210g2nexP{--21nt}, \n\nn 1- 'Y/28n (,,(;  f) 'Y \n\n8n(\"(;  f)  := sup{ 8 E  (0,1) : 8'Y Pn {(x, y)  : yf(x) ::;  8}  ::;  n-1+t }. \n\nThis  shows  that  in  the  case  when  the  VC-dimension  of the  base  is  relatively  small  the \ngeneralization error of boosting and some other convex combinations of simple classifiers \nobtained by various versions of voting methods becomes better than it was suggested by the \nbounds of Schapire, Freund, Bartlett and Lee.  One can also conjecture that the remarkable \ngeneralization ability of these methods observed in  numerous experiments can be related \nto  the  fact  that the  combined classifier belongs  to  a  subset of the  convex hull  for  which \nthe random entropy Hdp  2  is  much smaller than for the whole convex hull (see [9,  10] for \nimproved margin type bounds in a much more special setting). \n\nTo  demonstrate the improvement provided by  our bounds over previous results, we  show \nsome experimental evidence obtained for a simple artificially generated problem, for which \nwe are able to compute exactly the generalization error as  well as the 'Y-margins. \n\nWe consider the problem of learning a classifier consisting of the indicator function of the \nunion of a finite  number of intervals in the input space S  =  [0,1] . We used the Adaboost \nalgorithm [4]  to  find  a combined classifier using as  base class 11.  = {[a, b]  : b E  [0, In u \n{[b,l] : b E [0, In (i.e.  decision stumps).  Notice that in this case V  = 2, and according to \nthe theory values of gamma in (2/3, 1) should result in tighter bounds on the generalization \nerror. \n\nFor our experiments we used a target function with 10 equally spaced intervals, and a sam(cid:173)\nple size of 1000, generated according to the uniform distribution in [0, 1]. We ran Adaboost \nfor 500 rounds, and computed at each round the generalization error of the combined clas(cid:173)\nsifier and the bound C(n1- 'Y/28n(\"(;  f) 'Y )-1  for different values of 'Y.  We  set the constant \nC to one. \nIn  figure  1 we plot the  generalization error and  the  bounds for 'Y  = 1, 0.8 and  2/3.  As \nexpected,  for'Y  = 1 (which corresponds roughly to  the  bounds in  [8]) the  bound is  very \nloose,  and  as  'Y  decreases,  the  bound gets  closer to  the  generalization error.  In figure  2 \nwe  show  that by  reducing further  the  value  of 'Y  we get a curve even closer to  the actual \ngeneralization error (although for 'Y  = 0.2 we do not get an upper bound).  This seems  to \nsupport the conjecture that Adaboost generates combined classifiers that belong to a subset \nof of the  convex  hull  of 11.  with  a  smaller random entropy.  In  figure  3  we  plot the  ratio \n8-;\"(\"(;  f)/8n(\"(;  f) for'Y = 0.4,2/3 and 0.8 against the boosting iteration.  We can see that \nthe ratio is  close to  one in  all  the examples indicating that the  value of the constant A  in \ntheorem 3 is close to one in this case. \n\n\f\u00b7 - - - - - - - - -\n\nboosbnground \n\nFigure 1:  Comparison of the generalization error (thicker line) with  (nl-'Y/2 8n b; f)'Y)-l \nfor'Y = 1,0.8 and 2/3 (thinner lines, top to bottom). \n\nFigure 2:  Comparison of the generalization error (thicker line) with  (nl-'Y/2 8n b; f)'Y)-l \nfor'Y = 0.5,0.4 and 0.2 (thinner lines, top to bottom). \n\nboostlrQround \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\no \n\n\u2022 \n\n:I f  \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 I \n:1 ! 11 11 11  I. \" \n'::'ffl  \u2022 I \n~I  I i III  II  III  Jam,  \".~., I \n\n_ \n\n_ \n\no \n\n\u2022 \n\no \n\n\u2022 \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\nFigure  3:  Ratio  8:b;f)/8n b;f) versus  boosting  round  for'Y  =  0.4,2/3,0.8 (top  to \nbottom) \n\n\f3  Bounding the generalization error in neural network learning \n\nWe turn now to the applications of the bounds of previous section in  neural network learn(cid:173)\ning.  Let 1i  be  a  class  of measurable functions  from  (8, A)  into  R  Given  a  sigmoid  U \nfrom  lR  into  [-l,l]andavectorw  :=  (Wl, ... ,Wn)  E  lRn,  let  Nu,w(Ul, . .. ,Un)  := \nu(~~=l WjUj).  We  call  the  function  Nu,w  a neuron with  weights  wand sigmoid u.  For \nwE lRn,  [[w[[t l  := ~~=l [Wit.  Let Uj  : j  ~ 1 be functions from lR into [-1,1], satisfying \nthe Lipschitz conditions: \n\n[Uj(u)  - Uj(v)[  :\"S  Lj[u - vi,  u,v E R \n\nLet {Aj} be a  sequence of positive numbers.  We  define recursively classes of neural net(cid:173)\nworks with restrictions on the weights of neurons (j below is the number of layers): \n\n1lo  =1i,  1lj(Al , ... ,Aj ):= \n\n:= {Nuj,w(hl , ... , hn) : n  ~ 0, hi  E 1ij-l (Al'\"'' Aj-d, wE lRn,  [[w[[t l \n\nU 1ij-l (A l , .. . , A j- l ). \n\n:\"S  Aj} U \n\nTheorem 5  For all t > 0 and for alll ~ 1 \n. \nlOf  [Pn(fJh-)  + ~ II (2Lj Aj + l)Gn(1l)+ \n<lE(O,l] \n\nlP'{ =V  E 1l1(Al , ... ,AI) : P{J :\"S  O}  > \n\nU  k=l \n\n_ \n\nj \n\nU \n\n1 \n\nI \n\n( loglog2(28-l ))l/2] \n\n+ \n\nn \n\n2  {2 2} \n+-- <  exp-t \n\nt+2} \n..;n  -\n\nRemark.  Bartlett [1]  obtained a similar bound for a more special class 1l and with larger \nconstants.  In  the case  when  Aj  ==  A, Lj  ==  L  (the case considered by  Bartlett) the ex-\npression in the right hand side of his bound includes  (AL)I;;+l )/2,  which is replaced in  our \n\nbound  by  (Af)l.  These improvement can  be  substantial in  applications,  since  the  above \nquantities play the role of complexity penalties. \n\nFinally, it is worth mentioning that the theorems of Section 1 can be applied also to bound(cid:173)\ning  the  generalization error in multi-class  problems.  Namely,  we  assume  that the  labels \ntake  values  in  a  finite  set Y  with  card(Y)  =:  L.  Consider a  class  j: of functions  from \nS := 8  x Y  into lR. A function f  E  j: predicts a label y  E Y  for an example x  E 8  iff \n\nf(x,y) > maxf(x,y'). \n\ny'#y \n\nThe margin of an example (x, y) is defined as \n\nmj(x,y) := f(x,y) - maxf(x,y'), \n\ny'#y \n\nso  f  misclassifies the example (x, y)  iff mj(x, y)  :\"S  O.  Let \n\nF:= {J(.,y): y E Y,f E  j:}. \n\nThe next result follows from Theorem 2. \n\nTheorem 6  For all t > 0, \n\nlP'{3f  E  j:: P{mj :\"S  O}  > \n\ninf  [Pn{mj:\"S  8} + 4y'27rL~2L - 1) Gn(F)+ \n<lE(O,l] \n\n( 10glog2(28-l ))l/2] \n\n+ \n\nn \n\n{22} \n+-- <  exp-t. \n\n2 \n\nt+2} \n..;n  -\n\n\fReferences \n\n[1]  Bartlett, P.  (1998) The Sample Complexity of Pattern Classification with Neural Net(cid:173)\n\nworks:  The Size of the Weights is More Important than the Size of the Network. IEEE \nTransactions on Information Theory, 44, 525-536. \n\n[2]  Breiman, L.  (1996). Bagging Predictors. Machine Learning, 26(2), 123-140. \n[3]  Freund y.  (1995) Boosting a weak learning algorithm by majority.  Information and \n\nComputation, 121 ,2,256-285. \n\n[4]  Freund Y.  and  Schapire, R.E.  (1997)  A decision-theoretic generalization of on-line \nlearning and an  application to  boosting.  Journal of Computer and System Sciences, \n55(1),119-139. \n\n[5]  Koltchinskii, V.  and Panchenko, D.  (2000) Empirical margin distributions and bound(cid:173)\n\ning the generalization error of combined classifiers, preprint. \n\n[6]  Mason, L., Bartlett, P. and Baxter, J. (1999) Improved Generalization through Explicit \n\nOptimization of Margins. Machine Learning, 0, 1-11. \n\n[7]  Mason,  L., Baxter,  J.,  Bartlett,  P.  and  Frean,  M.  (1999)  Functional  Gradient Tech(cid:173)\n\nniques for Combining Hypotheses. In:  Advances in Large Margin Classifiers.  Smola, \nBartlett, Sch61kopf and Schnurmans (Eds), to appear. \n\n[8]  Schapire, R.,  Freund, Y.,  Bartlett, P.  and Lee, W.  S.  (1998) Boosting the Margin:  A \n\nNew Explanation of Effectiveness of Voting Methods. Ann. Statist., 26, 1651-1687. \n[9]  Shawe-Taylor,1.  and  Cristianini,  N.  (1999) Margin Distribution Bounds on  Gener(cid:173)\n\nalization. In:  Lecture Notes in Artificial Intelligence, 1572. Computational Learning \nTheory, 4th European Conference, EuroCOLT'99, 263-273. \n\n[10]  Shawe-Taylor, J.  and Cristianini, N.  (1999) Further Results on the Margin Distribu(cid:173)\n\ntion. Proc. of COLT'99, 278-285. \n\n[11]  van der Vaart, A.  and Wellner, 1.  (1996) Weak convergence and Empirical Processes. \n\nWith Applications to Statistics. Springer-Verlag, New York. \n\n\f", "award": [], "sourceid": 1899, "authors": [{"given_name": "Vladimir", "family_name": "Koltchinskii", "institution": null}, {"given_name": "Dmitriy", "family_name": "Panchenko", "institution": null}, {"given_name": "Fernando", "family_name": "Lozano", "institution": null}]}