{"title": "Some New Bounds on the Generalization Error of Combined Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 245, "page_last": 251, "abstract": null, "full_text": "Some new bounds on the generalization error of \n\ncombined classifiers \n\nVladimir Koltchinskii \n\nDepartment of Mathematics and Statistics \n\nDmitriy Panchenko \n\nDepartment of Mathematics and Statistics \n\nUniversity of New Mexico \n\nAlbuquerque, NM 87131-1141 \n\nvlad@math.unm.edu \n\nUniversity of New Mexico \n\nAlbuquerque, NM 87131-1141 \n\npanchenk@math.unm.edu \n\nDepartment of Electrical and Computer Engineering \n\nFernando Lozano \n\nUniversity of New Mexico \nAlbuquerque, NM 87131 \nflozano@eece.unm.edu \n\nAbstract \n\nIn this paper we develop the method of bounding the generalization error \nof a classifier in terms of its margin distribution which was introduced in \nthe recent papers of Bartlett and Schapire, Freund, Bartlett and Lee. The \ntheory of Gaussian and empirical processes allow us to prove the margin \ntype inequalities for the most general functional classes, the complexity \nof the class being measured via the so called Gaussian complexity func(cid:173)\ntions. As a simple application of our results, we obtain the bounds of \nSchapire, Freund, Bartlett and Lee for the generalization error of boost(cid:173)\ning. We also substantially improve the results of Bartlett on bounding \nthe generalization error of neural networks in terms of h -norms of the \nweights of neurons. Furthermore, under additional assumptions on the \ncomplexity of the class of hypotheses we provide some tighter bounds, \nwhich in the case of boosting improve the results of Schapire, Freund, \nBartlett and Lee. \n\n1 \n\nIntroduction and margin type inequalities for general functional \nclasses \n\nLet (X, Y) be a random couple, where X is an instance in a space Sand Y E {-I, I} is \na label. Let 9 be a set of functions from S into JR. For 9 E g, sign(g(X)) will be used as \na predictor (a classifier) of the unknown label Y. If the distribution of (X, Y) is unknown, \nthen the choice of the predictor is based on the training data (Xl, Yl ), ... , (Xn, Yn) that \nconsists ofn i.i.d. copies of (X, Y). The goal ofleaming is to find a predictor 9 E 9 (based \non the training data) whose generalization (classification) error JP'{Yg(X) :::; O} is small \nenough. We will first introduce some probabilistic bounds for general functional classes \nand then give several examples of their applications to bounding the generalization error of \nboosting and neural networks. We omit all the proofs and refer an interested reader to . \n\n\fLet (8, A, P) be a probability space and let F be a class of measurable functions from \n(8, A) into lR. Let {Xd be a sequence of i.i.d. \nrandom variables taking values in \n(8, A) with common distribution P. Let Pn be the empirical measure based on the sample \n(Xl,'\" ,Xn), Pn := n- l E~=l c5x \" where c5x denotes the probability distribution con(cid:173)\ncentrated at the point x. We will denote P! := Is !dP, Pn! := Is !dPn, etc. In what \nfollows, \u00a3OO(F) denotes the Banach space of uniformly bounded real valued functions on \nF with the norm IIYII.:F := sUPfE.:F 1Y(f)I, Y E \u00a3OO(F). Define \nn \n\nn \n\nwhere {gi} is a sequence of i.i.d. standard normal random variables, independent of {Xi}' \nWe will call n t-+ Gn(F) the Gaussian complexity function of the class F. One can find in \nthe literature (see, e.g. ) various upper bounds on such quantities as Gn (F) in terms of \nentropies, VC-dimensions, etc. \n\nWe give below a bound in terms of margin cost functions (compare to [6, 7]) and Gaussian \ncomplexities. \nLet

0, \n\nlP'{ =3/ E F: P{! :-::; O} > \n\n+ \n\nLet us consider a special family of cost functions. Assume that cP is a fixed non increasing \nLipschitz function from IR into IR such that cp(x) 2: (1 + sgn( -x)) /2 for all x E lR. One can \neasily observe that L( cpU 15)) :-::; L( cP )15- 1 . Applying Theorem 1 to the class of Lipschitz \nfunctions

0, \n\nlP'{3! E F: P{! :-::; O} > \n\ninf [Pncp(L) + 2y'2irL(cp) Gn(F) \naE[O,l] \n\n+ cogIOg~(2c5-l)r/2] + t:n2 }:-::; 2exp{-2t2}. \n\n15 \n\n15 \n\nIn  an example was given which shows that, in general, the order of the factor 15- 1 in the \nsecond term of the bound can not be improved. \nGiven a metric space (T, d), we denote Hd(Tj c:) the c:-entropy of T with respect to d, \ni.e. Hd(Tj c:) := log Nd(Tj c:), where Nd(Tj c:) is the minimal number of balls of radius \nc: covering T. The next theorem improves the previous results under some additional as(cid:173)\nsumptions on the growth of random entropies Hdp \n\n2 (Fj .). Define for \"( E (0,1] \n\nn, \n\nand \n\n8n (-yjf):= sup { 15 E (0,1): c5\"fPn {/:-::; c5}:-::; n-1+!}. \n\nWe call c5n (\"(j f) and 8n (\"(j f), respectively, the ,,(-margin and the empirical ,,(-margin of f. \n\n\fTheorem 3 Suppose that for some a E (0,2) and for some constant D > 0 \n\nHdpn ,2 (Fj u) ~ Du- a , u > 0 a.s. \n\n(1) \n\nThen for any \"( ~ 2~a ,for some constants A, B > 0 andfor all large enough n \n\nJIItv'f E F: A- 18n(\"(jJ) ~ 8nbjf) ~ A8nbjJ)} \n~ 1 - B(log210g2 n) exp { -n t /2}. \n\nThis implies that with high probability for all f E F \n\nP{f ~ O} ~ c(nl -'Y/28n bj f)'Y)-I. \n\nThe bound of Theorem 2 corresponds to the case of \"( = 1. It is easy to see from the \ndefinitions of \"(-margins that the quantity (n l -'Y/28n bj f)'Y)-1 increases in \"( E (0,1]. \nThis shows that the bound in the case of \"( < 1 is tighter. Further discussion of this \ntype of bounds and their experimental study in the case of convex combinations of simple \nclassifiers is given in the next section. \n\n2 Bounding the generalization error of convex combinations of \n\nclassifiers \n\nRecently, several authors ([1, 8]) suggested a new class of upper bounds on generalization \nerror that are expressed in terms of the empirical distribution of the margin of the predictor \n(the classifier), The margin is defined as the product Y g(X). The bounds in question are \nespecially useful in the case of the classifiers that are the combinations of simpler classifiers \n(that belong, say, to a class 1-\u00a3). One of the examples of such classifiers is provided by the \nclassifiers obtained by boosting [3, 4], bagging  and other voting methods of combining \nthe classifiers. We will now demonstrate how our general results can be applied to the case \nof convex combinations of simple base classifiers. \nWe assume that S := 8x {-1, 1} andF:= {]: f E F}, where j(x,y) := yf(x). Pwill \ndenote the distribution of (X, Y), Pn the empirical distribution based on the observations \n((Xl, YI ), ... , (Xn, Yn)) . It is easy to see that Gn(F) = Gn(F). One can easily see \nthat if F := conv(1-\u00a3), where 1-\u00a3 is a class of base classifiers, then Gn(F) = Gn(1-\u00a3). \nThese easy observations allow us to obtain useful bounds for boosting and other methods \nof combining the classifiers. For instance, we get in this case the following theorem that \nimplies the bound of Schapire, Freund, Bartlett and Lee  when 1\u00a3 is a VC-class of sets. \n\nTheorem 4 Let F := conv(1\u00a3), where 1-\u00a3 is a class of measurable functions from (8, A) \ninto R For all t > 0, \nlP'{ 3f E F : P{yf(x) ~ O} \n\nIn particular, if 1-\u00a3 is a VC--class of classifiers h : 8 H {-1, 1} (which means that the class \nof sets {{x: h(x) = +1} : h E 1-\u00a3} is a Vapnik-Chervonenkis class) with VC--dimension \nV(1-\u00a3), we have with some constant C > 0, Gn(1-\u00a3) ~ C(V(1-\u00a3)/n)I/2. This implies that \nwith probability at least 1 - a \n\nP{yf(x) ~ O} ~ inf [Pn{yf(x) ~ 8} + ~ JV(1-\u00a3) + \n\nOE(O ,I] \n\nu \n\nn \n\n\f( log log2 (28-1)) 1/ 2] V! log ~ + 2 \n+ \n\n+ \n\n, \n\nr,;; \nyn \n\nn \n\nwhich slightly improves the bound obtained previously by Schapire, Freund, Bartlett and \nLee . \n\nTheorem 3 provides some improvement of the above bounds on generalization error of \nconvex combinations of base classifiers. To be specific, consider the case when H is a \nVC-class of classifiers. Let V := V(H) be its VC-dimension. A well known bound (going \nback to Dudley) on the entropy of the convex hull (see , p. 142) implies that \n\nHdpn,2(conv(H);u)::; \n\nsup HdQ,2(conv(H);u)::; Du- - v -\n\n. \n\n2 ( V - l) \n\nQEP(S) \n\nIt immediately follows from Theorem 3 that for all 'Y 2: 2J~::::~) and for some constants \nC,B \n\nIF'{3f E conv(H): p{f::; a} > \n\nwhere \n\n? \n\n}::; BIog210g2nexP{--21nt}, \n\nn 1- 'Y/28n (,,(; f) 'Y \n\n8n(\"(; f) := sup{ 8 E (0,1) : 8'Y Pn {(x, y) : yf(x) ::; 8} ::; n-1+t }. \n\nThis shows that in the case when the VC-dimension of the base is relatively small the \ngeneralization error of boosting and some other convex combinations of simple classifiers \nobtained by various versions of voting methods becomes better than it was suggested by the \nbounds of Schapire, Freund, Bartlett and Lee. One can also conjecture that the remarkable \ngeneralization ability of these methods observed in numerous experiments can be related \nto the fact that the combined classifier belongs to a subset of the convex hull for which \nthe random entropy Hdp 2 is much smaller than for the whole convex hull (see [9, 10] for \nimproved margin type bounds in a much more special setting). \n\nTo demonstrate the improvement provided by our bounds over previous results, we show \nsome experimental evidence obtained for a simple artificially generated problem, for which \nwe are able to compute exactly the generalization error as well as the 'Y-margins. \n\nWe consider the problem of learning a classifier consisting of the indicator function of the \nunion of a finite number of intervals in the input space S = [0,1] . We used the Adaboost \nalgorithm  to find a combined classifier using as base class 11. = {[a, b] : b E [0, In u \n{[b,l] : b E [0, In (i.e. decision stumps). Notice that in this case V = 2, and according to \nthe theory values of gamma in (2/3, 1) should result in tighter bounds on the generalization \nerror. \n\nFor our experiments we used a target function with 10 equally spaced intervals, and a sam(cid:173)\nple size of 1000, generated according to the uniform distribution in [0, 1]. We ran Adaboost \nfor 500 rounds, and computed at each round the generalization error of the combined clas(cid:173)\nsifier and the bound C(n1- 'Y/28n(\"(; f) 'Y )-1 for different values of 'Y. We set the constant \nC to one. \nIn figure 1 we plot the generalization error and the bounds for 'Y = 1, 0.8 and 2/3. As \nexpected, for'Y = 1 (which corresponds roughly to the bounds in ) the bound is very \nloose, and as 'Y decreases, the bound gets closer to the generalization error. In figure 2 \nwe show that by reducing further the value of 'Y we get a curve even closer to the actual \ngeneralization error (although for 'Y = 0.2 we do not get an upper bound). This seems to \nsupport the conjecture that Adaboost generates combined classifiers that belong to a subset \nof of the convex hull of 11. with a smaller random entropy. In figure 3 we plot the ratio \n8-;\"(\"(; f)/8n(\"(; f) for'Y = 0.4,2/3 and 0.8 against the boosting iteration. We can see that \nthe ratio is close to one in all the examples indicating that the value of the constant A in \ntheorem 3 is close to one in this case. \n\n\f\u00b7 - - - - - - - - -\n\nboosbnground \n\nFigure 1: Comparison of the generalization error (thicker line) with (nl-'Y/2 8n b; f)'Y)-l \nfor'Y = 1,0.8 and 2/3 (thinner lines, top to bottom). \n\nFigure 2: Comparison of the generalization error (thicker line) with (nl-'Y/2 8n b; f)'Y)-l \nfor'Y = 0.5,0.4 and 0.2 (thinner lines, top to bottom). \n\nboostlrQround \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\no \n\n\u2022 \n\n:I f \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 I \n:1 ! 11 11 11 I. \" \n'::'ffl \u2022 I \n~I I i III II III Jam, \".~., I \n\n_ \n\n_ \n\no \n\n\u2022 \n\no \n\n\u2022 \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\nFigure 3: Ratio 8:b;f)/8n b;f) versus boosting round for'Y = 0.4,2/3,0.8 (top to \nbottom) \n\n\f3 Bounding the generalization error in neural network learning \n\nWe turn now to the applications of the bounds of previous section in neural network learn(cid:173)\ning. Let 1i be a class of measurable functions from (8, A) into R Given a sigmoid U \nfrom lR into [-l,l]andavectorw := (Wl, ... ,Wn) E lRn, let Nu,w(Ul, . .. ,Un) := \nu(~~=l WjUj). We call the function Nu,w a neuron with weights wand sigmoid u. For \nwE lRn, [[w[[t l := ~~=l [Wit. Let Uj : j ~ 1 be functions from lR into [-1,1], satisfying \nthe Lipschitz conditions: \n\n[Uj(u) - Uj(v)[ :\"S Lj[u - vi, u,v E R \n\nLet {Aj} be a sequence of positive numbers. We define recursively classes of neural net(cid:173)\nworks with restrictions on the weights of neurons (j below is the number of layers): \n\n1lo =1i, 1lj(Al , ... ,Aj ):= \n\n:= {Nuj,w(hl , ... , hn) : n ~ 0, hi E 1ij-l (Al'\"'' Aj-d, wE lRn, [[w[[t l \n\nU 1ij-l (A l , .. . , A j- l ). \n\n:\"S Aj} U \n\nTheorem 5 For all t > 0 and for alll ~ 1 \n. \nlOf [Pn(fJh-) + ~ II (2Lj Aj + l)Gn(1l)+ \n \n\nU k=l \n\n_ \n\nj \n\nU \n\n1 \n\nI \n\n( loglog2(28-l ))l/2] \n\n+ \n\nn \n\n2 {2 2} \n+-- < exp-t \n\nt+2} \n..;n -\n\nRemark. Bartlett  obtained a similar bound for a more special class 1l and with larger \nconstants. In the case when Aj == A, Lj == L (the case considered by Bartlett) the ex-\npression in the right hand side of his bound includes (AL)I;;+l )/2, which is replaced in our \n\nbound by (Af)l. These improvement can be substantial in applications, since the above \nquantities play the role of complexity penalties. \n\nFinally, it is worth mentioning that the theorems of Section 1 can be applied also to bound(cid:173)\ning the generalization error in multi-class problems. Namely, we assume that the labels \ntake values in a finite set Y with card(Y) =: L. Consider a class j: of functions from \nS := 8 x Y into lR. A function f E j: predicts a label y E Y for an example x E 8 iff \n\nf(x,y) > maxf(x,y'). \n\ny'#y \n\nThe margin of an example (x, y) is defined as \n\nmj(x,y) := f(x,y) - maxf(x,y'), \n\ny'#y \n\nso f misclassifies the example (x, y) iff mj(x, y) :\"S O. Let \n\nF:= {J(.,y): y E Y,f E j:}. \n\nThe next result follows from Theorem 2. \n\nTheorem 6 For all t > 0, \n\nlP'{3f E j:: P{mj :\"S O} > \n\ninf [Pn{mj:\"S 8} + 4y'27rL~2L - 1) Gn(F)+ \n