{"title": "A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work", "book": "Advances in Neural Information Processing Systems", "page_first": 224, "page_last": 230, "abstract": null, "full_text": "A PAC-Bayesian Margin Bound for Linear \n\nClassifiers: Why SVMs work \n\nRalf Herbrich \n\nStatistics Research Group \n\nComputer Science Department \nTechnical University of Berlin \n\nralfh@cs.tu-berlin.de \n\nThore Graepel \n\nStatistics Research Group \n\nComputer Science Department \nTechnical University of Berlin \n\nguru@cs.tu-berlin.de \n\nAbstract \n\nWe present a bound on the generalisation error of linear classifiers \nin terms of a refined margin quantity on the training set. The \nresult is obtained in a PAC- Bayesian framework and is based on \ngeometrical arguments in the space of linear classifiers. The new \nbound constitutes an exponential improvement of the so far tightest \nmargin bound by Shawe-Taylor et al. [8] and scales logarithmically \nin the inverse margin. Even in the case of less training examples \nthan input dimensions sufficiently large margins lead to non-trivial \nbound values and -\nplexity term. Furthermore, the classical margin is too coarse a \nmeasure for the essential quantity that controls the generalisation \nerror: the volume ratio between the whole hypothesis space and \nthe subset of consistent hypotheses. The practical relevance of the \nresult lies in the fact that the well-known support vector machine \nis optimal w.r.t. the new bound only if the feature vectors are all of \nthe same length. As a consequence we recommend to use SVMs on \nnormalised feature vectors only -\na recommendation that is well \nsupported by our numerical experiments on two benchmark data \nsets. \n\nfor maximum margins -\n\nto a vanishing com(cid:173)\n\n1 \n\nIntroduction \n\nLinear classifiers are exceedingly popular in the machine learning community due \nto their straight-forward applicability and high flexibility which has recently been \nboosted by the so-called kernel methods [13]. A natural and popular framework \nfor the theoretical analysis of classifiers is the PAC (probably approximately cor(cid:173)\nrect) framework [11] which is closely related to Vapnik's work on the generalisation \nerror [12]. For binary classifiers it turned out that the growth function is an ap(cid:173)\npropriate measure of \"complexity\" and can tightly be upper bounded by the VC \n(Vapnik-Chervonenkis) dimension [14]. Later, structural risk minimisation [12] was \nsuggested for directly minimising the VC dimension based on a training set and an \na priori structuring of the hypothesis space. \n\nIn practice, e.g. in the case of linear classifiers, often a thresholded real-valued func-\n\n\ftion is used for classification. In 1993, Kearns [4] demonstrated that considerably \ntighter bounds can be obtained by considering a scale-sensitive complexity measure \nknown as the fat shattering dimension. Further results [1] provided bounds on the \nGrowth function similar to those proved by Vapnik and others [14,6]. The popular(cid:173)\nity of the theory was boosted by the invention of the support vector machine (SVM) \n[13] which aims at directly minimising the complexity as suggested by theory. \n\nUntil recently, however, the success of the SVM remained somewhat obscure because \nin PAC/VC theory the structuring of the hypothesis space must be independent of \nthe training data -\nin contrast to the data-dependence of the canonical hyperplane. \nAs a consequence Shawe-Taylor et.al. [8] developed the luckiness framework, where \nluckiness refers to a complexity measure that is a function of both hypothesis and \ntraining sample. \n\nRecently, David McAllester presented some PAC-Bayesian theorems [5] that bound \nthe generalisation error of Bayesian classifiers independently of the correctness of the \nprior and regardless of the underlying data distribution -\nthus fulfilling the basic \ndesiderata of PAC theory. In [3] McAllester's bounds on the Gibbs classifier were \nextended to the Bayes (optimal) classifier. The PAC-Bayesian framework provides \na posteriori bounds and is thus closely related in spirit to the luckiness framework l . \n\nIn this paper we give a tight margin bound for linear classifiers in the PAC-Bayesian \nframework. The main idea is to identify the generalisation error of the classifier h of \ninterest with that of the Bayes (optimal) classifier of a (point-symmetric) subset Q \nthat is summarised by h. We show that for a uniform prior the normalised margin \nof h is directly related to the volume of a large subset Q summarised by h. In \nparticular, the result suggests that a learning algorithm for linear classifiers should \naim at maximising the normalised margin instead of the classical margin. In Section \n2 and 3 we review the basic PAC-Bayesian theorem and show how it can be applied \nto single classifiers. In Section 4 we give our main result and outline its proof. In \nSection 5 we discuss the consequences of the new result for the application of SVMs \nand demonstrate experimentally that in fact a normalisation of the feature vectors \nleads to considerably superior generalisation performance. \nWe denote n-tuples by italic bold letters (e.g. x = (Xl, ... ,Xn )), vectors by roman \nbold letters (e.g. x), random variables by sans serif font (e.g. X) and vector spaces \nby calligraphic capitalised letters (e.g. X). The symbols P, E, I and .e~ denote a \nprobability measure, the expectation of a random variable, the indicator function \nand the normed space (2-norm) of sequences of length n, respectively. \n\n2 A PAC Margin Bound \nWe consider learning in the PAC framework. Let X be the input space, and let y = \n{-1,+1}. Let a labelled training sample z = (x,y) E (X x y)m = zm be drawn \niid according to some unknown probability measure Pz = PYIXPx. Furthermore for \na given hypothesis space 1t ~ yX we assume the existence of a ''true'' hypothesis \nh * E 1t that labelled the data \n\nPYIX=x (y) = Iy=h*(x). \n\n(1) \n\nWe consider linear hypotheses \n1t = {hw: X f-t sign((w,\u00a2(x)}x:;) I w E W}, W = {w E K IllwlllC = 1}, (2) \nlIn fact, even Shawe-Taylor et.al. concede that \" ... a Bayesian might say that luckiness \nis just a complicated way of encoding a prior. The sole justification for our particular way \nof encoding is that it allows us to get the PAC like results we sought ... \" [9, p. 4]. \n\n\fwhere the mapping \u00a2 : X ~ K ~ f~ maps2 the input data to some feature space \nK and Ilwll,>;; = 1 leads to a one-to-one correspondence of hypotheses hw to their \nparameters w. From the existence of h* we know that there exists a version space \nV(z) ~ W, \n\nV(z)={wEW IV(x,y)EZ: hw(x)=y}. \n\nOur analysis aims at bounding the true risk R [w] of consistent hypotheses hw, \n\nR[w] = P XY (hw (X):I Y) . \n\nSince all classifiers w E V (z) are indistinguishable in terms of number of errors \ncommitted on the given training set z let us introduce the concept of the margin \n'Yz (w) of a classifier w, i.e. \n\n. \n'Yz w = mIn \n\n() \n\nydw, Xi),>;; \n\n(xi ,y;)Ez \n\nIlwll,>;; \nThe following theorem due to Shawe-Taylor et al. \nerrors R [w] of all classifier wE V (z) in terms of the margin 'Yz (w). \nTheorem 1 (PAC margin bound). For all probability measures Pz such that \nPx (II\u00a2 (X) II,>;; :::; ~) = 1, for any 8 > 0 with probability at least 1 - 8 over the ran(cid:173)\ndom draw of the training set z, if we succeed in correctly classifying m samples z \nwith a linear classifier w achieving a positive margin 'Yz (w) > J32~2 1m then the \ngeneralisation R [w] of w is bounded from above by \n\n[8] bounds the generalisation \n\n(3) \n\nAs the bound on R [w] depends linearly on 'Y;2 (w) we see that Theorem 1 provides \na theoretical foundation of all algorithms that aim at maximising 'Yz (w), e.g. SVMs \nand Boosting [13, 7]. \n\n3 PAC-Bayesian Analysis \n\nWe first present a result [5] that bounds the risk of the generalised Gibbs clas(cid:173)\nsification strategy Gibbsw(z) by the measure Pw (W (z)) on a consistent subset \nW (z) ~ V (z). This average risk is then related via the Bayes-Gibbs lemma to the \nrisk ofthe Bayes classification strategy Bayesw(z) on W (z). For a single consistent \nhypothesis w E W it is then necessary to identify a consistent subset Q (w) such \nthat the Bayes strategy BayesQ(w) on Q (w) always agrees with w. Let us define \nthe Gibbs classification strategy Gibbsw(z) w.r.t. the subset W (z) ~ V (z) by \n\nGibbsw(z) (x) = hw (x) , \n\nw '\" PWIWEW(z) . \n\n(5) \n\nThen the following theorem [5] holds for the risk of Gibbsw(z). \nTheorem 2 (PAC-Bayesian bound for subsets of classifiers). For any mea(cid:173)\nsure Pw and any measure Pz, for any 8 > 0 with probability at least 1 - 8 over \nthe random draw of the training set z for all subsets W (z) ~ V (z) such that \nPw (W (z)) > 0 the generalisation error of the associated Gibbs classification strat(cid:173)\negy Gibbsw(z) is bounded from above by \n\nR [Gibbsw(z)] :::; ~ (In (PW (~(Z))) + 2ln (m) + In (~) + 1) . \n\n(6) \n\n2For notational simplicity we sometimes abbreviate cf> (x) by x which should not be \n\nconfused with the sample x of training objects. \n\n\fNow consider the Bayes classifier Bayesw(z), \n\nBayesw(z) (x) = sign (EwIWEW(z) [hw (x)]) , \n\nwhere the expectation EWIWEW(z) is taken over a cut-off posterior given by com(cid:173)\nbining the PAC-likelihood (1) and the prior P w. \nLemma 1 (Bayes-Gibbs Lemma). For any two measures Pw and P XY and any \nsetW ~ W \n\nPXY (Bayesw (X) =I Y) :s; 2\u00b7 PXY (Gibbsw (X) =I Y) . \n\n(7) \n\nProof. (Sketch) Consider only the simple PAC setting we need. At all those points \nx E X at which Bayesw is wrong by definition at least half ofthe classifiers wE W \nunder consideration make a mistake as well. \nD \n\nThe combination of Lemma 1 with Theorem 2 yields a bound on the risk of \nBayesw(z). For a single hypothesis wE W let us find a (Bayes-admissible) subset \nQ (w) of version space V (z) such that BayesQ(w) on Q (w) agrees with w on every \npoint in X. \n\nDefinition 1 (Bayes-admissibility). Given the hypothesis space in (2) and a \nprior measure Pw over W we call a subset Q (w) ~ W Bayes admissible w.r.t. w \nand P w if and only if \n\n'r/xEX: \n\nhw (x) = BayesQ(w) (x) . \n\nAlthough difficult to achieve in general the following geometrically plausible lemma \nestablishes Bayes-admissibility for the case of interest. \n\nLemma 2 (Bayes-admissibility for linear classifiers). For uniform mea(cid:173)\nsure Pw over W each ball Q (w) = {v E W Illw - vlliC :s; r} is Bayes admissible \nw.r.t. its centre w . \n\nPlease note that by considering a ball Q (w) rather than just w we make use of \nthe fact that w summarises all its neighbouring classifiers v E Q (w). Now using a \nuniform prior Pw the normalised margin \n\n(8) \n\nquantifies the relative volume of classifiers summarised by wand thus allows us \nto bound its risk. Note that in contrast to the classical margin '\"Yz (see 3) this \nnormalised margin is a dimensionless quantity and constitutes a measure for the \nrelative size of the version space invariant under rescaling of both weight vectors w \nand feature vectors Xi. \n\n4 A PAC-Bayesian Margin Bound \n\nCombining the ideas outlined in the previous section allows us to derive a gener(cid:173)\nalisation error bound for linear classifiers w E V (z) in terms of their normalised \nmargin r z (w). \n\n\fFigure 1: Illustration of the volume ratio for the classifier at the north pole. Four \ntraining points shown as grand circles make up version space -\nthe polyhedron \non top of the sphere. The radius of the \"cap\" of the sphere is proportional to the \nmargin r %, which only for constant Ilxill.~: is maximised by the SVM. \n\nTheorem 3 (PAC-Bayesian margin bound). Suppose K ~ f~ is a given \nfeature space of dimensionality n. For all probability measures Pz, for any \n8 > 0 with probability at least 1 - 8 over the random draw of the training set \nz, if we succeed in correctly classifying m samples z with a linear classifier w \nachieving a positive margin r % (w) > 0 then the generalisation error R [w] of \nw is bounded from above by \n\n~(dln( \nwhere d = min (m, n). \n\nm \n\n1 - VI - r~ (w) \n\n1 \n\n)+2In(m)+ln(~)+2) \n\nu \n\n(9) \n\nProof. Geometrically the hypothesis space W is the unit sphere in ~n (see Figure \n1). Let us assume that Pw is uniform on the unit sphere as suggested by symmetry. \nGiven the training set z and a classifier wall classifiers v E Q (w) \n\nQ (w) = {v E W I (w, v)K > Vl- r~ (w) } \n\n(10) \n\nare within V (z) (For a proof see [2]). Such a set Q (w) is Bayes-admissible by \nLemma 2 and hence we can use Pw (Q (w\u00bb \nto bound the generalisation error of \nw. Since Pw is uniform, the value -In (Pw (Q (w\u00bb) is simply the logarithm of the \nvolume ratio between the surface of the unit sphere and the surface of all v fulfilling \nequation (10). In [2] it is shown that this ratio is exactly given by \n\n( \n\nIn \n\nf;1r sinn - 2 (B) dB \n\n) \n\nrarccos( Vl-r;(w)). \nJo \n\nsmn \n\n-2 \n\n(B) dB \n\n. \n\nIt can be shown that this ratio is tightly bounded from above by \n\nn In ( \n\n1 \n\n1- Vl- r~ (w) \n\n) + In (2) . \n\n\f\"'I--H-+-I __ I__ \n\n1 \n\nH--l--l j -- --j--H+H \n\n1\u00b7 \n\n'1--1\" \n\n. \n\n'!-- j--j-_j __ j_-}_+++-j--j--I--l--}--j\u00b7+ l \n\np \n\n(a) \n\np \n\n(b) \n\nFigure 2: Generalisation errors of classifiers learned by an SVM with (dashed line) \nand without (solid line) normalisation of the feature vectors Xi. The error bars \nindicate one standard deviation over 100 random splits of the data sets. The two \nplots are obtained on the (a) thyroid and (b) sonar data set. \n\nWith In (2) < 1 we obtain the desired result. Note that m points maximally span \nan m- dimensional space and thus we can marginalise over the remaining n - m \ndimensions of feature space K . This gives d = min (m, n). \n0 \nAn appealing feature of equation (9) is that for r z (w) = 1 the bound reduces \nto ~ (21n (m) - In (8) + 2) with a rapid decay to zero as m increases. In case of \nmargins r z (w) > 0.91 the troublesome situation of d = m, which occurs e.g. for \nRBF kernels, is compensated for. Furthermore, upper bounding 1/(1- vr=r') by \n2/r we see that Theorem 3 is an exponential improvement of Theorem 1 in terms \nof the attained margins. It should be noted, however, that the new bound depends \non the dimensionality of the input space via d = min (m, n). \n\n5 Experimental Study \n\nTheorem 3 suggest the following learning algorithm: given a version space V (z) \n(through a given training set z) find the classifier w that maximises r z (w). This \nalgorithm, however, is given by the SVM only if the training data in feature space \nK are normalised. We investigate the influence of such a normalisation on the \ngeneralisation error in the feature space K of all monomials up to the p-th degree \n(well-known from handwritten digit recognition, see [13]). Since the SVM learning \nalgorithm as well as the resulting classifier only refer to inner products in K, it \nsuffices to use an easy-to-calculate kernel function k : X X X -t IR such that for all \nx, x' EX, k (x, x') = ( (x) , (X')}JC' given in our case by the polynomial kernel \n\nVpE N: \n\nk (X,X') = ((x,x'h + l)P . \n\nEarlier experiment have shown [13] that without normalisation too large values \nof p may lead to \"overfitting\". We used the VCI [10] data sets thyroid (d = 5, \nm = 140, mtest = 75) and sonar (d = 60, m = 124, mtest = 60) and plotted \nthe generalisation error of SVM solutions (estimated over 100 different splits of the \ndata set) as a function of p (see Figure 2). As suggested by Theorem 3 in almost all \ncases the normalisation improved the performance of the support vector machine \nsolution at a statistically significant level. As a consequence, we recommend: \n\nWhen training an SVM, always normalise your data in feature space. \n\n\fIntuitively, it is only the spatial direction of both weight vector and feature vectors \nthat determines the classification. Hence the different lengths of feature vectors in \nthe training set should not enter the SVM optimisation problem. \n\n6 Conclusion \n\nThe PAC-Bayesian framework together with simple geometrical arguments yields \nthe so far tightest margin bound for linear classifiers. The role of the normalised \nmargin r % in the new bound suggests that the SVM is theoretically justified only for \ninput vectors of constant length. We hope that this result is recognised as a useful \nbridge between theory and practice in the spirit of Vapnik's famous statement: \n\nNothing is more practical than a good theory \n\nAcknowledgements We would like to thank David McAllester, John Shawe(cid:173)\nTaylor, Bob Williamson, Olivier Chapelle, John Langford, Alex Smola and Bernhard \nSchOlkopf for interesting discussions and useful suggestions on earlier drafts. \n\nReferences \n\n[1) N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale sensitive dimensions, \n\nuniform convergence and learnability. Journal of the ACM, 44(4}:615-631, 1997. \n\n[2) R. Herbrich. Learning Linear Classifiers - Theory and Algorithms. PhD thesis, Tech(cid:173)\n\nnische Universitat Berlin, 2000. accepted for publication by MIT Press. \n\n[3) R. Herbrich, T. Graepel, and C. Campbell. Bayesian learning in reproducing kernel \n\nHilbert spaces. Technical report, Technical University of Berlin, 1999. TR 99-11. \n\n[4) M. J. Kearns and R. Schapire. Efficient distribution-free learning of probabilistic \n\nconcepts. Journal of Computer and System Sciences , 48(2}:464-497, 1993. \n\n[5) D. A. McAllester. Some PAC Bayesian theorems. In Proceedings of the Eleventh An(cid:173)\nnual Conference on Computational Learning Theory, pages 230- 234, Madison, Wis(cid:173)\nconsin, 1998. \n\n[6) N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series \n\nA, 13:145- 147, 1972. \n\n[7) R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A \nnew explanation for the effectiveness of voting methods. In Proceedings of the 14- th \nInternational Conference in Machine Learning, 1997. \n\n[8) J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk \nminimization over data-dependent hierarchies. IEEE Transactions on Information \nTheory, 44(5}:1926- 1940, 1998. \n\n[9) J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayesian estimator. \n\nTechnical report, Royal Holloway, University of London, 1997. NC2- TR- 1997- 013. \n\n[10) UCI. University of California Irvine: Machine Learning Repository, 1990. \n[11) L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11}:1134-\n\n1142, 1984. \n\n[12) V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 1982. \n[13) V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. \n[14) V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of \nevents to their probabilities. Theory of Probability and its Application, 16(2}:264- 281, \n1971. \n\n\f", "award": [], "sourceid": 1844, "authors": [{"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Thore", "family_name": "Graepel", "institution": null}]}