{"title": "PAC-Bayes & Margins", "book": "Advances in Neural Information Processing Systems", "page_first": 439, "page_last": 446, "abstract": "", "full_text": "\fshorter argument and much tighter than previous margin bounds.\nThere are two mathematical flavors of margin bound dependent upon the weights\nWi of the vote and the features Xi that the vote is taken over.\n\n1. Those ([12], [1]) with a bound on Li w~ and Li x~ (\"bib\" bounds).\n2. Those ([11], [6]) with a bound on Li Wi and maxi Xi (\"it/loo\" bounds).\n\nThe results here are of the \"bll2\" form. We improve on Shawe-Taylor et al. [12]\nand Bartlett [1] by a log(m)2 sample complexity factor and much tighter constants\nIn addition, the\n(1000 or unstated versus 9 or 18 as suggested by Section 2.2).\nbound here covers margin errors without weakening the error-free case.\n\nHerbrich and Graepel [3] moved significantly towards the approach adopted in our\npaper, but the methodology adopted meant that their result does not scale well to\nhigh dimensional feature spaces as the bound here (and earlier results) do.\n\nThe layout of our paper is simple - we first show how to construct a stochastic\nclassifier with a good true error bound given a margin, and then construct a margin\nbound.\n\n2 Margin Implies PAC-Bayes Bound\n\n2.1 Notation and theoreIll\n\nConsider a feature space X which may be used to make predictions about the value\nin an output space Y = {-I, +1}. We use the notation x = (Xl, ... , XN) to denote\nan N dimensional vector. Let the vote of a voting classifier be given by:\n\nvw(x) = wx = L WiXi\u00b7\n\nThe classifier is given by c(x) = sign (vw(x)). The number of \"margin violations\"\nor \"margin errors\" at 7 is given by:\n\ni\n\ne1'(c) =\n\nPr\n\n(X,1I)~U(S)\n\n(yvw(x) < 7),\n\nwhere U(S) is the uniform distribution over the sample set S.\nFor convenience, we assume vx(x) :::; 1 and vw(w) :::; 1. Without this assumption,\nour results scale as ../vx(x)../vw(w)h rather than 117.\nAny margin bound applies to a vector W in N dimensional space. For every example,\nwe can decompose the example into a portion which is parallel to W and a portion\nwhich is perpendicular to w.\n\nXT = X -\n\nvw(x)\nIIwl1 2 w XII = x - XT\n\nThe argument is simple: we exhibit a \"prior\" over the weight space and a \"posterior\"\nover the weight space with an analytical form for the KL-divergence. The stochastic\nclassifier defined by the posterior has a slightly larger empirical error and a small\ntrue error bound.\nFor the next theorem, let F(x) = 1- f~oo ke-z2/2dx be the tail probability of a\nGaussian with mean 0 and variance 1. Also let\n\neQ(W,1',f) =\n\nPr\n\n(X,1I)~D,h~Q(w,1',f)\n\n(h(x) =I y)\n\n\fbe the true error rate of a stochastic classifier with distribution Q(f, w, 7) dependent\non a free parameter f, the weights w of an averaging classifier, and a margin 7.\n\nTheorem 2.1 There exists a function Q mapping a weight vector w, margin 7,\nand value f > 0 to a distribution Q(w, 7, f) such that\n\nA\n\nInp(Fl:(Ol)+lnmtl)\n\nVw, 7, f: KL(e1'(c) + flleQ(w,1',f\u00bb) :::;\n\nm\n\n~ 1 - 8\n\nPr\n\nS~D\"'\n\n(\n\nwhere KL(qllp) = q In: + (1 - q) In ~::::: = the Kullback-Leibler divergence between\ntwo coins of bias q < p.\n\n2.2 Discussion\n\nTheorem 2.1 shows that when a margin exists it is always possible to find a \"pos(cid:173)\nterior\" distribution (in the style of [5]) which introduces only a small amount of\nadditional training error rate. The true error bound for this stochastization of the\nlarge-margin classifier is not dependent on the dimensionality except via the margin.\n\nSince the Gaussian tail decreases exponentially, the value of P-l(f) is not very large\nIn particular, at P(3), we have f :::; 0.01. Thus, for\nfor any reasonable value of f.\nthe purpose of understanding, we can replace P-l(f) with 3 and consider f ~ O.\nOne useful approximation for P(x) with large x is:\n\n_\nF(x) ~ . tn= (1/x)\n\ne-\",2/2\n\ny27f\n\nIf there are no margin errors e1'(c) = 0, then these approximations, yield the ap(cid:173)\nproximate bound:\n\nP\nr\nS D\n\n~\"'\n\n(\n\neQ(w,1',O) :::;\n\n_9_ + In 3v'2iT + In m\u00b11 )\n21'2\n\n{j\n\nl'\nm\n\n1~\n~ - u\n\nIn particular, for large m the true error is approximately bounded by 21'~m'\n\nAs an example, if 7 = 0.25, the bound is less than 1 around m = 100 examples and\nless than 0.5 around m = 200 examples.\nLater we show (see Lemmas 4.1 and 4.2 or Theorem 4.3) that the generalisation\nerror of the original averaging classifier is only a factor 2 or 4 larger than that of the\nstochastic classifiers considered here. Hence, the bounds of Theorems 2.1 and 3.1\nalso give bounds on the averaging classifiers w.\n\nThis theorem is robust in the presence of noise and margin errors. Since the PAC(cid:173)\nBayes bound works for any \"posterior\" Q, we are free to choose Q dependent upon\nthe data in any way. In practice, it may be desirable to follow an approach similar\nto [5] and allow the data to determine the \"right\" posterior Q. Using the data\nrather than the margin 7 allows the bound to take into account a fortuitous data\ndistribution and robust behavior in the presence of a \"soft margin\" (a margin with\nerrors). This is developed (along with a full proof) in the next section.\n\n3 Main Full Result\n\nWe now present the main result. Here we state a bound which can take into ac(cid:173)\ncount the distribution of the training set. Theorem 2.1 is a simple consequence\n\n\fof this result. This theorem demonstrates the flexibility of the technique since it\nincorporates significantly more data-dependent information into the bound calcu(cid:173)\nlation. When applying the bound one would choose p, to make the inequality (1)\nan equality. Hence, any choice of p, determines E and hence the overall bound. We\nthen have the freedom to choose p, to optimise the bound.\n\nAs noted earlier, given a weight vector w, any particular feature vector x decom(cid:173)\nposes into a portion xII which is parallel to w and a portion XT which is perpen(cid:173)\ndicular to w. Hence, we can write x = xllell + XTeT, where ell is a unit vector in\nthe direction of w and eT is a unit vector in the direction of XT. Note that we may\nhave YXII < 0, if x is misclassified by w.\n\nTheorem 3.1 For all averaging classifiers c with normalized weights wand for all\nE > 0 stochastic error rates, If we choose p, > 0 such that\n\nEx,y~sF\n\n- (YXII\n\n)\n\nXT P, = E\n\nthen there exists a posterior distribution Q(w, p\" E) such that\n\ns~!J\", VE, w, p,: KL(ElleQ(w,p\"f)) ~\n\n(\n\nIn ~l + In !!!\u00b1! )\n\nF(p,)\n\n/j\n\nm\n\n(1)\n\n~ 1 - 6\n\nwhere KL(qllp) = q In ~ + (1 - q) In ~=: = the Kullback-Leibler divergence between\ntwo coins of bias q < p.\n\nProof. The proof uses the PAC-Bayes bound, which states that for all prior distri(cid:173)\nbutions P,\n\n(VQ: KL(eQlleQ) ~ KL(QIIP) + In \u00a5) ~ 1- 6\n\nm\n\nPr\n\nS~D\"'\n\nWe choose P = N(O,I), an isotropic Gaussian1\nA choice of the \"posterior\" Q completes the proof. The Q we choose depends upon\nthe direction w, the margin 'Y, and the stochastic error E. In particular, Q equals\nP in every direction perpendicular to w, and a rectified Gaussian tail in the w\ndirection2 \u2022 The distribution of a rectified Gaussian tail is given by R(p,) = 0 for\nx < p, and R(p,) = F(p,~.;21re-\",2 /2 for x ~ p,o\n\n.\n\nThe chain rule for relative entropy (Theorem 2.5.3 of [2]) and the independence of\ndraws in each dimension implies that:\n\nKL(QIIP)\n\nKL(QIIIIPjI) + KL(QTIIPT)\nKL(R(p,)IIN(O, 1)) + KL(PTIIPr)\nKL(R(p,)IIN(O, 1)) + 0\nroo\n1p,\n\nInp(p,)R(X)dx\n\n1\n\n1\n= In P(p,)\n\n1Later, the fact that an isotropic Gaussian has the same representation in all rotations\n\nof the coordinate sytem will be useful.\n\n2Note that we use the invariance under rotation of N(O, I) here to line up one dimension\n\nwith w.\n\n\fThus, our choice of posterior implies the theorem if the empirical error rate is\neq(w,x,.) :s Ex,._sF (*1') :s \u2022 which we show next.\nGiven a point x, our choice of posterior implies that we can decompose the stochastic\nweight vector, W = wllell +wTeT +w, where ell is parallel to w, eT is parallel to XT\nand W is a residual vector perpendicular to both. By our definition of the stochastic\ngeneration wli ~ R(p) and WT ~ N(O, 1). To avoid an error, we must have:\n\ny = sign(v;;,(x))\n\n=\n\nsign(wlixli +WTXT).\n\nThen, since tOil ~ JJ, no error occurs if:\n\ny(pxlI + WTXT) > 0\n\nSince WT is drawn from N(O, 1) the probability of this event is:\nPr (Y(I\"\"II +WTXT) > 0) ~ 1- F (~~Ip)\n\nAnd so, the empirical error rate of the stochastic classifier is bounded by:\n\neq:S Ex,._sF (~~Ip) =.\n\nas required. _\n\n3.1 Proof of Theorem 2.1\n\n(sketch) The theorem follows from a relaxation of Theorem 3.1.\n\nIn par(cid:173)\nProof.\nticular, we treat every example with a margin less than / as an error and use the\nbounds IlxT11 :s 1 and IlxlIll ~ /.\n3.2 Further results\n\n-\n\nSeveral aspects of the Theorem 3.1 appear arbitrary, but they are not. In particular,\nthe choice of \"prior\" is not that arbitrary as the following lemma indicates.\n\nLemma 3.2 The set of P satisfying 311111 : P(x) = 11II1(lIxI12) (rotational invari(cid:173)\nance) and P(x) = n~, p;(x;) (independence of each dimension) is N(O, >J) for\n>'>0.\n\nProof. Rotational invariance together with the dimension independence imply that\nfor all i,j,x: p;(x) =p;(x) which implies that:\nP(x) = II p(x;)\n\nN\n\n;=1\n\nfor some ftmction p(.). Applying rotational invariance, we have that:\n\nThis implies:\n\nN\n\nP(x) = 11II1(llxIl2) = IIp(x;)\n\n;=1\n\n10g11111 (~,q) = ~IOgP(X;)'\n\n\fTaking the derivative of this equation with respect to Xi gives\n\n1I111 (1IxI1 2\n\n) 2xi\n\nPjIIl(llxI1 2 )\n\n-\n\nP'(Xi)\np(Xi) .\n\nSince this holds for all values of x we must have\n\nPjlll (t) = AlIllI (t)\n\nfor some constant A, or Pjlll (t) = C exp(At) , for some constant C. Hence, P(x) =\nC exp(AllxI1 2 ), as required. _\nThe constant A in the previous lemma is a free parameter. However, the results do\nnot depend upon the precise value of A so we choose 1 for simplicity. Some freedom\nin the choice of the \"posterior\" Q does exist and the results are dependent on this\nchoice. A rectified gaussian appears simplest.\n\n4 Margin Implies Margin Bound\n\nThere are two methods for constructing a margin bound for the original averaging\nclassifier. The first method is simplest while the second is sometimes significantly\ntighter.\n\n4.1 Simple Margin Bound\n\nFirst we note a trivial bound arising from a folk theorem and the relationship to\nour result.\n\nLemma 4.1 (Simple Averaging bound) For any stochastic classifier with distribu(cid:173)\ntion Q and true error rate eQ, the averaging classifier,\n\nCQ(X) = sign ( [ h(X)dQ(h))\n\nhas true error rate:\n\nProof. For every example (x,y), every time the averaging classifier errs, the prob(cid:173)\nability of the stochastic classifier erring must be at least 1/2. _\n\nThis result is interesting and of practical use when the empirical error rate of the\noriginal averaging classifier is low. Furthermore, we can prove that cQ(x) is the\noriginal averaging classifier.\n\nLemma 4.2 For Q = Q(w,'Y,e) derived according to Theorems 2.1 and 3.1 and\ncQ(x) as in lemma 4.1:\n\nCQ(X) = sign (vw(x))\n\nProof. For every x this equation holds because of two simple facts:\n\n1. For any oW that classifies an input x differently from the averaging classifier,\nthere is a unique equiprobable paired weight vector that agrees with the\naveraging classifier.\n\n2. If vw(x) \u00a5- 0, then there exists a nonzero measure of classifier pairs which\n\nalways agrees with the averaging classifier.\n\n\fCondition (1) is met by reversing the sign of WT and noting that either the orig(cid:173)\ninal random vector or the reversed random vector must agree with the averaging\nclassifier.\nCondition (2) is met by the randomly drawn classifier W = AW and nearby classifiers\nfor any A> O. Since the example is not on the hyperplane, there exists some small\nsphere of paired classifiers (in the sense of condition (1)). This sphere has a positive\nmeasure. _\n\nThe simple averaging bound is elegant, but it breaks down when the empirical error\nis large because:\n\ne(c) ::; 2eQ = 2(\u20acQ + 6om ) ~ 2\u20ac-y(c) + 260m\n\nwhere \u20acQ is the empirical error rate of a stochastic classifier and 60m goes to zero\nas m -t 00. Next, we construct a bound of the form e(cQ) ::; \u20ac-y(c) + 6o~ where\n6o~ > 60m but \u20ac-y(c)\n\n::; 2\u20ac-y(c).\n\n4.2 A (Sometimes) Tighter Bound\n\nBy altering our choice of J.L and our notion of \"error\" we can construct a bound\nwhich holds without randomization. In particular, we have the following theorem:\nTheorem 4.3 For all averaging classifiers C with normalized weights W for all E > 0\n\"extra\" error rates and\"( > 0 margins:\n\nVE, w,\"(: KL(\u20ac-y(c) + Elle(c) - E) ::;\n\nPr\n\nS~D\"'\n\n(\n\nIn -c/ 1(0\u00bb) + 21nmtl)\n\nF -\"/-m\n\n~ 1- 0\n\nwhere KL(qllp) = qln ~ + (1 - q) In ~::::: = the Kullback-Leibler divergence between\ntwo coins of bias q < p.\n\nThe proof of this statement is strongly related to the proof given in [11] but notice(cid:173)\nably simpler. It is also very related to the proof of theorem 2.1.\nProof. (sketch) Instead of choosing wli so that the empirical error rate is increased\nby E, we instead choose wli so that the number of margin violations at margin ~ is\nincreased by at most E. This can be done by drawing from a distribution such as\n\nR (2F-1\n\n\"(\n\n(E))\n\nA\n\nWII'\"\n\nApplying the PAC-Bayes bound to this we reach a bound on the number of margin\nviolations at ~ for the true distribution. In particular, we have:\n\ns!:!'- (KL (\",(e) +<IleQ,;) oS In F(~ +In\"'t') '\" 1_;\n\nThe application is tricky because the bound does not hold uniformly for all \"(.3\nInstead we can discretize \"( at scale 1/ m and apply a union bound to get 0 -t 0/m+1.\nFor any fixed example, (x,y) with probability 1- 0, we know that with probability\nat least 1 - eQ,~' the example has a margin of at least ~. Since the example has\n\n3Thanks to David McAllester for pointing this out.\n\n\fa margin of at least ~ and our randomization doesn't change the margin by more\nthan ~ with probability 1- f, the averaging classifier almost always predicts in the\nsame way as the stochastic classifier implying the theorem. _\n\n4.3 Discussion &< Open Problems\n\nThe bound we have obtained here is considerably tighter than previous bounds for\naveraging classifiers-in fact it is tight enough to consider applying to real learning\nproblems and using the results in decision making.\nCan this argument be improved? The simple averaging bound (lemma 4.1) and\nthe margin bound (theorem 4.3) each have a regime in which they dominate. We\nexpect that there exists some natural theorem which does well in both regimes\nsimultaneously.\nhI order to verify that the margin bound is as tight as possible, it would also be\ninstructive to study lower bounds.\n\n4.4 Acknowledgements\n\nMany thanks to David McAllester for critical reading and comments.\n\nReferences\n[1] P. L. Bartlett,\n\n\"The sample complexity of pattern classification with neural\nnetworks: the size of the weights is more important than the size of the network,\"\nIEEE funsactiollS on Information Theory, vol. 44, no. 2, pp. 525-536, 1998.\n\n[2] Thomas Cover and Joy Thomas, \"Elements of fuformation Theory\" Wiley, New\n\nYork 1991.\n\n[3] Ralf Herbrich and Thore Graepel, A PAC-Bayesian Margin Bound for Linear\nClassifiers: Why SVMs work. In Advances in Neural fuformation Processing\nSystems 13, pages 224-230. 2001.\n\n[4] T. Jaakkola, M. Mella, T. Jebara, \"Maximum Entropy D iscrirnination\\char\"\n\nNIPS 1999.\n\n[5] John Langford and Rich Caruana, (Not) Bounding the True Error NIPS2001.\n[6] John Langford, Matthias Seeger, and Nimrod Megiddo, \"An Improved Predic(cid:173)\n\ntive Accuracy Bound for Averaging Classifiers\" ICML2001.\n\n[7] John Langford and Matthias Seeger, \"Bounds for Averaging Classifiers.\" CMU\n\ntech report, CMU-CS-01-102, 2001.\n\n[8] David McAllester, \"PAC-Bayesian Model Averaging\" COLT 1999.\n[9] Yoav Freund and Robert E. Schapire, \"A Decision Theoretic Generalization of\n\nOn-line Learning and an Application to Boosting\" Eurocolt 1995.\n\n[10] Matthias Seeger, \"PAC-Bayesian Generalization Error Bounds for Gaussian\nProcesses\", Tech Report, Division of fuformatics report EDI-INF-RR-0094.\nhttp://www.dai.ed.ac.uk/homes/seeger/papers/gpmcall-tr.ps.gz\n\n[11] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee, \"Boosting\nthe Margin: A new explanation for the effectiveness of voting methods\" The\nAnnals of Statistics, 26(5):1651-1686, 1998.\n\n[12] J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Struc(cid:173)\ntural risk minimization over data-dependent hierarchies. IEEE funsactions on\nInformation Theory, 44(5):1926--1940, 1998.\n\n\f", "award": [], "sourceid": 2317, "authors": [{"given_name": "John", "family_name": "Langford", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}]}