{"title": "PAC-Bayesian Generic Chaining", "book": "Advances in Neural Information Processing Systems", "page_first": 1125, "page_last": 1132, "abstract": "", "full_text": "PAC-Bayesian Generic Chaining\n\nJean-Yves Audibert (cid:3)\n\nUniversit\u00b4e Paris 6\n\nLaboratoire de Probabilit\u00b4es et Mod`eles al\u00b4eatoires\n\n175 rue du Chevaleret\n75013 Paris - France\n\njyaudibe@ccr.jussieu.fr\n\nOlivier Bousquet\n\nMax Planck Institute for Biological Cybernetics\n\nSpemannstrasse 38\n\nD-72076 T\u00a8ubingen - Germany\n\nolivier.bousquet@tuebingen.mpg.de\n\nAbstract\n\nThere exist many different generalization error bounds for classi\ufb01cation.\nEach of these bounds contains an improvement over the others for cer-\ntain situations. Our goal is to combine these different improvements into\na single bound. In particular we combine the PAC-Bayes approach intro-\nduced by McAllester [1], which is interesting for averaging classi\ufb01ers,\nwith the optimal union bound provided by the generic chaining technique\ndeveloped by Fernique and Talagrand [2]. This combination is quite nat-\nural since the generic chaining is based on the notion of majorizing mea-\nsures, which can be considered as priors on the set of classi\ufb01ers, and such\npriors also arise in the PAC-bayesian setting.\n\n1\n\nIntroduction\n\nSince the \ufb01rst results of Vapnik and Chervonenkis on uniform laws of large numbers for\nclasses of f0; 1g-valued functions, there has been a considerable amount of work aiming\nat obtaining generalizations and re\ufb01nements of these bounds. This work has been carried\nout by different communities. On the one hand, people developing empirical processes the-\nory like Dudley and Talagrand (among others) obtained very interesting results concerning\nthe behaviour of the suprema of empirical processes. On the other hand, people explor-\ning learning theory tried to obtain re\ufb01nements for speci\ufb01c algorithms with an emphasis on\ndata-dependent bounds.\nOne crucial aspect of all the generalization error bounds is that they aim at controlling the\nbehaviour of the function that is returned by the algorithm. This function is data-dependent\nand thus unknown before seeing the data. As a consequence, if one wants to make state-\nments about its behaviour (e.g. the difference between its empirical error and true error),\none has to be able to predict which function is likely to be chosen by the algorithm. But\n\n(cid:3)Secondary af\ufb01liation: CREST, ENSAE, Laboratoire de Finance et Assurance, Malakoff, France\n\n\fsince this cannot be done exactly, there is a need to provide guarantees that hold simulta-\nneously for several candidate functions. This is known as the union bound. The way to\nperform this union bound optimally is now well mastered in the empirical processes com-\nmunity.\nIn the learning theory setting, one is interested in bounds that are as algorithm and data\ndependent as possible. This particular focus has made concentration inequalities (see e.g.\n[3]) popular as they allow to obtain data-dependent results in an effortless way. Another\naspect that is of interest for learning is the case where the classi\ufb01ers are randomized or\naveraged. McAllester [1, 4] has proposed a new type of bound that takes the randomization\ninto account in a clever way.\nOur goal is to combine several of these improvements, bringing together the power of\nthe majorizing measures as an optimal union bound technique and the power of the PAC-\nBayesian bounds that handle randomized predictions ef\ufb01ciently, and obtain a generalization\nof both that is suited for learning applications.\nThe paper is structured as follows. Next section introduces the notation and reviews the\nprevious improved bounds that have been proposed. Then we give our main result and\ndiscuss its applications, showing in particular how to recover previously known results.\nFinally we give the proof of the presented results.\n\n2 Previous results\n\n1; : : : ; Z 0\n\nWe \ufb01rst introduce the notation and then give an overview of existing generalization error\nbounds. We consider an input space X , an output space Y and a probability distribution\nP on the product space Z , X (cid:2) Y. Let Z , (X; Y ) denote a pair of random variables\ndistributed according to P and for a given integer n, let Z1; : : : ; Zn and Z 0\nn be two\nindependent samples of n independent copies of Z. We denote by Pn, P 0\nn and P2n the\nempirical measures associated respectively to the \ufb01rst, the second and the union of both\nsamples.\nTo each function g : X ! Y we associate the corresponding loss function f : Z !\nIn classi\ufb01cation, the loss\nR de\ufb01ned by f (z) = L[g(x); y] where L is a loss function.\nfunction is L = Ig(x)6=y where I denotes the indicator function. F will denote a set of\nsuch functions. For such functions, we denote their expectation under P by P f and their\nempirical expectation by Pnf (i.e. Pnf = n(cid:0)1Pn\nn and E2n denote the\nexpectation with respect to the \ufb01rst, second and union of both training samples.\nn and d2n.\nWe consider the pseudo-distances d2(f1; f2) = P (f1 (cid:0) f2)2 and similarly dn; d0\nWe de\ufb01ne the covering number N (F; (cid:15); d) as the minimum number of balls of radius (cid:15)\nneeded to cover F in the pseudo-distance d.\nWe denote by (cid:26) and (cid:25) two probability measures on the space F, so that (cid:26)P f will actually\nmean the expectation of P f when f is sampled according to the probability measure (cid:26).\nFor two such measures, K((cid:26); (cid:25)) will denote their Kullback-Leibler divergence (K((cid:26); (cid:25)) =\n(cid:26) log d(cid:26)\nd(cid:25) when (cid:26) is absolutely continuous with respect to (cid:25) and K((cid:26); (cid:25)) = +1 otherwise).\nAlso, (cid:12) denotes some positive real number while C is some positive constant (whose value\n+(F) is the set of probability measures on F. We\nmay differ from line to line) and M1\nassume that the functions in F have range in [a; b].\nGeneralization error bounds give an upper bound on the difference between the true and\nempirical error of functions in a given class, which holds with high probability with respect\nto the sampling of the training set.\nSingle function. By Hoeffding\u2019s inequality one easily gets that for each \ufb01xed f 2 F, with\nprobability at least 1 (cid:0) (cid:12),\n\ni=1 f (Zi)). En, E0\n\nFinite union bound.\n\nIt is easy to convert the above statement into one which is valid\n\nP f (cid:0) Pnf (cid:20) Cr log 1=(cid:12)\n\nn\n\n:\n\n(1)\n\n\fsimultaneously for a \ufb01nite set of functions F. The simplest form of the union bound gives\nthat with probability at least 1 (cid:0) (cid:12),\n\n8f 2 F; P f (cid:0) Pnf (cid:20) Cr log jFj + log 1=(cid:12)\n\nn\n\n:\n\n(2)\n\nSymmetrization. When F is in\ufb01nite,\nthe trick is to introduce the second sample\nn and to consider the set of vectors formed by the values of each function in\n1; : : : ; Z 0\nZ 0\nF on the double sample. When the functions have values in f0; 1g, this is a \ufb01nite set and\nthe above union bound applies. This idea was \ufb01rst used by Vapnik and Chervonenkis [5] to\nobtain that with probability at least 1 (cid:0) (cid:12),\n\n8f 2 F; P f (cid:0) Pnf (cid:20) Cr log E2nN (F; 1=n; d2n) + log 1=(cid:12)\n\nn\n\n:\n\n(3)\n\nWeighted union bound and localization. The \ufb01nite union bound can be directly extended\nto the countable case by introducing a probability distribution (cid:25) over F which weights each\nfunction and gives that with probability at least 1 (cid:0) (cid:12),\n\n8f 2 F; P f (cid:0) Pnf (cid:20) Cr log 1=(cid:25)(f ) + log 1=(cid:12)\n\nn\n\n:\n\n(4)\n\nIt is interesting to notice that now the bound depends on the actual function f being con-\nsidered and not just on the set F. This can thus be called a localized bound.\nVariance. Since the deviations between P f and Pnf for a given function f actually de-\npend on its variance (which is upper bounded by P f 2=n or P f =n when the functions are\nin [0; 1]), one can re\ufb01ne (1) into\n\nP f (cid:0) Pnf (cid:20) C r P f 2 log 1=(cid:12)\n\nn\n\n+\n\nlog 1=(cid:12)\n\nn ! ;\n\n(5)\n\nand combine this improvement with the above union bounds. This was done by Vapnik and\nChervonenkis [5] (for functions in f0; 1g).\nAveraging. Consider a probability distribution (cid:26) de\ufb01ned on a countable F, take the expec-\ntation of (4) with respect to (cid:26) and use Jensen\u2019s inequality. This gives with probability at\nleast 1 (cid:0) (cid:12),\n\n8(cid:26); (cid:26)(P f (cid:0) Pnf ) (cid:20) Cr K((cid:26); (cid:25)) + H((cid:26)) + log 1=(cid:12)\n\nn\n\n;\n\nwhere H((cid:26)) is the Shannon entropy. The l.h.s. is the difference between true and empirical\nerror of a randomized classi\ufb01er which uses (cid:26) as weights for choosing the decision function\n(independently of the data). The PAC-Bayes bound [1] is a re\ufb01ned version of the above\nbound since it has the form (for possibly uncountable F)\n\n8(cid:26); (cid:26)(P f (cid:0) Pnf ) (cid:20) Cr K((cid:26); (cid:25)) + log n + log 1=(cid:12)\n\nn\n\n:\n\n(6)\n\nTo some extent, one can consider that the PAC-Bayes bound is a re\ufb01ned union bound where\nthe gain happens when (cid:26) is not concentrated on a single function (or more precisely (cid:26) has\nentropy larger than log n).\nRademacher averages. The quantity EnE(cid:27) supf 2F\nnP (cid:27)if (Zi), where the (cid:27)i are inde-\npendent random signs (+1;(cid:0)1 with probability 1=2), called the Rademacher average for\nF, is, up to a constant equal to En supf 2F P f (cid:0) Pnf which means that it best captures the\ncomplexity of F. One has with probability 1 (cid:0) (cid:12),\n\n1\n\n8f 2 F; P f (cid:0) Pnf (cid:20) C 1\n\nn\n\nEnE(cid:27) sup\n\nn ! :\nf 2FX (cid:27)if (Zi) +r log 1=(cid:12)\n\n(7)\n\n\fChaining. Another direction in which the union bound can be re\ufb01ned is by considering\n\ufb01nite covers of the set of function at different scales. This is called the chaining technique,\npioneered by Dudley (see e.g. [6]) since one constructs a chain of functions that approxi-\nmate a given function more and more closely. The results involve the Koltchinskii-Pollard\nentropy integral as, for example in [7], with probability 1 (cid:0) (cid:12),\n\n8f 2 F; P f (cid:0) Pnf (cid:20) C 1\npn\n\nEnZ 1\n\nn ! :\n0 plog N (F; (cid:15); dn)d(cid:15) +r log 1=(cid:12)\n\n(8)\n\nGeneric chaining. It has been noticed by Fernique and Talagrand that it is possible to\ncapture the complexity in a better way than using minimal covers by considering majorizing\nmeasures (essentially optimal for Gaussian processes). Let r > 0 and (Aj)j(cid:21)1 be partitions\nof F of diameter r(cid:0)j w.r.t.\nthe distance dn such that Aj+1 re\ufb01nes Aj. Using (7) and\ntechniques from [2] we obtain that with probability 1 (cid:0) (cid:12), 8f 2 F\nP f (cid:0) Pnf (cid:20) C0\n@\nIf one takes partitions induced by minimal covers of F at radii r(cid:0)j, one recovers (8) up to\na constant.\nConcentration. Using concentration inequalities as in [3] for example, one can get rid of\nthe expectation appearing in the r.h.s. of (3), (8), (7) or (9) and thus obtain a bound that\ncan be computed from the data.\n\nn 1\nr(cid:0)jqlog 1=(cid:25)Aj(f ) +r log 1=(cid:12)\nA :\n\nXj=1\n\ninf\n(cid:25)2M1\n\nsup\nf 2F\n\n1\npn\n\nEn\n\n+(F )\n\n(9)\n\n1\n\nRe\ufb01ning the bound (7) is possible as one can localize it (see e.g. [8]) by computing the\nRademacher average only on a small ball around the function of interest. So this comes\nclose to combining all improvements. However it has not been combined with the PAC-\nBayes improvement. Our goal is to try and combine all the above improvements.\n\n3 Main results\n\nLet F be as de\ufb01ned in section 2 with a = 0; b = 1 and (cid:25) 2 M1\n+(F). Instead of using\npartitions as in (9) we use approximating sets (which also induce partitions but are easier\nto handle here). Consider a sequence Sj of embedded \ufb01nite subsets of F: ff0g , S0 (cid:26)\n(cid:1)(cid:1)(cid:1) (cid:26) Sj(cid:0)1 (cid:26) Sj (cid:26) (cid:1)(cid:1)(cid:1) .\nLet pj : F ! Sj be maps (which can be thought of as projections) satisfying pj(f ) = f\nfor f 2 Sj and pj(cid:0)1 (cid:14) pj = pj(cid:0)1.\nThe quantities (cid:25), Sj and pj are allowed to depend on X 2n\n1\nexchanging Xi and X 0\n\nin an exchangeable way (i.e.\ni does not affect their value). For a probability distribution (cid:26) on\n(cid:26)ff 0 : pj(f 0) = fg(cid:14)f ; where (cid:14)f denotes\nF, de\ufb01ne its j-th projection as (cid:26)j = Pf 2Sj\nthe Dirac measure on f. To shorten notations, we denote the average distance between\ntwo successive \u201cprojections\u201d by (cid:26)d2\n2n[pj(f ); pj(cid:0)1(f )]. Finally, let (cid:1)n;j(f ) ,\nj\nP 0\nn[f (cid:0) pj(f )] (cid:0) Pn[f (cid:0) pj(f )]:\nTheorem 1 If the following condition holds\n\n, (cid:26)d2\n\nlim\n\nj!+1\n\nsup\nf 2F\n\n(cid:1)n;j(f ) = 0;\n\na.s.\n\n(10)\n\nthen for any 0 < (cid:12) < 1=2, with probability at least 1 (cid:0) (cid:12), for any distribution (cid:26), we have\n\n(cid:26)P 0\n\nnf (cid:0) P 0\n\nnf0 (cid:20) (cid:26)Pnf (cid:0) Pnf0 + 5\n\n+1\n\nXj=1\n\nj K((cid:26)j; (cid:25)j)\n\ns (cid:26)d2\n\nn\n\n+\n\n1\npn\n\n+1\n\nXj=1\n\n(cid:31)j((cid:26)d2\n\nj );\n\n\fwhere (cid:31)j(x) = 4rx log(cid:16)4j2(cid:12)(cid:0)1 log(e2=x)(cid:17).\nRemark 1 Assumption (10) is not very restrictive. For instance, it is satis\ufb01ed when F is\n\ufb01nite, or when limj!+1 supf 2F jf(cid:0)pj(f )j = 0; almost surely or also when the empirical\nprocess(cid:2)f 7! P f (cid:0) Pnf(cid:3) is uniformly continuous (which happens for classes with \ufb01nite\nV C dimension in particular) and limj!+1 supf 2F d2n(f; pj(f )) = 0:\nRemark 2 Let G be a model (i.e. a set of prediction functions). Let ~g be a reference\nfunction (not necessarily in G). Consider the class of functions F = (cid:8)z 7! L[g(x); y] :\ng 2 G [ f~gg(cid:9): Let f0 = L[~g(x); y]: The previous theorem compares the risk on the second\n\nsample of any (randomized) estimator with the risk on the second sample of the reference\nfunction ~g.\n\nNow let us give a version of the previous theorem in which the second sample does not\nappear.\n\nTheorem 2 If the following condition holds\n\nlim\n\nj!+1\n\nsup\nf 2F\n\nE0\n\nn(cid:2)(cid:1)n;j(f )(cid:3) = 0;\n\na.s.\n\n(11)\n\nthen for any 0 < (cid:12) < 1=2, with probability at least 1 (cid:0) (cid:12), for any distribution (cid:26), we have\nj ](cid:1):\n\n(cid:26)P f (cid:0) P f0 (cid:20) (cid:26)Pnf (cid:0) Pnf0 + 5\n\nn[K((cid:26)j; (cid:25)j)]\nn\n\n(cid:31)j(cid:0)E0\n\ns E0\n\n+1\n\nXj=1\n\n+1\n\nXj=1\n\n+\n\n1\npn\n\nn[(cid:26)d2\n\nj ]E0\n\nn[(cid:26)d2\n\n4 Discussion\n\nWe now discuss in which sense the result presented above combines several previous im-\nprovements in a single bound.\nNotice that our bound is localized in the sense that it depends on the function of interest (or\nrather on the averaging distribution (cid:26)) and does not involve a supremum over the class.\nAlso, the union bound is performed in an optimal way since, if one plugs in a distribution (cid:26)\nconcentrated on a single function, takes a supremum over F in the r.h.s., and upper bounds\nthe squared distance by the diameter of the partition, one recovers a result similar to (9)\nup to logarithmic factors but which is localized. Also, when two successive projections\nare identical, they do not enter in the bound (which comes from the fact that the variance\nweights the complexity terms). Moreover Theorem 1 also includes the PAC-Bayesian im-\nprovement for averaging classi\ufb01ers since if one considers the set S1 = F one recovers\na result similar to McAllester\u2019s (6) which in addition contains the variance improvement\nsuch as in [9].\nFinally due to the power of the generic chaining, it is possible to upper bound our result by\nRademacher averages, up to logarithmic factors (using the results of [10] and [11]).\nAs a remark, the choice of the sequence of sets Sj can generally be done by taking succes-\nsive covers of the hypothesis space with geometrically decreasing radii.\n\nHowever, the obtained bound is not completely empirical since it involves the expectation\nwith respect to an extra sample. In the transduction setting, this is not an issue, it is even\nan advantage as one can use the unlabeled data in the computation of the bound. However,\nin the induction setting, this is a drawback. Future work will focus on using concentration\ninequalities to give a fully empirical bound.\n\n\f5 Proofs\n\nProof of Theorem 1: The proof is inspired by previous works on PAC-bayesian bounds\n[12, 13] and on the generic chaining [2]. We \ufb01rst prove the following lemma.\n\nLemma 1 For any (cid:12) > 0, (cid:21) > 0, j 2 N(cid:3) and any exchangeable function (cid:25) : X 2n !\n+(F), with probability at least 1 (cid:0) (cid:12), for any probability distribution (cid:26) 2 M1\nM1\n+(F), we\nhave\nn[pj(f ) (cid:0) pj(cid:0)1(f )] (cid:0) Pn[pj(f ) (cid:0) pj(cid:0)1(f )]o\n(cid:26)nP 0\n\nn (cid:26)d2\n\n2n[pj(f ); pj(cid:0)1(f )] + K((cid:26);(cid:25))+log((cid:12)(cid:0)1)\n\n(cid:20) 2(cid:21)\n+(F) be an exchangeable function. Introduce the\n\n(cid:21)\n\n:\n\nProof Let (cid:21) > 0 and let (cid:25) : X 2n ! M1\nquantity (cid:1)i , pj(f )(Zn+i) (cid:0) pj(cid:0)1(f )(Zn+i) + pj(cid:0)1(f )(Zi) (cid:0) pj(f )(Zi) and\nh , (cid:21)P 0\n\nn(cid:2)pj(f ) (cid:0) pj(cid:0)1(f )(cid:3) (cid:0) (cid:21)Pn(cid:2)pj(f ) (cid:0) pj(cid:0)1(f )(cid:3) (cid:0)\n\nd2n(cid:2)pj(f ); pj(cid:0)1(f )(cid:3): (12)\n\n2(cid:21)2\nn\n\nBy using the exchangeability of (cid:25), for any (cid:27) 2 f(cid:0)1; +1gn, we have\nn Pn\nn Pn\n\nE2n(cid:25)eh = E2n(cid:25)e(cid:0) 2(cid:21)2\n= E2n(cid:25)e(cid:0) 2(cid:21)2\n\nn d2n[pj (f );pj(cid:0)1(f )]+ (cid:21)\n\nn d2n[pj (f );pj(cid:0)1(f )]+ (cid:21)\n\ni=1 (cid:1)i\n\ni=1 (cid:27)i(cid:1)i :\n\nNow take the expectation wrt (cid:27), where (cid:27) is a n-dimensional vector of Rademacher vari-\nables. We obtain\n\nE2n(cid:25)eh = E2n(cid:25)e(cid:0) 2(cid:21)2\n(cid:20) E2n(cid:25)e(cid:0) 2(cid:21)2\n\nn d2n[pj (f );pj(cid:0)1(f )]Qn\n\nn d2n[pj (f );pj(cid:0)1(f )]ePn\n\ni=1 cosh(cid:0) (cid:21)\n\n(cid:21)2\n2n2 (cid:1)2\n\nn (cid:1)i(cid:1)\n\ni=1\n\ni\n\nwhere at the last step we use that cosh s (cid:20) e\n\ns2\n2 . Since\n\n(cid:1)2\n\ni (cid:20) 2(cid:2)pj(f )(Zn+i) (cid:0) pj(cid:0)1(f )(Zn+i)(cid:3)2\n\n+ 2(cid:2)pj(f )(Zi) (cid:0) pj(cid:0)1(f )(Zi)(cid:3)2\n\n;\n\nwe obtain that for any (cid:21) > 0, E2n(cid:25)eh (cid:20) 1: Therefore, for any (cid:12) > 0, we have\n\nE2nI\n\nlog (cid:25)eh+log (cid:12) >0 = E2nI\n\n(cid:25)eh+log (cid:12) >1 (cid:20) E2n(cid:25)eh+log (cid:12) (cid:20) (cid:12);\n\n(13)\n\nOn the event(cid:8) log (cid:25)eh+log (cid:12) (cid:20) 0(cid:9), by the Legendre\u2019s transform, for any probability distri-\nbution (cid:26) 2 M1\n\n+(F), we have\n\n(cid:26)h + log (cid:12) (cid:20) log (cid:25)eh+log (cid:12) + K((cid:26); (cid:25)) (cid:20) K((cid:26); (cid:25));\n\n(14)\n\nwhich proves the lemma.\nNow let us apply this result to the projected measures (cid:25)j and (cid:26)j. Since, by de\ufb01nition, (cid:25), Sj\nand pj are exchangeable, (cid:25)j is also exchangeable. Since pj(f ) = f for any f 2 Sj, with\nprobability at least 1 (cid:0) (cid:12), uniformly in (cid:26), we have\n\n(cid:26)jnP 0\n\nn[f (cid:0) pj(cid:0)1(f )] (cid:0) Pn[pj(f ) (cid:0) pj(cid:0)1(f )]o (cid:20)\n\nwhere K 0\nj\n\n, K((cid:26)j; (cid:25)j) + log((cid:12)(cid:0)1): By de\ufb01nition of (cid:26)j, it implies that\n\n2(cid:21)\nn\n\n(cid:26)jd2\n\n2n[f; pj(cid:0)1(f )] +\n\nK 0\nj\n(cid:21)\n\n;\n\n(cid:26)nP 0\n\nn[pj(f )(cid:0)pj(cid:0)1(f )](cid:0)Pn[pj(f )(cid:0)pj(cid:0)1(f )]o (cid:20)\n\n2(cid:21)\nn\n\n(cid:26)d2\n\n2n[pj(f ); pj(cid:0)1(f )]+\n\nK 0\nj\n(cid:21)\n\n: (15)\n\n\f, (cid:26)d2\n\nTo shorten notations, de\ufb01ne (cid:26)d2\nj\n\nn[pj(f ) (cid:0)\npj(cid:0)1(f )] (cid:0) Pn[pj(f ) (cid:0) pj(cid:0)1(f )](cid:9): The parameter (cid:21) minimizing the RHS of the previ-\nous equation depends on (cid:26). Therefore, we need to get a version of this inequality which\nholds uniformly in (cid:21).\nFirst let us note that when (cid:26)d2\n\n2n[pj(f ); pj(cid:0)1(f )] and (cid:26)(cid:1)j , (cid:26)(cid:8)P 0\n\nj = 0, we have (cid:26)(cid:1)j = 0. When (cid:26)d2\n\n2n and\n\n(cid:21)k = mek=2 and let b be a function from R(cid:3) to (0; 1] such thatPk(cid:21)1 b((cid:21)k) (cid:20) 1. From the\nprevious lemma and a union bound, we obtain that for any (cid:12) > 0 and any integer j with\nprobability at least 1 (cid:0) (cid:12), for any k 2 N(cid:3) and any distribution (cid:26), we have\nK((cid:26)j; (cid:25)j) + log(cid:0)[b((cid:21)k)](cid:0)1(cid:12)(cid:0)1(cid:1)\n:\n\n(cid:26)(cid:1)j (cid:20)\n\n2(cid:21)k\nn\n\n(cid:26)d2\n\nj +\n\n(cid:21)k\n\nLet us take the function b such that h(cid:21) 7!\nj (cid:21) log 2\n\n(cid:21)\nThen there exists a parameter (cid:21)(cid:3) > 0 such that 2(cid:21)(cid:3)\nany (cid:12) < 1=2, we have ((cid:21)(cid:3))2(cid:26)d2\nk 2 N(cid:3) such that (cid:21)ke(cid:0)1=2 (cid:20) (cid:21)(cid:3) (cid:20) (cid:21)k: Then we have\n\ni is continuous and decreasing.\n: For\n2 n; hence (cid:21)(cid:3) (cid:21) m: So there exists an integer\n\nK((cid:26)j ;(cid:25)j )+log([b((cid:21)(cid:3))](cid:0)1(cid:12)(cid:0)1)\n\nn (cid:26)d2\n\nj =\n\nlog(cid:0)[b((cid:21))](cid:0)1(cid:1)\n\n(cid:21)(cid:3)\n\nj > 0, let mq log 2\n\n(16)\n\n(cid:26)(cid:1)j (cid:20) 2(cid:21)(cid:3)\n\nj +\n\nn pe(cid:26)d2\n= (1 + pe)r 2\n\nK((cid:26)j ;(cid:25)j )+log([b((cid:21)(cid:3))](cid:0)1(cid:12)(cid:0)1)\n\n(cid:21)(cid:3)\n\nn (cid:26)d2\n\njhK((cid:26)j; (cid:25)j) + log ([b((cid:21)(cid:3))](cid:0)1(cid:12)(cid:0)1)i:\n\nj\n\n4\n\n(cid:26)d2\nj\n\n[log( e2 (cid:21)\n\npK 0\n\n1\nm )]2\n\nTo have an explicit bound, it remains to \ufb01nd an upperbound of [b((cid:21)(cid:3))](cid:0)1. When b is\ndecreasing, this comes down to upperbouding (cid:21)(cid:3). Let us choose b((cid:21)) =\nwhen\n(cid:21) (cid:21) m and b((cid:21)) = 1=4 otherwise. Since b((cid:21)k) =\nTedious computations give (cid:21)(cid:3) (cid:20) 7m\nj K((cid:26)j; (cid:25)j)\n\n(k+4)2 ; we have Pk(cid:21)1 b((cid:21)k) (cid:20) 1.\nlog(cid:16)2(cid:12)(cid:0)1 logh e2\n\n(cid:26)(cid:1)j (cid:20) 5s (cid:26)d2\n\n+ 3:75s (cid:26)d2\n\nwhich combined with (16), yield\n\nBy simply using an union bound with weights taken proportional to 1=j 2, we have that\nthe previous inequation holds uniformly in j 2 N(cid:3) provided that (cid:12)(cid:0)1 is replaced with\n6 j2(cid:12)(cid:0)1(cid:0)sincePj2N(cid:3) 1=j2 = (cid:25)2=6 (cid:25) 1:64(cid:1). Notice that\n(cid:26)j(cid:2)(P 0\nnf (cid:0) P 0\n(cid:26)(cid:2)P 0\nn (cid:0) Pn)pj(cid:0)1(f )(cid:3)\nbecause pj(cid:0)1 = pj(cid:0)1 (cid:14) pj: So, with probability at least 1 (cid:0) (cid:12), for any distribution (cid:26), we\nhave\n\nnf0 + Pnf0 (cid:0) Pnf(cid:3) = (cid:26)(cid:1)n;J (f ) +\n\nn (cid:0) Pn)f (cid:0) (P 0\n\nji(cid:17):\n\nXj=1\n\nj\nn\n\n(cid:26)d2\n\n(cid:25)2\n\nn\n\nJ\n\nnf (cid:0) P 0\n\n(cid:26)(cid:2)P 0\n\nnf0 + Pnf0 (cid:0) Pnf(cid:3) (cid:20) supF (cid:1)n;J + 5PJ\nj=1r (cid:26)d2\n\n+3:75PJ\n\nj\n\nn\n\nj K((cid:26)j ;(cid:25)j )\n\nj=1q (cid:26)d2\nn log(cid:16)3:3j2(cid:12)(cid:0)1 logh e2\n\n(cid:26)d2\n\nji(cid:17):\n\nMaking J ! +1; we obtain theorem 1.\n(cid:3)\nProof of Theorem 2: It suf\ufb01ces to modify slightly the proof of theorem 1. Introduce U ,\nsup(cid:26)(cid:8)(cid:26)h + log (cid:12) (cid:0) K((cid:26); (cid:25))(cid:9); where h is still de\ufb01ned as in equation (12). Inequations (14)\nnU (cid:20) (cid:12); hence EnnE0\nimplies that E2neU (cid:20) (cid:12). By Jensen\u2019s inequality, we get EneE0\nnU (cid:21)\n0o (cid:20) (cid:12): So with probability at least 1 (cid:0) (cid:12), we have sup(cid:26)\nn(cid:8)(cid:26)h + log (cid:12) (cid:0) K((cid:26); (cid:25))(cid:9) (cid:20)\nE0\nnU (cid:20) 0:\n\nE0\n\n(cid:3)\n\n\f6 Conclusion\n\nWe have obtained a generalization error bound for randomized classi\ufb01ers which combines\nseveral previous improvements. It contains an optimal union bound, both in the sense of\noptimally taking into account the metric structure of the set of functions (via the majorizing\nmeasure approach) and in the sense of taking into account the averaging distribution. We\nbelieve that this is a very natural way of combining these two aspects as the result relies\non the comparison of a majorizing measure which can be thought of as a prior probability\ndistribution and a randomization distribution which can be considered as a posterior distri-\nbution.\nFuture work will focus on giving a totally empirical bound (in the induction setting) and\ninvestigating possible constructions for the approximating sets Sj.\n\nReferences\n\n[1] D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of the 11th Annual Confer-\n\nence on Computational Learning Theory, pages 230\u2013234. ACM Press, 1998.\n\n[2] M. Talagrand. Majorizing measures: The generic chaining. Annals of Probability, 24(3):1049\u2013\n\n1103, 1996.\n\n[3] S. Boucheron, G. Lugosi, and S. Massart. A sharp concentration inequality with applications.\n\nRandom Structures and Algorithms, 16:277\u2013292, 2000.\n\n[4] D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of the 12th Annual Confer-\n\nence on Computational Learning Theory. ACM Press, 1999.\n\n[5] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka, Moscow,\n1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung,\nAkademie\u2013Verlag, Berlin, 1979).\n\n[6] R. M. Dudley. A course on empirical processes. Lecture Notes in Mathematics, 1097:2\u2013142,\n\n1984.\n\n[7] L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer Series in\n\nStatistics. Springer Verlag, New York, 2001.\n\n[8] P. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. Preprint, 2003.\n[9] D. A. McAllester. Simpli\ufb01ed pac-bayesian margin bounds. In Proceedings of Computational\n\nLearning Theory (COLT), 2003.\n\n[10] M. Ledoux and M. Talagrand. Probability in Banach spaces. Springer-Verlag, Berlin, 1991.\n[11] M. Talagrand. The Glivenko-Cantelli problem. Annals of Probability, 6:837\u2013870, 1987.\n[12] O. Catoni. Localized empirical complexity bounds and randomized estimators, 2003. Preprint.\n[13] J.-Y. Audibert. Data-dependent generalization error bounds for (noisy) classi\ufb01cation: a PAC-\n\nbayesian approach. 2003. Work in progress.\n\n\f", "award": [], "sourceid": 2387, "authors": [{"given_name": "Jean-yves", "family_name": "Audibert", "institution": null}, {"given_name": "Olivier", "family_name": "Bousquet", "institution": null}]}