{"title": "From PAC-Bayes Bounds to KL Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 603, "page_last": 610, "abstract": "We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classifier that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard ell_p-regularized objective functions currently used, such as ridge regression and ell_p-regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost.", "full_text": "From PAC-Bayes Bounds to KL Regularization\n\nPascal Germain, Alexandre Lacasse, Franc\u00b8ois Laviolette, Mario Marchand, Sara Shanian\n\nDepartment of Computer Science and Software Engineering\n\nLaval University, Qu\u00b4ebec (QC), Canada\n\nfirstname.secondname@ift.ulaval.ca\n\nAbstract\n\nWe show that convex KL-regularized objective functions are obtained from a\nPAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs\nclassi\ufb01er that upper-bound the standard zero-one loss used for the weighted ma-\njority vote. By restricting ourselves to a class of posteriors, that we call quasi\nuniform, we propose a simple coordinate descent learning algorithm to minimize\nthe proposed KL-regularized cost function. We show that standard (cid:96)p-regularized\nobjective functions currently used, such as ridge regression and (cid:96)p-regularized\nboosting, are obtained from a relaxation of the KL divergence between the quasi\nuniform posterior and the uniform prior. We present numerical experiments where\nthe proposed learning algorithm generally outperforms ridge regression and Ada-\nBoost.\n\n1 Introduction\n\nWhat should a learning algorithm optimize on the training data in order to give classi\ufb01ers having the\nsmallest possible true risk? Many different speci\ufb01cations of what should be optimized on the train-\ning data have been provided by using different inductive principles. But the universally accepted\nguarantee on the true risk, however, always comes with a so-called risk bound that holds uniformly\nover a set of classi\ufb01ers. Since a risk bound can be computed from what a classi\ufb01er achieves on the\ntraining data, it automatically suggests that learning algorithms should \ufb01nd a classi\ufb01er that mini-\nmizes a tight risk (upper) bound.\nAmong the data-dependent bounds that have been proposed recently, the PAC-Bayes bounds [6, 8,\n4, 1, 3] seem to be especially tight. These bounds thus appear to be a good starting point for the\ndesign of a bound-minimizing learning algorithm. In that respect, [4, 5, 3] have proposed to use\nisotropic Gaussian posteriors over the space of linear classi\ufb01ers. But a computational drawback of\nthis approach is the fact the Gibbs empirical risk is not a quasi-convex function of the parameters\nof the posterior. Consequently, the resultant PAC-Bayes bound may have several local minima for\ncertain data sets\u2014thus giving an intractable optimization problem in the general case. To avoid\nsuch computational problems, we propose here to use convex loss functions for stochastic Gibbs\nclassi\ufb01ers that upper-bound the standard zero-one loss used for the weighted majority vote. By\nrestricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coor-\ndinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show\nthat there are no loss of discriminative power by restricting the posterior to be quasi uniform. We\nalso show that standard (cid:96)p-regularized objective functions currently used, such as ridge regression\nand (cid:96)p-regularized boosting, are obtained from a relaxation of the KL divergence between the quasi\nuniform posterior and the uniform prior. We present numerical experiments where the proposed\nlearning algorithm generally outperforms ridge regression and AdaBoost [7].\n\n1\n\n\f2 Basic De\ufb01nitions\nWe consider binary classi\ufb01cation problems where the input space X consists of an arbitrary subset\nof Rd and the output space Y = {\u22121, +1}. An example is an input-output (x, y) pair where x \u2208 X\nand y \u2208 Y. Throughout the paper, we adopt the PAC setting where each example (x, y) is drawn\naccording to a \ufb01xed, but unknown, distribution D on X \u00d7 Y.\nThe risk R(h) of any classi\ufb01er h: X \u2192 Y is de\ufb01ned as the probability that h misclassi\ufb01es an\nexample drawn according to D. Given a training set S of m examples, the empirical risk RS(h) of\nany classi\ufb01er h is de\ufb01ned by the frequency of training errors of h on S. Hence\n\nR(h) def=\n\nE\n\n(x,y)\u223cD\n\nI(h(x) (cid:54)= y)\n\n; RS(h) def=\n\n1\nm\n\nI(h(xi) (cid:54)= yi) ,\n\nm(cid:88)\n\ni=1\n\nwhere I(a) = 1 if predicate a is true and 0 otherwise.\nAfter observing the training set S, the task of the learner is to choose a posterior distribution Q\nover a space H of classi\ufb01ers such that the Q-weighted majority vote classi\ufb01er BQ will have the\nsmallest possible risk. On any input example x, the output BQ(x) of the majority vote classi\ufb01er BQ\n(sometimes called the Bayes classi\ufb01er) is given by\n\n(cid:20)\n\n(cid:21)\n\nBQ(x) def= sgn\n\nE\nh\u223cQ\n\nh(x)\n\n,\n\nwhere sgn(s) = +1 if s > 0 and sgn(s) = \u22121 otherwise. The output of the deterministic majority\nvote classi\ufb01er BQ is closely related to the output of a stochastic classi\ufb01er called the Gibbs classi\ufb01er\nGQ. To classify an input example x, the Gibbs classi\ufb01er GQ chooses randomly a (deterministic)\nclassi\ufb01er h according to Q to classify x. The true risk R(GQ) and the empirical risk RS(GQ) of the\nGibbs classi\ufb01er are thus given by\n\nR(GQ) = E\nh\u223cQ\n\nR(h)\n\n; RS(GQ) = E\nh\u223cQ\n\nRS(h) .\n\nAny bound for R(GQ) can straightforwardly be turned into a bound for the risk of the majority vote\nR(BQ). Indeed, whenever BQ misclassi\ufb01es x, at least half of the classi\ufb01ers (under measure Q)\nmisclassi\ufb01es x. It follows that the error rate of GQ is at least half of the error rate of BQ. Hence\nR(BQ) \u2264 2R(GQ). As shown in [5], this factor of 2 can sometimes be reduced to (1 + \u0001).\n\n3 PAC-Bayes Bounds and General Loss Functions\n\n(cid:20)\n\n(cid:104)\n\n1\nm\n\n(cid:105)(cid:21)(cid:19)\n\n1\n\u03b4\n\n(cid:18)\n\nIn this paper, we use the following PAC-Bayes bound which is obtained directly from Theorem 1.2.1\nof [1] and Corollary 2.2 of [3] by using 1 \u2212 exp(\u2212x) \u2264 x \u2200x \u2208 R.\nTheorem 3.1. For any distribution D, any set H of classi\ufb01ers, any distribution P of support H,\nany \u03b4 \u2208 (0, 1], and any positive real number C(cid:48), we have\n\nPr\n\nS\u223cDm\n\n\u2200 Q onH: R(GQ) \u2264\n\n1\n\n1 \u2212 e\u2212C(cid:48)\n\nC(cid:48)\u00b7RS(GQ) +\n\nKL(Q(cid:107)P ) + ln\n\n\u2265 1 \u2212 \u03b4 ,\n\nwhere KL(Q(cid:107)P ) def= E\nh\u223cQ\n\nln Q(h)\n\nP (h) is the Kullback-Leibler divergence between Q and P .\n\nNote that the dependence on Q of the upper bound on R(GQ) is realized via Gibbs\u2019 empirical risk\nRS(GQ) and the PAC-Bayes regularizer KL(Q(cid:107)P ). As in boosting, we focus on the case where\nthe a priori de\ufb01ned class H consists (mostly) of \u201cweak\u201d classi\ufb01ers having large risk R(h) . In this\ncase, R(GQ) is (almost) always large (near 1/2) for any Q even if the majority vote BQ has null\nrisk. In this case the disparity between R(BQ) and R(GQ) is enormous and the upper-bound on\nR(GQ) has very little relevance with R(BQ). On way to obtain a more relevant bound on R(BQ)\nfrom PAC-Bayes theory is to use a loss function \u03b6Q(x, y) for stochastic classi\ufb01ers which is distinct\nfrom the loss used for the deterministic classi\ufb01ers (the zero-one loss in our case). In order to obtain\na tractable optimization problem for a learning algorithm to solve, we propose here to use a loss\n\u03b6Q(x, y) which is convex in Q and that upper-bounds as closely as possible the zero-one loss of the\ndeterministic majority vote BQ.\n\n2\n\n\fConsider WQ(x, y) def= Eh\u223cQI(h(x) (cid:54)= y), the Q-fraction of binary classi\ufb01ers that err on exam-\nple (x, y). Then, R(GQ) = E(x,y)\u223cD WQ(x, y). Following [2], we consider any non-negative\nconvex loss \u03b6Q(x, y) that can be expanded in a Taylor series around WQ(x, y) = 1/2:\n\n\u03b6Q(x, y) def= 1 +\n\nak (2WQ(x, y) \u2212 1)k = 1 +\n\nak\n\nE\nh\u223cQ\n\n\u2212 yh(x)\n\n,\n\nthat upper bounds the risk of the majority vote BQ, i.e.,\n\n(cid:18)\n\n(cid:19)\n\n1\n2\n\n\u03b6Q(x, y) \u2265 I\n\nWQ(x, y) >\n\n\u2200Q, x, y .\n\nIt has been shown [2] that \u03b6Q(x, y) can be expressed in terms of the risk on example (x, y) of a\nGibbs classi\ufb01er described by a transformed posterior Q on N \u00d7 H\u221e, i.e.,\n\n(cid:18)\n\n\u221e(cid:88)\n\nk=1\n\ndef= (cid:80)\u221e\n\nwhere ca\n\nk=1 |ak| and where\n\n\u03b6Q(x, y) = 1 + ca\n\n(cid:104)\n(cid:105)\n2WQ(x, y) \u2212 1\n(cid:16)\n\n,\n\n(cid:19)k\n\n(cid:17)\n\n\u221e(cid:88)\n\nk=1\n\n\u221e(cid:88)\n\nk=1\n\nWQ(x, y) def=\n\n1\nca\n\n|ak| E\nh1\u223cQ\n\n. . . E\n\nhk\u223cQ\n\nI\n\n(\u2212y)kh1(x) . . . hk(x) = \u2212sgn(ak)\n\n.\n\nSince WQ(x, y) is the expectation of boolean random variable, Theorem 3.1 holds if we replace\n(P, Q) by (P , Q) with R(GQ) def=\ni=1 WQ(xi, yi). More-\nover, it has been shown [2] that\n\nWQ(x, y) and RS(GQ) def= 1\n\n(x,y)\u223cD\n\nE\n\nm\n\nKL(Q(cid:107)P ) = k \u00b7 KL(Q(cid:107)P ) , where k def=\n\n1\nca\n\n|ak| \u00b7 k .\n\n(cid:80)m\n\n\u221e(cid:88)\n\nk=1\n\nIf we de\ufb01ne\n\n\u03b6Q\n\n(cid:99)\u03b6Q\n\ndef=\n\ndef=\n\n\u03b6(x, y) = 1 + ca[2R(GQ) \u2212 1]\n\n\u03b6(xi, yi) = 1 + ca[2RS(GQ) \u2212 1] ,\n\nE\n\nm(cid:88)\n\n(x,y)\u223cD\n1\nm\n\ni=1\n\nTheorem 3.1 gives an upper bound on \u03b6Q and, consequently, on the true risk R(BQ) of the majority\nvote. More precisely, we have the following theorem.\nTheorem 3.2. For any D, any H, any P of support H, any \u03b4 \u2208 (0, 1], any positive real number C(cid:48),\nany loss function \u03b6Q(x, y) de\ufb01ned above, we have\n\n(cid:20)(cid:99)\u03b6Q +\n\n(cid:104)\n\n2ca\nmC(cid:48)\n\nk \u00b7 KL(Q(cid:107)P ) + ln\n\n1\n\u03b4\n\n\u2265 1 \u2212 \u03b4 ,\n\n(cid:105)(cid:21)(cid:19)\n\n(cid:18)\n\nPr\n\n\u2200 Q onH: \u03b6Q \u2264 g(ca, C(cid:48)) +\n\nS\u223cDm\nwhere g(ca, C(cid:48)) def= 1 \u2212 ca + C(cid:48)\n\n1 \u2212 e\u2212C(cid:48)\n1\u2212e\u2212C(cid:48) \u00b7 (ca \u2212 1).\n\nC(cid:48)\n\n4 Bound Minimization Learning Algorithms\n\nThe task of the learner is to \ufb01nd the posterior Q that minimizes the upper bound on \u03b6Q for a \ufb01xed\nloss function given by the coef\ufb01cients {ak}\u221e\nk=1 of the Taylor series expansion for \u03b6Q(x, y). Finding\nQ that minimizes the upper bound given by Theorem 3.2 is equivalent to \ufb01nding Q that minimizes\n\nf(Q) def= C\n\nwhere C def= C(cid:48)/(2cak) .\n\n\u03b6Q(xi, yi) + KL(Q(cid:107)P ) ,\n\nm(cid:88)\n\ni=1\n\n3\n\n\fexp\n\ny\n\nQ(h)h(x)\n\n= exp\n\n[2WQ(x, y) \u2212 1]\n\n.\n\n(cid:32)\n\u2212 1\n\u03b3\n\n(cid:88)\n\nh\u2208H\n\n(cid:32)\n\n(cid:88)\n\nh\u2208H\n\n1\n\u03b3\n\ny\n\n(cid:33)\n\n(cid:33)2\n\n(cid:18) 1\n\n\u03b3\n\n(cid:18) 1\n\n\u03b3\n\n(cid:19)\n\n(cid:19)2\n\nTo compare the proposed learning algorithms with AdaBoost, we will consider, for \u03b6Q(x, y), the\nexponential loss given by\n\nFor this choice of loss, we have ca = e\u03b3\u22121 \u22121 and k = \u03b3\u22121/(1\u2212 e\u2212\u03b3\u22121). Because of its simplicity,\nwe will also consider, for \u03b6Q(x, y), the quadratic loss given by\n\nQ(h)h(x) \u2212 1\n\n=\n\n[1 \u2212 2WQ(x, y)] \u2212 1\n\n.\n\nhas the minimum value of zero for examples having a margin y(cid:80)\n\nFor this choice of loss, we have ca = 2\u03b3\u22121 + \u03b3\u22122 and k = (2\u03b3 + 2)/(2\u03b3 + 1). Note that this loss\n\nh\u2208H Q(h)h(x) = \u03b3.\n\nWith these two choices of loss functions, \u03b6Q(x, y) is convex in Q. Moreover, KL(Q(cid:107)P ) is also\nconvex in Q. Since a sum of convex functions is also convex, it follows that objective function f\nis convex in Q (which has a convex domain). Consequently, f has a single local minimum which\ncoincides with the global minimum. We therefore propose to minimize f coordinate-wise, similarly\nh\u2208H Q(h) =\n1), each coordinate minimization will consist of a transfer of weight from one classi\ufb01er to another.\n\nas it is done for AdaBoost [7]. However, to ensure that Q is a distribution (i.e., that(cid:80)\n\n4.1 Quasi Uniform Posteriors\nWe consider learning algorithms that work in a space H of binary classi\ufb01ers such that for\neach h \u2208 H, the boolean complement of h is also in H. More speci\ufb01cally, we have H =\n{h1, . . . , hn, hn+1, . . . , h2n} where hi(x) = \u2212hn+i(x) \u2200x \u2208 X and \u2200i \u2208 {1, . . . , n}. We thus\nsay that (hi, hn+i) constitutes a boolean complement pair of classi\ufb01ers.\nWe consider a uniform prior distribution P over H, i.e., Pi = 1\nThe posterior distribution Q over H is constrained to be quasi uniform. By this, we mean that\nn \u2200i \u2208 {1, . . . , n}, i.e., the total weight assigned to each boolean complement pair of\nQi + Qi+n = 1\ndef= Qi \u2212 Qi+n \u2200i \u2208 {1, . . . , n}. Then wi \u2208 [\u22121/n, +1/n] \u2200i \u2208\nclassi\ufb01ers is \ufb01xed to 1/n. Let wi\n{1, . . . , n} whereas Qi \u2208 [0, 1/n] \u2200i \u2208 {1, . . . , 2n}.\nFor any quasi uniform Q, the output BQ(x) of the majority vote on any example x is given by\n\n2n \u2200i \u2208 {1, . . . , 2n}.\n\n(cid:16) 2n(cid:88)\n\n(cid:17)\n\n(cid:16) n(cid:88)\n\n(cid:17) def= sgn\n\n(cid:16)\n\n(cid:17)\n\nw \u00b7 h(x)\n\n.\n\nBQ(x) = sgn\n\nQihi(x)\n\n= sgn\n\nwihi(x)\n\ni=1\n\ni=1\n\nConsequently, the set of majority votes BQ over quasi uniform posteriors is isomorphic to the set\nof linear separators with real weights. There is thus no loss of discriminative power if we restrict\nourselves to quasi uniform posteriors.\n\nSince all loss functions that we consider are functions of 2WQ(x, y) \u2212 1 = \u2212y(cid:80)\n\nare thus functions of yw \u00b7 h(x). Hence we will often write \u03b6(yw \u00b7 h(x)) for \u03b6Q(x, y).\nThe basic iteration for the learning algorithm consists of choosing (at random) a boolean com-\nplement pair of classi\ufb01ers, call it (h1, hn+1), and then attempting to change only Q1, Qn+1, w1\naccording to:\n\ni Qihi(x), they\n\nQ1 \u2190 Q1 + \u03b4\n2\n\n; Qn+1 \u2190 Qn+1 \u2212 \u03b4\n2\n\n; w1 \u2190 w1 + \u03b4 ,\n\n(1)\n\nfor some optimally chosen value of \u03b4.\nLet Q\u03b4 and w\u03b4 be, respectively, the new posterior and the new weight vector obtained with such a\nchange. The above-mentioned convex properties of objective function f imply that we only need to\nlook for the value of \u03b4\u2217 satisfying\n\n= 0 .\n\n(2)\n\ndf(Q\u03b4)\n\nd\u03b4\n\n4\n\n\fIf w1 + \u03b4\u2217 > 1/n, then w1 \u2190 1/n, Q1 \u2190 1/n, Qn+1 \u2190 0. If w1 + \u03b4\u2217 < \u22121/n, then w1 \u2190\n\u22121/n, Q1 \u2190 0, Qn+1 \u2190 1/n. Otherwise, we accept the change described by Equation 1 with\n\u03b4 = \u03b4\u2217.\nFor objective function f we simply have\n\n+ dKL(Q\u03b4(cid:107)P )\n\n,\n\nd\u03b4\n\n(cid:18)\n\nQ1 + \u03b4\n2\n\n+\n\nQn+1 \u2212 \u03b4\n2\n\n(cid:35)\n\n(cid:19)\n\nln\n\nQn+1 \u2212 \u03b4\n\n2\n\n1\n2n\n\nd(cid:100)\u03b6Q\u03b4\n\nd\u03b4\n\ndf(Q\u03b4)\n\nd\u03b4\n\n= Cm\n\n(cid:34)(cid:18)\n(cid:19)\n(cid:20) Q1 + \u03b4/2\n\nQ1 + \u03b4\n2\n\nln\n\n(cid:21)\n\nQn+1 \u2212 \u03b4/2\n\n.\n\n1\n2n\n\nm(cid:88)\n\ni=1\n\nwhere\n\ndKL(Q\u03b4(cid:107)P )\n\nd\u03b4\n\n= d\nd\u03b4\n1\n2\nFor the quadratic loss, we \ufb01nd\n\n=\n\nm\n\nwhere\n\n=\n\n2m\u03b4\n\u03b32 +\n\n2\n\u03b32\n\nDql\n\nw(i)yih1(xi) ,\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nln\n\nd(cid:100)\u03b6Q\u03b4\n\nd\u03b4\n\nm(cid:88)\nm(cid:88)\n\nConsequently, for the quadratic loss case, the optimal value \u03b4\u2217 satis\ufb01es\n\nDql\n\nw(i)\n\ndef= yiw \u00b7 h(xi) \u2212 \u03b3 .\n\n2Cm\u03b4\n\u03b32 +\n\n2C\n\u03b32\n\nDql\n\nw(i)yih1(xi) +\n\n1\n2\n\nln\n\n(cid:21)\n\n= 0 .\n\ni=1\nFor the exponential loss, we \ufb01nd\n\nd(cid:100)\u03b6Q\u03b4\n\nd\u03b4\n\nm\n\nwhere\n\n= e\u03b4/\u03b3\n\u03b3\n\nw(i)I(h1(xi) (cid:54)= yi) \u2212 e\u2212\u03b4/\u03b3\n\nDel\n\ni=1\n\nDel\n\nw(i)\n\ndef= exp\n\n(cid:18)\n\u2212 1\n\u03b3\n\nConsequently, for the exponential loss case, the optimal value \u03b4\u2217 satis\ufb01es\n\nQn+1 \u2212 \u03b4/2\n\n(cid:20) Q1 + \u03b4/2\nm(cid:88)\n(cid:19)\n\nDel\n\ni=1\n\n\u03b3\n\nyiw \u00b7 h(xi)\n\n.\n\nw(i)I(h1(xi) = yi) ,\n\nm(cid:88)\n\ni=1\n\nCe\u03b4/\u03b3\n\n\u03b3\n\nw(i)I(h1(xi) (cid:54)= yi)\nDel\nm(cid:88)\n\u2212 Ce\u2212\u03b4/\u03b3\n\n\u03b3\n\ni=1\n\nDel\n\nw(i)I(h1(xi) = yi) +\n\n(cid:20) Q1 + \u03b4/2\n\nQn+1 \u2212 \u03b4/2\n\n(cid:21)\n\n1\n2\n\nln\n\n= 0 .\n\n(10)\n\n\u2212 1\n\nw(i) + yih1(xi)\u03b4\nw(i)e\n\u03b3 yih1(xi)\u03b4\n\n(quadratic loss case)\n(exponential loss case) .\n\nAfter changing w1, we need to recompute1 Dw(i) \u2200i \u2208 {1, . . . , m}. This can be done with the\nfollowing update rules.\nw(i) \u2190 Dql\nDql\nw(i) \u2190 Del\nDel\nSince, initially we have\nw(i) = \u2212\u03b3 \u2200i \u2208 {1, . . . , m} (quadratic loss case)\nDql\nw(i) = 1 \u2200i \u2208 {1, . . . , m} (exponential loss case) ,\nDel\n\n(13)\n(14)\nthe dot product present in Equations 6 and 9 never needs to be computed. Consequently, updating\nDw takes \u0398(m) time.\nThe computation of the summations over the m examples in Equation 7 or 10 takes \u0398(m) time.\nOnce these summations are computed, solving Equation 7 or 10 takes \u0398(1) time. Consequently,\nit takes \u0398(m) time to perform one basic iteration of the learning algorithm which consist of (1)\nsolving Equation 7 or 10 to \ufb01nd \u03b4\u2217, (2) modifying w1, Q1, Qn+1, and (3) updating Dw according to\nEquation 11 or 12. The complete algorithm, called f minimization, is described by the pseudo code\nof Algorithm 1.\n\n(11)\n(12)\n\n1Dw(i) stands for either Dql\n\nw(i) or Del\n\nw(i).\n\n5\n\n\fAlgorithm 1 : f minimization\n1: Initialization: Let Qi = Qn+i = 1\n\n2n , wi = 0, \u2200i \u2208 {1, . . . , n}.\n\nInitialize Dw according to Equation 13 or 14.\n\n2: repeat\n3:\n\n4:\n\n5:\n\n6:\n\n7:\n\nn < w1 + \u03b4\u2217 < 1\n\nChoose at random h \u2208 H and call it h1 (hn+1 is then the boolean complement of h1).\nFind \u03b4\u2217 that solves Equation 7 or 10.\nIf [\u22121\nIf [w1 + \u03b4\u2217 \u2265 1\nn] then Q1 \u2190 1\nIf [w1 + \u03b4\u2217 \u2264 \u22121\nn ] then Q1 \u2190 0; Qn+1 \u2190 1\nUpdate Dw according to Equation 11 or 12.\n\nn] then Q1 \u2190 Q1 + \u03b4/2; Qn+1 \u2190 Qn+1 \u2212 \u03b4/2; w1 \u2190 w1 + \u03b4.\n\nn; Qn+1 \u2190 0; w1 \u2190 1\nn.\nn; w1 \u2190 \u22121\nn .\n\n8:\n9: until Convergence\n\nThe repeat-until loop in Algorithm 1 was implemented as follows. We \ufb01rst mix at random the n\nboolean complement pairs of classi\ufb01ers and then go sequentially over each pair (hi, hn+i) to update\nwi and Dw. We repeat this sequence until no weight change exceeds a speci\ufb01ed small number \u0001.\n4.2 From KL(Q(cid:107)P ) to (cid:96)p Regularization\nWe can recover (cid:96)2 regularization if we upper-bound KL(Q(cid:107)P ) by a quadratic function. Indeed, if\nwe use\n\n(cid:18) 1\n\nn\n\n(cid:19)\n\n(cid:18) 1\n\nn\n\nq ln q +\n\n\u2212 q\n\nln\n\n\u2212 q\n\nln\n\n1\n2n\n\n+ 4n\n\nwe obtain, for the uniform prior Pi = 1/(2n),\n\nKL(Q(cid:107)P ) = ln(2n) +\n\nQi ln Qi +\n\nWith this approximation, the objective function to minimize becomes\n\ni=1\n\n\u2264 4n\n\nn(cid:88)\n\n(cid:18)\nf(cid:96)2(w) = C(cid:48)(cid:48) m(cid:88)\n\ni=1\n\nQi \u2212 1\n2n\n\n(cid:19)2\n(cid:18) 1\n\n\u03b6\n\n\u03b3\n\ni=1\n\n= n\n\nw2\ni .\n\nyiw \u00b7 h(xi)\n\n+ (cid:107)w(cid:107)2\n2 ,\n\n(cid:19)\n\u2264 1\nn\n(cid:20)\nn(cid:88)\n\n(cid:19)2 \u2200q \u2208 [0, 1/n] ,\n(cid:18) 1\n(cid:19)(cid:21)\n\n\u2212 Qi\n\nln\n\nn\n\nq \u2212 1\n2n\n(cid:19)\n\n\u2212 Qi\n\n(cid:18)\n(cid:18) 1\nn(cid:88)\n\nn\n\ni=1\n\n(cid:19)\n\n(15)\n\n(16)\n\n(17)\n\nsubject to the (cid:96)\u221e constraint |wj| \u2264 1/n \u2200j \u2208 {1, . . . , n}. Here (cid:107)w(cid:107)2 denotes the Euclidean norm\nof w and \u03b6(x) = (x \u2212 1)2 for the quadratic loss and e\u2212x for the exponential loss.\nIf, instead, we minimize f(cid:96)2 for v def= w/\u03b3 and remove the (cid:96)\u221e constraint, we recover exactly ridge\nregression for the quadratic loss case and (cid:96)2-regularized boosting for the exponential loss case.\nWe can obtain a (cid:96)1-regularized version of Equation 17 by repeating the above steps and us-\n\n(cid:12)(cid:12) \u2200q \u2208 [0, 1/n] since, in that case, we \ufb01nd that KL(Q(cid:107)P ) \u2264\n\n(cid:1)2 \u2264 2(cid:12)(cid:12)q \u2212 1\n\n2n\n\ning 4n(cid:0)q \u2212 1\n(cid:80)n\ni=1 |wi| def= (cid:107)w(cid:107)1.\n\n2n\n\nTo sum up, the KL-regularized objective function f immediately follows from PAC-Bayes theory\nand (cid:96)p regularization is obtained from a relaxation of f. Consequently, PAC-Bayes theory favors\nthe use of KL regularization if the goal of the learner is to produce a weighted majority vote with\ngood generalization.2\n\n2Interestingly, [9] has recently proposed a KL-regularized version of LPBoost but their objective function\n\nwas not derived from a uniform risk bound.\n\n6\n\n\f5 Empirical Results\n\nFor the sake of comparison, all learning algorithms of this subsection are producing a weighted\nmajority vote classi\ufb01er on the set of basis functions {h1, . . . , hn} known as decision stumps. Each\ndecision stump hi is a threshold classi\ufb01er that depends on a single attribute: its output is +b if\nthe tested attribute exceeds a threshold value t, and \u2212b otherwise, where b \u2208 {\u22121, +1}. For each\nattribute, at most ten equally-spaced possible values for t were determined a priori. Recall that,\nalthough Algorithm 1 needs a set H of 2n classi\ufb01ers containing n boolean complement pairs, it\noutputs a majority vote with n real-valued weights de\ufb01ned on {h1, . . . , hn}.\nThe results obtained for all tested algorithms are summarized in Table 1. We have compared Al-\ngorithm 1 with quadratic loss (KL-QL) and exponential loss (KL-EL) to AdaBoost [7] (AdB) and\nridge regression (RR).\nExcept for MNIST, all data sets were taken from the UCI repository. Each data set was randomly\nsplit into a training set S of |S| examples and a testing set T of |T| examples. The number a\nof attributes for each data set is also speci\ufb01ed in Table 1. For AdaBoost, the number of boosting\nrounds was \ufb01xed to 200. For all algorithms, RT refers to the frequency of errors, measured on the\ntesting set T .\nIn addition to this, the \u201cC and \u201c\u03b3\u201d columns in Table 1 refer, respectively, to the C value of the\nobjective function f and to the \u03b3 parameter present in the loss functions. These hyperparameters\nwere determined from the training set only by performing the 10-fold cross validation (CV) method.\nThe hyperparameters that gave the smallest 10-fold CV error were then used to train the Algorithms\non the whole training set and the resulting classi\ufb01ers were then run on the testing set.\n\nTable 1: Summary of results.\n\n(1) AdB (2) RR\n\nC\n\n1\n\n(3) KL-EL\n\u03b3\n0.1\n0.360 0.5 0.02\n0.227 0.1\n0.2\n0.187 500 0.01\n0.2\n0.1\n\n(4) KL-QL\nRT\n\u03b3\nRT C RT C\n0.047 0.02 0.4\n0.050 10 0.047 0.1\n0.286 0.02 0.3\n0.309 5\n0.157 2\n0.183 0.02 0.05\n0.196 0.02 0.01\n0.206 5\n0.273 100 0.253 500\n0.260 0.02 0.5\n0.177 0.05 0.2\n0.197 1\n0.211 0.2\n0.131 0.05 0.120 20 0.0001 0.097 0.2 0.1\n0.004 0.5 0.006 0.1 0.02\n0.006 1000 0.1\n0.026 0.05 0.019 500 0.01\n0.020 0.02 0.05\n0.045 0.5 0.043 10 0.0001 0.047 0.1 0.05\n0.015 0.05 0.006 500 0.001 0.015 0.2 0.02\n0.012 1\n0.014 1000 0.1\n0.024 0.2 0.016 0.2 0.001 0.031\n0.02\n0.033 0.2 0.035 500 0.0001 0.029 0.02 0.05\n0.001 0.5 0.000 10 0.001 0.000 1000 0.02\n0.037 0.05 0.025 500 0.01\n0.039 0.05 0.05\n0.115 1000 0.1\n0.192 0.05 0.135 500 0.05\n0.055 1000 0.05\n0.060 2\n0.1\n0.079 0.02 0.080 0.2 0.05\n0.080 0.02 0.05\n0.049 0.2 0.039 500 0.02\n0.046 1000 0.1\n\n0.014 500 0.02\n\n0.060 0.5\n\nDataset\n|S|\n\n|T|\nName\na\n9\nBreastCancer 343 340\n170 175\nLiver\n6\n353 300 15\nCredit-A\nGlass\n107 107\n9\n144 150\nHaberman\n3\n150 147 13\nHeart\n176 175 34\nIonosphere\n500 1055 16\nLetter:AB\nLetter:DO\n500 1058 16\nLetter:OQ\n500 1036 16\nMNIST:0vs8 500 1916 784\nMNIST:1vs7 500 1922 784\nMNIST:1vs8 500 1936 784\nMNIST:2vs3 500 1905 784\nMushroom 4062 4062 22\n3700 3700 20\nRingnorm\n104 104 60\nSonar\nUsvotes\n235 200 16\n4000 4000 21\nWaveform\nWdbc\n285 284 30\n\nRT\n0.053\n0.320\n0.170\n0.178\n0.260\n0.252\n0.120\n0.010\n0.036\n0.038\n0.008\n0.013\n0.025\n0.047\n0.000\n0.043\n0.231\n0.055\n0.085\n0.049\n\nSSB\n\n(3) < (2, 4)\n\n(3) < (4)\n(4) < (1)\n\n(3) < (1, 2, 4)\n\nWe clearly see that the cross-validation method generally chooses very small values for \u03b3. This, in\nturn, gives a risk bound (computed from Theorem 3.2) having very large values (results not shown\nhere). We have also tried to choose C and \u03b3 from the risk bound values.3 This method for selecting\nhyperparameters turned out to produce classi\ufb01ers having larger testing errors (results not shown\nhere).\nTo determine whether or not a difference of empirical risk measured on the testing set T is statisti-\ncally signi\ufb01cant, we have used the test set bound method of [4] (based on the binomial tail inversion)\n3From the standard union bound argument, the bound of Theorem 3.2 holds simultaneously for k different\n\nchoices of (\u03b3, C) if we replace \u03b4 by \u03b4/k.\n\n7\n\n\fwith a con\ufb01dence level of 95%. It turns out that no algorithm has succeeded in choosing a majority\nvote classi\ufb01er which was statistically signi\ufb01cantly better (SSB) than the one chosen by another al-\ngorithm except for the 4 cases that are listed in the column \u201cSSB\u201d of Table 1. We see that on these\ncases, Algorithm 1 turned out to be statistically signi\ufb01cantly better.\n\n6 Conclusion\n\nOur numerical results indicate that Algorithm 1 generally outperforms AdaBoost and ridge regres-\nsion when the hyperparameters C and \u03b3 are chosen by cross-validation. This indicates that the\n\nempirical loss(cid:99)\u03b6Q and the KL(Q(cid:107)P ) regularizer that are present in the PAC-Bayes bound of Theo-\nPAC-Bayes theory does not yet capture quantitatively the proper tradeoff between(cid:99)\u03b6Q and KL(Q(cid:107)P )\n\nrem 3.2 are key ingredients for learning algorithms to focus on. The fact that cross-validation turns\nout to be more ef\ufb01cient than Theorem 3.2 at selecting good values for hyperparameters indicates that\n\nthat learners should optimize on the trading data. However, we feel that it is important to pursue\nthis research direction since it could potentially eliminate the need to perform the time-consuming\ncross-validation method for selecting hyperparameters and provide better guarantees on the gener-\nalization error of classi\ufb01ers output by learning algorithms. In short, it could perhaps yield the best\ngeneric optimization problem for learning.\n\nAcknowledgments\n\nWork supported by NSERC discovery grants 122405 (M.M.) and 262067 (F.L.).\n\nReferences\n[1] Olivier Catoni.\n\ntistical\nMonograph series of\nhttp://arxiv.org/abs/0712.0248, December 2007.\n\nlearning.\n\nPAC-Bayesian surpevised classi\ufb01cation:\n\nthe thermodynamics of sta-\nthe Institute of Mathematical Statistics,\n\n[2] Pascal Germain, Alexandre Lacasse, Franc\u00b8ois Laviolette, and Mario Marchand. A pac-bayes\nrisk bound for general loss functions. In B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Ad-\nvances in Neural Information Processing Systems 19, pages 449\u2013456. MIT Press, Cambridge,\nMA, 2007.\n\n[3] Pascal Germain, Alexandre Lacasse, Franc\u00b8ois Laviolette, and Mario Marchand. PAC-Bayesian\nIn L\u00b4eon Bottou and Michael Littman, editors, Proceedings of\nlearning of linear classi\ufb01ers.\nthe 26th International Conference on Machine Learning, pages 353\u2013360, Montreal, June 2009.\nOmnipress.\n\n[4] John Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine\n\nLearning Research, 6:273\u2013306, 2005.\n\n[5] John Langford and John Shawe-Taylor. PAC-Bayes & margins.\n\nIn S. Thrun S. Becker and\nK. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 423\u2013430.\nMIT Press, Cambridge, MA, 2003.\n\n[6] David McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51:5\u201321,\n\n2003.\n\n[7] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new\nexplanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651\u20131686,\n1998.\n\n[8] Matthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. Journal of Ma-\n\nchine Learning Research, 3:233\u2013269, 2002.\n\n[9] Manfred K. Warmuth, Karen A. Glocer, and S.V.N. Vishwanathan. Entropy regularized LP-\nBoost. In Proceedings of the 2008 conference on Algorithmic Learning Theory, Springer LNAI\n5254,, pages 256\u2013271, 2008.\n\n8\n\n\f", "award": [], "sourceid": 456, "authors": [{"given_name": "Pascal", "family_name": "Germain", "institution": null}, {"given_name": "Alexandre", "family_name": "Lacasse", "institution": null}, {"given_name": "Mario", "family_name": "Marchand", "institution": null}, {"given_name": "Sara", "family_name": "Shanian", "institution": null}, {"given_name": "Fran\u00e7ois", "family_name": "Laviolette", "institution": null}]}