{"title": "Learning Kernels Using Local Rademacher Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 2760, "page_last": 2768, "abstract": "We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate.  We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classification tasks.", "full_text": "Learning Kernels Using\n\nLocal Rademacher Complexity\n\nCorinna Cortes\nGoogle Research\n76 Ninth Avenue\n\nNew York, NY 10011\n\ncorinna@google.com\n\nMarius Kloft\u21e4\nCourant Institute &\n\nSloan-Kettering Institute\n\n251 Mercer Street\n\nNew York, NY 10012\n\nmkloft@cims.nyu.edu\n\nMehryar Mohri\nCourant Institute &\nGoogle Research\n251 Mercer Street\n\nNew York, NY 10012\n\nmohri@cims.nyu.edu\n\nAbstract\n\nWe use the notion of local Rademacher complexity to design new algorithms for\nlearning kernels. Our algorithms thereby bene\ufb01t from the sharper learning bounds\nbased on that notion which, under certain general conditions, guarantee a faster\nconvergence rate. We devise two new learning kernel algorithms: one based on\na convex optimization problem for which we give an ef\ufb01cient solution using ex-\nisting learning kernel techniques, and another one that can be formulated as a\nDC-programming problem for which we describe a solution in detail. We also re-\nport the results of experiments with both algorithms in both binary and multi-class\nclassi\ufb01cation tasks.\n\nIntroduction\n\n1\nKernel-based algorithms are widely used in machine learning and have been shown to often provide\nvery effective solutions. For such algorithms, the features are provided intrinsically via the choice of\na positive-semi-de\ufb01nite symmetric kernel function, which can be interpreted as a similarity measure\nin a high-dimensional Hilbert space. In the standard setting of these algorithms, the choice of the\nkernel is left to the user. That choice is critical since a poor choice, as with a sub-optimal choice of\nfeatures, can make learning very challenging. In the last decade or so, a number of algorithms and\ntheoretical results have been given for a wider setting known as that of learning kernels or multiple\nkernel learning (MKL) (e.g., [1, 2, 3, 4, 5, 6]). That setting, instead of demanding from the user to\ntake the risk of specifying a particular kernel function, only requires from him to provide a family\nof kernels. Both tasks of selecting the kernel out of that family of kernels and choosing a hypothesis\nbased on that kernel are then left to the learning algorithm.\nOne of the most useful data-dependent complexity measures used in the theoretical analysis and\ndesign of learning kernel algorithms is the notion of Rademacher complexity (e.g., [7, 8]). Tight\nlearning bounds based on this notion were given in [2], improving earlier results of [4, 9, 10].\nThese generalization bounds provide a strong theoretical foundation for a family of learning kernel\nalgorithms based on a non-negative linear combination of base kernels. Most of these algorithms,\nwhether for binary classi\ufb01cation or multi-class classi\ufb01cation, are based on controlling the trace of\nthe combined kernel matrix.\nThis paper seeks to use a \ufb01ner notion of complexity for the design of algorithms for learning ker-\nnels: the notion of local Rademacher complexity [11, 12]. One shortcoming of the general notion\nof Rademacher complexity is that it does not take into consideration the fact that, typically, the\nhypotheses selected by a learning algorithm have a better performance than in the worst case and\nbelong to a more favorable sub-family of the set of all hypotheses. The notion of local Rademacher\ncomplexity is precisely based on this idea by considering Rademacher averages of smaller subsets\nof the hypothesis set. It leads to sharper learning bounds which, under certain general conditions,\nguarantee a faster convergence rate.\n\n\u21e4Alternative address: Memorial Sloan-Kettering Cancer Center, 415 E 68th street, New York, NY 10065,\n\nUSA. Email: kloft@cbio.mskcc.org.\n\n1\n\n\fWe show how the notion of local Rademacher complexity can be used to guide the design of new\nalgorithms for learning kernels. For kernel-based hypotheses, the local Rademacher complexity can\nbe both upper- and lower-bounded in terms of the tail sum of the eigenvalues of the kernel matrix\n[13]. This motivates the introduction of two natural families of hypotheses based on non-negative\ncombinations of base kernels with kernels constrained by a tail sum of the eigenvalues. We study\nand compare both families of hypotheses and derive learning kernel algorithms based on both. For\nthe \ufb01rst family of hypotheses, the algorithm is based on a convex optimization problem. We show\nhow that problem can be solved using optimization solutions for existing learning kernel algorithms.\nFor the second hypothesis set, we show that the problem can be formulated as a DC-programming\n(difference of convex functions programming) problem and describe in detail our solution. We report\nempirical results for both algorithms in both binary and multi-class classi\ufb01cation tasks.\nThe paper is organized as follows. In Section 2, we present some background on the notion of local\nRademacher complexity by summarizing the main results relevant to our theoretical analysis and the\ndesign of our algorithms. Section 3 describes and analyzes two new kernel learning algorithms, as\njust discussed. In Section 4, we give strong theoretical guarantees in support of both algorithms. In\nSection 5, we report the results of preliminary experiments, in a series of both binary classi\ufb01cation\nand multi-class classi\ufb01cation tasks.\n\n2 Background on local Rademacher complexity\nIn this section, we present an introduction to local Rademacher complexities and related properties.\n\n2.1 Core ideas and de\ufb01nitions\n\nWe consider the standard set-up of supervised learning where the learner receives a samplez1 =\n(x1, y1), . . . , zn = (xn, yn) of size n  1 drawn i.i.d. from a probability distribution P over\nZ = X\u21e5Y . Let F be a set of functions mapping from X to Y, and let l : Y\u21e5Y! [0, 1] be a loss\nfunction. The learning problem is that of selecting a function f 2F with small risk or expected loss\nE[l(f (x), y)]. Let G := l(F,\u00b7) denote the loss class, then, this is equivalent to \ufb01nding a function\ng 2G with small average E[g]. For convenience, in what follows, we assume that the in\ufb01mum\nof E[g] over G is reached and denote by g\u21e4 2 argming2G E[g] the most accurate predictor in G.\nWhen the in\ufb01mum is not reached, in the following results, E[g\u21e4] can be equivalently replaced by\ninf g2G E[g].\nDe\ufb01nition 1. Let 1, . . . , n be an i.i.d. family of Rademacher variables taking values 1 and\n+1 with equal probability independent of the sample (z1, . . . , zn). Then, the global Rademacher\ncomplexity of G is de\ufb01ned as\n\nRn(G) := E\uf8ff sup\n\ng2G\n\n1\nn\n\nig(zi).\n\nnXi=1\n\n\n\n.\n\nn\n\nGeneralization bounds based on the notion of Rademacher complexity are standard [7]. In particular,\nwith probability at least 1  :\n\nfor the empirical risk minimization (ERM) hypothesisbgn, for any > 0, the following bound holds\n(1)\nRn(G) is in the order of O(1/pn) for various classes used in practice, including when F is a\nIn such cases, the bound (1)\nkernel class with bounded trace and when the loss l is Lipschitz.\nconverges at rate O(1/pn). For some classes G, we may, however, obtain fast rates of up to O(1/n).\nThe following presentation is based on [12]. Using Talagrand\u2019s inequality, one can show that with\nprobability at least 1  ,\n\ng2GE[g] bE[g] \uf8ff 4Rn(G) +s 2 log 2\n\nE[bgn]  E[g\u21e4] \uf8ff 2 sup\n\nE[bgn]  E[g\u21e4] \uf8ff 8Rn(G) +\u2303( G)s 8 log 2\n\n(2)\nHere, \u23032(G) := supg2G E[g2] is a bound on the variance of the functions in G. The key idea to\nobtain fast rates is to choose a much smaller class G?\nn \u2713G with as small a variance as possible,\nn. Since such a small class can also have a substantially smaller\nRademacher complexity Rn(G?\nBut how can we \ufb01nd a small class G?\nbackground on how to construct such a class in the supplementary material section 1. It turns out\n\nwhile requiring thatbgn still lies in G?\n\nn that is just large enough to containbgn? We give some further\n\nn), the bound (2) can be sharper than (1).\n\n3 log 2\n\n\n\n\n+\n\nn\n\n.\n\nn\n\n2\n\n\fFigure 1: Illustration of the bound (3). The volume of the gray shaded area amounts to the term\n\nlarge, and the center \ufb01gure the case corresponding to the appropriate value of \u2713.\n\n\u2713r +Pj>\u2713 j occurring in (3). The left- and right-most \ufb01gures show the cases of \u2713 too small or too\nthat the order of convergence of E[bgn]  E[g\u21e4] is determined by the order of the \ufb01xed point of the\nlocal Rademacher complexity, de\ufb01ned below.\nDe\ufb01nition 2. For any r > 0, the local Rademacher complexity of G is de\ufb01ned as\nIf the local Rademacher complexity is known, it can be used to comparebgn with g\u21e4, as E[bgn]E[g\u21e4]\n\ncan be bounded in terms of the \ufb01xed point of the Rademacher complexity of F, besides constants and\nO(1/n) terms. But, while the global Rademacher complexity is generally of the order of O(1/pn)\nat best, its local counterpart can converge at orders up to O(1/n). We give an example of such a\nclass\u2014particularly relevant for this paper\u2014below.\n\nRn(G; r) := Rng 2G : E[g2] \uf8ff r .\n\n2.2 Kernel classes\nThe local Rademacher complexity for kernel classes can be accurately described and shown to admit\na simple expression in terms of the eigenvalues of the kernel [13] (cf. also Theorem 6.5 in [11]).\nTheorem 3. Let k be a Mercer kernel with corresponding feature map k and reproducing kernel\n\nHilbert space Hk. Let k(x, \u02dcx) = P1j=1 j'j(x)>'j(\u02dcx) be its eigenvalue decomposition, where\n(i)1i=1 is the sequence of eigenvalues arranged in descending order. Let F := {fw = (x 7!\nhw, k(x)i) : kwkHk \uf8ff 1}. Then, for every r > 0,\n\nmin(r, j).\n\n(3)\n\n1Xj=1\n\nMoreover, there is an absolute constant c such that, if 1  1\n\nn, then for every r  1\nn,\n\nn\n\nmin\n\nE[R(F; r)] \uf8ff s 2\n\u27130\u21e3\u2713r +Xj>\u2713\n1Xj=1\n\nc\npn\n\nj\u2318 = vuut 2\n\nn\n\nmin(r, j) \uf8ff E[R(F; r)].\n\nWe summarize the proof of this result in the supplementary material section 2. In view of (3), the\nlocal Rademacher complexity for kernel classes is determined by the tail sum of the eigenvalues.\nA core idea of the proof is to optimize over the \u201ccut-off point\u201d \u2713 of the tail sum of the eigenvalues\nin the bound. Solving for the optimal \u2713, gives a bound in terms of truncated eigenvalues, which is\nillustrated in Figure 1.\nConsider, for instance, the special case where r = 1. We can then recover the familiar upper bound\nthe case of Gaussian kernels [14], then\n\non the Rademacher complexity: Rn(F) \uf8ffpTr(k)/n. But, whenPj>\u2713 j = O(exp(\u2713)), as in\nO\u21e3 min\n\u27130\u2713r + exp(\u2713)\u2318 = O(r log(1/r)).\nTherefore, we have R(F; r) = O(p r\nby Theorem 8 (shown in the supplemental material), we have E[bgn]  E[g\u21e4] = O( log(n)\n\nn log(1/r)), which has the \ufb01xed point r\u21e4 = O( log(n)\n\n3 Algorithms\nIn this section, we will use the properties of the local Rademacher complexity just discussed to\ndevise a novel family of algorithms for learning kernels.\n\nyields a much stronger learning guarantee.\n\nn ). Thus,\nn ), which\n\n3\n\n\fHere, \u2713 is a free parameter controlling the tail sum. The trace is a linear function and thus the\nconstraint (4) de\ufb01nes a half-space, therefore a convex set, in the space of kernels. The function\n\nk 7!Pj>\u2713 j(k), however, is concave since it can be expressed as the difference of the trace and\nthe sum of the \u2713 largest eigenvalues, which is a convex function.\nNevertheless, the following upper bound holds, denoting \u02dc\u00b5m := \u00b5m/k\u00b5k1,\n\nMXm=1\n\n\u00b5mXj>\u2713\n\nj(km) =\n\nMXm=1\n\n\u02dc\u00b5mXj>\u2713\n\nj(k\u00b5k1 km) \uf8ff Xj>\u2713\n\nwhere the equality holds by linearity and the inequality by the concavity just discussed. This leads\nus to consider alternatively the following class\n\n\u02dc\u00b5m k\u00b5k1 km\n\n(5)\n\n\u25c6,\n\n}\n\n=k\u00b5\n\n{z\n\nj\u2713 MXm=1\n|\nj(km) \uf8ff 1.\n\nPM\n\nthe following hypothesis class:\n\n3.1 Motivation and analysis\nMost learning kernel algorithms are based on a family of hypotheses based on a kernel k\u00b5 =\nm=1 \u00b5mkm that is a non-negative linear combination of M base kernels. This is described by\n\nH := fw,k\u00b5 =x 7! hw, k\u00b5(x)i : kwkHk\u00b5 \uf8ff \u21e4, \u00b5 \u232b 0 .\n\nIt is known that the Rademacher complexity of H can be upper-bounded in terms of the trace of\nthe combined kernel. Thus, most existing algorithms for learning kernels [1, 4, 6] add the following\nconstraint to restrict H:\n(4)\nAs we saw in the previous section, however, the tail sum of the eigenvalues of the kernel, rather than\nits trace, determines the local Rademacher complexity. Since the local Rademacher complexity can\nlead to tighter generalization bounds than the global Rademacher complexity, this motivates us to\nconsider the following hypothesis class for learning kernels:\n\nTr(k\u00b5) \uf8ff 1.\n\nH1 :=fw,k\u00b5 2 H :Xj>\u2713\n\nj(k\u00b5) \uf8ff 1 .\n\nH2 := \u21e2fw,k\u00b5 2 H :\n\nMXm=1\n\n\u00b5mXj>\u2713\n\nThe class H2 is convex because it is the restriction of the convex class H via a linear inequality\nconstraint. H2 is thus more convenient to work with. The following proposition helps us compare\nthese two families.\nProposition 4. The following statements hold for the sets H1 and H2:\n\n1. (a) H1 \u2713 H2\n2. (b) If \u2713 = 0, then H1 = H2.\n3. (c) Let \u2713> 0. There exist kernels k1, . . . , kM and a probability measure P such that\n\nH1 ( H2.\n\nThe proposition shows that, in general, the convex class H2 can be larger than H1. The following\nresult shows that in general an even stronger result holds.\nProposition 5. Let \u2713> 0. There exist kernels k1, . . . , kM and a probability measure P such that\nconv(H1) ( H2.\nThe proofs of these propositions are given in the supplemental material. These results show that in\ngeneral H2 could be a richer class than H1 and even its convex hull. This would suggest working\nwith H1 to further limit the risk of over\ufb01tting, however, as already pointed out, H2 is more conve-\nnient since it is a convex class. Thus, in the next section, we will consider both hypothesis sets and\nintroduce two distinct learning kernel algorithms, each based on one of these families.\n3.2 Convex optimization algorithm\nThe simpler algorithm performs regularized empirical risk minimization based on the convex\nclass H2. Note that by a renormalization of the kernels k1, . . . , kM, according to \u02dckm :=\n\n(Pj>\u2713 j(km))1km and \u02dck\u00b5 =PM\n\nH2 = \u02dcH2 := \u21e2fw,\u02dck\u00b5\n\n= (x 7! hw, \u02dck\u00b5\n\nm=1 \u00b5m\u02dckm, we can simply rewrite H2 as\n\n(x)i), kwkH\u02dck\u00b5 \uf8ff \u21e4, \u00b5 \u232b 0, k\u00b5k1 \uf8ff 1,\n\n(6)\n\n4\n\n\fwhich is the commonly studied hypothesis class in multiple kernel learning. Of course, in practice,\nwe replace the empirical version of the kernel k by the kernel matrix K = (k(xi, xj))n\ni,j=1, and\nconsider 1, . . . , n as the eigenvalues of the kernel matrix and not of the kernel itself. Hence, we\ncan easily exploit existing software solutions:\n\n1. For all m = 1, . . . , M, computePj>\u2713 j(Km);\n(Pj>\u2713 j(Km))1Km;\n\nover \u02dcH2.\n\n2. For all m = 1, . . . , M, normalize the kernel matrices according to \u02dcKm :=\n\n3. Use any of the many existing (`1-norm) MKL solvers to compute the minimizer of ERM\n\nNote that the tail sum can be computed in O(n2\u2713) for each kernel because it is suf\ufb01cient to compute\n\nthe \u2713 largest eigenvalues and the trace:Pj>\u2713 j(Km) = Tr(Km) P\u2713\n\nj=1 j(Km).\n\n3.3 DC-programming\nIn the more challenging case, we perform penalized ERM over the class H1, that is, we aim to solve\n\nmin\nw\n\n1\n2 kwk2\n\nHK\u00b5\n\n+ C\n\nl(yifw,K\u00b5(xi))\n\nnXi=1\n\ns.t. Xj>\u2713\n\nj(K\u00b5) \uf8ff 1 .\n\n(7)\n\nThis is a convex optimization problem with an additional concave constraintPj>\u2713 j(K\u00b5) \uf8ff 1.\nThis constraint is not differentiable, but it admits a subdifferential at any point \u00b50 2 RM. Denote the\nsubdifferential of the function \u00b5 7! j(K\u00b5) by @\u00b50j(K\u00b50) := {v 2 RM : j(K\u00b5) j(K\u00b50) \nhv, \u00b5\u00b50i,8\u00b5 2 RM}. Moreover, let u1, . . . , un be the eigenvectors of K\u00b50 sorted in descending\norder. De\ufb01ning vm :=Pj>\u2713 u>j Kmuj, one can verify\u2014using the sub-differentiability of the max\noperator\u2014that v = (v1, . . . , vM )> is contained in the subdifferential @\u00b50Pj>\u2713 j(K\u00b50). Thus,\nwe can linearly approximate the constraint, for any \u00b50 2 RM, via\n\nXj>\u2713\n\nj(K\u00b5) \u21e1 hv, \u00b5  \u00b50i = Xj>\u2713\n\nu>j K\u00b5\u00b50uj.\n\n+ C\n\nmin\nw \u00b5\u232b0\n\nWe can thus tackle problem (7) using the DCA algorithm [15], which in this context reduces to\nalternating between the linearization of the concave constraint and solving the resulting convex\nproblem, that is, for any \u00b50 2 RM,\n1\n2 kwk2\ns.t. Xj>\u2713\n\nnXi=1\nu>j K(\u00b5\u00b50)uj \uf8ff 1.\n\nNote that \u00b50 changes in every iteration and so may also do the eigenvectors u1, . . . , un of K\u00b50,\nuntil the DCA algorithm converges. The DCA algorithm is proven to converge to a local minimum,\neven when the concave function is not differentiable [15]. The algorithm is also close to the CCCP\nalgorithm of Yuille and Rangarajan [16], modulo the use of subgradients instead of the gradients.\nTo solve (8), we alternate the optimization with respect to \u00b5 and w. Note that, for \ufb01xed w, we can\ncompute the optimal \u00b5 analytically. Up to normalization the following holds:\n\nl(fw,K\u00b5(xi), yi)\n\nHK\u00b5\n\n(8)\n\n8m = 1, . . . , M : \u00b5m =vuut kwk2\n\nPj>\u2713 u>j Kmuj\n\nHk\u00b5\n\n.\n\n(9)\n\nA very similar optimality expression has been used in the context the group Lasso and `p-norm\nmultiple kernel learning by [3]. In turn, we need to compute a w that is optimal in (8), for \ufb01xed\n\u00b5. We perform this computation in the dual; e.g., for the hinge loss l(t, y) = max(0, 1  ty), this\nreduces to a standard support vector machine (SVM) [17, 18] problem,\n(\u21b5  y)>K\u00b5(\u21b5  y),\n(10)\n\n1>\u21b5 \n\n1\n2\n\nmax\n0\u21b5C\n\nwhere  denotes the Hadamard product.\n\n5\n\n\finitialize \u00b5m := 1/M for all m = 1, . . . , M\n\ni,j=1 and labels y1, . . . , yn 2{ 1, 1}, optimization precision \"\n\nAlgorithm 1 (DC ALGORITHM FOR LEARNING KERNELS BASED ON THE LOCAL RADEMACHER\nCOMPLEXITY).\n1: input: kernel matrix K = (k(xi, xj))n\n2:\n3: while optimality conditions are not satis\ufb01ed within tolerance \u270f do\n4:\n5:\n6:\n7:\n8:\n9: end while\n10: SVM training: solve (10) with respect to \u21b5\n11: output: \u270f-accurate \u21b5 and kernel weights \u00b5\n\nSVM training: compute a new \u21b5 by solving the SVM problem (10)\neigenvalue computation: compute eigenvalues u1, . . . , un of K\u00b5\nstore \u00b50 := \u00b5\n\u00b5 update: compute a new \u00b5 according to (9) using (11)\n\nnormalize \u00b5 such thatPj>\u2713 ujK(\u00b5\u00b50)uj = 1\n\nFor the computation of (9), we can recover the term kwk2\nin (10) via\n\nHk\u00b5\n\ncorresponding to the \u21b5 that is optimal\n\nkwk2\n\nHK\u00b5\n\n= \u00b52\n\nm(\u21b5  y)>Km(\u21b5  y),\n\n(11)\nwhich follows from the KKT conditions with respect to (10). In summary, the proposed algorithm,\nwhich is shown in Algorithm Table 1, alternatingly optimizes \u21b5 and \u00b5, where prior to each \u00b5 step\nthe linear approximation is updated by computing an eigenvalue decomposition of K\u00b5.\nIn the discussion that precedes, for the sake of simplicity of the presentation, we restricted ourselves\nto the case of an `1-regularization, that is we showed how the standard trace-regularization can be\nreplaced by a regularization based on the tail-sum of the eigenvalues. It should be clear that in the\nsame way we can replace the familiar `p-regularization used in learning kernel algorithms [3] for\np  1 with `p-regularization in terms of the tail eigenvalues. In fact, as in the `1 case, in the `p case,\nour convex optimization algorithm can be solved using existing MKL optimization solutions. The\nresults we report in Section 5 will in fact also include those obtained by using the `2 version of our\nalgorithm.\n\n4 Learning guarantees\n\nAn advantage of the algorithms presented is that they bene\ufb01t from strong theoretical guarantees.\nSince H1 \u2713 H2, it is suf\ufb01cient to present these guarantees for H2\u2014any bound that holds for H2 a\nfortiori holds for H1. To present the result, recall from Section 3.2 that, by a re-normalization of the\nkernels, we may equivalently express H2 by \u02dcH2, as de\ufb01ned in (6). Thus, the algorithms presented\nenjoy the following bound on the local Rademacher complexity, which was shown in [19] (Theorem\n5). Similar results were shown in [20, 21].\nTheorem 6 (Local Rademacher complexity). Assume that the kernels are uniformly bounded (for\n\nbounded as follows:\n\nall m, k\u02dckmk1 < 1) and uncorrelated. Then, the local Rademacher complexity of eH2 can be\n\nm=1,...,M 1Xj=1\n\nmax\n\nmin\u21e3r, e2\u21e42 log2(M )j(\u02dckm)\u2318! + O\u2713 1\nn\u25c6 .\n\nR(eH2; r) \uf8ffvuut 16e\n\nn\n\nNote that we show the result under the assumption of uncorrelated kernels only for simplicity of\npresentation. More generally, a similar result holds for correlated kernels and arbitrary p  1\n(cf. [19], Theorem 5). Subsequently, we can derive the following bound on the excess risk from\nTheorem 6 using a result of [11] (presented as Theorem 8 in the supplemental material 1).\nTheorem 7. Let l(t, y) = 1\n2 (t  y)2 be the squared loss. Assume that for all m, there exists d such\nthat j(\u02dckm) \uf8ff dj for some > 1 (this is a common assumption and, for example, met for \ufb01nite\nrank kernels and Gaussian kernels [14]). Then, under the assumptions of the previous theorem, for\nany > 0, with probability at least 1   over the draw of the sample, the excess loss of the class\n\u02dcH2 can be bounded as follows:\n+1 + O\u2713 1\nn\u25c6 .\n\n1  4d\u21e42 log2(M ) 1\n\nE[bgn]  E[g\u21e4] \uf8ff 186r 3  \n\n1\n+1 e(M/e)\n\n1\n+1 n \n\n1+ 2\n\n6\n\n\f0.92\n\n0.9\n\n0.88\n\nC\nU\nA\n\n0.86\n\n0.84\n\n \n100\n\n250\n\nn\n\nt\nh\ng\ni\ne\nw\n\n \nl\ne\nn\nr\ne\nk\n\n3\n\n2\n\n1\n\n0\n\n \n\n \n\nl1\nl2\nunif\nconv\ndc\n1,000\n\n \n\nn=100\nl1\nl2\nconv\ndc\n\nTSS Promo 1st Ex Angle Energ\n85.2 80.9 85.8 55.6 72.1\n\n)\n\u03b8\n(\nm\nu\ns\n\nl\ni\n\na\nt\n(\ng\no\n\nl\n\n2\n\n1\n\n0\n\n\u22121\n0\n\n \n\n \n\nn=100\n\nTSS\nPromo\n1st Ex\nAngle\nEnerg\n50\n\u03b8\n\n100\n\nFigure 2: Results of the TSS experiment. LEFT: average AUCs of the compared algorithms. CEN-\nTER: for each kernel, the average kernel weight and single-kernel AUC. RIGHT: for each kernel\n\nKm, the tail sumPj>\u2713 j as a function of the eigenvalue cut-off point \u2713.\nWe observe that the above bound converges in O\u21e3 log2(M )\n1+\u2318. This can be almost\nas slow as O log(M )/pn (when  \u21e1 1) and almost as fast as OM/n (when letting  ! 1).\n\nThe latter is the case, for instance, for \ufb01nite-rank or Gaussian kernels.\n\n1\n\n1+ M\n\n1\n+1 n \n\n5 Experiments\n\nIn this section, we report the results of experiments with the two algorithms we introduced, which\nwe will denote by conv and dc in short. We will compare our algorithms with the classical `1-norm\nMKL (denoted by l1) and the more recent `2-norm MKL [3] (denoted by l2). We also measure\nthe performance of the uniform kernel combination, denoted by unif, which has frequently been\nshown to achieve competitive performances [22]. In all experiments, we use the hinge loss as a loss\nfunction, including a bias term.\n\n5.1 Transcription Start Site Detection\nOur \ufb01rst experiment aims at detecting transcription start sites (TSS) of RNA Polymerase II binding\ngenes in genomic DNA sequences. We experiment on the TSS data set, which we downloaded\nfrom http://mldata.org/. This data set, which is a subset of the data used in the larger study\nof [23], comes with 5 kernels, capturing various complementary aspects: a weighted-degree kernel\nrepresenting the TSS signal TSS, two spectrum kernels around the promoter region (Promo) and\nthe 1st exon (1st Ex), respectively, and two linear kernels based on twisting angles (Angle) and\nstacking energies (Energ), respectively. The SVM based on the uniform combination of these 5\nkernels was found to have the highest overall performance among 19 promoter prediction programs\n[24], it therefore constitutes a strong baseline. To be consistent with previous studies [24, 3, 23], we\nwill use the area under the ROC curve (AUC) as an evaluation criterion.\nAll kernel matrices Km were normalized such that Tr(Km) = n for all m, prior to the experiment.\nSVM computations were performed using the SHOGUN toolbox [25]. For both conv and dc, we\nexperiment with `1- and `2-norms. We randomly drew an n-elemental training set and split the\nremaining set into validation and test sets of equal size. The random partitioning was repeated 100\ntimes. We selected the optimal model parameters \u2713 2{ 2i, i = 0, 1, . . . , 4} and C 2{ 10i, i =\n2,1, 0, 1, 2} on the validation set, based on their maximal mean AUC, and report mean AUCs on\nthe test set as well as standard deviations (the latter are within the interval [1.1, 2.5] and are shown in\ndetail in the supplemental material 4). The experiment was carried out for all n 2{ 100, 250, 1000}.\nFigure 2 (left) shows the mean AUCs on the test sets.\nWe observe that unif and l2 outperform l1, except when n = 100, in which case the three meth-\nods are on par. This is consistent with the result reported by [3]. For all sample sizes investigated,\nconv and dc yield the highest AUCs.\nWe give a brief explanation for the outcome of the experiment. To further investigate, we compare\nthe average kernel weights \u00b5 output by the compared algorithms (for n = 100). They are shown\nin Figure 2 (center), where we report, below each kernel, also its performance in terms of its AUC\nwhen training an SVM on that single kernel alone. We observe that l1 focuses on the TSS kernel\nusing the TSS signal, which has the second highest AUC among the kernels (85.2). However, l1\ndiscards the 1st exon kernel, which also has a high predictive performance (AUC of 85.8). A similar\norder of kernel importance is determined by l2, but which distributes the weights more broadly,\n\n7\n\n\fTable 1: The training split (sp) fraction, dataset size (n), and multi-class accuracies shown with \u00b11\nstandard error. The performance results for MKL and conv correspond to the best values obtained\nusing either `1-norm or `2-norm regularization.\n\nn\n\n940\n2732\n541\n1444\n694\n\nunif\n91.1 \u00b1 0.8\n87.2 \u00b1 1.6\n90.5 \u00b1 3.1\n90.3 \u00b1 1.8\n57.2 \u00b1 2.0\n\nMKL\n\n90.6 \u00b1 0.9\n87.7 \u00b1 1.3\n90.6 \u00b1 3.4\n90.7 \u00b1 1.2\n57.2 \u00b1 2.0\n\nconv\n91.4 \u00b1 0.7\n87.6 \u00b1 0.9\n90.8 \u00b1 2.8\n91.2 \u00b1 1.3\n59.6 \u00b1 2.4\n\n\u2713\n\n32\n4\n1\n8\n8\n\nsp\n\nplant\nnonpl\npsortPos\npsortNeg\nprotein\n\n0.5\n0.5\n0.8\n0.5\n0.5\n\nwhile still mostly focusing on the TSS kernel. In contrast, conv and dc distribute their weight only\nover the TSS, Promoter, and 1st Exon kernels, which are also the kernels that also have the highest\npredictive accuracies. The considerably weaker kernels Angle and Energ are discarded.\nBut why are Angle and Energ discarded? This can be explained by means of Figure 2 (right),\nwhere we show the tail sum of each kernel as a function of the cut-off point \u2713. We observe that\nAngle and Energ have only moderately large \ufb01rst and second eigenvalues, which is why they\nhardly pro\ufb01t when using conv or dc. The Promo and Exon kernels, however, which are discarded\nby l1, have a large \ufb01rst (and also second) eigenvalues, which is why they are promoted by conv or\ndc. Indeed, the model selection determines the optimal cut-off, for both conv and dc, for \u2713 = 1.\n\n5.2 Multi-class Experiments\nWe next carried out a series of experiments with the conv algorithm in the multi-class classi\ufb01cation\nsetting, that repeatedly has demonstrated amenable to MKL learning [26, 27]. As described in\nSection 3.2 the conv problem can be solved by simply re-normalizing the kernels by the tail sum of\nthe eigenvalues and making use of any `p-norm MKL solver. For our experiments, we used the ufo\nalgorithm [26] from the DOGMA toolbox http://dogma.sourceforge.net/. For both conv\nand ufo we experiment both with `1 and `2 regularization and report the best performance achieved\nin each case.\nWe used the data sets evaluated in [27] (plant, nonpl, psortPos, and psortNeg), which consist of\neither 3 or 4 classes and use 69 biologically motivated sequence kernels.1 Furthermore, we also\nconsidered the proteinFold data set of [28], which consists of 27 classes and uses 12 biologically\nmotivated base kernels.2\nThe results are summarized in Table 1:\nthey represent mean accuracy values with one standard\ndeviation as computed over 10 random splits of the data into training and test folds. The fraction of\nthe data used for training, as well as the total number of examples, is also shown. The optimal value\nfor the parameter \u2713 2{ 2i, i = 0, 1, . . . , 8} was determined by cross-validation. For the parameters\n\u21b5 and C of the ufo algorithm we followed the methodology of [26]. For plant, psortPos, and\npsortNeg, the results show that conv leads to a consistent improvement in a dif\ufb01cult multi-class\nsetting, although we cannot attest to their signi\ufb01cance due to the insuf\ufb01cient size of the data sets.\nThey also demonstrate a signi\ufb01cant performance improvement over l1 and unif in the proteinFold\ndata set, a more dif\ufb01cult task where the classi\ufb01cation accuracies are below 60%.\n6 Conclusion\nWe showed how the notion of local Rademacher complexity can be used to derive new algorithms for\nlearning kernels by using a regularization based on the tail sum of the eigenvalues of the kernels. We\nintroduced two natural hypothesis sets based on that regularization, discussed their relationships, and\nshowed how they can be used to design an algorithm based on a convex optimization and one based\non solving a DC-programming problem. Our algorithms bene\ufb01t from strong learning guarantees.\nOur empirical results show that they can lead to performance improvement in some challenging\ntasks. Finally, our analysis based on local Rademacher complexity could be used as the basis for the\ndesign of new learning kernel algorithms.\nAcknowledgments\nWe thank Gunnar R\u00a8atsch for helpful discussions. This work was partly funded by the NSF award\nIIS-1117591 and a postdoctoral fellowship funded by the German Research Foundation (DFG).\n\n1Accessible from http://raetschlab.org//projects/protsubloc.\n2Accessible from http://mkl.ucsd.edu/dataset/protein-fold-prediction.\n\n8\n\n\fReferences\n[1] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, \u201cMultiple kernel learning, conic duality, and the SMO\n\nalgorithm,\u201d in Proc. 21st ICML, ACM, 2004.\n\n[2] C. Cortes, M. Mohri, and A. Rostamizadeh, \u201cGeneralization bounds for learning kernels,\u201d in Proceedings,\n\n27th ICML, pp. 247\u2013254, 2010.\n\n[3] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, \u201c`p-norm multiple kernel learning,\u201d Journal of Machine\n\nLearning Research, vol. 12, pp. 953\u2013997, Mar 2011.\n\n[4] G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan, \u201cLearning the kernel matrix with\n\nsemi-de\ufb01nite programming,\u201d JMLR, vol. 5, pp. 27\u201372, 2004.\n\n[5] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, \u201cSimpleMKL,\u201d J. Mach. Learn. Res., vol. 9,\n\npp. 2491\u20132521, 2008.\n\n[6] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf, \u201cLarge scale multiple kernel learning,\u201d Journal\n\nof Machine Learning Research, vol. 7, pp. 1531\u20131565, July 2006.\n\n[7] P. Bartlett and S. Mendelson, \u201cRademacher and gaussian complexities: Risk bounds and structural re-\n\nsults,\u201d Journal of Machine Learning Research, vol. 3, pp. 463\u2013482, Nov. 2002.\n\n[8] V. Koltchinskii and D. Panchenko, \u201cEmpirical margin distributions and bounding the generalization error\n\nof combined classi\ufb01ers,\u201d Annals of Statistics, vol. 30, pp. 1\u201350, 2002.\n\n[9] N. Srebro and S. Ben-David, \u201cLearning bounds for support vector machines with learned kernels,\u201d in\n\nProc. 19th COLT, pp. 169\u2013183, 2006.\n\n[10] Y. Ying and C. Campbell, \u201cGeneralization bounds for learning the kernel problem,\u201d in COLT, 2009.\n[11] P. L. Bartlett, O. Bousquet, and S. Mendelson, \u201cLocal Rademacher complexities,\u201d Ann. Stat., vol. 33,\n\nno. 4, pp. 1497\u20131537, 2005.\n\n[12] V. Koltchinskii, \u201cLocal Rademacher complexities and oracle inequalities in risk minimization,\u201d Annals of\n\nStatistics, vol. 34, no. 6, pp. 2593\u20132656, 2006.\n\n[13] S. Mendelson, \u201cOn the performance of kernel classes,\u201d J. Mach. Learn. Res., vol. 4, pp. 759\u2013771, Decem-\n\nber 2003.\n\n[14] B. Sch\u00a8olkopf and A. Smola, Learning with Kernels. Cambridge, MA: MIT Press, 2002.\n[15] P. D. Tao and L. T. H. An, \u201cA DC optimization algorithm for solving the trust-region subproblem,\u201d SIAM\n\nJournal on Optimization, vol. 8, no. 2, pp. 476\u2013505, 1998.\n\n[16] A. L. Yuille and A. Rangarajan, \u201cThe concave-convex procedure,\u201d Neural Computation, vol. 15, pp. 915\u2013\n\n936, Apr. 2003.\n\n[17] C. Cortes and V. Vapnik, \u201cSupport vector networks,\u201d Machine Learning, vol. 20, pp. 273\u2013297, 1995.\n[18] B. Boser, I. Guyon, and V. Vapnik, \u201cA training algorithm for optimal margin classi\ufb01ers,\u201d in Proc. 5th\n\nAnnual ACM Workshop on Computational Learning Theory (D. Haussler, ed.), pp. 144\u2013152, 1992.\n\n[19] M. Kloft and G. Blanchard, \u201cOn the convergence rate of `p-norm multiple kernel learning,\u201d Journal of\n\nMachine Learning Research, vol. 13, pp. 2465\u20132502, Aug 2012.\n\n[20] V. Koltchinskii and M. Yuan, \u201cSparsity in multiple kernel learning,\u201d Ann. Stat., vol. 38, no. 6, pp. 3660\u2013\n\n3695, 2010.\n\n[21] T. Suzuki, \u201cUnifying framework for fast learning rate of non-sparse multiple kernel learning,\u201d in Advances\n\nin Neural Information Processing Systems 24, pp. 1575\u20131583, 2011.\n\n[22] P. Gehler and S. Nowozin, \u201cOn feature combination for multiclass object classi\ufb01cation,\u201d in International\n\nConference on Computer Vision, pp. 221\u2013228, 2009.\n\n[23] S. Sonnenburg, A. Zien, and G. R\u00a8atsch, \u201cArts: Accurate recognition of transcription starts in human,\u201d\n\nBioinformatics, vol. 22, no. 14, pp. e472\u2013e480, 2006.\n\n[24] T. Abeel, Y. V. de Peer, and Y. Saeys, \u201cTowards a gold standard for promoter prediction evaluation,\u201d\n\nBioinformatics, 2009.\n\n[25] S. Sonnenburg, G. R\u00a8atsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, and\n\nV. Franc, \u201cThe SHOGUN Machine Learning Toolbox,\u201d J. Mach. Learn. Res., 2010.\n\n[26] F. Orabona and L. Jie, \u201cUltra-fast optimization algorithm for sparse multi kernel learning,\u201d in Proceedings\n\nof the 28th International Conference on Machine Learning, 2011.\n\n[27] A. Zien and C. S. Ong, \u201cMulticlass multiple kernel learning,\u201d in ICML 24, pp. 1191\u20131198, ACM, 2007.\n[28] T. Damoulas and M. A. Girolami, \u201cProbabilistic multi-class multi-kernel learning: on protein fold recog-\n\nnition and remote homology detection,\u201d Bioinformatics, vol. 24, no. 10, pp. 1264\u20131270, 2008.\n\n[29] P. Bartlett and S. Mendelson, \u201cEmpirical minimization,\u201d Probab. Theory Related Fields, vol. 135(3),\n\npp. 311\u2013334, 2006.\n\n[30] A. B. Tsybakov, \u201cOptimal aggregation of classi\ufb01ers in statistical learning,\u201d Ann. Stat., vol. 32, pp. 135\u2013\n\n166, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1273, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": "Google Research"}, {"given_name": "Marius", "family_name": "Kloft", "institution": "Courant Institute, NYU & Sloan-Kettering Institute (MSKCC)"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute, NYU & Google"}]}