{"title": "A General and Efficient Multiple Kernel Learning Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1273, "page_last": 1280, "abstract": "", "full_text": "A General and Ef\ufb01cient Multiple Kernel\n\nLearning Algorithm\n\nS\u00a8oren Sonnenburg\u2217\nFraunhofer FIRST\n\nKekul\u00b4estr. 7\n12489 Berlin\n\nGermany\n\nGunnar R\u00a8atsch\n\nFriedrich Miescher Lab\n\nMax Planck Society\n\nSpemannstr. 39\n\nT\u00a8ubingen, Germany\n\nChristin Sch\u00a8afer\nFraunhofer FIRST\n\nKekul\u00b4estr. 7\n12489 Berlin\n\nGermany\n\nsonne@first.fhg.de\n\nraetsch@tue.mpg.de\n\nchristin@first.fhg.de\n\nAbstract\n\nWhile classical kernel-based learning algorithms are based on a single\nkernel, in practice it is often desirable to use multiple kernels. Lankriet\net al. (2004) considered conic combinations of kernel matrices for classi-\n\ufb01cation, leading to a convex quadratically constraint quadratic program.\nWe show that it can be rewritten as a semi-in\ufb01nite linear program that\ncan be ef\ufb01ciently solved by recycling the standard SVM implementa-\ntions. Moreover, we generalize the formulation and our method to a\nlarger class of problems, including regression and one-class classi\ufb01ca-\ntion. Experimental results show that the proposed algorithm helps for\nautomatic model selection, improving the interpretability of the learn-\ning result and works for hundred thousands of examples or hundreds of\nkernels to be combined.\n\nf (x) = sign N\nXi=1\n\n\u03b1iyik(xi, x) + b! ,\n\n(1)\n\nIntroduction\n\n1\nKernel based methods such as Support Vector Machines (SVMs) have proven to be pow-\nerful for a wide range of different data analysis problems. They employ a so-called kernel\nfunction k(xi, xj) which intuitively computes the similarity between two examples xi and\nxj. The result of SVM learning is a \u03b1-weighted linear combination of kernel elements and\nthe bias b:\n\nwhere the xi\u2019s are N labeled training examples (yi \u2208 {\u00b11}).\nRecent developments in the literature on the SVM and other kernel methods have shown\nthe need to consider multiple kernels. This provides \ufb02exibility, and also re\ufb02ects the fact\nthat typical learning problems often involve multiple, heterogeneous data sources. While\nthis so-called \u201cmultiple kernel learning\u201d (MKL) problem can in principle be solved via\ncross-validation, several recent papers have focused on more ef\ufb01cient methods for multiple\nkernel learning [4, 5, 1, 7, 3, 9, 2].\n\nOne of the problems with kernel methods compared to other techniques is that the resulting\ndecision function (1) is hard to interpret and, hence, is dif\ufb01cult to use in order to extract rel-\n\n\u2217For more details, datasets and pseudocode see http://www.fml.tuebingen.mpg.de\n\n/raetsch/projects/mkl silp.\n\n\fevant knowledge about the problem at hand. One can approach this problem by considering\nconvex combinations of K kernels, i.e.\n\nk(xi, xj) =\n\n\u03b2kkk(xi, xj)\n\n(2)\n\nK\n\nXk=1\n\nwith \u03b2k \u2265 0 and Pk \u03b2k = 1, where each kernel kk uses only a distinct set of features\n\nof each instance. For appropriately designed sub-kernels kk, the optimized combination\ncoef\ufb01cients can then be used to understand which features of the examples are of impor-\ntance for discrimination:\nif one would be able to obtain an accurate classi\ufb01cation by a\nsparse weighting \u03b2k, then one can quite easily interpret the resulting decision function. We\nwill illustrate that the considered MKL formulation provides useful insights and is at the\nsame time is very ef\ufb01cient. This is an important property missing in current kernel based\nalgorithms.\n\nWe consider the framework proposed by [7], which results in a convex optimization prob-\nlem - a quadratically-constrained quadratic program (QCQP). This problem is more chal-\nlenging than the standard SVM QP, but it can in principle be solved by general-purpose\noptimization toolboxes. Since the use of such algorithms will only be feasible for small\nproblems with few data points and kernels, [1] suggested an algorithm based on sequential\nminimization optimization (SMO) [10]. While the kernel learning problem is convex, it\nis also non-smooth, making the direct application of simple local descent algorithms such\nas SMO infeasible. [1] therefore considered a smoothed version of the problem to which\nSMO can be applied.\n\nIn this work we follow a different direction: We reformulate the problem as a semi-in\ufb01nite\nlinear program (SILP), which can be ef\ufb01ciently solved using an off-the-shelf LP solver and\na standard SVM implementation (cf. Section 2 for details). Using this approach we are\nable to solve problems with more than hundred thousand examples or with several hundred\nkernels quite ef\ufb01ciently. We have used it for the analysis of sequence analysis problems\nleading to a better understanding of the biological problem at hand [16, 13]. We extend\nour previous work and show that the transformation to a SILP works with a large class of\nconvex loss functions (cf. Section 3). Our column-generation based algorithm for solving\nthe SILP works by repeatedly using an algorithm that can ef\ufb01ciently solve the single kernel\nproblem in order to solve the MKL problem. Hence, if there exists an algorithm that solves\nthe simpler problem ef\ufb01ciently (like SVMs), then our new algorithm can ef\ufb01ciently solve\nthe multiple kernel learning problem.\n\nWe conclude the paper by illustrating the usefulness of our algorithms in several examples\nrelating to the interpretation of results and to automatic model selection.\n\n2 Multiple Kernel Learning for Classi\ufb01cation using SILP\nIn the Multiple Kernel Learning (MKL) problem for binary classi\ufb01cation one is given N\ndata points (xi, yi) (yi \u2208 {\u00b11}), where xi is translated via a mapping \u03a6k(x) 7\u2192 RDk , k =\n1 . . . K from the input into K feature spaces (\u03a61(xi), . . . , \u03a6K (xi)) where Dk denotes\nthe dimensionality of the k-th feature space. Then one solves the following optimization\nproblem [1], which is equivalent to the linear SVM for K = 1:1\n\nmin\nwk\u2208RDk ,\u03be\u2208RN\n\n+ ,\u03b2\u2208RK\n\n+ ,b\u2208R\n\ns.t.\n\n1\n\n2 K\nXk=1\nyi K\nXk=1\n\n\u03bei\n\n(3)\n\n\u03b2kkwkk2!2\n\n+ C\n\nN\n\nXi=1\n\n\u03b2kw\u22a4\n\nk \u03a6k(xi) + b! \u2265 1 \u2212 \u03bei and\n\n\u03b2k = 1.\n\nK\n\nXk=1\n\n1[1] used a slightly different but equivalent (assuming tr(Kk) = 1, k = 1, . . . , K) formulation\n\nwithout the \u03b2\u2019s, which we introduced for illustration.\n\n\fNote that the \u21131-norm of \u03b2 is constrained to one, while one is penalizing the \u21132-norm of\nwk in each block k separately. The idea is that \u21131-norm constrained or penalized variables\ntend to have sparse optimal solutions, while \u21132-norm penalized variables do not [11]. Thus\nthe above optimization problem offers the possibility to \ufb01nd sparse solutions on the block\nlevel with non-sparse solutions within the blocks.\n\nBach et al. [1] derived the dual for problem (3), which can be equivalently written as:\n\nmin\n\n\u03b3\u2208R,1C\u2265\u03b1\u2208RN\n+\n\n\u03b3\n\ns.t.\n\n1\n2\n\n\u03b1i\u03b1jyiyj kk(xi, xj) \u2212\n\n\u03b1i\n\n\u2264 \u03b3 and Xi=1\n\n\u03b1iyi = 0 (4)\n\nN\n\nXi,j=1\n\n|\n\nN\n\nXi=1\n\n}\n\n=:Sk(\u03b1)\n\n{z\n\nK\n\nXk=1\n\nfor k = 1, . . . , K, where kk(xi, xj) = (\u03a6k(xi), \u03a6k(xj)). Note that we have one quadratic\nconstraint per kernel (Sk(\u03b1) \u2264 \u03b3). In the case of K = 1, the above problem reduces to the\noriginal SVM dual.\n\nIn order to solve (4), one may solve the following saddle point problem (Lagrangian):\n\nL := \u03b3 +\n\n\u03b2k(Sk(\u03b1) \u2212 \u03b3)\n\n(5)\n\nminimized w.r.t. \u03b1 \u2208 RN\nw.r.t. \u03b2 \u2208 RK\n\n+ , \u03b3 \u2208 R (subject to \u03b1 \u2264 C 1 andPi \u03b1iyi = 0) and maximized\n+ . Setting the derivative w.r.t. to \u03b3 to zero, one obtains the constraintPk \u03b2k =\n\nk=1 \u03b2kSk(\u03b1) and leads to a min-max problem:\n\n1 and (5) simpli\ufb01es to: L = S(\u03b1, \u03b2) :=PK\n\nmax\n\u03b2\u2208Rk\n+\n\nmin\n\n1C\u2265\u03b1\u2208RN\n+\n\n\u03b2kSk(\u03b1)\n\nK\n\nXk=1\n\ns.t.\n\nN\n\nXi=1\n\n\u03b1iyi = 0 and\n\nK\n\nXk=1\n\n\u03b2k = 1.\n\n(6)\n\nAssume \u03b1\u2217 would be the optimal solution, then \u03b8\u2217 := S(\u03b1\u2217, \u03b2) is minimal and, hence,\nS(\u03b1, \u03b2) \u2265 \u03b8\u2217 for all \u03b1 (subject to the above constraints). Hence, \ufb01nding a saddle-point of\n(5) is equivalent to solving the following semi-in\ufb01nite linear program:\n\nmax\n\n\u03b8\u2208R,\u03b2\u2208RM\n+\n\nK\n\n\u03b8\n\n\u03b2k = 1 and\n\ns.t. Xk\nfor all \u03b1 with 0 \u2264 \u03b1 \u2264 C 1 and Xi\n\nXk=1\n\nyi\u03b1i = 0\n\n\u03b2kSk(\u03b1) \u2265 \u03b8\n\n(7)\n\nPN\ndown to \u03b8 =PK\n\nNote that this is a linear program, as \u03b8 and \u03b2 are only linearly constrained. However\nthere are in\ufb01nitely many constraints: one for each \u03b1 \u2208 RN satisfying 0 \u2264 \u03b1 \u2264 C and\ni=1 \u03b1iyi = 0. Both problems (6) and (7) have the same solution. To illustrate that,\nconsider \u03b2 is \ufb01xed and we maximize \u03b1 in (6). Let \u03b1\u2217 be the solution that maximizes (6).\nThen we can decrease the value of \u03b8 in (7) as long as no \u03b1-constraint (7) is violated, i.e.\nk=1 \u03b2kSk(\u03b1\u2217). Similarly, as we increase \u03b8 for a \ufb01xed \u03b1 the maximizing \u03b2\n\nis found. We will discuss in Section 4 how to solve such semi in\ufb01nite linear programs.\n3 Multiple Kernel Learning with General Cost Functions\nIn this section we consider the more general class of MKL problems, where one is given\nan arbitrary strictly convex differentiable loss function, for which we derive its MKL SILP\nformulation. We will then investigate in this general MKL SILP using different loss func-\ntions, in particular the soft-margin loss, the \u01eb-insensitive loss and the quadratic loss.\nWe de\ufb01ne the MKL primal formulation for a strictly convex and differentiable loss function\nL as: (for simplicity we omit a bias term)\n\nmin\n\nwk\u2208RDk\n\n1\n\n2 K\nXk=1\n\nkwkk!2\n\n+\n\nN\n\nXi=1\n\nL(f (xi), yi)\n\ns.t.\n\nf (xi) =\n\n(\u03a6k(xi), wk)\n\nK\n\nXk=1\n\n(8)\n\n\fIn analogy to [1] we treat problem (8) as a second order cone program (SOCP)\nfor details):\nleading to the following dual\n\n(see Supplementary Website or\n\n[17]\n\nmin\n\n\u03b3\u2208R,\u03b1\u2208RN\n\ns.t. :\n\nL(L\u2032\u22121(\u03b1i, yi), yi) +\n\n\u03b1iL\u2032\u22121(\u03b1i, yi)\n\n(9)\n\nN\n\nXi=1\n\n\u2264 \u03b3, \u2200k = 1 . . . K\n\nN\n\n2\n\n2\n\nN\n\n1\n\n\u03b3 \u2212\n\nXi=1\n\u03b1i\u03a6k(xi)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n2(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\nXi=1\nXk=1\nand PK\nXi=1\n\n\u03b2k = 1\n\nK\n\nN\n\nN\n\nXi=1\n\nTo derive the SILP formulation we follow the same recipe as in Section 2: deriving the La-\ngrangian leads to a max-min problem formulation to be eventually reformulated to a SILP:\n\nmax\n\n\u03b8\u2208R,\u03b2\u2208RK\n\n\u03b8\n\ns.t.\n\nk=1 \u03b2kSk(\u03b1) \u2265 \u03b8, \u2200\u03b1 \u2208 RN ,\n\nN\n\n1\n\n2(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\nXi=1\n\n\u03b1i\u03a6k(xi)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n2\n\n2\n\n.\n\nwhere Sk(\u03b1) = \u2212\n\nL(L\u2032\u22121(\u03b1i, yi), yi) +\n\n\u03b1iL\u2032\u22121(\u03b1i, yi) +\n\nWe assumed that L(x, y) is strictly convex and differentiable in x. Unfortunately, the soft\nmargin and \u01eb-insensitive loss do not have these properties. We therefore consider them\nseparately in the sequel.\n\nSoft Margin Loss We use the following loss in order to approximate the soft margin\nloss: L\u03c3(x, y) = C\n\u03c3 log(1 + exp((1 \u2212 xy)\u03c3)). It is easy to verify that lim\u03c3\u2192\u221e L\u03c3(x, y) =\nC(1 \u2212 xy)+. Moreover, L\u03c3 is strictly convex and differentiable for \u03c3 < \u221e. Using this loss\nand assuming yi \u2208 {\u00b11}, we obtain :\n\nN\n\nN\n\nN\n\n1\n\nC\n\n\u03b1i\n\n\u03b1iyi +\n\nXi=1\n\nXi=1\n\nSk(\u03b1) = \u2212\n\n\u03b1i + Cyi\u00ab\u00ab+\n\n\u03c3 \u201elog\u201e Cyi\n\n\u03b1i + Cyi\u00ab + log\u201e\u2212\n\nIf \u03c3 \u2192 \u221e, then the \ufb01rst two terms vanish provided that \u2212C \u2264 \u03b1i \u2264 0 if yi = 1 and\ni=1 \u02dc\u03b1i +\n, with 0 \u2264 \u02dc\u03b1i \u2264 C (i = 1, . . . , N), which is very similar to (4):\n\n\u03b1i\u03a6k(xi)\u201a\u201a\u201a\u201a\u201a\n0 \u2264 \u03b1i \u2264 C if yi = \u22121. Substituting \u03b1 = \u2212\u02dc\u03b1iyi, we then obtain Sk( \u02dc\u03b1) = \u2212PN\n2(cid:13)(cid:13)(cid:13)PN\nonly thePi \u03b1iyi = 0 constraint is missing, since we omitted the bias.\n\nOne-Class Soft Margin Loss The one-class SVM soft margin (e.g. [15]) is very similar\nto the two class case and leads to Sk(\u03b1) = 1\n\ni=1 \u02dc\u03b1iyi\u03a6k(xi)(cid:13)(cid:13)(cid:13)\n\nsubject to 0 \u2264 \u03b1 \u2264 1\n\u03bdN\n\n2 \u201a\u201a\u201a\u201a\u201a\n\nXi=1\n\n1\n\n2\n\n2\n\n1\n\n2\n\n2\n\n2\n\n.\n\n\u01eb-insensitive Loss Using the same technique for the epsilon insensitive loss L(x, y) =\nC(1 \u2212 |x \u2212 y|)+, we obtain\n\nSk(\u03b1, \u03b1\u2217) =\n\n(\u03b1i \u2212 \u03b1\u2217\n\n(\u03b1i + \u03b1\u2217\n\ni )\u01eb \u2212\n\n(\u03b1i \u2212 \u03b1\u2217\n\ni )yi,\n\nwith 0 \u2264 \u03b1, \u03b1\u2217 \u2264 C 1. When including a bias term, we additionally have the constraint\n\ni=1 \u03b1i\u03a6k(xi)(cid:13)(cid:13)(cid:13)\n\n2\n\n2(cid:13)(cid:13)(cid:13)PN\ni )\u03a6k(xi)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n\u2212\n\n2\n\n2\n\nN\n\nXi=1\n\nN\n\nXi=1\n\ni=1 \u03b1i = 1.\n\nandPN\n\nN\n\n1\n\n2(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\nXi=1\n\ni=1(\u03b1i \u2212 \u03b1\u2217\n\ni )yi = 0.\n\nPN\n\nIt is straightforward to derive the dual problem for other loss functions such as the quadratic\nloss. Note that the dual SILP\u2019s only differ in the de\ufb01nition of Sk and the domains of the\n\u03b1\u2019s.\n4 Algorithms to solve SILPs\nThe SILPs considered in this work all have the following form:\n\nmax\n\n\u03b8\u2208R,\u03b2\u2208RM\n+\n\n\u03b8\n\ns.t. PK\n\nk=1 \u03b2k = 1 and PM\n\nk=1 \u03b2kSk(\u03b1) \u2265 \u03b8 for all \u03b1 \u2208 C (10)\n\n\ffor some appropriate Sk(\u03b1) and the feasible set C \u2286 RN of \u03b1 depending on the choice of\nthe cost function. Using Theorem 5 in [12] one can show that the above SILP has a solution\nif the corresponding primal is feasible and bounded. Moreover, there is no duality gap, if\nM = co{[S1(\u03b1), . . . , SK(\u03b1)]\u22a4 | \u03b1 \u2208 C} is a closed set. For all loss functions considered\nin this paper this holds true. We propose to use a technique called Column Generation to\nsolve (10). The basic idea is to compute the optimal (\u03b2, \u03b8) in (10) for a restricted subset of\nconstraints. It is called the restricted master problem. Then a second algorithm generates\na new constraint determined by \u03b1. In the best case the other algorithm \ufb01nds the constraint\nthat maximizes the constraint violation for the given intermediate solution (\u03b2, \u03b8), i.e.\n\n\u03b1\u03b2 := argmin\n\n\u03b2kSk(\u03b1).\n\n(11)\n\n\u03b1\u2208C Xk\n\nk=1 \u03b2kSk(\u03b1\u03b2) \u2265 \u03b8, then the solution is optimal. Other-\n\nIf \u03b1\u03b2 satis\ufb01es the constraintPK\n\nwise, the constraint is added to the set of constraints.\nAlgorithm 1 is a special case of the set of SILP algorithms known as exchange methods.\nThese methods are known to converge (cf. Theorem 7.2 in [6]). However, no convergence\nrates for such algorithm are so far known.2 Since it is often suf\ufb01cient to obtain an approxi-\nmate solution, we have to de\ufb01ne a suitable convergence criterion. Note that the problem is\nsolved when all constraints are satis\ufb01ed. Hence, it is a natural choice to use the normalized\n\nmaximal constraint violation as a convergence criterion, i.e. \u01eb := (cid:12)(cid:12)(cid:12)1 \u2212 PK\n\nwhere (\u03b2t, \u03b8t) is the optimal solution at iteration t \u2212 1 and \u03b1t corresponds to the newly\nfound maximally violating constraint of the next iteration.\n\nk=1 \u03b2 t\nkSk(\u03b1t)\n\u03b8t\n\n(cid:12)(cid:12)(cid:12) ,\n\nWe need an algorithm to identify unsatis\ufb01ed constraints, which, fortunately, turns out to be\nparticularly simple. Note that (11) is for all considered cases exactly the dual optimization\nproblem of the single kernel case for \ufb01xed \u03b2. For instance for binary classi\ufb01cation, (11)\n\nN\n\n\u03b1iyi = 0.\n\nN\n\nreduces to the standard SVM dual using the kernel k(xi, xj) =Pk \u03b2kkk(xi, xj):\nXi=1\n\n\u03b1i\u03b1jyiyj k(xi, xj) \u2212\n\n0 \u2264 \u03b1 \u2264 C 1 and\n\nXi,j=1\n\nXi=1\n\nmin\n\u03b1\u2208RN\n\nwith\n\nN\n\n\u03b1i\n\nWe can therefore use a standard SVM implementation in order to identify the most violated\nconstraint. Since there exist a large number of ef\ufb01cient algorithms to solve the single\nkernel problems for all sorts of cost functions, we have therefore found an easy way to\nextend their applicability to the problem of Multiple Kernel Learning. In some cases it\nis possible to extend existing SMO based implementations to simultaneously optimize \u03b2\nand \u03b1. In [16] we have considered such an algorithm for the binary classi\ufb01cation case\nthat frequently recomputes the \u03b2\u2019s.3 Empirically it is a few times faster than the column\ngeneration algorithm, but it is on the other hand much harder to implement.\n5 Experiments\nIn this section we will discuss toy examples for binary classi\ufb01cation and regression, demon-\nstrating that MKL can recover information about the problem at hand, followed by a brief\nreview on problems for which MKL has been successfully used.\n5.1 Classi\ufb01cations\nIn Figure 1 we consider a binary classi\ufb01cation problem, where we used MKL-SVMs with\n\ufb01ve RBF-kernels with different widths, to distinguish the dark star-like shape from the\n\n2It has been shown that solving semi-in\ufb01nite problems like (7), using a method related to boosting\n(e.g. [8]) one requires at most T = O(log(M )/\u02c6\u01eb2) iterations, where \u02c6\u01eb is the remaining constraint\nviolation and the constants may depend on the kernels and the number of examples N [11, 14]. At\nleast for not too small values of \u02c6\u01eb this technique produces reasonably fast good approximate solutions.\n3Simplex based LP solvers often offer the possibility to ef\ufb01cient restart the computation when\n\nadding only a few constraints.\n\n\fAlgorithm 1 The column generation algorithm employs a linear programming solver to\niteratively solve the semi-in\ufb01nite linear optimization problem (10). The accuracy parameter\n\u01eb is a parameter of the algorithm. Sk(\u03b1) and C are determined by the cost function.\n\nS 0 = 1, \u03b81 = \u2212\u221e, \u03b21\nfor t = 1, 2, . . . do\n\nk = 1\n\nK for k = 1, . . . , K\n\n\u03b2 t\nkSk(\u03b1) by single kernel algorithm with K =\n\nK\n\nXk=1\n\n\u03b2 t\nkKk\n\nK\n\nXk=1\n\nCompute \u03b1\n\nt = argmin\n\n\u03b1\u2208C\n\nS t =\n\nK\n\nXk=1\n\n\u03b2 t\nkSk(\u03b1\n\nt)\n\nS t\nif |1 \u2212\n\u03b8t | \u2264 \u01eb then break\n(\u03b2 t+1, \u03b8t+1) = argmax \u03b8\n\nw.r.t. \u03b2 \u2208 R\n\nK\n\n+ , \u03b8 \u2208 R with\n\nend for\n\nK\n\nXk=1\n\n\u03b2k = 1 and\n\nK\n\nXk=1\n\n\u03b2kS r\n\nk \u2265 \u03b8 for r = 1, . . . , t\n\nlight star. (The distance between the stars increases from left to right.) Shown are the\nobtained kernel weightings for the \ufb01ve kernels and the test error which quickly drops to\nzero as the problem becomes separable. Note that the RBF kernel with largest width was\nnot appropriate and thus never chosen. Also with increasing distance between the stars\nkernels with greater widths are used. This illustrates that MKL one can indeed recover\nsuch tendencies.\n\n5.2 Regression\n\nWe applied the newly derived MKL support vector regression formulation, to the task of\nlearning a sine function using three RBF-kernels with different widths. We then increased\nthe frequency of the sine wave. As can be seen in Figure 2, MKL-SV regression abruptly\nswitches to the width of the RBF-kernel \ufb01tting the regression problem best. In another\nregression experiment, we combined a linear function with two sine waves, one of lower\nfrequency and one of high frequency, i.e. f (x) = c \u00b7 x + sin(ax) + sin(bx). Using ten\nRBF-kernels of different width (see Figure 3) we trained a MKL-SVR and display the\nlearned weights (a column in the \ufb01gure). The largest selected width (100) models the linear\ncomponent (since RBF with large widths are effectively linear) and the medium width (1)\ncorresponds to the lower frequency sine. We varied the frequency of the high frequency\nsine wave from low to high (left to right in the \ufb01gure). One observes that MKL determines\n\nFigure 1: A 2-class toy problem where the dark grey star-like shape is to be distinguished\nfrom the light grey star inside of the dark grey star. For details see text.\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nt\nh\ng\ne\nw\n\ni\n\n \nl\n\ne\nn\nr\ne\nk\n\nwidth 0.005\nwidth 0.05\nwidth 0.5\nwidth 1\nwidth 10\n\n0\n\n0\n\n1\n\n2\n\nfrequency\n\n3\n\n4\n\n5\n\nFigure 2: MKL-Support Vector Regression for the task of learning a sine wave (please see\ntext for details).\n\nan appropriate combination of kernels of low and high widths, while decreasing the RBF-\nkernel width with increased frequency. This shows that MKL can be more powerful than\ncross-validation: To achieve a similar result with cross-validation one has to use 3 nested\nloops to tune 3 RBF-kernel sigmas, e.g. train 10\u00b79\u00b78/6 = 120 SVMs, which in preliminary\nexperiments was much slower then using MKL (800 vs. 56 seconds).\n\nh\nt\nd\nw\n\ni\n\n \nl\ne\nn\nr\ne\nk\n \nF\nB\nR\n\n0.001\n\n0.005\n\n0.01\n\n0.05\n\n0.1\n\n1\n\n10\n\n50\n\n100\n\n1000\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\nfrequency\n\n14\n\n16\n\n18\n\n20\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nFigure 3: MKL support vector regression on a linear combination of three functions:\nf (x) = c \u00b7 x + sin(ax) + sin(bx). MKL recovers that the original function is a com-\nbination of functions of low and high complexity. For more details see text.\n\n5.3 Applications in the Real World\n\nMKL has been successfully used on real-world datasets in the \ufb01eld of computational biol-\nogy [7, 16]. It was shown to improve classi\ufb01cation performance on the task of ribosomal\nand membrane protein prediction, where a weighting over different kernels each corre-\nsponding to a different feature set was learned. Random channels obtained low kernel\nweights. Moreover, on a splice site recognition task we used MKL as a tool for interpreting\nthe SVM classi\ufb01er [16], as is displayed in Figure 4. Using speci\ufb01cally optimized string\nkernels, we were able to solve the classi\ufb01cation MKL SILP for N = 1.000.000 examples\nand K = 20 kernels, as well as for N = 10.000 examples and K = 550 kernels.\n\n0.05\n\n0.045\n\n0.04\n\n0.035\n\n0.03\n\n0.025\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\n0\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210 Exon\nStart\n\n+10\n\n+20\n\n+30\n\n+40\n\n+50\n\nFigure 4: The \ufb01gure shows an importance weighting for each position in a DNA sequence\n(around a so called splice site). MKL was used to learn these weights, each corresponding\nto a sub-kernel which uses information at that position to discriminate true splice sites from\nfake ones. Different peaks correspond to different biologically known signals (see [16] for\ndetails). We used 65.000 examples for training with 54 sub-kernels.\n\n\f6 Conclusion\nWe have proposed a simple, yet ef\ufb01cient algorithm to solve the multiple kernel learning\nproblem for a large class of loss functions. The proposed method is able to exploit the\nexisting single kernel algorithms, whereby extending their applicability. In experiments we\nhave illustrated that the MKL for classi\ufb01cation and regression can be useful for automatic\nmodel selection and for obtaining comprehensible information about the learning problem\nat hand. It is future work to evaluate MKL algorithms for unsupervised learning such as\nKernel PCA and one-class classi\ufb01cation.\n\nAcknowledgments\nThe authors gratefully acknowledge partial support from the PASCAL Network of Ex-\ncellence (EU #506778), DFG grants JA 379 / 13-2 and MU 987/2-1. We thank Guido\nDornhege, Olivier Chapelle, Olaf Weiss, Joaquin Qui\u02dcno\u02dcnero Candela, Sebastian Mika and\nK.-R. M\u00a8uller for great discussions.\nReferences\n[1] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. Multiple kernel learning, conic\nduality, and the SMO algorithm. In Twenty-\ufb01rst international conference on Machine learning.\nACM Press, 2004.\n\n[2] Kristin P. Bennett, Michinari Momma, and Mark J. Embrechts. Mark: a boosting algorithm for\n\nheterogeneous kernel models. KDD, pages 24\u201331, 2002.\n\n[3] Jinbo Bi, Tong Zhang, and Kristin P. Bennett. Column-generation boosting methods for mixture\n\nof kernels. KDD, pages 521\u2013526, 2004.\n\n[4] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for\n\nsupport vector machines. Machine Learning, 46(1-3):131\u2013159, 2002.\n\n[5] I. Grandvalet and S. Canu. Adaptive scaling for feature selection in SVMs. In In Advances in\n\nNeural Information Processing Systems, 2002.\n\n[6] R. Hettich and K.O. Kortanek. Semi-in\ufb01nite programming: Theory, methods and applications.\n\nSIAM Review, 3:380\u2013429, September 1993.\n\n[7] G.R.G. Lanckriet, T. De Bie, N. Cristianini, M.I. Jordan, and W.S. Noble. A statistical framework\n\nfor genomic data fusion. Bioinformatics, 2004.\n\n[8] R. Meir and G. R\u00a8atsch. An introduction to boosting and leveraging.\n\nIn S. Mendelson and\nA. Smola, editors, Proc. of the \ufb01rst Machine Learning Summer School in Canberra, LNCS,\npages 119\u2013184. Springer, 2003. in press.\n\n[9] C.S. Ong, A.J. Smola, and R.C. Williamson. Hyperkernels. In In Advances in Neural Information\n\nProcessing Systems, volume 15, pages 495\u2013502, 2003.\n\n[10] J. Platt. Fast training of support vector machines using sequential minimal optimization. In\nB. Sch\u00a8olkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods \u2014 Support\nVector Learning, pages 185\u2013208, Cambridge, MA, 1999. MIT Press.\n\n[11] G. R\u00a8atsch. Robust Boosting via Convex Optimization. PhD thesis, University of Potsdam,\n\nComputer Science Dept., August-Bebel-Str. 89, 14482 Potsdam, Germany, 2001.\n\n[12] G. R\u00a8atsch, A. Demiriz, and K. Bennett. Sparse regression ensembles in in\ufb01nite and \ufb01nite hy-\npothesis spaces. Machine Learning, 48(1-3):193\u2013221, 2002. Special Issue on New Methods for\nModel Selection and Model Combination. Also NeuroCOLT2 Technical Report NC-TR-2000-\n085.\n\n[13] G. R\u00a8atsch, S. Sonnenburg, and C. Sch\u00a8afer. Learning interpretable svms for biological sequence\nclassi\ufb01cation. BMC Bioinformatics, Special Issue from NIPS workshop on New Problems and\nMethods in Computational Biology Whistler, Canada, 18 December 2004, 7(Suppl. 1:S9), Febru-\nary 2006.\n\n[14] G. R\u00a8atsch and M.K. Warmuth. Marginal boosting. NeuroCOLT2 Technical Report 97, Royal\n\nHolloway College, London, July 2001.\n\n[15] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[16] S. Sonnenburg, G. R\u00a8atsch, and C. Sch\u00a8afer. Learning interpretable SVMs for biological se-\nquence classi\ufb01cation. In RECOMB 2005, LNBI 3500, pages 389\u2013407. Springer-Verlag Berlin\nHeidelberg, 2005.\n\n[17] S. Sonnenburg, G. R\u00a8atsch, S. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning.\n\nJournal of Machine Learning Research, 2006. accepted.\n\n\f", "award": [], "sourceid": 2890, "authors": [{"given_name": "S\u00f6ren", "family_name": "Sonnenburg", "institution": null}, {"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Christin", "family_name": "Sch\u00e4fer", "institution": null}]}