{"title": "Multiple Kernel Learning and the SMO Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 2361, "page_last": 2369, "abstract": "Our objective is to train $p$-norm Multiple Kernel Learning (MKL) and, more generally, linear MKL regularised by the Bregman divergence, using the Sequential Minimal Optimization (SMO) algorithm. The SMO algorithm is simple, easy to implement and adapt, and efficiently scales to large problems. As a result, it has gained widespread acceptance and SVMs are routinely trained using SMO in diverse real world applications. Training using SMO has been a long standing goal in MKL for the very same reasons. Unfortunately, the standard MKL dual is not differentiable, and therefore can not be optimised using SMO style co-ordinate ascent. In this paper, we demonstrate that linear MKL regularised with the $p$-norm squared, or with certain Bregman divergences, can indeed be trained using SMO. The resulting algorithm retains both simplicity and efficiency and is significantly faster than the state-of-the-art specialised $p$-norm MKL solvers. We show that we can train on a hundred thousand kernels in approximately seven minutes and on fifty thousand points in less than half an hour on a single core.", "full_text": "Multiple Kernel Learning and the SMO Algorithm\n\nS. V. N. Vishwanathan, Zhaonan Sun, Nawanol Theera-Ampornpunt\n\nPurdue University\n\nvishy@stat.purdue.edu, sunz@stat.purdue.edu, ntheeraa@cs.purdue.edu\n\nManik Varma\n\nMicrosoft Research India\n\nmanik@microsoft.com\n\nAbstract\n\nOur objective is to train p-norm Multiple Kernel Learning (MKL) and, more gen-\nerally, linear MKL regularised by the Bregman divergence, using the Sequential\nMinimal Optimization (SMO) algorithm. The SMO algorithm is simple, easy to\nimplement and adapt, and ef\ufb01ciently scales to large problems. As a result, it has\ngained widespread acceptance and SVMs are routinely trained using SMO in di-\nverse real world applications. Training using SMO has been a long standing goal\nin MKL for the very same reasons. Unfortunately, the standard MKL dual is not\ndifferentiable, and therefore can not be optimised using SMO style co-ordinate as-\ncent. In this paper, we demonstrate that linear MKL regularised with the p-norm\nsquared, or with certain Bregman divergences, can indeed be trained using SMO.\nThe resulting algorithm retains both simplicity and ef\ufb01ciency and is signi\ufb01cantly\nfaster than state-of-the-art specialised p-norm MKL solvers. We show that we can\ntrain on a hundred thousand kernels in approximately seven minutes and on \ufb01fty\nthousand points in less than half an hour on a single core.\n\n1\n\nIntroduction\n\nResearch on Multiple Kernel Learning (MKL) needs to follow a two pronged approach. It is im-\nportant to explore formulations which lead to improvements in prediction accuracy. Recent trends\nindicate that performance gains can be achieved by non-linear kernel combinations [7,18,21], learn-\ning over large kernel spaces [2] and by using general, or non-sparse, regularisation [6, 7, 12, 18].\nSimultaneously, ef\ufb01cient optimisation techniques need to be developed to scale MKL out of the lab\nand into the real world. Such algorithms can help in investigating new application areas and different\nfacets of the MKL problem including dealing with a very large number of kernels and data points.\n\nOptimisation using decompositional algorithms such as Sequential Minimal Optimization\n(SMO) [15] has been a long standing goal in MKL [3] as the algorithms are simple, easy to im-\nplement and ef\ufb01ciently scale to large problems. The hope is that they might do for MKL what SMO\ndid for SVMs \u2013 allow people to play with MKL on their laptops, modify and adapt it for diverse real\nworld applications and explore large scale settings in terms of number of kernels and data points.\n\nUnfortunately, the standard MKL formulation, which learns a linear combination of base kernels\nsubject to l1 regularisation, leads to a dual which is not differentiable. SMO can not be applied as a\nresult and [3] had to resort to expensive Moreau-Yosida regularisation to smooth the dual. State-of-\nthe-art algorithms today overcome this limitation by solving an intermediate saddle point problem\nrather than the dual itself [12, 16].\n\nOur focus, in this paper, is on training p-norm MKL, with p > 1, using the SMO algorithm. More\ngenerally, we prove that linear MKL regularised by certain Bregman divergences, can also be trained\n\n1\n\n\fusing SMO. We shift the emphasis \ufb01rmly back towards solving the dual in such cases. The lp-\nMKL dual is shown to be differentiable and thereby amenable to co-ordinate ascent. Placing the\np-norm squared regulariser in the objective lets us ef\ufb01ciently solve the core reduced two variable\noptimisation problem analytically in some cases and algorithmically in others. Using results from [4,\n9], we can compute the lp-MKL Hessian, which brings into play second order variable selection\nmethods which tremendously speed up the rate of convergence [8]. The standard decompositional\nmethod proof of convergence [14] to the global optimum holds with minor modi\ufb01cations.\n\nThe resulting optimisation algorithm, which we call SMO-MKL, is straight forward to implement\nand ef\ufb01cient. We demonstrate that SMO-MKL can be signi\ufb01cantly faster than the state-of-the-art\nspecialised p-norm solvers [12]. We empirically show that the SMO-MKL algorithm is robust with\nthe desirable property that it is not greatly affected within large operating ranges of p. This implies\nthat our algorithm is well suited for learning both sparse, and non-sparse, kernel combinations.\nFurthermore, SMO-MKL scales well to large problems. We show that we can ef\ufb01ciently combine\na hundred thousand kernels in approximately seven minutes or train on \ufb01fty thousand points in less\nthan half an hour using a single core on standard hardware where other solvers fail to produce results.\nThe SMO-MKL code can be downloaded from [20].\n\n2 Related Work\n\nRecent trends indicate that there are three promising directions of research for obtaining performance\nimprovements using MKL. The \ufb01rst involves learning non-linear kernel combinations. A framework\nfor learning general non-linear kernel combinations subject to general regularisation was presented\nin [18]. It was demonstrated that, for feature selection, the non-linear GMKL formulation could\nperform signi\ufb01cantly better not only as compared to linear MKL but also state-of-the-art wrapper\nmethods and \ufb01lter methods with averaging. Very signi\ufb01cant performance gains in terms of pure\nclassi\ufb01cation accuracy were reported in [21] by learning a different kernel combination per data\npoint or cluster. Again, the results were better not only as compared to linear MKL but also baselines\nsuch as averaging. Similar trends were observed for regression while learning polynomial kernel\ncombinations [7]. Other promising directions which have resulted in performance gains are sticking\nto standard MKL but combining an exponentially large number of kernels [2] and linear MKL with\np-norm regularisers [6, 12]. Thus MKL based methods are beginning to de\ufb01ne the state-of-the-art\nfor very competitive applications, such as object recognition on the Caltech 101 database [21] and\nobject detection on the PASCAL VOC 2009 challenge [19].\n\nIn terms of optimisation,\ninitial work on MKL leveraged general purpose SDP and QCQP\nsolvers [13]. The SMO+M.-Y. regularisation method of [3] was one of the \ufb01rst techniques that\ncould ef\ufb01ciently tackle medium scale problems. This was superseded by the SILP technique of [17]\nwhich could, very impressively, train on a million point problem with twenty kernels using paral-\nlelism. Unfortunately, the method did not scale well with the number of kernels. In response, many\ntwo-stage wrapper techniques came up [2, 10, 12, 16, 18] which could be signi\ufb01cantly faster when\nthe number of training points was reasonable but the number of kernels large. SMO could indirectly\nbe used in some of these cases to solve the inner SVM optimisation. The primary disadvantage of\nthese techniques was that they solved the inner SVM to optimality. In fact, the solution needed to\nbe of high enough precision so that the kernel weight gradient computation was accurate and the\nalgorithm converged. In addition, Armijo rule based step size selection was also very expensive and\ncould involve tens of inner SVM evaluations in a single line search. This was particularly expensive\nsince the kernel cache would be invalidated from one SVM evaluation to the next. The one big\nadvantage of such two-stage methods for l1-MKL was that they could quickly identify, and discard,\nthe kernels with zero weights and thus scaled well with the number of kernels. Most recently, [12]\nhave come up with specialised p-norm solvers which make substantial gains by not solving the inner\nSVM to optimality and working with a small active set to better utilise the kernel cache.\n\n3 The l\n\np-MKL Formulation\n\nThe objective in MKL is to jointly learn kernel and SVM parameters from training data {(xi, yi)}.\nGiven a set of base kernels {Kk} and corresponding feature maps {\u03c6k}, linear MKL aims to learn\na linear combination of the base kernels as K = Pk dkKk. If the kernel weights are restricted to\n\n2\n\n\fbe non-negative, then the MKL task corresponds to learning a standard SVM in the feature space\n\nformed by concatenating the vectors \u221adk\u03c6k. The primal can therefore be formulated as\n\n\u03bb\n2\n\n2\np\n\n1\n\nmin\n\nwt\nk\n\n\u03bei+\n\ndp\nk)\n\nw,b,\u03be\u22650,d\u22650\n\n(Xk\n\n2Xk\n\nwk +CXi\n\ns. t. yi(Xk pdkwt\n\nk\u03c6k(xi)+b) \u2265 1\u2212\u03bei (1)\nThe regularisation on the kernel weights is necessary to prevent them from shooting off to in\ufb01nity.\nWhich regulariser one uses depends on the task at hand. In this Section, we limit ourselves to the\np-norm squared regulariser with p > 1. If it is felt that certain kernels are noisy and should be\ndiscarded then a sparse solution can be obtained by letting p tend to unity from above. Alternatively,\nif the application demands dense solutions, then larger values of p should be selected. Note that the\nprimal above can be made convex by substituting wk for \u221adkwk to get\ns. t. yi(Xk\n\nwt\nk\u03c6k(xi) + b) \u2265 1\u2212 \u03bei (2)\nWe \ufb01rst derive an intermediate saddle point optimisation problem obtained by minimising only w,\nb and \u03be. The Lagrangian is\n\nwk/dk + CXi\n\n2Xk\n\n(Xk\n\nw,b,\u03be\u22650,d\u22650\n\ndp\nk)\n\n\u03bei +\n\nwt\nk\n\nmin\n\n\u03bb\n2\n\n2\np\n\n1\n\nwt\nk\n\nL = 1\n\n2Xk\n\nwt\nk\u03c6k(xi) + b)\u2212 1 + \u03bei] (3)\nDifferentiating with respect to w, b and \u03be to get the optimality conditions and substituting back\nresults in the following intermediate saddle point problem\n\nwk/dk +Xi\n\n\u03b1i[yi(Xk\n\np \u2212Xi\n\n(C \u2212 \u03b2i)\u03bei +\n\n(Xk\n\ndp\nk)\n\n2\n\n\u03bb\n2\n\nmin\nd\u22650\n\nmax\n\u03b1\u2208A\n\n1t\u03b1 \u2212 1\n\n2Xk\n\ndk\u03b1tHk\u03b1 +\n\n\u03bb\n2\n\n(Xk\n\n2\np\n\ndp\nk)\n\n(4)\n\nwhere A = {\u03b1|0 \u2264 \u03b1 \u2264 C 1, 1tY \u03b1 = 0}, Hk = Y KkY and Y is a diagonal matrix with the labels\non the diagonal. Note that most MKL methods end up optimising either this, or a very similar, saddle\npoint problem. To now eliminate d we again form the Lagrangian\n\n2Xk\n\ndk\u03b1tHk\u03b1 +\n\n\u03bb\n2\n\n(Xk\n\ndp\nk)\n\n2\n\np \u2212Xk\n\n\u03b3kdk\n\ndp\nk)\n\n2\n\np \u22121dp\u22121\n\nk = \u03b3k + 1\n\n2 \u03b1tHk\u03b1\n\ndk(\u03b3k + 1\n\n2 \u03b1tHk\u03b1)\n\n\u2202L\n\u2202dk\n\n\u21d2 \u03bb(Xk\n\nL = 1t\u03b1 \u2212 1\n= 0 \u21d2 \u03bb(Xk\np =Xk\n\u21d2 L = 1t\u03b1 \u2212\n\ndp\nk)\n\n2\n\n\u03bb\n2\n\np + 1\n\nwhere 1\nthe optimal value of \u03b3k is zero. Our lp-MKL dual therefore becomes\n\nq = 1. Since Hk is positive semi-de\ufb01nite, \u03b1tHk\u03b1 \u2265 0 and since \u03b3k \u2265 0 it is clear that\n\n(Xk\n\ndp\nk)\n\n2\n\np = 1t\u03b1 \u2212\n\n1\n2\u03bb\n\n(Xk\n\n(\u03b3k + 1\n\n2 \u03b1tHk\u03b1)q)\n\n2\nq\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\nand the kernel weights can be recovered from the dual variables as\n\nD \u2261 max\n\n\u03b1\u2208A\n\ndk =\n\n1\n\n2\u03bb Xk\n\n1\n8\u03bb\n\n1t\u03b1 \u2212\n\n(Xk\n(\u03b1tHk\u03b1)q!\n\n(\u03b1tHk\u03b1)q)\n\n2\nq\n\n1\n\nq \u2212 1\n\np\n\n(\u03b1tHk\u03b1)\n\nq\np\n\nNote that our dual objective, unlike the objective in [3], is differentiable with respect to \u03b1. The\nSMO algorithm can therefore be brought to bear where two variables are selected and optimised\nusing gradient or Newton methods and the process repeated until convergence.\n\nAlso note that it has sometimes been observed that l2 regularisation can provide better results than\nl1 [6, 7, 12, 18]. For this special case, when p = q = 2, the reduced two variable problem can\nbe solved analytically. This was one of the primary motivations for choosing the p-norm squared\nregulariser and placing it in the primal objective (the other was to be consistent with other p-norm\nformulations [9, 11]). Had we included the regulariser as a primal constraint then the dual would\nhave the q-norm rather than the q-norm squared. Our dual would then be near identical to Eq. (9)\nin [12]. However, it would then no longer have been possible to solve the two variable reduced\nproblem analytically for the 2-norm special case.\n\n3\n\n\f4 SMO-MKL Optimisation\n\nWe now develop the SMO-MKL algorithm for optimising the lp MKL dual. The algorithm has three\nmain components: (a) reduced variable optimisation; (b) working set selection and (c) stopping\ncriterion and kernel caching. We build the SMO-MKL algorithm around the LibSVM code base [5].\n\n4.1 The Reduced Variable Optimisation\n\nThe SMO algorithm works by repeatedly choosing two variables (assumed to be \u03b11 and \u03b12 without\nloss of generality in this Subsection) and optimising them while holding all other variables constant.\nIf \u03b11 \u2190 \u03b11 + \u2206 and \u03b12 \u2190 \u03b12 + s\u2206, the dual simpli\ufb01es to\n\n\u2206\u2217 = argmax\nL\u2264\u2206\u2264U\n\n(1 + s)\u2206 \u2212\n\n1\n8\u03bb\n\n(Xk\n\n(ak\u22062 + 2bk\u2206 + ck)q)\n\n2\nq\n\n(11)\n\nwhere s = \u2212y1y2, L = (s == +1) ? max(\u2212\u03b11,\u2212\u03b12)\n: max(\u2212\u03b11, \u03b12 \u2212 C), U =\n: min(C \u2212 \u03b11, \u03b12), ak = H11k + H22k + 2sH12k,\n(s == +1) ? min(C \u2212 \u03b11, C \u2212 \u03b12)\nbk = \u03b1t(H:1k + sH:2k) and ck = \u03b1tHk\u03b1. Unlike as in SMO, \u2206\u2217 can not be found analyti-\ncally for arbitrary p. Nevertheless, since this is a simple one dimensional concave optimisation\nproblem, we can ef\ufb01ciently \ufb01nd the global optimum using a variety of methods. We tried bisection\nsearch and Brent\u2019s algorithm but the Newton-Raphson method worked best \u2013 partly because the one\ndimensional Hessian was already available from the working set selection step.\n\n4.2 Working Set Selection\n\nThe choice of which two variables to select for optimisation can have a big impact on training time.\nVery simple strategies, such as random sampling, can have very little cost per iteration but need many\niterations to converge. First and second order working set selection techniques are more expensive\nper iteration but converge in far fewer iterations.\n\nWe implement the greedy second order working set selection strategy of [8]. We do not give the\nvariable selection equations due to lack of space but refer the interested reader to the WSS2 method\nof [8] and our source code [20]. The critical thing is that the selection of the \ufb01rst (second) variable\ninvolves computing the gradient (Hessian) of the dual. These are readily derived to be\n\n\u2207\u03b1D = 1 \u2212Xk\n\u22072\n\u03b1D = \u2212H \u2212\n\ndkHk\u03b1 = 1 \u2212 H\u03b1\n1\n\u03bbXk\nwhere \u2207\u03b8k f \u22121(\u03b8) = (2 \u2212 q)\u03b82\u22122q\n\n\u03b82q\u22122\nk\n\nq\n\n(12)\n\n(13)\n\n\u2207\u03b8k f \u22121(\u03b8)(Hk\u03b1)(Hk\u03b1)t\n\n+ (q \u2212 1)\u03b82\u2212q\n\nq\n\n\u03b8q\u22122\nk\n\nand \u03b8k =\n\n1\n2\u03bb\n\n\u03b1tHk\u03b1 (14)\n\nwhere D has been overloaded to now refer to the dual objective. Rather than compute the gradient\n\u2207\u03b1D repeatedly, we speed up variable selection by caching, separately for each kernel, Hk\u03b1. The\ncache needs to be updated every time we change \u03b1 in the reduced variable optimisation. However,\nsince only two variables are changed, Hk\u03b1 can be updated by summing along just two columns of\nthe kernel matrix. This involves only O(M ) work in all, where M is the number of kernels, since\nthe column sums can be pre-computed for each kernel. The Hessian is too expensive to cache and is\nrecomputed on demand.\n\n4.3 Stopping Criterion and Kernel Caching\n\nWe terminate the SMO-MKL algorithm when the duality gap falls below a pre-speci\ufb01ed threshold.\nKernel caching strategies can have a big impact on performance since kernel computations can\ndominate everything else in some cases. While a few different kernel caching techniques have been\nexplored for SVMs, we stick to the standard one used in LibSVM [5]. A Least Recently Used\n(LRU) cache is implemented as a circular queue. Each element in the queue is a pointer to a recently\naccessed (common) row of each of the individual kernel matrices.\n\n4\n\n\f5 Special Cases and Extensions\n\nWe brie\ufb02y discuss a few special cases and extensions which impact our SMO-MKL optimisation.\n\n5.1\n\n2-Norm MKL\n\nAs we noted earlier, 2-norm MKL has sometimes been found to outperform MKL trained with l1\nregularisation [6, 7, 12, 18]. For this special case, when p = q = 2, our dual and reduced variable\noptimisation problems simplify to polynomials of degree four\n\nD2 \u2261 max\n\n\u03b1\u2208A\n\n\u2206\u2217 = argmax\nL\u2264\u2206\u2264U\n\n1\n\n1t\u03b1 \u2212\n\n8\u03bbXk\n(1 + s)\u2206 \u2212\n\n(\u03b1tHk\u03b1)2\n\n1\n\n8\u03bbXk\n\n(ak\u22062 + 2bk\u2206 + ck)2\n\n(15)\n\n(16)\n\nJust as in standard SMO, \u2206\u2217 can now be found analytically by using the expressions for the roots of\na cubic. This makes our SMO-MKL algorithm particularly ef\ufb01cient for p = 2 and our code defaults\nto the analytic solver for this special case.\n\n5.2 The Bregman Divergence as a Regulariser\n\nThe Bregman divergence generalises the squared p-norm. It is not a metric as it is not symmetric and\ndoes not obey the triangle inequality. In this Subsection, we demonstrate that our MKL formulation\ncan also incorporate the Bregman divergence as a regulariser.\nLet F be any differentiable, strictly convex function and f = \u2207F represent its gradient. The\nBregman divergence generated by F is given by rF (d) = F (d) \u2212 F (d0) \u2212 (d \u2212 d0)tf (d0). Note\nthat \u2207rF (d) = f (d) \u2212 f (d0). Incorporating the Bregman divergence as a regulariser in our primal\nobjective leads to the following intermediate saddle point problem and Lagrangian\n\nd\u22650\n\nmax\n\u03b1\u2208A\n\ndk\u03b1tHk\u03b1 + \u03bbrF (d)\n\nIB \u2261 min\nLB = 1t\u03b1 \u2212Xk\n\u2207dLB = 0 \u21d2 f (d) \u2212 f (d0) = g(\u03b1, \u03b3)/\u03bb\n\u21d2 d = f \u22121 (f (d0) + g(\u03b1, \u03b3)/\u03bb) = f \u22121(\u03b8(\u03b1, \u03b3))\n\n1t\u03b1 \u2212 1\ndk(\u03b3k + 1\n\n2 \u03b1tHk\u03b1) + \u03bbrF (d)\n\n2Xk\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\nwhere g is a vector with entries gk(\u03b1, \u03b3) = \u03b3k + 1\nSubstituting back in the Lagrangian and discarding terms dependent on just d0 results in the dual\n\n2 \u03b1tHk\u03b1 and \u03b8(\u03b1, \u03b3) = f (d0) + g(\u03b1, \u03b3)/\u03bb.\n\nDR \u2261 max\n\n\u03b1\u2208A,\u03b3\u22650\n\n1t\u03b1 + \u03bb(F (f \u22121(\u03b8)) \u2212 \u03b8tf \u22121(\u03b8))\n\n(21)\n\nIn many cases the optimal value of \u03b3 will turn out to be zero and the optimisation can ef\ufb01ciently be\ncarried out over \u03b1 using our SMO-MKL algorithm.\n\nGeneralised KL Divergence To take a concrete example, different from the p-norm squared used\nthus far, we investigate the use of the generalised KL divergence as a regulariser. Choosing F (d) =\n\nPk dk(log(dk) \u2212 1) leads to the generalised KL divergence between d and d0\n\ndk log(dk/d0\n\nd0\nk\n\nrKL(d) =Xk\n\nk) \u2212Xk\n\ndk +Xk\n\nPlugging in rKL in IB and following the steps above leads to the following dual problem\n\n(22)\n\n(23)\n\nwhich can be optimised straight forwardly using our SMO-MKL algorithm once we plug in the\ngradient and hessian information. However, discussing this further would take us too far out of the\nscope of this paper. We therefore stay focused on lp-MKL for the remainder of this paper.\n\nmax\n\u03b1\u2208A\n\n1t\u03b1 \u2212 \u03bbXk\n\n1\n\n2\u03bb \u03b1tHk\u03b1\n\nd0\nke\n\n5\n\n\f5.3 Regression and Other Loss Functions\n\nWhile we have discussed MKL based classi\ufb01cation so far we can easily adapt our formulation to\nhandle other convex loss functions such as regression, novelty detection, etc. We demonstrate this\nfor the \u01eb-insensitive loss function for regression. The primal, intermediate saddle point and \ufb01nal\ndual problems are given by\n\nPR \u2261\n\nw,b,\u03be\u00b1\u22650,d\u22650\n\nwt\nk\n\n1\n\nmin\n\n2Xk\nsuch that \u00b1 (Xk\n\nwt\n\n(\u03be+\n\ni ) +\n\ni + \u03be\u2212\n\nwk/dk + CXi\nk\u03c6k(xi) + b \u2212 yi) \u2264 \u01eb + \u03be\u00b1\n\u03bb\n2\n\ndk\u03b1tKk\u03b1 +\n\ni\n\n\u03bb\n2\n\n2\np\n\ndp\nk)\n\n(Xk\n\n2\np\n\ndp\nk)\n\n(Xk\n\nIR \u2261 min\n\nd\u22650\n\nmax\n\n\u2264|\u03b1|\u2264C1, 1t\u03b1=0\n\nDR \u2261\n\nmax\n\n0\u2264|\u03b1|\u2264C1, 1t\u03b1=0\n\n1t(Y \u03b1 \u2212 \u01eb|\u03b1|) \u2212 1\n1\n8\u03bb\n\n1t(Y \u03b1 \u2212 \u01eb|\u03b1|) \u2212\n\n2Xk\n(Xk\n\n(\u03b1tKk\u03b1)q)\n\n2\nq\n\n(24)\n\n(25)\n\n(26)\n\n(27)\n\nSMO has a slightly harder time optimising DR due to the |\u03b1| term which, though in itself not\ndifferentiable, can be gotten around by substituting \u03b1 = \u03b1+ \u2212 \u03b1\u2212 at the cost of doubling the\nnumber of dual variables.\n\n6 Experiments\n\nIn this Section, we empirically compare the performance of our proposed SMO-MKL algorithm\nagainst the specialised lp-MKL solver of [12] which is referred to as Shogun. Code, scripts and\nparameter settings were helpfully provided by the authors and we ensure that our stopping criteria\nare compatible. All experiments are carried out on a single core of an AMD 2380 2.5 GHz processor\nwith 32 Gb RAM. Our focus in these experiments is purely on training time and speed of optimisa-\ntion as the prediction accuracy improvements of lp-MKL have already been documented [12].\n\nWe carry out two sets of experiments. The \ufb01rst, on small scale UCI data sets, are carried out using\npre-computed kernels. This performs a direct comparison of the algorithmic components of SMO-\nMKL and Shogun. We also carry out a few large scale experiments with kernels computed on the\n\ufb02y. This experiment compares the two methods in totality. In this case, kernel caching can have an\neffect, but not a signi\ufb01cant one as the two methods have very similar caching strategies.\n\nFor each UCI data set we generated kernels as recommended in [16]. We generated RBF kernels\nwith ten bandwidths for each individual dimension of the feature vector as well as the full feature\nvector itself. Similarly, we also generated polynomial kernels of degrees 1, 2 and 3. All kernels\nmatrices were pre-computed and normalised to have unit trace. We set C = 100 as it gives us a\nreasonable accuracy on the test set. Note that for some value of \u03bb, SMO-MKL and Shogun will\nconverge to exactly the same solution [12]. Since this value is not known a priori we arbitrarily set\n\u03bb = 1.\n\nTraining times on the UCI data sets are presented in Table 1. Means and standard deviations are\nreported for \ufb01ve fold cross-validation. As can be seen, SMO-MKL is signi\ufb01cantly faster than Shogun\nat converging to similar solutions and obtaining similar test accuracies. In many cases, SMO-MKL\nis more than four times as fast and in some case more than ten or twenty times as fast. Note that our\ntest classi\ufb01cation accuracy on Liver is a lot lower than Shogun\u2019s. This is due to the arbitrary choice\nof \u03bb. We can vary our \u03bb on Liver to recover the same accuracy and solution as Shogun with a further\ndecrease in our training time.\n\nAnother very positive thing is that SMO-MKL appears to be relatively stable across a large operating\nrange of p. The code is, in most of the cases as expected, fastest when p = 2 and gets slower as\none increases or decreases p. Interestingly though, the algorithm doesn\u2019t appear to be signi\ufb01cantly\nslower for other values of p. Therefore, it is hoped that SMO-MKL can be used to learn sparse\nkernel combinations as well as non-sparse ones.\n\nMoving on to the large scale experiments with kernels computed on the \ufb02y, we \ufb01rst tried combining\na hundred thousand RBF kernels on the Sonar data set with 208 points and 59 dimensional features.\n\n6\n\n\fTable 1: Training times on UCI data sets with N training points, D dimensional features, M kernels\nand T test points. Mean and standard deviations are reported for 5-fold cross validation.\n\n(a) Australian: N =552, T =138, D=13, M =195.\n\nTraining Time (s)\n\nTest Accuracy (%)\n\n# Kernels Selected\n\nSMO-MKL\n4.89 \u00b1 0.31\n4.16 \u00b1 0.16\n4.31 \u00b1 0.19\n4.27 \u00b1 0.10\n4.88 \u00b1 0.18\n5.19 \u00b1 0.05\n5.48 \u00b1 0.21\n\nShogun\n\nShogun\n\nSMO-MKL\n85.22 \u00b1 2.96\n85.36 \u00b1 3.79\n85.65 \u00b1 3.73\n85.80 \u00b1 3.74\n85.80 \u00b1 3.74\n85.80 \u00b1 3.68\n85.51 \u00b1 3.69\n\n58.52 \u00b1 16.49\n33.58 \u00b1 2.58\n31.89 \u00b1 1.25\n27.08 \u00b1 7.18\n24.92 \u00b1 6.46\n26.90 \u00b1 2.05\n27.06 \u00b1 2.20\n(b) Ionosphere: N =280, T =71, D=33, M =442.\n\n85.22 \u00b1 2.81\n85.07 \u00b1 2.85\n85.07 \u00b1 2.85\n85.22 \u00b1 2.99\n85.07 \u00b1 2.85\n85.22 \u00b1 2.85\n85.22 \u00b1 2.85\n\nSMO-MKL\n26.4 \u00b1 0.8\n40.8 \u00b1 1.3\n72.2 \u00b1 4.8\n126.4 \u00b1 4.3\n162.8 \u00b1 3.6\n188.2 \u00b1 4.7\n192.0 \u00b1 2.6\n\nShogun\n\n137.2 \u00b1 53.8\n62.4 \u00b1 4.7\n100.2 \u00b1 3.7\n134.4 \u00b1 5.6\n177.8 \u00b1 8.3\n188.8 \u00b1 5.1\n194.4 \u00b1 1.2\n\nTraining Time (s)\n\nTest Accuracy (%)\n\n# Kernels Selected\n\nSMO-MKL\n2.85 \u00b1 0.16\n2.78 \u00b1 1.18\n2.42 \u00b1 0.28\n2.16 \u00b1 0.16\n2.35 \u00b1 0.25\n2.50 \u00b1 0.32\n3.03 \u00b1 0.99\n\nShogun\n\nShogun\n\n19.82 \u00b1 4.02\n8.49 \u00b1 0.61\n10.49 \u00b1 2.27\n13.99 \u00b1 4.68\n24.90 \u00b1 9.43\n33.05 \u00b1 3.66\n36.23 \u00b1 3.62\n\nSMO-MKL\n92.60 \u00b1 1.35\n92.03 \u00b1 1.42\n91.74 \u00b1 2.08\n92.03 \u00b1 1.68\n92.03 \u00b1 1.68\n92.03 \u00b1 1.68\n92.31 \u00b1 1.41\n\n92.03 \u00b1 1.68\n92.60 \u00b1 1.86\n91.74 \u00b1 1.37\n91.17 \u00b1 2.45\n91.74 \u00b1 2.08\n92.03 \u00b1 1.68\n91.75 \u00b1 2.05\n(c) Liver: N =276, T =69, D=5, M =91.\n\nSMO-MKL\n50.0 \u00b1 2.7\n120.8 \u00b1 6.0\n200.8 \u00b1 4.4\n328.0 \u00b1 6.6\n413.6 \u00b1 5.6\n430.6 \u00b1 4.6\n434.4 \u00b1 4.8\n\nShogun\n\n125.2 \u00b1 7.3\n217.0 \u00b1 23.4\n291.4 \u00b1 33.0\n364.2 \u00b1 15.4\n412.2 \u00b1 6.6\n436.6 \u00b1 4.3\n442.0 \u00b1 0.0\n\nTraining Time (s)\n\nTest Accuracy (%)\n\n# Kernels Selected\n\nSMO-MKL\n0.53 \u00b1 0.03\n0.54 \u00b1 0.03\n0.56 \u00b1 0.04\n0.54 \u00b1 0.04\n0.63 \u00b1 0.03\n0.65 \u00b1 0.02\n0.67 \u00b1 0.03\n\nShogun\n\n2.15 \u00b1 0.12\n0.92 \u00b1 0.05\n1.14 \u00b1 0.23\n1.72 \u00b1 0.57\n2.35 \u00b1 0.36\n2.53 \u00b1 0.44\n3.40 \u00b1 0.55\n\nSMO-MKL\n62.90 \u00b1 9.81\n66.09 \u00b1 8.48\n66.96 \u00b1 7.53\n66.96 \u00b1 7.06\n66.38 \u00b1 7.36\n65.22 \u00b1 6.80\n65.22 \u00b1 6.74\n\nShogun\n\n66.67 \u00b1 9.91\n71.59 \u00b1 8.92\n70.72 \u00b1 9.28\n72.17 \u00b1 6.94\n73.33 \u00b1 6.71\n72.75 \u00b1 7.96\n73.91 \u00b1 7.28\n\nSMO-MKL\n9.40 \u00b1 1.02\n24.40 \u00b1 2.06\n44.20 \u00b1 2.23\n71.00 \u00b1 5.29\n82.40 \u00b1 2.42\n83.20 \u00b1 2.32\n85.20 \u00b1 3.37\n\nShogun\n\n39.40 \u00b1 1.50\n43.60 \u00b1 2.42\n57.00 \u00b1 3.29\n78.00 \u00b1 2.28\n88.20 \u00b1 1.72\n90.80 \u00b1 0.40\n91.00 \u00b1 0.00\n\n(d) Sonar: N =166, T =42, D=59, M =793.\n\nTraining Time (s)\n\nTest Accuracy (%)\n\n# Kernels Selected\n\nSMO-MKL\n4.95 \u00b1 0.29\n4.00 \u00b1 0.76\n4.48 \u00b1 1.63\n3.31 \u00b1 0.31\n3.54 \u00b1 0.35\n3.83 \u00b1 0.38\n3.96 \u00b1 0.45\n\nShogun\n\n47.19 \u00b1 3.85\n18.28 \u00b1 1.63\n20.27 \u00b1 8.84\n31.52 \u00b1 5.07\n51.83 \u00b1 17.96\n64.59 \u00b1 9.19\n70.08 \u00b1 9.18\n\nSMO-MKL\n85.15 \u00b1 7.99\n84.65 \u00b1 9.37\n88.47 \u00b1 6.68\n88.94 \u00b1 6.00\n88.94 \u00b1 4.97\n88.94 \u00b1 4.97\n88.94 \u00b1 4.97\n\nShogun\n\n81.25 \u00b1 8.71\n87.03 \u00b1 6.85\n87.51 \u00b1 6.28\n88.95 \u00b1 6.33\n88.94 \u00b1 5.41\n88.94 \u00b1 4.97\n89.92 \u00b1 5.13\n\nSMO-MKL\n91.2 \u00b1 6.9\n247.8 \u00b1 7.7\n383.0 \u00b1 5.7\n661.2 \u00b1 10.2\n770.8 \u00b1 4.4\n782.0 \u00b1 3.4\n786.0 \u00b1 4.1\n\nShogun\n\n258.0 \u00b1 24.8\n374.2 \u00b1 20.9\n451.6 \u00b1 12.0\n664.8 \u00b1 35.2\n763.0 \u00b1 7.0\n789.4 \u00b1 2.8\n792.2 \u00b1 1.1\n\np\n\n1.10\n1.33\n1.66\n2.00\n2.33\n2.66\n3.00\n\np\n\n1.10\n1.33\n1.66\n2.00\n2.33\n2.66\n3.00\n\np\n\n1.10\n1.33\n1.66\n2.00\n2.33\n2.66\n3.00\n\np\n\n1.10\n1.33\n1.66\n2.00\n2.33\n2.66\n3.00\n\nNote that these kernels do not form any special hierarchy so approaches such as [2] are not applica-\nble. Timing results on a log-log scale are given in Figure (1a). As can be seen, SMO-MKL appears\nto be scaling linearly with the number of kernels and we converge in less than half an hour on all\nhundred thousand kernels for both p = 2 and p = 1.33. If we were to run the same experiment using\npre-computed kernels then we converge in approximately seven minutes (see Fig (1b)). On the other\nhand, Shogun took six hundred seconds to combine just ten thousand kernels computed on the \ufb02y.\n\nThe trend was the same when we increased the number of training points. Figure (1c) and (1d) plot\ntiming results on a log-log scale as the number of training points is varied on the Adult and Web\ndata sets (please see [1] for data set details and downloads). We used 50 kernels computed on the\n\n7\n\n\f)\ns\n(\n \n)\ne\nm\nT\n(\ng\no\n\ni\n\nl\n\n7.5\n\n7\n\n6.5\n\n6\n\n5.5\n\n5\n\n4.5\n\n \n9\n\nSonar\n\nSMO\u2212MKL p=1.33\nSMO\u2212MKL p=2.00\n\n \n\n9.5\n\n10\n\n10.5\n\n11\n\n11.5\n\n12\n\nlog(# Kernels)\n\n)\ns\n(\n \n)\ne\nm\nT\n(\ng\no\n\ni\n\nl\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n \n6\n\nSonar\n\nSMO\u2212MKL p=1.33\nSMO\u2212MKL p=2.00\n\nAdult\n\nSMO\u2212MKL p=1.33\nSMO\u2212MKL p=2.00\nShogun p=1.33\nShogun p=2.00\n\n \n\n10\n\n)\ns\n(\n \n)\ne\nm\nT\n(\ng\no\n\ni\n\nl\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n \n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n)\ns\n(\n \n)\ne\nm\nT\n(\ng\no\n\ni\n\nl\n\nWeb\n\nSMO\u2212MKL p=1.33\nSMO\u2212MKL p=2.00\n\n \n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\nlog(# Kernels)\n\n1\n\n \n7\n\n7.5\n\n8.5\n\n8\n9.5\nlog(# Training Points)\n\n9\n\n10\n\n10.5\n\n \n\n1\n7.5\n\n8\n\n8.5\n\n9\n\n10\nlog(# Training Points)\n\n9.5\n\n10.5\n\n11\n\n(a) Sonar\n\n(b) Sonar Pre-computed\n\n(c) Adult\n\n(d) Web\n\nFigure 1: Large scale experiments varying the number of kernels and points. See text for details.\n\n\ufb02y for these experiments. On Adult, till about six thousand points, SMO-MKL is roughly 1.5 times\nfaster than Shogun for p = 1.33 and 5 times faster for p = 2. However, on reaching eleven thousand\npoints, Shogun starts taking more and more time to converge and we could not get results for sixteen\nthousand points or more. SMO-MKL was unaffected and converged on the full data set with 32,561\npoints in 9245.80 seconds for p = 1.33 and 8511.12 seconds for p = 2. We tried the Web data set\nto see whether the SMO-MKL algorithm would scale beyond 32K points. Training on all 49,749\npoints and 50 kernels took 1574.73 seconds (i.e. less than half an hour) with p = 1.33 and 2023.35\nseconds with p = 2.\n\n7 Conclusions\n\nWe developed the SMO-MKL algorithm for ef\ufb01ciently optimising the lp-MKL formulation. We\nplaced the emphasis \ufb01rmly back on optimising the MKL dual rather than the intermediate saddle\npoint problem on which all state-of-the-art MKL solvers are based. We showed that the lp-MKL\ndual is differentiable and that placing the p-norm squared regulariser in the primal objective lets us\nanalytically solve the reduced variable problem for p = 2. We could also solve the convex, one-\ndimensional reduced variable problem when p 6= 2 by the Newton-Raphson method. A second-order\nworking set selection algorithm was implemented to speed up convergence. The resulting algorithm\nis simple, easy to implement and ef\ufb01ciently scales to large problems. We also showed how to\ngeneralise the algorithm to handle not just p-norms squared but also certain Bregman divergences.\n\nIn terms of empirical performance, we compared the SMO-MKL algorithm to the specialised lp-\nMKL solver of [12] referred to as Shogun. It was demonstrated that SMO-MKL was signi\ufb01cantly\nfaster than Shogun on both small and large scale data sets \u2013 sometimes by an order of magnitude.\nSMO-MKL was also found to be relatively stable for various values of p and could therefore be\nused to learn both sparse, and non-sparse, kernel combinations. We demonstrated that the algorithm\ncould combine a hundred thousand kernels on Sonar in approximately seven minutes using pre-\ncomputed kernels and in less than half an hour using kernels computed on the \ufb02y. This is signi\ufb01cant\nas many non-linear kernel combination forms, which lead to performance improvements but are\nnon-convex, can be recast as convex linear MKL with a much larger set of base kernels. The SMO-\nMKL algorithm can now be used to tackle such problems as long as an appropriate regulariser can\nbe found. We were also able to train on the entire Web data set with nearly \ufb01fty thousand points\nand \ufb01fty kernels computed on the \ufb02y in less than half an hour. Other solvers were not able to\nreturn results on these problems. All experiments were carried out on a single core and therefore,\nwe believe, rede\ufb01ne the state-of-the-art in terms of MKL optimisation. The SMO-MKL code is\navailable for download from [20].\n\nAcknowledgements\n\nWe are grateful to Saurabh Gupta, Marius Kloft and Soren SSonnenburg for helpful discussions,\nfeedback and help with Shogun.\n\nReferences\n\n[1] http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html.\n\n8\n\n\f[2] F. R. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, pages\n\n105\u2013112, 2008.\n\n[3] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO\n\nalgorithm. In ICML, pages 6\u201313, 2004.\n\n[4] A. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization method\n\nwith applications to tomography. SIAM Journal of Opimization, 12(1):79\u2013108, 2001.\n\n[5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at\n\nhttp://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[6] C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In UAI, 2009.\n\n[7] C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, 2009.\n\n[8] R. E. Fan, P. H. Chen, and C. J. Lin. Working set selection using second order information for training\n\nSVM. JMLR, 6:1889\u20131918, 2005.\n\n[9] C. Gentile. Robustness of the p-norm algorithms. ML, 53(3):265\u2013299, 2003.\n\n[10] M. Gonen and E. Alpaydin. Localized multiple kernel learning. In ICML, 2008.\n\n[11] J. Kivinen, M. K. Warmuth, and B. Hassibi. The p-norm generaliziation of the LMS algorithm for adaptive\n\n\ufb01ltering. IEEE Trans. Signal Processing, 54(5):1782\u20131793, 2006.\n\n[12] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Muller, and A. Zien. Ef\ufb01cient and accurate lp-norm\n\nMultiple Kernel Learning. In NIPS, 2009.\n\n[13] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. JMLR, 5:27\u201372, 2004.\n\n[14] C. J. Lin, S. Lucidi, L. Palagi, A. Risi, and M. Sciandrone. Decomposition algorithm model for singly\n\nlinearly-constrained problems subject to lower and upper bounds. JOTA, 141(1):107\u2013126, 2009.\n\n[15] J. Platt. Fast training of support vector machines using sequential minimal optimization. In Advances in\n\nKernel Methods \u2013 Support Vector Learning, pages 185\u2013208, 1999.\n\n[16] A. Rakotomamonjy, F. Bach, Y. Grandvalet, and S. Canu. SimpleMKL. JMLR, 9:2491\u20132521, 2008.\n\n[17] S. Sonnenburg, G. Raetsch, C. Schaefer, and B. Schoelkopf. Large scale multiple kernel learning. JMLR,\n\n7:1531\u20131565, 2006.\n\n[18] M. Varma and B. R. Babu. More generality in ef\ufb01cient multiple kernel learning. In ICML, 2009.\n\n[19] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In ICCV,\n\n2009.\n\n[20] S. V. N. Vishwanathan, Z. Sun, N. Theera-Ampornpunt, and M. Varma, 2010. The SMO-MKL code\n\nhttp://research.microsoft.com/\u02dcmanik/code/SMO-MKL/download.html.\n\n[21] J. Yang, Y. Li, Y. Tian, L. Duan, and W. Gao. Group-sensitive multiple kernel learning for object catego-\n\nrization. In ICCV, 2009.\n\n9\n\n\f", "award": [], "sourceid": 390, "authors": [{"given_name": "Zhaonan", "family_name": "Sun", "institution": null}, {"given_name": "Nawanol", "family_name": "Ampornpunt", "institution": null}, {"given_name": "Manik", "family_name": "Varma", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}