{"title": "Multi-Class Learning: From Theory to Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1586, "page_last": 1595, "abstract": "In this paper, we study the generalization performance of multi-class classification and obtain a shaper data-dependent generalization error bound with fast convergence rate, substantially improving the state-of-art bounds in the existing data-dependent generalization analysis. The theoretical analysis motivates us to devise two effective multi-class kernel learning algorithms with statistical guarantees. Experimental results show that our proposed methods can significantly outperform the existing multi-class classification methods.", "full_text": "Multi-Class Learning: From Theory to Algorithm\n\nJian Li1,2, Yong Liu1\u2217, Rong Yin1,2, Hua Zhang1, Lizhong Ding5, Weiping Wang1,3,4\n\n1Institute of Information Engineering, Chinese Academy of Sciences\n2School of Cyber Security, University of Chinese Academy of Sciences\n\n3National Engineering Research Center for Information Security\n\n4National Engineering Laboratory for Information Security Technology\n5Inception Institute of Arti\ufb01cial Intelligence (IIAI), Abu Dhabi, UAE\n\n{lijian9026,liuyong,yinrong,wangweiping}@iie.ac.cn\n\nlizhong.ding@inceptioniai.org\n\nAbstract\n\nIn this paper, we study the generalization performance of multi-class classi\ufb01-\ncation and obtain a shaper data-dependent generalization error bound with fast\nconvergence rate, substantially improving the state-of-art bounds in the existing\ndata-dependent generalization analysis. The theoretical analysis motivates us to de-\nvise two effective multi-class kernel learning algorithms with statistical guarantees.\nExperimental results show that our proposed methods can signi\ufb01cantly outperform\nthe existing multi-class classi\ufb01cation methods.\n\n1\n\nIntroduction\n\n\u221a\n\nMulti-class classi\ufb01cation is an important problem in various applications, such as natural language\nprocessing, information retrieval, computer vision, web advertising, etc. The statistical learning\ntheory of binary classi\ufb01cation is by now relatively well developed [19, 20, 21, 23, 27, 34], but there\nare still numerous statistical challenges to its multi-class extensions [25].\nTo understand the existing multi-class classi\ufb01cation algorithms and guide the development of new\nones, people have investigated the generalization ability of multi-class classi\ufb01cation algorithms. In\nrecent years, some generalization bounds have been proposed to estimate the ability of multi-class\nclassi\ufb01cation algorithms based on different measures, such as VC-dimension [1], Natarajan dimension\n[7], covering Number [9, 11, 37], Rademacher Complexity [5, 14, 27], Stability [10], PAC-Bayesian\n[26], etc. Although there have been several recent advances in the studying of generalization bounds\nof multi-class classi\ufb01cation algorithms, convergence rates of the existing generalization bounds are\n\nn(cid:1), where K and n are the number of classes and size of the sample, respectively.\nusually O(cid:0)K 2/\nalgorithms based on the above theoretical analysis. The rate of this bound is O(cid:0)(log K)2+1/log K/n(cid:1),\n\nIn this paper, we derive a novel data-dependent generalization bound for multi-class classi\ufb01cation via\nthe notion of local Rademacher complexity and further devise two effective multi-class kernel learning\n\nwhich substantially improves on the existing data-dependent generalization bounds. Moreover, the\nproposed multi-class kernel learning algorithms have statistical guarantees and fast convergence\nrates. Experimental results on lots of benchmark datasets show that our proposed methods can\nsigni\ufb01cantly outperform the existing multi-class classi\ufb01cation methods. The major contributions\nof this paper include: 1) A new local Rademacher complexity based bound with fast convergence\nrate for multi-class classi\ufb01cation is established. Existing works [16, 27] for multi-class classi\ufb01ers\nwith Rademacher complexity does not take into account couplings among different classes. To\nobtain sharper bound, we introduce a new structural complexity result on function classes induced\nby general classes via the maximum operator, while allowing to preserve the correlations among\n\n\u2217Corresponding author\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdifferent components meanwhile. Thus, our result in this paper is a non-trivial extension of the\nbinary classi\ufb01cation of local Rademacher complexity to multi-classi\ufb01cation; 2) Two novel multi-class\nclassi\ufb01cation algorithms are proposed with statistical guarantees: a) Conv-MKL. Using precomputed\nkernel matrices regularized by local Rademacher complexity, this method can be implemented by any\n(cid:96)p-norm multi-class MKL solvers; b) SMSD-MKL. This method puts local Rademacher complexity\nin penalized ERM with (cid:96)2,p-norm regularizer, implemented by stochastic sub-gradient descent with\nupdating dual weights.\n\n2 Related Work\n\n2.1 Multi-Class Classi\ufb01cation Bounds\n\nRademacher Complexities Bounds. Koltchinskii and Panchenko [14] and Koltchinskii, Panchenko,\nand Lozano [15] \ufb01rst introduced a margin-based bound for multi-class classi\ufb01cation in terms of\nRademacher complexity. This bound was slightly improved in [27, 5]. Maximov and Reshetova [25]\ngave a new Rademacher complexity based bound that is linear in the number of classes. Based on the\n(cid:96)p-norm regularization, Lei, Binder, and Klof [18] introduced a bound with a logarithmic dependence\non the number of class size. Instead of global Rademacher complexity, in this paper, we use local\nRademacher complexity to obtain a sharper bound, which substantially improves generalization\nperformance upon existing global Rademacher complexity methods.\nVC-dimension Bounds. Allwein, Schapire, and Singer [1] used the notion of VC-dimension for\nmulti-class learning problems, and derived a VC-dimension based bound. Natarajan dimension\nwas introduced in [28] in order to characterize multi-class PAC learnability, which exactly matches\nthe notion of Vapnik-Chervonenkis dimension in the case of binary classi\ufb01cation. Daniely and\nShalev-Shwartz [7] derived a risk bound with Natarajan dimension for multi-class classi\ufb01cation. VC\ndimension and Natarajan dimension are important tools to derive generalization bounds, however,\nthese bounds are usually dimension dependent, which makes them hardly applicable to practical\nlarge-scale problems (such as typical computer vision problems).\nCovering Number Bounds. Based on the (cid:96)\u221e-norm covering number bound of linear operators,\nGuermeur [9] obtained a generalization bound exhibiting a linear dependence on the class size,\nwhich was improved by [37] to a radical dependence. Hill and Doucet [11] derived a class-size\nindependent risk guarantee. However, their bound is based on a delicate de\ufb01nition of margin, which\nis not commonly used in mainstream multi-class literature.\nStability Bounds and PAC-Bayesian Bounds. Stability [10] and PAC-Bayesian [26] are two\npopular tools to analyze generalization performance on neural networks for multi-class setting.\nHardt, Recht and Singer [10] generated generalization bounds for models learned with stochastic\n\u221a\ngradient descent. McAllester [26] proposed a dropout bound for neural networks with PAC-Bayesian.\nHowever, the convergence rate based on stability and PAC-Bayesian is usually at most O(1/\n\nn).\n\n2.2 Local Rademacher Complexity\n\nIn recent years, several authors have applied local Rademacher complexity to obtain better gen-\neralization error bounds for traditional binary classi\ufb01cation [2, 13, 22, 24], similar analysis has\nbeen explored in multi-label learning [35] and multi-task learning [36] as well. However, numerous\nstatistical challenges remain in the multi-class case, and it is still unclear how to use this tool to derive\na tighter bound for multi-class. In this paper, we bridge this gap by deriving a sharper generalization\nbound using local Rademacher complexity.\n\n2.3 Multi-Class Kernel Learning Algorithms\n\nAs one of the success stories in multiple kernel learning, improvements in multi-class MKL have\nemerged [38], in which a one-stage multi-class MKL algorithm was presented as a generalization\nof multi-class loss function [6, 33]. And Orabona designed stochastic gradient methods, named\nOBSCURE [30] and UFO-MKL [29], which optimize primal versions of equivalent problems. In\nthis paper, we consider the use of the local Rademacher complexity to devise the novel multi-class\nclassi\ufb01cation algorithms, which have statistical guarantees and fast convergence rates.\n\n2\n\n\f3 Notations and Preliminaries\nWe consider multi-class classi\ufb01cation problems with K \u2265 2 classes in this paper. Let X be\nthe input space and Y = {1, 2, . . . , K} the output space. Assume that we are given a sample\nS = {z1 = (x1, y1), . . . , zn = (xn, yn)} of size n drawn i.i.d. from a \ufb01xed, but unknown probability\ndistribution \u00b5 on Z = X \u00d7 Y. Based on the training examples S, we wish to learn a scoring rule h\nfrom a space H mapping from Z to R and use the mapping x \u2192 arg maxy\u2208Y h(x, y) to predict. For\nany hypothesis h \u2208 H, the margin of a labeled example z = (x, y) is de\ufb01ned as\n\n\u03c1h(z) := h(x, y) \u2212 max\ny(cid:48)(cid:54)=y\n\nh(x, y(cid:48)).\n\n(cid:0)1t\u22640 + (1 \u2212 ts\u22121)10<t\u2264s\n\n(cid:1)2, s > 0. In the following, we assume that: 1) (cid:96)(t) bounds the 0-1 loss:\n\nThe h misclassi\ufb01es the labeled example z = (x, y) if \u03c1h(z) \u2264 0 and thus the expected risk incurred\nfrom using h for prediction is L(h) := E\u00b5[1\u03c1h(z)\u22640], where 1t\u22640 is the 0-1 loss, 1t\u22640 = 1 if\nt \u2264 0, otherwise 0. Since 0-1 loss is hard to handle in learning machines, one usually considers\nthe proxy loss: such as the square hinge (cid:96)(t) = (1 \u2212 t)2\n+ and the square margin loss (cid:96)s(t) =\n1t\u22640 \u2264 (cid:96)(t); 2) (cid:96) is decreasing and it has a zero point c(cid:96), i.e., (cid:96)(c(cid:96)) = 0; 3) (cid:96) is \u03b6-smooth, that\nis |(cid:96)(cid:48)(t) \u2212 (cid:96)(cid:48)(s)| \u2264 \u03b6|t \u2212 s|. Note that both square hinge loss and margin loss satisfy the above\nassumptions.\nAny function h : X \u00d7 Y \u2192 R can be equivalently represented by the vector-valued function\n(h1, . . . , hK) with hj(x) = h(x, j), \u2200j = 1, . . . , K. Let \u03ba : X \u00d7 X \u2192 R be a Mercer kernel with\n\u03c6 being the associated feature map, i.e., \u03ba(x, x(cid:48)) = (cid:104)\u03c6(x), \u03c6(x(cid:48))(cid:105). The (cid:96)p-norm hypothesis space\nassociated with the kernel \u03ba is denoted by:\n\n(cid:110)\n(cid:111)\nhw = ((cid:104)w1, \u03c6(x)(cid:105), . . . ,(cid:104)wK, \u03c6(x)(cid:105)) : (cid:107)w(cid:107)2,p \u2264 1, 1 \u2264 p \u2264 2\n\nHp,\u03ba =\n\n(1)\n\n,\n\n(cid:104)(cid:80)K\n\n(cid:105) 1\n\nwhere w = (w1, . . . , wK) and (cid:107)w(cid:107)2,p =\nthe dual exponent of p satisfying 1/p + 1/q = 1.\nThe space of loss function associated with Hp,\u03ba is denoted by\n\ni=1 (cid:107)wi(cid:107)p\n\n2\n\np is the (cid:96)2,p-norm. For any p \u2265 1, let q be\n\nL = {(cid:96)h := (cid:96)(\u03c1h(z)) : h \u2208 Hp,\u03ba} .\n\n(2)\n\nLet L((cid:96)h) and \u02c6L((cid:96)h) be expected generalization error and empirical error with respect to (cid:96)h:\n\nL((cid:96)h) := E\u00b5[(cid:96)(\u03c1h(z))] and \u02c6L((cid:96)h) =\n\n1\nn\n\n(cid:96)(\u03c1h(zi)).\n\nDe\ufb01nition 1 (Rademacher complexity). Assume L is a space of loss functions as de\ufb01ned in Equation\n(2). Then the empirical Rademacher complexity of L is:\n\nn(cid:88)\n\ni=1\n\n(cid:35)\n\n(cid:34)\n\nn(cid:88)\n\ni=1\n\n\u02c6R(L) := E\u03c3\n\n1\nn\n\nsup\n(cid:96)h\u2208L\n\n\u03c3i(cid:96)h(zi)\n\n,\n\nwhere \u03c31, \u03c32, . . . , \u03c3n is an i.i.d. family of Rademacher variables taking values -1 and 1 with equal\nprobability independent of the sample S = (z1, . . . , zn). The Rademacher complexity of L is\nR(L) = E\u00b5 \u02c6R(L).\nGeneralization bounds based on the notion of Rademacher complexity for multi-class classi\ufb01ca-\ntion are standard [14, 15, 27]: with probability 1 \u2212 \u03b4, L(h) \u2264 inf 0<\u03b3<1\n\u221a\nlog(1/\u03b4)/\nn)\nfor various kernel multi-class in practice, so the standard Rademacher complexity bounds converge at\n\n(cid:0) \u02c6L(h\u03b3) + O(cid:0)R(L)/\u03b3 +\n(cid:3). Since R(L) is in the order of O(K 2/\n\n(cid:2)1\u03c1h(zi)\u2264\u03b3\n\nn(cid:1)(cid:1), where \u02c6L(h\u03b3) = 1\nn(cid:1), usually.\n\nrate O(cid:0)K 2/\n\n(cid:80)n\n\n\u221a\n\n\u221a\n\ni=1\n\nn\n\nAlthough Rademacher complexity is widely used in bound generalization analysis, it does not take\ninto consideration the fact that, typically, the hypotheses selected by a learning algorithm have a\nbetter performance than in the worst case and belong to a more favorable sub-family of the set of all\nhypotheses [4]. Therefore, to derive sharper generalization bound, we consider the use of the local\nRademacher complexity in this paper.\n\n3\n\n\fR(Lr) := R(cid:110)\n(cid:2)(cid:96)2(\u03c1h(z))(cid:3).\n\nDe\ufb01nition 2 (Local Rademacher Complexity). For any r > 0, the local Rademacher complexity of\nL is de\ufb01ned as\n\n(cid:12)(cid:12)(cid:12)a \u2208 [0, 1], (cid:96)h \u2208 L, L[(a(cid:96)h)2] \u2264 r\n\n(cid:111)\n\n,\n\na(cid:96)h\n\nh) = E\u00b5\n\nwhere L((cid:96)2\nThe key idea to obtain sharper generalization error bound is to choose a much smaller class Lr \u2286 L\nwith as small a variance as possible, while requiring that the solution is still in {h|h \u2208 Hp,\u03ba, (cid:96)h \u2208 Lr}.\nIn the following, we assume that \u03d1 = supx\u2208X \u03ba(x, x) < \u221e, and (cid:96)h : Z \u2192 [0, d], d > 0 is a constant.\nThe above two assumptions are two common restrictions on kernel function and loss functions, which\nare satis\ufb01ed by the popular Gaussian kernels and the bounded hypothesis, respectively.\n\n4 Sharper Generalization Bounds\n\nIn this section, we \ufb01rst estimate the local Rademacher complexity, and further derive a sharper\ngeneralization bound.\n\n4.1 Local Rademacher Complexity\n\nThe estimate the local Rademacher complexity of multi-class classi\ufb01cation is given as follows.\nTheorem 1. With probability at least 1 \u2212 \u03b4,\n\n\u221a\n\u221a\nR(Lr) \u2264 cd,\u03d1\u03be(K)\n\u03b6r log\nn\n\n3\n2 (n)\n\n+\n\n4 log(1/\u03b4)\n\nn\n\n,\n\nwhere\n\n\u03be(K) =\n\n(cid:40)\u221a\n\ne(4 log K)1+ 1\n\n2 log K ,\n\n(2q)1+ 1\n\nq K\n\n1\nq ,\n\nif q \u2265 2 log K,\notherwise,\n\ncd,\u03d1 is a constant depends on d and \u03d1.\n\nn(cid:1) for various\nNote that the order of the (global) Rademacher complexity over L is usually O(cid:0)K 2/\nn + 1/n(cid:1). Note that \u03be(K) is logarithmic dependence on K when\nis R(Lr) = O(cid:0)\u221a\neven reach O(cid:0)(log K)2+1/ log K/n(cid:1) (see in the next subsection), which substantially improves the\n\nkernel multi-classes. From Theorem 1, one can see that the order of the local Rademacher complexity\nq \u2265 2 log K. For 2 \u2264 q < 2 log K, \u03be(K) = O(K\nq ) which is also substantially milder than the\nquadratic dependence for Rademacher complexity. If we choose a suitable value of r, the order can\n\nr\u03be(K)/\n\n\u221a\n\n\u221a\n\n2\n\nRademacher complexity bounds.\n\n4.2 A Sharper Generalization Bound\n\nA sharper bound for multi-class classi\ufb01cation based on the notion of local Rademacher complexity is\nderived as follows.\nTheorem 2. \u2200h \u2208 Hp,\u03ba and \u2200k > max(1,\n\n\u221a\n2d ), with probability at least 1 \u2212 \u03b4, we have\n\n2\n\n(cid:27)\n\n,\n\n(cid:26) k\n(cid:40)\u221a\n\nk \u2212 1\n\nL(h) \u2264 max\n\nwhere\n\n\u02c6L((cid:96)h), \u02c6L((cid:96)h) +\n\ncd,\u03d1,\u03b6,k\u03be2(K) log3 n\n\nn\n\n+\n\nc\u03b4\nn\n\ne(4 log K)1+ 1\n\n2 log K ,\n\n\u03be(K) =\n\n(2q)1+ 1\n\nq K\n\n1\n\nq ,\n\nif q \u2265 2 log K,\notherwise,\n\ncd,\u03d1 is a constant depending on d, \u03d1, \u03b6, k, and c\u03b4 is a constant depending on \u03b4.\n\n4\n\n\fThe order of the generalization bound in Theorem 2 is O(cid:0)\u03be2(K)/n(cid:1). From the de\ufb01nition of \u03be(K),\n\nwe can obtain that\n\n(cid:18) \u03be2(K)\n\n(cid:19)\n\n=\n\nn\n\nO\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3O(cid:16)\nO(cid:16)\n\n(cid:17)\n\n(log K)2+1/log K/n\n\n(cid:17)\n\nK 2/q/n\n\n,\n\n,\n\nif q \u2265 2 log K,\nif 2 \u2264 q < 2 log K.\n\nNote that our bounds is linear dependence on the reciprocal of sample size n, while for the existing\ndata-dependent bounds are all radical dependence. Furthermore, our bounds enjoy a mild dependence\non the number of classes. The dependence is polynomial with degree 2/q for 2 \u2264 q < 2 log K and\nbecomes logarithmic if q \u2265 2 log K, which is substantially milder than the quadratic dependence\nestablished in [14, 15, 27, 5].\n\n4.3 Comparison with the Related Work\n\nn\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nn + log 1/\u03b4\u221a\n\nn\n\n\u03b3\n\n\u221a\nlog(1/\u03b4)/\n\n(cid:1). The order is O(cid:0) K2\u221a\n\nRademacher Complexity Bounds Koltchinskii and Panchenko [14] and Koltchinskii, Panchenko,\nand Lozano [15] introduce a margin-based bound for multi-class classi\ufb01cation in terms of Rademacher\n\nimproved (by a constant factor prior to the Rademacher complexity term) by [27, 5]. Maximov and\nn)+\n\nRademacher complexity are all radical dependence on the reciprocal of sample size.\nIn this paper, we derive a sharper bound based on the local Rademacher complexity with order\n\ncomplexities: L(h) \u2264 inf 0<\u03b3<1 \u02c6L(h\u03b3) + O(cid:0) K2\n(cid:1), which is slightly\nReshetova [25] give a new Rademacher complexity bound: L(h) \u2264 inf 0<\u03b3<1 \u02c6L(h\u03b3)+O(cid:0)K/(\u03b3\nn(cid:1). Based on the (cid:96)p-norm regularization, Lei, Binder,\nn(cid:1), which has the form of O(cid:0)K/\nn(cid:1). The existing bounds based on\nand Klof [18] derive a new bound: L(h) \u2264 \u02c6L((cid:96)h) + O(cid:0)log2 K/\nlog K /n(cid:1), substantially sharper than the existing bounds of Rademacher complexity.\nO(cid:0)(log K)2+ 1\nGuermeur [9] obtains a generalization of form O(cid:0)K/\nn(cid:1), which is improved by [37] to a radical\ndependence: L(h) \u2264 \u02c6L((cid:96)h) + O(cid:0)(cid:112)K/n(cid:1). Hill and Doucet [11] derive a class-size independent\nrisk guarantee of form O(cid:0)(cid:112)1/n(cid:1). However, their bound is based on a delicate de\ufb01nition of margin,\nproblems, and derive a VC-dimension based bounds: L(h) \u2264 \u02c6L(h\u03b3) +O(cid:0)\u221a\nn(cid:1), where V\ndimension: L(h) \u2264 \u02c6L(h\u03b3) + O(cid:0)dN at/n(cid:1), where dN at is the Natarajan dimension. Note that VC\n\nwhich is not commonly used in mainstream multi-class literature.\nVC-dimension Bounds VC-dimension is an important tool to derive the generalization bound for\nbinary classi\ufb01cation. Allwein, Schapire, and Singer [1] show how to use it for multi-class learning\n\nis the VC-dimension. Natarajan dimension is introduced in [28] in order to characterize multi-class\nPAC learnability. Daniely and Shalev-Shwartz [7] derive a generalization bound with Natarajan\n\nCovering Number Bounds Based on the (cid:96)\u221e-norm covering number bound of linear operators,\n\ndimension bounds, as well as Natarajan dimension bounds, are usually dimension dependent, which\nmakes them hardly applicable for practical large scale problems (such as typical computer vision\nproblems).\nStability and PAC-Bayesian Bounds Stability [10] and PAC-Bayesian [26] are two useful tools to\nanalyze generalization performance on neural networks for a multi-class setting. Hardt, Recht and\nSinger [10] generated generalization bounds for models learned with stochastic gradient descent\n\nn(cid:1). McAllester [26] used the PAC-Bayesian theory to derive\n\nusing stability: L(h) \u2264 \u02c6L(h\u03b3) +O(cid:0)1/\ngeneralization bound: L(h) \u2264 \u02c6L(h\u03b3) + O(cid:0)(cid:113)\n\n\u02c6L(h\u03b3)/n(cid:1).\n\n\u221a\n\nV log K/\n\n\u221a\n\n5 Multi-Class Multiple Kernel Learning\n\nIn this paper, we consider the use of multiple kernels, \u03ba\u00b5 =(cid:80)M\n\nMotivated by the above analysis of generalization bound, we will exploit the properties of the local\nRademacher complexity to devise two algorithms for multi-class multiple kernel learning (MC-MKL).\nm=1 \u00b5m\u03bam. A common approach to\nmulti-class classi\ufb01cation is the use of joint feature maps \u03c6(x) : X \u2192 H [33]. For multiple kernel\nlearning, we have M feature mappings \u03c6m, m = 1, . . . , M and \u03bam(x, x(cid:48)) = (cid:104)\u03c6m(x), \u03c6m(x(cid:48))(cid:105),\nwhere m = 1, . . . , M. Let \u03c6\u00b5(x) = [\u03c61(x), . . . , \u03c6M (x)]. Using Theorem 2, to obtain a shaper\n\n5\n\n\fgeneralization bound, we con\ufb01ne q \u2265 2 log K, thus 1 < p \u2264 2 log K\nmultiple kernels can be written as:\n\n(cid:110)\nhw,\u03ba\u00b5 = ((cid:104)w1, \u03c6\u00b5(x)(cid:105), . . . ,(cid:104)wK, \u03c6\u00b5(x)(cid:105)) ,(cid:107)w(cid:107)2,p \u2264 1, 1 < p \u2264 2 log K\n2 log K \u2212 1\n\n2 log K\u22121. The (cid:96)p hypothesis space of\n\nHmkl =\n\n(cid:111)\n\n.\n\nH1 =\n\n(cid:110)\nhw,\u03ba\u00b5 \u2208 Hmkl :\n\n5.1 Conv-MKL\n(cid:80)M\nThe global Rademacher complexity of Hmkl can be bounded by the trace of kernel matrix K\u00b5 =\nm=1 Km. Existing works on [17, 32] use the following constraint to Hmkl: Tr(K\u00b5) \u2264 1.\nAccording to the above theoretical analysis, the local Rademacher complexity (the tail sum of\nthe eigenvalues of the kernel) leads to tighter generalization bounds than the global Rademacher\ncomplexity (the trace). Thus, we add the local Rademacher complexity to restrict Hmkl:\n\nwhere \u03bbj(K\u00b5) is the j-th eigenvalues of K\u00b5 and \u03b6 is free parameter removing the \u03b6 largest eigen-\nvalues to control the tail sum. Note that the tail sum is the difference between the trace and the \u03b6\nj=1 \u03bbj(K\u00b5), thus the tail sum can be calculated\n\n(cid:88)\nlargest eigenvalues:(cid:80)\nj>\u03b6 \u03bbj(K\u00b5) = Tr(K\u00b5) \u2212(cid:80)\u03b6\nOne can see that H1 is not convex, and we know that: (cid:80)M\n(cid:0)K\u00b5\n(cid:1). Thus, we consider the use of the con-\n(cid:80)M\n(cid:80)\nj>\u03b6 \u03bbj((cid:107)\u00b5(cid:107)1Km) \u2264 (cid:80)\nm=1 \u00b5m/(cid:107)\u00b5(cid:107)1\n(cid:88)\nM(cid:88)\n\nin O(n2\u03b6) for each kernel.\n\n\u03bbj(K\u00b5) \u2264 1\n\nj>\u03b6 \u03bbj(Km) =\n\nvex H2:\n\n(cid:110)\nhw,\u03ba\u00b5 \u2208 Hmkl :\n\nH2 =\n\nm=1 \u00b5m\n\n(cid:80)\n\nj>\u03b6 \u03bbj\n\n(cid:111)\n\nj>\u03b6\n\n,\n\n.\n\n(cid:110)\nAccording to normalized kernels \u02dc\u03bam =\nm=1 \u00b5m\u02dc\u03bam, we\n(cid:111)\ncan simply rewrite H2 as\n,(cid:107)w(cid:107)2,p \u2264 1, 1 < p \u2264\nhw,\u02dc\u03ba\u00b5 =\n2 log K\u22121 , \u00b5 (cid:23) 0,(cid:107)\u00b5(cid:107)1 \u2264 1\n, which is a commonly studied hypothesis class in multi-class multiple\nkernel learning. A simple process with precomputed kernel matrices regularized by local Rademacher\ncomplexity can be seen in Algorithm 1:\n\nj>\u03b6 \u03bbj(Km)\n\n2 log K\n\n(cid:111)\n(cid:16)(cid:80)\n(cid:16)(cid:104)w1, \u02dc\u03c6\u00b5(x)(cid:105), . . . ,(cid:104)wK, \u02dc\u03c6\u00b5(x)(cid:105)(cid:17)\n\n\u03bbj(Km) \u2264 1\n\n(cid:17)\u22121\n\n\u00b5m\n\nm=1\n\nj>\u03b6\n\n\u03bam and \u02dc\u03ba\u00b5 = (cid:80)M\n\nAlgorithm 1 Conv-MKL\n\nInput: precomputed kernel matrices K1, . . . , KM and \u03b6\nfor i = 1 to M do\n\nCompute tail sum: rm =(cid:80)\nNormalize precomputed kernel matrix: (cid:101)Km = Km/rm\nUse (cid:101)Km, m = 1, . . . , M, as the basic kernels in any (cid:96)p-norm MKL solver\n\nj>\u03b6 \u03bbj (Km)\n\nend for\n\n5.2 SMSD-MKL\nConsidering a more challenging case, we perform penalized ERM over the class H1, aiming to solve\na convex optimization problem with an additional term representing local Rademacher complexity :\n\nmin\nw,\u00b5\n\n1\nn\n\nn(cid:88)\n(cid:124)\n(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212\n(cid:18)\n\ni=1\n\n(cid:107)w(cid:107)2\n\n2,p + \u03b2\n\n(cid:123)(cid:122)\n\n\u2126(w)\n\n+\n\n\u03b1\n2\n\n(cid:125)\n\n(cid:96)(w, \u03c6\u00b5(xi), yi)\n\n(cid:124)\n(cid:123)(cid:122)\n(cid:104)wyi, \u03c6\u00b5(xi)(cid:105) \u2212 max\ny(cid:54)=yi\n\nC(w)\n\nm=1\n\nM(cid:88)\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)+\n\n\u00b5mrm\n\n,\n\n(cid:125)\nand rm =(cid:80)\n\n(3)\n\nj>\u03b6 \u03bbj(Km)\n\nwhere (cid:96)(w, \u03c6\u00b5(xi), yi) =\nis the tail sum of the eigenvalues of the m-th kernel matrix, m = 1, . . . , M.\n\n(cid:104)wy, \u03c6\u00b5(xi)(cid:105)\n\n6\n\n\fAlgorithm 2 SMSD-MKL\n\nInput: \u03b1, \u03b2, r, T\nInitialize: w1 = 0, \u03b8\u03b8\u03b81 = 0, \u00b51 = 1, q = 2 log K\nfor t = 1 to T do\n\nSample at random (xt, yt)\nCompute the dual weight: \u03b8\u03b8\u03b8t+1 = \u03b8\u03b8\u03b8t \u2212 \u2202C(wt)\nm (cid:107) \u2212 t\u03b2rm, \u2200m = 1, . . . , M\nm = (cid:107)\u03b8t+1\n\u03bdt+1\nm )|\u03bdt+1\n, \u2200m = 1, . . . , M\nm = sgn(\u03bdt+1\n\u00b5t+1\nm (cid:107)|\u03bdt+1\n\u03b1(cid:107)\u03b8t+1\nend for\n\nm |q\u22121\nm |q\u22122\n\nq\n\nBased on the stochastic mirror descent framework for minimization problems in [31, 29], we design\na stochastic mirror and sub-gradient descent algorithm, called SMSD-MKL, to minimize (3), seen in\nAlgorithm 2.\nAs shown in the mirror descent algorithm, it maintains two weight vectors: the primal vector w and\nthe dual vector \u03b8\u03b8\u03b8. Meanwhile, the optimization formulation can be divided into two parts: C(w) to\nupdate \u03b8\u03b8\u03b8 and \u2126(w) to update w by the gradient of the Fenchel dual of \u2126. Actually, the algorithm\nputs the kernel weight \u00b5\u00b5\u00b5 aside when updating \u03b8\u03b8\u03b8, but \u00b5\u00b5\u00b5 is updated together with w according to a\ntricky link function given in Theorem 3.\n\n\u2022 For C(w), the algorithm updates the dual vector with the gradient of C(w). Since\nhinge loss used in C(w) is not differentiable, the algorithm uses sub-gradient of zt =\n\u2202(cid:96)(wt, \u03c6\u00b5(xt), yt), where \u2202(cid:96)(wt, \u03c6\u00b5(xt), yt) is the sub-gradient w.r.t wt.\n\u2022 For \u2126(w), as in the UFO-MKL [29], the algorithm uses w = \u2207\u2126\u2217(\u03b8\u03b8\u03b8) to update the primal\n\nvector w, of which the calculation has been given in Theorem 3.\n\nq = 2 log K to make the order of generalization reach O(cid:0) (log K)2+1/ log K\n\nThe algorithm starts with w1 = 0, \u03b8\u03b8\u03b81 = 0 and \u00b5\u00b5\u00b51 = 1. Especially, the algorithm initializes\n\n(cid:1), according to Theorem 2.\n\nIn each iteration, the algorithm randomly samples a training example from the train set.\nActually, the algorithm updates real numbers (cid:107)\u03b8t+1\nm (cid:107), \u03bdt+1\nm in scalar products instead of\nm . The (cid:107)\u03b8t+1\nm (cid:107) can be calculated in an ef\ufb01cient incremental\nhigh-dimensional variables wt+1 and \u03b8t+1\nway by scalar values as following:\nm(cid:107)2\n2 = (cid:107)\u03b8\u03b8\u03b8t\n2 = (cid:107)\u03b8\u03b8\u03b8t\n2 \u2212 2\u03b8\u03b8\u03b8t\n(cid:105)\n\n(cid:104)(cid:107)\u03b8\u03b8\u03b81(cid:107) \u2212 \u03b2r1, . . . ,(cid:107)\u03b8\u03b8\u03b8M(cid:107) \u2212 \u03b2rM\n\nwhere zt = \u2202(cid:96)(wt, \u03c6\u00b5(xt), yt).\nTheorem 3. Let \u03bd =\n\n, then the component m-th of \u2207\u2126\u2217(\u03b8\u03b8\u03b8) is\n\nm and \u00b5t+1\n\nm + (cid:107)zt\n\nm(cid:107)2\n\nm \u2212 zt\n\nm(cid:107)2\n\n(cid:107)\u03b8\u03b8\u03b8t+1\nm (cid:107)2\n\nm \u00b7 zt\n\n2\n\nn\n\nsgn(\u03bdm)\u03b8\u03b8\u03b8m\n\u03b1(cid:107)\u03b8\u03b8\u03b8m(cid:107)\n\n|\u03bdm|q\u22121\n(cid:107)\u03bd(cid:107)q\u22122\n\nq\n\n,\n\nwhere sgn(x) is de\ufb01ned as sgn(x) = 1 if x > 0, sgn(x) = \u22121 if x < 0 and sgn(x) \u2208 [\u22121, +1], if\nx = 0.\n\n6 Experiments\n\nIn this section, we compare our proposed Conv-MKL (Algorithm 1) and SMSD-MKL (Algorithm 2)\nwith 7 popular multi-class classi\ufb01cation methods: One-against-One [12], One-against-the-Rest [3],\n(cid:96)1-norm linear multi-class SVM (LMC) [6], generalized minimal norm problem solver (GMNP) [8],\nthe Multiclass MKL (MC-MKL) with (cid:96)1-norm and (cid:96)2-norm [38] and mixed-norm MKL solved by\nstochastic gradient descent (UFO-MKL) [29]. Actually, we complete comparison tests via implements\nin LIBSVM (One-against-One and One-against-the-Rest), the DOGMA library 2 (LMC, GMNP, (cid:96)1-\nnorm and (cid:96)2-norm MC-MKL) and the SHOGUN-6.1.3 3 (UFO-MKL). We implement our proposed\nConv-MKL and SMSD-MKL algorithms based on UFO-MKL.\n\n2Available at http://dogma. sourceforge. net\n3Available at http://www.shogun-toolbox.org/\n\n7\n\n\fTable 1: Comparison of average test accuracies of our Conv-MKL and SMSD-MKL with the others. We\nbold the numbers of the best method and underline the numbers of the other methods which are not\nsigni\ufb01cantly worse than the best one.\n\nLMC\n\nGMNP\n\nSMSD-MKL\n\nOne vs. One One vs. Rest\n\nConv-MKL\n77.14\u00b12.25 78.01\u00b12.17 70.12\u00b12.96 75.83\u00b12.69 75.17\u00b12.68 75.42\u00b13.64 77.60\u00b12.63\nplant\n74.41\u00b13.35 76.23\u00b13.39 63.85\u00b13.94 73.33\u00b14.21 71.70\u00b14.89 73.55\u00b14.22 71.87\u00b14.87\npsortPos\n74.07\u00b12.16 74.66\u00b11.90 57.85\u00b12.49 73.74\u00b12.87 71.94\u00b12.50 74.27\u00b12.51 72.83\u00b12.20\npsortNeg\n79.15\u00b11.51 78.69\u00b11.58 75.16\u00b11.48 77.78\u00b11.52 77.49\u00b11.53 78.35\u00b11.46 77.89\u00b11.79\nnonpl\n92.83\u00b12.62 93.39\u00b10.70 93.16\u00b10.66 90.61\u00b10.69 91.34\u00b10.61\nsector\n96.79\u00b10.91 97.62\u00b10.83 95.07\u00b11.11 97.08\u00b10.61 97.02\u00b10.80 96.87\u00b10.80 96.98\u00b10.64\nsegment\n79.35\u00b12.27 77.28\u00b12.78 75.61\u00b13.56 78.72\u00b11.92 79.11\u00b11.94 81.57\u00b12.24 74.96\u00b12.93\nvehicle\n98.82\u00b11.19 98.83\u00b15.57 62.32\u00b14.97 98.12\u00b11.76 98.22\u00b11.83 97.04\u00b11.85 98.27\u00b11.22\nvowel\n99.63\u00b10.96 99.63\u00b10.96 97.87\u00b12.80 97.24\u00b13.05 98.14\u00b13.04 97.69\u00b12.43 98.61\u00b11.75\nwine\n96.08\u00b10.83 96.30\u00b10.79 92.02\u00b11.50 95.89\u00b10.56 95.61\u00b10.73 94.60\u00b10.94 96.27\u00b10.68\ndna\n75.19\u00b15.05 73.72\u00b15.80 63.95\u00b16.04 71.98\u00b15.75 70.00\u00b15.75 71.24\u00b18.14 69.07\u00b18.08\nglass\n96.67\u00b12.94 97.00\u00b12.63 88.00\u00b17.82 95.93\u00b13.25 95.87\u00b13.20 95.40\u00b17.34 95.40\u00b16.46\niris\nsvmguide2 82.69\u00b15.65 85.17\u00b13.83 81.10\u00b14.15 84.79\u00b13.45 84.27\u00b13.03 81.77\u00b13.45 83.16\u00b13.63\n91.64\u00b10.88 91.78\u00b10.82 84.95\u00b11.15 90.67\u00b10.91 89.29\u00b10.96 89.97\u00b10.81 91.86\u00b10.62\nsatimage\n\n(cid:96)1 MC-MKL (cid:96)2 MC-MKL UFO-MKL\n75.49\u00b12.48 76.77\u00b12.42\n70.70\u00b14.89 74.56\u00b14.04\n72.42\u00b12.65 73.80\u00b12.26\n77.95\u00b11.64 78.07\u00b11.56\n92.15\u00b12.57 92.60\u00b10.47\n97.58\u00b10.68 97.20\u00b10.82\n76.27\u00b13.15 76.92\u00b12.83\n97.86\u00b11.75 98.22\u00b11.62\n98.52\u00b11.89 99.44\u00b11.13\n95.06\u00b10.92 95.84\u00b10.61\n74.03\u00b16.41 72.46\u00b16.12\n94.00\u00b17.82 95.93\u00b12.88\n83.84\u00b14.21 82.91\u00b13.09\n90.43\u00b11.27 91.92\u00b10.83\n\n\\\n\n\\\n\nK(x, x(cid:48)) = exp(cid:0) \u2212 (cid:107)x \u2212 x(cid:48)(cid:107)2\n\nWe experiment on 14 publicly available datasets: four of them evaluated in [38] (plant, nonpl,\npsortPos, and psortNeg) and others from LIBSVM Data. For each dataset, we use the Gaussian kernel\n\n2/2\u03c4(cid:1) as our basic kernels, where \u03c4 \u2208 2i, i = \u221210,\u22129, . . . , 9, 10. For\n\nsingle kernel methods (One vs. One, One vs. Rest and GMNP), we choose the kernel which have\nthe highest performance among basic kernels estimated by 10-folds cross-validation. Meanwhile,\nwe use all basic kernels in MKL methods (Conv-MKL, SMSD-MKL, (cid:96)1 MC-MKL, (cid:96)2 MC-MKL\nand UFO-MKL). The regularization parameterized \u03b1 \u2208 2i, i = \u22122, . . . , 12 in all algorithms and\n\u03b6 \u2208 2i, i = 1, 2, . . . , 4, \u03b2 \u2208 10i, i = \u22124, . . . , 1 in SMSD-MKL are determined by 10-folds cross-\nvalidation on training data. Other parameters in compared algorithms follow the same experimental\nsetting in their papers. For each dataset, we run all methods 50 times with randomly selected 80%\nfor training and 20% for testing, offering an estimate of the statistical signi\ufb01cance of differences in\nperformance between methods. All statement of statistical signi\ufb01cance in the remainder refer to a\n95% level of signi\ufb01cance under t-test.\nThe average test accuracies are reported in Table 1. The results show: 1) Our methods Conv-MKL\nand SMSD-MKL give best results on nearly all datasets except vehicle and satimage; 2) SMSD-MKL\nis better than Conv-MKL because it wins on 2/3 datasets; 3) Compared with typical MKL methods,\nour methods get better results over almost all datasets except that only UFO-MKL works slightly\nbetter than ours on satige; 4) The MKL methods usually work better than the compared single kernel\nmethods (One vs. One, One vs. Rest and GMNP); 5) The kernel classi\ufb01cation methods have better\nperformance than the linear classi\ufb01cation machine (LMC) on all datasets.\nThe above results show that the use of the local Rademacher complexity can signi\ufb01cantly improve\nthe performance of multi-class multiple kernel learning algorithms, which conforms to our theoretical\nanalysis.\n\n7 Conclusion\n\nIn this paper, we studied the generalization performance of multi-class classi\ufb01cation, and derived a\nsharper data dependent generalization error bound using the local Rademacher complexity, which is\nmuch sharper than existing data-dependent generalization bounds of multi-class classi\ufb01cation. Then,\nwe designed two algorithms with statistical guarantees and fast convergence rates: Conv-MKL and\nSMSD-MKL. Based on local Rademacher complexity, our analysis can be used as a solid basis for the\ndesign of new multi-class kernel learning algorithms.\n\nAcknowledgments\n\nThis work is supported in part by the National Natural Science Foundation of China (No.61703396,\nNo.61673293, No.61602467), the National Key Research and Development Program of China\n(No.2016YFB1000604), the Science and Technology Project of Beijing (No.Z181100002718004)\nand the Excellent Talent Introduction of Institute of Information Engineering of CAS (Y7Z0111107).\n\n8\n\n\fReferences\n[1] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying\n\napproach for margin classi\ufb01ers. Journal of machine learning research, 1:113\u2013141, 2000.\n\n[2] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of\n\nStatistics, 33(4):1497\u20131537, 2005.\n\n[3] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A.\nMuller, E. Sackinger, P. Simard, et al. Comparison of classi\ufb01er methods: a case study in\nhandwritten digit recognition. In Proceedings of the 12th IAPR International Conference on\nPattern Recognition, pages 77\u201382, 1994.\n\n[4] C. Cortes, M. Kloft, and M. Mohri. Learning kernels using local Rademacher complexity. In\n\nAdvances in Neural Information Processing Systems 25 (NIPS), pages 2760\u20132768, 2013.\n\n[5] C. Cortes, M. Mohri, and A. Rostamizadeh. Multi-class classi\ufb01cation with maximum margin\nmultiple kernel. In Proceedings of the 30th International Conference on Machine Learning\n(ICML), pages 46\u201354, 2013.\n\n[6] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based\n\nvector machines. Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[7] A. Daniely and S. Shalev-Shwartz. Optimal learners for multiclass problems. In Proceedings of\n\nthe 27th Conference on Learning Theory (COLT), pages 287\u2013316, 2014.\n\n[8] V. Franc. Optimization algorithms for kernel methods. Prague: A PhD dissertation. Czech\n\nTechnical University, 2005.\n\n[9] Y. Guermeur. Combining discriminant models with new multi-class SVMs. Pattern Analysis &\n\nApplications, 5(2):168\u2013179, 2002.\n\n[10] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient\ndescent. In Proceedings of the 33rd International Conference on Machine Learning (ICML),\npages 1225\u20131234, 2016.\n\n[11] S. I. Hill and A. Doucet. A framework for kernel-based multi-category classi\ufb01cation. Journal\n\nof Arti\ufb01cial Intelligence Research, 30:525\u2013564, 2007.\n\n[12] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedure\nfor building and training a neural network. In Neurocomputing, pages 41\u201350. Springer, 1990.\n\n[13] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.\n\nThe Annals of Statistics, 34(6):2593\u20132656, 2006.\n\n[14] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generaliza-\n\ntion error of combined classi\ufb01ers. The Annals of Statistics, 30:1\u201350, 2002.\n\n[15] V. Koltchinskii, D. Panchenko, and F. Lozano. Some new bounds on the generalization error of\ncombined classi\ufb01ers. In Advances in Neural Information Processing Systems 14 (NIPS), pages\n245\u2013251, 2001.\n\n[16] V. Kuznetsov, M. Mohri, and U. Syed. Multi-class deep boosting. In Advances in Neural\n\nInformation Processing Systems 27 (NIPS), pages 2501\u20132509, 2014.\n\n[17] G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the\nkernel matrix with semide\ufb01nite programming. Journal of Machine Learning Research, 5:27\u201372,\n2004.\n\n[18] Y. Lei, U. D. A. Binder, and M. Kloft. Multi-class SVMs: From tighter data-dependent\ngeneralization bounds to novel algorithms. In Advances in Neural Information Processing\nSystems 27 (NIPS), pages 2035\u20132043, 2015.\n\n[19] Y. Liu, S. Jiang, and S. Liao. Eigenvalues perturbation of integral operator for kernel selection.\nIn Proceedings of the 22nd ACM International Conference on Information and Knowledge\nManagement (CIKM), pages 2189\u20132198, 2013.\n\n9\n\n\f[20] Y. Liu, S. Jiang, and S. Liao. Ef\ufb01cient approximation of cross-validation for kernel methods\nusing Bouligand in\ufb02uence function. In Proceedings of the 31st International Conference on\nMachine Learning (ICML), pages 324\u2013332, 2014.\n\n[21] Y. Liu and S. Liao. Preventing over-\ufb01tting of cross-validation with kernel stability. In Pro-\nceedings of the European Conference on Machine Learning and Principles and Practice of\nKnowledge Discovery in Databases (ECML), pages 290\u2013305, 2014.\n\n[22] Y. Liu and S. Liao. Eigenvalues ratio for kernel selection of kernel methods. In Proceedings of\n\nthe 29th AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 2814\u20132820, 2015.\n\n[23] Y. Liu, S. Liao, and Y. Hou. Learning kernels with upper bounds of leave-one-out error.\nIn Proceedings of the 20th ACM International Conference on Information and Knowledge\nManagement (CIKM), pages 2205\u20132208, 2011.\n\n[24] Y. Liu, S. Liao, H. Lin, Y. Yue, and W. Wang. In\ufb01nite kernel learning: generalization bounds\nand algorithms. In Proceedings of the 21st AAAI Conference on Arti\ufb01cial Intelligence (AAAI),\npages 2280\u20132286, 2017.\n\n[25] Y. Maximov and D. Reshetova. Tight risk bounds for multi-class margin classi\ufb01ers. Pattern\n\nRecognition and Image Analysis, 26(4):673\u2013680, 2016.\n\n[26] D. McAllester. A pac-bayesian tutorial with a dropout bound. Arxiv\u201913, 2013.\n\n[27] M. Moh, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press,\n\n2012.\n\n[28] B. K. Natarajan. On learning sets and functions. Machine Learning, 4(1):67\u201397, 1989.\n\n[29] F. Orabona and J. Luo. Ultra-fast optimization algorithm for sparse multi kernel learning. In\nProceedings of the 28th International Conference on Machine Learning (ICML), pages 249\u2013256,\n2011.\n\n[30] F. Orabona, J. Luo, and B. Caputo. Online-batch strongly convex multi kernel learning. In The\n23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 787\u2013794,\n2010.\n\n[31] S. Shalev-Shwartz and A. Tewari. Stochastic methods for l1-regularized loss minimization.\n\nJournal of Machine Learning Research, 12:1865\u20131892, 2011.\n\n[32] S. Sonnenburg, G. R\u00e4tsch, C. Sch\u00e4fer, and B. Sch\u00f6lkopf. Large scale multiple kernel learning.\n\nJournal of Machine Learning Research, 7:1531\u20131565, 2006.\n\n[33] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning\nIn Proceedings of the 21st International\n\nfor interdependent and structured output spaces.\nConference on Machine Learning (ICML), page 104, 2004.\n\n[34] V. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000.\n\n[35] C. Xu, T. Liu, D. Tao, and C. Xu. Local rademacher complexity for multi-label learning. IEEE\n\nTransactions on Image Processing, 25(3):1495\u20131507, 2016.\n\n[36] N. Youse\ufb01, Y. Lei, M. Kloft, M. Mollaghasemi, and G. Anagnostopoulos. Local rademacher\ncomplexity-based learning guarantees for multi-task learning. arXiv preprint arXiv:1602.05916,\n2016.\n\n[37] T. Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5:1225\u20131251, 2004.\n\n[38] A. Zien and C. S. Ong. Multiclass multiple kernel learning.\n\nIn Proceedings of the 24th\n\nInternational Conference on Machine Learning (ICML), pages 1191\u20131198, 2007.\n\n10\n\n\f", "award": [], "sourceid": 802, "authors": [{"given_name": "Jian", "family_name": "Li", "institution": "Institute of Information Engineering, CAS"}, {"given_name": "Yong", "family_name": "Liu", "institution": "Institute of Information Engineering, CAS"}, {"given_name": "Rong", "family_name": "Yin", "institution": "School of Cyber Security, University of Chinese Academy of Sciences"}, {"given_name": "Hua", "family_name": "Zhang", "institution": "Institute of Information Engineering,Chinese Academy of Sciences"}, {"given_name": "Lizhong", "family_name": "Ding", "institution": "KAUST"}, {"given_name": "Weiping", "family_name": "Wang", "institution": "Institute of Information Engineering, CAS, China"}]}