{"title": "Multi-class SVMs: From Tighter Data-Dependent Generalization Bounds to Novel Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2035, "page_last": 2043, "abstract": "This paper studies the generalization performance of multi-class classification algorithms, for which we obtain, for the first time, a data-dependent generalization error bound with a logarithmic dependence on the class size, substantially improving the state-of-the-art linear dependence in the existing data-dependent generalization analysis. The theoretical analysis motivates us to introduce a new multi-class classification machine based on lp-norm regularization, where the parameter p controls the complexity of the corresponding bounds. We derive an efficient optimization algorithm based on Fenchel duality theory. Benchmarks on several real-world datasets show that the proposed algorithm can achieve significant accuracy gains over the state of the art.", "full_text": "Multi-class SVMs: From Tighter Data-Dependent\n\nGeneralization Bounds to Novel Algorithms\n\nYunwen Lei\n\nDepartment of Mathematics\nCity University of Hong Kong\nyunwelei@cityu.edu.hk\n\n\u00a8Ur\u00a8un Dogan\n\nMicrosoft Research\n\nCambridge CB1 2FB, UK\n\nudogan@microsoft.com\n\nAlexander Binder\n\nISTD Pillar\n\nSingapore University of Technology and Design\n\nMachine Learning Group, TU Berlin\n\nalexander binder@sutd.edu.sg\n\nMarius Kloft\n\nDepartment of Computer Science\nHumboldt University of Berlin\n\nkloft@hu-berlin.de\n\nAbstract\n\nThis paper studies the generalization performance of multi-class classi\ufb01cation al-\ngorithms, for which we obtain\u2014for the \ufb01rst time\u2014a data-dependent generaliza-\ntion error bound with a logarithmic dependence on the class size, substantially\nimproving the state-of-the-art linear dependence in the existing data-dependent\ngeneralization analysis. The theoretical analysis motivates us to introduce a new\nmulti-class classi\ufb01cation machine based on (cid:96)p-norm regularization, where the pa-\nrameter p controls the complexity of the corresponding bounds. We derive an\nef\ufb01cient optimization algorithm based on Fenchel duality theory. Benchmarks on\nseveral real-world datasets show that the proposed algorithm can achieve signi\ufb01-\ncant accuracy gains over the state of the art.\n\n1\n\nIntroduction\n\nTypical multi-class application domains such as natural language processing [1], information re-\ntrieval [2], image annotation [3] and web advertising [4] involve tens or hundreds of thousands of\nclasses, and yet these datasets are still growing [5]. To handle such learning tasks, it is essential\nto build algorithms that scale favorably with respect to the number of classes. Over the past years,\nmuch progress in this respect has been achieved on the algorithmic side [4\u20137], including ef\ufb01cient\nstochastic gradient optimization strategies [8].\nAlthough also theoretical properties such as consistency [9\u201311] and \ufb01nite-sample behavior [1, 12\u2013\n15] have been studied, there still is a discrepancy between algorithms and theory in the sense that the\ncorresponding theoretical bounds do often not scale well with respect to the number of classes. This\ndiscrepancy occurs the most strongly in research on data-dependent generalization bounds, that is,\nbounds that can measure generalization performance of prediction models purely from the training\nsamples, and which thus are very appealing in model selection [16]. A crucial advantage of these\nbounds is that they can better capture the properties of the distribution that has generated the data,\nwhich can lead to tighter estimates [17] than conservative data-independent bounds.\nTo our best knowledge, for multi-class classi\ufb01cation, the \ufb01rst data-dependent error bounds were\ngiven by [14]. These bounds exhibit a quadratic dependence on the class size and were used by [12]\nand [18] to derive bounds for kernel-based multi-class classi\ufb01cation and multiple kernel learning\n(MKL) problems, respectively. More recently, [13] improve the quadratic dependence to a linear\ndependence by introducing a novel surrogate for the multi-class margin that is independent on the\ntrue realization of the class label.\n\n1\n\n\fHowever, a heavy dependence on the class size, such as linear or quadratic, implies a poor gen-\neralization guarantee for large-scale multi-class classi\ufb01cation problems with a massive number of\nclasses. In this paper, we show data-dependent generalization bounds for multi-class classi\ufb01cation\nproblems that\u2014for the \ufb01rst time\u2014exhibit a sublinear dependence on the number of classes. Choos-\ning appropriate regularization, this dependence can be as mild as logarithmic. We achieve these\nimproved bounds via the use of Gaussian complexities, while previous bounds are based on a well-\nknown structural result on Rademacher complexities for classes induced by the maximum operator.\nThe proposed proof technique based on Gaussian complexities exploits potential coupling among\ndifferent components of the multi-class classi\ufb01er, while this fact is ignored by previous analyses.\nThe result shows that the generalization ability is strongly impacted by the employed regularization.\nWhich motivates us to propose a new learning machine performing block-norm regularization over\nthe multi-class components. As a natural choice we investigate here the application of the proven (cid:96)p\nnorm [19]. This results in a novel (cid:96)p-norm multi-class support vector machine (MC-SVM), which\ncontains the classical model by Crammer & Singer [20] as a special case for p = 2. The bounds\nindicate that the parameter p crucially controls the complexity of the resulting prediction models.\nWe develop an ef\ufb01cient optimization algorithm for the proposed method based on its Fenchel dual\nrepresentation. We empirically evaluate its effectiveness on several standard benchmarks for multi-\nclass classi\ufb01cation taken from various domains, where the proposed approach signi\ufb01cantly outper-\nforms the state-of-the-art method of [20].\nThe remainder of this paper is structured as follows. Section 2 introduces the problem setting and\npresents the main theoretical results. Motivated by which we propose a new multi-class classi\ufb01cation\nmodel in Section 3 and give an ef\ufb01cient optimization algorithm based on Fenchel duality theory. In\nSection 4 we evaluate the approach for the application of visual image recognition and on several\nstandard benchmark datasets taken from various application domains. Section 5 concludes.\n\n2 Theory\n2.1 Problem Setting and Notations\nThis paper considers multi-class classi\ufb01cation problems with c \u2265 2 classes. Let X denote the input\nspace and Y = {1, 2, . . . , c} denote the output space. Assume that we are given a sequence of ex-\namples S = {(x1, y1), . . . , (xn, yn)} \u2208 (X \u00d7 Y)n, independently drawn according to a probability\nmeasure P de\ufb01ned on the sample space Z = X \u00d7 Y. Based on the training examples S, we wish to\nlearn a prediction rule h from a space H of hypotheses mapping from Z to R and use the mapping\nx \u2192 arg maxy\u2208Y h(x, y) to predict (ties are broken by favoring classes with a lower index, for\nwhich our loss function de\ufb01ned below always counts an error). For any hypothesis h \u2208 H, the mar-\ngin \u03c1h(x, y) of the function h at a labeled example (x, y) is \u03c1h(x, y) := h(x, y)\u2212 maxy(cid:48)(cid:54)=y h(x, y(cid:48)).\nThe prediction rule h makes an error at (x, y) if \u03c1h(x, y) \u2264 0 and thus the expected risk incurred\nfrom using h for prediction is R(h) := E[1\u03c1h(x,y)\u22640].\n(h1, . . . , hc) with hj(x) = h(x, j),\u2200j = 1, . . . , c. We denote by (cid:101)H := {\u03c1h : h \u2208 H} the class\nAny function h : X \u00d7 Y \u2192 R can be equivalently represented by the vector-valued function\nof margin functions associated to H. Let k : X \u00d7 X \u2192 R be a Mercer kernel with \u03c6 being the\nassociated feature map, i.e., k(x, \u02dcx) = (cid:104)\u03c6(x), \u03c6(\u02dcx)(cid:105) for all x, \u02dcx \u2208 X . We denote by (cid:107) \u00b7 (cid:107)\u2217 the dual\nnorm of (cid:107) \u00b7 (cid:107), i.e., (cid:107)w(cid:107)\u2217 := sup(cid:107) \u00afw(cid:107)\u22641(cid:104)w, \u00afw(cid:105). For a convex function f, we denote by f\u2217 its Fenchel\nby (cid:107)w(cid:107)2,p := [(cid:80)c\nconjugate, i.e., f\u2217(v) := supw[(cid:104)w, v(cid:105) \u2212 f (w)]. For any w = (w1, . . . , wc) we de\ufb01ne the (cid:96)2,p-norm\n2]1/p. For any p \u2265 1, we denote by p\u2217 the dual exponent of p satisfying\n1/p + 1/p\u2217 = 1 and \u00afp := p(2 \u2212 p)\u22121. We require the following de\ufb01nitions.\nDe\ufb01nition 1 (Strong Convexity). A function f : X \u2192 R is said to be \u03b2-strongly convex w.r.t. a\nnorm (cid:107) \u00b7 (cid:107) iff \u2200x, y \u2208 X and \u2200\u03b1 \u2208 (0, 1), we have\n\nj=1 (cid:107)wj(cid:107)p\n\nf (\u03b1x + (1 \u2212 \u03b1)y) \u2264 \u03b1f (x) + (1 \u2212 \u03b1)f (y) \u2212 \u03b2\n2\n\n\u03b1(1 \u2212 \u03b1)(cid:107)x \u2212 y(cid:107)2.\n\nDe\ufb01nition 2 (Regular Loss). We call (cid:96) a L-regular loss if it satis\ufb01es the following properties:\n(i) (cid:96)(t) bounds the 0-1 loss from above: (cid:96)(t) \u2265 1t\u22640;\n(ii) (cid:96) is L-Lipschitz in the sense |(cid:96)(t1) \u2212 (cid:96)(t2)| \u2264 L|t1 \u2212 t2|;\n\n2\n\n\f(iii) (cid:96)(t) is decreasing and it has a zero point c(cid:96), i.e., (cid:96)(c(cid:96)) = 0.\nSome examples of L-regular loss functions include the hinge (cid:96)h(t) = (1 \u2212 t)+ and the margin loss\n(1)\n\n(cid:96)\u03c1(t) = 1t\u22640 + (1 \u2212 t\u03c1\u22121)10 0.\n\n2.2 Main results\nOur discussion on data-dependent generalization error bounds is based on the established method-\nology of Rademacher and Gaussian complexities [21].\nDe\ufb01nition 3 (Rademacher and Gaussian Complexity). Let H be a family of real-valued functions\nde\ufb01ned on Z and S = (z1, . . . , zn) a \ufb01xed sample of size n with elements in Z. Then, the empirical\nRademacher and Gaussian complexities of H with respect to the sample S are de\ufb01ned by\n\n(cid:2) sup\n\nh\u2208H\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n\u03c3ih(zi)(cid:3), GS(H) = Eg\n\n(cid:2) sup\n\nh\u2208H\n\n1\nn\n\nn(cid:88)\n\ni=1\n\ngih(zi)(cid:3),\n\nRS(H) = E\u03c3\n\nwhere \u03c31, . . . , \u03c3n are independent random variables with equal probability taking values +1 or \u22121,\nand g1, . . . , gn are independent N (0, 1) random variables.\n\nNote that we have the following comparison inequality relating Rademacher and Gaussian complex-\nities (Cf. Section 4.2 in [22]):\n\nExisting work on data-dependent generalization bounds for multi-class classi\ufb01ers [12\u201314, 18] builds\non the following structural result on Rademacher complexities (e.g., [12], Lemma 8.1):\n\n(cid:114) \u03c0\n\n(cid:114) \u03c0\n\nRS(H) \u2264\n\nGS(H) \u2264 3\n\n(cid:112)log nRS(H).\nRS(max{h1, . . . , hc} : hj \u2208 Hj, j = 1, . . . , c) \u2264 c(cid:88)\n\n2\n\n2\n\nRS(Hj),\n\nj=1\n\n(2)\n\n(3)\n\nwhere H1, . . . , Hc are c hypothesis sets. This result is crucial for the standard generalization analysis\nof multi-class classi\ufb01cation since the margin \u03c1h involves the maximum operator, which is removed\nby (3), but at the expense of a linear dependency on the class size. In the following we show that this\nlinear dependency is suboptimal because (3) does not take into account the coupling among different\nclasses. For example, a common regularizer used in multi-class learning algorithms is r(h) =\n2 [20], for which the components h1, . . . , hc are correlated via a (cid:107) \u00b7 (cid:107)2,2 regularizer, and\n\n(cid:80)c\nj=1 (cid:107)hj(cid:107)2\n\nthe bound (3) ignoring this correlation would not be effective in this case [12\u201314, 18].\nAs a remedy, we here introduce a new structural complexity result on function classes induced\nby general classes via the maximum operator, while allowing to preserve the correlations among\ndifferent components meanwhile.\nInstead of considering the Rademacher complexity, Lemma 4\nconcerns the structural relationship of Gaussian complexities since it is based on a comparison result\namong different Gaussian processes.\nLemma 4 (Structural result on Gaussian complexity). Let H be a class of functions de\ufb01ned on\nX \u00d7 Y with Y = {1, . . . , c}. Let g1, . . . , gnc be independent N (0, 1) distributed random variables.\nThen, for any sample S = {x1, . . . , xn} of size n, we have\n\n(cid:0){max{h1, . . . , hc} : h = (h1, . . . , hc) \u2208 H}(cid:1) \u2264 1\n\nGS\n\nEg\n\nsup\n\nh=(h1,...,hc)\u2208H\n\nn\n\nn(cid:88)\n\nc(cid:88)\n\ni=1\n\nj=1\n\ng(j\u22121)n+ihj(xi),\n\n(4)\n\nwhere Eg denotes the expectation w.r.t. to the Gaussian variables g1, . . . , gnc.\n\nThe proof of Lemma 4 is given in Supplementary Material A. Equipped with Lemma 4, we are\nnow able to present a general data-dependent margin-based generalization bound. The proof of the\nfollowing results (Theorem 5, Theorem 7 and Corollary 8) is given in Supplementary Material B.\nTheorem 5 (Data-dependent generalization bound for multi-class classi\ufb01cation). Let H \u2282 RX\u00d7Y\nbe a hypothesis class with Y = {1, . . . , c}. Let (cid:96) be a L-regular loss function and denote B(cid:96) :=\nsup(x,y),h (cid:96)(\u03c1h(x, y)). Suppose that the examples S = {(x1, y1), . . . , (xn, yn)} are independently\n\n3\n\n\f\u221a\n2L\n\nn(cid:88)\n\nc(cid:88)\n\ni=1\n\nj=1\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\n(cid:115)\n\nlog 2\n\u03b4\n2n\n\n,\n\n(cid:115)\n\ndrawn from a probability measure de\ufb01ned on X \u00d7 Y. Then, for any \u03b4 > 0, with probability at least\n1 \u2212 \u03b4, the following multi-class classi\ufb01cation generalization bound holds for any h \u2208 H:\n\nR(h) \u2264 1\nn\n\n(cid:96)(\u03c1h(xi, yi)) +\n\n2\u03c0\n\nEg\n\nsup\n\nh=(h1,...,hc)\u2208H\n\nn\n\ng(j\u22121)n+ihj(xi) + 3B(cid:96)\n\nwhere g1, . . . , gnc are independent N (0, 1) distributed random variables.\nRemark 6. Under the same condition of Theorem 5, [12] derive the following data-dependent\ngeneralization bound (Cf. Corollary 8.1 in [12]):\n\nR(h) \u2264 1\nn\n\n(cid:96)(\u03c1h(xi, yi)) +\n\n4Lc\n\nRS({x \u2192 h(x, y) : y \u2208 Y, h \u2208 H}) + 3B(cid:96)\n\nn\n\ni=1\n\ndependence on c is governed by the term(cid:80)n\n\nThis linear dependence on c is due to the use of (3). For comparison, Theorem 5 implies that the\nj=1 g(j\u22121)n+ihj(xi), an advantage of which is that\nthe components h1, . . . , hc are jointly coupled. As we will see, this allows us to derive an improved\nresult with a favorable dependence on c, when a constraint is imposed on (h1, . . . , hc).\n\n(cid:80)c\n\ni=1\n\nlog 2\n\u03b4\n2n\n\n.\n\nThe following Theorem 7 applies the general result in Theorem 5 to kernel-based methods. The\nhypothesis space is de\ufb01ned by imposing a constraint with a general strongly convex function.\nTheorem 7 (Data-dependent generalization bound for kernel-based multi-class learning algorithms\nand MC-SVMs). Suppose that the hypothesis space is de\ufb01ned by\n\nH := Hf,\u039b = {hw = ((cid:104)w1, \u03c6(x)(cid:105), . . . ,(cid:104)wc, \u03c6(x)(cid:105)) : f (w) \u2264 \u039b},\n\nwhere f is a \u03b2-strongly convex function w.r.t. a norm (cid:107)\u00b7(cid:107) de\ufb01ned on H satisfying f\u2217(0) = 0. Let (cid:96) be\na L-regular loss function and denote B(cid:96) := sup(x,y),h (cid:96)(\u03c1h(x, y)). Let g1, . . . , gnc be independent\nN (0, 1) distributed random variables. Then, for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4 we have\n\n(cid:118)(cid:117)(cid:117)(cid:116) \u03c0\u039b\n\n\u03b2\n\nn(cid:88)\n\ni=1\n\nEg\n\n(cid:13)(cid:13)(cid:13)(cid:0)g(j\u22121)n+i\u03c6(xi)(cid:1)\n\n(cid:115)\n\n(cid:13)(cid:13)(cid:13)2\n\nj=1,...,c\n\n\u2217 + 3B(cid:96)\n\nlog 2\n\u03b4\n2n\n\n.\n\nR(hw) \u2264 1\nn\n\n(cid:96)(\u03c1hw (xi, yi)) +\n\n4L\nn\n\nn(cid:88)\n\ni=1\n\nWe now consider the following speci\ufb01c hypothesis spaces using a (cid:107) \u00b7 (cid:107)2,p constraint:\n\nHp,\u039b := {hw = ((cid:104)w1, \u03c6(x)(cid:105), . . . ,(cid:104)wc, \u03c6(x)(cid:105)) : (cid:107)w(cid:107)2,p \u2264 \u039b},\n\n(5)\nCorollary 8 ((cid:96)p-norm MC-SVM generalization bound). Let (cid:96) be a L-regular loss function and\ndenote B(cid:96) := sup(x,y),h (cid:96)(\u03c1h(x, y)). Then, with probability at least 1 \u2212 \u03b4, for any hw \u2208 Hp,\u039b the\ngeneralization error R(hw) can be upper bounded by:\nn(cid:88)\n\n1 \u2264 p \u2264 2.\n\n(cid:115)\n\n2L\u039b\n\n2 log c , if p\u2217 \u2265 2 log c,\ne(4 log c)1+ 1\np\u2217 ,\n\notherwise.\n\np\u2217\n\nc\n\n1\n\n(cid:40)\u221a\n(cid:0)2p\u2217(cid:1)1+ 1\n\nlog 2\n\u03b4\n2n\n\n+\n\nn\n\n(cid:96)(\u03c1hw (xi, yi))+3B(cid:96)\n\nk(xi, xi) \u00d7\n\n(cid:118)(cid:117)(cid:117)(cid:116) n(cid:88)\n\ni=1\n\n1\nn\n\ni=1\n\nRemark 9. The bounds in Corollary 8 enjoy a mild dependence on the number of classes. The\ndependence is polynomial with exponent 1/p\u2217 for 2 < p\u2217 < 2 log c and becomes logarithmic if p\u2217 \u2265\n2 log c. Even in the theoretically unfavorable case of p = 2 [20], the bounds still exhibit a radical\ndependence on the number of classes, which is substantially milder than the quadratic dependence\nestablished in [12, 14, 18] and the linear dependence established in [13]. Our generalization bound\nis data-dependent and shows clearly how the margin would affect the generalization performance\n(when (cid:96) is the margin loss (cid:96)\u03c1): a large margin \u03c1 would increase the empirical error while decrease\nthe model\u2019s complexity, and vice versa.\n\n2.3 Comparison of the Achieved Bounds to the State of the Art\nRelated work on data-independent bounds. The large body of theoretical work on multi-class\nlearning considers data-independent bounds. Based on the (cid:96)\u221e-norm covering number bound of\nlinear operators, [15] obtain a generalization bound exhibiting a linear dependence on the class size,\n\u221a\nwhich is improved by [9] to a radical dependence of the form O(n\u2212 1\n\u03c1 ). Under conditions\n\n2 (log\n\n2 n)\n\nc\n\n3\n\n4\n\n\fanalogous to Corollary 8, [23] derive a class-size independent generalization guarantee. However,\ntheir bound is based on a delicate de\ufb01nition of margin, which is why it is commonly not used in the\nmainstream multi-class literature. [1] derive the following generalization bound\n\n(cid:16)\n\nE(cid:104) 1\n\np\n\nlog\n\n1 +\n\nep(\u03c1\u2212(cid:104) \u02c6wy\u2212 \u02c6w \u02dcy,\u03c6(x)(cid:105))(cid:17)(cid:105) \u2264 inf\n\nw\u2208H\n\nE(cid:104) 1\n\np\n\n(cid:88)\n\n\u02dcy(cid:54)=y\n\n(cid:16)\n\n(cid:88)\n\nlog\n\n\u03bbn\n\n1 +\n\n\u02dcy(cid:54)=y\n(cid:107)w(cid:107)2\n\n2,2\n\n+\n\n2(n + 1)\n\nep(\u03c1\u2212(cid:104)wy\u2212w \u02dcy,\u03c6(x)(cid:105))(cid:17)\n(cid:105)\n\n2 supx\u2208X k(x, x)\n\n+\n\n\u03bbn\n\n,\n\n(6)\n\n(cid:80)n\n\nwhere \u03c1 is a margin condition, p > 0 a scaling factor, and \u03bb a regularization parameter. Eq. (6) is\nclass-size independent, yet Corollary 8 shows superiority in the following aspects: \ufb01rst, for SVMs\n(i.e., margin loss (cid:96)\u03c1), our bound consists of an empirical error ( 1\ni=1 (cid:96)\u03c1(\u03c1hw (xi, yi))) and a com-\nn\nplexity term divided by the margin value (note that L = 1/\u03c1 in Corollary 8). When the margin\nis large (which is often desirable) [14], the last term in the bound given by Corollary 8 becomes\nsmall, while\u2014on the contrary\u2014-the bound (6) is an increasing function of \u03c1, which is undesirable.\nSecondly, Theorem 7 applies to general loss functions, expressed through a strongly convex func-\ntion over a general hypothesis space, while the bound (6) only applies to a speci\ufb01c regularization\nalgorithm. Lastly, all the above mentioned results are conservative data-independent estimates.\nRelated work on data-dependent bounds. The techniques used in above mentioned papers do not\nstraightforwardly translate to data-dependent bounds, which is the type of bounds in the focus of\nthe present work. The investigation of these was initiated, to our best knowledge, by [14]: with the\nstructural complexity bound (3) for function classes induced via the maximal operator, [14] derive a\nmargin bound admitting a quadratic dependency on the number of classes. [12] use these results in\n[14] to study the generalization performance of MC-SVMs, where the components h1, . . . , hc are\ncoupled with an (cid:107) \u00b7 (cid:107)2,p, p \u2265 1 constraint. Due to the usage of the suboptimal Eq. (3), [12] obtain\na margin bound growing quadratically w.r.t. the number of classes. [18] develop a new multi-class\nclassi\ufb01cation algorithm based on a natural notion called the multi-class margin of a kernel. [18]\nalso present a novel multi-class Rademacher complexity margin bound based on Eq. (3), and the\nbound also depends quadratically on the class size. More recently, [13] give a re\ufb01ned Rademacher\ncomplexity bound with a linear dependence on the class size. The key reason for this improvement\nis the introduction of \u03c1\u03b8,h := miny(cid:48)\u2208Y [h(x, y)\u2212h(x, y(cid:48))+\u03b81y(cid:48)=y] bounding margin \u03c1h from below,\nand since the maximum operation in \u03c1\u03b8,h is applied to the set Y rather than the subset Y \u2212 {yi} for\n\u03c1h, one needs not to consider the random realization of yi. We also use this trick in our proof of\nTheorem 5. However, [13] fail to improve this linear dependence to a logarithmic dependence, as\nwe achieved in Corollary 8, due to the use of the suboptimal structural result (3).\n\n3 Algorithms\nMotivated by the generalization analysis given in Section 2, we now present a new multi-class\nlearning algorithm, based on performing empirical risk minimization in the hypothesis space (5).\nThis corresponds to the following (cid:96)p-norm MC-SVM (1 \u2264 p \u2264 2):\nProblem 10 (Primal problem: (cid:96)p-norm MC-SVM).\n\n(cid:105) 2\n\np\n\n(cid:104) c(cid:88)\n\nn(cid:88)\nmin\ns.t. ti = (cid:104)wyi, \u03c6(xi)(cid:105) \u2212 max\ny(cid:54)=yi\n\n(cid:107)wj(cid:107)p\n\n+ C\n\n1\n2\n\nj=1\n\ni=1\n\nw\n\n2\n\n(cid:96)(ti),\n(cid:104)wy, \u03c6(xi)(cid:105),\n\n(P)\n\nFor p = 2 we recover the seminal multi-class algorithm by Crammer & Singer [20] (CS), which is\nthus a special case of the proposed formulation. An advantage of the proposed approach over [20]\ncan be that, as shown in Corollary 8, the dependence of the generalization performance on the class\nsize becomes milder as p decreases to 1.\n\n3.1 Dual problems\nSince the optimization problem (P) is convex, we can derive the associated dual problem for the\nconstruction of ef\ufb01cient optimization algorithms. The derivation of the following dual problem is\ndeferred to Supplementary Material C. For a matrix \u03b1 \u2208 Rn\u00d7c, we denote by \u03b1i the i-th row.\nDenote by ej the j-th unit vector in Rc and 1 the vector in Rc with all components being zero.\n\n5\n\n\f(D)\n\nProblem 11 (Completely dualized problem for general loss). The Lagrangian dual of (10) is:\n\n(cid:104) c(cid:88)\n\n(cid:13)(cid:13) n(cid:88)\n\n\u03b1ij\u03c6(xi)(cid:13)(cid:13) p\n\np\u22121\n2\n\n(cid:105) 2(p\u22121)\n\np \u2212 C\n\nn(cid:88)\n\ni=1\n\nj=1\n\ni=1\n\ns.t. \u03b1ij \u2264 0 \u2227 \u03b1i \u00b7 1 = 0,\n\n\u2200j (cid:54)= yi, i = 1, . . . , n.\n\nsup\n\n\u03b1\u2208Rn\u00d7c\n\n\u2212 1\n2\n\n(cid:96)\u2217(\u2212 \u03b1iyi\nC\n\n)\n\nTheorem 12 (REPRESENTER THEOREM). For any dual variable \u03b1 \u2208 Rn\u00d7c, the associated primal\nvariable w = (w1, . . . , wc) minimizing the Lagrangian saddle problem can be represented by:\n\nwj =(cid:2) c(cid:88)\n\n(cid:107) n(cid:88)\n\np\u2217 \u22121(cid:13)(cid:13) n(cid:88)\n(cid:3) 2\n\n\u03b1ij\u03c6(xi)(cid:13)(cid:13)p\u2217\u22122\n\n2\n\n(cid:2) n(cid:88)\n\n\u03b1ij\u03c6(xi)(cid:3).\n\n\u03b1i\u02dcj\u03c6(xi)(cid:107)p\u2217\n\n2\n\n\u02dcj=1\n\ni=1\n\ni=1\n\ni=1\n\nFor the hinge loss (cid:96)h(t) = (1 \u2212 t)+, we know its Fenchel-Legendre conjugate is (cid:96)\u2217\n\u22121 \u2264 t \u2264 0 and \u221e elsewise. Hence (cid:96)\u2217\nwe have the following dual problem for the hinge loss function:\nProblem 13 (Completely dualized problem for the hinge loss ((cid:96)p-norm MC-SVM)).\n\nh(t) = t if\nC if \u22121 \u2264 \u2212 \u03b1iyi\nC \u2264 0 and \u221e elsewise. Now\nn(cid:88)\n\nh(\u2212 \u03b1iyi\n(cid:104) c(cid:88)\n\nC ) = \u2212 \u03b1iyi\n\n(cid:105) 2(p\u22121)\n\np\n\n(cid:13)(cid:13) n(cid:88)\n\n\u03b1ij\u03c6(xi)(cid:13)(cid:13) p\n\np\u22121\n2\n\n\u03b1iyi\n\n+\n\nsup\n\n\u03b1\u2208Rn\u00d7c\n\n\u2212 1\n2\n\n(7)\n\nj=1\n\ni=1\n\ns.t. \u03b1i \u2264 eyi \u00b7 C \u2227 \u03b1i \u00b7 1 = 0,\n\ni=1\n\n\u2200i = 1, . . . , n.\n\n3.2 Optimization Algorithms\nThe dual problems (D) and (7) are not quadratic programs for p (cid:54)= 2, and thus generally not easy to\nsolve. To circumvent this dif\ufb01culty, we rewrite Problem 10 as the following equivalent problem:\n\nn(cid:88)\n\nc(cid:88)\nmin\nw,\u03b2\ns.t. ti \u2264 (cid:104)wyi , \u03c6(xi)(cid:105) \u2212 (cid:104)wy, \u03c6(xi)(cid:105),\n\n(cid:107)wj(cid:107)2\n2\u03b2j\n\n(cid:96)(ti)\n\n+ C\n\nj=1\n\ni=1\n\n2\n\n(cid:107)\u03b2(cid:107) \u00afp \u2264 1, \u00afp = p(2 \u2212 p)\u22121, \u03b2j \u2265 0.\n\ny (cid:54)= yi, i = 1, . . . , n,\n\n(8)\n\nThe class weights \u03b21, . . . , \u03b2c in Eq. (8) play a similar role as the kernel weights in (cid:96)p-norm MKL\nalgorithms [19]. The equivalence between problem (P) and Eq. (8) follows directly from Lemma 26\nin [24], which shows that the optimal \u03b2 = (\u03b21, . . . , \u03b2c) in Eq. (8) can be explicitly represented in\nclosed form. Motivated by the recent work on (cid:96)p-norm MKL, we propose to solve the problem (8)\nvia alternately optimizing w and \u03b2. As we will show, given temporarily \ufb01xed \u03b2, the optimization\nof w reduces to a standard multi-class classi\ufb01cation problem. Furthermore, the update of \u03b2, given\n\ufb01xed w, can be achieved via an analytic formula.\nProblem 14 (Partially dualized problem for a general loss). For \ufb01xed \u03b2, the partial dual problem\nfor the sub-optimization problem (8) w.r.t. w is\n\nc(cid:88)\n\n(cid:13)(cid:13) n(cid:88)\n\nn(cid:88)\n\nsup\n\n2 \u2212 C\n(cid:96)\u2217(\u2212 \u03b1iyi\n)\nC\n\u2200j (cid:54)= yi, i = 1, . . . , n.\nThe primal variable w minimizing the associated Lagrangian saddle problem is\n\ns.t. \u03b1ij \u2264 0 \u2227 \u03b1i \u00b7 1 = 0,\n\n\u2212 1\n2\n\n\u03b1\u2208Rn\u00d7c\n\n\u03b2j\n\nj=1\n\ni=1\n\ni=1\n\n\u03b1ij\u03c6(xi)(cid:13)(cid:13)2\nn(cid:88)\n\ni=1\n\nwj = \u03b2j\n\n\u03b1ij\u03c6(xi).\n\nWe defer the proof to Supplementary Material C. Analogous to Problem 13, we have the following\npartial dual problem for the hinge loss.\nProblem 15 (Partially dualized problem for the hinge loss ((cid:96)p-norm MC-SVM)).\n\nsup\n\n\u03b1\u2208Rn\u00d7c\n\nf (\u03b1) := \u2212 1\n2\n\ns.t. \u03b1i \u2264 eyi \u00b7 C \u2227 \u03b1i \u00b7 1 = 0,\n\nc(cid:88)\n\nj=1\n\n(cid:13)(cid:13) n(cid:88)\n\ni=1\n\n\u03b2j\n\n\u03b1ij\u03c6(xi)(cid:13)(cid:13)2\n\nn(cid:88)\n\n2 +\n\n\u03b1iyi\n\ni=1\n\n\u2200i = 1, . . . , n.\n\n6\n\n(9)\n\n(10)\n\n(11)\n\n\fThe Problems 14 and 15 are quadratic, so we can use the dual coordinate ascent algorithm [25] to\nvery ef\ufb01ciently solve them for the case of linear kernels. To this end, we need to compute the gradient\nand solve the restricted problem of optimizing only one \u03b1i,\u2200i, keeping all other dual variables\n\ufb01xed [25]. The gradient of f can be exactly represented by w:\n\n\u03b1\u02dcijk(xi, x\u02dci) + 1yi=j = 1yi=j \u2212 (cid:104)wj, \u03c6(xi)(cid:105).\n\n(12)\n\nn(cid:88)\n\n\u02dci=1\n\n\u2202f\n\u2202\u03b1ij\n\n= \u2212\u03b2j\n\nSuppose the additive change to be applied to the current \u03b1i is \u03b4\u03b1i, then from (12) we have\n\nc(cid:88)\n\nj=1\n\n\u2202f\n\u2202\u03b1ij\n\n\u03b4\u03b1ij \u2212 1\n2\n\nc(cid:88)\n\nj=1\n\nf (\u03b11, . . . , \u03b1i\u22121, \u03b1i + \u03b4\u03b1i, \u03b1i+1, . . . , \u03b1n) =\n\nTherefore, the sub-problem of optimizing \u03b4\u03b1i is given by\n\nc(cid:88)\n\nc(cid:88)\nmax\n\u03b4\u03b1i\ns.t. \u03b4\u03b1i \u2264 eyi \u00b7 C \u2212 \u03b1i \u2227 \u03b4\u03b1i \u00b7 1 = 0.\n\n\u03b2jk(xi, xi)[\u03b4\u03b1ij]2 +\n\n\u2212 1\n2\n\nj=1\n\nj=1\n\n\u2202f\n\u2202\u03b1ij\n\n\u03b2jk(xi, xi)[\u03b4\u03b1ij]2 + const.\n\n\u03b4\u03b1ij\n\n(13)\n\n(cid:18) c(cid:88)\n\n(cid:19) p\u22122\n\np\n\nWe now consider the subproblem of updating class weights \u03b2 with temporarily \ufb01xed w, for which\nwe have the following analytic solution. The proof is deferred to the Supplementary Material C.1.\nProposition 16. (Solving the subproblem with respect to the class weights) Given \ufb01xed wj, the\nminimal \u03b2j optimizing the problem (8) is attained at\n\n\u03b2j = (cid:107)wj(cid:107)2\u2212p\n\n2\n\n(cid:107)w\u02dcj(cid:107)p\n\n2\n\n.\n\n(14)\n\n\u02dcj=1\n\nThe update of \u03b2j based on Eq. (14) requires calculating (cid:107)wj(cid:107)2\nrecalling the representation established in Eq. (10).\nThe resulting training algorithm for the proposed (cid:96)p-norm MC-SVM is given in Algorithm 1. The\nalgorithm alternates between solving a MC-SVM problem for \ufb01xed class weights (Line 3) and up-\ndating the class weights in a closed-form manner (Line 5). Recall that Problem 11 establishes a\ncompletely dualized problem, which can be used as a sound stopping criterion for Algorithm 1.\n\n2, which can be easily ful\ufb01lled by\n\nAlgorithm 1: Training algorithm for (cid:96)p-norm MC-SVM.\ninput: examples {(xi, yi)n\n\ninitialize \u03b2j = \u00afp(cid:112)1/c, wj = 0 for all j = 1, . . . , c\n\ni=1} and the kernel k.\n\nwhile Optimality conditions are not satis\ufb01ed do\n\noptimize the multi-class classi\ufb01cation problem (9)\ncompute (cid:107)wj(cid:107)2\nupdate \u03b2j for all j = 1, . . . , c, according to Eq. (14)\n\n2 for all j = 1, . . . , c, according to Eq. (10)\n\nend\n\n4 Empirical Analysis\n\nWe implemented the proposed (cid:96)p-norm MC-SVM algorithm (Algorithm 1) in C++ and solved the\ninvolved MC-SVM problem using dual coordinate ascent [25]. We experiment on six benchmark\ndatasets: the Sector dataset studied in [26], the News 20 dataset collected by [27], the Rcv1 dataset\ncollected by [28], the Birds 15, Birds 50 as a part from [29] and the Caltech 256 collected by\ngrif\ufb01n2007caltech. We used fc6 features from the BVLC reference caffenet from [30]. Table 1\ngives an information on these datasets.\nWe compare with the classical CS in [20], which constitutes a strong baseline for these datasets\n[25]. We employ a 5-fold cross validation on the training set to tune the regularization parameter\nC by grid search over the set {2\u221212, 2\u221211, . . . , 212} and p from 1.1 to 2 with 10 equidistant points.\nWe repeat the experiments 10 times, and report in Table 2 on the average accuracy and standard\ndeviations attained on the test set.\n\n7\n\n\fDataset\nSector\nNews 20\n\nRcv1\n\nBirds 15\nBirds 50\n\nCaltech 256\n\nNo. of Classes No. of Training Examples No. of Test Examples No. of Attributes\n\n105\n20\n53\n200\n200\n256\n\n6, 412\n15, 935\n15, 564\n3, 000\n9, 958\n12, 800\n\n3, 207\n3, 993\n\n518, 571\n\n8, 788\n1, 830\n16, 980\n\n55, 197\n62, 060\n47, 236\n4, 096\n4, 096\n4, 096\n\nTable 1: Description of datasets used in the experiments.\n\nSector\n\nMethod / Dataset\nCaltech 256\n(cid:96)p-norm MC-SVM 94.20\u00b10.34 86.19\u00b10.12 85.74\u00b10.71 13.73\u00b11.4 27.86\u00b10.2 56.00\u00b11.2\nCrammer & Singer 93.89\u00b10.27\n54.96\u00b11.1\nTable 2: Accuracies achieved by CS and the proposed (cid:96)p-norm MC-SVM on the benchmark datasets.\n\nNews 20\n85.12\u00b10.29\n\nBirds 50\n26.28\u00b10.3\n\nBirds 15\n12.53\u00b11.6\n\nRcv1\n\n85.21\u00b10.32\n\nWe observe that the proposed (cid:96)p-norm MC-SVM consistently outperforms CS [20] on all considered\ndatasets. Speci\ufb01cally, our method attains 0.31% accuracy gain on Sector, 1.07% accuracy gain on\nNews 20, 0.53% accuracy gain on Rcv1, 1.2% accuracy gain on Birds 15, 1.58% accuracy gain on\nBirds 50, and 1.04% accuracy gain on Birds 15. We perform a Wilcoxon signed rank test between\nthe accuracies of CS and our method on the benchmark datasets, and the p-value is 0.03, which\nmeans our method is signi\ufb01cantly better than CS at the signi\ufb01cance level of 0.05. These promising\nresults indicate that the proposed (cid:96)p-norm MC-SVM could further lift the state of the art in multi-\nclass classi\ufb01cation, even in real-world applications beyond the ones studied in this paper.\n\n5 Conclusion\n\nMotivated by the ever growing size of multi-class datasets in real-world applications such as im-\nage annotation and web advertising, which involve tens or hundreds of thousands of classes, we\nstudied the in\ufb02uence of the class size on the generalization behavior of multi-class classi\ufb01ers. We\nfocus here on data-dependent generalization bounds enjoying the ability to capture the properties of\nthe distribution that has generated the data. Of independent interest, for hypothesis classes that are\ngiven as a maximum over base classes, we developed a new structural result on Gaussian complex-\nities that is able to preserve the coupling among different components, while the existing structural\nresults ignore this coupling and may yield suboptimal generalization bounds. We applied the new\nstructural result to study learning rates for multi-class classi\ufb01ers, and derived, for the \ufb01rst time, a\ndata-dependent bound with a logarithmic dependence on the class size, which substantially outper-\nforms the linear dependence in the state-of-the-art data-dependent generalization bounds.\nMotivated by the theoretical analysis, we proposed a novel (cid:96)p-norm MC-SVM, where the parameter\np controls the complexity of the corresponding bounds. This class of algorithms contains the classi-\ncal CS [20] as a special case for p = 2. We developed an effective optimization algorithm based on\nthe Fenchel dual representation. For several standard benchmarks taken from various domains, the\nproposed approach surpassed the state-of-the-art method of CS [20] by up to 1.5%.\nA future direction will be to derive a data-dependent bound that is completely independent of the\nclass size (even overcoming the mild logarithmic dependence here). To this end, we will study more\npowerful structural results than Lemma 4 for controlling complexities of function classes induced\nvia the maximum operator. As a good starting point, we will consider (cid:96)\u221e-norm covering numbers.\n\nAcknowledgments\n\nWe thank Mehryar Mohri for helpful discussions. This work was partly funded by the German\nResearch Foundation (DFG) award KL 2698/2-1.\n\nReferences\n[1] T. Zhang, \u201cClass-size independent generalization analsysis of some discriminative multi-category classi-\n\n\ufb01cation,\u201d in Advances in Neural Information Processing Systems, pp. 1625\u20131632, 2004.\n\n[2] T. Hofmann, L. Cai, and M. Ciaramita, \u201cLearning with taxonomies: Classifying documents and words,\u201d\n\nin NIPS workshop on syntax, semantics, and statistics, 2003.\n\n8\n\n\f[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, \u201cImagenet: A large-scale hierarchical\nimage database,\u201d in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on,\npp. 248\u2013255, IEEE, 2009.\n\n[4] A. Beygelzimer, J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl, \u201cConditional probability tree estimation\n\nanalysis and algorithms,\u201d in Proceedings of UAI, pp. 51\u201358, AUAI Press, 2009.\n\n[5] S. Bengio, J. Weston, and D. Grangier, \u201cLabel embedding trees for large multi-class tasks,\u201d in Advances\n\nin Neural Information Processing Systems, pp. 163\u2013171, 2010.\n\n[6] P. Jain and A. Kapoor, \u201cActive learning for large multi-class problems,\u201d in Computer Vision and Pattern\n\nRecognition, 2009. CVPR 2009. IEEE Conference on, pp. 762\u2013769, IEEE, 2009.\n\n[7] O. Dekel and O. Shamir, \u201cMulticlass-multilabel classi\ufb01cation with more classes than examples,\u201d in Inter-\n\nnational Conference on Arti\ufb01cial Intelligence and Statistics, pp. 137\u2013144, 2010.\n\n[8] M. R. Gupta, S. Bengio, and J. Weston, \u201cTraining highly multiclass classi\ufb01ers,\u201d The Journal of Machine\n\nLearning Research, vol. 15, no. 1, pp. 1461\u20131492, 2014.\n\n[9] T. Zhang, \u201cStatistical analysis of some multi-category large margin classi\ufb01cation methods,\u201d The Journal\n\nof Machine Learning Research, vol. 5, pp. 1225\u20131251, 2004.\n\n[10] A. Tewari and P. L. Bartlett, \u201cOn the consistency of multiclass classi\ufb01cation methods,\u201d The Journal of\n\nMachine Learning Research, vol. 8, pp. 1007\u20131025, 2007.\n\n[11] T. Glasmachers, \u201cUniversal consistency of multi-class support vector classi\ufb01cation,\u201d in Advances in Neu-\n\nral Information Processing Systems, pp. 739\u2013747, 2010.\n\n[12] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press, 2012.\n[13] V. Kuznetsov, M. Mohri, and U. Syed, \u201cMulti-class deep boosting,\u201d in Advances in Neural Information\n\nProcessing Systems, pp. 2501\u20132509, 2014.\n\n[14] V. Koltchinskii and D. Panchenko, \u201cEmpirical margin distributions and bounding the generalization error\n\nof combined classi\ufb01ers,\u201d Annals of Statistics, pp. 1\u201350, 2002.\n\n[15] Y. Guermeur, \u201cCombining discriminant models with new multi-class svms,\u201d Pattern Analysis & Applica-\n\ntions, vol. 5, no. 2, pp. 168\u2013179, 2002.\n\n[16] L. Oneto, D. Anguita, A. Ghio, and S. Ridella, \u201cThe impact of unlabeled patterns in rademacher com-\nplexity theory for kernel classi\ufb01ers,\u201d in Advances in Neural Information Processing Systems, pp. 585\u2013593,\n2011.\n\n[17] V. Koltchinskii and D. Panchenko, \u201cRademacher processes and bounding the risk of function learning,\u201d\n\nin High Dimensional Probability II, pp. 443\u2013457, Springer, 2000.\n\n[18] C. Cortes, M. Mohri, and A. Rostamizadeh, \u201cMulti-class classi\ufb01cation with maximum margin multiple\n\nkernel,\u201d in ICML-13, pp. 46\u201354, 2013.\n\n[19] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, \u201cLp-norm multiple kernel learning,\u201d The Journal of\n\nMachine Learning Research, vol. 12, pp. 953\u2013997, 2011.\n\n[20] K. Crammer and Y. Singer, \u201cOn the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines,\u201d The Journal of Machine Learning Research, vol. 2, pp. 265\u2013292, 2002.\n\n[21] P. L. Bartlett and S. Mendelson, \u201cRademacher and gaussian complexities: Risk bounds and structural\n\nresults,\u201d J. Mach. Learn. Res., vol. 3, pp. 463\u2013482, 2002.\n\n[22] M. Ledoux and M. Talagrand, Probability in Banach Spaces: isoperimetry and processes, vol. 23. Berlin:\n\nSpringer, 1991.\n\nRes.(JAIR), vol. 30, pp. 525\u2013564, 2007.\n\nLearning Research, pp. 1099\u20131125, 2005.\n\n[23] S. I. Hill and A. Doucet, \u201cA framework for kernel-based multi-category classi\ufb01cation.,\u201d J. Artif. Intell.\n\n[24] C. A. Micchelli and M. Pontil, \u201cLearning the kernel function via regularization,\u201d Journal of Machine\n\n[25] S. S. Keerthi, S. Sundararajan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin, \u201cA sequential dual method for\n\nlarge scale multi-class linear svms,\u201d in 14th ACM SIGKDD, pp. 408\u2013416, ACM, 2008.\n\n[26] J. D. Rennie and R. Rifkin, \u201cImproving multiclass text classi\ufb01cation with the support vector machine,\u201d\n\n[27] K. Lang, \u201cNewsweeder: Learning to \ufb01lter netnews,\u201d in Proceedings of the 12th international conference\n\ntech. rep., AIM-2001-026, MIT, 2001.\n\non machine learning, pp. 331\u2013339, 1995.\n\n[28] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, \u201cRcv1: A new benchmark collection for text categorization\n\nresearch,\u201d The Journal of Machine Learning Research, vol. 5, pp. 361\u2013397, 2004.\n\n[29] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, \u201cCaltech-UCSD Birds\n\n200,\u201d Tech. Rep. CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[30] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, \u201cCaffe:\n\nConvolutional architecture for fast feature embedding,\u201d arXiv preprint arXiv:1408.5093, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1222, "authors": [{"given_name": "Yunwen", "family_name": "Lei", "institution": "City University of Hong Kong"}, {"given_name": "Urun", "family_name": "Dogan", "institution": "Microsoft"}, {"given_name": "Alexander", "family_name": "Binder", "institution": "Technical University of Berlin and Singapore University of Technology and Design"}, {"given_name": "Marius", "family_name": "Kloft", "institution": "Humboldt University Berlin"}]}