{"title": "Multiclass Learning with Simplex Coding", "book": "Advances in Neural Information Processing Systems", "page_first": 2789, "page_last": 2797, "abstract": "In this paper we dicuss a novel  framework for multiclass learning,  defined by  a suitable coding/decoding strategy,  namely the simplex coding, that allows to generalize to multiple classes a relaxation approach commonly used in binary classification. In this framework a  relaxation error analysis can be developed avoiding constraints on the considered hypotheses class.  Moreover, we show that in this setting it is possible to derive the first provably consistent  regularized methods with training/tuning complexity which is {\\em independent} to the number of classes. Tools from convex analysis are introduced that can be used beyond the scope of this paper.", "full_text": "Multiclass Learning with Simplex Coding\n\nYoussef Mroueh(cid:93),\u2021, Tomaso Poggio(cid:93),\u2021, Lorenzo Rosasco(cid:93),\u2021 Jean-Jacques E. Slotine\u2020\n\n(cid:93) - CBCL, McGovern Institute, MIT;\u2020 -LCSL, MIT- IIT; \u2020 - ME, BCS, MIT\n\nymroueh, lrosasco,jjs@mit.edu tp@ai.mit.edu\n\nAbstract\n\nIn this paper we discuss a novel framework for multiclass learning, de\ufb01ned by\na suitable coding/decoding strategy, namely the simplex coding, that allows us\nto generalize to multiple classes a relaxation approach commonly used in binary\nclassi\ufb01cation. In this framework, we develop a relaxation error analysis that avoids\nconstraints on the considered hypotheses class. Moreover, using this setting we\nderive the \ufb01rst provably consistent regularized method with training/tuning com-\nplexity that is independent to the number of classes. We introduce tools from\nconvex analysis that can be used beyond the scope of this paper.\n\n1\n\nIntroduction\n\nAs bigger and more complex datasets are available, multiclass learning is becoming increasingly im-\nportant in machine learning. While theory and algorithms for solving binary classi\ufb01cation problems\nare well established, the problem of multicategory classi\ufb01cation is much less understood. Practical\nmulticlass algorithms often reduce the problem to a collection of binary classi\ufb01cation problems. Bi-\nnary classi\ufb01cation algorithms are often based on a relaxation approach: classi\ufb01cation is posed as a\nnon-convex minimization problem and then relaxed to a convex one, de\ufb01ned by suitable convex loss\nfunctions. In this context, results in statistical learning theory quantify the error incurred by relax-\nation and in particular derive comparison inequalities explicitly relating the excess misclassi\ufb01cation\nrisk to the excess expected loss. We refer to [2, 27, 14, 29] and [18] Chapter 3 for an exhaustive\npresentation as well as generalizations.\nGeneralizing the above approach and results to more than two classes is not straightforward. Over\nthe years, several computational solutions have been proposed (among others, see [10, 6, 5, 25, 1,\n21]). Indeed, most of these methods can be interpreted as a kind of relaxation. Most proposed\nmethods have complexity which is more than linear in the number of classes and simple one-vs\nall in practice offers a good alternative both in terms of performance and speed [15]. Much fewer\nworks have focused on deriving theoretical guarantees. Results in this sense have been pioneered\nby [28, 20], see also [11, 7, 23]. In these works the error due to relaxation is studied asymptotically\nand under constraints on the function class to be considered. More quantitative results in terms of\ncomparison inequalities are given in [4] under similar restrictions (see also [19]). Notably, the above\nresults show that seemingly intuitive extensions of binary classi\ufb01cation algorithms might lead to\nmethods which are not consistent. Further, it is interesting to note that the restrictions on the func-\ntion class, needed to prove the theoretical guarantees, make the computations in the corresponding\nalgorithms more involved and are in fact often ignored in practice.\nIn this paper we dicuss a novel framework for multiclass learning, de\ufb01ned by a suitable cod-\ning/decoding strategy, namely the simplex coding, in which a relaxation error analysis can be devel-\noped avoiding constraints on the considered hypotheses class. Moreover, we show that in this frame-\nwork it is possible to derive the \ufb01rst provably consistent regularized method with training/tuning\ncomplexity that is independent to the number of classes. Interestingly, using the simplex coding,\nwe can naturally generalize results, proof techniques and methods from the binary case, which is\nrecovered as a special case of our theory. Due to space restriction in this paper we focus on exten-\nsions of least squares, and SVM loss functions, but our analysis can be generalized to a large class\n\n1\n\n\fof simplex loss functions, including extensions of the logistic and exponential loss functions (used\nin boosting). Tools from convex analysis are developed in the supplementary material and can be\nuseful beyond the scope of this paper, in particular in structured prediction.\nThe rest of the paper is organized as follow. In Section 2 we discuss the problem statement and\nbackground. In Section 3 we discuss the simplex coding framework which we analyze in Section\n4. Algorithmic aspects and numerical experiments are discussed in Section 5 and Section 6, respec-\ntively. Proofs and supplementary technical results are given in the appendices.\n\n(cid:80)n\n\n2 Problem Statement and Previous Work\nLet (X, Y ) be two random variables with values in two measurable spaces X and Y = {1 . . . T},\nT \u2265 2. Denote by \u03c1X , the law of X on X , and by \u03c1j(x), the conditional probabilities for j \u2208 Y. The\ndata is a sample S = (xi, yi)n\ni=1, from n identical and independent copies of (X, Y ). We can think of\nX as a set of possible inputs and of Y as a set of labels describing a set of semantic categories/classes\nthe input can belong to. A classi\ufb01cation rule is a map b : X \u2192 Y, and its error is measured by the\nmisclassi\ufb01cation risk R(b) = P(b(X) (cid:54)= Y ) = E(1I[b(x)(cid:54)=y](X, Y )). The optimal classi\ufb01cation rule\nthat minimizes R is the Bayes rule b\u03c1(x) = arg maxy\u2208Y \u03c1y(x), x \u2208 X . Computing the Bayes rule\nby directly minimizing the risk R is not possible since the probability distribution is unknown. One\nmight think of minimizing the empirical risk (ERM) RS(b) = 1\ni=1 1I[b(x)(cid:54)=y](xi, yi), which is an\nn\nunbiased estimator of the R, but the corresponding optimization problem is in general not feasible.\nIn binary classi\ufb01cation, one of the most common ways to obtain computationally ef\ufb01cient methods\nis based on a relaxation approach. We recall this approach in the next section and describe its exten-\nsion to multiclass in the rest of the paper.\nRelaxation Approach to Binary Classi\ufb01cation. If T = 2, we can set Y = {\u00b11}. Most mod-\nern machine learning algorithms for binary classi\ufb01cation consider a convex relaxation of the ERM\nfunctional RS. More precisely: 1) the indicator function in RS is replaced by non negative loss\nV : Y \u00d7 R \u2192 R+ which is convex in the second argument and is sometimes called a surrogate loss;\n2) the classi\ufb01cation rule b replaced by a real valued measurable function f : X \u2192 R. A classi\ufb01ca-\ntion rule is then obtained by considering the sign of f. It often suf\ufb01ces to consider a special class\nof loss functions, namely large margin loss functions V : R \u2192 R+ of the form V (\u2212yf (x)). This\nlast expression is suggested by the observation that the misclassi\ufb01cation risk, using the labels \u00b11,\ncan be written as R(f ) = E(\u0398(\u2212Y f (X))), where \u0398 is the Heaviside step function. The quantity\nm = \u2212yf (x), sometimes called the margin, is a natural point-wise measure of the classi\ufb01cation\nerror. Among other examples of large margin loss functions (such as the logistic and exponential\nloss), we recall the hinge loss V (m) = |1 + m|+ = max{1 + m, 0} used in the support vec-\ntor machine, and the square loss V (m) = (1 + m)2 used in regularized least squares (note that\n(1 \u2212 yf (x))2 = (y \u2212 f (x))2). Using large margin loss functions it is possible to design effective\nlearning algorithms replacing the empirical risk with regularized empirical risk minimization\n\nn(cid:88)\n\ni=1\n\nE \u03bb\nS (f ) =\n\n1\nn\n\nV (yi, f (xi)) + \u03bbR(f ),\n\n(1)\n\nwhere R is a suitable regularization functional and \u03bb is the regularization parameter, (see Section\n5).\n\n2.1 Relaxation Error Analysis\n\nAs we replace the misclassi\ufb01cation loss with a convex surrogate loss, we are effectively changing\nthe problem: the misclassi\ufb01cation risk is replaced by the expected loss, E(f ) = E(V (\u2212Y f (X))) .\nThe expected loss can be seen as a functional on a large space of functions F = FV,\u03c1, which depend\non V and \u03c1. Its minimizer, denoted by f\u03c1, replaces the Bayes rule as the target of our algorithm.\nThe question arises of the price we pay by a considering a relaxation approach: \u201cWhat is the rela-\ntionship between f\u03c1 and b\u03c1?\u201d More generally, \u201cWhat is the price we incur into by estimating the\nexpected risk rather than the misclassi\ufb01cation risk?\u201d The relaxation error for a given loss function\ncan be quanti\ufb01ed by the following two requirements:\n1) Fisher Consistency. A loss function is Fisher consistent if sign(f\u03c1(x)) = b\u03c1(x) almost surely\n(this property is related to the notion of classi\ufb01cation-calibration [2]).\n\n2\n\n\f2) Comparison inequalities. The excess misclassi\ufb01cation risk, and the excess expected loss are\nrelated by a comparison inequality\n\nR(sign(f )) \u2212 R(b\u03c1) \u2264 \u03c8(E(f ) \u2212 E(f\u03c1)),\n\nfor any function f \u2208 F, where \u03c8 = \u03c8V,\u03c1 is a suitable function that depends on V , and possibly\non the data distribution. In particular \u03c8 should be such that \u03c8(s) \u2192 0 as s \u2192 0, so that if fn\nis a (possibly random) sequence of functions, such that E(fn) \u2192 E(f\u03c1) (possibly in probability),\nthen the corresponding sequences of classi\ufb01cation rules cn = sign(fn) is Bayes consistent, i.e.\nR(cn) \u2192 R(b\u03c1) (possibly in probability).\nIf \u03c8 is explicitly known, then bounds on the excess\nexpected loss yield bounds on the excess misclassi\ufb01cation risk.\nThe relaxation error in the binary case has been thoroughly studied in [2, 14]. In particular, Theorem\n2 in [2] shows that if a large margin surrogate loss is convex, differentiable and decreasing in a\nneighborhood of 0, then the loss is Fisher consistent. Moreover, in this case it is possible to give\n\u221a\nan explicit expression for the function \u03c8.\nIn particular, for the hinge loss the target function is\nexactly the Bayes rule and \u03c8(t) = |t|. For least squares, f\u03c1(x) = 2\u03c11(x) \u2212 1, and \u03c8(t) =\nt.\nThe comparison inequality for the square loss can be improved for a suitable class of probability\ndistribution satisfying the so-called Tsybakov noise condition [22], \u03c1X ({x \u2208 X ,|f\u03c1(x)| \u2264 s}) \u2264\nBqsq, s \u2208 [0, 1], q > 0. Under this condition the probability of points such that \u03c1y(x) \u223c 1\n2 decreases\npolynomially. In this case the comparison inequality for the square loss is given by \u03c8(t) = cqt\nq+2 ,\nsee [2, 27].\nPrevious Works in Multiclass Classi\ufb01cation. From a practical perspective, over the years, several\ncomputational solutions to multiclass learning have been proposed. Among others, we mention\nfor example [10, 6, 5, 25, 1, 21]. Indeed, most of the above methods can be interpreted as a kind\nof relaxation of the original multiclass problem. Interestingly, the study in [15] suggests that the\nsimple one-vs all schemes should be a practical benchmark for multiclass algorithms as it seems\nexperimentally to achive performance that is similar or better than more sophisticated methods.\nAs we previously mentioned from a theoretical perspective a general account of a large class of\nmulticlass methods has been given in [20], building on results in [2] and [28]. Notably, these results\nshow that seemingly intuitive extensions of binary classi\ufb01cation algorithms can lead to inconsistent\nmethods. These results, see also [11, 23], are developed in a setting where a classi\ufb01cation rule\nis found by applying a suitable prediction/decoding map to a function f : X \u2192 RT where f is\nfound considering a loss function V : Y \u00d7 RT \u2192 R+. The considered functions have to satisfy\ny\u2208Y f y(x) = 0, for all x \u2208 X . The latter requirement is problematic as it makes\nthe computations in the corresponding algorithms more involved. It is in fact often ignored, so that\npractical algorithms often come with no consistency guarantees. In all the above papers relaxation\nis studied in terms of Fisher and Bayes consistency and the explicit form of the function \u03c8 is not\ngiven. More quantitative results in terms of explicit comparison inequality are given in [4] and (see\nalso [19]), but also need to to impose the \u201dsum to zero\u201d constraint on the considered function class.\n\nthe constraint(cid:80)\n\nq+1\n\n3 A Relaxation Approach to Multicategory Classi\ufb01cation\n\nIn this section we propose a natural extension of the relaxation approach that avoids constraining\nthe class of functions to be considered, and allows us to derive explicit comparison inequalities. See\nRemark 1 for related approaches.\n\n\u03b1\n\nc1\n\nc2\n\nc3\n\nFigure 1: Decoding with simplex coding T = 3.\n\nSimplex Coding. We start by considering a suitable coding/decoding strategy. A coding map turns\na label y \u2208 Y into a code vector. The corresponding decoding map given a vector returns a label in\n\n3\n\n\fY. Note that this is what we implicitly did while treating binary classi\ufb01cation,we encoded the label\nspace Y = {1, 2} using the coding \u00b11, so that the naturally decoding strategy is simply sign(f (x)).\nThe coding/decoding strategy we study here is described by the following de\ufb01nition.\nDe\ufb01nition 1 (Simplex Coding). The simplex coding is a map C : Y \u2192 RT\u22121, C(y) = cy,\nwhere the code vectors C = {cy | y \u2208 Y} \u2282 RT\u22121 satisfy: 1) (cid:107)cy(cid:107)2 = 1, \u2200y \u2208 Y, 2)(cid:104)cy, cy(cid:48)(cid:105) =\n\u2212 1\ny\u2208Y cy = 0. The corresponding decoding is the map\nD : RT\u22121 \u2192 {1, . . . , T}, D(\u03b1) = arg maxy\u2208Y (cid:104)\u03b1, cy(cid:105) ,\u2200\u03b1 \u2208 RT\u22121.\n\nT\u22121 , for y (cid:54)= y(cid:48) with y, y(cid:48) \u2208 Y, and 3)(cid:80)\n\nThe simplex coding has been considered in [8],[26], and [16]. It corresponds to T maximally sep-\narated vectors on the hypersphere ST\u22122 in RT\u22121, that are the vertices of a simplex (see Figure 1).\nFor binary classi\ufb01cation it reduces to the \u00b11 coding and the decoding map is equivalent to taking\nthe sign of f. The decoding map has a natural geometric interpretation: an input point is mapped\nto a vector f (x) by a function f : X \u2192 RT\u22121, and hence assigned to the class having closest code\nvector ( for y, y(cid:48) \u2208 Y and \u03b1 \u2208 RT\u22121, we have (cid:107)cy \u2212 \u03b1(cid:107)2 \u2265 (cid:107)cy(cid:48) \u2212 \u03b1(cid:107)2 \u21d4 (cid:104)cy(cid:48), \u03b1(cid:105) \u2264 (cid:104)cy, \u03b1(cid:105)).\nRelaxation for Multiclass Learning. We use the simplex coding to propose an extension of binary\nclassi\ufb01cation. Following the binary case, the relaxation can be described in two steps:\n\n1. using the simplex coding, the indicator function is upper bounded by a non-negative loss\nfunction V : Y \u00d7RT\u22121 \u2192 R+, such that 1I[b(x)(cid:54)=y](x, y) \u2264 V (y, C(b(x))), for all b : X \u2192\nY, and x \u2208 X , y \u2208 Y,\n\n2. rather than C \u25e6 b we consider functions with values in f : X \u2192 RT\u22121, so that\n\nV (y, C(b(x))) \u2264 V (y, f (x)), for all b : X \u2192 Y, f : X \u2192 RT\u22121 and x \u2208 X , y \u2208 Y.\n\nIn the next section we discuss several loss functions satisfying the above conditions and we study in\nparticular the extension of the least squares and SVM loss functions.\nMulticlass Simplex Loss Functions. Several loss functions for binary classi\ufb01cation can be natu-\nrally extended to multiple classes using the simplex coding. Due to space restriction, in this paper\nwe focus on extensions of the least squares and SVM loss functions, but our analysis can be general-\nized to a large class of loss functions, including extensions of logistic and exponential loss functions\n(used in boosting). The Simplex Least Square loss (S-LS) is given by V (y, f (x)) = (cid:107)cy \u2212 f (x)(cid:107)2,\nand reduces to the usual least square approach to binary classi\ufb01cation for T = 2. One natural\nextension of the SVM\u2019s hinge loss in this setting would be to consider the Simplex Half space\nSVM loss (SH-SVM) V (y, f (x)) = |1 \u2212 (cid:104)cy, f (x)(cid:105)|+. We will see in the following that while\nthis loss function would induce ef\ufb01cient algorithms in general is not Fisher consistent unless fur-\nther constraints are assumed. These latter constraints would considerably slow down the computa-\ntions. We then consider a second loss function Simplex Cone SVM (SC-SVM), which is de\ufb01ned as\n. The latter loss function is related to the one considered\nin the multiclass SVM proposed in [10]. We will see that it is possible to quantify the relaxation er-\nror of the loss function without requiring further constraints. Both of the above SVM loss functions\nreduce to the binary SVM hinge loss if T = 2.\n\nT\u22121 + (cid:104)cy(cid:48), f (x)(cid:105)(cid:12)(cid:12)(cid:12)+\n(cid:12)(cid:12)(cid:12) 1\n\nV (y, f (x)) =(cid:80)\n\ny(cid:48)(cid:54)=y\n\n(cid:80)\nRemark 1 (Related approaches). An SVM loss is considered in [8] where V (y, f (x)) =\ny(cid:48)(cid:54)=y |\u03b5 \u2212 (cid:104)f (x), vy(cid:48)(y)(cid:105)|+ and vy(cid:48)(y) = cy\u2212cy(cid:48)\n(cid:107)cy\u2212cy(cid:48)(cid:107) , with \u03b5 = (cid:104)cy, vy(cid:48)(y)(cid:105) = 1\u221a\nT\u22121 . More\nclass boosting loss was introduced in [16], in our notation V (y, f (x)) = (cid:80)\nrecently [26] considered the loss function V (y, f (x)) = |(cid:107)cy \u2212 f (x)(cid:107) \u2212 \u03b5|+, and a simplex multi-\nj(cid:54)=y e\u2212(cid:104)cy\u2212cy(cid:48) ,f (x)(cid:105).\nWhile all those losses introduce a certain notion of margin that makes use of the geometry of the\nsimplex coding, it is not to clear how to derive explicit comparison theorems and moreover the com-\nputational complexity of the resulting algorithms scales linearly with the number of classes in the\ncase of the losses considered in [16, 26] and O((nT )\u03b3), \u03b3 \u2208 {2, 3} for losses considered in [8] .\n\n2\n\n(cid:113) T\n\n4\n\n\fFigure 2: Level sets of the different losses considered for T = 3. A classi\ufb01cation is correct if an\ninput (x, y) is mapped to a point f (x) that lies in the neighborhood of the vertex cy. The shape of\nthe neighborhood is de\ufb01ned by the loss. It takes the form of a cone supported on a vertex, in the\ncase of SC-SVM, a half space delimited by the hyperplane orthogonal to the vertex in the case of\nthe SH-SVM, and a sphere centered on the vertex, in the case of S-LS.\n\n4 Relaxation Error Analysis\nIf we consider the simplex coding, a function f taking values in RT\u22121, and the decoding operator\nX (1 \u2212 \u03c1D(f (x)))d\u03c1X (x). Then,\nfollowing a relaxation approach, we replace the misclassi\ufb01cation loss by the expected risk induced\nby one of the loss functions V de\ufb01ned in the previous section. As in the binary case we consider\n\u03c1 =\n\nD, the misclassi\ufb01cation risk can also be written as: R(D(f )) =(cid:82)\nthe expected loss E(f ) = (cid:82) V (y, f (x))d\u03c1(x, y). Let Lp(X , \u03c1X ) = {f : X \u2192 RT\u22121 | (cid:107)f(cid:107)p\n(cid:82) (cid:107)f (x)(cid:107)p d\u03c1X (x) < \u221e}, p \u2265 1.\n\nThe following theorem studies the relaxation error for SH-SVM, SC-SVM, and S-LS loss functions.\nTheorem 1. For SH-SVM, SC-SVM, and S-LS loss functions, there exists a p such that E :\nLp(X , \u03c1X ) \u2192 R+ is convex and continuous. Moreover,\n\n1. The minimizer f\u03c1 of E over F = {f \u2208 Lp(X , \u03c1X ) | f (x) \u2208 K a.s.} exists and D(f\u03c1) = b\u03c1.\n2. For any f \u2208 F, R(D(f )) \u2212 R(D(f\u03c1)) \u2264 CT (E(f ) \u2212 E(f\u03c1))\u03b1, where the expressions of\n\np, K, f\u03c1, CT , and \u03b1 are given in Table 1.\n\np K\n\nconv(C)\n\nLoss\nSH-SVM 1\nSC-SVM 1 RT\u22121\n2 RT\u22121\nS-LS\nTable 1: conv(C) is the convex hull of the set C de\ufb01ned in (1).\n\n(cid:113) 2(T\u22121)\n\nCT\nT \u2212 1\nT \u2212 1\n\nf\u03c1\ncb\u03c1\ncb\u03c1\n\n(cid:80)\n\n\u03b1\n1\n1\n\n1\n2\n\ny\u2208Y \u03c1ycy\n\nT\n\nThe proof of this theorem is given, in Theorems 1 and 2 for S-LS, and Theorems 3, and 4 for SC-\nSVM and SH-SVM respectively, in Appendix B.\nThe above theorem can be improved for Least Squares under certain classes of distribution . Toward\nthis end we introduce the following notion of misclassi\ufb01cation noise that generalizes Tsybakov\u2019s\nnoise condition.\nDe\ufb01nition 2. Fix q > 0, we say that the distribution \u03c1 satis\ufb01es the multiclass noise condition with\nparameter Bq, if\n\n(cid:27)(cid:19)\n((cid:10)cD(f\u03c1(x)) \u2212 cj, f\u03c1(x)(cid:11)) \u2264 s\n\n\u2264 Bqsq,\n\n(2)\n\n(cid:18)(cid:26)\n\n\u03c1X\n\nx \u2208 X | 0 \u2264 min\n\nj(cid:54)=D(f\u03c1(x))\n\nT \u2212 1\nT\n\nwhere s \u2208 [0, 1].\n\n5\n\n\f1\n\nIf a distribution \u03c1 is characterized by a very large q, then, for each x \u2208 X , f\u03c1(x) is arbitrarily close\nto one of the coding vectors. For T = 2, the above condition reduces to the binary Tsybakov noise.\nIndeed, let c1 = 1, and c2 = \u22121, if f\u03c1(x) > 0, 1\n2 (c1 \u2212 c2)f\u03c1(x) = f\u03c1(x), and if f\u03c1(x) < 0,\n2 (c2 \u2212 c1)f\u03c1(x) = \u2212f\u03c1(x).\nThe following result improves the exponent of simplex-least square to q+1\nTheorem 2. For each f \u2208 L2(X , \u03c1X ), if (2) holds, then for S-LS we have the following inequality,\n\n2 :\nq+2 > 1\n\n(cid:18) 2(T \u2212 1)\n\nT\n\n(cid:19) q+1\n\nq+2\n\nR(D(f )) \u2212 R(D(f\u03c1)) \u2264 K\n\nwith K =(cid:0)2(cid:112)Bq + 1(cid:1) 2q+2\n\nq+2 .\n\n(E(f ) \u2212 E(f\u03c1))\n\n,\n\n(3)\n\nRemark 2. Note that the comparison inequalities show a tradeoff between the exponent \u03b1 and the\nconstant C(T ) for S-LS and SVM losses. While the constant is order T for SVM it is order 1 for S-\nLS, on the other hand the exponent is 1 for SVM losses and 1\n2 for S-LS. The latter could be enhanced\nto 1 for close to separable classi\ufb01cation problems by virtue of the Tsybakov noise condition.\nRemark 3. The comparison inequalities given in Theorems 1 and 2 can be used to derive gener-\nalization bounds on the excess misclassi\ufb01cation risk. For least squares min-max sharp bound, for\nvector valued regression are known [3].\nStandard techniques for deriving sample complexity bounds in binary classi\ufb01cation extended for\nmulti-class SVM losses, are found in [7] and could be adapted to our setting. The obtained bound\nare not known to be tight. Better bounds akin to those in [18], will be subject of future work.\n\n5 Computational Aspects and Regularization Algorithms\n\ni2\n\nu(cid:62)\n\n(cid:33)\n\n1 \u2212 1\n\n, C[2] = [1\u22121], where u = (\u2212 1\n\n(cid:32)1\nv C[i] \u00d7(cid:113)\n\nThe simplex coding framework allows us to extend batch and online kernel methods to the Multi-\nclass setting.\nComputing the Simplex Coding. We begin by noting that the simplex coding can be easily com-\ni \u00b7\u00b7\u00b7\u2212 1\nputed via the recursion: C[i+1] =\ni )\n(column vector in Ri) and v = (0, . . . , 0)(column vector in Ri\u22121) (see Algorithm C.1). Indeed we\nhave the following result (see the Appendix C.1 for the proof).\nLemma 1. The T columns of C[T ] are a set of T \u2212 1 dimensional vectors satisfying the properties\nof De\ufb01nition 1.\nThe above algorithm stems from the observation that the simplex in RT\u22121 can be obtained by pro-\njecting the simplex in RT onto the hyperplane orthogonal to the element (1, . . . , 0) of the canonical\nbasis in RT .\nRegularized Kernel Methods. We consider regularized methods of the form (1), induced by sim-\nplex loss functions and where the hypothesis space is a vector-valued reproducing kernel Hilbert\nspace H(VV-RKHS) the regularizer is the corresponding norm ||f||2H. See Appendix D.2 for a brief\nintroduction to VV-RKHS.\nIn the following, we consider a class of kernels K such that if f minimizes (1) for R(f ) = ||f||2H\ni=1 K(x, xi)ai, ai \u2208 RT\u22121 [12], where we note that the coef\ufb01cients\nare vectors in RT\u22121. In the case that the kernel is induced by a \ufb01nite dimensional feature map,\nk(x, x(cid:48)) = (cid:104)\u03a6(x), \u03a6(x(cid:48))(cid:105) , where \u03a6 : X \u2192 Rp, and (cid:104)\u00b7,\u00b7(cid:105) is the inner product in Rp, we can\nwrite each function in H as f (x) = W \u03a6(x), where W \u2208 R(T\u22121)\u00d7p.\n(cid:80)n\nIt is known [12] that the representer theorem [9] can be easily extended to a vector valued set-\nting, so that that minimizer of a simplex version of Tikhonov regularization is given by f \u03bb\nS (x) =\nj=1 k(x, xj)aj, aj \u2208 RT\u22121, for all x \u2208 X , where the explicit expression of the coef\ufb01cients\ndepends on the considered loss function. We use the following notation: K \u2208 Rn\u00d7n, Kij =\nk(xi, xj),\u2200i, j \u2208 {1 . . . n}, A \u2208 Rn\u00d7(T\u22121), A = (a1, ..., an)T .\nSimplex Regularized Least squares (S-RLS). S-RLS is obtained by substituting the simplex least\nsquare loss in the Tikhonov functional.\nIt is easy to see [15] that in this case the coef\ufb01cients\n\nwe have that f (x) = (cid:80)n\n\n6\n\n\fmust satisfy either (K + \u03bbnI)A = \u02c6Y or ( \u02c6X T \u02c6X + \u03bbnI)W = \u02c6X T \u02c6Y in the linear case, where\n\u02c6X \u2208 Rn\u00d7p, \u02c6X = (\u03a6(x1), ..., \u03a6(xn))(cid:62) and \u02c6Y \u2208 Rn\u00d7(T\u22121), \u02c6Y = (cy1 , ..., cyn )(cid:62) .\nInterestingly, the classical results from [24] can be extended to show that the value fSi(xi), obtained\ncomputing the solution fSi removing the i \u2212 th point from the training set (the leave one out so-\nloo \u2208 Rn\u00d7(T\u22121), f \u03bb\nlution), can be computed in closed form. Let f \u03bb\n(xn)).\nLet K(\u03bb) = (K + \u03bbnI)\u22121and C(\u03bb) = K(\u03bb) \u02c6Y . De\ufb01ne M (\u03bb) \u2208 Rn\u00d7(T\u22121), such that:\n(cid:80)n\nloo = \u02c6Y \u2212 C(\u03bb) (cid:12) M (\u03bb), where\nM (\u03bb)ij = 1/K(\u03bb)ii, \u2200 j = 1 . . . T \u2212 1. One can show that f \u03bb\n(cid:12) is the Hadamard product [15]. Then, the leave-one-out error 1\ni=1 1Iy(cid:54)=D(fSi (x))(yi, xi), can\nbe minimized at essentially no extra cost by precomputing the eigen decomposition of K (or \u02c6X T \u02c6X).\nSimplex Cone Support Vector Machine (SC-SVM). Using standard reasoning it is easy to show\nthat (see Appendix C.2), for the SC-SVM the coef\ufb01cients in the representer theorem are given by\ni )y\u2208Y \u2208 RT , i = 1, . . . , n, solve the quadratic\n\nai = \u2212(cid:80)\n\ni = 1, . . . , n, where \u03b1i = (\u03b1y\n\n(x1), . . . , f \u03bb\nSn\n\nloo = (f \u03bb\nS1\n\ni cy,\n\n\u03b1y\n\nn\n\ny(cid:54)=yi\n\nprogramming (QP) problem\n\n\uf8f1\uf8f2\uf8f3\u2212 1\n\nmax\n\n\u03b11,...,\u03b1n\u2208RT\n2\nsubject to 0 \u2264 \u03b1y\n\nn(cid:88)\n(cid:88)\ny,y(cid:48),i,j\ni \u2264 C0\u03b4y,yi , \u2200 i = 1, . . . , n, y \u2208 Y\n\ni KijGyy(cid:48)\u03b1y(cid:48)\n\u03b1y\n\nT \u2212 1\n\nj +\n\ni=1\n\n1\n\nT(cid:88)\n\ny=1\n\n\uf8fc\uf8fd\uf8fe\n\n\u03b1y\ni\n\n2n\u03bb, \u03b1i = (\u03b1y\n\ni )y\u2208Y \u2208 RT , for i = 1, . . . , n and \u03b4i,j\nwhere Gy,y(cid:48) = (cid:104)cy, cy(cid:48)(cid:105)\u2200y, y(cid:48) \u2208 Y and C0 = 1\nis the Kronecker delta.\nSimplex Halfspaces Support Vector Machine (SH-SVM). A similar, yet more more complicated\nprocedure, can be derived for the SH-SVM. Here, we omit this derivation and observe instead that\nif we neglect the convex hull constraint from Theorem 1, that requires f (x) \u2208 co(C) for almost all\nx \u2208 X , then the SH-SVM has an especially simple formulation at the price of loosing consistency\nguarantees. In fact, in this case the coef\ufb01cients are given by ai = \u03b1icyi,\ni = 1, . . . , n, where\n\u03b1i \u2208 R, with i = 1, . . . , n solve the quadratic programming (QP) problem\n\n(4)\n\n(cid:88)\n\nn(cid:88)\n\nmax\n\n\u03b11,...,\u03b1n\u2208R\u2212 1\nsubject to 0 \u2264 \u03b1i \u2264 C0, \u2200 i = 1 . . . n,\n\n\u03b1iKijGyiyj \u03b1j +\n\ni=1\n\ni,j\n\n2\n\n\u03b1i\n\n2n\u03bb. The latter formulation can be solved at the same complexity of the binary SVM\n\nwhere C0 = 1\n(worst case O(n3)) but lacks consistency.\nOnline/Incremental Optimization The regularized estimators induced by the simplex loss func-\ntions can be computed by means of online/incremental \ufb01rst order (sub) gradient methods. Indeed,\nwhen considering \ufb01nite dimensional feature maps, these strategies offer computationally feasible so-\nlutions to train estimators for large datasets where neither a p by p or an n by n matrix \ufb01t in memory.\nFollowing [17] we can alternate a step of stochastic descent on a data point : Wtmp = (1\u2212 \u03b7i\u03bb)Wi \u2212\n)Wtmp (See\n\u03b7i\u2202(V (yi, fWi(xi))) and a projection on the Frobenius ball Wi = min(1,\nAlgorithn C.5 for details.) The algorithm depends on the used loss function through computation of\nthe (point-wise) subgradient \u2202(V ). The latter can be easily computed for all the loss functions previ-\nously discussed. For the SLS loss we have \u2202(V (yi, fW (xi))) = 2(cyi \u2212 W xi)x(cid:62)\ni , while for the SC-\nT\u22121}.\nFor the SH-SVM loss we have: \u2202(V (y, fW (xi))) = \u2212cyix(cid:62)\n\nSVM loss we have \u2202(V (yi, fW (xi))) = ((cid:80)\n\ni where Ii = {y (cid:54)= yi|(cid:104)cy, W xi(cid:105) > \u2212 1\nif cyiW xi < 1 and 0 otherwise .\n\n\u03bb||Wtmp||F\n\nck)x(cid:62)\n\nk\u2208Ii\n\n\u221a\n\n1\n\ni\n\n5.1 Comparison of Computational Complexity\n\nThe cost of solving S-RLS for \ufb01xed \u03bb is in the worst case O(n3) (for example via Cholesky decom-\nposition). If we are interested in computing the regularization path for N regularization parameter\nvalues, then as noted in [15] it might be convenient to perform an eigendecomposition of the ker-\nnel matrix rather than solving the systems N times. For explicit p\u2212dimensional feature maps the\ncost is O(np2), so that the cost of computing the regularization path for simplex RLS algorithm is\nO(min(n3, np2)) and hence independent of T . One can contrast this complexity with that of a n\u00a8aive\nOne Versus All (OVA) approach that would lead to a O(N n3T ) complexity. Simplex SVMs can be\nsolved using solvers available for binary SVMs that are considered to have complexity O(n\u03b3) with\n\u03b3 \u2208 {2, 3}(the actual complexity scales with the number of support vectors) . For SC-SVM, though,\n\n7\n\n\fwe have nT rather than n unknowns and the complexity is (O(nT )\u03b3). SH-SVM in which we omit\nthe constraint, can be trained with the same complexity as the binary SVM (worst case O(n3)) but\nlacks consistency. Note that unlike for S-RLS, there is no straightforward way to compute the regu-\nlarization path and the leave one out error for any of the above SVM. The online algorithms induced\nby the different simplex loss functions are essentially the same. In particular, each iteration depends\nlinearly on the number of classes.\n\n6 Numerical Results\n\nWe conduct several experiments to evaluate the performance of our batch and online algorithms,\non 5 UCI datasets as listed in Table 2, as well as on Caltech101 and Pub\ufb01g83. We compare the\nperformance of our algorithms to one versus all svm (libsvm) , as well as simplex- based boosting\n[16]. For UCI datasets we use the raw features, on Caltech101 we use hierarchical features (hmax),\nand on Pub\ufb01g83 we use the feature maps from [13]. In all cases the parameter selection is based\neither on a hold out (ho) (80% training \u2212 20% validation) or a leave one out error (loo). For the\nmodel selection of \u03bb in S-LS, 100 values are chosen in the range [\u03bbmin, \u03bbmax],(where \u03bbmin and\n\u03bbmax, correspond to the smallest and biggest eigenvalues of K). In the case of a Gaussian kernel\n(rbf) we use a heuristic that sets the width of the Gaussian \u03c3 to the 25-th percentile of pairwise\ndistances between distinct points in the training set. In Table 2 we collect the resulting classi\ufb01cation\naccuracies.\n\nLetter\n\nIsolet\n\nSC-SVM Online (ho)\nSH-SVM Online (ho)\nS-LS Online (ho)\nS-LS Batch (loo)\nS-LS rbf Batch (loo)\nSVM batch ova (ho)\nSVM rbf batch ova (ho)\nSimplex boosting [16]\n\nCtech\n\nPendigit\n\nLandsat Optdigit\nPub\ufb01g83\n65.15% 89.57% 81.62% 52.82% 88.58% 63.33% 84.70%\n75.43% 85.58% 72.54% 38.40% 77.65% 45%\n49.76%\n63.62% 91.68% 81.39% 54.29% 92.62% 58.39% 83.61%\n65.88% 91.90% 80.69% 54.96% 92.55% 66.35% 86.63%\n90.15% 97.09% 98.17% 96.48% 97.05% 69.38% 86.75%\n72.81% 92.13% 86.93% 62.78% 90.59% 70.13% 85.97%\n95.33% 98.07% 98.88% 97.12% 96.99% 51.77% 85.60%\n86.65% 92.82% 92.94% 59.65% 91.02% \u2212\n\n\u2212\n\nTable 2: Accuracies of our algorithms on several datasets.\n\nAs suggested by the theory, the consistent methods SC-SVM and S-LS have large performance\nadvantage over SH-SVM (where we omitted the convex hull constraint). Batch methods are overall\nsuperior to online methods. Online SC-SVM achieves the best results among online methods. More\ngenerally, we see that rbf S- LS has the best performance amongst the simplex methods, including\nsimplex boosting [16]. We see that S-LS rbf achieves essentially the same performance as One\nVersus All SVM-rbf.\n\nReferences\n[1] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: a\nunifying approach for margin classi\ufb01ers. Journal of Machine Learning Research, 1:113\u2013141,\n2000.\n\n[2] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk\n\nbounds. Journal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[3] A. Caponnetto and E. De Vito. Optimal rates for regularized least-squares algorithm. Founda-\n\ntions of Computational Mathematics, 2006.\n\n[4] D. Chen and T. Sun. Consistency of multiclass empirical risk minimization methods based in\n\nconvex loss. Journal of machine learning, X, 2006.\n\n[5] Crammer.K and Singer.Y. On the algorithmic implementation of multiclass kernel-based vector\n\nmachines. JMLR, 2001.\n\n[6] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-\n\ncorrecting output codes. Journal of Arti\ufb01cial Intelligence Research, 2:263\u2013286, 1995.\n\n[7] Yann Guermeur. Vc theory of large margin multi-category classiers. Journal of Machine\n\nLearning Research, 8:2551\u20132594, 2007.\n\n8\n\n\f[8] Simon I. Hill and Arnaud Doucet. A framework for kernel-based multi-category classi\ufb01cation.\n\nJ. Artif. Int. Res., 30(1):525\u2013564, December 2007.\n\n[9] G. Kimeldorf and G. Wahba. A correspondence between bayesian estimation of stochastic\n\nprocesses and smoothing by splines. Ann. Math. Stat., 41:495\u2013502, 1970.\n\n[10] Lee.Y, L.Yin, and Wahba.G. Multicategory support vector machines: Theory and application\nto the classi\ufb01cation of microarray data and satellite radiance data. Journal of the American\nStatistical Association, 2004.\n\n[11] Liu.Y. Fisher consistency of multicategory support vector machines. Eleventh International\n\nConference on Arti\ufb01cial Intelligence and Statistics, 289-296, 2007.\n\n[12] C.A. Micchelli and M. Pontil. On learning vector\u2013valued functions. Neural Computation,\n\n17:177\u2013204, 2005.\n\n[13] N. Pinto, Z. Stone, T. Zickler, and D.D. Cox. Scaling-up biologically-inspired computer vision:\n\nA case-study on facebook. 2011.\n\n[14] M.D. Reid and R.C. Williamson. Composite binary losses. JMLR, 11, September 2010.\n[15] Rifkin.R and Klautau.A. In defense of one versus all classi\ufb01cation. journal of machine learn-\n\ning, 2004.\n\n[16] Saberian.M and Vasconcelos .N. Multiclass boosting: Theory and algorithms. In NIPS 2011,\n\n2011.\n\n[17] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-\ngradient solver for svm. In Proceedings of the 24th ICML, ICML \u201907, pages 807\u2013814, New\nYork, NY, USA, 2007. ACM.\n\n[18] I. Steinwart and A. Christmann. Support vector machines. Information Science and Statistics.\n\nSpringer, New York, 2008.\n\n[19] Van de Geer.S Tarigan.B. A moment bound for multicategory support vector machines. JMLR\n\n9, 2171-2185, 2008.\n\n[20] A. Tewari and P. L. Bartlett. On the consistency of multiclass classi\ufb01cation methods.\n\nIn\nProceedings of the 18th Annual Conference on Learning Theory, volume 3559, pages 143\u2013\n157. Springer, 2005.\n\n[21] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables. JMLR, 6(2):1453\u20131484, 2005.\n\n[22] Alexandre B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of\n\nStatistics, 32:135\u2013166, 2004.\n\n[23] Elodie Vernet, Robert C. Williamson, and Mark D. Reid. Composite multiclass losses.\n\nProceedings of Neural Information Processing Systems (NIPS 2011), 2011.\n\nIn\n\n[24] G. Wahba. Spline models for observational data, volume 59 of CBMS-NSF Regional Confer-\n\nence Series in Applied Mathematics. SIAM, Philadelphia, PA, 1990.\n\n[25] Weston and Watkins. Support vector machine for multi class pattern recognition. Proceedings\n\nof the seventh european symposium on arti\ufb01cial neural networks, 1999.\n\n[26] Tong Tong Wu and Kenneth Lange. Multicategory vertex discriminant analysis for high-\n\ndimensional data. Ann. Appl. Stat., 4(4):1698\u20131721, 2010.\n\n[27] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Con-\n\nstructive Approximation, 26(2):289\u2013315, 2007.\n\n[28] T. Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5:1225\u20131251, 2004.\n\n[29] Tong Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex\n\nrisk minimization. The Annals of Statistics, Vol. 32, No. 1, 56134, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1282, "authors": [{"given_name": "Youssef", "family_name": "Mroueh", "institution": null}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": null}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": null}, {"given_name": "Jean-jeacques", "family_name": "Slotine", "institution": null}]}