{"title": "Universal Consistency of Multi-Class Support Vector Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 739, "page_last": 747, "abstract": "Steinwart was the \ufb01rst to prove universal consistency of support vector machine  classi\ufb01cation. His proof analyzed the \u2018standard\u2019 support vector machine classi\ufb01er,  which is restricted to binary classi\ufb01cation problems. In contrast, recent analysis  has resulted in the common belief that several extensions of SVM classi\ufb01cation to  more than two classes are inconsistent.  Countering this belief, we proof the universal consistency of the multi-class support vector machine by Crammer and Singer. Our proof extends Steinwart\u2019s techniques to the multi-class case.", "full_text": "Universal Consistency of Multi-Class\n\nSupport Vector Classi\ufb01cation\n\nDalle Molle Institute for Arti\ufb01cial Intelligence (IDSIA), 6928 Manno-Lugano, Switzerland\n\ntobias@idsia.ch\n\nTobias Glasmachers\n\nAbstract\n\nSteinwart was the \ufb01rst to prove universal consistency of support vector machine\nclassi\ufb01cation. His proof analyzed the \u2018standard\u2019 support vector machine classi\ufb01er,\nwhich is restricted to binary classi\ufb01cation problems. In contrast, recent analysis\nhas resulted in the common belief that several extensions of SVM classi\ufb01cation to\nmore than two classes are inconsistent.\nCountering this belief, we prove the universal consistency of the multi-class sup-\nport vector machine by Crammer and Singer. Our proof extends Steinwart\u2019s tech-\nniques to the multi-class case.\n\n1\n\nIntroduction\n\nSupport vector machine (SVM) as proposed in [1, 8] are powerful classi\ufb01ers, especially in the bi-\nnary case of two possible classes. They can be extended to multi-class problems, that is, problems\ninvolving more than two classes, in multiple ways which all reduce to the standard machine in the\nbinary case.\n\nThis is trivially the case for general techniques such as one-versus-one architectures and the one-\nversus-all approach, which combine a set of binary machines to a multi-class decision maker. At\nleast three different \u2018true\u2019 multi-class SVM extensions have been proposed in the literature: The\ncanonical multi-class machine proposed by Vapnik [8] and independently by Weston and Watkins\n[9], the variant by Crammer and Singer [2], and a conceptually different extension by Lee, Lin, and\nWahba [4].\n\nRecently, consistency of multi-class support vector machines has been investigated based on proper-\nties of the loss function \u03a8 measuring empirical risk in machine training [7]. The analysis is based on\nthe technical property of classi\ufb01cation calibration (refer to [7] for details). This work is conceptually\nrelated to Fisher consistency, in contrast to univeral statistical consistency, see [3, 5]. Schematically,\nTheorem 2 by Tewari and Bartlett [7] establishes the relation\nSA \u21d4 (SB \u21d2 SC) ,\n\n(1)\n\nfor the terms\n\nSA: The loss function \u03a8 is classi\ufb01cation calibrated.\nSB: The \u03a8-risk of a sequence ( \u02c6fn)n\u2208N of classi\ufb01ers converges to the minimal possible \u03a8-risk:\n\nSC : The 0-1-risk of the same sequence ( \u02c6fn)n\u2208N of classi\ufb01ers converges to the minimal possible\n\nlimn\u2192\u221e R\u03a8( \u02c6fn) = R\u2217\n\u03a8.\n0-1-risk (Bayes risk): limn\u2192\u221e R( \u02c6fn) = R\u2217.\n\nThe classi\ufb01ers \u02c6fn are assumed to result from structural risk minimization [8], that is, the space Fn\nfor which we obtain \u02c6fn = arg min{R\u03a8(f )| f \u2208 Fn} grows suitably with the size of the training\nset such that SB holds.\n\n1\n\n\fThe confusion around the consistency of multi-class machines arises from mixing the equivalence\nand the implication in statement (1). Examples 1 and 2 in [7] show that the loss functions \u03a8 used\nin the machines by Crammer and Singer [2] and by Weston and Watkins [9] are not classi\ufb01cation\ncalibrated, thus SA = false. Then it is deduced that the corresponding machines are not consistent\n(SC = false), although it can be deduced only that the implication SB \u21d2 SC does not hold. This\ntells us nothing about SC , even if SB can be established per construction.\nWe argue that the consistency of a machine is not necessarily determined by properties of its loss\nfunction. This is because for SVMs it is necessary to provide a sequence of regularization parameters\nin order to make the in\ufb01nite sample limit well-de\ufb01ned. Thus, we generalize Steinwart\u2019s universal\nconsistency theorem for binary SVMs (Theorem 2 in [6]) to the multi-class support vector machine\n[2] proposed by Crammer and Singer:\n\nTheorem 2. Let X \u2282 Rd be compact and k : X \u00d7 X \u2192 R be a universal kernel with1\nN ((X, dk), \u03b5) \u2208 O(\u03b5\u2212\u03b1) for some \u03b1 > 0. Suppose that we have a positive sequence (C\u2113)\u2113\u2208N\nwith \u2113 \u00b7 C\u2113 \u2192 \u221e and C\u2113 \u2208 O(\u2113\u03b2\u22121) for some 0 < \u03b2 < 1\n\u03b1 . Then for all Borel probability measures\nP on X \u00d7 Y and all \u03b5 > 0 it holds\n\nlim\n\u2113\u2192\u221e\n\nPr\u2217(cid:0){T \u2208 (X \u00d7 Y )\u2113 | R(fT,k,C\u2113) \u2264 R\u2217 + \u03b5}(cid:1) = 1 .\n\nThe corresponding notation will be introduced in sections 2 and 3. The theorem does not only estab-\nlish the universal consistency of the multi-class SVM by Crammer and Singer, it also gives precise\nconditions for how exactly the complexity control parameters needs to be coupled to the training set\nsize in order to obtain universal consistency. Moreover, the rigorous proof of this statement implies\nthat the common belief on the inconsistency of the popular multi-class SVM by Crammer and Singer\nis wrong. This important learning machine is indeed universally consistent.\n\n2 Multi-Class Support Vector Classi\ufb01cation\n\nA multi-class classi\ufb01cation problem is stated by a training dataset T = (cid:0)(x1, y1), . . . , (x\u2113, y\u2113)(cid:1) \u2208\n(X \u00d7 Y )\u2113 with label set Y of size |Y | = q < \u221e. W.l.o.g., the label space is represented by\nY = {1, . . . , q}. In contrast to the conceptually simpler binary case we have q > 2. The training\nexamples are supposed to be drawn i.i.d. from a probability distribution P on X \u00d7 Y .\nLet k : X \u00d7 X \u2192 R be a positive de\ufb01nite (Mercer) kernel function, and let \u03c6 : X \u2192 H be\na corresponding feature map into a feature Hilbert space H such that h\u03c6(x), \u03c6(x\u2032)i = k(x, x\u2032).\nWe call a function on X induced by k if there exists w \u2208 H such that f (x) = hw, \u03c6(x)i. Let\ndk(x, x\u2032) := k\u03c6(x) \u2212 \u03c6(x\u2032)kH = pk(x, x) \u2212 2k(x, x\u2032) + k(x\u2032, x\u2032) denote the metric induced on\nX by the kernel k.\n\nAnalog to Steinwart [6] we require that the input space X is a compact subset of Rd, and de\ufb01ne the\nnotion of a universal kernel:\n\nDe\ufb01nition 1.\n\n(De\ufb01nition 2 in [6]) A continuous positive de\ufb01nite kernel function k : X \u00d7 X \u2192 R\non a compact subset X \u2282 Rd is called universal if the set of induced functions is dense in the\nspace C 0(X) of continuous functions, i.e., for all g \u2208 C 0(X) and all \u03b5 > 0 there exists an induced\nfunction f with kg \u2212 fk\u221e < \u03b5.\nIntuitively, this property makes sure that the feature space of a kernel is rich enough to achieve con-\nsistency for all possible data generating distributions. For a detailed treatment of universal kernels\nwe refer to [6].\n\nAn SVM classi\ufb01er for a q-class problem is given in the form of a vector-valued function f : X \u2192 Rq\nwith component functions fu : X \u2192 R, u \u2208 Y (sometimes restricted by the so-called sum-to-zero\nconstraintPu\u2208Y fu = 0). Each of its components takes the form fu(x) = hwu, \u03c6(x)i + bu with\nwu \u2208 H and bu \u2208 R. Then we turn f into a classi\ufb01er by feeding its result into the \u2018decision\u2019 function\n\n\u03ba : Rq \u2192 Y ;\n\n(v1, . . . , vq)T 7\u2192 minn arg max{vu | u \u2208 Y }o \u2208 Y .\n\n1For f, g : R+ \u2192 R+ we de\ufb01ne f (x) \u2208 O(g(x)) iff \u2203c, x0 > 0 such that f (x) \u2264 c \u00b7 g(x) \u2200x > x0.\n\n2\n\n\fHere, the arbitrary rule for breaking ties favors the smallest class index.2 We denote the SVM\nhypothesis by h = \u03ba \u25e6 f : X \u2192 Y .\nThe multi-class SVM variant proposed by Crammer and Singer uses functions without offset terms\n(bu = 0 for all u \u2208 Y ). For a given training set T = (cid:0)(x1, y1), . . . , (x\u2113, y\u2113)(cid:1) \u2208 (X \u00d7 Y )\u2113 this\nmachine de\ufb01nes the function f , determined by (w1, . . . , wq) \u2208 Hq, as the solution of the quadratic\n\nprogram\n\nminimize Xu\u2208Y\n\nXi=1\nhwyi \u2212 wu, \u03c6(xi)i \u2265 1 \u2212 \u03bei\nThe slack variables in the optimum can be written as\n\nhwu, wui +\n\ns.t.\n\n\u03bei\n\nC\n\u2113 \u00b7\n\n\u2113\n\n\u2200 i \u2208 {1, . . . , \u2113}, u \u2208 Y \\ {yi} .\n\n\u03bei = max\n\nv\u2208Y \\{yi}n(cid:2)1 \u2212 (fyi (xi) \u2212 fv(xi))(cid:3)+o \u2265(cid:2)1 \u2212 \u03b4h(xi),yi \u2212 fyi(xi) + fh(xi)(xi)(cid:3)+ ,\n\nwith the auxiliaury function [t]+ := max{0, t}. We denote the function induced by the solution of\nthis problem by f = fT,k,C = (hw1,\u00b7i, . . . ,hwq,\u00b7i)T .\nLet s(x) := 1 \u2212 max{P (y|x)| y \u2208 Y } denote the noise level, that is, the probability of error of a\nBayes optimal classi\ufb01er. We denote the Bayes risk by R\u2217 =RX s(x)dx. For a given (measurable)\nhypothesis h we de\ufb01ne its error as Eh(x) := 1 \u2212 P (h(x)|x), and its suboptimality w.r.t. Bayes-\noptimal classi\ufb01cation as \u03b7h(x) := Eh(x) \u2212 s(x) = max{P (y|x)| y \u2208 Y } \u2212 P (h(x)|x). We have\nEh(x) \u2265 s(x) and thus \u03b7h(x) \u2265 0 up to a zero set.\n\n(2)\n\n(3)\n\n3 The Central Construction\n\nIn this section we introduce a number of de\ufb01nitions and constructions preparing the proofs in the\nlater sections. Most of the differences to the binary case are incorporated into these constructions\nsuch that the lemmas and theorems proven later on naturally extend to the multi-class case. Let\n\n\u2206 := {p \u2208 Rq | pu \u2265 0 \u2200 u \u2208 Y and Pu\u2208Y pu = 1} denote the probability simplex over Y . We\n\nintroduce the covering number of the metric space (X, dk) as\n\nN ((X, dk), \u03b5) := minnn (cid:12)(cid:12)(cid:12) \u2203{x1, . . . , xn} \u2282 X such that X \u2282\n\nB(xi, \u03b5)o ,\nwith B(x, \u03b5) = {x\u2032 \u2208 X | dk(x, x\u2032) < \u03b5} being the open ball of radius \u03b5 > 0 around x \u2208 X.\nNext we construct a partition of a large part of the input space X into suitable subsets. In a \ufb01rst step\nwe partition the probability simplex, then we transfer this partition to the input space, and \ufb01nally\nwe discard small subsets of negligible impact. The resulting partition has a number of properties of\nimportance for the proofs of diverse lemmas in the next section.\n\n[i=1\n\nn\n\nWe start by de\ufb01ning \u03c4 = \u03b5/(q + 5), where \u03b5 is the error bound found in Theorems 1 and 2. Thus, \u03c4\nis simply a multiple of \u03b5, which we can think of as an arbitrarily small positive number.\n\nWe split the simplex \u2206 into a partition of \u2018classi\ufb01cation-aligned\u2019 subsets\n\nfor y \u2208 Y , on which the decision function \u03ba decides for class y. We de\ufb01ne the grid\n\n\u2206y := \u03ba\u22121({y}) =np \u2208 \u2206 (cid:12)(cid:12)(cid:12) py > pu for u < y and py \u2265 pu for u > yo\n\u02dc\u0393 =n[n1\u03c4, (n1 + 1)\u03c4 ) \u00d7 \u00b7\u00b7\u00b7 \u00d7 [nq\u03c4, (nq + 1)\u03c4 ) \u2282 Rq (cid:12)(cid:12)(cid:12) (n1, . . . , nq)T \u2208 Zqo\n\nof half-open cubes. Then we combine both constructions to the partition\n\n\u0393 := [y\u2208Y n\u02dc\u03b3 \u2229 \u2206y (cid:12)(cid:12)(cid:12) \u02dc\u03b3 \u2208 \u02dc\u0393 and \u02dc\u03b3 \u2229 \u2206y 6= \u2205o\n\n2Note that any other deterministic rule for breaking ties can be realized by permuting the class indices.\n\n3\n\n\fprobabiliy mass, resulting in\n\nof \u2206 into classi\ufb01cation-aligned subsets of side length upper bounded by \u03c4 . We have the trivial\nupper bound |\u0393| \u2264 D := q \u00b7 (1/\u03c4 + 1)q for the size of the partition. The partition \u0393 will serve\nas an index set in a number of cases. The \ufb01rst one of these is the partition X = S\u03b3\u2208\u0393 X\u03b3 with\nX\u03b3 :=(cid:8)x \u2208 X(cid:12)(cid:12) P (y|x) \u2208 \u03b3(cid:9).\nThe compactness of X ensures that the distribution P is regular. Thus, for each \u03b3 \u2208 \u0393 there exists a\ncompact subset \u02dcK\u03b3 \u2282 X\u03b3 with P ( \u02dcK\u03b3) \u2265 (1 \u2212 \u03c4 /2) \u00b7 P (X\u03b3). We choose minimal partitions \u02dcA\u03b3 of\nA such that the diameter of each A \u2208 \u02dcA\u03b3 is bounded by \u03c3 = \u03c4 /(2\u221aC). All of\neach \u02dcK\u03b3 =SA\u2208 \u02dcA\u03b3\nthese sets are summarized in the partition \u02dcA =S\u03b3\u2208\u0393\n\u02dcA\u03b3 . Now we drop all A \u2208 \u02dcA\u03b3 below a certain\nwith M := D \u00b7 N ((X, dk), \u03c3). We summarize these sets in K\u03b3 =SA\u2208A\u03b3\n\nA\u03b3 :=nA \u2208 \u02dcA\u03b3 (cid:12)(cid:12)(cid:12) PX (A) \u2265\n2Mo ,\nA! \u2265 PX\uf8eb\nA\uf8f6\nK\u03b3\uf8f6\n\uf8f8 = PX  [A\u2208A\n\uf8ed [A\u2208 \u02dcA\n\uf8f8 \u2212 \u03c4 /2\nX\u03b3\uf8f6\n\uf8f8 \u2212 \u03c4 /2 \u2265 PX\uf8eb\n\u02dcK\u03b3\uf8f6\n\uf8ed[\u03b3\u2208\u0393\n\uf8f8 \u2212 \u03c4 /2 \u2212 \u03c4 /2 = PX (X) \u2212 \u03c4 = 1 \u2212 \u03c4\n\nPX\uf8eb\n\uf8ed[\u03b3\u2208\u0393\n= PX\uf8eb\n\uf8ed[\u03b3\u2208\u0393\n\nThe \ufb01rst estimate makes use of | \u02dcA| \u2264 M and condition (4), while the second inequality follows\nfrom the de\ufb01nition of \u02dcK\u03b3 .\nTo simplify notation, we associate a number of quantities with the sets \u03b3 \u2208 \u0393 and X\u03b3 . We denote\nthe Bayes-optimal decision by y(X\u03b3) = y(\u03b3) := \u03ba(p) for any p \u2208 \u03b3, and for y \u2208 Y we de\ufb01ne the\nlower and upper bounds\n\nA and A :=S\u03b3\u2208\u0393 A\u03b3 .\n\nThese sets cover nearly all probability mass of PX in the sense\n\n(4)\n\n\u03c4\n\nLy(X\u03b3) = Ly(\u03b3) := infnpy (cid:12)(cid:12)(cid:12) p \u2208 \u03b3o\n\nand\n\non the corresponding components in the probability simplex. We canonically extend these de\ufb01ni-\n\ntions to the above de\ufb01ned sets K\u03b3 , \u02dcK\u03b3 , and A \u2208 A, which are all subsets of exactly one of the sets\nX\u03b3 , by de\ufb01ning y(S) := y(\u03b3) for all non-empty subsets S \u2282 X\u03b3 . The resulting construction has\nthe following properties:\n\nUy(X\u03b3) = Uy(\u03b3) := supnpy (cid:12)(cid:12)(cid:12) p \u2208 \u03b3o\n\nLy(\u03b3) = 0 or Ly(\u03b3) \u2265 \u03c4 .\nonly on \u03c4 and q, but not on T , k, or C.3\n\neach set X\u03b3 as well as on each of their subsets, most importantly on each A \u2208 A.\n\n(P1) The decision function \u03ba is constant on each set \u03b3 \u2208 \u0393, and thus h = \u03ba \u25e6 f is constant on\n(P2) For each y \u2208 Y , the side length Uy(\u03b3) \u2212 Ly(\u03b3) of each set \u03b3 \u2208 \u0393 is upper bounded by \u03c4 .\n(P3) It follows from the construction of \u0393 that for each y \u2208 Y and \u03b3 \u2208 \u0393 we have either\n(P4) The cardinality of the partition \u0393 is upper bounded by D = q \u00b7 (1/\u03c4 + 1)q, which depends\n(P5) The cardinality of the partition A is upper bounded by M = D \u00b7 N ((X, dk), \u03c4 /(2\u221aC)),\n(P6) The setSA\u2208A A =S\u03b3\u2208\u0393 K\u03b3 \u2282 X covers a probability mass (w.r.t. PX ) of at least (1\u2212\u03c4 ).\n(P7) Each A \u2208 A covers a probability mass (w.r.t. PX ) of at least\n(P8) Each A \u2208 A has diameter less than \u03c3 = \u03c4 /(2\u221aC), that is, for x, x\u2032 \u2208 A we have\n\nwhich is \ufb01nite by Lemma 1.\n\n\u03c4\n2M .\n\ndk(x, x\u2032) < \u03c3.\n\nWith properties (P2) and (P6) it is straight-forward to obtain the inequality\n\n1 \u2212X\u03b3\u2208\u0393\n\u2264 R\u2217 \u22641 \u2212X\u03b3\u2208\u0393\n\nLy(\u03b3)(\u03b3) \u00b7 PX (X\u03b3) \u2212 \u03c4 \u2264 1 \u2212X\u03b3\u2208\u0393\nLy(\u03b3)(\u03b3) \u00b7 PX (X\u03b3) \u2264 1 \u2212X\u03b3\u2208\u0393\n\nUy(\u03b3)(\u03b3) \u00b7 PX (X\u03b3)\n\nUy(\u03b3)(\u03b3) \u00b7 PX (X\u03b3) + \u03c4\n\n(5)\n\n3A tight bound would be in O(\u03c4 1\u2212q).\n\n4\n\n\fF A,u\n\n\u2113\n\nfor the risk.\nNow we are in the position to de\ufb01ne the notion of a \u2018typical\u2019 training set. For \u2113 \u2208 N, u \u2208 Y , and\nA \u2208 A, we de\ufb01ne\n\n:=n(cid:0)(x1, y1), . . . , (x\u2113, y\u2113)(cid:1) \u2208 (X \u00d7 Y )\u2113 (cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:8)n \u2208 {1, . . . , \u2113}(cid:12)(cid:12) xn \u2208 A, yn = u(cid:9)(cid:12)(cid:12) \u2265 \u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7 Lu(A) \u00b7 PX (A)o .\n\nIntuitively, we ask that the number of examples of class u in A does not deviate too much from its\nexpectation, introducing two approximations: The multiplicative factor (1\u2212\u03c4 ), and the lower bound\nLu(A) on the conditional probability of class u in A. We combine the properties of all these sets in\nthe set F\u2113 :=Tu\u2208Y TA\u2208A F A,u\nof training sets of size \u2113, with the same lower bound on the number\nof training examples in all sets A \u2208 A, and for all classes u \u2208 Y .\n\n\u2113\n\n4 Preparations\n\nThe proof of our main result follows the proofs of Theorems 1 and 2 in [6] as closely as possible. For\nthe sake of clarity we organize the proof such that all six lemmas in this section directly correspond\nto Lemmas 1-6 in [6].\n\nLemma 1.\n\n(Lemma 1 from [6]) Let k : X \u00d7 X \u2192 R be a universal kernel on a compact subset X\n\nor Rd and \u03a6 : X \u2192 H be a feature map of k. The \u03a6 is continuous and\n\ndk(x, x\u2032) := k\u03a6(x) \u2212 \u03a6(x\u2032)k\n\nde\ufb01nes a metric on X such that id : (X,k\u00b7k) \u2192 (X, dk) is continuous. In particular, N ((X, dk), \u03b5)\nis \ufb01nite for all \u03b5 > 0.\n\nLemma 2. Let X \u2282 Rd be compact and let k : X \u00d7 X \u2192 R be a universal kernel. Then, for all\n\u03b5 > 0 and all pairwise disjoint and compact (or empty) subsets \u02dcKu \u2282 X, u \u2208 Y , there exists an\n\ninduced function\n\nf : X \u2192h \u2212 1/2 \u00b7 (1 + \u03b5), 1/2 \u00b7 (1 + \u03b5)iq\n\nsuch that\n\n;\n\nx 7\u2192 (hw\u2217\n\n1, xi, . . . ,hw\u2217\n\nq , xi)T ,\n\nfu(x) \u2208 [1/2, 1/2 \u00b7 (1 + \u03b5)]\nfu(x) \u2208 [\u22121/2 \u00b7 (1 + \u03b5),\u22121/2]\n\nif x \u2208 \u02dcKu\nif x \u2208 \u02dcKv for some v \u2208 Y \\ {u}\n\nfor all u \u2208 Y .\nProof. This lemma directly corresponds to Lemma 2 in [6], with slightly different cases. Its proof\nis completely analogous.\n\nLemma 3. The probability of the training sets F\u2113 is lower bounded by\n\nP \u2113(F\u2113) \u2265 1 \u2212 q \u00b7 M \u00b7 exp(cid:18)\u2212\n\n1\n8\n\n(\u03c4 6/M 2)\u2113(cid:19) .\n\n\u2113\n\nProof. Let us \ufb01x A \u2208 A and u \u2208 Y . In the case Lu(A) = 0 we trivially have P \u2113(cid:0)(X \u00d7 Y )\u2113 \\\n(cid:1) = 0. Otherwise we consider T =(cid:0)(x1, y1), . . . , (x\u2113, y\u2113)(cid:1) \u2208 (X \u00d7 Y )\u2113 and de\ufb01ne the binary\nF A,u\nvariables zi := 1{A\u00d7{u}}(xi, yi), where the indicator function 1S(s) is one for s \u2208 S and zero\notherwise. This de\ufb01nition allows us to express the cardinality (cid:12)(cid:12)(cid:8)n \u2208 {1, . . . , \u2113}(cid:12)(cid:12) xn \u2208 A, yn =\nu(cid:9)(cid:12)(cid:12) =P\u2113\n\ninequality. The inequality, applied to the variables zi, states\n\ni=1 zi found in the de\ufb01nition of F A,u\n\nin a form suitable for the application of Hoeffding\u2019s\n\n\u2113\n\nP \u2113  \u2113\nXi=1\n\nzi \u2264 (1 \u2212 \u03c4 ) \u00b7 E \u00b7 \u2113! \u2264 exp(cid:0)\u22122(\u03c4 E)2\u2113(cid:1) ,\n\n5\n\n\fwhere E := E[zi] =RA\u00d7{u} dP (x, y) =RA P (u|x)dx \u2265 Lu(A) \u00b7 PX (A) > 0. Due to E > 0 we\n\ncan use the relation\n\nin order to obtain Hoeffding\u2019s formula for the case of strict inequality\n\n\u2113\n\nXi=1\nP \u2113  \u2113\nXi=1\n\nzi \u2264 (1 \u2212 \u03c4 ) \u00b7 E \u00b7 \u2113 \u21d2\n\nzi < (1 \u2212 \u03c4 /2) \u00b7 E \u00b7 \u2113\n\n\u2113\n\nXi=1\n\nzi < (1 \u2212 \u03c4 ) \u00b7 E \u00b7 \u2113! \u2264 exp(cid:18)\u2212\n\n1\n2\n\n(\u03c4 E)2\u2113(cid:19) .\n\nobtain\n\nCombining E \u2265 Lu(A) \u00b7 PX (A) andP\u2113\ni=1 zi < (1 \u2212 \u03c4 ) \u00b7 Lu(A) \u00b7 PX (A) \u00b7 \u2113 \u21d4 T 6\u2208 F A,u\n\u2113 (cid:17) = P \u2113  \u2113\nzi < (1 \u2212 \u03c4 ) \u00b7 Lu(A) \u00b7 PX (A) \u00b7 \u2113!\nXi=1\n\nP \u2113(cid:16)(X \u00d7 Y )\u2113 \\ F A,u\n\u2264 P \u2113  \u2113\n(\u03c4 Lu(A)PX (A))2\u2113(cid:19) .\nXi=1\nProperties (P3) and (P7) ensure Lu(A) \u2265 \u03c4 and PX (A) \u2265 \u03c4 /(2M ). Applying these to the previous\ninequality results in\n\nzi < (1 \u2212 \u03c4 ) \u00b7 E \u00b7 \u2113! \u2264 exp(cid:18)\u2212\n\n(\u03c4 E)2\u2113(cid:19) \u2264 exp(cid:18)\u2212\n\n1\n2\n\n1\n2\n\nwe\n\n\u2113\n\nwhich also holds in the case Lu(A) = 0 treated earlier. Finally, we use the union bound\n\n\u2113 (cid:17) \u2264 exp(cid:18)\u2212\nP \u2113(cid:16)(X \u00d7 Y )\u2113 \\ F A,u\n1 \u2212 P \u2113(F\u2113) = 1 \u2212 P \u2113 \\u\u2208Y \\A\u2208A\n\u2264 |Y | \u00b7 |A| \u00b7 exp(cid:18)\u2212\n\n1\n8\n\n1\n2\n\n1\n8\n\n(\u03c4 6/M 2)\u2113(cid:19) ,\n(\u03c4 3/(2M ))2\u2113(cid:19) = exp(cid:18)\u2212\n\u2113 (cid:17)!\n\u2113 ! = P \u2113 [u\u2208Y [A\u2208A(cid:16)(X \u00d7 Y )\u2113 \\ F A,u\n\nF A,u\n\n(\u03c4 6/M 2)\u2113(cid:19) \u2264 q \u00b7 M \u00b7 exp(cid:18)\u2212\n\n1\n8\n\n(\u03c4 6/M 2)\u2113(cid:19)\n\nand properties (P4) and (P5) to prove the assertion.\n\nLemma 4. The SVM solution f and the hypothesis h = f \u25e6 \u03ba ful\ufb01ll\n\nR(f ) \u2264 R\u2217 +ZX\n\n\u03b7h(x)dx .\n\nProof. The lemma follows directly from the de\ufb01nition of \u03b7h, even with equality. We keep it here\nbecause it is the direct counterpart to the (stronger) Lemma 4 in [6].\n\nLemma 5. For all training sets T \u2208 F\u2113 the SVM solution given by (w1, . . . , wq) and (\u03be1, . . . , \u03be\u2113)\nful\ufb01lls\n\nXu\u2208Y\n\nhwu, wui +\n\nC\n\u2113\n\n\u2113\n\nXi=1\n\n\u03bei \u2264 Xu\u2208Y\n\nu, w\u2217\n\nhw\u2217\n\nui + C(R\u2217 + 2\u03c4 ) ,\n\nwith (w\u2217\n\n1, . . . , w\u2217\n\nq ) as de\ufb01ned in Lemma 2.\n\nProof. The optimality of the SVM solution for the primal problem (2) implies\n\nhwu, wui +\n\nC\n\u2113\n\nXu\u2208Y\n\n\u2113\n\nXi=1\n\n\u03bei \u2264 Xu\u2208Y\n\nu, w\u2217\n\nhw\u2217\n\nui +\n\nC\n\u2113\n\n\u03be\u2217\ni\n\n\u2113\n\nXi=1\n\nfor any feasible choice of the slack variables \u03be\u2217\ni =\n1 + \u03c4 for P (y | xi) 6\u2208 \u2206yi and zero otherwise which corresponds to a feasible solution according\nto the construction of w\u2217\ni \u2264 \u2113 \u00b7 (R\u2217 + 2\u03c4 ).\n\ni . We choose the values of these variables as \u03be\u2217\n\nu in Lemma 2. Then it remains to show that P\u2113\n\ni=1 \u03be\u2217\n\n6\n\n\fThe de\ufb01nition of F\u2113 yields\n\nLet n+ = (cid:12)(cid:12)(cid:8)i \u2208 {1, . . . , \u2113}(cid:12)(cid:12) P (y | xi) \u2208 \u2206yi(cid:9)(cid:12)(cid:12) denote the number of training examples correctly\nclassi\ufb01ed by the Bayes rule expressed by \u2206yi (or \u03ba). Then we haveP\u2113\ni = (1 + \u03c4 )(\u2113 \u2212 n+).\nPX (A)\uf8f9\n\uf8ee\n\uf8f0Lu(\u03b3) \u00b7 XA\u2208A\u03b3\n\u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7 Lu(\u03b3) \u00b7 PX (A) = \u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7Xu\u2208Y X\u03b3\u2208\u0393\nn+ \u2265 Xu\u2208Y XA\u2208A\n\uf8fb\nhLu(\u03b3) \u00b7 PX (K\u03b3)i = \u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7X\u03b3\u2208\u0393hLy(\u03b3)(\u03b3) \u00b7 PX (K\u03b3)i\n\ni=1 \u03be\u2217\n\ny(A)=u\n\ny(\u03b3)=u\n\n= \u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7Xu\u2208Y X\u03b3\u2208\u0393\n\u2265 \u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7\uf8eb\n\ny(\u03b3)=u\n\nwhere the last line is due to inequality (5). We obtain\n\n\uf8edX\u03b3\u2208\u0393hLy(\u03b3)(\u03b3) \u00b7 PX (X\u03b3)i \u2212 \u03c4\uf8f6\nXi=1\n= \u2113 \u00b7 [R\u2217 + \u03c4 + \u03c4 2(1 \u2212 R\u2217)] \u2264 \u2113 \u00b7 [R\u2217 + \u03c4 + \u03c4 2] \u2264 \u2113 \u00b7 (R\u2217 + 2\u03c4 ) ,\n\n\uf8f8 \u2265 \u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7 (1 \u2212 R\u2217) ,\n\n\u03be\u2217\ni \u2264 \u2113 \u00b7 (1 + \u03c4 ) \u00b7 (1 \u2212 (1 \u2212 \u03c4 ) \u00b7 (1 \u2212 R\u2217))\n\n\u2113\n\nwhich proves the claim.\n\nLemma 6. For all training sets T \u2208 F\u2113 the sum of the slack variables (\u03be1, . . . , \u03be\u2113) corresponding\nto the SVM solution ful\ufb01lls\n\n\u2113\n\nXi=1\n\n\u03bei \u2265 \u2113 \u00b7 (1 \u2212 \u03c4 )2 \u00b7(cid:18)R\u2217 +ZX\n\n\u03b7h(x) dPX (x) \u2212 q \u00b7 \u03c4(cid:19) .\n\nProof. Problem (2) takes the value C in the feasible solution w1 = . . . , wq = 0 and \u03be1 = \u00b7\u00b7\u00b7 =\nu \u2208 Y . Thus, property (P8) makes sure that |fu(x) \u2212 fu(x\u2032)| \u2264 \u03c4 /2 for all x, x\u2032 \u2208 A and u \u2208 Y .\n\n\u03be\u2113 = 1. Thus, we havePu\u2208Y kwuk2 \u2264 C in the optimum, and we deduce kwuk \u2264 \u221aC for each\nXi=1\n\nThe proof works through the following series of inequalities. The details are discussed below.\n\n\u03bei\n\n\u2113\n\n1\n\nyi=u\n\nyi=u\n\nPX (A) \u00b7ZAh1 \u2212 \u03b4h(x),u + fh(x)(x) \u2212 fu(x) \u2212 2 \u00b7\n\nyi=u(cid:2)1 \u2212 \u03b4h(xi),u + fh(xi)(xi) \u2212 fu(xi)(cid:3)+\n\n\u03bei = XA\u2208AXu\u2208Y Xxi\u2208A\n\u2265 XA\u2208AXu\u2208Y Xxi\u2208A\n\u2265 XA\u2208AXu\u2208Y Xxi\u2208A\n\u2265 XA\u2208AXu\u2208Y\n= \u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7 XA\u2208AZA\n{z\n\u2265 \u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7 XA\u2208AZA\n\u2265 \u2113 \u00b7 (1 \u2212 \u03c4 )2 \u00b7 XA\u2208AZA(cid:16)1 \u2212 q \u00b7 \u03c4 \u2212 Lh(x)(A)(cid:17) dPX (x)\n\n\uf8f9\n\uf8ee\n\uf8fa\uf8fb+\n\uf8ef\uf8f0\n1 \u2212 \u03c4 \u2212 \u03b4h(x),u + fh(x)(x) \u2212 fu(x)\n}\n(1 \u2212 \u03c4 ) \u00b7 Xu\u2208Y \\{h(x)}\n\nLu(A) dPX (x)\n\n|\n\n\u22650\n\n\u2113 \u00b7 (1 \u2212 \u03c4 ) \u00b7 Lu(A) \u00b7ZA(cid:2)1 \u2212 \u03c4 \u2212 \u03b4h(x),u + fh(x)(x) \u2212 fu(x)(cid:3)+ dPX (x)\n\ndPX (x)\n\n\u03c4\n\n2i+\n\nLu(A) dPX (x)\n\n\u00b7Xu\u2208Y\n\n7\n\n\f\u2265 \u2113 \u00b7 (1 \u2212 \u03c4 )2 \u00b7 XA\u2208AZA(cid:16)1 \u2212 q \u00b7 \u03c4 \u2212 1 + s(x) + \u03b7h(x)(cid:17) dPX (x)\n= \u2113 \u00b7 (1 \u2212 \u03c4 )2 \u00b7(cid:18)R\u2217 +ZX\n\n\u03b7h(x) dPX (x) \u2212 q \u00b7 \u03c4(cid:19)\n\n\u2113\n\nThe \ufb01rst inequality follows from equation (3). The second inequality is clear from the de\ufb01nition of\nF A,u\ntogether with |fu(x) \u2212 fu(x\u2032)| \u2264 \u03c4 /2 within each A \u2208 A. For the third inequality we use\nthat the case u = h(x) does not contribute, and the non-negativity of fh(x)(x) \u2212 fu(x). In the next\nsteps we make use ofPu\u2208Y Lu(A) \u2265 1 \u2212 q \u00b7 \u03c4 and the lower bound Lh(x)(x) \u2264 P (h(x)|x) =\n1 \u2212 Eh(x) = 1 \u2212 s(x) \u2212 \u03b7h(x), which can be deduced from properties (P1) and (P2).\n\n5 Proof of the Main Result\n\nJust like the lemmas, we organize our theorems analogous to the ones found in [6]. We start with a\ndetailed but technical auxiliaury result.\n\n1\n8\n\nLemmas 5 and 6 to\n\n(\u03c4 6/M 2)\u2113(cid:19) ,\n\nTheorem 1. Let X \u2282 Rd be compact, Y = {1, . . . , q}, and k : X \u00d7 X \u2192 R a universal kernel.\nThen, for all Borel probability measures P on X \u00d7 Y and all \u03b5 > 0 there exists a constant C \u2217 > 0\nsuch that for all C \u2265 C \u2217 and all \u2113 \u2265 1 we have\n\nPr\u2217(cid:16)(cid:8)T \u2208 (X \u00d7 Y )\u2113 (cid:12)(cid:12) R(fT,k,C) \u2264 R\u2217 + \u03b5(cid:9)(cid:17) \u2265 1 \u2212 qM exp(cid:18)\u2212\n\nwhere Pr\u2217 is the outer probability of P \u2113, fT,k,C is the solution of problem (2), M = q \u00b7 (1/\u03c4 + 1)q \u00b7\nN ((X, dk), \u03c4 /(2\u221aC)), and \u03c4 = \u03b5/(q + 5).\nProof. According to Lemma 3 it is suf\ufb01cient to show R(fT,k,C ) \u2264 R\u2217 + \u03b5 for all T \u2208 F\u2113.\nLemma 4 provides the estimate R(f ) \u2264 R\u2217 +RX \u03b7h(x) dPX (x), such that it remains to show\nthat RX \u03b7h(x) dPX (x) \u2264 \u03b5 for T \u2208 F\u2113. Consider w\u2217\nu as de\ufb01ned in Lemma 2, then we combine\n\uf8eb\n\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\nXu\u2208Y\n(1 \u2212 \u03c4 )2 \u00b7\n\n\uf8f6\n\uf8eb\n\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\nR\u2217 +ZX\n\uf8ec\uf8ec\uf8ec\uf8ed\nuk2 \u2212Xu\u2208Y\nkw\u2217\nkwuk2\n|\n|\n{z\n}\nUsing a\u2212\u03c4 \u2264 (1\u2212\u03c4 )\u00b7a for any a \u2208 [0, 1], we deriveRX \u03b7h(x) dPX (x) \u2264 1\nC Pu\u2208Y kw\u2217\n\u03c4 . With the choice C \u2217 = 1\n(q + 5) \u00b7 \u03c4 = \u03b5.\nProof of Theorem 2. Up to constants, this short proof coincides with the proof of Theorem 2 in [6].\nBecause of the importance of the statement and the brevity of the proof we repeat it here:\n\nuk2+(q+4)\u00b7\nuk2 and the condition C \u2265 C \u2217 we obtainRX \u03b7h(x) dPX (x) \u2264\n\n\uf8f6\n\uf8f7\uf8f7\uf8f7\uf8f8\n\u03b7h(x) dPX (x) \u2212 q \u00b7 \u03c4\n}\n\n{z\n\u03c4 \u00b7Pu\u2208Y kw\u2217\n\n+ (R\u2217 + 2\u03c4 ) .\n\n1\nC\n\n\u2264\n\n\u22641\n\n\u22650\n\nSince \u2113 \u00b7 C\u2113 \u2192 \u221e there exists an integer \u21130 such that \u2113 \u00b7 C\u2113 \u2265 C \u2217 for all \u2113 \u2265 \u21130. Thus for \u2113 \u2265 \u21130\n\nTheorem 1 yields\n\nPr\u2217(cid:16)(cid:8)T \u2208 (X \u00d7 Y )\u2113 (cid:12)(cid:12) R(fT,k,C\u2113 ) \u2264 R\u2217 + \u03b5(cid:9)(cid:17) \u2265 1 \u2212 qM\u2113 exp(cid:18)\u2212\n\nwhere M\u2113 = D \u00b7 N ((X, dk), \u03c4 /(2\u221aC\u2113)). Moreover, by the assumption on the covering numbers of\n\n\u2113 )\u2113(cid:19) ,\n\n(\u03c4 6/M 2\n\n1\n8\n\n(X, dk) we obtain M 2\n\n\u2113 \u2208 O((\u2113 \u00b7 C\u2113)2) and thus \u2113 \u00b7 M \u22122\n\n\u2113 \u2192 \u221e.\n\n6 Conclusion\n\nWe have proven the universal consistency of the popular multi-class SVM by Crammer and Singer.\nThis result disproves the common belief that this machine is in general inconsistent. The proof itself\ncan be understood as an extension of Steinwart\u2019s universal consistency result for binary SVMs. Just\nlike there are different extensions of the binary SVM to multi-class classi\ufb01cation in the literature,\nwe strongly believe that our proof can be further generalized to cover other multi-class machines,\nsuch as the one proposed by Weston and Watkins, which is a possible direction for future research.\n\n8\n\n\fReferences\n\n[1] C. Cortes and V. Vapnik. Support-Vector Networks. Machine Learning, 20(3):273\u2013297, 1995.\n\n[2] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector\n\nmachines. Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[3] S. Hill and A. Doucet. A Framework for Kernel-Based Multi-Category Classi\ufb01cation. Journal\n\nof Arti\ufb01cial Intelligence Research, 30:525\u2013564, 2007.\n\n[4] Y. Lee, Y. Lin, and G. Wahba. Multicategory Support Vector Machines: Theory and Application\nto the Classi\ufb01cation of Microarray Data and Satellite Radiance Data. Journal of the American\nStatistical Association, 99(465):67\u201382, 2004.\n\n[5] Y. Liu. Fisher Consistency of Multicategory Support Vector Machines. Journal of Machine\n\nLearning Research, 2:291\u2013298, 2007.\n\n[6] I. Steinwart. Support Vector Machines are Universally Consistent. J. Complexity, 18(3):768\u2013\n\n791, 2002.\n\n[7] A. Tewari and P. L. Bartlett. On the Consistency of Multiclass Classi\ufb01cation Methods. Journal\n\nof Machine Learning Research, 8:1007\u20131025, 2007.\n\n[8] V. Vapnik. Statistical Learning Theory. Wiley, New-York, 1998.\n\n[9] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition.\n\nIn\nM. Verleysen, editor, Proceedings of the Seventh European Symposium On Arti\ufb01cial Neural\nNetworks (ESANN), pages 219\u2013224, 1999.\n\n9\n\n\f", "award": [], "sourceid": 33, "authors": [{"given_name": "Tobias", "family_name": "Glasmachers", "institution": null}]}