{"title": "Large Margin DAGs for Multiclass Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 547, "page_last": 553, "abstract": null, "full_text": "Large Margin DAGs for \nMulticlass Classification \n\nJohn C. Platt \n\nMicrosoft Research \n\n1 Microsoft Way \n\nRedmond, WA 98052 \njpiatt@microsojt.com \n\nNello Cristianini \n\nDept. of Engineering Mathematics \n\nUniversity of Bristol \nBristol, BS8 1 TR - UK \n\nnello.cristianini@bristol.ac.uk \n\nJohn Shawe-Taylor \n\nDepartment of Computer Science \n\nRoyal Holloway College - University of London \n\nEGHAM, Surrey, TW20 OEX - UK \nj.shawe-taylor@dcs.rhbnc.ac.uk \n\nAbstract \n\nWe present a new learning architecture: the Decision Directed Acyclic \nGraph (DDAG), which is used to combine many two-class classifiers \ninto a multiclass classifier. For an N -class problem, the DDAG con(cid:173)\ntains N(N - 1)/2 classifiers, one for each pair of classes. We present a \nVC analysis of the case when the node classifiers are hyperplanes; the re(cid:173)\nsulting bound on the test error depends on N and on the margin achieved \nat the nodes, but not on the dimension of the space. This motivates an \nalgorithm, DAGSVM, which operates in a kernel-induced feature space \nand uses two-class maximal margin hyperplanes at each decision-node \nof the DDAG. The DAGSVM is substantially faster to train and evalu(cid:173)\nate than either the standard algorithm or Max Wins, while maintaining \ncomparable accuracy to both of these algorithms. \n\n1 Introduction \n\nThe problem of multiclass classificatIon, especially for systems like SVMs, doesn't present \nan easy solution. It is generally simpler to construct classifier theory and algorithms for two \nmutually-exclusive classes than for N mutually-exclusive classes. We believe constructing \nN -class SVMs is still an unsolved research p~oblem. \nThe standard method for N -class SVMs [10] is to construct N SVMs. The ith SVM will be \ntrained with all of the examples in the ith class with positive labels, and all other examples \nwith negative labels. We refer to SVMs trained in this way as J -v-r SVMs (short for one(cid:173)\nversus-rest). The final output of the N l-v-r SVMs is the class that corresponds to the SVM \nwith the highest output value. Unfortunately, there is no bound on the generalization error \nfor the l-v-r SVM, and the training time of the standard method scales linearly with N. \nAnother method for constructing N -class classifiers from SVMs is derived from previous \nresearch into combining two-class classifiers. Knerr [5] suggested constructing all possible \ntwo-class classifiers from a training set of N classes, each classifier being trained on only \n\n\f548 \n\nJ C. Platt, N. Cristianini and J Shawe-Taylor \n\ntwo out of N classes. There would thus be K = N (N - 1) /2 classifiers. When applied to \nSVMs, we refer to this as J -v-J SVMs (short for one-versus-one). \nKnerr suggested combining these two-class classifiers with an \"AND\" gate [5]. Fried(cid:173)\nman [4] suggested a Max Wins algorithm: each I-v-l classifier casts one vote for its pre(cid:173)\nferred class, and the final result is the class with the most votes. Friedman shows cir(cid:173)\ncumstances in which this algorithm is Bayes optimal. KreBel [6] applies the Max Wins \nalgorithm to Support Vector Machines with excellent results. \nA significant disadvantage of the I-v-l approach, however, is that, unless the individual \nclassifiers are carefully regularized (as in SVMs), the overall N -class classifier system will \ntend to overfit. The \"AND\" combination method and the Max Wins combination method \ndo not have bounds on the generalization error. Finally, the size of the I-v-l classifier may \ngrow superlinearly with N, and hence, may be slow to evaluate on large problems. \nIn Section 2, we introduce a new multiclass learning architecture, called the Decision Di(cid:173)\nrected Acyclic Graph (DDAG). The DDAG contains N(N - 1)/2 nodes, each with an \nassociated I-v-l classifier. In Section 3, we present a VC analysis of DDAGs whose clas(cid:173)\nsifiers are hyperplanes, showing that the margins achieved at the decision nodes and the \nsize of the graph both affect their performance, while the dimensionality of the input space \ndoes not. The VC analysis indicates that building large margin DAGs in high-dimensional \nfeature spaces can yield good generalization performance. Using such bound as a guide, \nin Section 4, we introduce a novel algorithm for multiclass classification based on placing \nl-v-l SVMs into nodes of a DDAG. This algorithm, called DAGSVM, is efficient to train \nand evaluate. Empirical evidence of this efficiency is shown in Section 5. \n\n2 Decision DAGs \n\nA Directed Acyclic Graph (DAG) is a graph whose edges have an orientation and no cycles. \nA Rooted DAG has a unique node such that it is the only node which has no arcs pointing \ninto it. A Rooted Binary DAG has nodes which have either 0 or 2 arcs leaving them. \nWe will use Rooted Binary DAGs in order to define a class of functions to be used in \nclassification tasks. The class of functions computed by Rooted Binary DAGs is formally \ndefined as follows. \n\nDefinition 1 Decision DAGs (DDAGs). Given a space X and a set of boolean functions \nF = {f : X -t {a, I}}, the class DDAG(F) of Decision DAGs on N classes over Fare \nfunctions which can be implemented using a rooted binary DAG with N leaves labeled by \nthe classes where each of the K = N(N - 1)/2 internal nodes is labeled with an element \nof F. The nodes are arranged in a triangle with the single root node at the top, two nodes \nin the second layer and so on until the jinallayer of N leaves. The i-th node in layer j < N \nis connected to the i-th and (i + 1)-st node in the (j + 1)-st layer. \n\nTo evaluate a particular DDAG G on input x EX, starting at the root node, the binary \nfunction at a node is evaluated. The node is then exited via the left edge, if the binary \nfunction is zero; or the right edge, if the binary function is one. The next node's binary \nfunction is then evaluated. The value of the decision function D (x) is the value associated \nwith the final leaf node (see Figure l(a\u00bb. The path taken through the DDAG is known \nas the evaluation path. The input x reaches a node of the graph, if that node is on the \nevaluation path for x. We refer to the decision node distinguishing classes i and j as the \nij-node. Assuming that the number of a leaf is \"its class, this node is the i-th node in the \n(N - j + i)-th layer provided i < j. Similarly the j-nodes are those nodes involving class \nj, that is, the internal nodes on the two diagonals containing the leaf labeled by j. \nThe DDAG is equivalent to operating on a list, where each node eliminates one class from \nthe list. The list is initialized with a list of all classes. A test point is evaluated against the \ndecision node that corresponds to the first and last elements of the list. If the node prefers \n\n\fLarge Margin DAGs for Multiclass Classification \n\n549 \n\ntest points on this \nSIde of hyperplane \ncannot be in class 1 \n\n3 \n4 \n\n4 \n\n3 \n\n2 \n\n(a) \n\n1 vs4 SVM \n\n1 \n\n1 1 1 1 \n1 1 1 \n\ntest pOInts on this \nSide of hyperplane \ncannot be In class 4 \n\n(b) \n\nFigure 1: (a) The decision DAG for finding the best class out of four classes. The equivalent \nlist state for each node is shown next to that node. (b) A diagram of the input space of a \nfour-class problem. A I-v-l SVM can only exclude one class from consideration. \n\none of the two classes, the other class is eliminated from the list, and the DDAG proceeds \nto test the first and last elements of the new list. The DDAG terminates when only one \nclass remains in the list. Thus, for a problem with N classes, N - 1 decision nodes will be \nevaluated in order to derive an answer. \nThe current state of the list is the total state of the system. Therefore, since a list state \nis reachable in more than one possible path through the system, the decision graph the \nalgorithm traverses is a DAG, not simply a tree. \nDecision DAGs naturally generalize the class of Decision Trees, allowing for a more ef(cid:173)\nficient representation of redundancies and repetitions that can occur in different branches \nof the tree, by allowing the merging of different decision paths. The class of functions \nimplemented is the same as that of Generalized Decision Trees [1], but this particular rep(cid:173)\nresentation presents both computational and learning-theoretical advantages. \n\n3 Analysis of Generalization \n\nIn this paper we study DDAGs where the node-classifiers are hyperplanes. We define a \nPerceptron DDAG to be a DDAG with a perceptron at every node. Let w be the (unit) \nweight vector correctly splitting the i and j classes at the ij-node with threshold O. We \ndefine the margin of the ij-node to be I = minc(x)==i,j {I(w, x) - Ol}, where c(x) is the \nclass associated to training example x. Note that, in this definition, we only take into \naccount examples with class labels equal to i or j . \n\nTheorem 1 Suppose we are able to classify a random m sampLe of LabeLed examples using \na Perceptron DDAG on N classes containing K decision nodes with margins Ii at node i, \nthen we can bound the generalization error with probability greater than 1 - 6 to be less \nthan \n\n130R2 ( \n--:;;;:- D' log ( 4em) log( 4m) + log \n\n2(2m)K) \n\n6 \n\n' \n\nwhere D' = L~l ~, and R is the radius of a ball containing the distribution's support. \n\nProof: see Appendix 0 \n\n\f550 \n\nJ. C. Platt, N. Cristianini and J. Shawe-Taylor \n\nTheorem 1 implies that we can control the capacity of DDAGs by enlarging their margin. \nNote that, in some situations, this bound may be pessimistic: the DDAG partitions the \ninput space into poly topic regions, each of which is mapped to a leaf node and assigned \nto a specific class. Intuitively, the only margins that should matter are the ones relative to \nthe boundaries of the cell where a given training point is assigned, whereas the bound in \nTheorem 1 depends on all the margins in the graph. \nBy the above observations, we would expect that a DDAG whose j-node margins are large \nwould be accurate at identifying class j , even when other nodes do not have large margins. \nTheorem 2 substantiates this by showing that the appropriate bound depends only on the \nj-node margins, but first we introduce the notation, Ej(G) = P{x : (x in class j and x is \nmisclassified by G) or x is misclassified as class j by G}. \n\nTheorem 2 Suppose we are able to correctly distinguish class j from the other classes in \na random m-sample with a DDAG Gover N classes containing K decision nodes with \nmargins 'Yi at node i, then with probability 1 - J, \n\nEj(G) ~ ----;:;;- D'log(4em) log(4m) + log \n\n130R2 ( \n\n2(2m)N-l) \n\nJ \n\n' \n\nwhere D' = ~ .. d ~,and R is the radius of a ball containing the support of the \ndistribution. \n\nL-tErno es \"Y; \n\nProof: follows exactly Lemma 4 and Theorem I, but is omitted.O \n\n4 The DAGSVM algorithm \n\nBased on the previous analysis, we propose a new algorithm, called the Directed Acyclic \nGraph SVM (DAGSVM) algorithm, which combines the results of I-v-I SVMs. We will \nshow that this combination method is efficient to train and evaluate. \nThe analysis of Section 3 indicates that maximizing the margin of all of the nodes in a \nDDAG will minimize a bound on the generalization error. This bound is also independent \nof input dimensionality. Therefore, we will create a DDAG whose nodes are maximum \nmargin classifiers over a kernel-induced feature space. Such a DDAG is obtained by train(cid:173)\ning each ij-node only on the subset of training points labeled by i or j. The final class \ndecision is derived by using the DDAG architecture, described in Section 2. \nThe DAGSVM separates the individual classes with large margin. It is safe to discard the \nlosing class at each I-v-l decision because, for the hard margin case, all of the examples \nof the losing class are far away from the decision surface (see Figure 1 (b)). \nFor the DAGSVM, the choice of the class order in the list (or DDAG) is arbitrary. The ex(cid:173)\nperiments in Section 5 simply use a list of classes in the natural numerical (or alphabetical) \norder. Limited experimentation with re-ordering the list did not yield significant changes \nin accuracy performance. \nThe DAGSVM algorithm is superior to other multiclass SVM algorithms in both training \nand evaluation time. Empirically, SVM training is observed to scale super-linearly with the \ntraining set size m [7], according to a power law: T = crn\"Y, where 'Y ~ 2 for algorithms \nbased on the decomposition method, with some proportionality constant c. For the standard \nI-v-r multiclass SVM training algorithm, the entire training set is used to create all N \nclassifiers. Hence the training time for I-v-r is \n\n(1) \nAssuming that the classes have the same number of examples, training each l-v-I SVM \nonly requires 2m/ N training examples. Thus, training K l-v-I SVMs would require \n\nT1 - v - r = cNm\"Y . \n\nT \n\nI-v-l - c \n\n2 \n\n- N(N -1) (2m) \"Y '\" \"Y-1 N2- \"Y \n\nN \n\n'\" 2 \n\nc \n\n\"Y \nm . \n\n(2) \n\n\fLarge Margin DAGs for Multiclass Classification \n\n551 \n\nFor a typical case, where 'Y = 2, the amount of time required to train all of the 1-v-1 SVMs \nis independent of N , and is only twice that of training a single 1-v-r SVM. Vsing 1-v-1 \nSVMs with a combination algorithm is thus preferred for training time. \n\n5 Empirical Comparisons and Conclusions \n\nThe DAGSVM algorithm was evaluated on three different test sets: the VSPS handwritten \ndigit data set [10], the VCI Letter data set [2], and the VCI Covertype data set [2]. The \nUSPS digit data consists of 10 classes (0-9), whose inputs are pixels of a scaled input \nimage. There are 7291 training examples and 2007 test examples. The UCI Letter data \nconsists of 26 classes (A-Z), whose inputs are measured statistics of printed font glyphs. \nWe used the first 16000 examples for training, and the last 4000 for testing. All inputs of \nthe VCI Letter data set were scaled to lie in [-1, 1]. The VCI Covertype data consists of \n7 classes of trees, where the inputs are terrain features. There are 11340 training examples \nand 565893 test examples. All of the continuous inputs for Covertype were scaled to have \nzero mean and unit variance. Discrete inputs were represented as a 1-of-n code. \nOn each data set, we trained N 1-v-r SVMs and K 1-v-1 SVMs, using SMO [7], with \nsoft margins. We combined the 1-v-1 SVMs both with the Max Wins algorithm and with \nDAGSVM. The choice of kernel and of the regularizing parameter C was determined via \nperfonnance on a validation set. The validation performance was measured by training \non 70% of the training set and testing the combination algorithm on 30% of the training \nset (except for Covertype, where the UCI validation set was used). The best kernel was \nselected from a set of polynomial kernels (from degree 1 through 6), both homogeneous and \ninhomogeneous; and Gaussian kernels, with various a. The Gaussian kernel was always \nfound to be best. \n\n(1 \n\nC \n\nError \n\nKernel \n\nRate (%) Evaluations \n\nTraining CPU Classifier Size \n(Kparameters) \n\nTime (sec) \n\nUSPS \nl-v-r \nMax Wins \nDAGSVM \nNeural Net [10] \nUCI Letter \n\n1-v-r \nMax Wins \nDAGSVM \nNeural Net \nUCI Covertype \n\nl-v-r \nMax Wins \nDAGSVM \nNeural Net [2] \n\n3.58 \n5.06 \n5.06 \n\n100 \n100 \n100 \n\n0.447 \n0.632 \n0.447 \n\n100 \n100 \n10 \n\n1 \n1 \n1 \n\n10 \n10 \n10 \n\n4.7 \n4.5 \n4.4 \n5.9 \n\n2.2 \n2.4 \n2.2 \n4.3 \n\n30.2 \n29.0 \n29.2 \n30 \n\n2936 \n1877 \n819 \n\n8183 \n7357 \n3834 \n\n7366 \n7238 \n4390 \n\n3532 \n307 \n307 \n\n1764 \n441 \n792 \n\n4210 \n1305 \n1305 \n\n760 \n487 \n487 \n\n148 \n160 \n223 \n\n105 \n107 \n107 \n\nTable 1: Experimental Results \n\nTable 1 shows the results of the experiments. The optimal parameters for all three multi(cid:173)\nclass SVM algorithms are very similar for both data sets. Also, the error rates are similar \nfor all three algorithms for both data sets. Neither 1-v-r nor Max Wins is statistically sig(cid:173)\nnificantly better than DAGSVM using McNemar's test [3] at a 0.05 significance level for \nUSPS or UCI Letter. For VCI Covertype, Max Wins is slightly better than either of the \nother SVM-based algorithms. The results for a neural network trained on the same data \nsets are shown for a baseline accuracy comparison. \nThe three algorithms distinguish themselves in training time, evaluation time, and classifier \nsize. The number of kernel evaluations is a good indication of evaluation time. For J-v-\n\n\f552 \n\nJ C. Platt, N. Cristianini and J Shawe-Taylor \n\nr and Max Wins, the number of kernel evaluations is the total number of unique support \nvectors for all SVMs. For the DAGSVM, the number of kernel evaluations is the number \nof unique support vectors averaged over the evaluation paths through the DDAG taken by \nthe test set. As can be seen in Table 1, Max Wins is faster than I-v-r SVMs, due to shared \nsupport vectors between the I-v-1 classifiers. The DAGSVM has the fastest evaluation. \nThe DAGSVM is between a factor of 1.6 and 2.3 times faster to evaluate than Max Wins. \nThe DAGSVM algorithm is also substantially faster to train than the standard I-v-r SVM \nalgorithm: a factor of 2.2 and 11.5 times faster for these two data sets. The Max Wins \nalgorithm shares a similar training speed advantage. \nBecause the SVM basis functions are drawn from a limited set, they can be shared across \nclassifiers for a great savings in classifier size. The number of parameters for DAGSVM \n(and Max Wins) is comparable to the number of parameters for I-v-r SVM, even though \nthere are N (N - 1) /2 classifiers, rather than N. \nIn summary, we have created a Decision DAG architecture, which is amenable to a VC(cid:173)\nstyle bound of generalization error. Using this bound, we created the DAGSVM algorithm, \nwhich places a two-class SVM at every node of the DDAG. The DAGSVM algorithm \nwas tested versus the standard 1-v-r multiclass SVM algorithm, and Friedman's Max Wins \ncombination algorithm. The DAGSVM algorithm yields comparable accuracy and memory \nusage to the other two algorithms, but yields substantial improvements in both training and \nevaluation time. \n\n6 Appendix: Proof of Main Theorem \n\nDefinition 2 Let F be a set of reaL vaLued functions. We say that a set of points X is ,(cid:173)\nshattered by F relative to r = (rx)xEx, if there are reaL numbers rx, indexed by x E X, \nsuch that for all binary vectors b indexed by X, there is a function fb E F satisfying \n(2bx - l)fdx) ~ (2bx - l)rx +,. The fat shattering dimension, fatF, of the set F is a \nfunction from the positive reaL numbers to the integers which maps a vaLue, to the size of \nthe largest ,-shattered set, if the set is finite, or maps to infinity otherwise. \nAs a relevant example, consider the class Flin = {x -+ (w, x) - (J : Ilwll = I}. We quote \nthe following result from [1]. \n\nTheorem 3 Let Flin be restricted to points in a ball ofn dimensions of radius R about the \norigin. Then \n\nWe wiIl bound generalization with a technique that closely resembles the technique used \nin [1] to study Perceptron Decision Trees. We will now give a lemma and a theorem: the \nlemma bounds the probability over a double sample that the first half has zero error and the \nsecond error greater than an appropriate E. We assume that the DDAG on N classes has \nK = N(N - 1)/2 nodes and we denote fat}'\"l\u00b7 b) by fatb). \n\nIII \n\nLemma 4 Let G be a DDAG on N classes with K = N(N - 1)/2 decision nodes with \nmargins ,1,,2, ... \"K at the decision nodes satisfying k i = fat ( ,i/8), where fat is con(cid:173)\ntinuous from the right. Then the following bound hoLds, p2m{xy:::I a graph G : G \nwhich separates classes i and j at the ij-node for all x in x, a fraction of points mis-\nclassified in y > E(m, K, 6).} < 6 where E(m, K, 6) = ! (D log (8m) + log 2;) and \nD = L:~1 ki log(4em/ki ). \n\nProof The proof of Lemma 4 is omitted for space reasons, but is formally analogous to the \nproof of Lemma 4.4 in [8], and can easily be reconstructed from it. 0 \n\n\fLarge Margin DAGs for Muldclass Classification \n\n553 \n\nLemma 4 applies to a particular DDAG with a specified margin Ii at each node. In practice, \nwe observe these quantities after generating the DDAG. Hence, to obtain a bound that can \nbe applied in practice, we must bound the probabilities uniformly over all of the possible \nmargins that can arise. We can now give the proof for Theorem 1. \nProof of Main Theorem: We must bound the probabilities over different margins. We first \nuse a standard result due to Vapnik [9, page 168] to bound the probability of error in terms \nof the probability of the discrepancy between the performance on two halves of a double \nsample. Then we combine this result with Lemma 4. We must consider all possible patterns \nof ki's over the decision nodes. The largest allowed value of ki is m, and so, for fixed K, \nwe can bound the number of possibilities by m K . Hence, there are m K of applications of \nLemma 4 for a fixed N. Since K = N(N - 1)/2, we can let 15k = 8/mK, so thatthe sum \nL:~l 15k = 8. Choosing \n\n\u20ac (m , K, 8;) = 6~2 (D'IOg(4em) log(4m) + log 2(2;)K) \n\n(3) \n\nin the applications of Lemma 4 ensures that the probability of any of the statements failing \nto hold is less than 8/2. Note that we have replaced the constant 82 = 64 by 65 in order \nto ensure the continuity from the right required for the application of Lemma 4 and have \nupper bounded log(4em/ki ) by log(4em). Applying Vapnik's Lemma [9, page 168] in \neach case, the probability that the statement of the theorem fails to hold is less than 8. 0 \nMore details on this style of proof, omitted in this paper for space constraints, can be \nfound in [1]. \n\nReferences \n\n[1] K. Bennett, N. Cristianini, J. Shawe-Taylor, and D. Wu. Enlarging the margin in perceptron \n\ndecision trees. Machine Learning (submitted). http://lara.enm.bris.ac.ukJcig/pubsIML-PDT.ps. \n\n[2] C. Blake, E. Keogh, and C. Merz. UCI repository of machine leaming databases. Dept. of \n\ninformation and computer sciences, University of Califomia, Irvine, 1998. \nhttp://www.ics.uci.edul,,,mleamIMLRepository.html. \n\n[3] T. G. Dietterich. Approximate statistical tests for comparing supervised classification leaming \n\nalgorithms. Neural Computation, 10: 1895-1924, 1998. \n\n[4] J. H. Friedman. Another approach to polychotomous classification. Technical report, Stanford \n\nDepartment of Statistics, 1996. htlp:llwww-stat.stanford.edulreports/friedmanlpoly.ps.Z. \n\n[5] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer leaming revisited: A stepwise procedure \n\nfor building and training a neural network. In Fogelman-Soulie and Herault, editors, \nNeurocomputing: Algorithms, Architectures and Applications, NATO ASI. Springer, 1990. \n\n[6] U. KreGel. Pairwise classification and support vector machines. In B. SchOlkopf, C. J. C. \nBurges, and A. J. Smola, editors, Advances in Kernel Methods: Support Vector Learning, \npages 255-268. MIT Press, Cambridge, MA, 1999. \n\n[7] J. Platt. Fast training of support vector machines using sequential minimal optimization. In \n\nB. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods -\nSupport Vector Learning, pages 185-208. MIT Press, Cambridge, MA, 1999. \n\n[8] J. Shawe-Taylor and N. Cristianini. Data dependent structural risk minimization for perceptron \ndecision trees. In M. Jordan, M. Keams, and S. SoJla, editors, Advances in Neural Information \nProcessing Systems, volume 10, pages 336-342. MIT Press, 1999. \n\n[9] V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian). Nauka, \n\nMoscow, 1979. (English translation: Springer Verlag, New York, 1982). \n\n[10] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. \n\n\f", "award": [], "sourceid": 1773, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}]}