{"title": "Using Analytic QP and Sparseness to Speed Training of Support Vector Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 557, "page_last": 563, "abstract": null, "full_text": "Using Analytic QP and Sparseness to Speed \n\nTraining of Support Vector Machines \n\nJohn C. Platt \n\nMicrosoft Research \n\n1 Microsoft Way \n\nRedmond, WA 98052 \njplatt@microsoft.com \n\nAbstract \n\nTraining a Support Vector Machine (SVM) requires the solution of a very \nlarge quadratic programming (QP) problem. This paper proposes an al(cid:173)\ngorithm for training SVMs: Sequential Minimal Optimization, or SMO. \nSMO breaks the large QP problem into a series of smallest possible QP \nproblems which are analytically solvable. Thus, SMO does not require \na numerical QP library. SMO's computation time is dominated by eval(cid:173)\nuation of the kernel, hence kernel optimizations substantially quicken \nSMO. For the MNIST database, SMO is 1.7 times as fast as PCG chunk(cid:173)\ning; while for the UCI Adult database and linear SVMs, SMO can be \n1500 times faster than the PCG chunking algorithm. \n\n1 \n\nINTRODUCTION \n\nIn the last few years, there has been a surge of interest in Support Vector Machines \n(SVMs) [1]. SVMs have empirically been shown to give good generalization performance \non a wide variety of problems. However, the use of SVMs is stilI limited to a small group of \nresearchers. One possible reason is that training algorithms for SVMs are slow, especially \nfor large problems. Another explanation is that SVM training algorithms are complex, \nsubtle, and sometimes difficult to implement. This paper describes a new SVM learning \nalgorithm that is easy to implement, often faster, and has better scaling properties than the \nstandard SVM training algorithm. The new SVM learning algorithm is called Sequential \nMinimal Optimization (or SMO). \n\n1.1 OVERVIEW OF SUPPORT VECTOR MACHINES \n\nA general non-linear SVM can be expressed as \n\nU = LQiYiK(Xi,X) - b \n\n(1) \n\n\f558 \n\nJ C. Platt \n\nwhere U is the output of the SVM, K is a kernel function which measures the similarity \nof a stored training example Xi to the input x, Yi E {-1, + 1} is the desired output of the \nclassifier, b is a threshold, and (li are weights which blend the different kernels [1]. For \nlinear SVMs, the kernel function K is linear, hence equation (1) can be expressed as \n\nu=w\u00b7x-b \n\n(2) \n\nwhere W = Li (liYiXi\u00b7 \nTraining of an SVM consists of finding the (li. The training is expressed as a minimization \nof a dual quadratic form: \n\nsubject to box constraints, \n\nand one linear equality constraint \n\nN \nLYi(li = O. \ni=l \n\n(3) \n\n(4) \n\n(5) \n\nThe (li are Lagrange multipliers of a primal quadratic programming (QP) problem: there \nis a one-to-one correspondence between each (li and each training example Xi. \nEquations (3-5) form a QP problem that the SMO algorithm will solve. The SMO algo(cid:173)\nrithm will terminate when all of the Karush-Kuhn-Tucker (KKT) optimality conditions of \nthe QP problem are fulfilled. These KKT conditions are particularly simple: \n\n(li = 0 '* YiUi ~ 1, 0 < (li < C '* YiUi = 1, \n\n(li = C '* YiUi :::; 1, \n\n(6) \n\nwhere Ui is the output of the SVM for the ith training example. \n\n1.2 PREVIOUS METHODS FOR TRAINING SUPPORT VECTOR MACHINES \n\nDue to its immense size, the QP problem that arises from SVMs cannot be easily solved via \nstandard QP techniques. The quadratic form in (3) involves a Hessian matrix of dimension \nequal to the number of training examples. This matrix cannot be fit into 128 Megabytes if \nthere are more than 4000 training examples. \n\nVapnik [9] describes a method to solve the SVM QP, which has since been known as \n\"chunking.\" Chunking relies on the fact that removing training examples with (li = 0 \ndoes not change the solution. Chunking thus breaks down the large QP problem into a \nseries of smaller QP sub-problems, whose object is to identify the training examples with \nnon-zero (li. Every QP sub-problem updates the subset of the (li that are associated with \nthe sub-problem, while leaving the rest of the (li unchanged. The QP sub-problem consists \nof every non-zero (li from the previous sub-problem combined with the M worst examples \nthat violate the KKT conditions (6), for some M [1]. At the last step, the entire set of \nnon-zero (li has been identified, hence the last step solves the entire QP problem. \n\nChunking reduces the dimension of the matrix from the number of training examples to \napproximately the number of non-zero (li. If standard QP techniques are used, chunking \ncannot handle large-scale training problems, because even this reduced matrix cannot fit \ninto memory. Kaufman [3] has described a QP algorithm that does not require the storage \nof the entire Hessian. \n\nThe decomposition technique [6] is similar to chunking: decomposition breaks the large \nQP problem into smaller QP sub-problems. However, Osuna et al. [6] suggest keeping a \n\n\fAnalytic QP and Sparseness to Speed Training o/Support Vector Machines \n\n559 \n\nQ 2 =c \n\nQ 2 =c \n\nal=oQal=C al = {::sJal = C \n\nQ 2 =0 \n\nQ 2 =0 \n\nYt *- Y2 ~ Qt - Q 2 = k \n\nYt = Y2 ~ Qt + Q 2 = k \n\nFigure 1: The Lagrange multipliers al and a2 must fulfill all of the constraints of the full \nproblem. The inequality constraints cause the Lagrange multipliers to lie in the box. The \nlinear equality constraint causes them to lie on a diagonal line. \n\nfixed size matrix for every sub-problem, deleting some examples and adding others which \nviolate the KKT conditions. Using a fixed-size matrix allows SVMs to be trained on very \nlarge training sets. 10achims [2] suggests adding and subtracting examples according to \nheuristics for rapid convergence. However, until SMO, decomposition required the use of \na numerical QP library, which can be costly or slow. \n\n2 SEQUENTIAL MINIMAL OPTIMIZATION \n\nSequential Minimal Optimization quickly solves the SVM QP problem without using nu(cid:173)\nmerical QP optimization steps at all. SMO decomposes the overall QP problem into fixed(cid:173)\nsize QP sub-problems, similar to the decomposition method [7]. \n\nUnlike previous methods, however, SMO chooses to solve the smallest possible optimiza(cid:173)\ntion problem at each step. For the standard SVM, the smallest possible optimization prob(cid:173)\nlem involves two elements of a. because the a. must obey one linear equality constraint. At \neach step, SMO chooses two ai to jointly optimize, finds the optimal values for these ai, \nand updates the SVM to reflect these new values. \n\nThe advantage of SMO lies in the fact that solving for two ai can be done analytically. \nThus, numerical QP optimization is avoided entirely. The inner loop of the algorithm can \nbe expressed in a short amount of C code, rather than invoking an entire QP library routine. \n\nBy avoiding numerical QP, the computation time is shifted from QP to kernel evaluation. \nKernel evaluation time can be dramatically reduced in certain common situations, e.g., \nwhen a linear SVM is used, or when the input data is sparse (mostly zero). The result of \nkernel evaluations can also be cached in memory [1]. \n\nThere are two components to SMO: an analytic method for solving for the two ai, and \na heuristic for choosing which multipliers to optimize. Pseudo-code for the SMO algo(cid:173)\nrithm can be found in [8, 7], along with the relationship to other optimization and machine \nlearning algorithms. \n\n2.1 SOLVING FOR TWO LAGRANGE MULTIPLIERS \n\nTo solve for the two Lagrange multipliers al and a2, SMO first computes the constraints on \nthese mUltipliers and then solves for the constrained minimum. For convenience, all quan(cid:173)\ntities that refer to the first multiplier will have a subscript 1, while all quantities that refer \nto the second mUltiplier will have a subscript 2. Because there are only two multipliers, \n\n\f560 \n\n1. C. Platt \n\nthe constraints can easily be displayed in two dimensions (see figure 1). The constrained \nminimum of the objective function must lie on a diagonal line segment. \nThe ends of the diagonal line segment can be expressed quite simply in terms of a2. Let \ns = YI Y2\u00b7 The following bounds apply to a2: \n\nL = max(O, a2 + sal -\n\n1 \n'2(s + l)C), \n\n. \n\nH = mm(C, a2 + sal -\n\n1 \n'2(s -\n\nl)C). (7) \n\nUnder normal circumstances, the objective function is positive definite, and there is a min(cid:173)\nimum along the direction of the linear equality constraint. In this case, SMO computes the \nminimum along the direction of the linear equality constraint: \n\nnew _ \n\na 2 \n\n-a2 K( ....\n\n+ \n\nXl, Xl + \n\n.... ) K( .... \n\nY2(EI - E 2) \n-\n\nX2, X2 \n\n- ) 2K( ....\n\n.... )' \n\nXl, X2 \n\n(8) \n\nwhere Ei = Ui - Yi is the error on the ith training example. As a next step, the constrained \nminimum is found by clipping a2ew into the interval [L, H]. The value of al is then \ncomputed from the new, clipped, a2: \n\n(9) \n\nFor both linear and non-linear SVMs, the threshold b is re-computed after each step, so that \nthe KKT conditions are fulfilled for both optimized examples. \n\n2.2 HEURISTICS FOR CHOOSING WHICH MULTIPLIERS TO OPTIMIZE \n\nIn order to speed convergence, SMO uses heuristics to choose which two Lagrange multi(cid:173)\npliers to jointly optimize. \nThere are two separate choice heuristics: one for al and one for a2. The choice of al \nprovides the outer loop of the SMO algorithm. If an example is found to violate the KKT \nconditions by the outer loop, it is eligible for optimization. The outer loop alternates single \npasses through the entire training set with multiple passes through the non-bound ai (ai f. \n{a, C}). The multiple passes terminate when all of the non-bound examples obey the KKT \nconditions within E. The entire SMO algorithm terminates when the entire training set \nobeys the KKT conditions within c. Typically, c = 10- 3 . \nThe first choice heuristic concentrates the CPU time on the examples that are most likely to \nviolate the KKT conditions, i.e., the non-bound subset. As the SMO algorithm progresses, \nai that are at the bounds are likely to stay at the bounds, while ai that are not at the bounds \nwill move as other examples are optimized. \nAs a further optimization, SMO uses the shrinking heuristic proposed in [2]. After the pass \nthrough the entire training set, shrinking finds examples which fulfill the KKT conditions \nmore than the worst example failed the KKT conditions. Further passes through the training \nset ignore these fulfilled conditions until a final pass at the end of training, which ensures \nthat every example fulfills its KKT condition. \nOnce an al is chosen, SMO chooses an a2 to maximize the size of the step taken during \njoint optimization. SMO approximates the step size by the absolute value of the numerator \nin equation (8): lEI -E21. SMO keeps a cached error value E for every non-bound example \nin the training set and' then chooses an error to approximately maximize the step size. If \nEI is positive, SMO chooses an example with minimum error E 2 . If EI is negative, SMO \nchooses an example with maximum error E 2 . \n\n2.3 KERNEL OPTIMIZATIONS \n\nBecause the computation time for SMO is dominated by kernel evaluations, SMO can be \naccelerated by optimizing these kernel evaluations. Utilizing sparse inputs is a generally \n\n\fAnalytic QP and Sparseness to Speed Training of Support Vector Machines \n\nExperiment \n\nKernel \n\nLinear \nAdultLin \nLinear \nAdultLinD \nLinear \nWebLin \nLinear \nWebLinD \nGaussian \nAdultGaussK \nGaussian \nAdultGauss \nAdultGaussKD Gaussian \nGaussian \nAdultGaussD \nGaussian \nWebGaussK \nGaussian \nWebGauss \nGaussian \nWebGaussKD \nGaussian \nWebGaussD \nPolynom. \nMNIST \n\ny \nN \ny \nN \ny \ny \nN \nN \ny \ny \nN \nN \ny \n\nSparse Kernel \nInputs Caching \nUsed \n\nTraining Number of \n\nC \n\nSet \nSize \n11221 \n11221 \n49749 \n49749 \n11221 \n11221 \n11221 \n11221 \n49749 \n49749 \n49749 \n49749 \n60000 \n\nSupport \nVectors \n4158 \n4158 \n1723 \n1723 \n4206 \n4206 \n4206 \n4206 \n4484 \n4484 \n4484 \n4484 \n3450 \n\n0.05 \n0.05 \n1 \n1 \n1 \n1 \n1 \n1 \n5 \n5 \n5 \n5 \n100 \n\nUsed \nmix \nmix \nmix \nmix \ny \nN \ny \nN \ny \nN \ny \nN \nN \n\n561 \n\n% \n\nSparse \nInputs \n\n89 \n0 \n96 \n0 \n89 \n89 \n0 \n0 \n96 \n96 \n0 \n0 \n81 \n\nTable 1: Parameters for various experiments \n\napplicable kernel optimization. For commonly-used kernels, equations (1) and (2) can be \ndramatically sped up by exploiting the sparseness of the input. For example, a Gaussian \nkernel can be expressed as an exponential of a linear combination of sparse dot products. \nSparsely storing the training set also achieves substantial reduction in memory consump(cid:173)\ntion. \nTo compute a linear SVM, only a single weight vector needs to be stored, rather than all of \nthe training examples that correspond to non-zero ai. If the QP sub-problem succeeds, the \nstored weight vector is updated to reflect the new ai values. \n\n3 BENCHMARKING SMO \n\nThe SMO algorithm is tested against the standard chunking algorithm and against the de(cid:173)\ncomposition method on a series of benchmarks. Both SMO and chunking are written in \nC++, using Microsoft's Visual C++ 6.0 compiler. Joachims' package SVMlight (version \n2.01) with a default working set size of lOis used to test the decomposition method. The \nCPU time of all algorithms are measured on an unloaded 266 MHz Pentium II processor \nrunning Windows NT 4. \nThe chunking algorithm uses the projected conjugate gradient algorithm as its QP solver, \nas suggested by Burges [1]. All algorithms use sparse dot product code and kernel caching, \nas appropriate [1, 2]. Both SMO and chunking share folded linear SVM code. \n\nThe SMO algorithm is tested on three real-world data sets. The results of the experiments \nare shown in Tables 1 and 2. Further tests on artificial data sets can be found in [8, 7]. \n\nThe first test set is the UeI Adult data set [5]. The SVM is given 14 attributes of a census \nform of a household and asked to predict whether that household has an income greater \nthan $50,000. Out of the 14 attributes, eight are categorical and six are continuous. The six \ncontinuous attributes are discretized into quintiles, yielding a total of 123 binary attributes. \n\nThe second test set is text categorization: classifying whether a web page belongs to a \ncategory or not. Each web page is represented as 300 sparse binary keywords attributes. \n\nThe third test set is the MNIST database of handwritten digits, from AT&T Research \nLabs [4]. One classifier of MNIST, class 8, is trained. The inputs are 784-dimensional \n\n\f562 \n\nExperiment \n\nAdultLin \nAdultLinD \nWebLin \nWebLinD \nAdultGaussK \nAdultGauss \nAdultGaussKD \nAdultGaussD \nWebGaussK \nWebGauss \nWebGaussKD \nWebGaussD \nMNIST \n\nSMa \nTime \n(sec) \n\n13.7 \n21.9 \n339.9 \n4589.1 \n442.4 \n523.3 \n1433.0 \n1810.2 \n2477.9 \n2538.0 \n23365.3 \n24758.0 \n19387.9 \n\nSVMllght Chunking \n\nTime \n(sec) \n\n217.9 \nnla \n3980.8 \nnla \n284.7 \n737.5 \nnla \nnla \n2949.5 \n6923.5 \nnla \nnla \n38452.3 \n\nTime \n(sec) \n20711.3 \n21141.1 \n17164.7 \n17332.8 \n11910.6 \nnla \n14740.4 \nnla \n23877.6 \nnla \n50371 .9 \nnla \n33109.0 \n\n1. C. Platt \n\nSVMllght Chunking \nSMa \nScaling \nScaling \nScaling \nExponent Exponent Exponent \n\n1.8 \n1.0 \n1.6 \n1.5 \n2.0 \n2.0 \n2.5 \n2.0 \n1.6 \n1.6 \n2.6 \n1.6 \nnla \n\n2.1 \nnla \n2.2 \nnla \n2.0 \n2.0 \nnla \nnla \n2.0 \n1.8 \nnla \nnla \nnla \n\n3.1 \n3.0 \n2.5 \n2.5 \n2.9 \nnla \n2.8 \nnla \n2.0 \nnla \n2.0 \nnla \nnla \n\nTable 2: Timings of algorithms on various data sets. \n\nnon-binary vectors and are stored as sparse vectors. A fifth-order polynomial kernel is \nused to match the AT&T accuracy results. \nThe Adult set and the Web set are trained both with linear SVMs and Gaussian SVMs with \nvariance of 10. For the Adult and Web data sets, the C parameter is chosen to optimize \naccuracy on a validation set. Experiments on the Adult and Web sets are performed with \nand without sparse inputs and with and without kernel caching, in order to determine the \neffect these kernel optimizations have on computation time. When a kernel cache is used, \nthe cache size for SMa and SVMlight is 40 megabytes. The chunking algorithm always \nuses kernel caching: matrix values from the previous QP step are re-used. For the linear \nexperiments, SMa does not use kernel caching, while SVMlight does. \nIn Table 2, the scaling of each algorithm is measured as a function of the training set size, \nwhich is varied by taking random nested subsets of the full training set. A line is fitted \nto the log of the training time versus the log of the set size. The slope of the line is an \nempirical scaling exponent. \n\n4 CONCLUSIONS \n\nAs can be seen in Table 2, standard PCG chunking is slower than SMa for the data sets \nshown, even for dense inputs. Decomposition and SMa have the advantage, over standard \nPCG chunking, of ignoring the examples whose Lagrange multipliers are at C. This ad(cid:173)\nvantage is reflected in the scaling exponents for PCG chunking versus SMa and SVMlight . \nPCG chunking can be altered to have a similar property [3]. Notice that PCG chunking uses \nthe same sparse dot product code and linear SVM folding code as SMa. However, these \noptimizations do not speed up PCG chunking due to the overhead of numerically solving \nlarge QP sub-problems. \n\nSMa and SVM1ight are similar: they decompose the large QP problem into very small QP \nsub-problems. SMa decomposes into even smaller sub-problems: it uses analytical solu(cid:173)\ntions of two-dimensional sub-problems, while SVMlight uses numerical QP to solve 10-\ndimensional sub-problems. The difference in timings between the two methods is partly \ndue to the numerical QP overhead, but mostly due to the difference in heuristics and kernel \noptimizations. For example, SMa is faster than SVMlight by an order of magnitude on \n\n\fAnalytic QP and Sparseness to Speed Training of Support Vector Machines \n\n563 \n\nlinear problems, due to linear SVM folding. However, SVMlight can also potentially use \nlinear SVM folding . In these experiments, SMO uses a very simple least-recently-used ker(cid:173)\nnel cache of Hessian rows, while SVMlight uses a more complex kernel cache and modifies \nits heuristics to utilize the kernel effectively [2]. Therefore, SMO does not benefit from the \nkernel cache at the largest problem sizes, while SVMlight speeds up by a factor of 2.5 . \n\nUtilizing sparseness to compute kernels yields a large advantage for SMO due to the lack \nof heavy numerical QP overhead. For the sparse data sets shown, SMO can speed up by \na factor of between 3 and 13, while PCG chunking only obtained a maximum speed up of \n2.1 times. \nThe MNIST experiments were performed without a kernel cache, because the MNIST data \nset takes up most of the memory of the benchmark machine. Due to sparse inputs, SMO is \na factor of 1.7 faster than PCG chunking, even though none of the Lagrange multipliers are \nat C. On a machine with more memory, SVMlight would be as fast or faster than SMO for \nMNIST, due to kernel caching. \nIn summary, SMO is a simple method for training support vector machines which does not \nrequire a numerical QP library. Because its CPU time is dominated by kernel evaluation, \nSMO can be dramatically quickened by the use of kernel optimizations, such as linear SVM \nfolding and sparse dot products. SMO can be anywhere from 1.7 to 1500 times faster than \nthe standard PCG chunking algorithm, depending on the data set. \n\nAcknowledgements \n\nThanks to Chris Burges for running data sets through his projected conjugate gradient code \nand for various helpful suggestions. \n\nReferences \n[1] c. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data \n\nMining and Knowledge Discovery , 2(2), 1998. \n\n[2] T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. J. C. \nSupport Vector \n\nBurges, and A. J. Smola, editors, Advances in Kernel Methods -\nLearning, pages 169-184. MIT Press, 1998. \n\n[3] L. Kaufman. Solving the quadratic programming problem arising in support vector \nclassification. In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in \nKernel Methods - Support Vector Learning, pages 147-168. MIT Press, 1998. \n\n[4] Y. LeCun. MNIST handwritten digit database. Available on the web at http:// \n\nwww.research .att.comr yann/ocr/mnistl. \n\n[5] C. J. Merz and P. M. Murphy. UCI repository of machine learning databases, 1998. \n\n[http://www.ics.uci.edu/rvmlearnIMLRepository.html].Irvine.CA: University of Cali(cid:173)\nfornia, Department of Information and Computer Science. \n\n[6] E. Osuna, R. Freund, and F. Girosi . Improved training algorithm for support vector \n\nmachines. In Proc. IEEE Neural Networks in Signal Processing '97, 1997. \n\n[7] J. C. Platt. \n\nFast training of SVMs using sequential minimal optimization. \n\nIn \n\nB. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Meth(cid:173)\nods - Support Vector Learning, pages 185-208. MIT Press, 1998. \n\n[8] J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vec(cid:173)\n\ntor machines. Technical Report MSR- TR-98-14, Microsoft Research, 1998. Available \nat http://www.research .microsoft.comrjplattlsmo.html. \n\n[9] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, \n\n1982. \n\n\f", "award": [], "sourceid": 1577, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}]}