{"title": "Parallel Support Vector Machines: The Cascade SVM", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 528, "abstract": null, "full_text": " \n\n \n\n \n\nParallel Support Vector Machines: \n\nThe Cascade SVM \n\nHans Peter Graf, Eric Cosatto, \n\nLeon Bottou, Igor Durdanovic, Vladimir Vapnik \n\n4 Independence Way, Princeton, NJ 08540 \n\n{hpg, cosatto, leonb, igord, vlad}@nec-labs.com \n\n \n\nNEC Laboratories \n\nAbstract \n\nWe describe an algorithm for support vector machines (SVM) that \ncan be parallelized efficiently and scales to very large problems with \nhundreds of thousands of training vectors. Instead of analyzing the \nwhole training set in one optimization step, the data are split into \nsubsets and optimized separately with multiple SVMs. The partial \nresults are combined and filtered again in a \u2018Cascade\u2019 of SVMs, until \nthe global optimum is reached. The Cascade SVM can be spread over \nmultiple processors with minimal communication overhead and \nrequires far less memory, since the kernel matrices are much smaller \nthan for a regular SVM. Convergence to the global optimum is \nguaranteed with multiple passes through the Cascade, but already a \nsingle pass provides good generalization. A single pass is 5x \u2013 10x \nfaster than a regular SVM for problems of 100,000 vectors when \nimplemented on a single processor. Parallel implementations on a \ncluster of 16 processors were tested with over 1 million vectors \n(2-class problems), converging in a day or two, while a regular SVM \nnever converged in over a week. \n\n1 Introduction \nSupport Vector Machines [1] are powerful classification and regression tools, but \ntheir compute and storage requirements increase rapidly with the number of training \nvectors, putting many problems of practical interest out of their reach. The core of an \nSVM is a quadratic programming problem (QP), separating support vectors from the \nrest of the training data. General-purpose QP solvers tend to scale with the cube of the \nnumber of training vectors (O(k3)). Specialized algorithms, typically based on \ngradient descent methods, achieve impressive gains in efficiency, but still become \nimpractically slow for problem sizes on the order of 100,000 training vectors (2-class \nproblems). \n\nOne approach for accelerating the QP is based on \u2018chunking\u2019 [2][3][4], where subsets \nof the training data are optimized iteratively, until the global optimum is reached. \n\u2018Sequential Minimal Optimization\u2019 (SMO) [5], which reduces the chunk size to 2 \nvectors, is the most popular of these algorithms. Eliminating non-support vectors \n\n\f \n\nearly during the optimization process is another strategy that provides substantial \nsavings in computation. Efficient SVM implementations incorporate steps known as \n\u2018shrinking\u2019 for identifying non-support vectors early [4][6][7]. In combination with \ncaching of the kernel data, such techniques reduce the computation requirements by \norders of magnitude. Another approach, named \u2018digesting\u2019 optimizes subsets closer to \ncompletion before adding new data [8], saving considerable amounts of storage. \n\nImproving compute-speed through parallelization is difficult due to dependencies \nbetween the computation steps. Parallelizations have been proposed by splitting the \nproblem into smaller subsets and training a network to assign samples to different \nsubsets [9]. Variations of the standard SVM algorithm, such as the Proximal SVM \nhave been developed that are better suited for parallelization [10], but how widely \nthey are applicable, in particular to high-dimensional problems, remains to be seen. A \nparallelization scheme was proposed where the kernel matrix is approximated by a \nblock-diagonal [11]. A technique called variable projection method [12] looks \npromising for improving the parallelization of the optimization loop. \n\nIn order to break through the limits of today\u2019s SVM implementations we developed a \ndistributed architecture, where smaller optimizations are solved independently and \ncan be spread over multiple processors, yet the ensemble is guaranteed to converge to \nthe globally optimal solution. \n\n2 The Cascade SVM \nAs mentioned above, eliminating non-support vectors early from the optimization \nproved to be an effective strategy for accelerating SVMs. Using this concept we \ndeveloped a filtering process that can be parallelized efficiently. After evaluating \nmultiple techniques, such as projections onto subspaces (in feature space) or \nclustering techniques, we opted to use SVMs as filters. This makes it straightforward \nto drive partial solutions towards the global optimum, while alternative techniques \nmay optimize criteria that are not directly relevant for finding the global solution. \n\nTD / 8\nTD / 8\n\nTD / 8\nTD / 8\n\nTD / 8\nTD / 8\n\nTD / 8\nTD / 8\n\nTD / 8\nTD / 8\n\nTD / 8\nTD / 8\n\nTD / 8\nTD / 8\n\nTD / 8\nTD / 8\n\nSV1\nSV1\n\nSV2\nSV2\n\nSV3\nSV3\n\nSV4\nSV4\n\nSV5\nSV5\n\nSV6\nSV6\n\nSV7\nSV7\n\nSV8\nSV8\n\nSV9\nSV9\n\nSV10\nSV10\n\nSV11\nSV11\n\nSV12\nSV12\n\nSV13\nSV13\n\nSV14\nSV14\n\n1st layer\n1st layer\n\n2nd layer\n2nd layer\n\n3rd layer\n3rd layer\n\n4th layer\n4th layer\n\nSV15\nSV15\n\n \n\n Figure 1: Schematic of a binary Cascade architecture. The data are split into \nsubsets and each one is evaluated individually for support vectors in the first \nlayer. The results are combined two-by-two and entered as training sets for the \nnext layer. The resulting support vectors are tested for global convergence by \nfeeding the result of the last layer into the first layer, together with the \nnon-support vectors. TD: Training data, SVi: Support vectors produced by \noptimization i. \n\nWe initialize the problem with a number of independent, smaller optimizations and \ncombine the partial results in later stages in a hierarchical fashion, as shown in Figure \n1. Splitting the data and combining the results can be done in many different ways. \n\n\f \n\nFigure 1 merely represents one possible architecture, a binary Cascade that proved to \nbe efficient in many tests. It is guaranteed to advance the optimization function in \nevery layer, requires only modest communication from one layer to the next, and \nconverges to a good solution quickly. \n\nIn the architecture of Figure 1 sets of support vectors from two SVMs are combined \nand the optimization proceeds by finding the support vectors in each of the combined \nsubsets. This continues until only one set of vectors is left. Often a single pass through \nthis Cascade produces satisfactory accuracy, but if the global optimum has to be \nreached, the result of the last layer is fed back into the first layer. Each of the SVMs in \nthe first layer receives all the support vectors of the last layer as inputs and tests its \nfraction of the input vectors, if any of them have to be incorporated into the \noptimization. If this is not the case for all SVMs of the input layer, the Cascade has \nconverged to the global optimum, otherwise it proceeds with another pass through the \nnetwork. \n\nIn this architecture a single SVM never has to deal with the whole training set. If the \nfilters in the first few layers are efficient in extracting the support vectors then the \nlargest optimization, the one of the last layer, has to handle only a few more vectors \nthan the number of actual support vectors. Therefore, in problems where the support \nvectors are a small subset of the training vectors - which is usually the case - each of \nthe sub-problems is much smaller than the whole problem (compare section 4). \n\n2 . 1 N o t a t i o n ( 2 - c l a s s , ma xi mu m ma r g i n ) \n\nWe discuss here the 2-class classification problem, solved in dual formulation. The \nCascade does not depend on details of the optimization algorithm and alternative \nformulations or regression algorithms map equally well onto this architecture. The \n2-class problem is the most difficult one to parallelize because there is no natural split \ninto sub-problems. Multi-class problems can always be separated into 2-class \nproblems. \n\niy\n\n1\u00b1=\n\nLet us consider a set of l training examples (xi; yi); where \n represents a \n the class label. K(xi,xj) is the matrix of kernel values \nd-dimensional pattern and \nbetween patterns and \u03b1i the Lagrange coefficients to be determined by the \noptimization. The SVM solution for this problem consists in maximizing the \nfollowing quadratic optimization function (dual formulation): \nl\n\u2211\n\n\u03b1\u03b1\n\n\u2212=\n\n\u03b1\n(\n)\n\nmax\n\n2/1\n\n(1) \n\n\u03b1\n\nW\n\n+\n\n\u2217\n\nR\n\n)\n\n \n\n \n\nd\n\nj\n\nl\n\nj\n\nl\n\ni\n\ni\n\ni\n\nx \u2208\n\n\u2211\u2211\ni\n\u2200\n\nj\n and \n\ni\n\nj\n\n,\nxxKyy\ni\n0=\u2211l\n\n(\ni y\u03b1\n\ni\n\n0\n\n1\n\ni\n\ni\n\nSubject to: \n\n\u2264\n0 \u03b1\n\u2264\nCi\n(\u03b1W\u2207=\nof W with respect to \u03b1 is then: \nThe gradient G \n)\n\u2202=\nW\n\u03b1\n\u2212=\n\u2211\n\u2202\n\u03b1\n\nxxK\n\n+\n1)\n\nG\n\ny\n\ny\n\n(\n\n \n\ni\n\n,\n\n,\n\ni\n\ni\n\nj\n\nj\n\ni\n\ni\n\nl\n\nj\n\ni y\u03b1\n\n\u2211\n\n=\n\n=\n\ni\n\nl\n\ni\n\n=\n1\n\nj\n\n(2) \n\n2 . 2 F o r ma l p r o o f o f c o n v e r g e n c e \n\nThe main issue is whether a Cascade architecture will actually converge to the global \noptimum. The following theorems show that this is the case for a wide range of \nconditions. Let S denote a subset of the training set \u2126, W(S) is the optimal objective \nbe the subset of S for which the \nfunction over S (equation 1), and let \noptimal \u03b1 are non-zero (support vectors of S). It is obvious that: \n\n\u2282)(\n\nSSv\n\nS\n\n\f\u2126\u2282\u2200\n,\n\nS\n\nWSSvWSW\n\n)(\n\n))\n\n(\n\n(\n\n\u2126\u2264\n)\n\n(\n\n=\n\n \n\n \n\n(3) \n\nLet us consider a family F of sets of training examples for which we can independently \nthat achieves the greatest W(S) will be \ncompute the SVM solution. The set \ncalled the best set in family F. We will write W(F) as a shorthand for W(S*), that is: \n\nS \u2208*\n\nF\n\nFW\n\n(\n\n)\n\n=\n\nmax\n\u2208\nFS\n\nWSW\n\n)(\n\n\u2126\u2264\n\n(\n\n)\n\n \n\n(4) \n\n)\n\n(\n\n)\n\n(\n\n)\n\n \n\n \n\n*\nF\n\n=\n\nT\n\u2264\n\n)\n))\n\n\u2265\n\u2264\n\nF\n*\n)\nF\n\nGT \u2208 \n\n\u2282)\n)(\n\n\u2208*\nS F\n(\nSW\n\n, then \n=\n(\n\n(\nSSv\n\u2264\n*\n)\nF\n\nFWGWIf\n\n(\n(\nSSvW\n \n\n*\nF\nGWTW\n\nWe are interested in defining a sequence of families Ft such that W(Ft) converges to \nthe optimum. Two results are relevant for proving convergence. \nTheorem 1: Let us consider two families F and G of subsets of \u2126. If a set \ncontains the support vectors of the best set \nProof: Since\n, we have \n(\n(\nSWFW\nTheorem 2: Let us consider two families F and G of subsets of \u2126. Assume that every \nset \n\n(\n).\nFWGW\n \n).\n(\nTW\nTherefore, \n (cid:31)\n\n\u2208*\nGT\u2208 contains the support vectors of the best set \nS F\n=\n(\nW\nU\n\n=\n*\n(\nSW\nF\n. Consider a vector \u03b1* solution of the \n)\n(\nFWGW\nProof: Theorem 1 implies that \nGT\u2208 , we have \n( *\nFSSv\nSVM problem restricted to the support vectors \nFSSvWTW \u2265\n(\nsubset of T. We also have \n=\n(\n. This \nSWFWGWTW\nimplies that \u03b1* is also a solution of the SVM on set T. Therefore \u03b1* satisfies all the \nGT \u2208 . This implies that \u03b1* also satisfies the \nKKT conditions corresponding to all sets \nKKT conditions for the union of all sets in G. \n(cid:31) \nDefinition 1. A Cascade is a sequence (Ft) of families of subsets of \u2126 satisfying: \ni) For all t > 1, a set \nii) For all t, there is a k > t such that: \n\ntFT \u2208 contains the support vectors of the best set in Ft-1. \nkFT \u2208 contain the support vectors of the best set in Fk-1. \n\nFSSvWTW =\n\n( *\n)\nFSSv\n=\n(\n(\nSSvW\n\nbecause \n=\n*\n)\nF\n\n\u2022 All sets \n\u2022 The union of all sets in Fk is equal to \u2126. \n\n is a \n*\nF\n\n Therefore \n\n. For all \n\n\u21d2\n\u2265\n\n)(\n)(\n\n. \nT\n\n))\n(\n\n)(\n\n(\n(\n\n)\n)\n\n\u2208\nGT\n\n(\n(\n\n)).\n\nF\n\n).\n\n))\n\n\u2264\n\n*\n\n(\n\n(\n\n \n\n \n\n \n\n*\n\n)\n\n)\n\n \n\n)\n\n \n\n \n\n)\n\n*\n\n \n\n,\n\n,\n\nt\n\n(\n\n\u2126=\n\n>\u2200\u2203\nt\n\n*\nWFWt\n\nTheorem 3: A Cascade (Ft) converges to the SVM solution of \u2126 in finite \ntime, namely: \nProof: Assumption i) of Definition 1 plus theorem 1 imply that the sequence W(Ft) is \nmonotonically increasing. Since this sequence is bounded by W(\u2126), it converges to \nsome value \n. The sequence W(Ft) takes its values in the finite set of the \nW(S) for all \n. This \nobservation, assertion ii) of definition 1, plus theorem 2 imply that there is a k > l such \n for all t > k. \nthat \n\n\u2264 WW\n\u2126\n(\n\u2126\u2282S\n. Therefore there is a l > 0 such that \n\n. Since W(Ft) is monotonically increasing, \n\n,\nWFWl\n\n>\u2200\n\n\u2126= WFW k\n)\n(\n\n\u2126=WFW k\n)(\n(\n\n=\n\n)\n\n(\n\n)\n\n)\n\n)\n\n)\n\n(\n\n)\n\n(\n\n \n\nt\n\n*\n\n*\n\nt\n\nt\n\nAs stated in theorem 3, a layered Cascade architecture is guaranteed to converge to the \nglobal optimum if we keep the best set of support vectors produced in one layer, and \nuse it in at least one of the subsets in the next layer. This is the case in the binary \nCascade shown in Figure 1. However, not all layers meet assertion ii) of Definition 1. \nThe union of sets in a layer is not equal to the whole training set, except in the first \nlayer. By introducing the feedback loop that enters the result of the last layer into the \n\n\f \n\nfirst one, combined with all non-support vectors, we fulfill all assertions of Definition \n1. We can test for global convergence in layer 1 and do a fast filtering in the \nsubsequent layers. \n\n2 . 3 \n\n I n t e r p r e t a t i o n o f t h e S V M f i l t e r i n g p r o c e s s \n\n\u2126\u2282S\nAn intuitive picture of the filtering process is provided in Figure 2. If a subset\n \nis chosen randomly from the training set, it will most likely not contain all support \nvectors of \u2126 and its support vectors may not be support vectors of the whole problem. \nHowever, if there is not a serious bias in a subset, support vectors of S are likely to \ncontain some support vectors of the whole problem. Stated differently, it is plausible \nthat \u2018interior\u2019 points in a subset are going to be \u2018interior\u2019 points in the whole set. \nTherefore, a non-support vector of a subset has a good chance of being a non-support \nvector of the whole set and we can eliminate it from further analysis. \n\nFigure 2: A toy problem illustrating the filtering process. Two disjoint subsets \nare selected from the training data and each of them is optimized individually (left, \ncenter; the data selected for the optimizations are the solid elements). The support \nvectors in each of the subsets are marked with frames. They are combined for the \nfinal optimization (right), resulting in a classification boundary (solid curve) close \nto the one for the whole problem (dashed curve). \n\n \n\n3 Distributed Optimization \n\n1\n2\n\u03b1\nr\nT\ni\n\n\u2212=\n\nW\n\ni\n\nr\nG\n\ni\n\n\u2212=\n\n\u03b1\u03b1\u03b1\nr\nr\n\n+\n\nr\n\nT\n\nr\ne\n\ni\n\n;\n\n(5) \n\ni\n\nT\ni\n\nQ\n\ni\n\ni\n\n+\nr\neQ\n\ni\n\ni\n\n;\n\n \n\n \n\nFigure 3: A Cascade with two input sets D1, D2. Wi, Gi and Qi are objective \nfunction, gradient, and kernel matrix, respectively, of SVMi (in vector notation); ei \nis a vector with all 1. Gradients of SVM1 and SVM2 are merged (Extend) as \nindicated in (6) and are entered into SVM3. Support vectors of SVM3 are used to \ntest D1, D2 for violations of the KKT conditions. Violators are combined with the \nsupport vectors for the next iteration. \n\n \nSection 2 shows that a distributed architecture like the Cascade indeed converges to the \nglobal solution, but no indication is provided how efficient this approach is. For a good \nperformance we try to advance the optimization as much as possible in each stage. This \ndepends on how the data are split initially, how partial results are merged and how well an \noptimization can start from the partial results provided by the previous stage. We focus on \ngradient-ascent algorithms here, and discuss how to handle merging efficiently. \n\n\f \n\n3 . 1 M e r g i n g s u b s e t s \nFor this discussion we look at a Cascade with two layers (Figure 3). When merging the \ntwo results of SVM1 and SVM2, we can initialize the optimization of SVM3 to \ndifferent starting points. In the general case the merged set starts with the following \noptimization function and gradient: \n\uf8ee+\uf8fa\n\uf8f9\n\uf8ef\n\uf8fb\n\uf8f0\n\n\u03b1\nr\n\uf8ee\u2212=\n1\n\uf8ef\n\u03b1\nr\n2\n\uf8f0\n\n\uf8ee\nQQ\n12\n\uf8ef\nQ\nQ\n\uf8f0\n\n\u03b1\nr\n\uf8ee\u2212=\n\uf8ef\n\u03b1\nr\n\uf8f0\n\n\uf8f9\n\uf8ee+\uf8fa\n\uf8ef\n\uf8fb\n\uf8f0\n\n\u03b1\nr\n\uf8ee\n\uf8ef\n\u03b1\nr\n\uf8f0\n\n\u03b1\nr\n\uf8ee\n\uf8ef\n\u03b1\nr\n\uf8f0\n\nQ\n12\nQ\n\nQ\n1\nQ\n\nr\ne\n1\nr\ne\n2\n\nr\ne\n1\nr\ne\n2\n\n(6) \n\n \n\n \n\nr\nG\n\n\uf8f9\n\uf8fa\n\uf8fb\n\n\uf8f9\n\uf8fa\n\uf8fb\n\n\uf8f9\n\uf8fa\n\uf8fb\n\nW\n\n3\n\n1\n\n21\n\n\uf8f9\n\uf8fa\n\uf8fb\n\n1\n\n2\n\n \n\n\uf8f9\n\uf8fa\n\uf8fb\n\n\uf8ee\n\uf8ef\n\uf8f0\n\n\uf8f9\n\uf8fa\n\uf8fb\n\n1\n\n2\n\n21\n\n2\n\n1\n\n2\n\nT\n\nT\n\n1\n\n2\n\nT\n\n3\n\n2\n\nWe consider two possible initializations: \n\u03b1\nr\n; \n2\n\u03b1\u03b1\nr\nr\n\n\u03b1\u03b1\nr\n1\n\u03b1\u03b1\nr\n\nCase 1: \nCase 2: \n\nr\n=\n0\n=\n\n1\nSVM\n\nSVM\n\n=\n=\n\n;\n;\n\nof\n\nof\n\n1\n\n1\n\n2\n\n2\n\n1\n\n1\n\nof\n\nSVM\n\n. \n\n2\n\n(7) \n\nSince each of the subsets fulfills the KKT conditions, each of these cases represents a \nfeasible starting point with: \n\n. \n\n0=\n\ni y\u03b1\n\u2211 i\n\nIntuitively one would probably assume that case 2 is the preferred one since we start \nfrom a point that is optimal in the two spaces defined by the vectors D1 and D2. If Q12 \nis 0 (Q21 is then also 0 since the kernel matrix is symmetric), the two spaces are \northogonal (in feature space) and the sum of the two solutions is the solution of the \nwhole problem. Therefore, case 2 is indeed the best choice for initialization, because \nit represents the final solution. If, on the other hand, the two subsets are identical, then \nan initialization with case 1 is optimal, since this represents now the solution of the \nwhole problem. In general, we are probably somewhere between these two cases and \ntherefore it is not obvious, which case is best. \n\nWhile the theorems of section 2 guarantee the convergence to the global optimum, \nthey do not provide any indication how fast this going to happen. Empirically we find \nthat the Cascade converges quickly to the global solution, as is indicated in the \nexamples below. All the problems we tested converge in 2 to 5 passes. \n\n4 Experimental results \nWe implemented the Cascade architecture for a single processor as well as for a \ncluster of processors and tested it extensively with several problems; the largest are: \nMNIST1, FOREST2, NORB3 (all are converted to 2-class problems). One of the main \nadvantages of the Cascade architecture is that it requires far less memory than a single \nSVM, because the size of the kernel matrix scales with the square of the active set. \nThis effect is illustrated in Figure 4. It has to be emphasized that both cases, single \nSVM and Cascade, use shrinking, but shrinking alone does not solve the problem of \nexorbitant sizes of the kernel matrix. \n\nA good indication of the Cascade\u2019s inherent efficiency is obtained by counting the \nnumber of kernel evaluations required for one pass. As shown in Table 1, a 9-layer \nCascade requires only about 30% as many kernel evaluations as a single SVM for \n\n \n\n1 MNIST: handwritten digits, d=784 (28x28 pixels); training: 60,000; testing: 10,000; \nclasses: odd digits - even digits; http://yann.lecun.com/exdb/mnist. \n2 FOREST: d=54; class 2 versus rest; training: 560,000; testing: 58,100 \nftp://ftp.ics.uci.edu/pub/machine-learning-databases/covtype/covtype.info. \n3 NORB: images, d=9,216 ; trainingr=48,600; testing=48,600; monocular; merged class 0 \nand 1 versus the rest. http://www.cs.nyu.edu/~ylclab/data/norb-v1.0 \n\n\f100,000 training vectors. How many kernel evaluations actually have to be computed \ndepends on the caching strategy and the memory size. \n\n \n\nActive set size\n\n6,000\n\n4,000\n\n2,000\n\none SVM\n\nCascade SVM\n\nNumber of Iterations\n\n \n\nFigure 4: The size of the active set as a function of the number of iterations for a \nproblem with 30,000 training vectors. The upper curve represents a single SVM, \nwhile the lower one shows the active set size for a 4-layer Cascade. \n\nAs indicated in Table 1, this parameter, and with it the compute times, are reduced \neven more. Therefore, even a simulation on a single processor can produce a speed-up \nof 5x to 10x or more, depending on the available memory size. For practical purposes \noften a single pass through the Cascade produces sufficient accuracy (compare Figure \n5). This offers a particularly simple way for solving problems of a size that would \notherwise be out of reach for SVMs. \n\nNumber of Layers \nK-eval request x109 \n\nK-eval x109 \n\n1 \n106 \n33 \n\n2 \n89 \n12 \n\n4 \n68 \n\n6 \n3 \n55 \n77 \n4.5 3.9 2.7 2.4 \n\n5 \n61 \n\n7 \n48 \n1.9 \n\n8 \n42 \n1.6 \n\n9 \n38 \n1.4 \n\nTable 1: Number of Kernel evaluations (requests and actual, with a cache size of \n800MB) for different numbers of layers in the Cascade (single pass). The number \nof Kernel evaluations is reduced as the number of Cascade layers increases. Then, \nlarger amounts of the problems fit in the cache, reducing the actual Kernel \ncomputations even more. Problem: FOREST, 100K vectors. \n\nIteration Training \n\n0 \n1 \n2 \n\ntime \n21.6h \n22.2h \n0.8h \n\nMax # training \nvect. per machine \n\n72,658 \n67,876 \n61,217 \n\n# Support \nVectors \n54,647 \n61,084 \n61,102 \n\nW \n\nAcc. \n\n167427 \n174560 \n174564 \n\n99.08% \n99.14% \n99.13% \n\nTable 2: Training times for a large data set with 1,016,736 vectors (MNIST was \nexpanded by warping the handwritten digits). A Cascade with 5 layers is executed \non a Linux cluster with 16 machines (AMD 1800, dual processors, 2GB RAM per \nmachine). The solution converges in 3 iterations. Shown are also the maximum \nnumber of training vectors on one machine and the number of support vectors in \nthe last stage. W: optimization function; Acc: accuracy on test set. Kernel: RBF, \ngamma=1; C=50. \n\nTable 2 shows how a problem with over one million vectors is solved in about a day \n(single pass) with a generalization performance equivalent to the fully converged \nsolution. While the full training set contains over 1M vectors, one processor never \nhandles more than 73k vectors in the optimization and 130k for the convergence test. \nThe Cascade provides several advantages over a single SVM because it can reduce \ncompute- as well as storage-requirements. The main limitation is that the last layer \nconsists of one single optimization and its size has a lower limit given by the number \nof support vectors. This is why the acceleration saturates at a relatively small number \n\n\fof layers. Yet this is not a hard limit since a single optimization can be distributed over \nmultiple processors as well, and we are working on efficient implementations of such \nalgorithms. \n \n\n \n\n \n\nFigure 5: Speed-up for a parallel implementation of the Cascades with 1 to 5 \nlayers (1 to 16 SVMs, each running on a separate processor), relative to a single \nSVM: single pass (left), fully converged (middle) (MNIST, NORB: 3 iterations, \nFOREST: 5 iterations). On the right is the generalization performance of a 5-layer \nCascade, measured after each iteration. For MNIST and NORB, the accuracy after \none pass is the same as after full convergence (3 iterations). For FOREST, the \naccuracy improves from 90.6% after a single pass to 91.6% after convergence (5 \niterations). Training set sizes: MNIST: 60k, NORB: 48k, FOREST: 186k. \n \n\nR e f e r e n c e s \n\n[1] V. Vapnik, \u201cStatistical Learning Theory\u201d, Wiley, New York, 1998. \n\n[2] B. Boser, I. Guyon, V. Vapnik, \u201cA training algorithm for optimal margin classifiers\u201d in \nProc. 5th Annual Workshop on Computational Learning Theory, Pittsburgh, ACM, 1992. \n\n[3] E. Osuna, R. Freund, F. Girosi, \u201cTraining Support Vector Machines, an Application to Face \nDetection\u201d, in Computer vision and Pattern Recognition, pp.130-136, 1997. \n\n[4] T. Joachims, \u201cMaking large-scale support vector machine learning practical\u201d, in Advances \nin Kernel Methods, B. Sch\u00f6lkopf, C. Burges, A. Smola, (eds.), Cambridge, MIT Press, 1998. \n\n[5] J.C. Platt, \u201cFast training of support vector machines using sequential minimal \noptimization\u201d, in Adv. in Kernel Methods: Sch\u00f6lkopf, C. Burges, A. Smola (eds.), 1998. \n\n[6] C. Chang, C. Lin, \u201cLIBSVM\u201d, http://www.csie.ntu.edu.tw/~cjlin/libsvm/. \n\n[7] R. Collobert, S. Bengio, and J. Mari\u00e9thoz. Torch: A modular machine learning software \nlibrary. Technical Report IDIAP-RR 02-46, IDIAP, 2002. \n\n[8] D. DeCoste and B. Sch\u00f6lkopf, \u201cTraining Invariant Support Vector Machines\u201d, Machine \nLearning, 46, 161-190, 2002. \n\n[9] R. Collobert, Y. Bengio, S. Bengio, \u201cA Parallel Mixture of SVMs for Very Large Scale \nProblems\u201d, in Neural Information Processing Systems, Vol. 17, MIT Press, 2004. \n[10] A. Tveit, H. Engum. Parallelization of the Incremental Proximal Support Vector Machine \nClassifier using a Heap-based Tree Topology. Tech. Report, IDI, NTNU, Trondheim, 2003. \n\n[11] J. X. Dong, A. Krzyzak , C. Y. Suen, \u201cA fast Parallel Optimization for Training Support \nVector Machine.\u201d Proceedings of 3rd International Conference on Machine Learning and \nData Mining, P. Perner and A. Rosenfeld (Eds.) Springer Lecture Notes in Artificial \nIntelligence (LNAI 2734), pp. 96--105, Leipzig, Germany, July 5-7, 2003. \n\n[12] G. Zanghirati, L. Zanni, \u201cA parallel solver for large quadratic programs in training support \nvector machines\u201d, Parallel Computing, Vol. 29, pp.535-551, 2003. \n\n\f", "award": [], "sourceid": 2608, "authors": [{"given_name": "Hans", "family_name": "Graf", "institution": null}, {"given_name": "Eric", "family_name": "Cosatto", "institution": null}, {"given_name": "L\u00e9on", "family_name": "Bottou", "institution": null}, {"given_name": "Igor", "family_name": "Dourdanovic", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}