{"title": "A Randomized Algorithm for Large Scale Support Vector Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 800, "abstract": "We propose a randomized algorithm for large scale SVM learning which solves the problem by iterating over random subsets of the data. Crucial to the algorithm for scalability is the size of the subsets chosen. In the context of text classification we show that, by using ideas from random projections, a sample size of O(log n) can be used to obtain a solution which is close to the optimal with a high probability. Experiments done on synthetic and real life data sets demonstrate that the algorithm scales up SVM learners, without loss in accuracy.", "full_text": "A Randomized Algorithm for Large Scale Support\n\nVector Learning\n\nDepartment of Computer Science and Automation, Indian Institute of Science, Bangalore-12\n\nKrishnan S.\n\nkrishi@csa.iisc.ernet.in\n\nDepartment of Computer Science and Automation, Indian Institute of Science, Bangalore-12\n\nChiranjib Bhattacharyya\n\nchiru@csa.iisc.ernet.in\n\nRamesh Hariharan\n\nStrand Genomics, Bangalore-80\n\nramesh@strandls.com\n\nAbstract\n\nThis paper investigates the application of randomized algorithms for large scale\nSVM learning. The key contribution of the paper is to show that, by using ideas\nrandom projections, the minimal number of support vectors required to solve al-\nmost separable classi\ufb01cation problems, such that the solution obtained is near\noptimal with a very high probability, is given by O(log n); if on removal of prop-\nerly chosen O(log n) points the data becomes linearly separable then it is called\nalmost separable. The second contribution is a sampling based algorithm, moti-\nvated from randomized algorithms, which solves a SVM problem by considering\nsubsets of the dataset which are greater in size than the number of support vectors\nfor the problem. These two ideas are combined to obtain an algorithm for SVM\nclassi\ufb01cation problems which performs the learning by considering only O(log n)\npoints at a time. Experiments done on synthetic and real life datasets show that the\nalgorithm does scale up state of the art SVM solvers in terms of memory required\nand execution time without loss in accuracy. It is to be noted that the algorithm\npresented here nicely complements existing large scale SVM learning approaches\nas it can be used to scale up any SVM solver.\n\n1 Introduction\n\nConsider a training dataset D = f(xi; yi)g; i = 1 : : : n and yi = f+1;(cid:0)1g, where xi 2 Rd are data\npoints and yi specify the class labels. the problem of learning the classi\ufb01er, y = sign(wT x + b),\ncan be narrowed down to computing fw; bg such that it has good generalization ability. The SVM\nformulation for classi\ufb01cation, which will be called C (cid:0) SV M, for determining fw; bg is given by\n[1]\nC-SVM-1:\n\nM inimize(w;b;(cid:24))\n\n1\n2jjwjj2 + C\n\n(cid:24)i\n\nnX\n\ni=1\n\nAt optimality w is given by w = X\n\nSubject to : yi(w (cid:1) xi + b) (cid:21) 1 (cid:0) (cid:24)i; ; (cid:24)i (cid:21) 0; i = 1 : : : n\n\ni:(cid:11)i>0\n\n(cid:11)iyixi; 0 (cid:20) (cid:11)i (cid:20) C\n\n1\n\n\fConsider the set S = fxij(cid:11)i > 0g; the elements of this set are called the Support vectors. Note\nthat S completely determines the solution of C (cid:0) SV M.The set S may not be unique, though w is.\nDe\ufb01ne a parameter (cid:1) to be the minimum cardinality over all S. See that (cid:1) does not change with\nnumber of examples, n, and is often much less than n.\nMore generally, the C (cid:0) SV M problem can be seen as an instance of Abstract optimization prob-\nlem(AOP) [2, 3, 4]. An AOP is de\ufb01ned as follows:\nAn AOP is a triple (H; <; (cid:8)) where H is a \ufb01nite set, < a total ordering on 2H, and (cid:8) an oracle\nthat, for a given F (cid:18) G (cid:18) H, either reports F = min<F 0jF 0 (cid:18) G or returns a set F 0 (cid:18) G with\nF 0 < F .\nMany SVM learning problems are AOP problems; algorithms developed for AOP problems can be\nused for solving SVM problems. Every AOP has a combinatorial dimension associated with it; the\ncombinatorial dimension captures the notion of number of free variables for that AOP. An AOP can\nbe solved by a randomized algorithm by selecting subsets of size greater than the combinatorial\ndimension of the problem [2].\n\nFor SVM, (cid:1) is the combinatorial dimension of the problem; by iterating over subsets of size greater\nthan (cid:1), the subsets chosen using random sampling, the problem can be solved ef\ufb01ciently [3, 4]; this\nalgorithm was called RandSVM by the authors. Apriori the value of (cid:1) is not known, but for linearly\nseparable classi\ufb01cation problems the following holds: 2 (cid:20) (cid:1) (cid:20) d + 1. This follows from the fact\nthat the dual problem is the minimum distance between 2 non-overlapping convex hulls[5]. When\nthe problem is not linearly separable, the authors use the reduced convex hull formulation [5] to\ncome up with an estimate of the combinatorial dimension; this estimate is not very clear and much\nhigher than d1. The algorithm RandSVM2 iterates over subsets of size proportional to (cid:1)2.\nRandSVM is not practical because of the following reasons: the sample size is too large in case of\nhigh dimensional datasets, the dimension of feature space is usually unknown when using kernels,\nand the reduced convex hull method used to calculate the combinatorial dimension, when the data is\nnot separable in the feature space, isn\u2019t really useful as the number obtained is very large.\n\nThis work overcomes the above problems using ideas from random projections[6, 7] and random-\nized algorithms[8, 9, 2, 10],. As mentioned by the authors of RandSVM, the biggest bottleneck\nin their algorithm is the value of (cid:1) as it is too large. The main contribution is, using ideas from\nrandom projections, the conjecture that if RandSVM is solved using (cid:1) equal to O(log n), then the\nsolution obtained is close to optimal with high probability(Theorem 3), in particular for almost\nseparable datasets. Almost separable datasets are those which become linearly separable when a\nsmall number of properly chosen data points are deleted from them. The second contribution is an\nalgorithm which, using ideas from randomized algorithms for Linear Programming(LP), solves the\nSVM problem by using samples of size linear in (cid:1). This work also shows that the theory can be\napplied to non-linear kernels.\n\n2 A NEW RANDOMIZED ALGORITHM FOR CLASSIFICATION\n\nThis section uses results from random projections, and randomized algorithms for linear program-\nming, to develop a new algorithm for learning large scale SVM problems. In Section 2.1, we discuss\nthe case of linearly separable data and estimate the number of support vectors required such that the\nmargin is preserved with high probability, and show that this number is much smaller than the data\ndimension d, using ideas from random projections. In Section 2.2, we look how the analysis applies\nto almost separable data and present the main result of the paper(Theorem 2.2). The section ends\nwith a discussion on the application of the theory to non-linear kernels. In Section 2.3, we present\nshows the randomized algorithm from SVM learning.\n\n2.1 Linearly separable data\n\nWe start with determining the dimension k of the target space such that on performing a random pro-\njection to the space, the Euclidean distances and dot products are preserved. The appendix contains\na few results from random projections which will be used in this section.\n\n1Details of this calculation are present in the supplementary material\n2Presented in supplementary material\n\n2\n\n\fFor a linearly separable dataset D = f(xi; yi); i = 1; : : : ; ng; xi 2 Rd; yi 2 f+1;(cid:0)1g, the C-SVM\nformulation is the same as C-SVM-1 with (cid:24)i = 0; i = 1 : : : n. By dividing all the constraints by\njjwjj, the problem can be reformulated as follows:\nC-SVM-2a:\n\nM aximize( ^w;b;l)l; Subject to : yi( ^w (cid:1) xi + ^b) (cid:21) l; i = 1 : : : n; jj ^wjj = 1\n\n, and l = 1\njjwjj\n\n, ^b = b\nwhere ^w = w\n. l is the margin induced by the separating hyperplane,\njjwjj\njjwjj\nthat is, it is the distance between the 2 supporting hyperplanes, h1 : yi(w (cid:1) xi + b) = 1 and\nh2 : yi(w (cid:1) xi + b) = (cid:0)1.\nThe determination of k proceeds as follows. First, for any given value of k, we show the change in\nthe margin as a function of k, if the data points are projected onto the k dimensional subspace and\nthe problem solved. From this, we determine the value k(k << d) which will preserve margin with\na very high probability. In a k dimensional subspace, there are at the most k + 1 support vectors.\nUsing the idea of orthogonal extensions(de\ufb01nition appears later in this section), we prove that when\nthe problem is solved in the original space, using an estimate of k + 1 on the number of support\nvectors, the margin is preserved with a very high probability.\nLet w0 and x0i; i = 1; : : : ; n be the projection of ^w and xi; i = 1; : : : ; n respectively onto a k\ndimensional subspace (as in Lemma 2, Appendix A). The classi\ufb01cation problem in the projected\nspace with the dataset being D0 = f(x0i; yi); i = 1; : : : ; ng; x0i 2 Rk; yi 2 f+1;(cid:0)1g can be written\nas follows:\nC-SVM-2b:\n\nM aximize(w0;^b;l0)l0; Subject to : yi(w0 (cid:1) x0i + ^b) (cid:21) l0; i = 1 : : : n; jjw0jj (cid:20) 1\n\nwhere l0 = l(1 (cid:0) (cid:13)), (cid:13) is the distortion and 0 < (cid:13) < 1. The following lemma predicts, for a given\nvalue of (cid:13), the k such that the margin is preserved with a high probability upon projection. be solved\nwith the optimal margin obtained close to the optimal margin for the original problem is given by\nthe following lemma.\nTheorem 1. Let L = maxjjxijj and (w(cid:3); b(cid:3); l(cid:3)) be the optimal solution for C-SVM-2a. Let R be\na random d (cid:2) k matrix as given in Lemma 2(Appendix A). Let ew = RT w(cid:3)\n; i =\npk\n1; : : : ; n and k (cid:21) 8\n(cid:14) ; 0 < (cid:13) < 1; 0 < (cid:14) < 1, then the following bound holds\non the optimal margin lP obtained by solving the problem C-SVM-2b:\n\nand x0i = RT xipk\n\n(cid:13)2 (1 + (1+L2)\n\n)2 log 4n\n\n2l(cid:3)\n\nP (lP (cid:21) l(cid:3)(1 (cid:0) (cid:13))) (cid:21) 1 (cid:0) (cid:14)\nProof. From Corollary 1 of Lemma 2(Appendix A), we have\n\nw(cid:3) (cid:1) xi (cid:0)\n\n(cid:15)\n2\n\n(1 + L2) (cid:20) ew (cid:1) x0i (cid:20) w(cid:3) (cid:1) xi +\n\n(cid:15)\n2\n\n(1 + L2)\n\n8 , for some (cid:15) > 0. Consider some example xi with\n\nwhich holds with probability at least 1 (cid:0) 4e(cid:0)(cid:15)2 k\nyi = 1. Then the following holds with probability at least 1 (cid:0) 2e(cid:0)(cid:15)2 k\n(cid:15)\n(1 + L2) + b(cid:3) (cid:21) l(cid:3) (cid:0)\n2\n2 (1+L2)\njj ewjj\n\new (cid:1) x0i + b(cid:3) (cid:21) w(cid:3) (cid:1) xi (cid:0)\n\n(cid:21) l(cid:3)(cid:0) (cid:15)\n\n(cid:15)\n2\n\n8\n\ni+b(cid:3)\njj ewjj\n\nat\nHence we have\n\nDividing the above by jjewjj, we have ew(cid:1)x0\n1(Appendix A), we have (1 (cid:0) (cid:15))jjw(cid:3)jj (cid:20) jjewjj (cid:20) (1 + (cid:15))jjw(cid:3)jj, with probability\nSince jjw(cid:3)jj = 1, we have p1 (cid:0) (cid:15) (cid:20) jjewjj (cid:20) p1 + (cid:15).\nl(cid:3) (cid:0) (cid:15)\n2 (1 + L2)\np1 + (cid:15)\n(1 + L2))(p1 (cid:0) (cid:15)) = l(cid:3)(1 (cid:0)\n(cid:15)\n2\n\nleast 1 (cid:0) 2e(cid:0)(cid:15)2 k\n8 .\new (cid:1) x0i + b(cid:3)\n\n(1 + L2)(p1 (cid:0) (cid:15)))\n\n. Note that from Lemma\n\njjewjj\n\n(1 + L2)\n\n(cid:21)\n(cid:21) (l(cid:3) (cid:0)\n(cid:21) l(cid:3)(p1 (cid:0) (cid:15) (cid:0)\n\n(cid:15)\n2l(cid:3)\n(1 + L2)) = l(cid:3)(1 (cid:0) (cid:15)(1 +\n\n(cid:15)\n2l(cid:3)\n\n1 + L2\n\n2l(cid:3)\n\n))\n\n3\n\n\fThis holds with probability at least 1 (cid:0) 4e(cid:0)(cid:15)2 k\n8 . A similar result can be derived for a point xj for\nwhich yj = (cid:0)1. The above analysis guarantees that by projecting onto a k dimensional space, there\nexists at least one hyperplane ( ew\njj ewjj\n\n; b(cid:3)\njj ewjj\n\n), which guarantees a margin of l(cid:3)(1 (cid:0) (cid:13)) where\n(cid:13) (cid:20) (cid:15)(1 +\n\n1 + L2\n\n)\n\n2l(cid:3)\n\n(1)\n\nwith probability at least 1 (cid:0) n4e(cid:0)(cid:15)2 k\ncan only be better than this. So the value of k is given by:\n\n8 . The margin obtained by solving the problem C-SVM-2b, lP\n\n(cid:0)\n\n(1+ 1+L2\n\n(cid:13)2\n2l(cid:3) )2\n\nk\n8\n\nn4e\n\n(cid:20) (cid:14) ) k (cid:21)\n\n8(1 + (1+L2)\n\n2l(cid:3)\n\n)2\n\n(cid:13)2\n\nlog\n\n4n\n(cid:14)\n\n(2)\n\nAs seen above, by randomly projecting the points onto a k dimensional subspace, the margin is\npreserved with a high probability. This result is similar to the results obtained in work on random\nprojections[7]. But there are fundamental differences between the method proposed in this paper\nand the previous methods: No random projection is actually done here, and no black box access\nto the data distribution is required. We use Theorem 1 to determine an estimate on the number of\nsupport vectors such that margin is preserved with a high probability, when the problem is solved\nin the original space. This is given in Theorem 2 and is the main contribution of this section. The\ntheorem is based on the following fact: in a k dimensional space, the number of support vectors\nis upper bounded by k + 1. We show that this k + 1 can be used as an estimate of the number of\nsupport vectors in the original space such that the solution obtained preserves the margin with a high\nprobability. We start with the following de\ufb01nition.\nDe\ufb01nition An orthogonal extension of a k (cid:0) 1-dimensional \ufb02at( a k (cid:0) 1 dimensional \ufb02at\nis a k (cid:0) 1-dimensional af\ufb01ne space) hp = (wp; b), where wp = (w1; : : : ; wk), in a subspace Sk\nof dimension k to a d (cid:0) 1-dimensional hyperplane h = (ew; b) in d-dimensional space, is de\ufb01ned\nas follows. Let R 2 Rd(cid:2)d be a random projection matrix as in Lemma 2((Appendix A)). Let\n^R 2 Rd(cid:2)k be a another random projection matrix which consists of only the the \ufb01rst k columns of\nR. Let ^xi = RT xi and x0i = ^RT\nxi as follows: Let wp = (w1; : : : ; wk) be the optimal hyperplane\npk\nclassi\ufb01er with margin lP for the points x01; : : : ; x0n in the k dimensional subspace. Now de\ufb01ne ew\nto be all 0\u2019s in the last d (cid:0) k coordinates and identical to wp in the \ufb01rst k coordinates, that is,\new = (w1; : : : ; wk; 0; : : : ; 0). Orthogonal extensions have the following key property. If (wp; b) is a\nseparator with margin lp for the projected points, then its orthogonal extension (ew; b) is a separator\nwith margin lp for the original points,that is,\nif, yi(wp (cid:1) x0i + b) (cid:21) l; i = 1; : : : ; n then yi(ew (cid:1) ^xi + b) (cid:21) l; i = 1; : : : ; n\n\nAn important point to note, which will be required when extending orthogonal extensions to non-\nlinear kernels, is that dot products between the points are preserved upon doing orthogonal projec-\ntions, that is, x0T\nLet L; l(cid:3); (cid:13); (cid:14) and n be as de\ufb01ned in Theorem 1. The following is the main result of this section.\nTheorem 2. Given k (cid:21) 8\n(cid:14) and n training points with maximum norm L in d\ndimensional space and separable by a hyperplane with margin l(cid:3), there exists a subset of k0 training\npoints x10 : : : xk0 where k0 (cid:20) k and a hyperplane h satisfying the following conditions:\n\n(cid:13)2 (1 + (1+L2)\n\ni x0j = ^xi\n\n)2 log 4n\n\nT ^xj.\n\n2l(cid:3)\n\n1. h has margin at least l(cid:3)(1 (cid:0) (cid:13)) with probability at least 1 (cid:0) (cid:14)\n2. x10 : : : xk0 are the only training points which lie either on h1 or on h2\n\nProof. Let w(cid:3); b(cid:3) denote the normal to a separating hyperplane with margin l(cid:3), that is, yi(w(cid:3) (cid:1) xi +\nb(cid:3)) (cid:21) l(cid:3) for all xi and jjw(cid:3)jj = 1. Consider a random projection of x1; : : : ; xn to a k dimensional\nspace and let w0; z1; : : : ; zn be the projections of w(cid:3); x1; : : : ; xn, respectively, scaled by 1=pk. By\nTheorem 1, yi(w0 (cid:1) zi + b(cid:3)) (cid:21) l(cid:3)(1 (cid:0) (cid:13)) holds for all zi with probability at least 1 (cid:0) (cid:14). Let h be the\northogonal extension of w0; b(cid:3) to the full d dimensional space. Then h has margin at least l(cid:3)(1(cid:0) (cid:13)),\nas required. This shows the \ufb01rst part of the claim.\nTo prove the second part, consider the projected training points which lie on w0; b(cid:3) (that is, they lie\non either of the two sandwiching hyperplanes). Barring degeneracies, there are at the most k such\npoints. Clearly, these will be the only points which lie on the orthogonal extension h, by de\ufb01nition.(cid:3)\n\n4\n\n\fFrom the above analysis, it is seen that if k << d, then we can estimate that the number of support\nvectors is k + 1, and the algorithm RandSVM would take on average O(k log n) iterations to solve\nthe problem [3, 4].\n\n2.2 Almost separable data\n\nIn this section, we look at how the above analysis can be applied to almost separable datasets. We\ncall a dataset almost separable if by removing a fraction (cid:20) = O( log n\nn ) of the points, the dataset\nbecomes linearly separable.\n\nThe C-SVM formulation when the data is not linearly separable(and almost separable) was given in\nC-SVM-1. This problem can be reformulated as follows:\nnX\n\nM inimize(w;b;(cid:24))\n\n(cid:24)i\n\nSubject to : yi(w (cid:1) xi + b) (cid:21) l (cid:0) (cid:24)i; (cid:24)i (cid:21) 0; i = 1 : : : n;jjwjj (cid:20)\n\n1\nl\n\ni=1\n\nThis formulation is known as the Generalized Optimal Hyperplane formulation. Here l depends on\nthe value of C in the C-formulation. At optimality, the margin l(cid:3) = l. The following theorem proves\na result for almost separable data similar to the one proved in Claim 1 for separable data.\nTheorem 3. Given k (cid:21) 8\n(cid:14) + (cid:20)n, l(cid:3) being the margin at optimality, l the\nlower bound on l(cid:3) as in the Generalized Optimal Hyperplane formulation and (cid:20) = O( log n\nn ), there\nexists a subset of k0 training points x10 : : : xk0, k0 (cid:20) k and a hyperplane h satisfying the following\nconditions:\n\n(cid:13)2 (1 + (1+L2)\n\n)2 log 4n\n\n2l(cid:3)\n\n1. h has margin at least l(1 (cid:0) (cid:13)) with probability at least 1 (cid:0) (cid:14)\n2. At the most 8(1+ (1+L2)\n\n(cid:14) points lie on the planes h1 or on h2\n\nlog 4n\n\n)2\n\n2l(cid:3)\n\n(cid:13)2\n\n3. x10 ; : : : ; xk0 are the only points which de\ufb01ne the hyperplane h, that is, they are the support\n\nvectors of h.\n\ni:(cid:11)i>0\n\njjw(cid:3)jj\n\n(cid:11)iyixi, and l(cid:3) = 1\n\nProof. Let the optimal solution for the generalized optimal hyperplane formulation be (w(cid:3); b(cid:3); (cid:24)(cid:3)).\nw(cid:3) = X\nas mentioned before. The set of support vectors can be split\ninto to 2 disjoint sets,SV1 = fxi : (cid:11)i > 0 and (cid:24)(cid:3)i = 0g(unbounded SVs), and SV2 = fxi : (cid:11)i >\n0 and (cid:24)(cid:3)i > 0g(bounded SVs).\nNow, consider removing the points in SV2 from the dataset. Then the dataset becomes linearly\nseparable with margin l(cid:3). Using an analysis similar to Theorem 1, and the fact that l(cid:3) (cid:21) l, we have\nthe proof for the \ufb01rst 2 conditions.\nWhen all the points in SV2 are added back, at most all these points are added to the set of support\nvectors and the margin does not change. The margin not changing is guaranteed by the fact that for\nproving the conditions 1 and 2, we have assumed the worst possible margin, and any value lower\nthan this would violate the constraints of the problem. This proves condition 3. (cid:3)\n\nHence the number of support vectors, such that the margin is preserved with high probability, can\nbe upper bounded by\n\nk + 1 =\n\n8\n(cid:13)2 (1 +\n\n(1 + L2)\n\n2l(cid:3)\n\n)2 log\n\n4n\n(cid:14)\n\n+ (cid:20)n + 1 =\n\n8\n(cid:13)2 (1 +\n\n(1 + L2)\n\n2l(cid:3)\n\n)2 log\n\n4n\n(cid:14)\n\n+ O(log n)\n\n(3)\n\nUsing a non-linear kernel Consider a mapping function (cid:8) : Rd ! Rd0\n; d0 > d, which projects\na point xi 2 Rd to a point zi 2 Rd0, where Rd0 is a Euclidean space. Let the points be projected\nonto a random k dimensional subspace as before. Then, as in the case of linear kernels, the lemmata\nin the appendix are applicable to these random projections[11]. The orthogonal extensions can be\n\n5\n\n\fconsidered as a projection from the k dimensional space to the (cid:8)-space, such that the kernel function\nvalues are preserved. Then it can be shown that Theorem 3 applies when using non-linear kernels\nalso.\n\n2.3 A Randomized Algorithm\n\nThe reduction in the sample size from 6d2 to 6k2 is not enough to make RandSVM useful\nin practice as 6k2 is still a large number. This section presents another randomized algorithm\nwhich only requires that the sample size be greater than the number of support vectors. Hence\na sample size linear in k can be used in the algorithm. This algorithm was \ufb01rst proposed to\nsolve large scale LP problems[10]; it has been adapted for solving large scale SVM problems.\n\nAlgorithm 1 RandSVM-1(D,k,r)\nRequire: D - The dataset.\nRequire: k - The estimate of the number of support vectors.\nRequire: r - Sample size = ck; c > 0.\n1: S = randomsubset(D; r); // Pick a random subset, S, of size r from the dataset D\n2: SV = svmlearn((cid:8); S); // SV - set of support vectors obtained by solving the problem S\n3: V = fx 2 D(cid:0)Sjviolates(x; SV )g //violator - nonsampled point not satisfying KKT conditions\n4: while jV j > 0 and jSV j < k do\n5: R = randomsubset(V , r (cid:0) jSV j); //Pick a random subset from the set of violators\n6:\n7:\n8: end while\n9: return SV\n\nSV = svmlearn(SV; R); //SV - set of support vectors obtained by solving the problem SV [ R\nV = fx 2 D (cid:0) (SV [ R)jviolates(x; SV )g; //Determine violators from nonsampled set\n\nProof of Convergence: Let SV be the current set of support vectors. Condition jSV j < k comes\nfrom Theorem 3. Hence if the condition is violated, then the algorithm terminates solution which\nis near optimal with a very high probability.\nNow consider the case where jSV j < k and jV j > 0. Let xi be a violator(xi is a non-sampled\npoint such that yi(wT xi + b) < 1). Solving the problem with the set of constraints as SV [ xi will\nonly result, since SVM is an instance of AOP, in the increase(decrease) of the objective function\nof the primal(dual). As there are only \ufb01nite number of basis for an AOP, the algorithm is bound to\nterminate; also if termination happens with the number of violators equal to zero, then the solution\nobtained is optimal.\n\nDetermination of k The value of k depends on the l which is not available in case of C-SVM and\nnu-SVM. This can be handled only be solving for k as a function of (cid:15) where (cid:15) is the maximum al-\nlowed distortion in the L2 norms of the vectors upon projection. If all the data points are normalized\nto length 1, that is, L = 1, then Equation 1 becomes (cid:15) (cid:21) (cid:13)=(1 + 1+L2\n2l(cid:3) ). Combining this with the\nresult from Theorem 2, the value of k can be determined in terms of (cid:15) as follows:\n\n8\n(cid:13)2 (1 +\n\nk (cid:21)\n\n(1 + L2)\n\n2l(cid:3)\n\n)2 log\n\n4n\n(cid:14)\n\n+ O(log n) (cid:21)\n\n16\n(cid:13)2 (1 +\n\n(1 + L2)\n\n2l(cid:3)\n\n)2 log\n\n4n\n(cid:14)\n\n) (cid:21)\n\n16\n(cid:15)2 log\n\n4n\n(cid:14)\n\n(4)\n\n3 Experiments\n\nThis section discusses the performance of RandSVM in practice. The experiments were performed\non 3 synthetic and 1 real world dataset. RandSVM was used with LibSVM as the solver when using\na non-linear kernel; with SVMLight for a linear kernel. This choice was made because it was ob-\nserved that SVMLight is much faster than LibSVM when using a linear kernel, and vice-versa when\nusing non-linear kernels. RandSVM has been compared with state of the art SVM solvers: LibSVM\nfor non-linear kernels, and SVMPerf and SVMLin for linear kernels.\nSynthetic datasets\nThe twonorm dataset is a 2 class problem where each class is drawn from a multivariate nor-\nmal distribution with unit variance. Each vector is a 20 dimensional vector. One class has mean\n\n(a; a; : : : ; a), and the other class has mean ((cid:0)a;(cid:0)a; : : : ;(cid:0)a), where a = 2=p(20).\n\nThe ringnorm dataset is a 2 class problem with each vector consisting of 20 dimensions. Each class\n\n6\n\n\fCategory\ntwonorm1\ntwonorm2\nringnorm1\nringnorm2\n\nKernel\nGaussian\nGaussian\nGaussian\nGaussian\ncheckerboard1 Gaussian\ncheckerboard2 Gaussian\nLinear\nLinear\n\nCCAT(cid:3)\nC11(cid:3)\n\nRandSVM\n\n300 (94.98%)\n437 (94.71%)\n2637 (70.66%)\n4982 (65.74%)\n406 (93.70%)\n814 (94.10%)\n345 (94.37%)\n449 (96.57%)\n\nLibSVM\n\nSVMPerf\n\nSVMLin\n\n8542 (96.48%)\n\n256 (70.31%)\n85124 (65.34%)\n1568.93 (96.90%)\n\n-\n\n-\nX\nX\n\nX\nX\nX\nX\nX\nX\n\n148 (94.38%)\n120 (97.53%)\n\nX\nX\nX\nX\nX\nX\n\n429(95.1913%)\n295 (97.71%)\n\nTable 1: The table gives the execution time(in seconds) and the classi\ufb01cation accuracy(in brackets).\nThe subscripts 1 and 2 indicate that the corresponding training set sizes are 105 and 106 respectively.\nA \u2019-\u2019 indicates that the solver did not \ufb01nish execution even after a running for a day. A \u2019X\u2019 indicates\nthat the experiment is not applicable for the corresponding solver. The \u2019(cid:3)\u2019 indicates that the solver\nused with RandSVM was SVMLight; otherwise it was LibSVM.\n\nis drawn from a multivariate normal distribution. One class has mean 1, and covariance 4 times the\nidentity. The other class has mean (a; a; : : : ; a), and unit covariance where a = 2=p(20).\nThe checkerboard dataset consists of vectors in a 2 dimensional space. The points are generated in\na 4 (cid:2) 4 grid. Both the classes are generated from a multivariate uniform distribution; each point is\n(x1 = U (0; 4); x2 = U (0; 4)). The points are labelled as follows - if(dx1e%2 6= dx2e%2), then the\npoint is labelled negative, else the point is labelled positive.\nFor each of the synthetic datasets, a training set of 10,00,000 points and a test set of 10,000 points\nwas generated. A smaller subset of 1,00,000 points was chosen from training set for parameter tun-\ning. From now on, the smaller training set will have a subscript of 1 and the larger training set will\nhave a subscript of 2, for example, ringnorm1 and ringnorm2.\nReal world dataset\nThe RCV1 dataset consists of 804,414 documents, with each document consisting of 47,236 fea-\ntures. Experiments were performed using 2 categories of the dataset - CCAT and C11. The dataset\nwas split into a training set of 7,00,000 documents and a test set of 104,414 documents.\nTable 1 shows the kernels which were used for each of the datasets. The parameters used for the\ngaussian kernels, (cid:27) and C, were obtained using grid search based tuning. The parameter for the\nlinear kernel, C, for CCAT and C11 were obtained from previous work done[12].\nSelection of k for RandSVM: The values of (cid:15) and (cid:14) were \ufb01xed to 0:2 and 0:9 respectively, for all\nthe datasets. For linearly separable datasets, k was set to (16 log(4n=(cid:14)))=(cid:15)2. For the others, k was\nset to (32 log(4n=(cid:14)))=(cid:15)2.\nDiscussion of results: Table 1, which has the timing and classi\ufb01cation accuracy comparisons, shows\nthat RandSVM can scale up SVM solvers for very large datasets. Using just a small wrapper around\nthe solvers, RandSVM has scaled up SVMLight so that its performance is comparable to that of\nstate of the art solvers such as SVMPerf and SVMLin. Similarly LibSVM has been made capable of\nquickly solving problems which it could not do before, even after executing for a day. Furthermore,\nit is clear, from the experiments on the synthetic datasets, that the execution times taken for training\nwith 105 examples and 106 examples are not too far apart; this is a clear indication that the execution\ntime does not increase rapidly with the increase in the dataset size.\nAll the runs of RandSVM terminated with the condition jSV j < k being violated. Since the clas-\nsi\ufb01cation accuracies obtained by using RandSVM and the baseline solvers are very close, it is clear\nthat Theorem 2 holds in practice.\n\n4 Further Research\n\nIt is clear from the experimental evaluations that randomized algorithms can be used to scale up\nSVM solvers to large scale classi\ufb01cation problems. If an estimate of the number of support vectors\nis obtained then algorithm RandSVM-1 can be used for other SVM learning problems also, as they\nare usually instances of an AOP. The future work would be to apply the work done here to such\nproblems.\n\n7\n\n\fA Some Results from Random Projections\n\nHere we review a few lemmas from random projections [7]. The following lemma discusses how\nthe L2 norm of a vector is preserved when it is projected on a random subspace.\nLemma 1. Let R = (rij) be a random d (cid:2) k matrix, such that each entry (rij) is chosen indepen-\ndently according to N (0; 1). For any \ufb01xed vector u 2 Rd, and any (cid:15) > 0, let u0 = RT upk\n. Then\nE[jju0jj2] = jjujj2 and the following bound holds:\n\nP ((1 (cid:0) (cid:15))jjujj2 (cid:20) jju0jj2 (cid:20) (1 + (cid:15))jjujj2) (cid:21) 1 (cid:0) 2e(cid:0)((cid:15)2(cid:0)(cid:15)3) k\n\n4\n\nThe following theorem and its corollary show the change in the Euclidean distance between 2 points\nand the dot products when they are projected onto a lower dimensional space [7].\nLemma 2. Let u; v 2 Rd. Let u0 = RT upk\nbe the projections of u and v to Rk via a\nrandom matrix R whose entries are chosen independently from N (0; 1) or U ((cid:0)1; 1). Then for any\n(cid:15) > 0, the following bounds hold\n\nand v0 = RT upk\n\nP ((1 (cid:0) (cid:15))ku (cid:0) vk2 (cid:20) ku0 (cid:0) v0k2) (cid:21) 1 (cid:0) e(cid:0)((cid:15)2(cid:0)(cid:15)3) k\nP (ku0 (cid:0) v0k2 (cid:20) (1 + (cid:15))ku (cid:0) vk2) (cid:21) 1 (cid:0) e(cid:0)((cid:15)2(cid:0)(cid:15)3) k\n\n4\n\n4 ; and\n\nA corollary of the above theorem shows how well the dot products are preserved upon projec-\ntion(This is a slight modi\ufb01cation of the corollary given in [7]).\nCorollary 1. Let u; v be vectors in Rd s.t. kuk (cid:20) L1;kvk (cid:20) L2. Let R be a random matrix whose\nentries are chosen independently from either N (0; 1) or U ((cid:0)1; 1). De\ufb01ne u0 = RT upk\nand v0 = RT vpk\n.\nThen for any (cid:15) > 0, the following holds with probability at least 1 (cid:0) 4e(cid:0)(cid:15)2 k\n1 + L2\n2)\n\n1 + L2\n\n(L2\n\n(L2\n\n8\n\nu (cid:1) v (cid:0)\n\n(cid:15)\n2\n\n2) (cid:20) u0 (cid:1) v0 (cid:20) u (cid:1) v +\n\n(cid:15)\n2\n\nReferences\n\n[1] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.\n[2] Bernd Gartner. A subexponential algorithm for abstract optimization problems. In Proceedings\n\n33rd Symposium on Foundations of Computer Science, IEEE CS Press, 1992.\n\n[3] Jose L. Balcazar, Yang Dai, and Osamu Watanabe. A random sampling technique for training\n\nsupport vector machines. In ALT. Springer, 2001.\n\n[4] Jose L. Balcazar, Yang Dai, and Osamu Watanabe. Provably fast training algorithms for sup-\n\nport vector machines. In ICDM, pages 43\u201350, 2001.\n\n[5] K. P. Bennett and E. J. Bredensteiner. Duality and geometry in SVM classi\ufb01ers. In P. Langley,\n\neditor, ICML, pages 57\u201364, San Francisco, California, 2000.\n\n[6] W. Johnson and J. Lindenstauss. Extensions of lipschitz maps into a hilbert space. Contempo-\n\nrary Mathematics, 1984.\n\n[7] R. I. Arriaga and S. Vempala. An algorithmic theory of learning: Random concepts and random\n\nprojections. In Proceedings of the 40th Foundations of Computer Science, 1999.\n\n[8] Kenneth L. Clarkson. Las vegas algorithms for linear and integer programming when the\n\ndimension is small. Journal of the ACM, 42(2):488\u2013499, 1995.\n\n[9] B. Gartner and E. Welzl. A simple sampling lemma: analysis and application in geometric\noptimization. In Proceedings of the 16th annual ACM symposium on Computational Geometry,\n2000.\n\n[10] M. Pellegrini. Randomizing combinatorial algorithms for linear programming when the di-\n\nmension is moderately high. In SODA \u201901, pages 101\u2013108, Philadelphia, PA, USA, 2001.\n\n[11] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. On kernels, margins and low-\n\ndimensional mappings. In Proc. of the 15th Conf. Algorithmic Learning Theory, 2004.\n\n[12] T. Joachims. Training linear svms in linear time. In Proceedings of the ACM Conference on\n\nKnowledge Discovery and Data Mining (KDD), 2006.\n\n8\n\n\f", "award": [], "sourceid": 524, "authors": [{"given_name": "Krishnan", "family_name": "Kumar", "institution": null}, {"given_name": "Chiru", "family_name": "Bhattacharya", "institution": null}, {"given_name": "Ramesh", "family_name": "Hariharan", "institution": null}]}