{"title": "Asynchronous Parallel Greedy Coordinate Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 4682, "page_last": 4690, "abstract": "n this paper, we propose and study an Asynchronous parallel Greedy Coordinate Descent (Asy-GCD) algorithm for minimizing a smooth function with bounded constraints. At each iteration, workers asynchronously conduct greedy coordinate descent updates on a block of variables.  In the first part of the paper, we analyze the theoretical behavior of Asy-GCD and prove a linear convergence rate.  In the second part, we develop an efficient kernel SVM solver based on Asy-GCD in the shared memory multi-core setting.  Since our algorithm is fully asynchronous---each core does not need to idle and wait for the other cores---the  resulting algorithm enjoys good speedup and outperforms existing multi-core kernel SVM solvers including asynchronous stochastic coordinate descent and multi-core LIBSVM.", "full_text": "Asynchronous Parallel Greedy Coordinate Descent\n\nYang You \u21e7, + XiangRu Lian\u2020, + Ji Liu \u2020 Hsiang-Fu Yu \u2021\nInderjit S. Dhillon \u2021 James Demmel \u21e7 Cho-Jui Hsieh \u21e4\n\n+ equally contributed\n\n\u21e4 University of California, Davis\n\n\u2020 University of Rochester\n\n\u2021 University of Texas, Austin\n\n\u21e7 University of California, Berkeley\n\nyouyang@cs.berkeley.edu, xiangru@yandex.com,\n\njliu@cs.rochester.edu\n\n{rofuyu,inderjit}@cs.utexas.edu,\n\ndemmel@eecs.berkeley.edu\n\nchohsieh@cs.ucdavis.edu\n\nAbstract\n\nIn this paper, we propose and study an Asynchronous parallel Greedy Coordinate\nDescent (Asy-GCD) algorithm for minimizing a smooth function with bounded\nconstraints. At each iteration, workers asynchronously conduct greedy coordinate\ndescent updates on a block of variables. In the \ufb01rst part of the paper, we analyze the\ntheoretical behavior of Asy-GCD and prove a linear convergence rate. In the second\npart, we develop an ef\ufb01cient kernel SVM solver based on Asy-GCD in the shared\nmemory multi-core setting. Since our algorithm is fully asynchronous\u2014each core\ndoes not need to idle and wait for the other cores\u2014the resulting algorithm enjoys\ngood speedup and outperforms existing multi-core kernel SVM solvers including\nasynchronous stochastic coordinate descent and multi-core LIBSVM.\n\nIntroduction\n\n1\nAsynchronous parallel optimization has recently become a popular way to speedup machine learning\nalgorithms using multiple processors. The key idea of asynchronous parallel optimization is to allow\nmachines work independently without waiting for the synchronization points. It has many successful\napplications including linear SVM [13, 19], deep neural networks [7, 15], matrix completion [19, 31],\nlinear programming [26], and its theoretical behavior has been deeply studied in the past few\nyears [1, 9, 16].\nThe most widely used asynchronous optimization algorithms are stochastic gradient method (SG) [7,\n9, 19] and coordinate descent (CD) [1, 13, 16], where the workers keep selecting a sample or a\nvariable randomly and conduct the corresponding update asynchronously. Although these stochastic\nalgorithms have been studied deeply, in some important machine learning problems a \u201cgreedy\u201d\napproach can achieve much faster convergence speed. A very famous example is greedy coordinate\ndescent: instead of randomly choosing a variable, at each iteration the algorithm selects the most\nimportant variable to update. If this selection step can be implemented ef\ufb01ciently, greedy coordinate\ndescent can often make bigger progress compared with stochastic coordinate descent, leading to a\nfaster convergence speed. For example, the decomposition method (a variant of greedy coordinate\ndescent) is widely known as best solver for kernel SVM [14, 21], which is implemented in LIBSVM\nand SVMLight. Other successful applications can be found in [8, 11, 29].\nIn this paper, we study asynchronous greedy coordinate descent algorithm framework. The variable is\npartitioned into subsets, and each worker asynchronously conducts greedy coordinate descent in one\nof the blocks. To our knowledge, this is the \ufb01rst paper to present a theoretical analysis or practical\napplications of this asynchronous parallel algorithm. In the \ufb01rst part of the paper, we formally de\ufb01ne\nthe asynchronous greedy coordinate descent procedure, and prove a linear convergence rate under\nmild assumption. In the second part of the paper, we discuss how to apply this algorithm to solve the\nkernel SVM problem on multi-core machines. Our algorithm achieves linear speedup with number of\ncores, and performs better than other multi-core SVM solvers.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe rest of the paper is outlined as follows. The related work is discussed in Section 2. We propose\nthe asynchronous greedy coordinate descent algorithm in Section 3 and derive the convergence rate\nin the same section. In Section 4 we show the details how to apply this algorithm for training kernel\nSVM, and the experimental comparisons are presented in Section 5.\n2 Related Work\nCoordinate Descent. Coordinate descent (CD) has been extensively studied in the optimization\ncommunity [2], and has become widely used in machine learning. At each iteration, only one variable\nis chosen and updated while all the other variables remain \ufb01xed. CD can be classi\ufb01ed into stochastic\ncoordinate descent (SCD), cyclic coordinate descent (CCD) and greedy coordinate descent (GCD)\nbased on their variable selection scheme. In SCD, variables are chosen randomly based on some\ndistribution, and this simple approach has been successfully applied in solving many machine learning\nproblems [10, 25]. The theoretical analysis of SCD has been discussed in [18, 22]. Cyclic coordinate\ndescent updates variables in a cyclic order, and has also been applied to several applications [4, 30].\nGreedy Coordinate Descent (GCD).The idea of GCD is to select a good, instead of random,\ncoordinate that can yield better reduction of objective function value. This can often be measured by\nthe magnitude of gradient, projected gradient (for constrained minimization) or proximal gradient\n(for composite minimization). Since the variable is carefully selected, at each iteration GCD can\nreduce objective function more than SCD or CCD, which leads to faster convergence in practice.\nUnfortunately, selecting a variable with larger gradient is often time consuming, so one needs to\ncarefully organize the computation to avoid the overhead, and this is often problem dependent.\nThe most famous application of GCD is the decomposition method [14, 21] used in kernel SVM.\nBy exploiting the structure of quadratic programming, selecting the variable with largest gradient\nmagnitude can be done without any overhead; as a result GCD becomes the dominant technique in\nsolving kernel SVM, and is implemented in LIBSVM [5] and SVMLight [14]. There are also other\napplications of GCD, such as non-negative matrix factorization [11], large-scale linear SVM [29],\nand [8] proposed an approximate way to select variables in GCD. Recently, [20] proved an improved\nconvergence bound for greedy coordinate descent. We focus on parallelizing the GS-r rule in this\npaper but our analysis can be potentially extended to the GS-q rule mentioned in that paper.\nTo the best of our knowledge, the only literature discussing how to parallelize GCD was in [23, 24].\nA thread-greedy/block-greedy coordinate descent is a synchronized parallel GCD for L1-regularized\nempirical risk minimization. At an iteration, each thread randomly selects a block of coordinates\nfrom a pre-partitioned block partition and proposes the best coordinate from this block along with its\nincrement (i.e., step size). Then all the threads are synchronized to perform the actual update to the\nvariables. However, the method can potentially diverge; indeed, this is mentioned in [23] about the\npotential divergence when the number of threads is large. [24] establishes sub-linear convergence for\nthis algorithm.\nAsynchronous Parallel Optimization Algorithms.In a synchronous algorithm each worker con-\nducts local updates, and in the end of each round they have to stop and communicate to get the new\nparameters. This is not ef\ufb01cient when scaling to large problem due to the curse of last reducer (all\nthe workers have to wait for the slowest one). In contrast, in asynchronous algorithms there is no\nsynchronization point, so the throughput will be much higher than a synchronized system. As a result,\nmany recent work focus on developing asynchronous parallel algorithms for machine learning as well\nas providing theoretical guarantee for those algorithms [1, 7, 9, 13, 15, 16, 19, 28, 31].\nIn distributed systems, asynchronous algorithms are often implemented using the concept of parameter\nservers [7, 15, 28]. In such setting, each machine asynchronously communicates with the server to\nread or write the parameters. In our experiments, we focus on another multi-core shared memory\nsetting, where multiple cores in a single machine conduct updates independently and asynchronously,\nand the communication is implicitly done by reading/writing to the parameters stored in the shared\nmemory space. This has been \ufb01rst discussed in [19] for the stochastic gradient method, and recently\nproposed for parallelizing stochastic coordinate descent [13, 17].\nThis is the \ufb01rst work proposing an asynchronous greedy coordinate decent framework. The closest\nwork to ours is [17] for asynchronous stochastic coordinate descent (ASCD). In their algorithm, each\nworker asynchronously conducts the following updates: (1) randomly select a variable (2) compute\nthe update and write to memory or server. In our AGCD algorithm, each worker will select the best\nvariable to update in a block, which leads to faster convergence speed. We also compare with ASCD\nalgorithm in the experimental results for solving the kernel SVM problem.\n\n2\n\n\f3 Asynchronous Greedy Coordinate Descent\nWe consider the following constrained minimization problem:\n\nmin\nx2\u2326\n\nf (x),\n\n(1)\nwhere f is convex and smooth, \u2326 \u21e2 RN is the constraint set, \u2326=\u2326 1 \u21e5 \u23262 \u21e5\u00b7\u00b7\u00b7\u21e5 \u2326N and each\n\u2326i, i = 1, 2, . . . , N is a closed subinterval of the real line.\nNotation: We denote S to be the optimal solution set for (1) and PS(x),P\u2326(x) to be the Euclidean\nprojection of x onto S, \u2326, respectively. We also denote f\u21e4 to be the optimal objective function value\nfor (1).\nWe propose the following Asynchronous parallel Greedy Coordinate Descent (Asy-GCD) for solv-\ning (1). Assume N coordinates are divided into n non-overlapping sets S1 [ . . . [ Sn. Let k be the\nglobal counter of total number of updates. In Asy-GCD, each processor repeatedly runs the following\nGCD updates:\n\u2022 Randomly select a set Sk 2{ S1, . . . , Sn} and pick the coordinate ik 2 Sk where the projected\n\u2022 Update the parameter by\nwhere  is the step size.\n\ngradient (de\ufb01ned in (2)) has largest absolute value.\n\nxk+1 P \u2326(xk  rik f (xk)),\n\nHere the projected gradient de\ufb01ned by\n\n(2)\nis a measurement of optimality for each variable, where \u02c6xk is current point stored in memory used\nto calculate the update. The processors will run concurrently without synchronization. In order to\nanalyze Asy-GCD, we capture the system-wise global view in Algorithm 1.\n\nr+\nik f (\u02c6xk) := xk P \u2326(xk  rik f (\u02c6xk))\n\nAlgorithm 1 Asynchronous Parallel Greedy Coordinate Descent (Asy-GCD)\nInput: x0 2 \u2326,, K\nOutput: xK+1\n1: Initialize k 0;\n2: while k \uf8ff K do\n3:\n4:\n5:\n6:\n7: end while\n\nChoose Sk from {S1, . . . , Sn} with equal probability;\nPick ik = arg maxi2Sk kr+\nxk+1 P \u2326(xk  rik f (\u02c6xk));\nk k + 1;\n\ni f (\u02c6x)k;\n\nThe update in the kth iteration is\n\nxk+1 P \u2326(xk  rik f (\u02c6xk)),\n\nwhere ik is the selected coordinate in kth iteration, \u02c6xk is the point used to calculate the gradient and\nrik f (\u02c6xk) is a zero vector where the ikth coordinate is set to the corresponding coordinate of the\ngradient of f at \u02c6xk. Note that \u02c6xk may not be equal to the current value of the optimization variable\nxk due to asynchrony. Later in the theoretical analysis we will need to assume \u02c6xk is close to xk using\nthe bounded delay assumption.\nIn the following we prove the convergence behavior of Asy-GCD. We \ufb01rst make some commonly\nused assumptions:\nAssumption 1.\n\n1. (Bounded Delay) There is a set J(k) \u21e2{ k  1, . . . , k  T} for each iteration k such that\n(3)\n\n\u02c6xk\n\n:= xk  Xj2J(k)\n\n(xj+1  xj),\n\nwhere T is the upper bound of the staleness. In this \u201cinconsistent read\u201d model, we assume\nsome of the latest T updates are not yet written back to memory. This is also used in some\nprevious papers [1, 17], and is more general than the \u201cconsistent read\u201d model that assumes\n\u02c6xk is equal to some previous iterate.\n\n3\n\n\fis to say,\n\nkrf (x)  rf (y)k \uf8ff Lkx  yk 8x,8y.\n\n2. For simplicity, we assume each set Si, i 2{ 1, . . . , n} has m coordinates.\n3. (Lipschitzian Gradient) The gradient function of the objective rf (\u00b7) is Lipschitzian. That\n(4)\nUnder the Lipschitzian gradient assumption, we can de\ufb01ne three more constants Lres, Ls and\nLmax. De\ufb01ne Lres to be the restricted Lipschitz constant satisfying the following inequality:\n8i 2{ 1, 2, ..., N} and t 2 R with x, x + tei 2 \u2326\nkrf (x)  rf (x + \u21b5ei)k \uf8ff Lres|\u21b5|,\n(5)\nLet ri be the operator calculating a zero vector where the ith coordinate is set to the ith\ncoordinate of the gradient. De\ufb01ne L(i) for i 2{ 1, 2, . . . , N} as the minimum constant that\nsatis\ufb01es:\n(6)\n\nkrif (x)  rif (x + \u21b5ei)k \uf8ff L(i)|\u21b5|.\n\nDe\ufb01ne Lmax := maxi2{1,...,N} L(i). It can be seen that Lmax \uf8ff Lres \uf8ff L.\nLet s be any positive integer bounded by N. De\ufb01ne Ls to be the minimal constant satisfying\nthe following inequality: 8x 2 \u2326,8S \u21e2{ 1, 2, ..., N} where |S|\uf8ff s:\n\n4. (Global Error Bound) We assume that our objective f has the following property: when\n\nrf (x)  rfx +Pi2S \u21b5iei \uf8ff LsPi2S \u21b5iei.\n\n, there exists a constant \uf8ff such that\n\n = 1\n\n3Lmax\n\nkx P S(x)k 6 \uf8ffk\u02dcx  xk,8x 2 \u2326.\n\nWhere \u02dcx is de\ufb01ned by argminx02\u2326\u21e3hrf (x), x0  xi + 1\n2kx0  xk2\u2318. This is satis\ufb01ed by\nstrongly convex objectives and some weakly convex objectives. For example, it is proved\nin [27] that the kernel SVM problem (9) satis\ufb01es the global error bound even when the\nkernel is not strictly positive de\ufb01nite.\n5. (Independence) All random variables in {Sk}k=0,1,\u00b7\u00b7\u00b7 ,K in Algorithm 1 are independent to\n\n(7)\n\neach other.\n\nWe then have the following convergence result:\nTheorem 2 (Convergence). Choose  = 1/(3Lmax) in Algorithm 1. Suppose n  6 and that the\nupper bound for staleness T satis\ufb01es the following condition\npnLmax\n4eLres\n\nT (T + 1) 6\n\n(8)\n\n.\n\nUnder Assumption 1, we have the following convergence rate for Algorithm 1:\n\nThis theorem indicates a linear convergence rate under the global error bound and the condition\nT 2 \uf8ff O(pn). Since T is usually proportional to the total number cores involved in the computation,\nthis result suggests that one can have linear speedup as long as the total number of cores is smaller\nthan O(n1/4). Note that for n = N Algorithm 1 reduces to the standard asynchronous coordinate\ndescent algorithm (ASCD) and our result is essentially consistent with the one in [17], although they\nuse the optimally strong convexity assumption for f (\u00b7). The optimally strong convexity is a similar\ncondition to the global error bound assumption [32].\nHere we brie\ufb02y discuss the constants involved in the convergence rate. Using Gaussian kernel SVM on\ncovtype as a concrete sample, Lmax = 1 for Gaussian kernel, Lres is the maximum norm of columns\nof kernel matrix (\u21e1 3.5), L is the 2-norm of Q (21.43 for covtype), and conditional number \uf8ff \u21e1 1190.\nAs number of samples increased, the conditional number \uf8ff will become a dominant term, and this\nalso appears in the rate of serial greedy coordinate descent. In terms of speedup when increasing\nnumber of threads (T ), although LT may grow, it only appears in b = (\n+ 2)1, where\nthe \ufb01rst term inside b is usually small since there is a pn in the demominator. Therefore, b \u21e1 21 in\nmost cases, which means the convergence rate does not slow down too much when we increase T .\n\n18pnLmaxLres\n\nL2\nT\n\n4\n\nwhere b is de\ufb01ned as\n\nE(f (xk)  f\u21e4) 6 \u27131 \n\n(f (x0)  f\u21e4).\n\n2Lmaxb\n\nL\uf8ff2n \u25c6k\n+ 2\u25c61\n\n.\n\nb =\u2713\n\nL2\nT\n\n18pnLmaxLres\n\n\f4 Application to Multi-core Kernel SVM\nIn this section, we demonstrate how to apply asynchronous parallel greedy coordinate descent to\nsolve kernel SVM [3, 6]. We follow the conventional notations for kernel SVM, where the variables\nfor the dual form are \u21b5 2 Rn (instead of x in the previous section). Given training samples {ai}`\ni=1\nwith corresponding labels yi 2{ +1,1}, kernel SVM solves the following quadratic minimization\nproblem:\n\n\u21b52Rn\u21e2 1\n\nmin\n\n2\n\n\u21b5T Q\u21b5  eT \u21b5 := f (\u21b5) s.t. 0 \uf8ff \u21b5 \uf8ff C,\n\n(9)\n\nwhere Q is an ` by ` symmetric matrix with Qij = yiyjK(ai, aj) and K(ai, aj) is the kernel\nfunction. Gaussian kernel is a widely-used kernel function, where K(ai, aj) = ekaiajk2.\nGreedy coordinate descent is the most popular way to solve kernel SVM. In the following, we \ufb01rst\nintroduce greedy coordinate descent for kernel SVM, and then discuss the detailed update rule and\nimplementation issues when applying our proposed Asy-GCD algorithm on multi-core machines.\n4.1 Kernel SVM and greedy coordinate descent\nWhen we apply coordinate descent to solve the dual form of kernel SVM (9), the one variable update\nrule for any index i can be computed by:\n\n\u21e4i = P[0, C]\u21b5i  rfi(\u21b5)/Qii  \u21b5i\n\n(10)\nwhere P[0, C] is the projection to the interval [0, C] and the gradient is rfi(\u21b5) = (Q\u21b5)i  1. Note\nthat this update rule is slightly different from (2) by setting the step size to be  = 1/Qii. For\nquadratic functions this step size leads to faster convergence because \u21e4i obtained by (10) is the closed\nform solution of\n\n\u21e4 = arg min\n\nf (\u21b5 + ei),\n\n\n\nand ei is the i-th indicator vector.\nAs in Algorithm 1, we choose the best coordinate based on the magnitude of projected gradient. In\nthis case, by de\ufb01nition\n\ni f (\u21b5) = \u21b5i  P[0, C]\u21b5i  rif (\u21b5).\nr+\n\n(11)\n\nThe success of GCD in solving kernel SVM is mainly due to the maintenance of the gradient\n\ng := rif (\u21b5) = (Q\u21b5)  1.\n\nConsider the update rule (10): it requires O(`) time to compute (Q\u21b5)i, which is the cost for stochastic\ncoordinate descent or cyclic coordinate descent. However, in the following we show that GCD has\nthe same time complexity per update by using the trick of maintaining g during the whole procedure.\nIf g is available in memory, each element of the projected gradient (11) can be computed in O(1)\ntime, so selecting the variable with maximum magnitude of projected gradient only costs O(`) time.\nThe single variable update (10) can be computed in O(1) time. After the update \u21b5i \u21b5i + , the\ng has to be updated by g g + qi, where qi is the i-th column of Q. This also costs O(`) time.\nTherefore, each GCD update only costs O(`) using this trick of maintaining g.\nTherefore, for solving kernel SVM, GCD is faster than SCD and CCD since there is no additional\ncost for selecting the best variable to update. Note that in the above discussion we assume Q can be\nstored in memory. Unfortunately, this is not the case for large scale problems because Q is an ` by `\ndense matrix, where ` can be millions. We will discuss how to deal with this issue in Section 4.3.\nWith the trick of maintaining g = Q\u21b51, the GCD for solving (9) can be summarized in Algorithm 2.\nAlgorithm 2 Greedy Coordinate Descent (GCD) for Dual Kernel SVM\n1: Initial g = 1, \u21b5 = 0\n2: For k = 1, 2,\u00b7\u00b7\u00b7\n3:\n4:\n5:\n6:\n\nstep 1: Pick i = arg maxi |r+\nstep 2: Compute \u21e4i by eq (10)\nstep 3: g g + \u21e4qi\nstep 4: \u21b5i \u21b5i + \u21e4\n\ni f (\u21b5)| using g\n\n(See eq (11))\n\n5\n\n\f4.2 Asynchronous greedy coordinate descent\nWhen we have n threads in a multi-core shared memory machine, and the dual variables (or corre-\nsponding training samples) are partitioned into the same number of blocks:\n\nS1 [ S2 [\u00b7\u00b7\u00b7[ Sn = {1, 2,\u00b7\u00b7\u00b7 ,`} and Si \\ Sj =  for all i, j.\n\nNow we apply Asy-GCD algorithm to solve (9). For better memory allocation of kernel cache\n(see Section 4.3), we bind each thread to a partition. The behavior of our algorithm still follows\nAsy-GCD because the sequence of updates are asynchronously random. The algorithm is summarized\nin Algorithm 3.\n\nAlgorithm 3 Asy-GCD for Dual Kernel SVM\n1: Initial g = 1, \u21b5 = 0\n2: Each thread t repeatedly performs the following updates in parallel:\n3:\n4:\n5:\n6:\n7:\n\nstep 1: Pick i = arg maxi2St |r+\nstep 2: Compute \u21e4i by eq (10)\nstep 3: For j = 1, 2,\u00b7\u00b7\u00b7 ,`\nstep 4: \u21b5i \u21b5i + \u21e4\n\ngj gj + \u21e4Qj,i using atomic update\n\ni f (\u21b5)| using g\n\n(See eq (11))\n\nImplementation Issues\n\nNote that each thread will read the `-dimensional vector g in step 2 and update g in step 3 in the\nshared memory. For the read, we do not use any atomic operations. For the writes, we maintain the\ncorrectness of g by atomic writes, otherwise some updates to g might be overwritten by others, and\nthe algorithm cannot converge to the optimal solution. Theorem 2, suggests a linear convergence rate\nof our algorithm, and in the experimental results we will see the algorithm is much faster than the\nwidely used Asynchronous Stochastic Coordinate Descent (Asy-SCD) algorithm [17].\n4.3\nIn addition to the main algorithm, there are some practical issues we need to handle in order to\nmake Asy-GCD algorithm scales to large-scale kernel SVM problems. Here we discuss these\nimplementation issues.\nKernel Caching.The main dif\ufb01culty for scaling kernel SVM to large dataset is the memory require-\nment for storing the Q matrix, which takes O(`2) memory. In the GCD algorithm, step 2 (see eq (10))\nrequires a diagonal element of Q, which can be pre-computed and stored in memory. However, the\nmain dif\ufb01culty is to conduct step 3, where a column of Q (denoted by qi)is needed. If qi is in the\nmemory, the algorithm only takes O(`) time; however, if qi is not in the memory, re-computing it\nfrom scratch takes O(dn) time. As a result, how to maintain most important columns of Q in memory\nis an important implementation issues in SVM software.\nIn LIBSVM, the user can specify the size of memory they want to use for storing columns of Q. The\ncolumns of Q are stored in a linked-list in the memory, and when memory space is not enough the\nLeast Recent Used column will be kicked out (LRU technique).\nIn our implementation, instead of sharing the same LRU for all the cores, we create an individual\nLRU for each core, and make the memory space used by a core in a contiguous memory space. As\na result, remote memory access will happen less in the NUMA system when there are more than 1\nCPU in the same computer. Using this technique, our algorithm is able to scale up in a multi-socket\nmachine (see Figure 2).\nVariable Partitioning.The theory of Asy-GCD allows any non-overlapping partition of the dual\nvariables. However, we observe a better partition that minimizes the between-cluster connections can\noften lead to faster convergence. This idea has been used in a divide-and-conquer SVM algorithm [12],\nand we use the same idea to obtain the partition. More speci\ufb01cally, we partition the data by running\nkmeans algorithm on a subset of 20000 training samples to obtain cluster centers {cr}n\nr=1, and then\nassign each i to the nearest center: \u21e1(i) = argminr kcr  xik. This steps can be easily parallelized,\nand costs less than 3 seconds in all the datasets used in the experiments. Note that we include this\nkmeans time in all our experimental comparisons.\n5 Experimental Results\nWe conduct experiments to show that the proposed method Asy-GCD achieves good speedup in\nparallelizing kernel SVM in multi-core systems. We consider three datasets: ijcnn1, covtype and\nwebspam (see Table 1 for detailed information). We follow the parameter settings in [12], where C\n\n6\n\n\fTable 1: Data statistics. ` is number of training samples, d is dimensionality, `t is number of testing\nsamples.\n\n`\n\n`t\n\nd\n22\n54\n254\n\nC\n32\n32\n8\n\nijcnn1\n49,990\ncovtype\n464,810\nwebspam 280,000\n\n91,701\n116,202\n70,000\n\n2\n32\n32\n\n(a) ijcnn1 time vs obj\n\n(c) covtype time vs obj\nFigure 1: Comparison of Asy-GCD with 1\u201320 threads on ijcnn1, covtype and webspam datasets.\n\n(b) webspam time vs obj\n\nand  are selected by cross validation. All the experiments are run on the same system with 20 CPUs\nand 256GB memory, where the CPU has two sockets, each with 10 cores. We locate 64GB for kernel\ncaching for all the algorithms. In our algorithm, the 64GB will distribute to each core; for example,\nfor Asy-GCD with 20 cores, each core will have 3.2GB kernel cache.\nWe include the following algorithms/implementations into our comparison:\n\n1. Asy-GCD: Our proposed method implemented by C++ with OpenMP. Note that the prepro-\n\ncessing time for computing the partition is included in all the timing results.\n\n2. PSCD: We implement the asynchronous stochastic coordinate descent [17] approach for\nsolving kernel SVM. Instead of forming the whole kernel matrix in the beginning (which\ncannot scale to all the dataset we are using), we use the same kernel caching technique as\nAsy-GCD to scale up PSCD.\n\n3. LIBSVM (OMP): In LIBSVM, there is an option to speedup the algorithm in multi-core envi-\nronment using OpenMP (see http://www.csie.ntu.edu.tw/~cjlin/libsvm/\nfaq.html#f432). This approach uses multiple cores when computing a column of kernel\nmatrix (qi used in step 3 of Algorithm 2).\n\nAll the implementations are modi\ufb01ed from LIBSVM (e.g., they share the similar LRU cache class),\nso the comparison is very fair. We conduct the following two sets of experiments. Note that another\nrecent proposed DC-SVM solver [12] is currently not parallelizable; however, since it is a meta\nalgorithm and requires training a series of SVM problems, our algorithm can be naturally served as a\nbuilding block of DC-SVM.\n5.1 Scaling with number of cores\nIn the \ufb01rst set of experiments, we test the speedup of our algorithm with varying number of cores.\nThe results are presented in Figure 1 and Figure 2. We have the following observations:\n\nmore CPU cores, the objective decreases faster.\n\n\u2022 Time vs obj (for 1, 2, 4, 10, 20 cores). From Fig. 1 (a)-(c), we observe that when we use\n\u2022 Cores vs speedup. From Fig. 2, we can observe that we got good strong scaling when we\nincrease the number of threads. Note that our computer has two sockets, each with 10 cores,\nand our algorithm can often achieve 13-15 times speedup. This suggests our algorithm can\nscale to multiple sockets in a Non-Uniform Memory Access (NUMA) system. Previous\nasynchronous parallel algorithms such as HogWild [19] or PASSCoDe [13] often struggle\nwhen scaling to multiple sockets.\n\n5.2 Comparison with other methods\nNow we compare the ef\ufb01ciency of our proposed algorithm with other multi-core parallel kernel SVM\nsolvers on real datasets in Figure 3. All the algorithms in this comparison are using 20 cores and\n64GB memory space for kernel caching. Note that LIBSVM is solving the kernel SVM problem with\nthe bias term, so the objective function value is not showing in the \ufb01gures.\nWe have the following observations:\n\n7\n\n\f(a) ijcnn1 cores vs speedup\n\n(b) webspam cores vs speedup\n\n(c) covtype cores vs speedup\n\nFigure 2: The scalability of Asy-GCD with up to 20 threads.\n\n(a) ijcnn1 time vs accuracy\n\n(b) covtype time vs accuracy\n\n(c) webspam time vs accuracy\n\n(d) ijcnn1 time vs objective\n\n(e) covtype time vs objective\n\n(f) webspam time vs objective\n\nFigure 3: Comparison among multi-core kernel SVM solvers. All the solvers use 20 cores and the\nsame amount of memory.\n\n\u2022 Our algorithm achieves much faster convergence in terms of objective function value\ncompared with PSCD. This is not surprising because using the trick of maintaining g (see\ndetails in Section 4) greedy approach can select the best variable to update, while stochastic\napproach just chooses variables randomly. In terms of accuracy, PSCD is sometimes good in\nthe beginning, but converges very slowly to the best accuracy. For example, in covtype data\nthe accuracy of PSCD remains 93% after 4000 seconds, while our algorithm can achieve\n95% accuracy after 1500 seconds.\n\u2022 LIBSVM (OMP) is slower than our method. The main reason is that they only use multiple\ncores when computing kernel values, so the computational power is wasted when the column\nof kernel (qi) needed is available in memory.\n\nConclusions In this paper, we propose an Asynchronous parallel Greedy Coordinate Descent (Asy-\nGCD) algorithm, and prove a linear convergence rate under mild condition. We show our algorithm\nis useful for parallelizing the greedy coordinate descent method for solving kernel SVM, and the\nresulting algorithm is much faster than existing multi-core SVM solvers.\nAcknowledgement XL and JL are supported by the NSF grant CNS-1548078. HFY and ISD\nare supported by the NSF grants CCF-1320746, IIS-1546459 and CCF-1564000. YY and JD are\nsupported by the U.S. Department of Energy Of\ufb01ce of Science, Of\ufb01ce of Advanced Scienti\ufb01c\nComputing Research, Applied Mathematics program under Award Number DE-SC0010200; by the\nU.S. Department of Energy Of\ufb01ce of Science, Of\ufb01ce of Advanced Scienti\ufb01c Computing Research\nunder Award Numbers DE-SC0008700 and AC02-05CH11231; by DARPA Award Number HR0011-\n12-2-0016, Intel, Google, HP, Huawei, LGE, Nokia, NVIDIA, Oracle and S Samsung, Mathworks\nand Cray. CJH also thank the XSEDE and Nvidia support.\n\n8\n\n\fReferences\n[1] H. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate\n\nthrough randomization. In IEEE International Parallel and Distributed Processing Symposium, 2014.\n\n[2] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, MA 02178-9998, second edition,\n\n1999.\n\nProtein Science, 2003.\n\n[3] B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classi\ufb01ers. In COLT, 1992.\n[4] A. Canutescu and R. Dunbrack. Cyclic coordinate descent: A robotics algorithm for protein loop closure.\n\n[5] C.-C. Chang and C.-J. Lin. LIBSVM: Introduction and benchmarks. Technical report, Department of\n\nComputer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2000.\n\n[6] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273\u2013297, 1995.\n[7] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al.\n\nLarge scale distributed deep networks. In NIPS, 2012.\n\n[8] I. S. Dhillon, P. Ravikumar, and A. Tewari. Nearest neighbor based greedy coordinate descent. In NIPS,\n\n2011.\n\narXiv:1508.00882, 2015.\n\n[9] J. C. Duchi, S. Chaturapruek, and C. R\u00e9. Asynchronous stochastic convex optimization. arXiv preprint\n\n[10] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method\n\n[11] C.-J. Hsieh and I. S. Dhillon. Fast coordinate descent methods with variable selection for non-negative\n\n[12] C. J. Hsieh, S. Si, and I. S. Dhillon. A divide-and-conquer solver for kernel support vector machines. In\n\nfor large-scale linear SVM. In ICML, 2008.\n\nmatrix factorization. In KDD, 2011.\n\nICML, 2014.\n\n[13] C.-J. Hsieh, H. F. Yu, and I. S. Dhillon. PASSCoDe: Parallel ASynchronous Stochastic dual Coordinate\n\nDescent. In International Conference on Machine Learning(ICML),, 2015.\n\n[14] T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel Methods - Support Vector\n\nLearning. MIT Press, 1998.\n\n[15] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y.\n\nSu. Scaling distributed machine learning with the parameter server. In OSDI, 2014.\n\n[16] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence\n\n[17] J. Liu, S. J. Wright, C. Re, and V. Bittorf. An asynchronous parallel stochastic coordinate descent algorithm.\n\nproperties. 2014.\n\nIn ICML, 2014.\n\n[18] Y. E. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\n[19] F. Niu, B. Recht, C. R\u00e9, and S. J. Wright. HOGWILD!: a lock-free approach to parallelizing stochastic\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\ngradient descent. In NIPS, pages 693\u2013701, 2011.\n\n[20] J. Nutini, M. Schmidt, I. H. Laradji, M. Friedlander, and H. Koepke. Coordinate descent converges faster\n\nwith the gauss-southwell rule than random selection. In ICML, 2015.\n\n[21] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch\u00f6lkopf,\nC. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning,\nCambridge, MA, 1998. MIT Press.\n\n[22] P. Richt\u00e1rik and M. Tak\u00e1\u02c7c. Iteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Mathematical Programming, 144:1\u201338, 2014.\n\n[23] C. Scherrer, M. Halappanavar, A. Tewari, and D. Haglin. Scaling up coordinate descent algorithms for\n\nlarge l1 regularization problems. In ICML, 2012.\n\n[24] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin. Feature clustering for accelerating parallel\n\ncoordinate descent. In NIPS, 2012.\n\n[25] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimiza-\n\ntion. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[26] S. Sridhar, S. Wright, C. Re, J. Liu, V. Bittorf, and C. Zhang. An approximate, ef\ufb01cient LP solver for lp\n\nrounding. NIPS, 2013.\n\n[27] P.-W. Wang and C.-J. Lin. Iteration complexity of feasible descent methods for convex optimization.\n\nJournal of Machine Learning Research, 15:1523\u20131548, 2014.\n\n[28] E. P. Xing, W. Dai, J. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform\n\nfor distributed machine learning on big data. In KDD, 2015.\n\n[29] I. Yen, C.-F. Chang, T.-W. Lin, S.-W. Lin, and S.-D. Lin. Indexed block coordinate descent for large-scale\n\nlinear classi\ufb01cation with limited memory. In KDD, 2013.\n\n[30] H.-F. Yu, C.-J. Hsieh, S. Si, and I. S. Dhillon. Parallel matrix factorization for recommender systems.\n\nKAIS, 2013.\n\n[31] H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. S. Dhillon. Nomad: Non-locking, stochastic\n\nmulti-machine algorithm for asynchronous and decentralized matrix completion. In VLDB, 2014.\n\n[32] H. Zhang. The restricted strong convexity revisited: Analysis of equivalence to error bound and quadratic\n\ngrowth. ArXiv e-prints, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2340, "authors": [{"given_name": "Yang", "family_name": "You", "institution": "UC Berkeley"}, {"given_name": "Xiangru", "family_name": "Lian", "institution": "University of Rochester"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester"}, {"given_name": "Hsiang-Fu", "family_name": "Yu", "institution": "University of Texas at Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}, {"given_name": "James", "family_name": "Demmel", "institution": "UC Berkeley"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UC Davis"}]}