{"title": "Feature Clustering for Accelerating Parallel Coordinate Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 28, "page_last": 36, "abstract": "Large scale $\\ell_1$-regularized loss minimization problems arise in numerous applications such as compressed sensing and high dimensional supervised learning, including classification and regression problems.  High performance algorithms and implementations are critical to efficiently solving these problems.  Building upon previous work on coordinate descent algorithms for $\\ell_1$ regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-greedy.  We give a unified convergence analysis for the family of block-greedy algorithms.  The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small.  Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications.  We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale $\\ell_1$-regularization problems.", "full_text": "Feature Clustering for Accelerating\n\nParallel Coordinate Descent\n\nChad Scherrer\n\nIndependent Consultant\n\nYakima, WA\n\nchad.scherrer@gmail.com\n\nAmbuj Tewari\n\nDepartment of Statistics\nUniversity of Michigan\n\nAnn Arbor, MI\n\ntewaria@umich.edu\n\nMahantesh Halappanavar\n\nPaci\ufb01c Northwest National Laboratory\n\nRichland, WA\n\nPaci\ufb01c Northwest National Laboratory\n\nDavid J Haglin\n\nRichland, WA\n\nmahantesh.halappanavar@pnnl.gov\n\ndavid.haglin@pnnl.gov\n\nAbstract\n\nLarge-scale `1-regularized loss minimization problems arise in high-dimensional\napplications such as compressed sensing and high-dimensional supervised learn-\ning, including classi\ufb01cation and regression problems. High-performance algo-\nrithms and implementations are critical to ef\ufb01ciently solving these problems.\nBuilding upon previous work on coordinate descent algorithms for `1-regularized\nproblems, we introduce a novel family of algorithms called block-greedy coor-\ndinate descent that includes, as special cases, several existing algorithms such as\nSCD, Greedy CD, Shotgun, and Thread-Greedy. We give a uni\ufb01ed convergence\nanalysis for the family of block-greedy algorithms. The analysis suggests that\nblock-greedy coordinate descent can better exploit parallelism if features are clus-\ntered so that the maximum inner product between features in different blocks is\nsmall. Our theoretical convergence analysis is supported with experimental re-\nsults using data from diverse real-world applications. We hope that algorithmic\napproaches and convergence analysis we provide will not only advance the \ufb01eld,\nbut will also encourage researchers to systematically explore the design space of\nalgorithms for solving large-scale `1-regularization problems.\n\n1\n\nIntroduction\n\nConsider the `1-regularized loss minimization problem\n\nmin\nw\n\n1\nn\n\nnXi=1\n\n`(yi, (Xw)i) + kwk1 ,\n\n(1)\n\nwhere X 2 IRn\u21e5p is the design matrix, w 2 IRp is a weight vector to be estimated, and the\nloss function ` is such that `(y,\u00b7) is a convex differentiable function for each y. This formulation\nincludes `1-regularized least squares (Lasso) (when `(y, t) = 1\n2 (y  t)2) and `1-regularized logistic\nregression (when `(y, t) = log(1+exp(yt))). In recent years, coordinate descent (CD) algorithms\nhave been shown to be ef\ufb01cient for this class of problems [Friedman et al., 2007; Wu and Lange,\n2008; Shalev-Shwartz and Tewari, 2011; Bradley et al., 2011].\nMotivated by the need to solve large scale `1 regularized problems, researchers have begun to ex-\nplore parallel algorithms. For instance, Bradley et al. [2011] developed the Shotgun algorithm.\nMore recently, Scherrer et al. [2012] have developed \u201cGenCD\u201d, a generic framework for expressing\n\n1\n\n\fparallel coordinate descent algorithms. Special cases of GenCD include Greedy CD [Li and Osher,\n2009; Dhillon et al., 2011], the Shotgun algorithm of [Bradley et al., 2011], and Thread-Greedy CD\n[Scherrer et al., 2012].\nIn fact, the connection between these three special cases of GenCD is much deeper, and more fun-\ndamental, than is obvious under the GenCD abstraction. As our \ufb01rst contribution, we describe a\ngeneral randomized block-greedy that includes all three as special cases. The block-greedy algo-\nrithm has two parameter: B, the total number of feature blocks and P , the size of the random subset\nof the B blocks that is chosen at every time step. For each of these P blocks, we greedily choose, in\nparallel, a single feature weight to be updated.\nSecond, we present a non-asymptotic convergence rate analysis for the randomized block-greedy\ncoordinate descent algorithms for general values of B 2{ 1, . . . , p} (as the number of blocks cannot\nexceed the number of features) and P 2{ 1, . . . , B}. This result therefore applies to stochastic CD,\ngreedy CD, Shotgun, and thread-greedy. Indeed, we build on the analysis and insights in all of these\nprevious works. Our general convergence result, and in particular its instantiation to thread-greedy\nCD, is novel.\nThird, based on the convergence rate analysis for block-greedy, we optimize a certain \u201cblock spectral\nradius\u201d associated with the design matrix. This parameter is a direct generalization of a similar\nspectral parameter that appears in the analysis of Shotgun. We show that the block spectral radius\ncan be upper bounded by the maximum inner product (or correlation if features are mean zero)\nbetween features in distinct blocks. This motivates the use of correlation-based feature clustering to\naccelerate the convergence of the thread-greedy algorithm.\nFinally, we conduct an experimental study using a simple clustering heuristic. We observe dramatic\nacceleration due to clustering for smaller values of the regularization parameter, and show charac-\nteristics that must be paid particularly close attention for heavily regularized problems, and that can\nbe improved upon in future work.\n\n2 Block-Greedy Coordinate Descent\n\nScherrer et al. [2012] describe \u201cGenCD\u201d, a generic framework for parallel coordinate descent algo-\nrithms, in which a parallel coordinate descent algorithm can be determined by specifying a select\nstep and an accept step. At each iteration, features chosen by select are evaluated, and a proposed\nincrement is generated for each corresponding feature weight. Using this, the accept step then de-\ntermines which proposals are to be updated.\nIn these terms, we consider the block-greedy algorithm that takes as\npart of the input a partition of the features into B blocks. Given this,\neach iteration selects features corresponding to a set of P randomly\nselected blocks, and accepts a single feature from each block, based\non an estimate of the resulting reduction in the objective function.\nThe pseudocode for the randomized block-greedy coordinate descent\nis given by Algorithm 1. The algorithm can be applied to any function\nof the form F + R where F is smooth and convex, and R is convex\nand separable across coordinates. Our objective function (1) satis\ufb01es\nthese conditions. The greedy step chooses a feature within a block for\nwhich the guaranteed descent in the objective function (if that feature\nalone were updated) is maximized. This descent is quanti\ufb01ed by |\u2318j|,\nwhich is de\ufb01ned precisely in the next section. To arrive at an heuristic\nunderstanding, it is best to think of |\u2318j| as being proportional to the\nabsolute value |rjF (w)| of the jth entry in the gradient of the smooth part F . In fact, if R is zero\n(no regularization) then this heuristic is exact.\nThe two parameters, B and P , of the block-greedy CD algorithm have the ranges B 2{ 1, . . . , p}\nand P 2{ 1, . . . , B}. Setting these to speci\ufb01c values gives many existing algorithms. For instance\nwhen B = p, each feature is a block on its own. Then, setting P = 1 amounts to randomly choosing\na single coordinate and updating it which gives us the stochastic CD algorithm of Shalev-Shwartz\nand Tewari [2011]. Shotgun [Bradley et al., 2011] is obtained when B is still p but P  1. Another\n\nFigure 1: The design space\nof block-greedy coordinate\ndescent.\n\n2\n\n\fAlgorithm 1 Block-Greedy Coordinate Descent\n\nParameters: B (no. of blocks) and P \uf8ff B (degree of parallelism)\nwhile not converged do\n\nSelect a random subset of size P from the B available blocks\nSet J to be the features in the selected blocks\nPropose increment \u2318j, j 2 J\nAccept J0 = {j : \u2318j has maximal absolute value in its block}\nUpdate weight wj wj  \u2318j for all j 2 J0\n\n// parallel\n\n// parallel\n\nextreme is the case when all the features constitute a single block. That is, B = 1. Then block-\ngreedy CD is a deterministic algorithm and becomes the greedy CD algorithm of Li and Osher\n[2009]; Dhillon et al. [2011]. Finally, we can choose non-trivial values of B that lie strictly between\n1 and p. When this is the case, and we choose to update all blocks in parallel each time (P = B),\nwe arrive at the thread-greedy algorithm of Scherrer et al. [2012]. Figure 1 shows a schematic\nrepresentation of the parameterization of these special cases.\n\n3 Convergence Analysis\n\nOf course, there is no reason to expect block-greedy CD to converge for all values of B and P . In\nthis section, we give a suf\ufb01cient condition for convergence and derive a convergence rate assuming\nthis condition.\nBradley et al. express the convergence criteria for Shotgun algorithm in terms of the spectral ra-\ndius (maximal eigenvalue) \u21e2(XT X). For block-greedy, the corresponding quantity is a bit more\ncomplicated. We de\ufb01ne\n\n\u21e2block = max\nM2M\n\n\u21e2(M )\n\nwhere M is the set of all B \u21e5 B submatrices that we can obtain from XT X by selecting exactly one\nindex from each of the B blocks. The intuition is that if features from different blocks are almost\northogonal then the matrices M in M will be close to identity and will therefore have small \u21e2(M ).\nHighly correlated features within a block do not increase \u21e2block.\nAs we said above, we will assume that we are minimizing a \u201csmooth plus separable\u201d convex function\nF + R where the convex differentiable function F : Rp ! R satis\ufb01es a second order upper bound\n\nF (w +) \uf8ff F (w) + rF (w)T +\n\n\n2\n\nT XT X\n\nIn our case, this inequality will hold as soon as `00(y, t) \uf8ff  for any y, t (where differentiation is\nw.r.t. t). The function R is separable across coordinates: R(w) = Pp\nj=1 r(wj). The function\nkwk1 is clearly separable.\nThe quantity \u2318j appearing in Algorithm 1 serves to quantify the guaranteed descent (based on second\norder upper bound) if feature j alone is updated and is obtained as a solution of the one-dimensional\nminimization problem\n\n\u2318j = argmin\n\n\u2318\n\nrjF (w)\u2318 +\n\n\n2\n\n\u23182 + r(wj + \u2318)  r(wj) .\n\nNote that if there is no regularization, then \u2318j is simply rjF (w)/ = gj/ (if we denote\nrjF (w) by gj for brevity). In the general case, by \ufb01rst order optimality conditions for the above\none-dimensional convex optimization problem, we have gj +\u2318j +\u232bj = 0 where \u232bj is a subgradient\nof r at wj + \u2318j. That is, \u232bj 2 @r(wj + \u2318j). This implies that r(wj + \u2318j) r(w0) \uf8ff \u232bj(wj + \u2318j  w0)\nfor any w0.\nTheorem 1. Let P be chosen so that\n\n\u270f =\n\n(P  1)(\u21e2block  1)\n\n(B  1)\n3\n\n\fis less than 1. Suppose the randomized block-greedy coordinate algorithm is run on a smooth plus\nseparable convex function f = F +R to produce the iterates {wk}k1. Then the expected accuracy\nafter k steps is bounded as\n\nE [f (wk)  f (w?)] \uf8ff C\n\nB R2\n1\n\n(1  \u270f)P \u00b7\n\n1\nk\n\n.\n\nHere the constant C only depends on (Lipschitz and smoothness constants of) the function F , R1 is\nan upper bound on the norms {kwk  w?k1}k1, and w? is any minimizer of f.\nProof. We \ufb01rst calculate the expected change in objective function following the Shotgun analysis.\nWe will use wb to denote wjb (similarly for \u232bb, gb etc.)\n\nE [f (w0)  f (w)] = PEb\uf8ff\u2318bgb +\n\n\n2\n\n(\u2318b)2 + r(wb + \u2318b)  r(wb)\n\n+\n\n\n2\n\nP (P  1)Eb6=b0\u21e5\u2318b \u00b7 \u2318b0 \u00b7 AT\n\njbAjb0\u21e4\n\nP\n\n\uf8ff\n\nDe\ufb01ne the B \u21e5 B matrix M (that depends on the current iterate w) with entries Mb,b0 = AT\nThen, using r(wb + \u2318b)  r(wb) \uf8ff \u232bb\u2318b, we continue\n\u2318T \u2318 + \u232bT \u2318 +\n\n2B(B  1)\u21e5\u2318>M\u2318  \u2318T \u2318\u21e4\n\nB\uf8ff\u2318T g +\n\nP (P  1)\n\nAbove (with some abuse of notation), \u2318, \u232b and g are B length vectors with components \u2318b, \u232bb and\ngb respectively. By de\ufb01nition of \u21e2block, we have \u2318>M\u2318 \uf8ff \u21e2block\u2318T \u2318. So, we continue\n(\u21e2block  1)\u2318T \u2318\n\n\u2318T \u2318  gT \u2318  \u2318T \u2318 +\n\nB\uf8ff\u2318T g +\n\nP (P  1)\n2B(B  1)\n\njbAjb.\n\n\uf8ff\n\n\n2\n\n\n2\n\nP\n\nwhere we used \u232b = g  \u2318. Simplifying we get\nE [f (w0)  f (w)] \uf8ff\n\nP\n2B\n\n[1 + \u270f]k\u2318k2\n\n2\n\nwhere\n\n\u270f =\n\n(P  1)(\u21e2block  1)\n\n(B  1)\n\nshould be less than 1.\nNow note that k\u2318k2\n1,2 where the \u201cin\ufb01nity-2\u201d norm k\u00b7k 1,2 of a p-vector is, by\nde\ufb01nition, as follows: take the `1 norm within a block and take the `2 of the resulting values. Note\nthat in the second step above, we moved from a B-length \u2318 to a p length \u2318.\nThis gives us\n\n2 =Pb |\u2318jb|2 = k\u2318k2\n\nE [f (w0)  f (w)] \uf8ff \n\n(1  \u270f)P\n\n2B\n\nk\u2318k2\n\n1,2 .\n\nFor the rest of the proof, assume  = 0. In that case \u2318 = g/. Thus, convexity and the fact that\nthe dual norm of the \u201cin\ufb01nity-2\u201d norm is the \u201c1-2\u201d norm, give\n\nf (w)  f (w?) \uf8ff rf (w)T (w  w?) \uf8ff krf (w)k1,2 \u00b7k w  w?k1,2\n\nPutting the last two inequalities together gives (for any upper bound R1 on kw  w?k1  kw \nw?k1,2)\n\nE [f (w0)  f (w)] \uf8ff \n\n(1  \u270f)P\n2BR2\n1\n\n(f (w)  f (w?))2 .\n\nDe\ufb01ning the accuracy \u21b5k = f (wk)  f (w?), we translate the above into the recurrence\n\nE [\u21b5k+1  \u21b5k] \uf8ff \n\n(1  \u270f)P\n2BR2\n1\n\n4\n\nE\u21e5\u21b52\nk\u21e4\n\n\fand by Jensen\u2019s we have (E [\u21b5k])2 \uf8ff E\u21e5\u21b52\n\nE [\u21b5k+1]  E [\u21b5k] \uf8ff \nwhich solves to (up to a universal constant factor)\n\nk\u21e4 and therefore\n\n(1  \u270f)P\n2BR2\n1\n\n(E [\u21b5k])2\n\nEven when > 0, we can still relate k\u2318k1,2 to f (w)  f (w?) but the argument is a little more\ninvolved. We refer the reader to the supplementary for more details.\n\nE [\u21b5k] \uf8ff\n\n2BR2\n1\n\n(1  \u270f)P \u00b7\n\n1\nk\n\nIn particular, consider the case where all blocks are updated in parallel as in the thread-greedy\ncoordinate descent algorithm of Scherrer et al. [2012]. Then P = B and there is no randomness in\nthe algorithm, yielding the following corollary.\nCorollary 2. Suppose the block-greedy coordinate algorithm with B = P (thready-greedy) is run\non a smooth plus separable convex function f = F + R to produce the iterates {wk}k1.\nIf\n\u21e2block < 2, then\n\nf (wk)  f (w?) = O\u2713\n\n1\n\n(2  \u21e2block)k\u25c6 .\n\n4 Feature Clustering\n\nThe convergence analysis of section 3 shows that we need to minimize the block spectral radius.\nDirectly \ufb01nding a clustering that minimizes \u21e2block is a computationally daunting task. Even with\n\nBB. In the absence of an ef\ufb01cient\nequal-sized blocks, the number of possible partitions is p!/ p\nsearch strategy for this enormous space, we \ufb01nd it convenient to work instead in terms of the inner\nproduct of features from distinct blocks. The following proposition makes the connection between\nthese approaches precise.\nProposition 3. Let S 2 RB\u21e5B be positive semide\ufb01nite, with Sii = 1, and |Sij| <\" for i 6= j. Then\nthe spectral radius of S has the upper bound\n\n\u21e2(S) \uf8ff 1 + (B  1) \".\n\nProof. Let x be the eigenvector corresponding to the largest eigenvalue of S, scaled so that kxk1 =\n1. Then\n\n\u21e2 (S) = kSxk1 =Xi\n\nxi + SijXj6=i\n\n\n\n\uf8ffXi\n\n0@|xi| + \"Xj6=i\n\n|xj|1A = 1 + (B  1) \"\n\nxj\n\nProposition 3 tells us that we can partition the features into clusters using a heuristic approach that\nstrives to minimize the maximum absolute inner product between the features (columns of the design\nmatrix) Xi and Xj where i and j are features in different blocks.\n\n4.1 Clustering Heuristic\n\nGiven p features and B blocks, we wish to distribute the features evenly among the blocks, attempt-\ning to minimize the absolute inner product between features across blocks. Moreover, we require an\napproach that is ef\ufb01cient, since any time spent clustering could instead have been used for iterations\nof the main algorithm. We describe a simple heuristic that builds uniform-sized clusters of features.\nTo construct a given block, we select a feature as a \u201cseed\u201d, and assign the nearest features to the seed\n(in terms of absolute inner product) to be in the same block. Because inner products with very sparse\nfeatures result in a large number of zeros, we choose at each step the most dense unassigned feature\nas the seed. Algorithm 2 provides a detailed description. This heuristic requires computation of\nO(Bp) inner products. In practice it is very fast\u2014less than three seconds for even the large KDDA\ndataset.\n\n5\n\n\fAlgorithm 2 A heuristic for clustering p features into B blocks, based on correlation\n\nU { 1,\u00b7\u00b7\u00b7 , p}\nfor b = 1 to B  1 do\n\ns arg maxj2U NNZ(Xj)\nfor j 2 U do\nJb { j yielding the dp/Be largest values of cj}\nU U\\Jb\n\ncj |h Xs, Xji|\n\nJB U\nreturn {Jb|b = 1,\u00b7\u00b7\u00b7 , B}\n\n// parallel\n\nName\nNEWS20\nREUTERS\nREALSIM\nKDDA\n\n# Features\n1, 355, 191\n47, 237\n20, 958\n20, 216, 830\n\n# Samples\n19, 996\n23, 865\n72, 309\n8, 407, 752\n\nSource\n\n# Nonzeros\n9, 097, 916 Keerthi and DeCoste [2005]\n1, 757, 800\n3, 709, 083\n305, 613, 510\n\nLewis et al. [2004]\nRealSim\nLo et al. [2011]\n\nTable 1: Summary of input characteristics.\n\n5 Experimental Setup\n\nPlatform All our experiments are conducted on a 48-core system comprising of 4 sockets and 8\nbanks of memory. Each socket is an AMD Opteron processor codenamed Magny-Cours, which is a\nmultichip processor with two 6-core chips on a single die. Each 6-core processor is equipped with a\nthree-level memory hierarchy as follows: (i) 64 KB of L1 cache for data and 512 KB of L2 cache\nthat are private to each core, and (ii) 12 MB of L3 cache that is shared among the 6 cores. Each\n6-core processor is linked to a 32 GB memory bank with independent memory controllers leading\nto a total system memory of 256 GB (32 \u21e5 8) that can be globally addressed from each core. The\nfour sockets are interconnected using HyperTransport-3 technology1.\nDatasets A variety of datasets were chosen2 for experimentation; these are summarized in Ta-\nble 1. We consider four datasets: (i) NEWS20 contains about 20, 000 UseNet postings from 20\nnewsgroups. The data was gathered by Ken Lang at Carnegie Mellon University circa 1995. (ii)\nREUTERS is the RCV1-v2/LYRL2004 Reuters text data described by Lewis et al. [2004]. In this\nterm-document matrix, each example is a training document, and each feature is a term. Nonzero\nvalues of the matrix correspond to term frequencies that are transformed using a standard tf-idf nor-\nmalization. (iii) REALSIM consists of about 73, 000 UseNet articles from four discussion groups:\nsimulated auto racing, simulated aviation, real auto racing, and real aviation. The data was gathered\nby Andrew McCallum while at Just Research circa 1997. We consider classifying real vs simulated\ndata, irrespective of auto/aviation. (iv) KDDA represents data from the KDD Cup 2010 challenge\non educational data mining. The data represents a processed version of the training set of the \ufb01rst\nproblem, algebra 2008 2009, provided by the winner from the National Taiwan University. These\nfour inputs cover a broad spectrum of sizes and structural properties.\nImplementation For the current work, our empirical results focus on thread-greedy coordinate de-\nscent with 32 blocks. At each iteration, a given thread must step through the nonzeros of each of its\nfeatures to compute the proposed increment (the \u2318j of Section 3) and the estimated bene\ufb01t of choos-\ning that feature. Once this is complete, the thread (without waiting) enters the line search phase,\nwhere it remains until all threads are being updated by less than the speci\ufb01ed tolerance. Finally, all\nupdates are performed concurrently. We use OpenMP\u2019s atomic directive to maintain consistency.\nTesting framework\nWe compare the effect of clustering to randomization (i.e. features are randomly assigned to blocks),\nfor a variety of values for the regularization parameter . To test the effect of clustering for very\n\n1Further details on AMD Opteron can be found at http://www.amd.com/us/products/\n\nembedded/processors/opteron/Pages/opteron-6100-series.aspx.\n\n2from http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/\n\n6\n\n\fL\nE\nR\n\u00d7\n0\n1\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n7\n\n6\n\nL\nE\nR\n\u00d7\n0\n1\n\n5\n\n4\n\n3\n\n2\n\n0\n3\n9\n6\n\n.\n\n5\n2\n9\n6\n\n.\n\n0\n2\n9\n6\n\n.\n\nL\nE\nR\n\u00d7\n0\n1\n\n0\n\n5\n\n10\n\n15\n\n0\n\n5\n\n10\n\n15\n\n0\n\n5\n\n10\n\n15\n\n0\n\n50\n\n100\n\n150\n\nL\nE\nR\n\u00d7\n0\n1\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n4\n0\n1\n\nZ\nN\nN\n\n3\n0\n1\n\n2\n0\n1\n\n4\n0\n1\n\n4\n0\n1\n\nZ\nN\nN\n\n3\n0\n1\n\n2\n0\n1\n\nZ\nN\nN\n\n3\n0\n1\n\n2\n0\n1\n\n0\n1\n\n0\n\n5\n\n10\nTime (min)\n\n15\n\n0\n1\n\n0\n\n5\n\n10\nTime (min)\n\n15\n\n0\n1\n\n0\n\n4\n0\n1\n\n3\n0\n1\n\nZ\nN\nN\n\n2\n0\n1\n\n0\n1\n\n1\n\n5\n\n10\nTime (min)\n\n15\n\n0\n\n50\n\n100\nTime (min)\n\n150\n\n(a) NEWS20,\n0 = 104\n\n(b) REUTERS,\n0 = 104\n\n(c) REALSIM,\n0 = 104\n\n(d) KDDA,\n0 = 106\n\nFigure 2: Convergence results. For each dataset, we show the regularized expected loss (top) and\nnumber of nonzeros (bottom), using powers of ten as regularization parameters. Results for random-\nized features are shown in black, and those for clustered features are shown in red. Note that the\nallowed running time for KDDA was ten times that of other datasets.\n\nActive blocks\n\nIterations per second\n\nNNZ @ 1K sec\n\nObjective @ 1K sec\nNNZ @ 10K iter\n\nObjective @ 10K iter\n\n = 104\n\nRandomized Clustered\n\n = 105\n\nRandomized Clustered\n\n = 106\n\nRandomized Clustered\n\n32\n153\n184\n0.472\n74\n0.570\n\n6\n12.9\n215\n0.591\n203\n0.593\n\n32\n152\n797\n0.264\n82\n0.515\n\n32\n12.3\n8592\n0.321\n8812\n0.328\n\n32\n136\n1248\n0.206\n110\n0.472\n\n32\n12.3\n19473\n0.136\n19919\n0.141\n\nTable 2: The effect of feature clustering, for REUTERS.\n\nsparse weights, we \ufb01rst let 0 be the largest power of ten that leads to any nonzero weight esti-\nmates. This is followed by the next three consecutive powers of ten. For each run, we measure the\nregularized expected loss and the number of nonzeros at one-second intervals. Times required for\nclustering and randomization are negligible, and we do not report them here.\n\n6 Results\n\nFigure 2 shows the regularized expected loss (top) and number of nonzeros (bottom), for several\nvalues of the regularization parameter . Black and red curves indicate randomly-permuted features\nand clustered features, respectively. The starting value of  was 104 for all data except KDDA,\nwhich required  = 106 in order to yield any nonzero weights.\nIn the upper plots, within a color, the order of the 4 curves, top to bottom, corresponds to successively\ndecreasing values of . Note that a larger value of  results in a sparser solution with greater\nregularized expected loss and a smaller number of nonzeros. Thus, for each sub\ufb01gure of Figure 2,\nthe order of the curves in the lower plot is reversed from that of the upper plot.\nOverall, results across datasets are very consistent. For large values of , the simple clustering\nheuristic results in slower convergence, while for smaller values of  we see considerable bene\ufb01t.\nDue to space limitations, we choose a single dataset for which to explore results in greater detail.\nOf the datasets we tested, REUTERS might reasonably lead to the greatest concern. Like the other\ndatasets, clustered features lead to slow convergence for large  and fast convergence for small .\nHowever, REUTERS is particularly interesting because for  = 105, clustered features seem to\nprovide an initial bene\ufb01t that does not last; after about 250 seconds it is overtaken by the run with\nrandomized features.\n\n7\n\n\f6\n0\n1\n\n\u25cf\n\n\u25cf\n\nZ \u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\nN\nN\n\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\n5\n0\n1\n\n4\n0\n1\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\u25cf\u25cf\n\n\u25cf\u25cf\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\n\nBlock\n\nL\nE\nR\n\u00d7\n0\n1\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n4\n0\n1\n\n3\n0\n1\n\n2\n0\n1\n\nZ\nN\nN\n\n1\n\n10\n\n102\n103\nIterations\n\n104\n\n105\n\n1\n\n10\n\n102\n103\nIterations\n\n104\n\n105\n\n0\n1\n\n(a) Block density\n\n(b) Regularized expected loss\n\n(c) Number of nonzeros\n\nFigure 3: A closer look at performance characteristics for REUTERS.\n\nTable 2 gives a more detailed summary of the results for REUTERS, for the three largest values of .\nThe \ufb01rst row of this table gives the number of active blocks, by which we mean the number of blocks\ncontaining any nonzeros. For an inactive block, the corresponding thread repeatedly con\ufb01rms that\nall weights remain zero without contributing to convergence.\nIn the most regularized case  = 104, clustered data results in only six active blocks, while for\nother cases every block is active. Thus in this case features corresponding to nonzero weights are\ncolocated within these few blocks, severely limiting the advantage of parallel updates.\nIn the second row, we see that for randomized features, the algorithm is able to get through over ten\ntimes as many iterations per second. To see why, note that the amount of work for a given thread is\na linear function of the number of nonzeros over all of the features in its block. Thus, the block with\nthe greatest number of nonzeros serves as a bottleneck.\nThe middle two rows of Figure 2 summarize the state of each run after 1000 seconds. Note that for\nthis test, randomized features result in faster convergence for the two largest values of .\nFor comparison, the \ufb01nal two rows of Figure 2 provide a similar summary based instead on the\nnumber of iterations. In these terms, clustering is advantageous for all but the largest value of .\nFigure 3 shows the source of this problem. First, Figure 3a shows the number of nonzeros in all\nfeatures for each of the 32 blocks. Clearly the simple heuristic results in poor load-balancing. For\ncomparison, Figures 3b and 3c show convergence rates as a function of the number of iterations.\n\n7 Conclusion\n\nWe have presented convergence results for a family of randomized coordinate descent algorithms\nthat we call block-greedy coordinate descent. This family includes Greedy CD, Thread-Greedy\nCD, Shotgun, and Stochastic CD. We have shown that convergence depends on \u21e2block, the maximal\nspectral radius over submatrices of XT X resulting from the choice of one feature for each block.\nEven though a simple clustering heuristic helps for smaller values of the regularization parameter,\nour results also show the importance of considering issues of load-balancing and the distribution of\nweights for heavily-regularized problems.\nA clear next goal in this work is the development of a clustering heuristic that is relatively well\nload-balanced and distributes weights for heavily-regularized problems evenly across blocks, while\nmaintaining good computational ef\ufb01ciency.\n\nAcknowledgments\nThe authors are grateful for the helpful suggestions of Ken Jarman, Joseph Manzano, and our anony-\nmous reviewers.\nFunding for this work was provided by the Center for Adaptive Super Computing Software - Mul-\ntiThreaded Architectures (CASS-MT) at the U.S. Department of Energy\u2019s Paci\ufb01c Northwest Na-\ntional Laboratory. PNNL is operated by Battelle Memorial Institute under Contract DE-ACO6-\n76RL01830.\n\n8\n\n\fReferences\nJ Friedman, T Hastie, H H\u00a8o\ufb02ing, and R Tibshirani. Pathwise coordinate optimization. Annals of\n\nApplied Statistics, 1(2):302\u2013332, 2007.\n\nT Wu and K Lange. Coordinate descent algorithms for lasso penalized regression. Annals of Applied\n\nStatistics, 2:224\u2013244, 2008.\n\nS Shalev-Shwartz and A Tewari. Stochastic methods for `1-regularized loss minimization. Journal\n\nof Machine Learning Research, 12:1865\u20131892, 2011.\n\nJ K Bradley, A Kyrola, D Bickson, and C Guestrin. Parallel Coordinate Descent for L1-Regularized\nLoss Minimization. In Proceedings of the 28th International Conference on Machine Learning,\npages 321\u2013328, 2011.\n\nC Scherrer, A Tewari, M Halappanavar, and D Haglin. Scaling up Parallel Coordinate Descent\n\nAlgorithms. In International Conference on Machine Learning, 2012.\n\nY Li and S Osher. Coordinate Descent Optimization for `1 Minimization with Application to Com-\npressed Sensing ; a Greedy Algorithm Solving the Unconstrained Problem. Inverse Problems and\nImaging, 3:487\u2013503, 2009.\n\nI S Dhillon, P Ravikumar, and A Tewari. Nearest neighbor based greedy coordinate descent. In\n\nAdvances in Neural Information Processing Systems 24, pages 2160\u20132168, 2011.\n\nD Lewis, Y Yang, T Rose, and F Li. RCV1 : A New Benchmark Collection for Text Categorization\n\nResearch. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\nS S Keerthi and D DeCoste. A modi\ufb01ed \ufb01nite Newton method for fast solution of large scale linear\n\nSVMs. Journal of Machine Learning Research, 6:341\u2013361, 2005.\n\nRealSim. Document classi\ufb01cation data gathered by Andrew McCallum., circa 1997. URL:http:\n\n//people.cs.umass.edu/\u02dcmccallum/data.html.\n\nHung-Yi Lo, Kai-Wei Chang, Shang-Tse Chen, Tsung-Hsien Chiang, Chun-Sung Ferng, Cho-Jui\nHsieh, Yi-Kuang Ko, Tsung-Ting Kuo, Hung-Che Lai, Ken-Yi Lin, Chia-Hsuan Wang, Hsiang-Fu\nYu, Chih-Jen Lin, Hsuan-Tien Lin, and Shou de Lin. Feature engineering and classi\ufb01er ensemble\nfor KDD Cup 2010, 2011. To appear in JMLR Workshop and Conference Proceedings.\n\n9\n\n\f", "award": [], "sourceid": 26, "authors": [{"given_name": "Chad", "family_name": "Scherrer", "institution": null}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": null}, {"given_name": "Mahantesh", "family_name": "Halappanavar", "institution": null}, {"given_name": "David", "family_name": "Haglin", "institution": null}]}