{"title": "SySCD: A System-Aware Parallel Coordinate Descent Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 592, "page_last": 602, "abstract": "In this paper we propose a novel parallel stochastic coordinate descent (SCD) algorithm with convergence guarantees that exhibits strong scalability. We start by studying a state-of-the-art parallel implementation of SCD and identify scalability as well as system-level performance bottlenecks of the respective implementation. We then take a principled approach to develop a new SCD variant which is designed to avoid the identified system bottlenecks, such as limited scaling due to coherence traffic of model sharing across threads, and inefficient CPU cache accesses. Our proposed system-aware parallel coordinate descent algorithm (SySCD) scales to many cores and across numa nodes, and offers a consistent bottom line speedup in training time of up to x12 compared to an optimized asynchronous parallel SCD algorithm and up to x42, compared to state-of-the-art GLM solvers (scikit-learn, Vowpal Wabbit, and H2O) on a range of datasets and multi-core CPU architectures.", "full_text": "SySCD: A System-Aware Parallel\nCoordinate Descent Algorithm\n\nNikolas Ioannou\u2217\nIBM Research\n\nZurich, Switzerland\n\nCelestine Mendler-D\u00fcnner\u2217\u2020\n\nUC Berkeley\n\nBerkeley, California\n\nThomas Parnell\nIBM Research\n\nZurich, Switzerland\n\nnio@zurich.ibm.com\n\nmendler@berkeley.edu\n\ntpa@zurich.ibm.com\n\nAbstract\n\nIn this paper we propose a novel parallel stochastic coordinate descent (SCD)\nalgorithm with convergence guarantees that exhibits strong scalability. We start by\nstudying a state-of-the-art parallel implementation of SCD and identify scalability\nas well as system-level performance bottlenecks of the respective implementation.\nWe then take a principled approach to develop a new SCD variant which is designed\nto avoid the identi\ufb01ed system bottlenecks, such as limited scaling due to coherence\ntraf\ufb01c of model sharing across threads, and inef\ufb01cient CPU cache accesses. Our\nproposed system-aware parallel coordinate descent algorithm (SySCD) scales to\nmany cores and across numa nodes, and offers a consistent bottom line speedup\nin training time of up to \u00d712 compared to an optimized asynchronous parallel\nSCD algorithm and up to \u00d742, compared to state-of-the-art GLM solvers (scikit-\nlearn, Vowpal Wabbit, and H2O) on a range of datasets and multi-core CPU\narchitectures.\n\n1\n\nIntroduction\n\nToday\u2019s individual machines offer dozens of cores and hundreds of gigabytes of RAM that can, if used\nef\ufb01ciently, signi\ufb01cantly improve the training performance of machine learning models. In this respect\nparallel versions of popular machine learning algorithms such as stochastic gradient descent (Recht\net al., 2011) and stochastic coordinate descent (Liu et al., 2015; Hsieh et al., 2015a; Richtarik &\nTakac, 2016b) have been developed. These methods either introduce asynchronicity to the sequential\nalgorithms, or they use a mini-batch approach, in order to enable parallelization and better utilization\nof compute resources. However, all of these methods treat machines as a simple, uniform, collection\nof cores. This is far from reality. While modern machines offer ample computation and memory\nresources, they are also elaborate systems with complex topologies, memory hierarchies, and CPU\npipelines. As a result, maximizing the performance of parallel training requires algorithms and\nimplementations that are aware of these system-level characteristics and respect their bottlenecks.\nSetup. In this work we focus on the training of generalized linear models (GLMs). Our goal is to\nef\ufb01ciently solve the following partially separable convex optimization problem using the full compute\npower available in modern CPUs:\n\nmin\n\u03b1\u2208Rn\n\nF (\u03b1) where F (\u03b1) := f (A\u03b1) +\n\ngi(\u03b1i).\n\n(1)\n\n(cid:88)\n\ni\n\nThe model vector \u03b1 \u2208 Rn is learned from the training data A \u2208 Rd\u00d7n, the function f is convex\nand smooth, and gi are general convex functions. The objective (1) covers primal as well as dual\nformulations of many popular machine learning models which are widely deployed in industry\n\n\u2217Equal contribution.\n\u2020Work conducted while at IBM Research, Zurich.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(Kaggle, 2017). For developing such a system-aware training algorithm we will build on the popular\nstochastic coordinate descent (SCD) method (Wright, 2015; Shalev-Shwartz & Zhang, 2013). We \ufb01rst\nidentify its performance bottlenecks and then propose several algorithmic optimizations to alleviate\nthem.\n\nContributions. The main contributions of this work can be summarized as follows:\n\n1. We propose SySCD, the \ufb01rst system-aware coordinate descent algorithm that is optimized for\n\u2013 cache access patterns: We introduce buckets to design data access patterns that are well\n\naligned with the system architecture.\n\n\u2013 thread scalability: We increase data parallelism across worker threads to avoid data access\n\nbottlenecks and bene\ufb01t from the buckets to reduce permutation overheads.\n\n\u2013 numa-topology: We design a hierarchical numa-aware optimization pattern that respects\n\nnon-uniform data access delays of threads across numa-nodes.\n\n2. We give convergence guarantees for our optimized method and motivate a dynamic re-\n\npartitioning scheme to improve its sample ef\ufb01ciency.\n\n3. We evaluate the performance of SySCD on diverse datasets and across different CPU archi-\ntectures, and we show that SySCD drastically improves the implementation ef\ufb01ciency and the\nscalability when compared to state-of-the-art GLM solvers (scikit-learn Pedregosa et al. (2011),\nVowpal Wabbit Langford (2007), and H2O The H2O.ai team (2015)), resulting in \u00d712 faster\ntraining on average.\n\n2 Background\n\nStochastic coordinate descent (SCD) methods (Wright, 2015; Shalev-Shwartz & Zhang, 2013) have\nbecome one of the key tools for training GLMs, due to their ease of implementation, cheap iteration\ncost, and effectiveness in the primal as well as in the dual. Their popularity has been driving research\nbeyond sequential stochastic solvers and a lot of work has been devoted to map these methods\nto parallel hardware. We will give a short summary in the following, putting emphasis on the\nassumptions made on the underlying hardware.\nPrevious works on parallel coordinate descent (Hsieh et al., 2015a; Parnell et al., 2017; Richtarik &\nTakac, 2016b; Liu et al., 2015) assume that parallel processes are homogeneous and data as well as\nmodel information resides in shared memory which is accessible by all processes. Building on these\nassumptions, Hsieh et al. (2015a); Liu et al. (2015); Liu & Wright (2015) propose asynchronous\nmethods for scaling up SCD: the model resides in shared memory and all processes simultaneously\nread and write this model vector. A fundamental limitation of such an approach is that its convergence\nrelies on the fact that the model information used to compute each update is not too stale. Thus,\nasynchronous algorithms are prone to diverge when scaled up to a large number of processes. In\naddition, the heavy load on the model vector can cause signi\ufb01cant runtime delays. Both limitations\nare more pronounced for dense data, thus we use a dense synthetic dataset to illustrate these effects in\nFig 1a; the orange, dashed line shows that convergence suffers from staleness, the gray line shows\nthe respective runtime assuming perfect thread scalability and the yellow lines depicts the measured\nruntime. The algorithm diverges when scaled across more than 8 threads. Taking another route,\nRichtarik & Takac (2016b); Bradley et al. (2011) propose a synchronous approach for parallelizing\nSCD. Such methods come with more robust convergence properties. However, depending on the\ninherent separability of f, the potential of acceleration can be small. For synthetic, well separable\nproblems, mini-batch SDCA proposed by Richtarik & Takac (2016b) show almost linear scaling,\nwhereas for correlated objectives or dense datasets, the potential for acceleration, as given in their\ntheory diminishes. In addition, updates to the shared vector in the synchronous setting are guaranteed\nto con\ufb02ict across parallel threads \u2013 mini-batch SDCA uses atomic operations3 to serialize those\nupdates; this does not scale as the thread count increases, and especially not in numa machines. We\nhave applied this method to the same synthetic example used in Fig 1 and we observed virtually no\nspeedup (5%) when using 32 threads.\nOrthogonal to parallel methods, distributed coordinate-based methods have also been the focus\nof many works, including (Yang, 2013; Ma et al., 2015; Richtarik & Takac, 2016a; D\u00fcnner et al.,\n\n3code available at https://code.google.com/archive/p/ac-dc/downloads\n\n2\n\n\f(a) PASSCoDe\n\n(b) CoCoA\n\nFigure 1: Scalability of existing methods: Training of a logistic regression classi\ufb01er on a synthetic dense dataset\nwith 100k training examples and 100 features \u2013 (a) training using PASSCoDe-wild (Hsieh et al., 2015a) and (b)\ntraining using CoCoA (Smith et al., 2018) deployed across threads. Details can be found in the appendix.\n\n2018; Smith et al., 2018; Lee & Chang, 2018). Here the standard assumption on the hardware is that\nprocesses are physically separate, data is partitioned across them, and communication is expensive. To\nthis end, state-of-the-art distributed \ufb01rst- and second-order methods attempt to pair good convergence\nguarantees with ef\ufb01cient distributed communication. However, enabling this often means trading\nconvergence for data parallelism (Kaufmann et al., 2018). We have illustrated this tradeoff in Fig 1b\nwhere we employ CoCoA Smith et al. (2018) across threads; using 32 threads the number of epochs is\nincreased by \u00d78 resulting in a speedup of \u00d74 assuming perfect thread scalability. This small payback\nmakes distributed algorithms generally not well suited to achieving acceleration; they are primarily\ndesigned to enable training of large datasets that do not \ufb01t into a single machine (Smith et al., 2018).\nThe fundamental trade-off between statistical ef\ufb01ciency (how many iterations are needed to con-\nverge) and hardware ef\ufb01ciency (how ef\ufb01cient they can be executed) of deploying machine learning\nalgorithms on modern CPU architectures has previously been studied in Zhang & R\u00e9 (2014). The\nauthors identi\ufb01ed data parallelism as a critical tuning parameter and demonstrate that its choice can\nsigni\ufb01cantly affect performance of any given algorithm.\nThe goal of this work is to go one step further and enable better trade-offs by directly incorporate\nmitigations to critical system-level bottlenecks into the algorithm design. We exploit the shared\nmemory performance available to worker threads within modern individual machines to enable new\nalgorithmic features that improve scalability of parallel coordinate descent, while at the same time\nmaintaining statistical ef\ufb01ciency.\n\n3 Bottleneck Analysis\n\nWe start by analyzing state-of-the-art implementations of sequential and parallel coordinate descent to\nidentify bottlenecks and scalability issues. For the parallel case, we use an optimized implementation\nof PASSCoDe (Hsieh et al., 2015a) as the baseline for this study, which is vectorized and reasonably\nef\ufb01cient. The parallel algorithm operates in epochs and repeatedly divides the n shuf\ufb02ed coordinates\namong the P parallel threads. Each thread then operates asynchronously: reading the current state of\nthe model \u03b1, computing an update for this coordinate and writing out the update to the model \u03b1j\nas well as the shared vector v. The auxiliary vector v := A\u03b1 is kept in memory to avoid recurring\ncomputations. Write-contention on v is solved opportunistically in a wild fashion, which in practice\nis the preferred approach over expensive locking (Parnell et al., 2017; Hsieh et al., 2015a). The\nparallel SCD algorithm is stated in Appendix A.1 for completeness.\nOne would expect that, especially for large datasets (e.g., datasets that do not \ufb01t in the CPU caches),\nthe runtime would be dominated by (a) the time to compute the inner product required for the\ncoordinate update computation and (b) retrieving the data from memory. While these bottlenecks\ncan generally not be avoided, our performance analysis identi\ufb01ed four other bottlenecks that in some\ncases vastly dominate the runtime:\n(B1) Access to model vector. When the model vector \u03b1 does not \ufb01t in the CPU cache, a lot of time\nis spend in accessing the model. The origin of this overhead is the random nature of the accesses to\n\u03b1, there is very little cache line re-use: a cache line is brought from memory (64B or 128B), out of\nwhich only 8B are used. This issue affects both the parallel and the sequential implementation. For\n\n3\n\n05101520250501001502000481216time [s]# epochs# threadsepochs to convergetrain time (ideal*)train timealgorithm diverges01234010020030040008162432time [s]# epochs # CoCoA partitionsepochs to convergetrain time (ideal*)train time\fk=1.\n\ncreate random partitioning of local buckets across threads {Pk,p}P\nparfor p = 1, 2, . . . , P across threads do\n\np=1\n\nAlgorithm 1 SySCD for minimizing (1)\n1: Input: Training data matrix A = [x1, ..., xn] \u2208 Rd\u00d7n\ni=1 \u03b1ixi.\n\n2: Initialize model \u03b1 and shared vector v =(cid:80)n\n\nvk = v\nfor t = 1, 2, . . . , T2 do\n\nvp = vk\nfor j = 1, 2, . . . , T3 do\n\nparfor k = 1, 2, . . . , K across numa nodes do\n\n3: Partition coordinates into buckets of size B.\n4: Partition buckets across numa nodes according to {Pk}K\n5: for t = 1, 2, . . . , T1 do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26: end for\n\nrandomly select a bucket B \u2208 Pk,p\nfor i = 1, 2, . . . , T4 do\n\nend for\nend parfor\n\nvk = vk +(cid:80)\n\nend for\nend parfor\n\nv = v +(cid:80)\n\nk(vk \u2212 v)\n\nrandomly sample a coordinate j in bucket B\n\u03b4 = arg min\u03b4\u2208R \u00aff (vp + xj\u03b4) + \u00afgj(\u03b1j + \u03b4)\n\u03b1j = \u03b1j + \u03b4\nvp = vp + \u03b4xj\n\nend for\n\np(vp \u2212 vk)\n\nthe latter this bottleneck dominates and we found that, by accessing the model in a sequential manner,\nwe can reduce the runtime by \u00d72.\n(B2) Access to the shared vector. For the parallel implementation, we found that writing the updates\nto the shared vector v across the different threads was the main bottleneck. On top of dominating the\nruntime, staleness in the shared vector can also negatively impact convergence.\n(B3) Non-uniform memory access. When the parallel implementation is deployed across multiple\nnuma nodes, bottleneck (B2) becomes catastrophic, often leading to divergence of the algorithm (see\nFig. 1a). This effect can be explained by the fact that the inter-node delay when writing updates is far\nmore pronounced than the intra-node delay.\n(B4) Shuf\ufb02ing coordinates. A signi\ufb01cant amount of time is spent permuting the coordinates before\neach epoch in both the parallel and the sequential case. For the latter, we found that by removing\nthe permutation, effectively performing cyclic coordinate descent, we could achieve a further 20%\nspeed-up in runtime on top of removing (B1).\n\n4 Algorithmic Optimizations\n\nIn this section we present the main algorithmic optimizations of our new training algorithm which are\ndesigned to simultaneously address the system performance bottlenecks (B1)-(B4) identi\ufb01ed in the\nprevious section as well as the scalability issue demonstrated in Fig. 1b. Our system-aware parallel\ntraining algorithm (SySCD) is summarized in Alg. 1 and its convergence properties are analyzed\nin Sec. 4.4. The following subsections will be accompanied by experimental results illustrating the\neffect of the individual optimizations. They show training of a logistic regression classi\ufb01er on the\ncriteo-kaggle dataset (Criteo-Labs, 2013) on a 4 node system with 8 threads per numa node (for the\nexperimental setup, see Sec 5) . Results for two additional datasets can be found in the appendix.\n\n4.1 Bucket Optimization\n\nWe have identi\ufb01ed the cache line access pattern (B1) and the random shuf\ufb02ing computation (B4) as\ntwo critical bottlenecks in the sequential as well as the parallel coordinate descent implementation. To\nalleviate these in our new method, we introduce the concept of buckets: We partition the coordinates\n\n4\n\n\fFigure 2: Bucket Optimization: Gain achieved by\nusing buckets. Solid lines indicate time, and dashed-\nlines depict number of epochs.\n\nFigure 3: Sensitivity analysis on the bucket size w.r.t.\ntraining time and epochs for convergence.\n\nand the respective columns xi of A into buckets and then train a bucket of B consecutive coordinates\nat a time. Thus, instead of randomizing all coordinates at once, the order in which buckets are\nprocessed is randomized, as well as the order of coordinates within a bucket. This modi\ufb01cation\nto the algorithm improves performance in several ways; (i) the model vector \u03b1 is accessed in a\ncache-line ef\ufb01cient manner, (ii) the computation overhead of randomizing the coordinates is reduced\nby 1/B, and (iii) CPU prefetching ef\ufb01ciency on accessing the different coordinates of xi is implicitly\nimproved. For our test case this optimization leads to an average speedup of 63% with only a small\ntoll on convergence, as depicted in Fig. 2.\nThe bucket size B will appear in our convergence rate (Theorem 1) and can be used to control the\nscope of the randomization to trade-off between convergence speed and implementation ef\ufb01ciency.\nWe illustrate the sensitivity of our algorithm to the bucket size B in Fig. 3. We see that the bottom\nline training time decreases signi\ufb01cantly across the board by introducing buckets. The optimal bucket\nsize in Fig. 3 is eight which coincides with the cache line size of the CPU with respect to the model\nvector \u03b1 accesses. Thus we do not need to introduce an additional hyperparameter and can choose\nthe bucket size B at runtime based on the cache line size of the CPU, using linux sysfs.\n\n4.2\n\nIncreasing Data Parallelism\n\nOur second algorithmic optimization mitigates the main scalability bottleneck (B2) of the asyn-\nchronous implementation: write-contention on the shared vector v. We completely avoid this\nwrite-contention by replicating the shared vector across threads to increase data parallelism. To\nrealize this data parallelism algorithmically we transfer ideas from distributed learning. In particular,\nwe employ the CoCoA method (Smith et al., 2018) and map it to a parallel architecture where we\npartition the (buckets of) coordinates across the threads and replicate the shared vector in each one.\nThe global shared vector is therefore only accessed at coarse grain intervals (e.g., epoch boundaries),\nwhere it is updated based on the replicas and broadcasted anew to each thread. Similar to CoCoA we\ncan exploit the typical asymmetry of large datasets and map our problem such that the shared vector\nhas dimensionality d = min(#features, #examples).\nWe have seen in Sec 2 that distributed algorithms such as CoCoA are generally not suited to achieve\nsigni\ufb01cant acceleration with parallelism. This behavior of distributed methods is caused by the static\npartitioning of the training data across workers which increases the epochs needed for convergence\n(Smith et al., 2018; Kaufmann et al., 2018) (e.g., see Fig 1b). To alleviate this issue, we propose\nto combine our multi-threaded implementation with a dynamic re-partitioning scheme. That is, we\nshuf\ufb02e all the (buckets of) coordinates at the beginning of each local optimization round (Step 9\nof Alg. 1), and then, each thread picks a different set of buckets each time. Such a re-partitioning\napproach is very effective for convergence when compared to a default static partitioning, as depicted\nin Fig. 4. It reduces the number of epochs by 54% at the cost of only a small implementation\noverhead. To the best of our knowledge we are the \ufb01rst to consider such a re-partitioning approach in\ncombination with distributed methods and demonstrate a practical scenario where it pays off \u2013 in a\nclassical distributed setting the cost of re-partitioning would be unacceptable.\nThe intuition behind this approach is the following: In CoCoA (Smith et al., 2018) a block-separable\nauxiliary model of the objective is constructed. In this model the correlation matrix M = A(cid:62)A is\napproximated by a block-diagonal version where the blocks are aligned with the partitioning of the\ndata. This allows one to decouple local optimization tasks. However, this also means that correlations\n\n5\n\n050100150200250300350400020406008162432Train time (s)# epochs to converge#ThreadsWithout bucketsWith buckets6080100120152025303508162432Train time (s)# epochs to convergeBucket size (# coordinates)epochstrain timecache line size\fFigure 4: Static and dynamic partitioning: Gain\nachieved by dynamic re-partitioning. Solid lines indi-\ncate time, and dashed-lines depict number of epochs.\n\nFigure 5: Numa-level Optimizations: Gain achieved\nby numa-awareness. Solid lines indicate time, and\ndashed-lines depict number of epochs.\n\nbetween data points on different worker nodes are not considered. A dynamic re-partitioning scheme\nhas the effect of choosing a different block diagonal approximation of M in each step. By randomly\nre-partitioning coordinates, the off-diagonal elements of M are sampled uniformly at random and thus\nin expectation a good estimate of M is used. A rigorous analysis of this effect would be an interesting\nstudy for future work. However, note that SySCD inherits the strong convergence guarantees of the\nCoCoA method, independent of the partitioning, and can thus be scaled up safely to a large number\nof cores in contrast to our asynchronous reference implementation.\n\n4.3 Numa-Level Optimizations\n\nSubsequently, we focus on optimizations related to the numa topology in a multi-numa node system.\nDepending on the numa node where the data resides and the node on which a thread is running, data\naccess performance can be non-uniform across threads. As we have seen in Fig. 1b and discussed\nin Sec. 3 this ampli\ufb01es bottleneck (B3). To avoid this in SySCD, we add an additional level of\nparallelism and treat each numa node as an independent training node, in the distributed sense. We\nthen deploy a hierarchical scheme: we statically partition the buckets across the numa nodes, and\nwithin the numa nodes we use the dynamic re-partitioning scheme introduced in Sec 4.2. We exploit\nthe fact that the training dataset is read-only and thus it does not incur expensive coherence traf\ufb01c\nacross numa nodes. We do not replicate the training dataset across the nodes and the model vector\n\u03b1 is local to each node which holds the coordinates corresponding to its partition Pk. Crucially,\neach node holds its own replica of the shared vector, which is reduced across nodes periodically.\nThe frequency of synchronization can be steered in Alg. 1 by balancing the total number of updates\nbetween T1 and T2. This again offers a trade off between fast convergence (see Theorem 1) and\nimplementation ef\ufb01ciency. This hierarchical optimization pattern that re\ufb02ects the numa-topology\nresults in a speedup of 33% over a numa-oblivious implementation, as shown in Fig 5. To avoid\nadditional hyperparameters, we dynamically detect the numa topology of the system, as well as the\nnumber of physical cores per node, using libnuma and the sysfs interface. If the number of threads\nrequested by the user is less or equal to the number of cores in one node, we schedule a single node\nsolver. We detect the numa node on which the dataset resides using the move_pages system call.\n\n4.4 Convergence Analysis\n\nWe derive an end-to-end convergence rate for SySCD with all its optimizations as described in\nAlg. 1. We focus on strongly convex gi while every single component of SySCD is also guaranteed\nto converge for general convex gi, see Remark 2 in the Appendix.\nTheorem 1. Consider Algorithm 1 applied to (1). Assume f is \u03b3-smooth and gi are \u00b5-strongly\nconvex functions. Let K be the number of numa nodes and P the number of threads per numa node.\nLet B be the bucket size. Denote T4 the number of SDCA updates performed on each bucket, let T3 be\nthe number of buckets processed locally in each iteration and let T2 be the number of communication\nrounds performed independently on each numa node before global synchronization. Then, after T1\nouter rounds the suboptimality \u03b5 = F (\u03b1) \u2212 min\u03b1 F (\u03b1) can be bounded as\n\n\u03b3KcA + \u00b5\n\u03b3KP cA + \u00b5\n\n6\n\n(cid:32)\n\n(cid:34)\n\n(cid:18)\n\nE[\u03b5] \u2264\n\n1 \u2212\n\n1 \u2212\n\n1 \u2212 (1 \u2212 \u03b8)\n\n(cid:19)T2(cid:35)\n\n(cid:33)T1\n\n\u00b5\n\n\u00b5 + K\u03b3cA\n\n\u03b50\n\n(2)\n\n05010015020025002040608008162432Train time (s)# epochs to converge#Threadsstatic partitioningdynamic partitioning050100150200250020406008162432Train time (s)# epochs to converge#ThreadsWithout numa-optsWith numa-opts\fwhere cA := (cid:107)A(cid:107)op and\n\n\u03b8 =\n\n1 \u2212\n\n(cid:32)\n\n(cid:34)\n\n1 \u2212\n\n(cid:18)\n\n1 \u2212 1\nn\n\n\u00b5\n\n\u00b5 + \u03b3KP\n\n(cid:19)T4(cid:35)\n\n(cid:33)T3\n\n.\n\nB\nn\n\n\u00b5\n\n\u00b5 + cA\u03b3KP\n\n(3)\n\nProof Sketch. To derive a convergence rate of Alg. 1 we start at the outer most level. We focus\non the two nested for-loops in Step 6 and Step 10 of Alg. 1. They implement a nested version of\nCoCoA where the outer level corresponds to CoCoA across numa nodes and the inner level to CoCoA\nacross threads. The number of inner iterations T2 is a hyper-parameter of our algorithm steering\nthe accuracy to which the local subproblem assigned to each numa node is solved. Convergence\nguarantees for such a scheme can be derived from a nested application of (Smith et al., 2018,\nTheorem 3) similar to (D\u00fcnner et al., 2018, Appendix B). Subsequently, we combine this result\nwith the convergence guarantees of the local solver used by each thread. This solver, implementing\nthe bucketing optimization, can be analyzed as a randomized block coordinate descent method,\nsimilar to (D\u00fcnner et al., 2017, Theorem 1), where each block corresponds to a bucket of coordinates.\nEach block update is then computed using SDCA (Shalev-Shwartz & Zhang, 2013). Again, the\nnumber of coordinate descent steps T4 forms a hyper-parameter to steer the accuracy of each block\nupdate. Combining all these results in a nested manner yields the convergence guarantee presented in\nTheorem 1. We refer to the Appendix A.3 for a detailed proof.\n\n5 Evaluation\n\nIn this section, we evaluate the performance of SySCD in two different single-server multi numa-node\nenvironments. We \ufb01rst analyze the scalability of our method and the performance gains achieved over\nthe reference implementation. Then, we compare SySCD with other state-of-the-art GLM solvers\navailable in scikit-learn (Pedregosa et al., 2011)(0.19.2), H2O (The H2O.ai team, 2015) (3.20.0.8),\nand Vowpal Wabbit (VW) (Langford, 2007) (commit: 5b020c4). We take logistic regression with\nL2 regularization as a test case. We use two systems with different CPU architectures and numa\ntopologies: a 4-node Intel Xeon (E5-4620) with 8 cores and 128GiB of RAM in each node, and a\n2-node IBM POWER9 with 20 cores and 512GiB in each node, 1TiB total. We evaluate on three\ndatasets: (i) the sparse dataset released by Criteo Labs as part of their 2014 Kaggle competition\n(Criteo-Labs, 2013) (criteo-kaggle), (ii) the dense HIGGS dataset (Baldi et al., 2014) (higgs), and\n(iii) the dense epsilon dataset from the PASCAL Large Scale Learning Challenge (Epsilon, 2008)\n(epsilon). Results on epsilon and additional details can be found in the appendix.\n\nRemark 1 (Hyperparameters). The hyperparameters T2, T3, T4 in Alg 1 can be used to optimally\ntune SySCD to different CPU architectures. However, a good default choice is\n\nT4 = B,\n\nT3 =\n\nn\nP B\n\nT2 = 1\n\n(4)\n\nsuch that one epoch (n coordinate updates) is performed across the threads before each synchro-\nnization step. We will use these values for all our experiments and did not further tune our method.\nFurther, recall that the bucket size B is set to be equal to the cache line size of the CPU and the\nnumber of numa nodes K as well as the number of threads P is automatically detected.\n\n5.1 Scalability\n\nWe \ufb01rst investigate the thread scalability of SySCD. Results, showing the speedup in time per epoch\n(an epoch corresponds to n coordinate updates) over the sequential version, are depicted in Fig 6. We\nsee that SySCD scales almost linearly across the two systems and thus the main scalability bottleneck\n(B2) of our reference implementation is successfully mitigated. The 4 node system show a slightly\nlower absolute speedup beyond 1-node (8 threads), which is expected due to the higher overhead\nwhen accessing memory on different numa nodes compared to the 2-node system.\nNote that in our experiments we disable simultaneous multi-threading (SMT), since in practice we\noften \ufb01nd enabling SMT leads to worse overall performance. Therefore, the maximal thread count\ncorresponds to the number of physical cores present in the machine. In order to illustrate how SySCD\nscales when the number of threads exceeds the number of physical cores, we enabled SMT4 (4\n\n7\n\n\f(a) higgs\n\n(b) criteo-kaggle\n\nFigure 6: Strong thread scalability of SySCD w.r.t runtime per epoch with increasing thread counts for the two\ndifferent systems: a 2 node (P9) and a 4 node (X86_64) machine.\n\n(a) higgs\n\n(b) criteo-kaggle\n\nFigure 7: Training time w.r.t. thread count for the reference PASSCoDe and our optimized SySCD implementa-\ntion on a 2 node (P9) and a 4 node (X86_64) machine.\n\nhardware threads per core) on the P9 machine and re-ran the experiment from Fig. 6b. The results are\nshown in Figure 16 in the appendix. As expected, we see linear scaling up to the number of physical\nCPU cores (in this case 40), after which we start to see diminishing returns due to the inherent\ninef\ufb01ciency of SMT4 operation. We thus recommend disabling SMT when deploying SySCD.\n\n5.2 Bottom Line Performance\n\nSecond, we compare the performance of our new SySCD algorithm to the PASSCoDe baseline\nimplementation. Convergence is declared if the relative change in the learned model across iterations\nis below a threshold. We have veri\ufb01ed that all implementations exhibit the same test loss after training,\napart from the PASSCoDe implementation which can converge to an incorrect solution when using\nmany threads (Hsieh et al., 2015b). Fig 7 illustrates the results for two different systems. Comparing\nagainst PASSCoDe with the best performing thread count, SySCD achieves a speedup of \u00d75.4 (P9)\nand \u00d74.8 (X86_64) on average across datasets. The larger performance improvement observed for\nthe 2-node system relative to the 4-node system, in particular on the higgs dataset, can be attributed\nto the increased memory bandwidth.\n\n5.3 Comparison with sklearn, VW, and H2O\n\nWe \ufb01nally compare the performance of our new solver against widely used frameworks for training\nGLMs. We compare with scikit-learn (Pedregosa et al., 2011), using different solvers (liblinear,\nlbfgs, sag), with H2O (The H2O.ai team, 2015), using its multi-threaded auto solver and with\nVW (Langford, 2007), using its default solver. Care was taken to ensure that the regularization\nstrength was equivalent across all experiments, and that the reported time did not include parsing\nof text and loading of data. Results showing training time against test loss for the different solvers,\non the two systems, are depicted in Fig 8. We add results for SySCD with single (SySCD 1T) and\nmaximum (SySCD MT) thread counts. Overall SySCD MT is over \u00d710 faster, on average, than the\nbest performing alternative solver. The best competitor is VW for criteo-kaggle and H2O for higgs.\nH20 results are not shown in Fig 8a and 8b because we could not train the criteo-kaggle dataset in a\nreasonable amount of time (> 16 hours), even by using the max_active_predictors option.\n\n8\n\n0510152025300816243240Speedup#ThreadsP9X86_640510152025300816243240Speedup#ThreadsP9X86_641.0010.00100.000816243240Training time (s)#ThreadsPASSCoDe - P9PASSCoDe - x86_64SySCD - P9SySCD - x86_6410.00100.001,000.000816243240Training time (s)#ThreadsPASSCoDe - P9PASSCoDe - x86_64SySCD - P9SySCD - x86_64\f(a) criteo-kaggle - x86_64\n\n(b) criteo-kaggle - P9\n\n(c) higgs - x86_64\n\n(d) higgs - P9\n\nFigure 8: Comparing a single- and multi-threaded implementations of SySCD against state-of-the-art GLM\nsolvers available in scikit-learn, VW, and H2O.\n\n6 Conclusion\n\nWe have shown that the performance of existing parallel coordinate descent algorithms which assume\na simplistic model of the parallel hardware, signi\ufb01cantly suffers from system bottlenecks which\nprevents them from taking full advantage of modern CPUs. In this light we have proposed SySCD, a\nnew system-aware parallel coordinate descent algorithm that respects cache structures, data access\npatterns and numa topology of modern systems to improve implementation ef\ufb01ciency and exploit fast\ndata access by all parallel threads to reshuf\ufb02e data and improve convergence. Our new algorithm\nachieves a gain of up to \u00d712 compared to a state-of-the-art system-agnostic parallel coordinate\ndescent algorithm. In addition, SySCD enjoys strong scalability and convergence guarantees and is\nthus suited to be deployed in production.\n\n9\n\nsklearn [lbfgs]sklearn [sag]sklearn [saga]sklearn [liblinear]SySCD MTSySCD 1Tvw0.4500.4600.4700.48010100100010000LogLoss (Test)Time (s)sklearn [lbfgs]sklearn [sag]sklearn [saga]sklearn [liblinear]SySCD MTSySCD 1Tvw0.4500.4600.4700.48010100100010000LogLoss (Test)Time (s)sklearn [lbfgs]sklearn [sag]sklearn [saga]sklearn [liblinear]SySCD MTSySCD 1Th2ovw0.6370.6380.6390.6400.6411101001000LogLoss (Test)Time (s)sklearn [lbfgs]sklearn [sag]sklearn [saga]sklearn [liblinear]SySCD MTSySCD 1Th2ovw0.6370.6380.6390.6401101001000LogLoss (Test)Time (s)\fReferences\n\nBaldi, P., Sadowski, P., and Whiteson, D. O. Searching for exotic particles in high-energy physics\n\nwith deep learning. Nature communications, 5:4308, 2014.\n\nBradley, J. K., Kyrola, A., Bickson, D., and Guestrin, C. Parallel coordinate descent for l1-regularized\nloss minimization. In Proceedings of the 28th International Conference on International Conference\non Machine Learning, ICML\u201911, pp. 321\u2013328, 2011. ISBN 978-1-4503-0619-5.\n\nCriteo-Labs. Terabyte click logs dataset.\n\nhttp://labs.criteo.com/2013/12/download-\n\nterabyte-click-logs/, 2013. Online; Accessed: 2018-01-25.\n\nD\u00fcnner, C., Forte, S., Takac, M., and Jaggi, M. Primal-dual rates and certi\ufb01cates. In Proceedings of\nThe 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine\nLearning Research, pp. 783\u2013792, New York, New York, USA, 20\u201322 Jun 2016. PMLR.\n\nD\u00fcnner, C., Parnell, T., and Jaggi, M. Ef\ufb01cient use of limited-memory accelerators for linear\nlearning on heterogeneous systems. In Advances in Neural Information Processing Systems 30, pp.\n4258\u20134267. 2017.\n\nD\u00fcnner, C., Lucchi, A., Gargiani, M., Bian, A., Hofmann, T., and Jaggi, M. A distributed second-\norder algorithm you can trust. In Proceedings of the 35th International Conference on Machine\nLearning, volume 80, pp. 1358\u20131366, 2018.\n\nD\u00fcnner, C., Parnell, T., Sarigiannis, D., Ioannou, N., Anghel, A., Ravi, G., Kandasamy, M., and\nPozidis, H. Snap ML: A Hierarchical Framework for Machine Learning. In Advances in Neural\nInformation Processing Systems 31, pp. 250\u2013260. 2018.\n\nEpsilon. Pascal large scale learning challenge. http://www.k4all.org/project/large-scale-\n\nlearning-challenge, 2008.\n\nHsieh, C.-J., Yu, H.-F., and Dhillon, I. Passcode: Parallel asynchronous stochastic dual co-ordinate\n\ndescent. In International Conference on Machine Learning, pp. 2370\u20132379, 2015a.\n\nHsieh, C.-J., Yu, H.-F., and Dhillon, I. Passcode: Parallel asynchronous stochastic dual co-ordinate\ndescent. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on\nMachine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 2370\u20132379,\nLille, France, 07\u201309 Jul 2015b. PMLR.\n\nKaggle. Kaggle machine learning and data science survey, 2017. https://www.kaggle.com/\n\nsurveys/2017.\n\nKaggle.\n\nLibsvm data: Classi\ufb01cation,\n\nregression, and multi-label, 2019.\n\nwww.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.\n\nhttps://\n\nKaufmann, M., Parnell, T. P., and Kourtis, K. Elastic cocoa: Scaling in to improve convergence.\n\narXiv:1811.02322, 2018.\n\nLangford, J. Vowpal wabbit. https://github.com/JohnLangford/vowpal_wabbit/wiki,\n\n2007.\n\nLee, C.-p. and Chang, K.-W. Distributed block-diagonal approximation methods for regularized\n\nempirical risk minimization. arXiv:1709.03043, 2018.\n\nLiu, J. and Wright, S. Asynchronous stochastic coordinate descent: Parallelism and convergence\n\nproperties. SIAM Journal on Optimization, 25(1):351\u2013376, 2015.\n\nLiu, J., Wright, S. J., R\u00e9, C., Bittorf, V., and Sridhar, S. An asynchronous parallel stochastic\ncoordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285\u2013322, 2015.\n\nMa, C., Smith, V., Jaggi, M., Jordan, M. I., Richt\u00e1rik, P., and Tak\u00e1\u02c7c, M. Adding vs. Averaging in\nDistributed Primal-Dual Optimization. In Proceedings of the 32th International Conference on\nMachine Learning, ICML, pp. 1973\u20131982, 2015.\n\n10\n\n\fParnell, T. P., D\u00fcnner, C., Atasu, K., Sifalakis, M., and Pozidis, H. Large-Scale Stochastic Learning\nUsing GPUs. 2017 IEEE International Parallel and Distributed Processing Symposium Workshops\n(IPDPSW), pp. 419\u2013428, 2017.\n\nPedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,\nPrettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M.,\nPerrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\nRecht, B., Re, C., Wright, S., and Niu, F. Hogwild: A lock-free approach to parallelizing stochastic\n\ngradient descent. In Advances in Neural Information Processing Systems, pp. 693\u2013701, 2011.\n\nRichtarik, P. and Takac, M. Distributed coordinate descent method for learning with big data. Journal\n\nof Machine Learning Research, 17:1\u201325, 2016a.\n\nRichtarik, P. and Takac, M. Parallel coordinate descent methods for big data optimization. Mathemat-\n\nical Programming, 156:433\u2013484, 2016b.\n\nShalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss.\n\nJMLR, 14(1):567\u2013599, 2013. ISSN 1532-4435.\n\nSmith, V., Forte, S., Ma, C., Tak\u00e1\u02c7c, M., Jordan, M., and Jaggi, M. Cocoa: A general framework for\n\ncommunication-ef\ufb01cient distributed optimization. 18, 2018.\n\nThe H2O.ai team. h2o: Python interface for h2o. http://www.h2o.ai, 2015. Python package\n\nversion 3.20.0.8.\n\nWright, S. J. Coordinate descent algorithms. Mathematical Programming, 151(1):3\u201334, 2015.\n\nYang, T. Trading computation for communication: Distributed stochastic dual coordinate ascent. In\nBurges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q. (eds.), Advances\nin Neural Information Processing Systems 26, pp. 629\u2013637. Curran Associates, Inc., 2013.\n\nZhang, C. and R\u00e9, C. Dimmwitted: A study of main-memory statistical analytics. Proc. VLDB\n\nEndow., 7(12):1283\u20131294, August 2014. ISSN 2150-8097.\n\n11\n\n\f", "award": [], "sourceid": 319, "authors": [{"given_name": "Nikolas", "family_name": "Ioannou", "institution": "IBM Research"}, {"given_name": "Celestine", "family_name": "Mendler-D\u00fcnner", "institution": "UC Berkeley"}, {"given_name": "Thomas", "family_name": "Parnell", "institution": "IBM Research"}]}