{"title": "Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 56, "page_last": 65, "abstract": "Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.", "full_text": "Breaking the Nonsmooth Barrier: A Scalable Parallel\n\nMethod for Composite Optimization\n\nFabian Pedregosa\n\nINRIA/ENS\u2217\nParis, France\n\nR\u00b4emi Leblond\nINRIA/ENS\u2217\nParis, France\n\nSimon Lacoste-Julien\n\nMILA and DIRO\n\nUniversit\u00b4e de Montr\u00b4eal, Canada\n\nAbstract\n\nDue to their simplicity and excellent performance, parallel asynchronous variants\nof stochastic gradient descent have become popular methods to solve a wide range\nof large-scale optimization problems on multi-core architectures. Yet, despite their\npractical success, support for nonsmooth objectives is still lacking, making them\nunsuitable for many problems of interest in machine learning, such as the Lasso,\ngroup Lasso or empirical risk minimization with convex constraints. In this work,\nwe propose and analyze PROXASAGA, a fully asynchronous sparse method in-\nspired by SAGA, a variance reduced incremental gradient algorithm. The proposed\nmethod is easy to implement and signi\ufb01cantly outperforms the state of the art on\nseveral nonsmooth, large-scale problems. We prove that our method achieves a\ntheoretical linear speedup with respect to the sequential version under assump-\ntions on the sparsity of gradients and block-separability of the proximal term.\nEmpirical benchmarks on a multi-core architecture illustrate practical speedups of\nup to 12x on a 20-core machine.\n\n1\n\nIntroduction\n\nThe widespread availability of multi-core computers motivates the development of parallel methods\nadapted for these architectures. One of the most popular approaches is HOGWILD (Niu et al., 2011),\nan asynchronous variant of stochastic gradient descent (SGD). In this algorithm, multiple threads run\nthe update rule of SGD asynchronously in parallel. As SGD, it only requires visiting a small batch\nof random examples per iteration, which makes it ideally suited for large scale machine learning\nproblems. Due to its simplicity and excellent performance, this parallelization approach has recently\nbeen extended to other variants of SGD with better convergence properties, such as SVRG (Johnson\n& Zhang, 2013) and SAGA (Defazio et al., 2014).\nDespite their practical success, existing parallel asynchronous variants of SGD are limited to smooth\nobjectives, making them inapplicable to many problems in machine learning and signal processing.\nIn this work, we develop a sparse variant of the SAGA algorithm and consider its parallel asyn-\nchronous variants for general composite optimization problems of the form:\n\nf (x) + h(x)\n\narg min\nx\u2208Rp\n\n, with f (x) := 1\n\nwhere each fi is convex with L-Lipschitz gradient, the average function f is \u00b5-strongly convex and\nh is convex but potentially nonsmooth. We further assume that h is \u201csimple\u201d in the sense that we\nhave access to its proximal operator, and that it is block-separable, that is, it can be decomposed\nhB([x]B), where B is a partition of the coef\ufb01cients into\n\nblock coordinate-wise as h(x) = \ufffdB\u2208B\n\n\u2217DI \u00b4Ecole normale sup\u00b4erieure, CNRS, PSL Research University\n\ni=1 fi(x)\n\n,\n\n(OPT)\n\nn\ufffdn\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsubsets which will call blocks and hB only depends on coordinates in block B. Note that there is\nno loss of generality in this last assumption as a unique block covering all coordinates is a valid\npartition, though in this case, our sparse variant of the SAGA algorithm reduces to the original SAGA\nalgorithm and no gain from sparsity is obtained.\nThis template models a broad range of problems arising in machine learning and signal processing:\nthe \ufb01nite-sum structure of f includes the least squares or logistic loss functions; the proximal term\nh includes penalties such as the \ufffd1 or group lasso penalty. Furthermore, this term can be extended-\nvalued, thus allowing for convex constraints through the indicator function.\n\nContributions. This work presents two main contributions. First, in \u00a72 we describe Sparse Proxi-\nmal SAGA, a novel variant of the SAGA algorithm which features a reduced cost per iteration in the\npresence of sparse gradients and a block-separable penalty. Like other variance reduced methods, it\nenjoys a linear convergence rate under strong convexity. Second, in \u00a73 we present PROXASAGA, a\nlock-free asynchronous parallel version of the aforementioned algorithm that does not require con-\nsistent reads. Our main results states that PROXASAGA obtains (under assumptions) a theoretical\nlinear speedup with respect to its sequential version. Empirical benchmarks reported in \u00a74 show that\nthis method dramatically outperforms state-of-the-art alternatives on large sparse datasets, while the\nempirical speedup analysis illustrates the practical gains as well as its limitations.\n\n1.1 Related work\n\nAsynchronous coordinate-descent. For composite objective functions of the form (OPT), most of\nthe existing literature on asynchronous optimization has focused on variants of coordinate descent.\nLiu & Wright (2015) proposed an asynchronous variant of (proximal) coordinate descent and proved\na near-linear speedup in the number of cores used, given a suitable step size. This approach has been\nrecently extended to general block-coordinate schemes by Peng et al. (2016), to greedy coordinate-\ndescent schemes by You et al. (2016) and to non-convex problems by Davis et al. (2016). However,\nas illustrated by our experiments, in the large sample regime coordinate descent compares poorly\nagainst incremental gradient methods like SAGA.\n\nVariance reduced incremental gradient and their asynchronous variants.\nInitially proposed in\nthe context of smooth optimization by Le Roux et al. (2012), variance reduced incremental gradient\nmethods have since been extended to minimize composite problems of the form (OPT) (see table\nbelow). Smooth variants of these methods have also recently been extended to the asynchronous set-\nting, where multiple threads run the update rule asynchronously and in parallel. Interestingly, none\nof these methods achieve both simultaneously, i.e. asynchronous optimization of composite prob-\nlems. Since variance reduced incremental gradient methods have shown state of the art performance\nin both settings, this generalization is of key practical interest.\n\nObjective\n\nSmooth\n\nComposite\n\nSequential Algorithm\nSVRG (Johnson & Zhang, 2013)\nSDCA (Shalev-Shwartz & Zhang, 2013)\nSAGA (Defazio et al., 2014)\nPROXSDCA (Shalev-Shwartz et al., 2012)\nSAGA (Defazio et al., 2014)\nProxSVRG (Xiao & Zhang, 2014)\n\nAsynchronous Algorithm\n\nSVRG (Reddi et al., 2015)\nPASSCODE (Hsieh et al., 2015, SDCA variant)\nASAGA (Leblond et al., 2017, SAGA variant)\n\nThis work: PROXASAGA\n\nOn the dif\ufb01culty of a composite extension. Two key issues explain the paucity in the develop-\nment of asynchronous incremental gradient methods for composite optimization. The \ufb01rst issue\nis related to the design of such algorithms. Asynchronous variants of SGD are most competitive\nwhen the updates are sparse and have a small overlap, that is, when each update modi\ufb01es a small\nand different subset of the coef\ufb01cients. This is typically achieved by updating only coef\ufb01cients for\nwhich the partial gradient at a given iteration is nonzero,2 but existing schemes such as the lagged\nupdates technique (Schmidt et al., 2016) are not applicable in the asynchronous setting. The second\n\n2Although some regularizers are sparsity inducing, large scale datasets are often extremely sparse and lever-\n\naging this property is crucial for the ef\ufb01ciency of the method.\n\n2\n\n\fdif\ufb01culty is related to the analysis of such algorithms. All convergence proofs crucially use the Lip-\nschitz condition on the gradient to bound the noise terms derived from asynchrony. However, in the\ncomposite case, the gradient mapping term (Beck & Teboulle, 2009), which replaces the gradient\nin proximal-gradient methods, does not have a bounded Lipschitz constant. Hence, the traditional\nproof technique breaks down in this scenario.\n\nOther approaches. Recently, Meng et al. (2017); Gu et al. (2016) independently proposed a dou-\nbly stochastic method to solve the problem at hand. Following Meng et al. (2017) we refer to it\nas Async-PROXSVRCD. This method performs coordinate descent-like updates in which the true\ngradient is replaced by its SVRG approximation. It hence features a doubly-stochastic loop: at each\niteration we select a random coordinate and a random sample. Because the selected coordinate block\nis uncorrelated with the chosen sample, the algorithm can be orders of magnitude slower than SAGA\nin the presence of sparse gradients. Appendix F contains a comparison of these methods.\n\n1.2 De\ufb01nitions and notations\n\nBy convention, we denote vectors and vector-valued functions in lowercase boldface (e.g. x) and\nmatrices in uppercase boldface (e.g. D). The proximal operator of a convex lower semicontinuous\n2\ufffdx\u2212 z\ufffd2}. A function f is said to be\nfunction h is de\ufb01ned as proxh(x) := arg minz\u2208Rp{h(z) + 1\nL-smooth if it is differentiable and its gradient is L-Lipschitz continuous. A function f is said to be\n\u00b5-strongly convex if f \u2212 \u00b5\n2\ufffd \u00b7 \ufffd2 is convex. We use the notation \u03ba := L/\u00b5 to denote the condition\nnumber for an L-smooth and \u00b5-strongly convex function.3\nI p denotes the p-dimensional identity matrix, 1{cond} the characteristic function, which is 1 if cond\ni=1 \u03b1i.\nevaluates to true and 0 otherwise. The average of a vector or matrix is denoted \u03b1 := 1\nWe use \ufffd \u00b7 \ufffd for the Euclidean norm. For a positive semi-de\ufb01nite matrix D, we de\ufb01ne its associated\ndistance as \ufffdx\ufffd2\nD := \ufffdx, Dx\ufffd. We denote by [ x ]b the b-th coordinate in x. This notation is\noverloaded so that for a collection of blocks T = {B1, B2, . . .}, [x]T denotes the vector x restricted\nto the coordinates in the blocks of T . For convenience, when T consists of a single block B we use\n[x]B as a shortcut of [x]{B}. Finally, we distinguish E, the full expectation taken with respect to all\nthe randomness in the system, from E, the conditional expectation of a random it (the random index\nsampled at each iteration by SGD-like algorithms) conditioned on all the \u201cpast\u201d, which the context\nwill clarify.\n\nn\ufffdn\n\n2 Sparse Proximal SAGA\n\nOriginal SAGA algorithm. The original SAGA algorithm (Defazio et al., 2014) maintains two\nmoving quantities: the current iterate x and a table (memory) of historical gradients (\u03b1i)n\ni=1. At\nevery iteration, it samples an index i \u2208 {1, . . . , n} uniformly at random, and computes the next\niterate (x+, \u03b1+) according to the following recursion:\n(1)\nOn each iteration, this update rule requires to visit all coef\ufb01cients even if the partial gradients \u2207fi are\nsparse. Sparse partial gradients arise in a variety of practical scenarios: for example, in generalized\nlinear models the partial gradients inherit the sparsity pattern of the dataset. Given that large-scale\ndatasets are often sparse,4 leveraging this sparsity is crucial for the success of the optimizer.\n\nui = \u2207fi(x) \u2212 \u03b1i + \u03b1 ; x+ = prox\u03b3h\ufffdx \u2212 \u03b3ui\ufffd; \u03b1+\n\ni = \u2207fi(x) .\n\nSparse Proximal SAGA algorithm. We will now describe an algorithm that leverages sparsity\nin the partial gradients by only updating those blocks that intersect with the support of the partial\ngradients. Since in this update scheme some blocks might appear more frequently than others, we\nwill need to counterbalance this undersirable effect with a well-chosen block-wise reweighting of\nthe average gradient and the proximal term.\nIn order to make precise this block-wise reweighting, we de\ufb01ne the following quantities. We denote\nby Ti the extended support of \u2207fi, which is the set of blocks that intersect the support of \u2207fi,\n3Since we have assumed that each individual fi is L-smooth, f itself is L-smooth \u2013 but it could have a\nsmaller smoothness constant. Our rates are in terms of this bigger L/\u00b5, as is standard in the SAGA literature.\n4For example, in the LibSVM datasets suite, 8 out of the 11 datasets (as of May 2017) with more than a\n\nmillion samples have a density between 10\u22124 and 10\u22126.\n\n3\n\n\fformally de\ufb01ned as Ti := {B : supp(\u2207fi) \u2229 B \ufffd= \u2205, B \u2208 B}. For totally separable penalties such\nas the \ufffd1 norm, the blocks are individual coordinates and so the extended support covers the same\ncoordinates as the support. Let dB := n/nB, where nB :=\ufffdi\n1{B \u2208 Ti} is the number of times\nthat B \u2208 Ti. For simplicity we assume nB > 0, as otherwise the problem can be reformulated\nwithout block B. The update rule in (1) requires computing the proximal operator of h, which\ninvolves a full pass on the coordinates. In our proposed algorithm, we replace h in (1) with the\nfunction \u03d5i(x) := \ufffdB\u2208Ti\ndBhB(x), whose form is justi\ufb01ed by the following three properties.\nFirst, this function is zero outside Ti, allowing for sparse updates. Second, because of the block-wise\nreweighting dB, the function \u03d5i is an unbiased estimator of h (i.e., E \u03d5i = h), property which will\nbe crucial to prove the convergence of the method. Third, \u03d5i inherits the block-wise structure of h\nand its proximal operator can be computed from that of h as [prox\u03b3\u03d5i (x)]B = [prox(dB \u03b3)hB (x)]B\nif B \u2208 Ti and [prox\u03b3\u03d5i (x)]B = [x]B otherwise. Following Leblond et al. (2017), we will also\nreplace the dense gradient estimate ui by the sparse estimate vi := \u2207fi(x) \u2212 \u03b1i + Di\u03b1, where\nDi is the diagonal matrix de\ufb01ned block-wise as [Di]B,B = dB1{B \u2208 Ti}I|B|. It is easy to verify\nthat the vector Di\u03b1 is a weighted projection onto the support of Ti and E Di\u03b1 = \u03b1, making vi an\nunbiased estimate of the gradient.\nWe now have all necessary elements to describe the Sparse Proximal SAGA algorithm. As the\noriginal SAGA algorithm, it maintains two moving quantities: the current iterate x \u2208 Rp and a\ntable of historical gradients (\u03b1i)n\ni=1, \u03b1i \u2208 Rp. At each iteration, the algorithm samples an index\ni \u2208 {1, . . . , n} and computes the next iterate (x+, \u03b1+) as:\n\nvi = \u2207fi(x) \u2212 \u03b1i + Di\u03b1 ; x+ = prox\u03b3\u03d5i\ufffdx \u2212 \u03b3vi\ufffd ; \u03b1+\n\ni = \u2207fi(x) ,\n\n(SPS)\n\nwhere in a practical implementation the vector \u03b1 is updated incrementally at each iteration.\nThe above algorithm is sparse in the sense that it only requires to visit and update blocks in the\nextended support: if B /\u2208 Ti, by the sparsity of vi and prox\u03d5i, we have [x+]B = [x]B. Hence,\nwhen the extended support Ti is sparse, this algorithm can be orders of magnitude faster than the\nnaive SAGA algorithm. The extended support is sparse for example when the partial gradients are\nsparse and the penalty is separable, as is the case of the \ufffd1 norm or the indicator function over a\nhypercube, or when the the penalty is block-separable in a way such that only a small subset of the\nblocks overlap with the support of the partial gradients. Initialization of variables and a reduced\nstorage scheme for the memory are discussed in the implementation details section of Appendix E.\nRelationship with existing methods. This algorithm can be seen as a generalization of both the\nStandard SAGA algorithm and the Sparse SAGA algorithm of Leblond et al. (2017). When the\nproximal term is not block-separable, then dB = 1 (for a unique block B) and the algorithm defaults\nto the Standard (dense) SAGA algorithm. In the smooth case (i.e., h = 0), the algorithm defaults to\nthe Sparse SAGA method. Hence we note that the sparse gradient estimate vi in our algorithm is the\nsame as the one proposed in Leblond et al. (2017). However, we emphasize that a straightforward\ncombination of this sparse update rule with the proximal update from the Standard SAGA algorithm\nresults in a nonconvergent algorithm: the block-wise reweighting of h is a surprisingly simple but\ncrucial change. We now give the convergence guarantees for this algorithm.\nTheorem 1. Let \u03b3 = a\nSAGA converges geometrically in expectation with a rate factor of at least \u03c1 = 1\nis, for xt obtained after t updates, we have the following bound:\nE\ufffdxt \u2212 x\u2217\ufffd2 \u2264 (1 \u2212 \u03c1)tC0 , with C0 := \ufffdx0 \u2212 x\u2217\ufffd2 + 1\n\n5L for any a \u2264 1 and f be \u00b5-strongly convex (\u00b5 > 0). Then Sparse Proximal\n\u03ba}. That\n\n5 min{ 1\n\nn , a 1\n\n5L2\ufffdn\n\ni=1 \ufffd\u03b10\n\ni \u2212 \u2207fi(x\u2217)\ufffd2\n\n.\n\nRemark. For the step size \u03b3 = 1/5L, the convergence rate is (1 \u2212 1/5 min{1/n, 1/\u03ba}). We can thus\nidentify two regimes: the \u201cbig data\u201d regime, n \u2265 \u03ba, in which the rate factor is bounded by 1/5n, and\nthe \u201cill-conditioned\u201d regime, \u03ba \u2265 n, in which the rate factor is bounded by 1/5\u03ba. This rate roughly\nmatches the rate obtained by Defazio et al. (2014). While the step size bound of 1/5L is slightly\nsmaller than the 1/3L one obtained in that work, this can be explained by their stronger assumptions:\neach fi is strongly convex whereas they are strongly convex only on average in this work. All proofs\nfor this section can be found in Appendix B.\n\n4\n\n\fi=1\n\nj=1[ \u02c6\u03b1j ]Ti\n\n\u02c6x = inconsistent read of x\n\u02c6\u03b1 = inconsistent read of \u03b1\nSample i uniformly in {1, ..., n}\nSi := support of \u2207fi\nTi := extended support of \u2207fi in B\n[ \u03b1 ]Ti = 1/n\ufffdn\n[ \u03b4\u03b1 ]Si = [\u2207fi(\u02c6x)]Si \u2212 [ \u02c6\u03b1i]Si\n[ \u02c6v ]Ti = [ \u03b4\u03b1 ]Ti + [Di\u03b1 ]Ti\n[ \u03b4x ]Ti = [prox\u03b3\u03d5i (\u02c6x \u2212 \u03b3 \u02c6v)]Ti \u2212 [\u02c6x]Ti\nfor B in Ti do\nfor b \u2208 B do\n\nAlgorithm 1 PROXASAGA (analyzed)\n1: Initialize shared variables x and (\u03b1i)n\n2: keep doing in parallel\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21: end parallel loop\n\n[ x ]b \u2190 [ x ]b + [ \u03b4x ]b\nif b \u2208 Si then\n[\u03b1i]b \u2190 [\u2207fi(\u02c6x)]b\nend if\nend for\n\nend for\n// (\u2018\u2190\u2019 denotes shared memory update.)\n\n\ufffd atomic\n\ni=1, \u03b1\n\nAlgorithm 2 PROXASAGA (implemented)\n1: Initialize shared variables x, (\u03b1i)n\n2: keep doing in parallel\nSample i uniformly in {1, ..., n}\n3:\nSi := support of \u2207fi\n4:\nTi := extended support of \u2207fi in B\n5:\n[ \u02c6x ]Ti = inconsistent read of x on Ti\n6:\n\u02c6\u03b1i = inconsistent read of \u03b1i\n7:\n[ \u03b1 ]Ti = inconsistent read of \u03b1 on Ti\n8:\n9:\n[ \u03b4\u03b1 ]Si = [\u2207fi(\u02c6x)]Si \u2212 [ \u02c6\u03b1i]Si\n10:\n[ \u02c6v ]Ti = [\u03b4\u03b1 ]Ti + [ Di\u03b1 ]Ti\n11:\n[ \u03b4x ]Ti = [prox\u03b3\u03d5i (\u02c6x \u2212 \u03b3 \u02c6v)]Ti \u2212 [\u02c6x]Ti\nfor B in Ti do\n12:\nfor b in B do\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20: \u03b1i \u2190 \u2207fi(\u02c6x)\n21: end parallel loop\n\n[ x ]b \u2190 [ x ]b + [ \u03b4x ]b\nif b \u2208 Si then\nend if\nend for\n\nend for\n\n[ \u03b1 ]b \u2190 [\u03b1]b + 1/n[\u03b4\u03b1]b \ufffd atomic\n\n(scalar update) \ufffd atomic\n\n\ufffd atomic\n\n3 Asynchronous Sparse Proximal SAGA\n\nWe introduce PROXASAGA \u2013 the asynchronous parallel variant of Sparse Proximal SAGA. In this\nalgorithm, multiple cores update a central parameter vector using the Sparse Proximal SAGA intro-\nduced in the previous section, and updates are performed asynchronously. The algorithm parameters\nare read and written without vector locks, i.e., the vector content of the shared memory can poten-\ntially change while a core is reading or writing to main memory coordinate by coordinate. These\noperations are typically called inconsistent (at the vector level).\nThe full algorithm is described in Algorithm 1 for its theoretical version (on which our analysis\nis built) and in Algorithm 2 for its practical implementation. The practical implementation differs\nfrom the analyzed agorithm in three points. First, in the implemented algorithm, index i is sampled\nbefore reading the coef\ufb01cients to minimize memory access since only the extended support needs to\nbe read. Second, since our implementation targets generalized linear models, the memory \u03b1i can be\ncompressed into a single scalar in L20 (see Appendix E). Third, \u03b1 is stored in memory and updated\nincrementally instead of recomputed at each iteration.\nThe rest of the section is structured as follows: we start by describing our framework of analysis; we\nthen derive essential properties of PROXASAGA along with a classical delay assumption. Finally,\nwe state our main convergence and speedup result.\n\n3.1 Analysis framework\n\nAs in most of the recent asynchronous optimization literature, we build on the hardware model in-\ntroduced by Niu et al. (2011), with multiple cores reading and writing to a shared memory parameter\nvector. These operations are asynchronous (lock-free) and inconsistent:5 \u02c6xt, the local copy of the\nparameters of a given core, does not necessarily correspond to a consistent iterate in memory.\n\n\u201cPerturbed\u201d iterates. To handle this additional dif\ufb01culty, contrary to most contributions in this\n\ufb01eld, we choose the \u201cperturbed iterate framework\u201d proposed by Mania et al. (2017) and re\ufb01ned\nby Leblond et al. (2017). This framework can analyze variants of SGD which obey the update rule:\n\nxt+1 = xt \u2212 \u03b3v(xt, it) , where v veri\ufb01es the unbiasedness condition E v(x, it) = \u2207f (x)\n5This is an extension of the framework of Niu et al. (2011), where consistent updates were assumed.\n\n5\n\n\fand the expectation is computed with respect to it. In the asynchronous parallel setting, cores are\nreading inconsistent iterates from memory, which we denote \u02c6xt. As these inconsistent iterates are\naffected by various delays induced by asynchrony, they cannot easily be written as a function of\ntheir previous iterates. To alleviate this issue, Mania et al. (2017) choose to introduce an additional\nquantity for the purpose of the analysis:\n\nxt+1 := xt \u2212 \u03b3v(\u02c6xt, it) ,\n\n(2)\nNote that this equation is the de\ufb01nition of this new quantity xt. This virtual iterate is useful for the\nconvergence analysis and makes for much easier proofs than in the related literature.\n\nthe \u201cvirtual iterate\u201d \u2013 which is never actually computed .\n\n\u201cAfter read\u201d labeling. How we choose to de\ufb01ne the iteration counter t to label an iterate xt\nmatters in the analysis.\nIn this paper, we follow the \u201cafter read\u201d labeling proposed in Leblond\net al. (2017), in which we update our iterate counter, t, as each core \ufb01nishes reading its copy of\nthe parameters (in the speci\ufb01c case of PROXASAGA, this includes both \u02c6xt and \u02c6\u03b1t). This means\nthat \u02c6xt is the (t + 1)th fully completed read. One key advantage of this approach compared to the\nclassical choice of Niu et al. (2011) \u2013 where t is increasing after each successful update \u2013 is that\nit guarantees both that the it are uniformly distributed and that it and \u02c6xt are independent. This\nproperty is not veri\ufb01ed when using the \u201cafter write\u201d labeling of Niu et al. (2011), although it is still\nimplicitly assumed in the papers using this approach, see Leblond et al. (2017, Section 3.2) for a\ndiscussion of issues related to the different labeling schemes.\n\nGeneralization to composite optimization. Although the perturbed iterate framework was de-\nsigned for gradient-based updates, we can extend it to proximal methods by remarking that in the\nsequential setting, proximal stochastic gradient descent and its variants can be characterized by the\nfollowing similar update rule:\n\nxt+1 = xt \u2212 \u03b3g(xt, vit , it) , with g(x, v, i) := 1\n\n(3)\nwhere as before v veri\ufb01es the unbiasedness condition E v = \u2207f (x). The Proximal Sparse SAGA\niteration can be easily written within this template by using \u03d5i and vi as de\ufb01ned in \u00a72. Using this\nde\ufb01nition of g, we can de\ufb01ne PROXASAGA virtual iterates as:\n(4)\nit = \u2207fit (\u02c6xt). Our theoretical\n\nwhere as in the sequential case, the memory terms are updated as \u02c6\u03b1t\nanalysis of PROXASAGA will be based on this de\ufb01nition of the virtual iterate xt+1.\n\nxt+1 := xt \u2212 \u03b3g(\u02c6xt, \u02c6vt\n\n\u03b3\ufffdx \u2212 prox\u03b3\u03d5i (x \u2212 \u03b3v)\ufffd ,\n\nit = \u2207fit (\u02c6xt) \u2212 \u02c6\u03b1t\n\nit , it) , with \u02c6vt\n\nit + Dit \u03b1t\n\n,\n\n3.2 Properties and assumptions\n\nit is an unbiased estimator of the gradient of f at \u02c6xt. This property is a\n\nNow that we have introduced the \u201cafter read\u201d labeling for proximal methods in Eq. (4), we can\nleverage the framework of Leblond et al. (2017, Section 3.3) to derive essential properties for the\nanalysis of PROXASAGA. We describe below three useful properties arising from the de\ufb01nition\nof Algorithm 1, and then state a central (but standard) assumption that the delays induced by the\nasynchrony are uniformly bounded.\nIndependence: Due to the \u201cafter read\u201d global ordering, ir is independent of \u02c6xt for all r \u2265 t. We\nenforce the independence for r = t by having the cores read all the shared parameters before their\niterations.\nUnbiasedness: The term \u02c6vt\nconsequence of the independence between it and \u02c6xt.\nAtomicity: The shared parameter coordinate update of [x]b on Line 14 is atomic. This means that\nthere are no overwrites for a single coordinate even if several cores compete for the same resources.\nMost modern processors have support for atomic operations with minimal overhead.\nBounded overlap assumption. We assume that there exists a uniform bound, \u03c4, on the maximum\nnumber of overlapping iterations. This means that every coordinate update from iteration t is suc-\ncessfully written to memory before iteration t + \u03c4 + 1 starts. Our result will give us conditions on \u03c4\nto obtain linear speedups.\nBounding \u02c6xt\u2212 xt. The delay assumption of the previous paragraph allows to express the difference\nbetween real and virtual iterate using the gradient mapping gu := g(\u02c6xu, \u02c6vu\n\u02c6xt\u2212xt = \u03b3\ufffdt\u22121\nu are p \u00d7 p diagonal matrices with terms in {0, +1}. (5)\n\nugu , where Gt\n\niu , iu) as:\n\nu=(t\u2212\u03c4 )+\n\nGt\n\n6\n\n\f0 represents instances where both \u02c6xu and xu have received the corresponding updates. +1, on\nthe contrary, represents instances where \u02c6xu has not yet received an update that is already in xu by\nde\ufb01nition. This bound will prove essential to our analysis.\n\n3.3 Analysis\n\n36 min{1, 6\u03ba\n\nL with a \u2264 a\u2217(\u03c4 ) := 1\n\nIn this section, we state our convergence and speedup results for PROXASAGA. The full details\nof the analysis can be found in Appendix C. Following Niu et al. (2011), we introduce a sparsity\nmeasure (generalized to the composite setting) that will appear in our results.\nDe\ufb01nition 1. Let \u0394 := maxB\u2208B |{i : Ti \ufffd B}|/n. This is the normalized maximum number of\ntimes that a block appears in the extended support. For example, if a block is present in all Ti, then\n\u0394 = 1. If no two Ti share the same block, then \u0394 = 1/n. We always have 1/n \u2264 \u0394 \u2264 1.\nTheorem 2 (Convergence guarantee of PROXASAGA). Suppose \u03c4 \u2264 1\n. For any step size\n10\u221a\u0394\n\u03c4 }, the inconsistent read iterates of Algorithm 1 converge\n\u03b3 = a\n\u03ba\ufffd, i.e. E\ufffd\u02c6xt \u2212 x\u2217\ufffd2 \u2264\nin expectation at a geometric rate factor of at least: \u03c1(a) = 1\nn , a 1\n(1 \u2212 \u03c1)t \u02dcC0, where \u02dcC0 is a constant independent of t (\u2248 n\u03ba\na C0 with C0 as de\ufb01ned in Theorem ??).\nThis last result is similar to the original SAGA convergence result and our own Theorem ??, with\nboth an extra condition on \u03c4 and on the maximum allowable step size. In the best sparsity case,\n\u0394 = 1/n and we get the condition \u03c4 \u2264 \u221an/10. We now compare the geometric rate above to the one\nof Sparse Proximal SAGA to derive the necessary conditions under which PROXASAGA is linearly\nfaster.\nCorollary 1 (Speedup). Suppose \u03c4 \u2264 1\n. If \u03ba \u2265 n, then using the step size \u03b3 = 1/36L, PROXAS-\n10\u221a\u0394\n\u03ba ). If \u03ba < n, then using the step size \u03b3 = 1/36n\u00b5,\nAGA converges geometrically with rate factor \u03a9( 1\nPROXASAGA converges geometrically with rate factor \u03a9( 1\nn ). In both cases, the convergence rate\nis the same as Sparse Proximal SAGA. Thus PROXASAGA is linearly faster than its sequential\ncounterpart up to a constant factor. Note that in both cases the step size does not depend on \u03c4.\nFurthermore, if \u03c4 \u2264 6\u03ba, we can use a universal step size of \u0398(1/L) to get a similar rate for PROX-\nASAGA than Sparse Proximal SAGA, thus making it adaptive to local strong convexity since the\nknowledge of \u03ba is not required.\n\n5 min\ufffd 1\n\nThese speedup regimes are comparable with the best ones obtained in the smooth case, including Niu\net al. (2011); Reddi et al. (2015), even though unlike these papers, we support inconsistent reads\nand nonsmooth objective functions. The one exception is Leblond et al. (2017), where the authors\nprove that their algorithm, ASAGA, can obtain a linear speedup even without sparsity in the well-\nconditioned regime. In contrast, PROXASAGA always requires some sparsity. Whether this property\nfor smooth objective functions could be extended to the composite case remains an open problem.\nRelative to ASYSPCD, in the best case scenario (where the components of the gradient are uncorre-\nlated, a somewhat unrealistic setting), ASYSPCD can get a near-linear speedup for \u03c4 as big as 4\u221ap.\nOur result states that \u03c4 = O(1/\u221a\u0394) is necessary for a linear speedup. This means in case \u0394 \u2264 1/\u221ap\nour bound is better than the one obtained for ASYSPCD. Recalling that 1/n \u2264 \u0394 \u2264 1, it appears\nthat PROXASAGA is favored when n is bigger than \u221ap whereas ASYSPCD may have a better bound\notherwise, though this comparison should be taken with a grain of salt given the assumptions we\nhad to make to arrive at comparable quantities. An extended comparison with the related work can\nbe found in Appendix D.\n\n4 Experiments\n\nIn this section, we compare PROXASAGA with related methods on different datasets. Although\nPROXASAGA can be applied more broadly, we focus on \ufffd1 + \ufffd2-regularized logistic regression, a\nmodel of particular practical importance. The objective function takes the form\n\nwhere ai \u2208 Rp and bi \u2208 {\u22121, +1} are the data samples. Following Defazio et al. (2014), we set\n\u03bb1 = 1/n. The amount of \ufffd1 regularization (\u03bb2) is selected to give an approximate 1/10 nonzero\n\n2 \ufffdx\ufffd2\n\n2 + \u03bb2\ufffdx\ufffd1\n\n,\n\n(6)\n\n1\nn\n\nn\ufffdi=1\n\nlog\ufffd1 + exp(\u2212bia\ufffd\n\ni x)\ufffd + \u03bb1\n\n7\n\n\fTable 1: Description of datasets.\n\nDataset\nKDD 2010 (Yu et al., 2010)\nKDD 2012 (Juan et al., 2016)\nCriteo (Juan et al., 2016)\n\nn\n\np\n\n19,264,097\n149,639,105\n45,840,617\n\n1,163,024\n54,686,452\n1,000,000\n\ndensity\n10\u22126\n2 \u00d7 10\u22127\n4 \u00d7 10\u22125\n\nL\n\n28.12\n1.25\n1.25\n\n\u0394\n0.15\n0.85\n0.89\n\nFigure 1: Convergence for asynchronous stochastic methods for \ufffd1 + \ufffd2-regularized logistic\nregression. Top: Suboptimality as a function of time for different asynchronous methods using 1\nand 10 cores. Bottom: Running time speedup as function of the number of cores. PROXASAGA\nachieves signi\ufb01cant speedups over its sequential version while being orders of magnitude faster than\ncompeting methods. ASYSPCD achieves the highest speedups but it also the slowest overall method.\n\ncoef\ufb01cients. Implementation details are available in Appendix E. We chose the 3 datasets described\nin Table 1\n\nResults. We compare three parallel asynchronous methods on the aforementioned datasets: PROX-\nASAGA (this work),6 ASYSPCD, the asynchronous proximal coordinate descent method of Liu &\nWright (2015) and the (synchronous) FISTA algorithm (Beck & Teboulle, 2009), in which the gra-\ndient computation is parallelized by splitting the dataset into equal batches. We aim to benchmark\nthese methods in the most realistic scenario possible; to this end we use the following step size:\n1/2L for PROXASAGA, 1/Lc for ASYSPCD, where Lc is the coordinate-wise Lipschitz constant\nof the gradient, while FISTA uses backtracking line-search. The results can be seen in Figure 1\n(top) with both one (thus sequential) and ten processors. Two main observations can be made from\nthis \ufb01gure. First, PROXASAGA is signi\ufb01cantly faster on these problems. Second, its asynchronous\nversion offers a signi\ufb01cant speedup over its sequential counterpart.\nIn Figure 1 (bottom) we present speedup with respect to the number of cores, where speedup is\ncomputed as the time to achieve a suboptimality of 10\u221210 with one core divided by the time to\nachieve the same suboptimality using several cores. While our theoretical speedups (with respect\nto the number of iterations) are almost linear as our theory predicts (see Appendix F), we observe\na different story for our running time speedups. This can be attributed to memory access overhead,\nwhich our model does not take into account. As predicted by our theoretical results, we observe\n\n6A reference C++/Python implementation of is available at https://github.com/fabianp/ProxASAGA\n\n8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fa high correlation between the \u0394 dataset sparsity measure and the empirical speedup: KDD 2010\n(\u0394 = 0.15) achieves a 11x speedup, while in Criteo (\u0394 = 0.89) the speedup is never above 6x.\nNote that although competitor methods exhibit similar or sometimes better speedups, they remain\norders of magnitude slower than PROXASAGA in running time for large sparse problems. In fact,\nour method is between 5x and 80x times faster (in time to reach 10\u221210 suboptimality) than FISTA\nand between 13x and 290x times faster than ASYSPCD (see Appendix F.3).\n\n5 Conclusion and future work\n\nIn this work, we have described PROXASAGA, an asynchronous variance reduced algorithm with\nsupport for composite objective functions. This method builds upon a novel sparse variant of the\n(proximal) SAGA algorithm that takes advantage of sparsity in the individual gradients. We have\nproven that this algorithm is linearly convergent under a condition on the step size and that it is\nlinearly faster than its sequential counterpart given a bound on the delay. Empirical benchmarks\nshow that PROXASAGA is orders of magnitude faster than existing state-of-the-art methods.\nThis work can be extended in several ways. First, we have focused on the SAGA method as the basic\niteration loop, but this approach can likely be extended to other proximal incremental schemes such\nas SGD or ProxSVRG. Second, as mentioned in \u00a73.3, it is an open question whether it is possible to\nobtain convergence guarantees without any sparsity assumption, as was done for ASAGA.\n\nAcknowledgements\n\nThe authors would like to thank our colleagues Damien Garreau, Robert Gower, Thomas Kerdreux,\nGeoffrey Negiar, Konstantin Mishchenko and Kilian Fatras for their feedback on this manuscript,\nand Jean-Baptiste Alayrac for support managing the computational resources.\nThis work was partially supported by a Google Research Award. FP acknowledges support from the\nchaire \u00b4Economie des nouvelles donn\u00b4ees with the data science joint research initiative with the fonds\nAXA pour la recherche.\n\nReferences\nBauschke, Heinz and Combettes, Patrick L. Convex analysis and monotone operator theory in\n\nHilbert spaces. Springer, 2011.\n\nBeck, Amir and Teboulle, Marc. Gradient-based algorithms with applications to signal recovery.\n\nConvex Optimization in Signal Processing and Communications, 2009.\n\nDavis, Damek, Edmunds, Brent, and Udell, Madeleine. The sound of APALM clapping: faster\nnonsmooth nonconvex optimization with stochastic asynchronous PALM. In Advances in Neural\nInformation Processing Systems 29, 2016.\n\nDefazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. SAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural Infor-\nmation Processing Systems, 2014.\n\nGu, Bin, Huo, Zhouyuan, and Huang, Heng. Asynchronous stochastic block coordinate descent\n\nwith variance reduction. arXiv preprint arXiv:1610.09447v3, 2016.\n\nHsieh, Cho-Jui, Yu, Hsiang-Fu, and Dhillon, Inderjit S. PASSCoDe: parallel asynchronous stochas-\n\ntic dual coordinate descent. In ICML, 2015.\n\nJohnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, 2013.\n\nJuan, Yuchin, Zhuang, Yong, Chin, Wei-Sheng, and Lin, Chih-Jen. Field-aware factorization ma-\nchines for CTR prediction. In Proceedings of the 10th ACM Conference on Recommender Sys-\ntems. ACM, 2016.\n\n9\n\n\fLe Roux, Nicolas, Schmidt, Mark, and Bach, Francis R. A stochastic gradient method with an ex-\nponential convergence rate for \ufb01nite training sets. In Advances in Neural Information Processing\nSystems, 2012.\n\nLeblond, R\u00b4emi, Pedregosa, Fabian, and Lacoste-Julien, Simon. ASAGA: asynchronous parallel\nSAGA. Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS 2017), 2017.\n\nLiu, Ji and Wright, Stephen J. Asynchronous stochastic coordinate descent: Parallelism and conver-\n\ngence properties. SIAM Journal on Optimization, 2015.\n\nMania, Horia, Pan, Xinghao, Papailiopoulos, Dimitris, Recht, Benjamin, Ramchandran, Kannan,\nand Jordan, Michael I. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM\nJournal on Optimization, 2017.\n\nMeng, Qi, Chen, Wei, Yu, Jingcheng, Wang, Taifeng, Ma, Zhi-Ming, and Liu, Tie-Yan. Asyn-\n\nchronous stochastic proximal optimization algorithms with variance reduction. In AAAI, 2017.\n\nNesterov, Yurii. Introductory lectures on convex optimization. Springer Science & Business Media,\n\n2004.\n\nNesterov, Yurii. Gradient methods for minimizing composite functions. Mathematical Program-\n\nming, 2013.\n\nNiu, Feng, Recht, Benjamin, Re, Christopher, and Wright, Stephen. Hogwild: A lock-free approach\nto parallelizing stochastic gradient descent. In Advances in Neural Information Processing Sys-\ntems, 2011.\n\nPeng, Zhimin, Xu, Yangyang, Yan, Ming, and Yin, Wotao. ARock: an algorithmic framework for\n\nasynchronous parallel coordinate updates. SIAM Journal on Scienti\ufb01c Computing, 2016.\n\nReddi, Sashank J, Hefny, Ahmed, Sra, Suvrit, Poczos, Barnabas, and Smola, Alexander J. On\nvariance reduction in stochastic gradient descent and its asynchronous variants. In Advances in\nNeural Information Processing Systems, 2015.\n\nSchmidt, Mark, Le Roux, Nicolas, and Bach, Francis. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, 2016.\n\nShalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 2013.\n\nShalev-Shwartz, Shai et al.\n\narXiv:1211.2717, 2012.\n\nProximal stochastic dual coordinate ascent.\n\narXiv preprint\n\nXiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance re-\n\nduction. SIAM Journal on Optimization, 2014.\n\nYou, Yang, Lian, Xiangru, Liu, Ji, Yu, Hsiang-Fu, Dhillon, Inderjit S, Demmel, James, and Hsieh,\nCho-Jui. Asynchronous parallel greedy coordinate descent. In Advances In Neural Information\nProcessing Systems, 2016.\n\nYu, Hsiang-Fu, Lo, Hung-Yi, Hsieh, Hsun-Ping, Lou, Jing-Kai, McKenzie, Todd G, Chou, Jung-\nWei, Chung, Po-Han, Ho, Chia-Hua, Chang, Chun-Fu, Wei, Yin-Hsuan, et al. Feature engineering\nand classi\ufb01er ensemble for KDD cup 2010. In KDD Cup, 2010.\n\nZhao, Tuo, Yu, Mo, Wang, Yiming, Arora, Raman, and Liu, Han. Accelerated mini-batch random-\nized block coordinate descent method. In Advances in neural information processing systems,\n2014.\n\n10\n\n\f", "award": [], "sourceid": 60, "authors": [{"given_name": "Fabian", "family_name": "Pedregosa", "institution": "UC Berkeley / ETH Zurich"}, {"given_name": "R\u00e9mi", "family_name": "Leblond", "institution": "INRIA"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "Universit\u00e9 de Montr\u00e9al"}]}