{"title": "Sparsified SGD with Memory", "book": "Advances in Neural Information Processing Systems", "page_first": 4447, "page_last": 4458, "abstract": "Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.\n\nIn this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory).  That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the good scalability for distributed applications.", "full_text": "Sparsi\ufb01ed SGD with Memory\n\nSebastian U. Stich\n\nJean-Baptiste Cordonnier\n\nMartin Jaggi\n\nMachine Learning and Optimization Laboratory (MLO)\n\nEPFL, Switzerland\n\nAbstract\n\nHuge scale machine learning problems are nowadays tackled by distributed op-\ntimization algorithms, i.e. algorithms that leverage the compute power of many\ndevices for training. The communication overhead is a key bottleneck that hinders\nperfect scalability. Various recent works proposed to use quantization or sparsi-\n\ufb01cation techniques to reduce the amount of data that needs to be communicated,\nfor instance by only sending the most signi\ufb01cant entries of the stochastic gradient\n(top-k sparsi\ufb01cation). Whilst such schemes showed very promising performance\nin practice, they have eluded theoretical analysis so far.\nIn this work we analyze Stochastic Gradient Descent (SGD) with k-sparsi\ufb01cation or\ncompression (for instance top-k or random-k) and show that this scheme converges\nat the same rate as vanilla SGD when equipped with error compensation (keeping\ntrack of accumulated errors in memory). That is, communication can be reduced\nby a factor of the dimension of the problem (sometimes even more) whilst still\nconverging at the same rate. We present numerical experiments to illustrate the\ntheoretical \ufb01ndings and the good scalability for distributed applications.\n\n1\n\nIntroduction\n\nStochastic Gradient Descent (SGD) [29] and variants thereof (e.g. [10, 16]) are among the most\npopular optimization algorithms in machine- and deep-learning [5]. SGD consists of iterations of the\nform\n\nxt+1 := xt \u2212 \u03b7tgt ,\n\n(1)\nfor iterates xt, xt+1 \u2208 Rd, stepsize (or learning rate) \u03b7t > 0, and stochastic gradient gt with the\nproperty E[gt] = \u2207f (xt), for a loss function f : Rd \u2192 R. SGD addresses the computational\nbottleneck of full gradient descent, as the stochastic gradients can in general be computed much\nmore ef\ufb01ciently than a full gradient \u2207f (xt). However, note that in general both gt and \u2207f (xt)\nare dense vectors1 of size d, i.e. SGD does not address the communication bottleneck of gradient\ndescent, which occurs as a roadblock both in distributed as well as parallel training. In the setting of\ndistributed training, communicating the stochastic gradients to the other workers is a major limiting\nfactor for many large scale (deep) learning applications, see e.g. [3, 21, 33, 44]. The same bottleneck\ncan also appear for parallel training, e.g. in the increasingly common setting of a single multi-core\nmachine or device, where locking and bandwidth of memory write operations for the common shared\nparameter xt often forms the main bottleneck, see e.g. [14, 18, 25].\nA remedy to address these issues seems to enforce applying smaller and more ef\ufb01cient updates\ncomp(gt) instead of gt, where comp : Rd \u2192 Rd generates a compression of the gradient, such as by\nlossy quantization or sparsi\ufb01cation. We discuss different schemes below. However, too aggressive\n1 Note that the stochastic gradients gt are dense vectors for the setting of training neural networks. The gt\nthemselves can be sparse for generalized linear models under the additional assumption that the data is sparse.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcompression can hurt the performance, unless it is implemented in a clever way: 1Bit-SGD [33, 37]\ncombines gradient quantization with an error compensation technique, which is a memory or feedback\nmechanism. We in this work leverage this key mechanism but apply it within the more general setting\nof SGD. We will now sketch how the algorithm uses feedback to correct for errors accumulated in\nprevious iterations. Roughly speaking, the method keeps track of a memory vector m which contains\nthe sum of the information that has been suppressed thus far, i.e. mt+1 := mt + gt \u2212 comp(gt),\nand injects this information back in the next iteration, by transmitting comp(mt+1 + gt+1) instead\nof only comp(gt+1). Note that updates of this kind are not unbiased (even if comp(gt+1) would be)\nand there is also no control over the delay after which the single coordinates are applied. These are\nsome (technical) reasons why there exists no theoretical analysis of this scheme up to now.\nIn this paper we give a concise convergence rate analysis for SGD with memory and k-compression op-\nerators2, such as (but not limited to) top-k sparsi\ufb01cation. Our analysis also supports ultra-sparsi\ufb01cation\noperators for which k < 1, i.e. where less than one coordinate of the stochastic gradient is applied on\naverage in (1). We not only provide the \ufb01rst convergence result of this method, but the result also\nshows that the method converges at the same rate as vanilla SGD.\n\n1.1 Related Work\n\nThere are several ways to reduce the communication in SGD. For instance by simply increasing the\namount of computation before communication, i.e. by using large mini-batches (see e.g. [12, 43]), or\nby designing communication-ef\ufb01cient schemes [45]. These approaches are a bit orthogonal to the\nmethods we consider in this paper, which focus on quantization or sparsi\ufb01cation of the gradient.\nSeveral papers consider approaches that limit the number of bits to represent \ufb02oating point num-\nbers [13, 24, 31]. Recent work proposes adaptive tuning of the compression ratio [7]. Unbiased\nquantization operators not only limit the number of bits, but quantize the stochastic gradients in such\na way that they are still unbiased estimators of the gradient [3, 41]. The ZipML framework also\napplies this technique to the data [44]. Sparsi\ufb01cation methods reduce the number of non-zero entries\nin the stochastic gradient [3, 40].\nA very aggressive sparsi\ufb01cation method is to keep only very few coordinates of the stochastic gradient\nby considering only the coordinates with the largest magnitudes [1, 9]. In contrast to the unbiased\nschemes it is clear that such methods can only work by using some kind of error accumulation or\nfeedback procedure, similar to the one the we have already discussed [33, 37], as otherwise certain\ncoordinates could simply never be updated. However, in certain applications no feedback mechanism\nis needed [38]. Also more elaborate sparsi\ufb01cation schemes have been introduced [21].\nAsynchronous updates provide an alternative solution to disguise the communication overhead\nto a certain amount [19]. However, those methods usually rely on a sparsity assumption on the\nupdates [25, 31], which is not realistic e.g. in deep learning. We like to advocate that combining\ngradient sparsi\ufb01cation with those asynchronous schemes seems to be a promising approach, as\nit combines the best of both worlds. Other scenarios that could pro\ufb01t from sparsi\ufb01cation are\nheterogeneous systems or specialized hardware, e.g. accelerators [11, 44].\nConvergence proofs for SGD [29] typically rely on averaging the iterates [23, 27, 30], though\nconvergence of the last iterate can also be proven [34]. For our convergence proof we rely on\naveraging techniques that give more weight to more recent iterates [17, 28, 34], as well as the\nperturbed iterate framework from Mania et al. [22] and techniques from [18, 36].\nSimultaneous to our work, [4, 39] at NeurIPS 2018 propose related schemes. Whilst Tang et\nal. [39] only consider unbiased stochastic compression schemes, Alistarh et al. [4] study biased\ntop-k sparsi\ufb01cation. Their scheme also uses a memory vector to compensate for the errors, but their\nanalysis suffers from a slowdown proportional to k, which we can avoid here. Another simultaneous\nanalysis of Wu et al. [42] at ICML 2018 is restricted to unbiased gradient compression. This scheme\nalso critically relies on an error compensation technique, but in contrast to our work the analysis\nis restricted to quadratic functions and the scheme introduces two additional hyperparameters that\ncontrol the feedback mechanism.\n\n2See De\ufb01nition 2.1.\n\n2\n\n\fn(cid:88)\n\ni=1\n\nf (x) =\n\n1\nn\n\n1.2 Contributions\nWe consider \ufb01nite-sum convex optimization problems f : Rd \u2192 R of the form\n\nfi(x) ,\n\nx(cid:63) := arg min\nx\u2208Rd\n\nf (x) ,\n\nf (cid:63) := f (x(cid:63)) ,\n\n(2)\n\nwhere each fi is L-smooth3 and f is \u00b5-strongly convex4. We consider a sequential sparsi\ufb01ed SGD\nalgorithm with error accumulation technique and prove convergence for k-compression operators,\n0 < k \u2264 d (for instance the sparsi\ufb01cation operators top-k or random-k). For appropriately chosen\nstepsizes and an averaged iterate \u00afxT after T steps we show convergence\n\n(cid:18) G2\n\n(cid:19)\n\n\u00b5T\n\n(cid:32) d2\n\n(cid:33)\n\n(cid:33)\n\n(cid:32) d3\n\nk3 G2\n\u00b5T 3\n\nE f (\u00afxT ) \u2212 f (cid:63) = O\n\n+ O\n\nk2 G2\u03ba\n\u00b5T 2\n\n+ O\n\n,\n\n(3)\n\n\u00b5T\n\n(cid:0) G2\n\n(cid:1) in the convergence rate is the same term as in the convergence rate as for vanilla SGD.\n\u00b5 and G2 \u2265 E(cid:107)\u2207fi(xt)(cid:107)2. Not only is this, to the best of our knowledge, the \ufb01rst\nfor \u03ba = L\nconvergence result for sparsi\ufb01ed SGD with memory, but the result also shows that the leading term\nO\nWe introduce the method formally in Section 2 and show a sketch of the convergence proof in\nSection 3. In Section 4 we include a few numerical experiments for illustrative purposes. The\nexperiments highlight that top-k sparsi\ufb01cation yields a very effective compression method and\ndoes not hurt convergence. We also report results for a parallel multi-core implementation of SGD\nwith memory that show that the algorithm scales as well as asynchronous SGD and drastically\ndecreases the communication cost without sacri\ufb01cing the rate of convergence. We like to stress that\nthe effectiveness of SGD variants with sparsi\ufb01cation techniques has already been demonstrated in\npractice [1, 9, 21, 33, 37].\nAlthough we do not yet provide convergence guarantees for parallel and asynchronous variants\nof the scheme, this is the main application of this method. For instance, we like to highlight that\nasynchronous SGD schemes [2, 25] could pro\ufb01t from the gradient sparsi\ufb01cation. To demonstrate this\nuse-case, we include in Section 4 a set of experiments for a multi-core implementation.\n\n2 SGD with Memory\n\nIn this section we present the sparsi\ufb01ed SGD algorithm with memory. First we introduce sparsi\ufb01cation\nand quantization operators which allow us to drastically reduce the communication cost in comparison\nwith vanilla SGD.\n\n2.1 Compression and Sparsi\ufb01cation Operators\n\nWe consider compression operators that satisfy the following contraction property:\nDe\ufb01nition 2.1 (k-contraction). For a parameter 0 < k \u2264 d, a k-contraction operator is a (possibly\nrandomized) operator comp : Rd \u2192 Rd that satis\ufb01es the contraction property\n\u2200x \u2208 Rd.\n\nE(cid:107)x \u2212 comp(x)(cid:107)2 \u2264\n\n(cid:107)x(cid:107)2 ,\n\n(cid:18)\n\n(cid:19)\n\n(4)\n\nk\nd\n\n1 \u2212\n\nThe contraction property is suf\ufb01cient to obtain all mathematical results that are derived in this paper.\nHowever, note that (4) does not imply that comp(x) is a necessarily sparse vector. Also dense vectors\ncan satisfy (4). One of the main goals of this work is to derive communication ef\ufb01cient schemes, thus\nwe are particularly interested in operators that also ensure that comp(x) can be encoded much more\nef\ufb01ciently than the original x.\nThe following two operators are examples of k-contraction operators with the additional property of\nbeing k-sparse vectors:\n\n3fi(y) \u2264 fi(x) + (cid:104)\u2207fi(x), y \u2212 x(cid:105) + L\n4f (y) \u2265 f (x) + (cid:104)\u2207f (x), y \u2212 x(cid:105) + \u00b5\n\n2 (cid:107)y \u2212 x(cid:107)2, \u2200x, y \u2208 Rd, i \u2208 [n].\n2 (cid:107)y \u2212 x(cid:107)2, \u2200x, y \u2208 Rd.\n\n3\n\n\fRd, where \u2126k =(cid:0)[d]\nDe\ufb01nition 2.2. For a parameter 1 \u2264 k \u2264 d, the operators topk : Rd \u2192 Rd and randk : Rd \u00d7 \u2126k \u2192\n\n(cid:1) denotes the set of all k element subsets of [d], are de\ufb01ned for x \u2208 Rd as\n(cid:26)(x)\u03c0(i),\n\n(cid:26)(x)i,\n\nk\n\n(randk(x, \u03c9))i :=\n\n0\n\nif i \u2208 \u03c9 ,\notherwise ,\n\n(5)\n\n(topk(x))i :=\n\n0\n\nif i \u2264 k ,\notherwise ,\n\nwhere \u03c0 is a permutation of [d] such that (|x|)\u03c0(i) \u2265 (|x|)\u03c0(i+1) for i = 1, . . . , d \u2212 1. We abbreviate\nrandk(x) whenever the second argument is chosen uniformly at random, \u03c9 \u223cu.a.r. \u2126k.\nIt is easy to see that both operators satisfy De\ufb01nition 2.1 of being a k-contraction. For completeness\nthe proof is included in Appendix A.1.\nWe note that our setting is more general than simply measuring sparsity in terms of the cardinality, i.e.\nthe non-zero elements of vectors in Rd. Instead, De\ufb01nition 2.1 can also be considered for quantization\nor e.g. \ufb02oating point representation of each entry of the vector. In this setting we would for instance\nmeasure sparsity in terms of the number of bits that are needed to encode the vector. By this, we can\nalso use stochastic rounding operators (similar as the ones used in [3], but with different scaling) as\ncompression operators according to (4). Also gradient dropping [1] trivially satis\ufb01es (4), though with\ndifferent parameter k in each iteration.\nRemark 2.3 (Ultra-sparsi\ufb01cation). We like to highlight that many other operators do satisfy De\ufb01ni-\ntion 2.1, not only the two examples given in De\ufb01nition 2.2. As a notable variant is to pick a random\ncoordinate of a vector with probability k\nd , for 0 < k \u2264 1, property (4) holds even if k < 1. I.e. it\nsuf\ufb01ces to transmit on average less than one coordinate per iteration (this would then correspond to\na mini-batch update).\n\n2.2 Variance Blow-up for Unbiased Updates\n\nBefore introducing SGD with memory we \ufb01rst discuss a motivating example. Consider the following\nvariant of SGD, where (d \u2212 k) random coordinates of the stochastic gradient are dropped:\n\ngt := d\n\nxt+1 := xt \u2212 \u03b7tgt ,\n\nk \u00b7 randk(\u2207fi(xt)) ,\n\nk randk(\u2207fi(x)) \u2212 \u2207f (x)(cid:13)(cid:13)2\n\n(cid:0) \u03c32\nk randk(\u2207fi(x))(cid:13)(cid:13)2\n\n(6)\n(cid:1) on strongly convex and smooth\nwhere i \u223cu.a.r [n]. It is important to note that the update is unbiased, i.e. E gt = \u2207f (x). For\ncarefully chosen stepsizes \u03b7t this algorithm converges at rate O\n\u03c32 = E(cid:13)(cid:13) d\nfunctions f, where \u03c32 is an upper bound on the variance, see for instance [46]. We have\nk G2\nwhere we used the variance decomposition E(cid:107)X \u2212 E X(cid:107)2 = E(cid:107)X(cid:107)2 \u2212 (cid:107)E X(cid:107)2 and the standard\nassumption Ei (cid:107)\u2207fi(x)(cid:107)2 \u2264 G2. Hence, when k is small this algorithm requires d times more\n(cid:1)\n(cid:0)[n]\niterations to achieve the same error guarantee as vanilla SGD with k = d.\nIt is well known that by using mini-batches the variance of the gradient estimator can be reduced. If we\nconsider in (6) the estimator gt := d\ninstead, we have\n\nEi (cid:107)\u2207fi(x)(cid:107)2 \u2264 d\n\n\u2264 E(cid:13)(cid:13) d\n\nd(cid:101), and I\u03c4 \u223cu.a.r.\n\nk \u00b7 randk\n\n\u2264 d\n\nk\n\nk\n\nt\n\n(cid:0) 1\ni\u2208I\u03c4 \u2207fi(xt)(cid:1) for \u03c4 = (cid:100) k\n(cid:80)\ni\u2208I\u03c4 \u2207fi(xt)(cid:1)(cid:13)(cid:13)2\n(cid:0) 1\n(cid:80)\n\n\u2264 d\n\n\u03c4\n\n\u03c4\n\nEi (cid:107)\u2207fi(xt)(cid:107)2 \u2264 G2 . (7)\nThis shows that, when using mini-batches of appropriate size, the sparsi\ufb01cation of the gradient does\nnot hurt convergence. However, by increasing the mini-batch size, we increase the computation by a\nfactor of d\nk .\n\nk \u00b7 randk\n\nk\u03c4\n\n\u03c32 = E(cid:107)gt \u2212 \u2207f (xt)(cid:107)2 \u2264 E(cid:13)(cid:13) d\n\nthe memory mts \u2208 Rd contains this past information (mts )i =(cid:80)ts\u22121\n\nThese two observations seem to indicate that the factor d\nk is inevitably lost, either by increased number\nof iterations or increased computation. However, this is no longer true when the information in (6)\nis not dropped, but kept in memory. To illustrate this, assume k = 1 and that index i has not been\nselected by the rand1 operator in iterations t = t0,\u00b7\u00b7\u00b7 , ts\u22121, but is selected in iteration ts. Then\nt=t0(\u2207fit(xt))i. Intuitively, we\nwould expect that the variance of this estimator is now reduced by a factor of s compared to the na\u00efve\nestimator in (6), similar to the mini-batch update in (7). Indeed, SGD with memory converges at the\nsame rate as vanilla SGD, as we will demonstrate below.\n\n4\n\n\fAlgorithm 1 MEM-SGD\n1: Initialize variables x0 and m0 = 0\n2: for t in 0 . . . T \u2212 1 do\n3:\n4:\n5:\n6:\n7: end for\n\nSample it uniformly in [n]\ngt \u2190 compk(mt + \u03b7t\u2207fit(xt))\nxt+1 \u2190 xt \u2212 gt\nmt+1 \u2190 mt + \u03b7t\u2207fit(xt) \u2212 gt\n\nand mw\n\nAlgorithm 2 PARALLEL-MEM-SGD\n1: Initialize shared variable x\n0 = 0, \u2200w \u2208 [W ]\nfor t in 0 . . . T \u2212 1 do\nSample iw\ngw\nt \u2190 compk(mw\nx \u2190 x \u2212 gw\nt+1 \u2190 mw\nmw\nend for\n\n2: parallel for w in 1 . . . W do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end parallel for\n\nt + \u03b7t\u2207fiw\n\nt uniformly in [n]\n\nt\n\nt (x))\nt + \u03b7t\u2207fiw\n(cid:46) shared memory\nt (x) \u2212 gw\n\nt\n\nFigure 1: Left: The MEM-SGD algorithm. Right: Implementation for multi-core experiments.\n\n2.3 SGD with Memory: Algorithm and Convergence Results\nWe consider the following algorithm for parameter 0 < k \u2264 d, using a compression operator\ncompk : Rd \u2192 Rd which is a k-contraction (De\ufb01nition 2.1)\n(8)\nxt+1 := xt \u2212 gt , gt := compk(mt + \u03b7t\u2207fit(xt)) , mt+1 := mt + \u03b7t\u2207fit(xt) \u2212 gt ,\nwhere it \u223cu.a.r. [n], m0 := 0 and {\u03b7t}t\u22650 denotes a sequence of stepsizes. The pseudocode is given\nin Algorithm 1. Note that the gradients get multiplied with the stepsize \u03b7t at the timestep t when they\nput into memory, and not when they are (partially) retrieved from the memory.\nWe state the precise convergence result for Algorithm 1 in Theorem 2.4 below. In Remark 2.6 we\ngive a simpli\ufb01ed statement in big-O notation for a speci\ufb01c choice of the stepsizes \u03b7t.\nTheorem 2.4. Let fi be L-smooth, f be \u00b5-strongly convex, 0 < k \u2264 d, Ei (cid:107)\u2207fi(xt)(cid:107)2 \u2264 G2 for\nt = 0, . . . , T \u2212 1, where {xt}t\u22650 are generated according to (8) for stepsizes \u03b7t = 8\n\u00b5(a+t) and shift\nparameter a > 1. Then for \u03b1 > 4 such that (\u03b1+1) d\n\u03c1+1\n\nk +\u03c1\n\n\u2264 a, with \u03c1 :=\n\n64T(cid:0)1 + 2 L\n\n4\u03b1\n\n(cid:18) 4\u03b1\n\n(cid:1)\n(\u03b1\u22124)(\u03b1+1)2 , it holds\nk2 G2 ,\n\n(cid:19) d2\n\n\u00b5\n\n\u03b1 \u2212 4\n\n(9)\n\n\u03b1\n\n4T (T + 2a)\n\n(cid:80)T\u22121\nt=0 wtxt, for wt = (a + t)2, and ST =(cid:80)T\u22121\nE f (\u00afxT ) \u2212 f (cid:63) \u2264\n\n\u00b5a3\n8ST (cid:107)x0 \u2212 x(cid:63)(cid:107)2 +\n\nG2 +\n\n\u00b5ST\n\n\u00b5ST\n\n3 T 3.\n\nt=0 wt \u2265 1\n\nk3T 2 ), thus requiring T = \u2126( d1.5\n\nk ) and the last term in (9) will be of order O( d3\n\nwhere \u00afxT = 1\nST\nRemark 2.5 (Choice of the shift a). Theorem 2.4 says that for any shift a > 1 there is a parameter\n\u03b1(a) > 4 such that (9) holds. However, for the choice a = O(1) one has to set \u03b1 such that\nk1.5 ) steps\n\u03b1\u22124 = \u2126( d\n\u03b1\u22124 = O(1) and the last term is only of order O( d2\nto yield convergence. For \u03b1 \u2265 5 we have \u03b1\nk2T 2 )\ninstead. However, this requires typically a large shift. Observe (\u03b1+1) d\nk ,\nk \u2264 (\u03b1+2) d\n\u03c1+1\nk is enough. We like to stress that in general it is not advisable to set\ni.e. setting a = (\u03b1 + 2) d\nk , as\na (cid:29) (\u03b1 + 2) d\nwe will discuss in Section 4.\nRemark 2.6. As discussed in Remark 2.5 above, setting \u03b1 = 5 and a = (\u03b1 + 2) d\nthis choice, equation (9) simpli\ufb01es to\nE f (\u00afxT ) \u2212 f (cid:63) \u2264 O\n\n(10)\n\u00b5 . To estimate the second term in (9) we used the property E \u00b5(cid:107)x0 \u2212 x(cid:63)(cid:107) \u2264 2G for\nfor \u03ba = L\n\u00b5-strongly convex f, as derived in [28, Lemma 2]. We observe that for large T the \ufb01rst term, O( G2\n\u00b5T ),\nis dominating the rate. This is the same term as in the convergence rate of vanilla SGD [17].\n\nk as the \ufb01rst two terms in (9) depend on a. In practice, it often suf\ufb01ces to set a = d\n\n\u2264 1+(\u03b1+1) d\n\nk is feasible. With\n\n(cid:18) G2\n\nk2 G2\u03ba\n\u00b5T 2\n\n(cid:32) d3\n\n(cid:32) d2\n\nk3 G2\n\u00b5T 3\n\n+ O\n\n+ O\n\n(cid:33)\n\n(cid:33)\n\n(cid:19)\n\nk +\u03c1\n\n\u00b5T\n\n,\n\n3 Proof Outline\n\nWe now give an outline of the proof. The proofs of the lemmas are given in Appendix A.2.\n\n5\n\n\f(11)\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\nx0 \u2212\n\n\u02dcx0 = x0 ,\n\n\u02dcxt \u2212 xt =\n\n(cid:80)t\u22121\n\n\u02dcxt+1 = \u02dcxt \u2212 \u03b7t\u2207fit(xt) ,\n\nInspired by the perturbed iterate framework in [22] and [18] we \ufb01rst\n\nwhere the sequences {xt}t\u22650,{\u03b7t}t\u22650 and {it}t\u22650 are the same as in (8). Notice that\n= mt .\n\nPerturbed iterate analysis.\nde\ufb01ne a virtual sequence {\u02dcxt}t\u22650 in the following way:\n(cid:17)\n(cid:80)t\u22121\n(cid:17) E(cid:107)\u02dcxt \u2212 x(cid:63)(cid:107)2 + \u03b72\n\n(12)\nj=0 \u03b7j\u2207fij (xj)\nLemma 3.1. Let {xt}t\u22650 and {\u02dcxt}t\u22650 be de\ufb01nes as in (8) and (11) and let fi be L-smooth and f\nbe \u00b5-strongly convex with Ei (cid:107)\u2207fi(xt)(cid:107)2 \u2264 G2. Then\n\n(cid:16)\n1 \u2212\nwhere et := E f (xt) \u2212 f (cid:63).\nBounding the memory. From equation (13) it becomes clear that we should derive an upper bound\non E(cid:107)mt(cid:107)2. For this we will use the contraction property (4) of the compression operators.\nLemma 3.2. Let {xt}t\u22650 as de\ufb01ned in (8) for 0 < k \u2264 d, Ei (cid:107)\u2207fi(xt)(cid:107)2 \u2264 G2 and stepsizes\n\u03b7t = 8\n\nt G2 \u2212 \u03b7tet + \u03b7t(\u00b5 + 2L) E(cid:107)mt(cid:107)2 ,\n\nE(cid:107)\u02dcxt+1 \u2212 x(cid:63)(cid:107)2 \u2264\n\nx0 \u2212\n\n\u03b7t\u00b5\n2\n\nj=0 gj\n\n(13)\n\n\u2212\n\n\u00b5(a+t) with a, \u03b1 > 4, as in Theorem 2.4. Then\nE(cid:107)mt(cid:107)2 \u2264 \u03b72\n\nt\n\n4\u03b1\n\u03b1 \u2212 4\n\nd2\nk2 G2 .\n\n(14)\n\nOptimal averaging. Similar as discussed in [17, 28, 34] we have to de\ufb01ne a suitable averaging\nscheme for the iterates {xt}t\u22650 to get the optimal convergence rate. In contrast to [17] that use\nlinearly increasing weights, we use quadratically increasing weights, as for instance [34, 36].\nLemma 3.3. Let {at}t\u22650, at \u2265 0, {et}t\u22650, et \u2265 0, be sequences satisfying\n\n(cid:16)\n\n(cid:17)\n\n\u00b5\u03b7t\n2\n\nt B \u2212 \u03b7tet ,\n\n(15)\n\nfor \u03b7t = 8\n\nat+1 \u2264\n\nt A + \u03b73\n\u00b5(a+t) and constants A, B \u2265 0, \u00b5 > 0, a > 1. Then\n4T (T + 2a)\n\nT\u22121(cid:88)\n\nat + \u03b72\n\n1 \u2212\n\n\u00b5a3\n8ST\n\n1\nST\n\nt=0\n\na0 +\n\nwtet \u2264\n\nfor wt = (a + t)2 and ST :=(cid:80)T\u22121\nProof of Theorem 2.4. The proof of the theorem immediately follows from the three lemmas that\nwe have presented in this section and convexity of f, i.e. we have E f (\u00afxT ) \u2212 f (cid:63) \u2264 1\nt=0 wtet\n(cid:3)\nin (16), for constants A = G2 and B = (\u00b5 + 2L) 4\u03b1\n\u03b1\u22124\n\n(cid:0)2T 2 + 6aT \u2212 3T + 6a2 \u2212 6a + 1(cid:1)\n\n(cid:80)T\u22121\n\nt=0 wt = T\n\nd2\nk2 G2.\n\n\u2265 1\n\n3 T 3.\n\n(16)\n\nA +\n\n\u00b5ST\n\nB ,\n\nST\n\n6\n\n64T\n\u00b52ST\n\n4 Experiments\n\nWe present numerical experiments to illustrate the excellent convergence properties and communica-\ntion ef\ufb01ciency of MEM-SGD. As the usefulness of SGD with sparsi\ufb01cation techniques has already\nbeen shown in practical applications [1, 9, 21, 33, 37] we focus here on a few particular aspects. First,\nwe verify the impact of the initial learning rate that did come up in the statement of Theorem 2.4.\nWe then compare our method with QSGD [3] which decreases the communication cost in SGD by\nusing random quantization operators, but without memory. Finally, we show the performance of the\nparallel SGD depicted in Algorithm 2 in a multi-core setting with shared memory and compare the\nspeed-up to asynchronous SGD.\n\n4.1 Experimental Setup\n\nModels. The experiments focus on the performance of MEM-SGD applied to logistic regression.\n2(cid:107)x(cid:107)2, where ai \u2208 Rd and\nThe associated objective function is 1\nn\nbi \u2208 {\u22121, +1} are the data samples, and we employ a standard L2-regularizer. The regularization\nparameter is set to \u03bb = 1/n for both datasets following [32].\n\n(cid:80)n\ni=1 log(1 + exp(\u2212bia(cid:62)i x)) + \u03bb\n\n6\n\n\fepsilon\nRCV1-test\n\nn\n\n400\u2019000\n677\u2019399\n\nd\n\n2\u2019000\n47\u2019236\n\ndensity\n100%\n0.15%\n\nepsilon\n\nRCV1-test\n\nparameter\n\n\u03b3\na\n\u03b3\na\n\nvalue\n\n2\nd/k\n2\n\n10d/k\n\nTable 1: Datasets statistics.\n\nTable 2: Learning rate \u03b7t = \u03b3/(\u03bb(t + a)).\n\nDatasets. We consider a dense dataset, epsilon [35], as well as a sparse dataset, RCV1 [20] where\nwe train on the larger test set. Statistics on the datasets are listed in Table 1.\n\nImplementation. We use Python3 and the numpy library [15]. Our code is open-source and\npublicly available at github.com/epfml/sparsifiedSGD. We emphasize that our high level\nimplementation is not optimized for speed per iteration but for readability and simplicity. We only\nreport convergence per iteration and relative speedups, but not wall-clock time because unequal efforts\nhave been made to speed up the different implementations. Plots additionally show the baseline\ncomputed with the standard optimizer LogisticSGD of scikit-learn [26]. Experiments were run on\nan Ubuntu 18.04 machine with a 24 cores processor Intel\u00ae Xeon\u00ae CPU E5-2680 v3 @ 2.50GHz.\n\n4.2 Verifying the Theory\n\nWe study the convergence of the method using the stepsizes \u03b7t = \u03b3/(\u03bb(t+a)) and hyperparameters \u03b3\nand a set as in Table 2. We compute the \ufb01nal estimate \u00afx as a weighted average of all iterates xt\nwith weights wt = (t + a)2 as indicated by Theorem 2.4. The results are depicted in Figure 2. We\nuse k \u2208 {1, 2, 3} for epsilon and k \u2208 {10, 20, 30} for RCV1 to increase the difference with large\nnumber of features. The topk variant consistently outperforms randk and sometimes outperforms\nvanilla SGD, which is surprising and might come from feature characteristics of the datasets. We\nalso evaluate the impact of the delay a in the learning rate: setting it to 1 instead of order O(d/k)\ndramatically hurts the memory and requires time to recover from the high initial learning rate (labeled\n\u201cwithout delay\u201d in Figure 2).\nWe experimentally veri\ufb01ed the convergence properties of MEM-SGD for different sparsi\ufb01cation\noperators and stepsizes but we want to further evaluate its fundamental bene\ufb01ts in terms of sparsity\nenforcement and reduction of the communication bottleneck. The gain in communication cost of SGD\nwith memory is very high for dense datasets\u2014using the top1 strategy on epsilon dataset improves the\namount of communication by 103 compared to SGD. For the sparse dataset, SGD can readily use the\ngiven sparsity of the gradients. Nevertheless, the improvement for top10 on RCV1 is of approximately\nan order of magnitude.\n\nFigure 2: Convergence of MEM-SGD using different sparsi\ufb01cation operators compared to full SGD\nwith theoretical learning rates (parameters in Table 2).\n\n7\n\n0123epoch1003\u00d710\u221214\u00d710\u221216\u00d710\u22121traininglossepsilondatasetSGDrandk=1randk=2randk=3topk=1withoutdelaybaseline0123epoch10\u221212\u00d710\u221213\u00d710\u221214\u00d710\u22121traininglossRCV1datasetSGDrandk=10randk=20randk=30topk=10withoutdelaybaseline\fFigure 3: MEM-SGD and QSGD convergence comparison. Top row: convergence in number of\niterations. Bottom row: cumulated size of the communicated gradients during training. We compute\nthe loss 10 times per epoch and remove the point at 0MB for clarity.\n\n4.3 Comparison with QSGD\n\nNow we compare MEM-SGD with the QSGD compression scheme [3] which reduces communication\ncost by random quantization. The accuracy (and the compression ratio) in QSGD is controlled by a\nparameter s, corresponding to the number of quantization levels. Ideally, we would like to set the\nquantization precision in QSGD such that the number of bits transmitted by QSGD and MEM-SGD\nare identical and compare their convergence properties. However, even for the lowest precision,\n\nQSGD needs to send the sign and index of O(\u221ad) coordinates. It is therefore not possible to reach\n\nthe compression level of sparsi\ufb01cation operators such as top-k or random-k, that only transmit a\nconstant number of bits per iteration (up to logarithmic factors).5 Hence, we did not enforce this\ncondition and resorted to pick reasonable levels of quantization in QSGD (s = 2b with b \u2208 {2, 4, 8}).\nNote that b-bits stands for the number of bits used to encode s = 2b levels but the number of bits\ntransmitted in QSGD can be reduced using Elias coding. As a fair comparison in practice, we chose a\nstandard learning rate \u03b30/(1 + \u03b30\u03bbt)\u22121 [6], tuned the hyperparameter \u03b30 on a subset of each dataset\n(see Appendix B). Figure 3 shows that MEM-SGD with top1 on epsilon and RCV1 converges as\nfast as QSGD in term of iterations for 8 and 4-bits. As shown in the bottom of Figure 3, we are\ntransmitting two orders of magnitude fewer bits with the top1 sparsi\ufb01er concluding that sparsi\ufb01cation\noffers a much more aggressive and performant strategy than quantization.\n\n4.4 Multicore experiment\n\nWe implement a parallelized version of MEM-SGD, as depicted in Algorithm 2. The enforced\nsparsity allows us to do the update in shared memory using a lock-free mechanism as in [25]. For\nthis experiment we evaluate the \ufb01nal iterate xT instead of the weighted average \u00afxT above, and use\nthe learning rate \u03b7t \u2261 (1 + t)\u22121.\nFigure 4 shows the speed-up obtained when increasing the number of cores. We see that both sparsi\ufb01ed\nSGD and vanilla SGD have a linear speed-up, the slopes are dependent of the implementation details.\nBut we observe that PARALLEL-MEM-SGD with a reasonable sparsi\ufb01cation parameter k does not\nsuffer of having multiple independent memories. The experiment is run on a single machine with a\n5Encoding the indices of the top-k or random-k elements can be done with additional O(k log d) bits. Note\nthat log d \u2264 32 \u2264 \u221a\n\nd for both our examples.\n\n8\n\n0246810epoch0.280.290.300.310.320.330.34traininglossepsilondatasetSGDtopk=1randk=1QSGD8bitsQSGD4bitsQSGD2bitsbaseline0246810epoch0.0850.0900.0950.1000.1050.110RCV1datasetSGDtopk=1randk=1QSGD8bitsQSGD4bitsQSGD2bitsbaseline10\u22121100101102103104totalsizeofcommunicatedgradients(MB)0.280.290.300.310.320.330.34trainingloss10\u22121100101102totalsizeofcommunicatedgradients(MB)0.0850.0900.0950.1000.1050.110\fFigure 4: Multicore wall-clock time speed up comparison between MEM-SGD and lock-free SGD.\nThe colored area depicts the best and worst results of 3 independent runs for each dataset.\n\n24 core processor, hence no inter-node communication is used. The main advantage of our method\u2014\novercoming the communication bottleneck\u2014 would be even more visible in a multi-node setup. In\nthis asynchronous setup, SGD with memory computes gradients on stale iterates that differ only by a\nfew coordinates. It encounters fewer inconsistent read/write operations than lock free asynchronous\nSGD and exhibits better scaling properties on the RCV1 dataset. The topk operator performs better\nthan randk in the sequential setup, but this is not the case in the parallel setup.\n\n5 Conclusion\n\nWe provide the \ufb01rst concise convergence analysis of sparsi\ufb01ed SGD [1, 9, 33, 37]. This extremely\ncommunication-ef\ufb01cient variant of SGD enforces sparsity of the applied updates by only updating a\nconstant number of coordinates in every iteration. This way, the method overcomes the communica-\ntion bottleneck of SGD, while still enjoying the same convergence rate in terms of stochastic gradient\ncomputations.\nOur experiments verify the drastic reduction in communication cost by demonstrating that MEM-\nSGD requires one to two orders of magnitude less bits to be communicated than QSGD [3] while\nconverging to the same accuracy. The experiments show an advantage for the top-k sparsi\ufb01cation over\nrandom sparsi\ufb01cation in the serial setting, but not in the multi-core shared memory implementation.\nThere, both schemes are on par, and show better scaling than a simple shared memory implementation\nthat just writes the unquantized updates in a lock-free asynchronous fashion (like Hogwild! [25]).\nThe theoretical insights to MEM-SGD that were developed here should facilitate the analysis of the\nsame scheme in the parallel (as developped in [8]) and the distributed setting. It has already been\nshown in practice that gradient sparsi\ufb01cation can be ef\ufb01ciently applied to bandwidth memory limited\nsystems such as multi-GPU training for neural networks [1, 9, 21, 33, 37]. By delivering sparsity no\nmatter if the original gradients were sparse or not, our scheme is not only communication ef\ufb01cient,\nbut becomes more eligible for asynchronous implementations as well. While those were so far limited\nby strict sparsity assumptions (as e.g. in [25]), our approach might make such methods much more\nwidely applicable.\n\nAcknowledgments\n\nWe would like to thank Dan Alistarh for insightful discussions in the early stages of this project and\nFrederik K\u00fcnstner for his useful comments on the various drafts of this manuscript. We acknowledge\nfunding from SNSF grant 200021_175796, Microsoft Research JRC project \u2018Coltrain\u2019, as well as a\nGoogle Focused Research Award.\n\n9\n\n1235810121518202224cores1235810121518202224relativespeeduptoconvergenceepsilondatasetSGDrandk=10topk=10ideal12358101215182024cores12358101215182024RCV1datasetSGDrandk=50topk=50ideal\fReferences\n[1] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent.\n\nIn\nProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages\n440\u2013445. Association for Computational Linguistics, 2017.\n\n[2] Dan Alistarh, Christopher De Sa, and Nikola Konstantinov. The convergence of stochastic gradient descent\nin asynchronous shared memory. In Proceedings of the 2018 ACM Symposium on Principles of Distributed\nComputing, PODC \u201918, pages 169\u2013178, New York, NY, USA, 2018. ACM.\n\n[3] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-\nef\ufb01cient SGD via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,\nR. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information Processing\nSystems 30, pages 1709\u20131720. Curran Associates, Inc., 2017.\n\n[4] Dan Alistarh, Torsten Hoe\ufb02er, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and C\u00e9dric Renggli.\nThe convergence of sparsi\ufb01ed gradient methods. In NeurIPS 2018, to appear and CoRR abs/1809.10505,\n2018.\n\n[5] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Yves Lechevallier and\nGilbert Saporta, editors, Proceedings of COMPSTAT\u20192010, pages 177\u2013186, Heidelberg, 2010. Physica-\nVerlag HD.\n\n[6] Leon Bottou. Stochastic Gradient Descent Tricks, volume 7700, page 430\u2013445. Springer, January 2012.\n\n[7] Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur Agrawal, Wei Zhang, and Kailash Gopalakrishnan.\nAdacomp : Adaptive residual gradient compression for data-parallel distributed training. In Sheila A.\nMcIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on\nArti\ufb01cial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018. AAAI Press, 2018.\n\n[8] Jean-Baptiste Cordonnier. Convex optimization using sparsi\ufb01ed stochastic gradient descent with memory.\n\nMaster\u2019s thesis, EPFL, Lausanne, Switzerland, 2018.\n\n[9] Nikoli Dryden, Sam Ade Jacobs, Tim Moon, and Brian Van Essen. Communication quantization for\ndata-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in\nHigh Performance Computing Environments, MLHPC \u201916, pages 1\u20138, Piscataway, NJ, USA, 2016. IEEE\nPress.\n\n[10] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. JMLR, 12:2121\u20132159, August 2011.\n\n[11] Celestine D\u00fcnner, Thomas Parnell, and Martin Jaggi. Ef\ufb01cient use of limited-memory accelerators for\nlinear learning on heterogeneous systems. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information Processing Systems 30,\npages 4258\u20134267. Curran Associates, Inc., 2017.\n\n[12] Priya Goyal, Piotr Doll\u00e1r, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew\nTulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training ImageNet in 1 hour.\nCoRR, abs/1706.02677, 2017.\n\n[13] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited\nnumerical precision. In Proceedings of the 32Nd International Conference on International Conference on\nMachine Learning - Volume 37, ICML\u201915, pages 1737\u20131746. JMLR.org, 2015.\n\n[14] Cho-Jui Hsieh, Hsiang-Fu Yu, and Inderjit Dhillon. Passcode: Parallel asynchronous stochastic dual\n\nco-ordinate descent. In International Conference on Machine Learning, pages 2370\u20132379, 2015.\n\n[15] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scienti\ufb01c tools for Python, 2001\u2013.\n\n[16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,\n\n2014.\n\n[17] Simon Lacoste-Julien, Mark W. Schmidt, and Francis R. Bach. A simpler approach to obtaining an O(1/t)\n\nconvergence rate for the projected stochastic subgradient method. CoRR, abs/1212.2002, 2012.\n\n[18] R\u00e9mi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: Asynchronous parallel SAGA.\nIn Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 46\u201354, Fort\nLauderdale, FL, USA, 20\u201322 Apr 2017. PMLR.\n\n10\n\n\f[19] R\u00e9mi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. Improved asynchronous parallel optimization\n\nanalysis for stochastic incremental methods. CoRR, abs/1801.03749, January 2018.\n\n[20] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text\n\ncategorization research. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[21] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the\ncommunication bandwidth for distributed training. In ICLR 2018 - International Conference on Learning\nRepresentations, 2018.\n\n[22] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I.\nJordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization,\n27(4):2202\u20132229, 2017.\n\n[23] Eric Moulines and Francis R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for\nmachine learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors,\nNIPS - Advances in Neural Information Processing Systems 24, pages 451\u2013459. Curran Associates, Inc.,\n2011.\n\n[24] Taesik Na, Jong Hwan Ko, Jaeha Kung, and Saibal Mukhopadhyay. On-chip training of recurrent neural\nnetworks with limited numerical precision. 2017 International Joint Conference on Neural Networks\n(IJCNN), pages 3716\u20133723, 2009.\n\n[25] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. HOGWILD!: A lock-free approach to\nparallelizing stochastic gradient descent. In NIPS - Proceedings of the 24th International Conference on\nNeural Information Processing Systems, NIPS\u201911, pages 693\u2013701, USA, 2011. Curran Associates Inc.\n\n[26] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,\nMathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in\npython. Journal of machine learning research, 12(Oct):2825\u20132830, 2011.\n\n[27] Boris T. Polyak and Anatoli B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJournal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[28] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly\nconvex stochastic optimization. In Proceedings of the 29th International Coference on International\nConference on Machine Learning, ICML\u201912, pages 1571\u20131578, USA, 2012. Omnipress.\n\n[29] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical\n\nStatistics, 22(3):400\u2013407, September 1951.\n\n[30] David Ruppert. Ef\ufb01cient estimations from a slowly convergent Robbins-Monro process. Technical report,\n\nCornell University Operations Research and Industrial Engineering, 1988.\n\n[31] Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher R\u00e9. Taming the wild: A uni\ufb01ed analysis\nof HOGWILD!-style algorithms. In NIPS - Proceedings of the 28th International Conference on Neural\nInformation Processing Systems, NIPS\u201915, pages 2674\u20132682, Cambridge, MA, USA, 2015. MIT Press.\n\n[32] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Math. Program., 162(1-2):83\u2013112, March 2017.\n\n[33] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its\napplication to data-parallel distributed training of speech DNNs. In Haizhou Li, Helen M. Meng, Bin Ma,\nEngsiong Chng, and Lei Xie, editors, INTERSPEECH, pages 1058\u20131062. ISCA, 2014.\n\n[34] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence\nresults and optimal averaging schemes. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of\nthe 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning\nResearch, pages 71\u201379, Atlanta, Georgia, USA, 17\u201319 Jun 2013. PMLR.\n\n[35] Soren Sonnenburg, Vojtvech Franc, E. Yom-Tov, and M. Sebag. Pascal large scale learning challenge.\n\n10:1937\u20131953, 01 2008.\n\n[36] Sebastian U. Stich. Local SGD converges fast and communicates little. CoRR, abs/1805.09767, May 2018.\n\n[37] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTER-\n\nSPEECH, pages 1488\u20131492. ISCA, 2015.\n\n11\n\n\f[38] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meProp: Sparsi\ufb01ed back propagation\nfor accelerated deep learning with reduced over\ufb01tting. In Doina Precup and Yee Whye Teh, editors,\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of\nMachine Learning Research, pages 3299\u20133308, International Convention Centre, Sydney, Australia, 06\u201311\nAug 2017. PMLR.\n\n[39] Hanlin Tang, Shaoduo Gan, Ce Zhang, and Ji Liu. Communication compression for decentralized training.\n\nIn NeurIPS 2018, to appear and CoRR abs/1803.06443, 2018.\n\n[40] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsi\ufb01cation for communication-ef\ufb01cient\n\ndistributed optimization. In NeurIPS 2018, to appear and CoRR abs/1710.09854, 2018.\n\n[41] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary\ngradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio,\nH. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS - Advances in Neural Information\nProcessing Systems 30, pages 1509\u20131519. Curran Associates, Inc., 2017.\n\n[42] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its\napplications to large-scale distributed optimization. In ICML 2018 - Proceedings of the 35th International\nConference on Machine Learning, pages 5321\u20135329, July 2018.\n\n[43] Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for ImageNet training. CoRR,\n\nabs/1708.03888, 2017.\n\n[44] Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. ZipML: Training linear models\nwith end-to-end low precision, and a little bit of deep learning. In Doina Precup and Yee Whye Teh, editors,\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of\nMachine Learning Research, pages 4035\u20134043, International Convention Centre, Sydney, Australia, 06\u201311\nAug 2017. PMLR.\n\n[45] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-ef\ufb01cient algorithms for statistical\noptimization. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS - Advances in\nNeural Information Processing Systems 25, pages 1502\u20131510. Curran Associates, Inc., 2012.\n\n[46] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss\nminimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference\non Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1\u20139, Lille, France,\n07\u201309 Jul 2015. PMLR.\n\n12\n\n\f", "award": [], "sourceid": 2188, "authors": [{"given_name": "Sebastian", "family_name": "Stich", "institution": "EPFL"}, {"given_name": "Jean-Baptiste", "family_name": "Cordonnier", "institution": "EPFL"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "EPFL"}]}