{"title": "Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 11450, "page_last": 11460, "abstract": "Communication overhead is a major bottleneck hampering the scalability of distributed machine learning systems. Recently, there has been a surge of interest in \nusing gradient compression to improve the communication efficiency of distributed neural network training. Using 1-bit quantization, signSGD with majority vote achieves a 32x reduction in communication cost. However, its convergence is based on unrealistic assumptions and can diverge in practice. In this paper, we propose a general distributed compressed SGD with Nesterov's momentum. We consider two-way compression, which compresses the gradients both to and from workers. Convergence analysis on nonconvex problems for general gradient compressors is provided. By partitioning the gradient into blocks, a blockwise compressor is introduced such that each gradient block is compressed and transmitted in 1-bit format with a scaling factor, leading to a nearly 32x reduction on communication. Experimental results show that the proposed method converges as fast as full-precision distributed momentum SGD and achieves the same testing accuracy. In particular, on distributed ResNet training with 7 workers on the ImageNet, the proposed algorithm achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\\%$ less wall clock time.", "full_text": "Communication-Ef\ufb01cient Distributed Blockwise\n\nMomentum SGD with Error-Feedback\n\nShuai Zheng\u21e41,2, Ziyue Huang1, James T. Kwok1\n\nshzheng@amazon.com, {zhuangbq, jamesk}@cse.ust.hk\n\n1Department of Computer Science and Engineering\nHong Kong University of Science and Technology\n\n2Amazon Web Services\n\nAbstract\n\nCommunication overhead is a major bottleneck hampering the scalability of dis-\ntributed machine learning systems. Recently, there has been a surge of interest in\nusing gradient compression to improve the communication ef\ufb01ciency of distributed\nneural network training. Using 1-bit quantization, signSGD with majority vote\nachieves a 32x reduction on communication cost. However, its convergence is based\non unrealistic assumptions and can diverge in practice. In this paper, we propose\na general distributed compressed SGD with Nesterov\u2019s momentum. We consider\ntwo-way compression, which compresses the gradients both to and from workers.\nConvergence analysis on nonconvex problems for general gradient compressors\nis provided. By partitioning the gradient into blocks, a blockwise compressor is\nintroduced such that each gradient block is compressed and transmitted in 1-bit\nformat with a scaling factor, leading to a nearly 32x reduction on communica-\ntion. Experimental results show that the proposed method converges as fast as\nfull-precision distributed momentum SGD and achieves the same testing accuracy.\nIn particular, on distributed ResNet training with 7 workers on the ImageNet, the\nproposed algorithm achieves the same testing accuracy as momentum SGD using\nfull-precision gradients, but with 46% less wall clock time.\n\n1\n\nIntroduction\n\nDeep neural networks have been highly successful in recent years [9, 10, 17, 22, 27]. To achieve\nstate-of-the-art performance, they often have to leverage the computing power of multiple machines\nduring training [8, 26, 28, 6]. Popular approaches include distributed synchronous SGD and its\nmomentum variant SGDM, in which the computational load for evaluating a mini-batch gradient is\ndistributed among the workers. Each worker performs local computation, and these local informations\nare then merged by the server for \ufb01nal update on the model parameters. However, its scalability is\nlimited by the possibly overwhelming cost due to communication of the gradient and model parameter\n[12]. Let d be the gradient/parameter dimensionality, and M be the number of workers. 64M d bits\nneed to be transferred between the workers and server in each iteration.\nTo mitigate this communication bottleneck, the two common approaches are gradient sparsi\ufb01cation\nand gradient quantization. Gradient sparsi\ufb01cation only sends the most signi\ufb01cant, information-\npreserving gradient entries. A heuristic algorithm is \ufb01rst introduced in [16], in which only the large\nentries are transmitted. On training a neural machine translation model with 4 GPUs, this greatly\nreduces the communication overhead and achieves 22% speedup [1]. Deep gradient compression\n[13] is another heuristic method that combines gradient sparsi\ufb01cation with other techniques such as\nmomentum correction, local gradient clipping, and momentum factor masking, achieving signi\ufb01cant\n\n\u21e4The work was done before Shuai Zheng joined Amazon Web Services.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\freduction on communication cost. Recently, a stochastic sparsi\ufb01cation method was proposed in [23]\nthat balances sparsity and variance by solving a constrained linear programming. MEM-SGD [18]\ncombines top-k sparsi\ufb01cation with error correction. By keeping track of the accumulated errors,\nthese can be added back to the gradient estimator before each transmission. MEM-SGD converges at\nthe same rate as SGD on convex problems, whilst reducing the communication overhead by a factor\nequal to the problem dimensionality.\nOn the other hand, gradient quantization mitigates the communication bottleneck by lowering the\ngradient\u2019s \ufb02oating-point precision with a smaller bit width. 1-bit SGD achieves state-of-the-art\nresults on acoustic modeling while dramatically reducing the communication cost [16, 19]. TernGrad\n[24] quantizes the gradients to ternary levels {1, 0, 1}. QSGD [2] employs stochastic randomized\nrounding to ensure unbiasedness of the estimator. Error-compensated quantized SGD (ECQ-SGD) was\nproposed in [25], wherein a similar stochastic quantization function used in QSGD is employed, and\nan error bound is obtained for quadratic loss functions. Different from the error-feedback mechanism\nproposed in MEM-SGD, ECQ-SGD requires two more hyper-parameters and its quantization errors\nare decayed exponentially. Thus, error feedback is limited to a small number of iterations. Also,\nECQ-SGD uses all-to-all broadcast (which may involve large network traf\ufb01c and idle time), while we\nconsider parameter-server architecture. Recently, Bernstein et al. proposed signSGD with majority\nvote [3], which only transmits the 1-bit gradient sign between workers and server. A variant using\nmomentum, called signum with majority vote, is also introduced though without convergence analysis\n[4] . Using the majority vote, signSGD achieves a notion of Byzantine fault tolerance [4]. Moreover,\nit converges at the same rate as distributed SGD, though it has to rely on the unrealistic assumptions\nof having a large mini-batch and unimodal symmetric gradient noise. Indeed, signSGD can diverge\nin some simple cases when these assumptions are violated [11]. With only a single worker, this\ndivergence issue can be \ufb01xed by using the error correction technique in MEM-SGD, leading to SGD\nwith error-feedback (EF-SGD) [11].\nWhile only a single worker is considered in EF-SGD, we study in this paper the more interesting\ndistributed setting. An extension of MEM-SGD and EF-SGD with parallel computing was proposed in\n[7] for all-to-all broadcast. Another related architecture is allreduce. Compression at the server can be\nimplemented between the reduce and broadcast steps in tree allreduce, or between the reduce-scatter\nand allgather steps in ring allreduce. However, allreduce requires repeated gradient aggregations,\nand the compressed gradients need to be \ufb01rst decompressed before they are summed. Hence, heavy\noverheads may be incurred.\nIn this paper, we study the distributed setting with a parameter server architecture. To ensure ef\ufb01cient\ncommunication, we consider two-way gradient compression, in which gradients in both directions\n(server to/from workers) are compressed. Note that existing works (except signSGD/signum with\nmajority vote [3, 4]) do not compress the aggregated gradients before sending back to workers.\nMoreover, as gradients in a deep network typically have similar magnitudes in each layer, each\nlayer-wise gradient can be suf\ufb01ciently represented using a sign vector and its average `1-norm. This\nlayer-wise (or blockwise in general) compressor achieves nearly 32x reduction in communication cost.\nThe resulant procedure is called communication-ef\ufb01cient distributed SGD with error-feedback (dist-\nEF-SGD). Analogous to SGDM, we also propose a stochastic variant dist-EF-SGDM with Nesterov\u2019s\nmomentum [14]. The convergence properties of dist-EF-SGD(M) are studied theoretically.\nOur contributions are: (i) We provide a bound on dist-EF-SGD with general stepsize schedule for\na class of compressors (including the commonly used sign-operator and top-k sparsi\ufb01cation). In\nparticular, without relying on the unrealistic assumptions in [3, 4], we show that dist-EF-SGD with\n\nbest of our knowledge, these are the \ufb01rst convergence results on two-way gradient compression with\nNesterov\u2019s momentum; (iii) We propose a general blockwise compressor and show its theoretical\nproperties. Experimental results show that the proposed algorithms are ef\ufb01cient without losing\nprediction accuracy. After our paper has appeared, we note a similar idea was independently proposed\nin [21]. Different from ours, they do not consider changing stepsize, blockwise compressor and\nNesterov\u2019s momentum.\n\ndistributed synchronous SGD; (ii) We study gradient compression with Nesterov\u2019s momentum in a\n\nconstant/decreasing/increasing stepsize converges at an O(1/pM T ) rate, which matches that of\nparameter server. For dist-EF-SGDM with constant stepsize, we obtain an O(1/pM T ) rate. To the\n\n2\n\n\fNotations. For a vector x, kxk1 and kxk2 are its `1- and `2-norms, respectively. sign(x) outputs a\nvector in which each element is the sign of the corresponding entry of x. For two vectors x, y, hx, yi\ndenotes the dot product. For a function f, its gradient is rf.\n2 Related Work: SGD with Error-Feedback\n\nIn machine learning, one is often interested in minimizing the expected risk F (x) = E\u21e0[f (x, \u21e0)].\nwhich directly measures the generalization error [5]. Here, x 2 Rd is the model parameter, \u21e0 is drawn\nfrom some unknown distribution, and f (x, \u21e0) is the possibly nonconvex risk due to x. When the\nexpectation is taken over a training set of size n, the expected risk reduces to empirical risk.\nRecently, Karimireddy et al. [11] introduced SGD with error-feedback (EF-SGD), which combines\ngradient compression with error correction (Algorithm 1). A single machine is considered, which\nkeeps the gradient difference that is not used for parameter update in the current iteration. In the next\niteration t, the accumulated residual et is added to the current gradient. The corrected gradient pt is\nthen fed into an -approximate compressor.\nDe\ufb01nition 1. [11] An operator C : Rd ! Rd is an -approximate compressor for  2 (0, 1] if\nkC(x)  xk2\nExamples of -approximate compressors include the scaled sign operator C(v) = kvk1/d \u00b7 sign(v)\n[11] and top-k operator (which only preserves the k coordinates with the largest absolute values) [18].\nOne can also have randomized compressors that only satisfy De\ufb01nition 1 in expectation. Obviously,\nit is desirable to have a large  while achieving low communication cost.\n\n2.\n2 \uf8ff (1  )kxk2\n\npt = \u2318gt + et {stochastic gradient gt = rf (xt,\u21e0 t)}\nxt+1 = xt  t\net+1 = pt  t\n\nAlgorithm 1 SGD with Error-Feedback (EF-SGD) [11]\n1: Input: stepsize \u2318; compressor C(\u00b7).\n2: Initialize: x0 2 Rd; e0 = 0 2 Rd\n3: for t = 0, . . . , T  1 do\n4:\n5: t = C(pt) {compressed value output}\n6:\n7:\n8: end for\nEF-SGD achieves the same O(1/pT )) rate as SGD. To obtain this convergence guarantee, an\nimportant observation is that the error-corrected iterate \u02dcxt = xt  et satis\ufb01es the recurrence: \u02dcxt+1 =\n\u02dcxt  \u2318gt, which is similar to that of SGD. This allows utilizing the convergence proof of SGD to\nbound the gradient difference krF (\u02dcxt)  rF (xt)k2.\n3 Distributed Blockwise Momentum SGD with Error-Feedback\n\n3.1 Distributed SGD with Error-Feedback\n\nThe proposed procedure, which extends EF-SGD to the distributed setting. is shown in Algorithm 2.\nThe computational workload is distributed over M workers. A local accumulated error vector et,i and\na local corrected gradient vector pt,i are stored in the memory of worker i. At iteration t, worker i\npushes the compressed signal t,i = C(pt,i) to the parameter server. On the server side, all workers\u2019\nt,i\u2019s are aggregated and used to update its global error-corrected vector \u02dcpt. Before sending back the\n\ufb01nal update direction \u02dcpt to each worker, compression is performed to ensure a comparable amount\nof communication costs between the push and pull operations. Due to gradient compression on the\nserver, we also employ a global accumulated error vector \u02dcet. Unlike EF-SGD in Algorithm 1, we do\nnot multiply gradient gt,i by the stepsize \u2318t before compression. The two cases make no difference\nwhen \u2318t is constant. However, when the stepsize is changing over time, this would affect convergence.\nWe also rescale the local accumulated error et,i by \u2318t1/\u2318t. This modi\ufb01cation, together with the\nuse of error correction on both workers and server, allows us to obtain Lemma 1. Because of these\ndifferences, note that dist-EF-SGD does not reduce to EF-SGD when M = 1. When C(\u00b7) is the\nidentity mapping, dist-EF-SGD reduces to full-precision distributed SGD.\n\n3\n\n\fon each worker i\n\nAlgorithm 2 Distributed SGD with Error-Feedback (dist-EF-SGD)\n1: Input: stepsize sequence {\u2318t} with \u23181 = 0; number of workers M; compressor C(\u00b7).\n2: Initialize: x0 2 Rd; e0,i = 0 2 Rd on each worker i; \u02dce0 = 0 2 Rd on server\n3: for t = 0, . . . , T  1 do\n4:\npt,i = gt,i + \u2318t1\n5:\n\u2318t\npush t,i = C(pt,i) to server\n6:\nxt+1 = xt  \u2318t \u02dct { \u02dct is pulled from server}\n7:\n8:\net+1,i = pt,i  t,i\n9:\nMPM\npull t,i from each worker i and \u02dcpt = 1\n10:\npush \u02dct = C(\u02dcpt) to each worker\n11:\n\u02dcet+1 = \u02dcpt  \u02dct\n12:\n13: end for\n\net,i {stochastic gradient gt,i = rf (xt,\u21e0 t,i)}\n\ni=1 t,i + \u2318t1\n\u2318t\n\non server\n\n\u02dcet\n\n2 kx  yk2\n\n2 for x, y 2 Rd).\n\nIn the following, we investigate the convergence of dist-EF-SGD. We make the following assumptions,\nwhich are common in the stochastic approximation literature.\nAssumption 1. F is lower-bounded (i.e., F\u21e4 = inf x2Rd F (x) > 1) and L-smooth (i.e., F (x) \uf8ff\nF (y) + hrF (y), x  yi + L\n2\u21e4 \uf8ff 2.\nAssumption 2. The stochastic gradient gt,i has bounded variance: Et\u21e5kgt,i  rF (xt)k2\nAssumption 3. The full gradient rF is uniformly bounded: krF (xt)k2\nThis implies the second moment is bounded, i.e., Et\u21e5kgt,ik2\ni=1 et,i\u2318, where xt, \u02dcet,\nLemma 1. Consider the error-corrected iterate \u02dcxt = xt  \u2318t1\u21e3\u02dcet + 1\nand et,i\u2019s are generated from Algorithm 2. It satis\ufb01es the recurrence: \u02dcxt+1 = \u02dcxt  \u2318t\ni=1 gt,i.\nThe above Lemma shows that \u02dcxt is very similar to the distributed SGD iterate except that the\nstochastic gradients are evaluated at xt instead of \u02dcxt. This connection allows us to utilize the analysis\nof full-precision distributed SGD. In particular, we have the following Lemma.\n\n2 \uf8ff !2.\n2\u21e4 \uf8ff G2 \u2318 2 + !2.\nMPM\n\nMPM\n\n1\n\nLemma 2. E\uf8ff\u02dcet + 1\ni=1 et,i\nMPM\n\n2\n\n2 \uf8ff 8(1)G2\n\n2\n\n\u21e51 + 16\n\n2\u21e4 for any t  0.\n\nThis implies that rF (\u02dcxt) \u21e1 rF (xt) by Assumption 1. Given the above results, we can prove\nconvergence of the proposed method by utilizing tools used on the full-precision distributed SGD.\nTheorem 1. Suppose that Assumptions 1-3 hold. Assume that 0 <\u2318 t < 3/(2L) for all t. For the\n{xt} sequence generated from Algorithm 2, we have\nEhkrF (xo)k2\n\nk=0 \u2318k (3  2L\u2318k)\n\n[F (x0)  F\u21e4] +\n\n2i \uf8ff\n\n2L2\nM\n\n\u23182\nt\n\n4\n\nT1Xt=0\n\nPT1\n\nk=0 \u2318k (3  2L\u2318k)\n32L2(1  )G2\n\n+\n\n2\n\n\uf8ff1 +\n\n16\n\n2 T1Xt=0\n\nPT1\n\n,\n\n\u2318t\u23182\n\nt1\n\nPT1\nk=0 \u2318k (3  2L\u2318k)\nPT1\n\nt=0 \u2318t(32L\u2318t)\n\nwhere o 2{ 0, . . . , T  1} is an index such that P (o = k) = \u2318k(32L\u2318k)\n, 8k = 0, . . . , T  1.\nThe \ufb01rst term on the RHS shows decay of the initial value. The second term is related to the variance,\nand the proposed algorithm enjoys variance reduction with more workers. The last term is due\nto gradient compression. A large  (less compression) makes this term smaller and thus faster\nconvergence. Similar to the results in [11], our bound also holds for unbiased compressors (e.g.,\n2 for some\nQSGD [2]) of the form C(\u00b7) = cU (\u00b7), where E[U (x)] = x and E[kU (x)k2\n0 < c < 1. Then, cU (\u00b7) is a c-approximate compressor in expectation.\nThe following Corollary shows that dist-EF-SGD has a convergence rate of O(1/pM T ), leading to\na O(1/(M\u270f 4)) iteration complexity for satisfying E[krF (xo)k2\n\n2] \uf8ff \u270f2.\n\n2] \uf8ff 1\n\nckxk2\n\n4\n\n\f{G1, . . . ,GB}.\n\nAlgorithm 3 Distributed Blockwise SGD with Error-Feedback (dist-EF-blockSGD)\n1: Input: stepsize sequence {\u2318t} with \u23181 = 0; number of workers M; block partition\n2: Initialize: x0 2 Rd; e0,i = 0 2 Rd on each worker i; \u02dce0 = 0 2 Rd on server\n3: for t = 0, . . . , T  1 do\n4:\npt,i = gt,i + \u2318t1\n5:\n\u2318t\n\net,i {stochastic gradient gt,i = rf (xt,\u21e0 t,i)}\n\non each worker i\n\non server\n\nd1\n\ndB\n\nsign(pt,i,G1), . . . , kpt,i,GBk1\n\npush t,i =hkpt,i,G1k1\nxt+1 = xt  \u2318t \u02dct { \u02dct is pulled from server}\net+1,i = pt,i  t,i\nMPM\npull t,i from each worker i and \u02dcpt = 1\nsign(\u02dcpt,G1), . . . , k\u02dcpt,GBk1\n\npush \u02dct =hk\u02dcpt,G1k1\n\u02dcet+1 = \u02dcpt  \u02dct\n\ndB\n\nd1\n\n6:\n7:\n8:\n9:\n10:\n\n11:\n12:\n13: end for\n\nsign(pt,i,GB )i to server\n\n\u02dcet\n\ni=1 t,i + \u2318t1\n\u2318t\n\nsign(\u02dcpt,GB )i to each worker\n\nCorollary 1. Let stepsize \u2318 = min( 1\n2L ,\n\npT /pM +(1)1/3(1/2+16/4)1/3T 1/3 ) for some > 0. Then,\n\n\n\nE[krF (xo)k2\n\n2] \uf8ff\n\nE[krF (xo)k2\n\n2] \uf8ff\n\n8L\n3T\n\n+\n\n4L\nT\n\n\n\n[F (x0)  F\u21e4] +\uf8ff 2\n2(1  )1/3h 1\n[F (x0)  F\u21e4] +\uf8ff 2\n\n[F (x0)  F\u21e4] + L2\n1\npM T\n [F (x0)  F\u21e4] + 8L22G2i\n21/3\n\uf8ff1 +\n[F (x0)  F\u21e4] + L2\n\n3pM T\n\n2/3T 2/3\n\n16\n\n\n\n2\n\n.\n\n.\n\nIn comparison, under the same assumptions, distributed synchronous SGD achieves\n\nThus, the convergence rate of dist-EF-SGD matches that of distributed synchronous SGD (with\nfull-precision gradients) after T  O(1/2) iterations, even though gradient compression is used.\nMoreover, more workers (larger M) leads to faster convergence. Note that the bound above does\nnot reduce to that of EF-SGD when M = 1, as we have two-way compression. When M = 1, our\nbound also differs from Remark 4 in [11] in that our last term is O((1  )1/3/(4/3T 2/3)), while\ntheirs is O((1  )/(2T )) (which is for single machine with one-way compression). Ours is worse\nby a factor of O(T 1/32/3/(1  )2/3), which is the price to pay for two-way compression and a\nlinear speedup of using M workers. Moreover, unlike signSGD with majority vote [3], we achieve\na convergence rate of O(1/pM T ) without assuming a large mini-batch size (= T ) and unimodal\n\nsymmetric gradient noise.\nTheorem 1 only requires 0 <\u2318 t < 3/(2L) for all t. This thus allows the use of any decreasing,\nincreasing, or hybrid stepsize schedule. In particular, we have the following Corollary.\nCorollary 2. Let \u2318t =\n16L44M 2 or \u2318t =\n\n((t+1)T )1/4/(pM )+(1)1/3(1/2+16/4)1/3T 1/3 (decreasing stepsize) with T \nT /pM +(1)1/3(1/2+16/4)1/3T 5/6 (increasing stepsize) with T  4L22M.\n\nThen, dist-EF-SGD converges to a stationary point at a rate of O(1/pM T ).\n\npt+1\n\n\n\nTo the best of our knowledge, this is the \ufb01rst such result for distributed compressed SGD with\ndecreasing/increasing stepsize on nonconvex problems. These two stepsize schedules can also be\nused together. For example, one can use an increasing stepsize at the beginning of training as\nwarm-up, and then a decreasing stepsize afterwards.\n\n3.2 Blockwise Compressor\nA commonly used compressor is [11]:\n\nC(v) = kvk1/d \u00b7 sign(v).\n\n5\n\n(1)\n\n\f1/(dkvk2\n\n1\n\n.\n\n1\ndb\n\n2  minb2[B]\n\nCompared to using only the sign operator as in signSGD, the factor kvk1/d can preserve the gradient\u2019s\n2), and can be particularly\nmagnitude. However, as shown in [11], its  in De\ufb01nition 1 is kvk2\nsmall when v is sparse. When  is closer to 1, the bound in Corollary 1 becomes smaller and thus\nconvergence is faster. In this section, we achieve this by proposing a blockwise extension of (1).\nSpeci\ufb01cally, we partition the compressor input v into B blocks, where each block b has db elements\nindexed by Gb. Block b is then compressed with scaling factor kvGbk1/db (where vGb is the subvector\nof v with elements in block b), leading to: CB(v) = [kvG1k1/d1 \u00b7 sign(vG1), . . . ,kvGBk1/dB \u00b7\nsign(vGB )]. A similar compression scheme, with each layer being a block, is considered in the\nexperiments of [11]. However, they provide no theoretical justi\ufb01cations. The following Proposition\n\ufb01rst shows that CB(\u00b7) is also an approximate compressor.\nProposition 1. Let [B] = {1, 2, . . . , B}. CB is a (v)-approximate compressor, where (v) =\nminb2[B] kvGbk2\ndbkvGbk2\nThe resultant algorithm will be called dist-EF-blockSGD (Algorithm 3) in the sequel. As can be seen,\nthis is a special case of Algorithm 2. By replacing  with (v) in Proposition 1, the convergence\nresults of dist-EF-SGD in Section 3.1 can be directly applied.\nThere are many ways to partition the gradient into blocks. In practice, one can simply consider each\nparameter tensor/matrix/vector in the deep network as a block. The intuition is that (i) gradients in the\nsame parameter tensor/matrix/vector typically have similar magnitudes, and (ii) the corresponding\nscaling factors can thus be tighter than the scaling factor obtained on the whole parameter, leading to\na larger . As an illustration of (i), Figure 1(a) shows the coef\ufb01cient of variation (which is de\ufb01ned as\nthe ratio of the standard deviation to the mean) of {|gt,i|}i2Gb averaged over all blocks and iterations\nin an epoch, obtained from ResNet-20 on the CIFAR-100 dataset (with a mini-batch size of 16 per\nworker).2 A value smaller than 1 indicates that the absolute gradient values in each block concentrate\naround the mean. As for point (ii) above, consider the case where all the blocks are of the same\nsize (db = \u02dcd,8b), elements in the same block have the same magnitude (8i 2G b,|vi| = cb for some\ncb), and the magnitude is increasing across blocks (cb/cb+1 = \u21b5 for some \u21b5< 1). For the standard\ncompressor in (1),  = kvk2\nB(1\u21b5) for a suf\ufb01ciently large B; whereas for the\ndkvk2\nproposed blockwise compressor, (v) = 1  (1+\u21b5)\nB(1\u21b5). Figure 1(b) shows the empirical estimates of\nkvk2\n\n2) and (v) in the ResNet-20 experiment. As can be seen, (v)  kvk2\n\nB(1\u21b5)(1+\u21b5B) \u21e1 (1+\u21b5)\n\n= (1+\u21b5)(1\u21b5B)\n\n2).\n1/(dkvk2\n\n1\n\n2\n\n1/(dkvk2\n\n(a) Coef\ufb01cient of variation of {|gt,i|}i2Gb.\n\n(b)  for blockwise and non-block versions.\n\nFigure 1: Illustrations using the ResNet-20 in Section 4.1. Left: Averaged coef\ufb01cient of variation of\n{|gt,i|}i2Gb. Right: Empirical estimates of  for the blockwise ((v) in Proposition 1) and non-block\nversions (kvk2\n2)). Each point is the minimum among all iterations in an epoch. The lower\nbounds, minb2[B] 1/db and 1/d, are also shown. Note that the ordinate is in log scale.\n\n1/(dkvk2\n\nThe per-iteration communication costs of the various distributed algorithms are shown in Table 1.\nCompared to signSGD with majority vote [3], dist-EF-blockSGD requires an extra 64M B bits for\ntransmitting the blockwise scaling factors (each factor kvGbk1/db is stored in \ufb02oat32 format and\ntransmitted twice in each iteration). By treating each vector/matrix/tensor parameter as a block, B\nis typically in the order of hundreds. For most problems of interest, 64M B/(2M d) < 103. The\nreduction in communication cost compared to full-precision distributed SGD is thus nearly 32x.\n\n2The detailed experimental setup is in Section 4.1.\n\n6\n\n0255075100125150175200Epoch105104103102101(v)kvk21dkvk22lowerbound(blockwise)lowerbound(non-block)\fTable 1: Communication costs of the various distributed gradient compression algorithms and SGD.\n\nalgorithm\n\nfull-precision SGD\n\nsignSGD with majority vote\n\ndist-EF-blockSGD\n\n#bits per iteration\n\n64M d\n2M d\n\n2M d + 64M B\n\n1\n\nMPM\n\n3.3 Nesterov\u2019s Momentum\nMomentum has been widely used in deep networks [20]. Standard distributed SGD with Nesterov\u2019s\nmomentum [14] and full-precision gradients uses the update: mt,i = \u00b5mt1,i + gt,i,8i 2 [M ] and\ni=1(\u00b5mt,i + gt,i), where mt,i is a local momentum vector maintained by each\nxt+1 = xt  \u2318t\nworker i at time t (with m0,i = 0), and \u00b5 2 [0, 1) is the momentum parameter. In this section, we\nextend the proposed dist-EF-SGD with momentum. Instead of sending the compressed gt,i + \u2318t1\net,i\n\u2318t\nto the server, the compressed \u00b5mt,i + gt,i + \u2318t1\net,i is sent. The server merges all the workers\u2019s\n\u2318t\nresults and sends it back to each worker. The resultant procedure with blockwise compressor is called\ndist-EF-blockSGDM (Algorithm 4), and has the same communication cost as dist-EF-blockSGD.\nThe corresponding non-block variant is analogous.\nAlgorithm 4 Distributed Blockwise Momentum SGD with Error-Feedback (dist-EF-blockSGDM)\n1: Input: stepsize sequence {\u2318t} with \u23181 = 0; momentum parameter 0 \uf8ff \u00b5 < 1; number of\n2: Initialize: x0 2 Rd; m1,i = e0,i = 0 2 Rd on each worker i; \u02dce0 = 0 2 Rd on server\n3: for t = 0, . . . , T  1 do\n4:\n5:\n6:\n\nworkers M; block partition {G1, . . . ,GB}.\n\non each worker i\n\nd1\n\net,i\n\nsign(pt,i,G1), . . . , kpt,i,GBk1\n\nmt,i = \u00b5mt1,i + gt,i {stochastic gradient gt,i = rf (xt,\u21e0 t,i)}\npt,i = \u00b5mt,i + gt,i + \u2318t1\n\u2318t\npush t,i =hkpt,i,G1k1\nxt+1 = xt  \u2318t \u02dct { \u02dct is pulled from server}\net+1,i = pt,i  t,i\nMPM\npull t,i from each worker i and \u02dcpt = 1\nsign(\u02dcpt,G1), . . . , k\u02dcpt,GBk1\n\ni=1 t,i + \u2318t1\n\u2318t\n\ndB\n\nd1\n\n\u02dcet\n\nsign(pt,i,GB )i to server\n\ndB\n\npush \u02dct =hk\u02dcpt,G1k1\n\u02dcet+1 = \u02dcpt  \u02dct\n\nsign(\u02dcpt,GB )i to each worker\n\n12:\n13:\n14: end for\nSimilar to Lemma 1, the following Lemma shows that the error-corrected iterate \u02dcxt is very similar to\nNesterov\u2019s accelerated gradient iterate, except that the momentum is computed based on {xt}.\nMPM\nLemma 3. The error-corrected iterate \u02dcxt = xt  \u2318t1(\u02dcet + 1\nare generated from Algorithm 4, satis\ufb01es the recurrence: \u02dcxt+1 = \u02dcxt  \u2318t\ni=1 et,ik2 is bounded and rF (\u02dcxt) \u21e1 rF (xt).\nAs in Section 3.1, it can be shown that k\u02dcet + 1\nThe following Theorem shows the convergence rate of the proposed dist-EF-blockSGDM.\nTheorem 2. Suppose that Assumptions 1-3 hold. Let \u2318t = \u2318 for some \u2318> 0. For any \u2318 \uf8ff (1\u00b5)2\n2L ,\nand the {xt} sequence generated from Algorithm 4, we have\n[F (x0)  F\u21e4] +\n\ni=1 et,i), where xt, \u02dcet, and et,i\u2019s\n\nMPM\n\nMPM\n\ni=1(\u00b5mt,i + gt,i).\n\n4(1  \u00b5)\n\n(2)\n\n1\n\nEhkrF (xo)k2\n\n2i \uf8ff\n\n7:\n8:\n9:\n10:\n11:\n\non server\n\n\u2318T\n32L2\u23182(1  )G2\n\n2(1  \u00b5)2\n\n\uf8ff1 +\n\n+\n\n2L\u2318\u00b54\n\n(1  \u00b5)3\n\n2L\u23182\n\n(1  \u00b5)M \uf8ff1 +\n2 .\n\n16\n\nCompared to Theorem 1, using a larger momentum parameter \u00b5 makes the \ufb01rst term (which depends\non the initial condition) smaller but a worse variance term (second term) and error term due to\ngradient compression (last term). Similar to Theorem 1, a larger \u2318 makes the third term larger. The\nfollowing Corollary shows that the proposed dist-EF-blockSGDM achieves a convergence rate of\n\nO(((1  \u00b5)[F (x0)  F\u21e4] + 2/(1  \u00b5))/pM T ).\n\n7\n\n\f(a) Mini-batch size: 8 per worker. (b) Mini-batch size: 16 per worker. (c) Mini-batch size: 32 per worker.\nFigure 2: Testing accuracy on CIFAR-100. Top: No momentum; Bottom: With momentum. The solid\ncurve is the mean accuracy over \ufb01ve repetitions. The shaded region spans one standard deviation.\n\nCorollary 3. Let \u2318 =\nT  42L2M\n4(1)1/3[ (1\u00b5)\n\n(1\u00b5)4 , EhkrF (xo)k2\n\n\n\n[F (x0)F\u21e4]+8L22G2/(1\u00b5)2]\n\n2/3T 2/3\n\n4 Experiments\n\n\n\npT /pM +(1)1/3(1/2+16/4)1/3T 1/3\n\n\n\n2i \uf8ff h 2(1\u00b5)\n1\u00b5i\n[F (x0)  F\u21e4] + L2\n\u21e51 + 16\n2\u21e41/3.\n\nfor some >\n2pM T\n\n0.\n\nFor any\n(1\u00b5)4T +\n\n+ 4L22\u00b542\n\n4.1 Multi-GPU Experiment on CIFAR-100\n\nIn this experiment, we demonstrate that the proposed dist-EF-blockSGDM and dist-EF-blockSGD\n(\u00b5 = 0 in Algorithm 4), though using fewer bits for gradient transmission, still has good convergence.\nFor faster experimentation, we use a single node with multiple GPUs (an AWS P3.16 instance with 8\nNvidia V100 GPUs, each GPU being a worker) instead of a distributed setting.\nExperiment is performed on the CIFAR-100 dataset, with 50K training images and 10K test images.\nWe use a 20-layer ResNet [10]. Each parameter tensor/matrix/vector is treated as a block in dist-\nEF-blockSGD(M). They are compared with (i) distributed synchronous SGD (with full-precision\ngradient); (ii) distributed synchronous SGD (full-precision gradient) with momentum (SGDM); (iii)\nsignSGD with majority vote [3]; and (iv) signum with majority vote [4]. All the algorithms are\nimplemented in MXNet. We vary the mini-batch size per worker in {8, 16, 32}. Results are averaged\nover 5 repetitions. More details of the experiments are shown in Appendix A.1.\nFigure 2 shows convergence of the testing accuracy w.r.t. the number of epochs. As can be seen, dist-\nEF-blockSGD converges as fast as SGD and has slightly better accuracy, while signSGD performs\npoorly. In particular, dist-EF-blockSGD is robust to the mini-batch size, while the performance of\nsignSGD degrades with smaller mini-batch size (which agrees with the results in [3]). Momentum\nmakes SGD and dist-EF-blockSGD faster with mini-batch size of 16 or 32 per worker, particularly\nbefore epoch 100. At epoch 100, the learning rate is reduced, and the difference is less obvious. This\nis because a larger mini-batch means smaller variance 2, so the initial optimality gap F (x0)  F\u21e4 in\n(2) is more dominant. Use of momentum (\u00b5 > 0) is then bene\ufb01cial. On the other hand, momentum\nsigni\ufb01cantly improves signSGD. However, signum is still much worse than dist-EF-blockSGDM.\n\n4.2 Distributed Training on ImageNet\n\nIn this section, we perform distributed optimization on ImageNet [15] using a 50-layer ResNet. Each\nworker is an AWS P3.2 instance with 1 GPU, and the parameter server is housed in one node. We\n\n8\n\n\f(a) Test accuracy w.r.t. epoch.\n\n(b) Test accuracy w.r.t. time.\n\n(c) Workload breakdown.\n\n(d) Test accuracy w.r.t. epoch.\n\n(e) Test accuracy w.r.t. time.\n\n(f) Workload breakdown.\n\nFigure 3: Distributed training results on the ImageNet dataset. Top: 7 workers; Bottom: 15 workers.\nuse the publicly available code3 in [4], and the default communication library Gloo communication\nlibrary in PyTorch. As in [4], we use its allreduce implementation for SGDM, which is faster.\nAs momentum accelerates the training for large mini-batch size in Section 4.1, we only compare\nthe momentum variants here. The proposed dist-EF-blockSGDM is compared with (i) distributed\nsynchronous SGD with momentum (SGDM); and (ii) signum with majority vote [4]. The number of\nworkers M is varied in {7, 15}. With an odd number of workers, a majority vote will not produce\nzero, and so signum does not lose accuracy by using 1-bit compression. More details of the setup are\nin Appendix A.2.\nFigure 3 shows the testing accuracy w.r.t. the number of epochs and wall clock time. As in Section 4.1,\nSGDM and dist-EF-blockSGDM have comparable accuracies, while signum is inferior. When 7\nworkers are used, dist-EF-blockSGDM has higher accuracy than SGDM (76.77% vs 76.27%). dist-\nEF-blockSGDM reaches SGDM\u2019s highest accuracy in around 13 hours, while SGDM takes 24 hours\n(Figure 3(b)), leading to a 46% speedup. With 15 machines, the improvement is smaller (Figure 3(e)).\nThis is because the burden on the parameter server is heavier. We expect comparable speedup with the\n7-worker setting can be obtained by using more parameter servers. In both cases, signum converges\nfast but the test accuracies are about 4% worse.\nFigures 3(c) and 3(f) show a breakdown of wall clock time into computation and communication\ntime.4 All methods have comparable computation costs, but signum and dist-EF-blockSGDM\nhave lower communication costs than SGDM. The communication costs for signum and dist-EF-\nblockSGDM are comparable for 7 workers, but for 15 workers signum is lower. We speculate that it\nis because the sign vectors and scaling factors are sent separately to the server in our implementation,\nwhich causes more latency on the server with more workers. This may be alleviated if the two\noperations are fused.\n\n5 Conclusion\n\nIn this paper, we proposed a distributed blockwise SGD algorithm with error feedback and mo-\nmentum. By partitioning the gradients into blocks, we can transmit each block of gradient using\n1-bit quantization with its average `1-norm. The proposed methods are communication-ef\ufb01cient and\nhave the same convergence rates as full-precision distributed SGD/SGDM for nonconvex objectives.\nExperimental results show that the proposed methods have fast convergence and achieve the same\ntest accuracy as SGD/SGDM, while signSGD and signum only achieve much worse accuracies.\n\n3https://github.com/PermiJW/signSGD-with-Majority-Vote\n4Following [4], communication time includes the extra computation time for error feedback and compression.\n\n9\n\n\fReferences\n[1] A. F. Aji and K. Hea\ufb01eld. Sparse communication for distributed gradient descent. In Proceedings\nof the Conference on Empirical Methods in Natural Language Processing, pages 440\u2013445,\n2017.\n\n[2] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-ef\ufb01cient\nIn Proceedings of the Neural Information\n\nSGD via gradient quantization and encoding.\nProcessing Systems Conference, pages 1709\u20131720, 2017.\n\n[3] J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed\noptimisation for non-convex problems. In Proceedings of the International Conference on\nMachine Learning, pages 560\u2013569, 2018.\n\n[4] J. Bernstein, J.and Zhao, K. Azizzadenesheli, and A. Anandkumar. signSGD with majority vote\nis communication ef\ufb01cient and fault tolerant. In Proceedings of the International Conference\non Learning Representations, 2019.\n\n[5] L. Bottou and Y. Lecun. Large scale online learning. In Proceedings of the Neural Information\n\nProcessing Systems Conference, pages 217\u2013224, 2004.\n\n[6] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous\n\nSGD. Preprint arXiv:1604.00981, 2016.\n\n[7] J. Cordonnier. Convex optimization using sparsi\ufb01ed stochastic gradient descent with memory.\n\nTechnical report, 2018.\n\n[8] J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, and A. Ng. Large scale\nIn Proceedings of the Neural Information Processing Systems\n\ndistributed deep networks.\nConference, pages 1223\u20131231, 2012.\n\n[9] A. Graves. Generating sequences with recurrent neural networks. Preprint arXiv:1308.0850,\n\n2013.\n\n[10] K. He, X. Zhang, S. Ren, and J. Sun.\n\nIdentity mappings in deep residual networks.\n\nProceedings of the European conference on computer vision, pages 630\u2013645, 2016.\n\nIn\n\n[11] S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi. Error feedback \ufb01xes signSGD\nand other gradient compression schemes. In Proceedings of the International Conference on\nMachine Learning, pages 3252\u20133261, 2019.\n\n[12] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication ef\ufb01cient distributed machine\nIn Proceedings of the Neural Information Processing\n\nlearning with the parameter server.\nSystems Conference, pages 19\u201327, 2014.\n\n[13] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing\nthe communication bandwidth for distributed training. In Proceedings of the International\nConference on Representation Learning, 2018.\n\n[14] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate\n\no (1/k\u02c6 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543\u2013547, 1983.\n\n[15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition\nchallenge. International journal of computer vision, 115(3):211\u2013252, 2015.\n\n[16] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application\nto data-parallel distributed training of speech dnns. In Proceedings of the Annual Conference of\nthe International Speech Communication Association, 2014.\n\n[17] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, George Van D. D., J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbren-\nner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D Hassabis. Mastering\nthe game of Go with deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\n10\n\n\f[18] S. U. Stich, J. Cordonnier, and M. Jaggi. Sparsi\ufb01ed SGD with memory. In Proceedings of the\n\nNeural Information Processing Systems Conference, pages 4452\u20134463, 2018.\n\n[19] N. Strom. Scalable distributed dnn training using commodity gpu cloud computing.\n\nIn\nProceedings of the Annual Conference of the International Speech Communication Association,\n2015.\n\n[20] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and\nmomentum in deep learning. In Proceedings of the International Conference on Machine\nLearning, pages 1139\u20131147, 2013.\n\n[21] H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu. Doublesqueeze: Parallel stochastic gradient\ndescent with double-pass error-compensated compression. In Proceedings of the International\nConference on Machine Learning, pages 6155\u20136165, 2019.\n\n[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and\nI. Polosukhin. Attention is all you need. In Proceedings of the Neural Information Processing\nSystems Conference, pages 5998\u20136008, 2017.\n\n[23] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsi\ufb01cation for communication-ef\ufb01cient\ndistributed optimization. In Proceedings of the Neural Information Processing Systems Confer-\nence, pages 1306\u20131316, 2018.\n\n[24] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to\nreduce communication in distributed deep learning. In Proceedings of the Neural Information\nProcessing Systems Conference, pages 1509\u20131519, 2017.\n\n[25] J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications\nto large-scale distributed optimization. In Proceedings of the International Conference on\nMachine Learning, pages 5321\u20135329, 2018.\n\n[26] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu.\nPetuum: A new platform for distributed machine learning on big data. IEEE Transactions on\nBig Data, 1(2):49\u201367, 2015.\n\n[27] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. Preprint\n\narXiv:1409.2329, 2014.\n\n[28] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent.\nIn Proceedings of the Neural Information Processing Systems Conference, pages 2595\u20132603,\n2010.\n\n11\n\n\f", "award": [], "sourceid": 6099, "authors": [{"given_name": "Shuai", "family_name": "Zheng", "institution": "Hong Kong University of Science and Technology / Amazon Web Services"}, {"given_name": "Ziyue", "family_name": "Huang", "institution": "Hong Kong University of Science and Technology"}, {"given_name": "James", "family_name": "Kwok", "institution": "Hong Kong University of Science and Technology"}]}