{"title": "TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1509, "page_last": 1519, "abstract": "High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels {-1,0,1}, which can aggressively reduce the communication time. We mathematically prove the convergence of TernGrad under the assumption of a bound on gradients. Guided by the bound, we propose layer-wise ternarizing and gradient clipping to improve its convergence. Our experiments show that applying TernGrad on AlexNet does not incur any accuracy loss and can even improve accuracy. The accuracy loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a performance model is proposed to study the scalability of TernGrad. Experiments show significant speed gains for various deep neural networks. Our source code is available.", "full_text": "TernGrad: Ternary Gradients to Reduce\n\nCommunication in Distributed Deep Learning\n\nWei Wen1, Cong Xu2, Feng Yan3, Chunpeng Wu1, Yandan Wang4, Yiran Chen1, Hai Li1\n\n1Duke University, 2Hewlett Packard Labs, 3University of Nevada \u2013 Reno, 4University of Pittsburgh\n\n1{wei.wen, chunpeng.wu, yiran.chen, hai.li}@duke.edu\n\n2cong.xu@hpe.com, 3fyan@unr.edu, 4yaw46@pitt.edu\n\nAbstract\n\nHigh network communication cost for synchronizing gradients and parameters\nis the well-known bottleneck of distributed training. In this work, we propose\nTernGrad that uses ternary gradients to accelerate distributed deep learning in data\nparallelism. Our approach requires only three numerical levels {\u22121, 0, 1}, which\ncan aggressively reduce the communication time. We mathematically prove the\nconvergence of TernGrad under the assumption of a bound on gradients. Guided\nby the bound, we propose layer-wise ternarizing and gradient clipping to im-\nprove its convergence. Our experiments show that applying TernGrad on AlexNet\ndoesn\u2019t incur any accuracy loss and can even improve accuracy. The accuracy\nloss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a\nperformance model is proposed to study the scalability of TernGrad. Experiments\nshow signi\ufb01cant speed gains for various deep neural networks. Our source code is\navailable 1.\n\nIntroduction\n\n1\nThe remarkable advances in deep learning is driven by data explosion and increase of model size.\nThe training of large-scale models with huge amounts of data are often carried on distributed sys-\ntems [1][2][3][4][5][6][7][8][9], where data parallelism is adopted to exploit the compute capability\nempowered by multiple workers [10]. Stochastic Gradient Descent (SGD) is usually selected as the\noptimization method because of its high computation ef\ufb01ciency. In realizing the data parallelism\nof SGD, model copies in computing workers are trained in parallel by applying different subsets of\ndata. A centralized parameter server performs gradient synchronization by collecting all gradients\nand averaging them to update parameters. The updated parameters will be sent back to workers, that\nis, parameter synchronization. Increasing the number of workers helps to reduce the computation\ntime dramatically. However, as the scale of distributed systems grows up, the extensive gradient\nand parameter synchronizations prolong the communication time and even amortize the savings\nof computation time [4][11][12]. A common approach to overcome such a network bottleneck is\nasynchronous SGD [1][4][7][12][13][14], which continues computation by using stale values without\nwaiting for the completeness of synchronization. The inconsistency of parameters across computing\nworkers, however, can degrade training accuracy and incur occasional divergence [15][16]. Moreover,\nits workload dynamics make the training nondeterministic and hard to debug.\nFrom the perspective of inference acceleration, sparse and quantized Deep Neural Networks (DNNs)\nhave been widely studied, such as [17][18][19][20][21][22][23][24][25]. However, these methods\ngenerally aggravate the training effort. Researches such as sparse logistic regression and Lasso\noptimization problems [4][12][26] took advantage of the sparsity inherent in models and achieved\n\n1https://github.com/wenwei202/terngrad\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fremarkable speedup for distributed training. A more generic and important topic is how to accelerate\nthe distributed training of dense models by utilizing sparsity and quantization techniques. For\ninstance, Aji and Hea\ufb01eld [27] proposed to heuristically sparsify dense gradients by dropping off\nsmall values in order to reduce gradient communication. For the same purpose, quantizing gradients\nto low-precision values with smaller bit width has also been extensively studied [22][28][29][30].\nOur work belongs to the category of gradient quantization, which is an orthogonal approach to\nsparsity methods. We propose TernGrad that quantizes gradients to ternary levels {\u22121, 0, 1} to\nreduce the overhead of gradient synchronization. Furthermore, we propose scaler sharing and\nparameter localization, which can replace parameter synchronization with a low-precision gradient\npulling. Comparing with previous works, our major contributions include: (1) we use ternary values\nfor gradients to reduce communication; (2) we mathematically prove the convergence of TernGrad\nin general by proposing a statistical bound on gradients; (3) we propose layer-wise ternarizing\nand gradient clipping to move this bound closer toward the bound of standard SGD. These simple\ntechniques successfully improve the convergence; (4) we build a performance model to evaluate the\nspeed of training methods with compressed gradients, like TernGrad.\n\n2 Related work\nGradient sparsi\ufb01cation. Aji and Hea\ufb01eld [27] proposed a heuristic gradient sparsi\ufb01cation method\nthat truncated the smallest gradients and transmitted only the remaining large ones. The method\ngreatly reduced the gradient communication and achieved 22% speed gain on 4 GPUs for a neural\nmachine translation, without impacting the translation quality. An earlier study by Garg et al. [31]\nadopted the similar approach, but targeted at sparsity recovery instead of training acceleration. Our\nproposed TernGrad is orthogonal to these sparsity-based methods.\nGradient quantization. DoReFa-Net [22] derived from AlexNet reduced the bit widths of weights,\nactivations and gradients to 1, 2 and 6, respectively. However, DoReFa-Net showed 9.8% accuracy\nloss as it targeted at acceleration on single worker. S. Gupta et al. [30] successfully trained neural\nnetworks on MNIST and CIFAR-10 datasets using 16-bit numerical precision for an energy-ef\ufb01cient\nhardware accelerator. Our work, instead, tends to speedup the distributed training by decreasing the\ncommunicated gradients to three numerical levels {\u22121, 0, 1}. F. Seide et al. [28] applied 1-bit SGD\nto accelerate distributed training and empirically veri\ufb01ed its effectiveness in speech applications. As\nthe gradient quantization is conducted by columns, a \ufb02oating-point scaler per column is required. So\nit cannot yield speed bene\ufb01t on convolutional neural networks [29]. Moreover, \u201ccold start\u201d of the\nmethod [28] requires \ufb02oating-point gradients to converge to a good initial point for the following 1-bit\nSGD. More importantly, it is unknown what conditions can guarantee its convergence. Comparably,\nour TernGrad can start the DNN training from scratch and we prove the conditions that promise\nthe convergence of TernGrad. A. T. Suresh et al. [32] proposed stochastic rotated quantization of\ngradients, and reduced gradient precision to 4 bits for MNIST and CIFAR dataset. However, TernGrad\nachieves lower precision for larger dataset (e.g. ImageNet), and has more ef\ufb01cient computation for\nquantization in each computing node.\nA parallel work by D. Alistarh et al. [29] presented QSGD that explores the trade-off between\naccuracy and gradient precision. The effectiveness of gradient quantization was justi\ufb01ed and the\nconvergence of QSGD was provably guaranteed. Compared to QSGD developed simultaneously,\nour TernGrad shares the same concept but advances in the following three aspects: (1) we prove\nthe convergence from the perspective of statistic bound on gradients. The bound also explains why\nmultiple quantization buckets are necessary in QSGD; (2) the bound is used to guide practices and\ninspires techniques of layer-wise ternarizing and gradient clipping; (3) TernGrad using only 3-level\ngradients achieves 0.92% top-1 accuracy improvement for AlexNet, while 1.73% top-1 accuracy loss\nis observed in QSGD with 4 levels. The accuracy loss in QSGD can be eliminated by paying the cost\nof increasing the precision to 4 bits (16 levels) and beyond.\n\n3 Problem Formulation and Our Approach\n3.1 Problem Formulation and TernGrad\nFigure 1 formulates the distributed training problem of synchronous SGD using data parallelism. At\niteration t, a mini-batch of training samples are split and fed into multiple workers (i \u2208 {1, ..., N}).\nWorker i computes the gradients g(i)\n. All gradients are\n\nt of parameters w.r.t. its input samples z(i)\n\nt\n\n2\n\n\f\ufb01rst synchronized and averaged at parameter server, and then sent back to update workers. Note\nthat parameter server in most implementations [1][12] are used to preserve shared parameters, while\nhere we utilize it in a slightly different way of maintaining shared gradients. In Figure 1, each\nworker keeps a copy of parameters locally. We name this technique as parameter localization. The\nparameter consistency among workers can be maintained by random initialization with an identical\nseed. Parameter localization changes the communication of parameters in \ufb02oating-point form to the\ntransfer of quantized gradients that require much lighter traf\ufb01c. Note that our proposed TernGrad can\nbe integrated with many settings like Asynchronous SGD [1][4], even though the scope of this paper\nonly focuses on the distributed SGD in Figure 1.\nAlgorithm 1 formulates the t-th iteration of TernGrad algorithm according to Figure 1. Most steps of\nTernGrad remain the same as traditional distributed training, except that gradients shall be quantized\ninto ternary precision before sending to parameter server. More speci\ufb01c, ternarize(\u00b7) aims to reduce\n2 to a ternary vector with\nthe communication volume of gradients. It randomly quantizes gradient gt\nvalues \u2208 {\u22121, 0, +1}. Formally, with a random binary vector bt, gt is ternarized as\n\n\u02dcgt = ternarize(gt) = st \u00b7 sign (gt) \u25e6 bt,\n\n(1)\n\nwhere\n(2)\nis a scaler, e.g. maximum norm, that can shrink \u00b11 to a much smaller amplitude. \u25e6 is the Hadamard\nproduct. sign(\u00b7) and abs(\u00b7) respectively returns the sign and absolute value of each element. Giving\na gt, each element of bt independently follows the Bernoulli distribution\n\nst (cid:44) max (abs (gt)) (cid:44) ||gt||\u221e\n\n(cid:26)P (btk = 1 | gt) = |gtk|/st\n\nP (btk = 0 | gt) = 1 \u2212 |gtk|/st\n\n,\n\n(3)\n\nwhere btk and gtk is the k-th element of bt and gt, respectively. This stochastic rounding, instead\nof deterministic one, is chosen by both our study and QSGD [29], as stochastic rounding has an\nunbiased expectation and has been successfully studied for low-precision processing [20][30].\nTheoretically, ternary gradients can at least reduce the worker-to-server traf\ufb01c by a factor of\n32/log2(3) = 20.18\u00d7. Even using 2 bits to encode a ternary gradient, the reduction factor is\nstill 16\u00d7. In this work, we compare TernGrad with 32-bit gradients, considering 32-bit is the default\nprecision in modern deep learning frameworks. Although a lower-precision (e.g. 16-bit) may be\nenough in some scenarios, it will not undervalue TernGrad. As aforementioned, parameter localiza-\ntion reduces server-to-worker traf\ufb01c by pulling quantized gradients from servers. However, summing\nt will produce more possible levels and thereby the \ufb01nal averaged gradient\ngt is no longer ternary as shown in Figure 2(d). It emerges as a critical issue when workers use\ndifferent scalers s(i)\nt\n\nup ternary values in(cid:80)\n\n. To minimize the number of levels, we propose a shared scaler\n\ni \u02dcg(i)\n\nst = max({s(i)\n\nt } : i = 1...N )\n\n(4)\nacross all the workers. We name this technique as scaler sharing. The sharing process has a small\noverhead of transferring 2N \ufb02oating scalars. By integrating parameter localization and scaler\nsharing, the maximum number of levels in gt decreases to 2N + 1. As a result, the server-to-worker\ncommunication reduces by a factor of 32/log2(1 + 2N ), unless N \u2265 230.\n\nAlgorithm 1 TernGrad: distributed SGD training\nusing ternary gradients.\nWorker : i = 1, ..., N\n\n1\n\n2\n\n3\n\n4\n5\n6\n\n7\n\nFigure 1: Distributed SGD with data par-\nallelism.\n\n2Here, the superscript of gt is omitted for simplicity.\n\n, a part of a mini-batch of training samples\n\nt\n\nt under z(i)\n\nInput z(i)\nzt\nCompute gradients g(i)\nTernarize gradients to \u02dcg(i)\nPush ternary \u02dcg(i)\nPull averaged gradients gt from the server\nUpdate parameters wt+1 \u2190 wt \u2212 \u03b7 \u00b7 gt\n\nAverage ternary gradients gt =(cid:80)\n\nto the server\n\nt\n\ni \u02dcg(i)\n\nt /N\n\nt = ternarize(g(i)\nt )\n\nt\n\nServer :\n\n3\n\nParameter serverWorker1\ud835\udc98\"#$\u2190\ud835\udc98\"\u2212\ud835\udc88\"Worker2\ud835\udc98\"#$\u2190\ud835\udc98\"\u2212\ud835\udc88\"WorkerN\ud835\udc98\"#$\u2190\ud835\udc98\"\u2212\ud835\udc88\"\u2026\u2026\ud835\udc88\"($)\ud835\udc88\"(*)\ud835\udc88\"(+)\ud835\udc88\"\ud835\udc88\"\ud835\udc88\"\f(5)\n\n(6)\n\n(9)\n\n(10)\n\n(12)\n\n(13)\n\nConvexity is a subset of Assumption 1, and we can easily \ufb01nd non-convex functions satisfying it.\nAssumption 2. Learning rate \u03b3t is positive and constrained as\n\n(cid:26)(cid:80)+\u221e\n(cid:80)+\u221e\nt=0 \u03b32\nt=0 \u03b3t = +\u221e ,\n\nt < +\u221e\n\n3.2 Convergence Analysis and Gradient Bound\nWe analyze the convergence of TernGrad in the framework of online learning systems. An online\nlearning system adapts its parameter w to a sequence of observations to maximize performance. Each\nobservation z is drawn from an unknown distribution, and a loss function Q(z, w) is used to measure\nthe performance of current system with parameter w and input z. The minimization target then is the\nloss expectation\n\nC(w) (cid:44) E{Q(z, w)} .\n\nIn General Online Gradient Algorithm (GOGA) [33], parameter is updated at learning rate \u03b3t as\n\nwt+1 = wt \u2212 \u03b3tgt = wt \u2212 \u03b3t \u00b7 \u2207wQ(zt, wt),\n\ng (cid:44) \u2207wQ(z, w)\n\nwhere\n(7)\nand the subscript t denotes observing step t. In GOGA, E{g} is the gradient of the minimization\ntarget in Eq. (5).\nAccording to Eq. (1), the parameter in TernGrad is updated, such as\nwt+1 = wt \u2212 \u03b3t (st \u00b7 sign (gt) \u25e6 bt) ,\n\n(8)\nwhere st (cid:44) ||gt||\u221e is a random variable depending on zt and wt. As gt is known for given zt and\n\nwt, Eq. (3) is equivalent to (cid:26)P (btk = 1 | zt, wt) = |gtk|/st\n\nP (btk = 0 | zt, wt) = 1 \u2212 |gtk|/st\n\n.\n\nAt any given wt, the expectation of ternary gradient satis\ufb01es\n\nE{st \u00b7 sign (gt) \u25e6 bt} = E{st \u00b7 sign (gt) \u25e6 E{bt|zt}} = E{gt} = \u2207wC(wt),\n\nwhich is an unbiased gradient of minimization target in Eq. (5).\nThe convergence analysis of TernGrad is adapted from the convergence proof of GOGA presented\nin [33]. We adopt two assumptions, which were used in analysis of the convergence of standard\nGOGA in [33]. Without explicit mention, vectors indicate column vectors here.\nAssumption 1. C(w) has a single minimum w\u2217 and gradient \u2212\u2207wC(w) always points to w\u2217, i.e.,\n(11)\n\n(w \u2212 w\u2217)T \u2207wC(w) > 0.\n\n\u2200\u0001 > 0,\n\ninf\n\n||w\u2212w\u2217||2>\u0001\n\nwhich ensures \u03b3t decreases neither very fast nor very slow respectively.\nWe de\ufb01ne the square of distance between current parameter wt and the minimum w\u2217 as\n\nwhere || \u00b7 || is (cid:96)2 norm. We also de\ufb01ne the set of all random variables before step t as\n\nht (cid:44) ||wt \u2212 w\u2217||2 ,\n\n(14)\nUnder Assumption 1 and Assumption 2, using Lyapunov process and Quasi-Martingales convergence\ntheorem, L. Bottou [33] proved\nLemma 1. If \u2203A, B > 0 s.t.\n\nXt (cid:44) (z1...t\u22121, b1...t\u22121) .\n\nE(cid:8)(cid:0)ht+1 \u2212(cid:0)1 + \u03b32\n\nt B(cid:1) ht\n\n(cid:1)|Xt\n\n(cid:9) \u2264 \u22122\u03b3t(wt \u2212 w\u2217)T\u2207wC(wt) + \u03b32\n\nt A,\n\n(15)\n\nthen C(z, w) converges almost surely toward minimum w\u2217, i.e., P (limt\u2192+\u221e wt = w\u2217) = 1.\n\n4\n\n\fWe further make an assumption on the gradient as\nAssumption 3 (Gradient Bound). The gradient g is bounded as\n\nE{||g||\u221e \u00b7 ||g||1} \u2264 A + B ||w \u2212 w\u2217||2 ,\n\nwhere A, B > 0 and || \u00b7 ||1 is (cid:96)1 norm.\nWith Assumption 3 and Lemma 1, we prove Theorem 1 ( in Supplementary Material):\nTheorem 1. When online learning systems update as\n\n(16)\n\n(17)\ni.e.,\n\n(18)\n\nwt+1 = wt \u2212 \u03b3t (st \u00b7 sign (gt) \u25e6 bt)\n\nthey converge almost surely toward minimum w\u2217,\n\nusing stochastic ternary gradients,\nP (limt\u2192+\u221e wt = w\u2217) = 1.\nComparing with the gradient bound of standard GOGA [33]\n\nE(cid:8)||g||2(cid:9) \u2264 A + B ||w \u2212 w\u2217||2 ,\n\nthe bound in Assumption 3 is stronger because\n\n||g||\u221e \u00b7 ||g||1 \u2265 ||g||2.\n\n(19)\nWe propose layer-wise ternarizing and gradient clipping to make two bounds closer, which shall be\nexplained in Section 3.3. A side bene\ufb01t of our work is that, by following the similar proof procedure,\nwe can prove the convergence of GOGA when Gaussian noise N (0, \u03c32) is added to gradients [34],\nunder the gradient bound of\n\nE(cid:8)||g||2(cid:9) \u2264 A + B ||w \u2212 w\u2217||2 \u2212 \u03c32.\n\n(20)\nAlthough the bound is also stronger, Gaussian noise encourages active exploration of parameter\nspace and improves accuracy as was empirically studied in [34]. Similarly, the randomness of ternary\ngradients also encourages space exploration and improves accuracy for some models, as shall be\npresented in Section 4.\n\n3.3 Feasibility Considerations\n\nThe gradient bound of TernGrad in Assumption 3 is stronger than the bound in standard GOGA.\nPushing the two bounds closer can improve the convergence of TernGrad. In Assumption 3, ||g||\u221e\nis the maximum absolute value of all the gradients in the DNN. So, in a large DNN, ||g||\u221e could\nbe relatively much larger than most gradients, implying that the bound in TernGrad becomes much\nstronger. Considering the situation, we propose layer-wise ternarizing and gradient clipping to reduce\n||g||\u221e and therefore shrink the gap between these two bounds.\nLayer-wise ternarizing is proposed based on the observation that the range of gradients in each\nlayer changes as gradients are back propagated. Instead of adopting a large global maximum scaler,\n\nFigure 2: Histograms of (a) original \ufb02oating gradients, (b) clipped gradients, (c) ternary gradients\nand (d) \ufb01nal averaged gradients. Visualization by TensorBoard. The DNN is AlexNet distributed\non two workers, and vertical axis is the training iteration. As examples, top row visualizes the third\nconvolutional layer and bottom one visualizes the \ufb01rst fully-connected layer.\n\n5\n\n(a)original(b)clipped(c)ternary(d)finalIteration#Iteration#convfc\fwe independently ternarize gradients in each layer using the layer-wise scalers. More speci\ufb01c, we\nseparately ternarize the gradients of biases and weights by using Eq. (1), where gt could be the\ngradients of biases or weights in each layer. To approach the standard bound more closely, we can\nsplit gradients to more buckets and ternarize each bucket independently as D. Alistarh et al. [29] does.\nHowever, this will introduce more \ufb02oating scalers and increase communication. When the size of\nbucket is one, it degenerates to \ufb02oating gradients.\nLayer-wise ternarizing can shrink the bound gap resulted from the dynamic ranges of the gradients\nacross layers. However, the dynamic range within a layer still remains as a problem. We propose\ngradient clipping, which limits the magnitude of each gradient gi in g as\n\n(cid:26)gi\n\nf (gi) =\n\n|gi| \u2264 c\u03c3\nsign(gi) \u00b7 c\u03c3 |gi| > c\u03c3\n\n,\n\n(21)\n\nwhere \u03c3 is the standard derivation of gradients in g. In distributed training, gradient clipping is\napplied to every worker before ternarizing. c is a hyper-parameter to select, but we cross validate\nit only once and use the constant in all our experiments. Speci\ufb01cally, we used a CNN [35] trained\non CIFAR-10 by momentum SGD with staircase learning rate and obtained the optimal c = 2.5.\nSuppose the distribution of gradients is close to Gaussian distribution as shown in Figure 2(a), very\nfew gradients can drop out of [\u22122.5\u03c3, 2.5\u03c3]. Clipping these gradients in Figure 2(b) can signi\ufb01cantly\nreduce the scaler but slightly changes the length and direction of original g. Numerical analysis\nshows that gradient clipping with c = 2.5 only changes the length of g by 1.0% \u2212 1.5% and its\ndirection by 2\u25e6 \u2212 3\u25e6. In our experiments, c = 2.5 remains valid across multiple databases (MNIST,\nCIFAR-10 and ImageNet), various network structures (LeNet, CifarNet, AlexNet, GoogLeNet, etc)\nand training schemes (momentum, vanilla SGD, adam, etc).\nThe effectiveness of layer-wise ternarizing and gradient clipping can also be explained as follows.\nWhen the scalar st in Eq. (1) and Eq. (3) is very large, most gradients have a high possibility to be\nternarized to zeros, leaving only a few gradients to large-magnitude values. The scenario raises a\nsevere parameter update pattern: most parameters keep unchanged while others likely overshoot.\nThis will introduce large training variance. Our experiments on AlexNet show that by applying both\nlayer-wise ternarizing and gradient clipping techniques, TernGrad can converge to the same accuracy\nas standard SGD. Removing any of the two techniques can result in accuracy degradation, e.g., 3%\ntop-1 accuracy loss without applying gradient clipping as we shall show in Table 2.\n\n4 Experiments\nWe \ufb01rst investigate the convergence of TernGrad under various training schemes on relatively small\ndatabases and show the results in Section 4.1. Then the scalability of TernGrad to large-scale\ndistributed deep learning is explored and discussed in Section 4.2. The experiments are performed\nby TensorFlow[2]. We maintain the exponential moving average of parameters by employing an\nexponential decay of 0.9999 [15]. The accuracy is evaluated by the \ufb01nal averaged parameters. This\ngives slightly better accuracy in our experiments. For fair comparison, in each pair of comparative\nexperiments using either \ufb02oating or ternary gradients, all the other training hyper-parameters are the\nsame unless differences are explicitly pointed out. In experiments, when SGD with momentum is\nadopted, momentum value of 0.9 is used. When polynomial decay is applied to decay the learning\nrate (LR), the power of 0.5 is used to decay LR from the base LR to zero.\n\nIntegrating with Various Training Schemes\n\n4.1\nWe study the convergence of TernGrad using LeNet on MNIST and a ConvNet [35] (named as\nCifarNet) on CIFAR-10. LeNet is trained without data augmentation. While training CifarNet, images\n\nFigure 3: Accuracy vs. worker number for baseline and TernGrad, trained with (a) momentum SGD\nor (b) vanilla SGD. In all experiments, total mini-batch size is 64 and maximum iteration is 10K.\n\n6\n\n98.00%98.50%99.00%99.50%100.00%248163264248163264baselineTernGradAccuracyNworkers(a) momentum SGD(b) vanilla SGD\fTable 1: Results of TernGrad on CifarNet.\n\nSGD\n\nbase LR\n\ntotal mini-batch size\n\niterations\n\nAdam\n\n0.0002\n\nAdam\n\n0.0002\n\n128\n\n2048\n\n300K\n\n18.75K\n\ngradients\n\ufb02oating\nTernGrad\n\ufb02oating\nTernGrad\n\nworkers\n\n2\n2\n16\n16\n\naccuracy\n86.56%\n85.64% (-0.92%)\n83.19%\n82.80% (-0.39%)\n\nare randomly cropped to 24 \u00d7 24 images and mirrored. Brightness and contrast are also randomly\nadjusted. During the testing of CifarNet, only center crop is used. Our experiments cover the scope\nof SGD optimizers over vanilla SGD, SGD with momentum [36] and Adam [37].\nFigure 3 shows the results of LeNet. All are trained using polynomial LR decay with weight decay of\n0.0005. The base learning rates of momentum SGD and vanilla SGD are 0.01 and 0.1, respectively.\nGiven the total mini-batch size M and the worker number N, the mini-batch size per worker is\nM/N. Without explicit mention, mini-batch size refers to the total mini-batch size in this work.\nFigure 3 shows that TernGrad can converge to the similar accuracy within the same iterations, using\nmomentum SGD or vanilla SGD. The maximum accuracy gain is 0.15% and the maximum accuracy\nloss is 0.22%. Very importantly, the communication time per iteration can be reduced. The \ufb01gure\nalso shows that TernGrad generalizes well to distributed training with large N. No degradation is\nobserved even for N = 64, which indicates one training sample per iteration per worker.\nTable 1 summarizes the results of CifarNet, where all trainings terminate after the same epochs.\nAdam SGD is used for training. Instead of keeping total mini-batch size unchanged, we maintain the\nmini-batch size per worker. Therefore, the total mini-batch size linearly increases as the number of\nworkers grows. Though the base learning rate of 0.0002 seems small, it can achieve better accuracy\nthan larger ones like 0.001 for baseline. In each pair of experiments, TernGrad can converge to the\naccuracy level with less than 1% degradation. The accuracy degrades under a large mini-batch size in\nboth baseline and TernGrad. This is because parameters are updated less frequently and large-batch\ntraining tends to converge to poorer sharp minima [38]. However, the noise inherent in TernGrad can\nhelp converge to better \ufb02at minimizers [38], which could explain the smaller accuracy gap between\nthe baseline and TernGrad when the mini-batch size is 2048. In our experiments of AlexNet in\nSection 4.2, TernGrad even improves the accuracy in the large-batch scenario. This attribute is\nbene\ufb01cial for distributed training as a large mini-batch size is usually required.\n\n4.2 Scaling to Large-scale Deep Learning\nWe also evaluate TernGrad by AlexNet and GoogLeNet trained on ImageNet. It is more challenging to\napply TernGrad to large-scale DNNs. It may result in some accuracy loss when simply replacing the\n\ufb02oating gradients with ternary gradients while keeping other hyper-parameters unchanged. However,\nwe are able to train large-scale DNNs by TernGrad successfully after making some or all of the\nfollowing changes: (1) decreasing dropout ratio to keep more neurons; (2) using smaller weight\ndecay; and (3) disabling ternarizing in the last classi\ufb01cation layer. Dropout can regularize DNNs by\nadding randomness, while TernGrad also introduces randomness. Thus, dropping fewer neurons helps\navoid over-randomness. Similarly, as the randomness of TernGrad introduces regularization, smaller\nweight decay may be adopted. We suggest not to apply ternarizing to the last layer, considering\nthat the one-hot encoding of labels generates a skew distribution of gradients and the symmetric\nternary encoding {\u22121, 0, 1} is not optimal for such a skew distribution. Though asymmetric ternary\nlevels could be an option, we decide to stick to \ufb02oating gradients in the last layer for simplicity. The\noverhead of communicating these \ufb02oating gradients is small, as the last layer occupies only a small\npercentage of total parameters, like 6.7% in AlexNet and 3.99% in ResNet-152 [39].\nAll DNNs are trained by momentum SGD with Batch Normalization [40] on convolutional layers.\nAlexNet is trained by the hyper-parameters and data augmentation depicted in Caffe. GoogLeNet is\ntrained by polynomial LR decay and data augmentation in [41]. Our implementation of GoogLeNet\ndoes not utilize any auxiliary classi\ufb01ers, that is, the loss from the last softmax layer is the total loss.\nMore training hyper-parameters are reported in corresponding tables and published source code.\nValidation accuracy is evaluated using only the central crops of images.\nThe results of AlexNet are shown in Table 2. Mini-batch size per worker is \ufb01xed to 128. For fast\ndevelopment, all DNNs are trained through the same epochs of images. In this setting, when there are\n\n7\n\n\fTable 2: Accuracy comparison for AlexNet.\n\nbase LR mini-batch size workers\n\niterations\n\n0.01\n\n0.02\n\n256\n\n512\n\n2\n\n4\n\n370K\n\n185K\n\ngradients\n\ufb02oating\nTernGrad\n\nTernGrad-noclip \u2021\n\n\ufb02oating\nTernGrad\n\ufb02oating\nTernGrad\n\nweight decay DR\u2020\n0.5\n0.2\n0.2\n0.5\n0.2\n0.5\n0.2\n\n0.0005\n0.0005\n0.0005\n0.0005\n0.0005\n0.0005\n0.0005\n\ntop-1\ntop-5\n57.33% 80.56%\n57.61% 80.47%\n54.63% 78.16%\n57.32% 80.73%\n57.28% 80.23%\n56.62% 80.28%\n57.54% 80.25%\n\n0.04\n\u2020 DR: dropout ratio, the ratio of dropped neurons. \u2021 TernGrad without gradient clipping.\n\n92.5K\n\n1024\n\n8\n\nTable 3: Accuracy comparison for GoogLeNet.\n\nbase LR mini-batch size workers\n\niterations\n\n0.04\n\n0.08\n\n0.10\n\n128\n\n256\n\n512\n\n2\n\n4\n\n8\n\n600K\n\n300K\n\n300K\n\ngradients weight decay DR\n\ufb02oating\n0.2\nTernGrad\n0.08\n0.2\n\ufb02oating\nTernGrad\n0.08\n\ufb02oating\n0.2\nTernGrad\n0.08\n\n4e-5\n1e-5\n4e-5\n1e-5\n4e-5\n2e-5\n\ntop-5\n88.30%\n86.77%\n87.82%\n85.96%\n89.00%\n86.47%\n\nmore workers, the number of iterations becomes smaller and parameters are less frequently updated.\nTo overcome this problem, we increase the learning rate for large-batch scenario [10]. Using this\nscheme, SGD with \ufb02oating gradients successfully trains AlexNet to similar accuracy, for mini-batch\nsize of 256 and 512. However, when mini-batch size is 1024, the top-1 accuracy drops 0.71% for the\nsame reason as we point out in Section 4.1.\nTernGrad converges to approximate accuracy levels regardless of mini-batch size. Notably, it\nimproves the top-1 accuracy by 0.92% when mini-batch size is 1024, because its inherent randomness\nencourages to escape from poorer sharp minima [34][38]. Figure 4 plots training details vs. iteration\nwhen mini-batch size is 512. Figure 4(a) shows that the convergence curve of TernGrad matches\nwell with the baseline\u2019s, demonstrating the effectiveness of TernGrad. The training ef\ufb01ciency can be\nfurther improved by reducing communication time as shall be discussed in Section 5. The training\ndata loss in Figure 4(b) shows that TernGrad converges to a slightly lower level, which further proves\nthe capability of TernGrad to minimize the target function even with ternary gradients. A smaller\ndropout ratio in TernGrad can be another reason of the lower loss. Figure 4(c) simply illustrate that\non average 71.32% gradients of a fully-connected layer (fc6) are ternarized to zeros.\nFinally, we summarize the results of GoogLeNet in Table 3. On average, the accuracy loss is less\nthan 2%. In TernGrad, we adopted all that hyper-parameters (except dropout ratio and weight decay)\nthat are well tuned for the baseline [42]. Tuning these hyper-parameters speci\ufb01cally for TernGrad\ncould further optimize TernGrad and obtain higher accuracy.\n\n5 Performance Model and Discussion\nOur proposed TernGrad requires only three numerical levels {\u22121, 0, 1}, which can aggressively\nreduce the communication time. Moreover, our experiments in Section 4 demonstrate that within the\n\nFigure 4: AlexNet trained on 4 workers with mini-batch size 512: (a) top-1 validation accuracy, (b)\ntraining data loss and (c) sparsity of gradients in \ufb01rst fully-connected layer (fc6) vs. iteration.\n\n8\n\n0%10%20%30%40%50%60%70%050000100000150000baselineterngrad02468050000100000150000baselineterngrad0%20%40%60%80%050000100000150000(c)gradient sparsityofterngradinfc6(b)traininglossvsiteration(a)top-1accuracyvsiteration\fFigure 5: Training throughput on two different GPUs clusters: (a) 128-node GPU cluster with\n1Gbps Ethernet, each node has 4 NVIDIA GTX 1080 GPUs and one PCI switch; (b) 128-node GPU\ncluster with 100 Gbps In\ufb01niBand network connections, each node has 4 NVIDIA Tesla P100 GPUs\nconnected via NVLink. Mini-batch size per GPU of AlexNet, GoogLeNet and VggNet-A is 128, 64\nand 32, respectively\n\nsame iterations, TernGrad can converge to approximately the same accuracy as its corresponding\nbaseline. Consequently, a dramatical throughput improvement on the distributed DNN training is\nexpected. Due to the resource and time constraint, unfortunately, we aren\u2019t able to perform the\ntraining of more DNN models like VggNet-A [43] and distributed training beyond 8 workers. We plan\nto continue the experiments in our future work. We opt for using a performance model to conduct\nthe scalability analysis of DNN models when utilizing up to 512 GPUs, with and without applying\nTernGrad. Three neural network models\u2014AlexNet, GoogLeNet and VggNet-A\u2014are investigated.\nIn discussions of performance model, performance refers to training speed. Here, we extend the\nperformance model that was initially developed for CPU-based deep learning systems [44] to estimate\nthe performance of distributed GPUs/machines. The key idea is combining the lightweight pro\ufb01ling\non single machine with analytical modeling for accurate performance estimation. In the interest of\nspace, please refer to Supplementary Material for details of the performance model.\nFigure 5 presents the training throughput on two different GPUs clusters. Our results show that\nTernGrad effectively increases the training throughput for the three DNNs. The speedup depends on\nthe communication-to-computation ratio of the DNN, the number of GPUs, and the communication\nbandwidth. DNNs with larger communication-to-computation ratios (e.g. AlexNet and VggNet-A)\ncan bene\ufb01t more from TernGrad than those with smaller ratios (e.g., GoogLeNet). Even on a very\nhigh-end HPC system with In\ufb01niBand and NVLink, TernGrad is still able to double the training\nspeed of VggNet-A on 128 nodes as shown in Figure 5(b). Moreover, the TernGrad becomes more\nef\ufb01cient when the bandwidth becomes smaller, such as 1Gbps Ethernet and PCI switch in Figure 5(a)\nwhere TernGrad can have 3.04\u00d7 training speedup for AlexNet on 8 GPUs.\n\nAcknowledgments\nThis work was supported in part by NSF CCF-1744082 and DOE SC0017030. Any opinions,\n\ufb01ndings, conclusions or recommendations expressed in this material are those of the authors and do\nnot necessarily re\ufb02ect the views of NSF, DOE, or their contractors. Thanks Ali Taylan Cemgil at\nBogazici University for valuable suggestions on this work.\n\nReferences\n\n[1] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato,\nAndrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributed deep\nnetworks. In Advances in Neural Information Processing Systems, pages 1223\u20131231. 2012.\n\n[2] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\n\n9\n\n(a)(b)020000400006000080000100000Images/sec# of GPUsTraining throughput on GPU cluster with Ethernet and PCI switchAlexNet FP32AlexNet TernGradGoogLeNet FP32GoogLeNet TernGradVggNet-A FP32VggNet-A TernGrad124816326412825651204000080000120000160000200000240000Images/sec# of GPUsTraining throughput on GPU cluster with InfiniBand and NVLinkAlexNet FP32AlexNet TernGradGoogLeNet FP32GoogLeNet TernGradVggNet-A FP32VggNet-A TernGrad1248163264128256512010002000300040000200040006000\fheterogeneous distributed systems. arXiv preprint:1603.04467, 2016.\n\n[3] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. Deep learning with\n\ncots hpc systems. In International Conference on Machine Learning, pages 1337\u20131345, 2013.\n\n[4] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to\nparallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages\n693\u2013701, 2011.\n\n[5] Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building\n\nan ef\ufb01cient and scalable deep learning training system. In OSDI, volume 14, pages 571\u2013582, 2014.\n\n[6] Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie,\nAbhimanu Kumar, and Yaoliang Yu. Petuum: A new platform for distributed machine learning on big data.\nIEEE Transactions on Big Data, 1(2):49\u201367, 2015.\n\n[7] Philipp Moritz, Robert Nishihara, Ion Stoica, and Michael I Jordan. Sparknet: Training deep networks in\n\nspark. arXiv preprint:1511.06051, 2015.\n\n[8] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan\nZhang, and Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for heterogeneous\ndistributed systems. arXiv preprint:1512.01274, 2015.\n\n[9] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In\n\nAdvances in Neural Information Processing Systems, pages 685\u2013693, 2015.\n\n[10] Mu Li. Scaling Distributed Machine Learning with System and Algorithm Co-design. PhD thesis, Carnegie\n\nMellon University, 2017.\n\n[11] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long,\nEugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In\nOSDI, volume 14, pages 583\u2013598, 2014.\n\n[12] Mu Li, David G Andersen, Alexander J Smola, and Kai Yu. Communication ef\ufb01cient distributed machine\nlearning with the parameter server. In Advances in Neural Information Processing Systems, pages 19\u201327,\n2014.\n\n[13] Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B Gibbons, Garth A Gibson,\nGreg Ganger, and Eric P Xing. More effective distributed ml via a stale synchronous parallel parameter\nserver. In Advances in neural information processing systems, pages 1223\u20131231, 2013.\n\n[14] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent.\n\nIn Advances in neural information processing systems, pages 2595\u20132603, 2010.\n\n[15] Xinghao Pan, Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed\n\nsynchronous sgd. arXiv preprint:1702.05800, 2017.\n\n[16] Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-aware async-sgd for distributed deep\nlearning. In Proceedings of the Twenty-Fifth International Joint Conference on Arti\ufb01cial Intelligence,\nIJCAI\u201916, pages 2350\u20132356. AAAI Press, 2016. ISBN 978-1-57735-770-4. URL http://dl.acm.org/\ncitation.cfm?id=3060832.3060950.\n\n[17] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[18] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep\n\nneural networks. In Advances in Neural Information Processing Systems, pages 2074\u20132082, 2016.\n\n[19] J Park, S Li, W Wen, PTP Tang, H Li, Y Chen, and P Dubey. Faster cnns with direct sparse convolutions\n\nand guided pruning. In International Conference on Learning Representations (ICLR), 2017.\n\n[20] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 4107\u20134115, 2016.\n\n[21] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classi\ufb01-\ncation using binary convolutional neural networks. In European Conference on Computer Vision, pages\n525\u2013542. Springer, 2016.\n\n[22] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low\nbitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160,\n2016.\n\n[23] Wei Wen, Yuxiong He, Samyam Rajbhandari, Wenhan Wang, Fang Liu, Bin Hu, Yiran Chen, and Hai Li.\n\nLearning intrinsic sparse structures within long short-term memory. arXiv:1709.05027, 2017.\n\n[24] Joachim Ott, Zhouhan Lin, Ying Zhang, Shih-Chii Liu, and Yoshua Bengio. Recurrent neural networks\n\nwith limited numerical precision. arXiv:1608.06902, 2016.\n\n10\n\n\f[25] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few\n\nmultiplications. arXiv:1510.03009, 2015.\n\n[26] Joseph K Bradley, Aapo Kyrola, Danny Bickson, and Carlos Guestrin. Parallel coordinate descent for\n\nl1-regularized loss minimization. arXiv preprint arXiv:1105.5379, 2011.\n\n[27] Alham Fikri Aji and Kenneth Hea\ufb01eld. Sparse communication for distributed gradient descent. arXiv\n\npreprint:1704.05021, 2017.\n\n[28] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its\napplication to data-parallel distributed training of speech dnns. In Interspeech, pages 1058\u20131062, 2014.\n\n[29] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-\nef\ufb01cient sgd via gradient quantization and encoding. In Advances in Neural Information Processing\nSystems, pages 1707\u20131718, 2017.\n\n[30] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited\n\nnumerical precision. In ICML, pages 1737\u20131746, 2015.\n\n[31] Rahul Garg and Rohit Khandekar. Gradient descent with sparsi\ufb01cation: an iterative algorithm for sparse\nrecovery with restricted isometry property. In Proceedings of the 26th Annual International Conference on\nMachine Learning, pages 337\u2013344. ACM, 2009.\n\n[32] Ananda Theertha Suresh, Felix X Yu, H Brendan McMahan, and Sanjiv Kumar. Distributed mean\n\nestimation with limited communication. arXiv:1611.00429, 2016.\n\n[33] L\u00e9on Bottou. Online learning and stochastic approximations. On-line learning in neural networks, 17(9):\n\n142, 1998.\n\n[34] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James\nMartens. Adding gradient noise improves learning for very deep networks. arXiv preprint:1511.06807,\n2015.\n\n[35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105. 2012.\n\n[36] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):\n\n145\u2013151, 1999.\n\n[37] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint:1412.6980,\n\n2014.\n\n[38] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter\nTang. On large-batch training for deep learning: Generalization gap and sharp minima. In International\nConference on Learning Representations, 2017.\n\n[39] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[40] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint:1502.03167, 2015.\n\n[41] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the\ninception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 2818\u20132826, 2016.\n\n[42] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du-\nmitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv\npreprint:1409.4842, 2015.\n\n[43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint:1409.1556, 2014.\n\n[44] Feng Yan, Olatunji Ruwase, Yuxiong He, and Trishul M. Chilimbi. Performance modeling and scalability\noptimization of distributed deep learning systems. In Proceedings of the 21th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, pages\n1355\u20131364, 2015. doi: 10.1145/2783258.2783270. URL http://doi.acm.org/10.1145/2783258.\n2783270.\n\n11\n\n\f", "award": [], "sourceid": 970, "authors": [{"given_name": "Wei", "family_name": "Wen", "institution": "Duke University"}, {"given_name": "Cong", "family_name": "Xu", "institution": "Hewlett Packard Labs"}, {"given_name": "Feng", "family_name": "Yan", "institution": "University of Nevada, Reno"}, {"given_name": "Chunpeng", "family_name": "Wu", "institution": "Duke University"}, {"given_name": "Yandan", "family_name": "Wang", "institution": "University of Pittsburgh"}, {"given_name": "Yiran", "family_name": "Chen", "institution": "Duke University"}, {"given_name": "Hai", "family_name": "Li", "institution": "Duke University"}]}