{"title": "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism", "book": "Advances in Neural Information Processing Systems", "page_first": 103, "page_last": 112, "abstract": "Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other machine learning tasks. To address the need for efficient and task-independent model parallelism, we introduce TensorPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers.  By pipelining different sub-sequences of layers on separate accelerators, TensorPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, TensorPipe utilizes a novel batch-splitting pipelining algorithm,  resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the  advantages  of  TensorPipe  by  training  large-scale  neural  networks  on  two different tasks with distinct network architectures: (i)Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii)Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.", "full_text": "GPipe: Ef\ufb01cient Training of Giant Neural Networks\n\nusing Pipeline Parallelism\n\nYanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen,\n\nHyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen\n\n{huangyp,ylc,ankurbpn,orhanf,miachen,dehao\n\nhyouklee,jngiam,qvl,yonghui,zhifengc}\n\n@google.com\n\nAbstract\n\nScaling up deep neural network capacity has been known as an effective approach\nto improving model quality for several different machine learning tasks. In many\ncases, increasing model capacity beyond the memory limit of a single accelera-\ntor has required developing special algorithms or infrastructure. These solutions\nare often architecture-speci\ufb01c and do not transfer to other tasks. To address the\nneed for ef\ufb01cient and task-independent model parallelism, we introduce GPipe, a\npipeline parallelism library that allows scaling any network that can be expressed\nas a sequence of layers. By pipelining different sub-sequences of layers on sep-\narate accelerators, GPipe provides the \ufb02exibility of scaling a variety of different\nnetworks to gigantic sizes ef\ufb01ciently. Moreover, GPipe utilizes a novel batch-\nsplitting pipelining algorithm, resulting in almost linear speedup when a model\nis partitioned across multiple accelerators. We demonstrate the advantages of\nGPipe by training large-scale neural networks on two different tasks with distinct\nnetwork architectures: (i) Image Classi\ufb01cation: We train a 557-million-parameter\nAmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii)\nMultilingual Neural Machine Translation: We train a single 6-billion-parameter,\n128-layer Transformer model on a corpus spanning over 100 languages and achieve\nbetter quality than all bilingual models.\n\n1\n\nIntroduction\n\nDeep learning has seen great progress over the last decade, partially thanks to the development of\nmethods that have facilitated scaling the effective capacity of neural networks. This trend has been\nmost visible for image classi\ufb01cation, as demonstrated by the accuracy improvements on ImageNet\nwith the increase in model capacity (Figure 1a). A similar phenomenon can also be observed in\nthe context of natural language processing (Figure 1b) where simple shallow models of sentence\nrepresentations [1, 2] are outperformed by their deeper and larger counterparts [3, 4].\nWhile larger models have brought remarkable quality improvements to several \ufb01elds, scaling neural\nnetworks introduces signi\ufb01cant practical challenges. Hardware constraints, including memory\nlimitations and communication bandwidths on accelerators (GPU or TPU), force users to divide larger\nmodels into partitions and to assign different partitions to different accelerators. However, ef\ufb01cient\nmodel parallelism algorithms are extremely hard to design and implement, which often requires the\npractitioner to make dif\ufb01cult choices among scaling capacity, \ufb02exibility (or speci\ufb01city to particular\ntasks and architectures) and training ef\ufb01ciency. As a result, most ef\ufb01cient model-parallel algorithms\nare architecture and task-speci\ufb01c. With the growing number of applications of deep learning, there is\nan ever-increasing demand for reliable and \ufb02exible infrastructure that allows researchers to easily\nscale neural networks for a large variety of machine learning tasks.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: (a) Strong correlation between top-1 accuracy on ImageNet 2012 validation dataset [5]\nand model size for representative state-of-the-art image classi\ufb01cation models in recent years [6, 7, 8,\n9, 10, 11, 12]. There has been a 36\u00d7 increase in the model capacity. Red dot depicts 84.4% top-1\naccuracy for the 550M parameter AmoebaNet model. (b) Average improvement in translation quality\n(BLEU) compared against bilingual baselines on our massively multilingual in-house corpus, with\nincreasing model size. Each point, T (L, H, A), depicts the performance of a Transformer with L\nencoder and L decoder layers, a feed-forward hidden dimension of H and A attention heads. Red dot\ndepicts the performance of a 128-layer 6B parameter Transformer.\n\nTo address these challenges, we introduce GPipe, a \ufb02exible library that enables ef\ufb01cient training of\nlarge neural networks. GPipe allows scaling arbitrary deep neural network architectures beyond the\nmemory limitations of a single accelerator by partitioning the model across different accelerators and\nsupporting re-materialization on every accelerator [13, 14]. With GPipe, each model can be speci\ufb01ed\nas a sequence of layers, and consecutive groups of layers can be partitioned into cells. Each cell is\nthen placed on a separate accelerator. Based on this partitioned setup, we propose a novel pipeline\nparallelism algorithm with batch splitting. We \ufb01rst split a mini-batch of training examples into\nsmaller micro-batches, then pipeline the execution of each set of micro-batches over cells. We apply\nsynchronous mini-batch gradient descent for training, where gradients are accumulated across all\nmicro-batches in a mini-batch and applied at the end of a mini-batch. Consequently, gradient updates\nusing GPipe are consistent regardless of the number of partitions, allowing researchers to easily train\nincreasingly large models by deploying more accelerators. GPipe can also be complemented with\ndata parallelism to further scale training.\nWe demonstrate the \ufb02exibility and ef\ufb01ciency of GPipe on image classi\ufb01cation and machine translation.\nFor image classi\ufb01cation, we train the AmoebaNet model on 480\u00d7 480 input from the ImageNet 2012\ndataset. By increasing the model width, we scale up the number of parameters to 557 million and\nachieve a top-1 validation accuracy of 84.4%. On machine translation, we train a single 128-layer\n6-billion-parameter multilingual Transformer model on 103 languages (102 languages to English).\nWe show that this model is capable of outperforming the individually trained 350-million-parameter\nbilingual Transformer Big [15] models on all 102 language pairs.\n\n2 The GPipe Library\n\nWe now describe the interface and the main design features of GPipe. This open-source library is\nimplemented under the Lingvo [16] framework. The core design features of GPipe are generally\napplicable and can be implemented for other frameworks [17, 18, 19].\n\n2.1\n\nInterface\n\nAny deep neural network can be de\ufb01ned as a sequence of L layers. Each layer Li is composed of\na forward computation function fi, and a corresponding set of parameters wi. GPipe additionally\nallows the user to specify an optional computation cost estimation function, ci. With a given number\nof partitions K, the sequence of L layers can be partitioned into K composite layers, or cells. Let pk\nconsist of consecutive layers between layers i and j. The set of parameters corresponding to pk is\n\n2\n\n\fFigure 2: (a) An example neural network with sequential layers is partitioned across four accelerators.\nFk is the composite forward computation function of the k-th cell. Bk is the back-propagation\nfunction, which depends on both Bk+1 from the upper layer and Fk. (b) The naive model parallelism\nstrategy leads to severe under-utilization due to the sequential dependency of the network. (c) Pipeline\nparallelism divides the input mini-batch into smaller micro-batches, enabling different accelerators to\nwork on different micro-batches simultaneously. Gradients are applied synchronously at the end.\n\n(b)\n\n(c)\n\n(a)\n\nequivalent to the union of wi, wi+1, . . . , wj, and its forward function would be Fk = fj\u25e6. . .\u25e6fi+1\u25e6fi.\nThe corresponding back-propagation function Bk can be computed from Fk using automatic symbolic\ndifferentiation. The cost estimator, Ck, is set to \u03a3j\nThe GPipe interface is extremely simple and intuitive, requiring the user to specify: (i) the number of\nmodel partitions K, (ii) the number of micro-batches M, and (iii) the sequence and de\ufb01nitions of L\nlayers that de\ufb01ne the model. Please refer to supplementary material for examples.\n\nl=icl.\n\n2.2 Algorithm\n\nOnce the user de\ufb01nes the sequence of layers in their network in terms of model parameters wi, forward\ncomputation function fi, and the cost estimation function ci, GPipe partitions the network into K\ncells and places the k-th cell on the k-th accelerator. Communication primitives are automatically\ninserted at partition boundaries to allow data transfer between neighboring partitions. The partitioning\nalgorithm minimizes the variance in the estimated costs of all cells in order to maximize the ef\ufb01ciency\nof the pipeline by syncing the computation time across all partitions.\nDuring the forward pass, GPipe \ufb01rst divides every mini-batch of size N into M equal micro-batches,\nwhich are pipelined through the K accelerators. During the backward pass, gradients for each\nmicro-batch are computed based on the same model parameters used for the forward pass. At the end\nof each mini-batch, gradients from all M micro-batches are accumulated and applied to update the\nmodel parameters across all accelerators. This sequence of operations is illustrated in Figure 2c.\nIf batch normalization [20] is used in the network, the suf\ufb01cient statistics of inputs during training\nare computed over each micro-batch and over replicas if necessary [21]. We also track the moving\naverage of the suf\ufb01cient statistics over the entire mini-batch to be used during evaluation.\n\n2.3 Performance Optimization\n\nIn order to reduce activation memory requirements, GPipe supports re-materialization [14]. During\nforward computation, each accelerator only stores output activations at the partition boundaries.\nDuring the backward pass, the k-th accelerator recomputes the composite forward function Fk. As a\nconsequence, peak activation memory requirement is reduced to O(N + L\nM is the\nmicro-batch size and L\nK is the number of layers per partition. In comparison, memory requirement\nwithout re-materialization and partitioning would be O(N \u00d7 L), since computing the gradients bi\nrequires both the upper layer gradients bi+1 and the cached activations fi(x).\n\nM ), where N\n\nK \u00d7 N\n\n3\n\n\fTable 1: Maximum model size of AmoebaNet supported by GPipe under different scenarios. Naive-1\nrefers to the sequential version without GPipe. Pipeline-k means k partitions with GPipe on k\naccelerators. AmoebaNet-D (L, D): AmoebaNet model with L normal cell layers and \ufb01lter size D .\nTransformer-L: Transformer model with L layers, 2048 model and 8192 hidden dimensions. Each\nmodel parameter needs 12 bytes since we applied RMSProp during training.\n\nNVIDIA GPUs (8GB each)\nAmoebaNet-D (L, D)\n# of Model Parameters\nTotal Model Parameter Memory\nPeak Activation Memory\nCloud TPUv3 (16GB each)\nTransformer-L\n# of Model Parameters\nTotal Model Parameter Memory\nPeak Activation Memory\n\nNaive-1\n(18, 208)\n\n82M\n\n1.05GB\n6.26GB\nNaive-1\n\n3\n\n282.2M\n11.7G\n3.15G\n\nPipeline-1\n(18, 416)\n\n318M\n3.8GB\n3.46GB\nPipeline-1\n\n13\n\n785.8M\n8.8G\n6.4G\n\nPipeline-2\n(18, 544)\n\n542M\n6.45GB\n8.11GB\nPipeline-8\n\n103\n5.3B\n59.5G\n50.9G\n\nPipeline-4\n(36, 544)\n\n1.05B\n\n12.53GB\n15.21GB\nPipeline-32\n\n415\n21.0B\n235.1G\n199.9G\n\nPipeline-8\n(72, 512)\n\n1.8B\n\n24.62GB\n26.24GB\n\nPipeline-128\n\n1663\n83.9B\n937.9G\n796.1G\n\nAs illustrated in Figure 2c, partitioning introduces some idle time per accelerator, which we refer to\nas the bubble overhead. This bubble time is O( K\u22121\nM +K\u22121 ) amortized over the number of micro-steps\nM. In our experiments, we found the bubble overhead to be negligible when M \u2265 4 \u00d7 K. This\nis also partly because re-computation during the backward pass can be scheduled earlier, without\nwaiting for the gradients from earlier layers.\nGPipe also introduces low communication overhead, given that we only need to pass activation\ntensors at the partition boundaries between accelerators. Therefore, we can achieve ef\ufb01cient scaling\nperformance even on accelerators without high-speed interconnects.\nFigure 2c assumes partitions are evenly balanced. However, memory requirements and computa-\ntion \ufb02ops at different layers are often quite imbalanced. In such scenarios, imperfect partitioning\nalgorithms might lead to load imbalance. Better partitioning algorithms can potentially improve the\nperformance over our heuristic approach.\n\n3 Performance Analyses\n\nWe evaluate GPipe performance with two very different types of model architectures: an Amoe-\nbaNet [12] convolutional model and a Transformer [15] sequence-to-sequence model. We ran\nexperiments to study their scalability, ef\ufb01ciency and communication cost.\nWe expect both re-materialization and pipeline parallelism to bene\ufb01t memory utilization and thus\nmake \ufb01tting giant models feasible. We report the biggest model size GPipe can support under\nreasonably large input size in Table 1. For AmoebaNet, we ran the experiments on Cloud TPUv2s\nwith 8GB memory per accelerator. We used a \ufb01xed input image size of 224 \u00d7 224 and mini-batch\nsize of 128. Without GPipe, a single accelerator can train up to an 82M-parameter AmoebaNet,\nconstrained by device memory limits. Owing to re-materialization in back-propagation and batch\nsplitting, GPipe reduces the intermediate activation memory requirements from 6.26GB to 3.46GB,\nenabling a 318M-parameter model on a single accelerator. With model parallelism, we were able to\nscale AmoebaNet to 1.8 billion parameters on 8 accelerators, 25x more than what is possible without\nGPipe. In this case, the maximum model size did not scale perfectly linearly due to the imbalanced\ndistribution of model parameters over different layers in AmoebaNet.\nWe next trained Transformer models using Cloud TPUv3s with 16GB memory per accelerator core.\nWe used a \ufb01xed vocabulary size of 32k, sequence length 1024 and batch size 32. Each Transformer\nlayer has 2048 for model dimension, 8192 for feed-forward hidden dimension and 32 attention heads.\nWe scaled the model by varying the number of layers. Re-materialization allows training a 2.7\u00d7\nlarger model on a single accelerator. With 128 partitions, GPipe allows scaling Transformer up to\n83.9B parameters, a 298\u00d7 increase than what is possible on a single accelerator. Different from\nAmoebaNet, the maximum model size scales linearly with the number of accelerators for Transformer,\nsince each layer has the same number of parameters and input sizes.\n\n4\n\n\fTPU\n\nAmoebaNet\n\nTransformer\n\n2\n1\n1.7\n1.8\n\n2\n1\n\n1.07\n1.21\n\nK =\nM = 1\nM = 4\nM = 32\n\nTable 2: Normalized training throughput using\nGPipe with different # of partitions K and differ-\nent # of micro-batches M on TPUs. Performance\nincreases with more micro-batches. There is an\nalmost linear speedup with the number of accelera-\ntors for Transformer model when M (cid:29) K. Batch\nsize was adjusted to \ufb01t memory if necessary.\n\nTo evaluate ef\ufb01ciency, we report the normalized\ntraining throughput of AmoebaNet-D (18, 256)\nand Transformer-48 using GPipe with different\nnumbers of partitions and different numbers of\nmicro-batches in Table 2. Each partition is as-\nsigned to a separate accelerator. We observe\nthat when the number of micro-batches M is\nat least 4\u00d7 the number of partitions, the bub-\nble overhead is almost negligible. For Trans-\nformer model, there is a 3.5\u00d7 speedup when it is\npartitioned across four times more accelerators.\nFurthermore, training throughput scales almost\nlinearly with the number of devices, thanks to\nthe computation being evenly distributed across\nTransformer layers. In contrast, the AmoebaNet\nmodel achieves sub-linear speedup due to its imbalanced computation distribution. When M is\nrelatively small, the bubble overhead can no longer be negligible. When M is 1, there is effectively\nno pipeline parallelism. We observe relatively constant throughput regardless of the number of\naccelerators used, indicating only one device is actively computing at any given time.\nTo measure the effect of communication overhead with GPipe, we ran our experiments on a single\nhost with multiple NVIDIA P100 GPUs but without NVLinks. Data transfer across GPUs then has to\ninvolve the relatively slow device-to-host and host-to-device transfers through PCI-E. The number of\nmicro-batches was \ufb01xed at 32. As shown in Table 3, we observe 2.7\u00d7 speedup for AmoebaNet-D\n(18, 128) when we increase the number of partitions from 2 to 8. For the 24-layer Transformer,\nthe speedup is 3.3\u00d7. There is similar linear\nspeedup to what we observe on TPUs where\nhigh-speed interconnects are equipped. The\ncommunication bandwidth between devices is\nno longer a bottleneck for model parallelism\nsince GPipe only transfers activation tensors at\nthe boundaries of partitions.\n\nTable 3: Normalized training throughput using\nGPipe on GPUs without high-speed interconnect.\n\nTransformer\n8\n2\n1\n3.3\n\n4\n1.8\n\nGPU\nK = 2\n1\n\nAmoebaNet\n8\n2.7\n\n4\n1.7\n\n4\n\n1.13\n1.26\n1.84\n\n4\n\n1.07\n3.2\n3.4\n\n8\n\n1.38\n1.72\n3.48\n\n8\n1.3\n4.8\n6.3\n\nM = 32\n\n3.1 Performance Overhead Breakdown\n\nTable 4: Time step breakdown\n\nTo study opportunities for future performance\nimprovements, we identi\ufb01ed the key factors\nthat affect the performance of GPipe on Cloud\nTPUs. We measured the time spent on differ-\nent activities listed in Table 4. We found that\nre-computation time was the main contributor\nto GPipe overhead, taking up to 23% of the\ntotal step time. Another source of overhead\nwas load imbalance. With two partitions, over-\nhead caused by load imbalance was only 3.2%.\nThe theoretical bubble overhead is O( K\u22121\nM +K\u22121 )\nwhere K is the number of partitions and M\nis the number of micro-batches in each mini-\nbatch. The observed bubble overhead was\nslightly lower than the theoretical value partly\nbecause re-computation was scheduled early to overlap with the bubble. Weight update time for\ngradient aggregation at the end of pipeline was also small, thanks to high-speed interconnections\nbetween the accelerators.\n\n4\n\nImage Classi\ufb01cation\n\nAs a proof of concept, we \ufb01rst used GPipe to scale AmoebaNet. We increased the number of channels\nin an AmoebaNet and scaled the input image size to 480\u00d7480. We trained this 557-million-parameter\nAmoebaNet-B(18, 512) on the ImageNet 2012 dataset, using the same hyper-parameters as described\n\n5\n\n\fTable 5: Image classi\ufb01cation accuracy using AmoebaNet-B (18, 512) \ufb01rst trained on ImageNet 2012\nthen \ufb01ne-tuned on others. Please refer to the supplementary material for a detailed description of our\ntraining setup. Our \ufb01ne-tuned results were averaged across 5 \ufb01ne-tuning runs. Baseline results from\nReal et al. [12] and Cubuk et al. [26] were directly trained from scratch. *Mahajan et al.\u2019s model [27]\nachieved 85.4% top-1 accuracy but it was pretrained on non-public Instagram data. Ngiam et al. [28]\nachieved better results by pre-training with data from a private dataset (JFT-300M).\n\nDataset\nImageNet-2012\nCIFAR-10\nCIFAR-100\nStanford Cars\nOxford Pets\nFood-101\nFGVC Aircraft\nBirdsnap\n\n# Train\n1,281,167\n\n50,000\n50,000\n8,144\n3,680\n75,750\n6,667\n47,386\n\n# Test\n50,000\n10,000\n10,000\n8,041\n3,369\n25,250\n3,333\n2,443\n\n# Classes Accuracy (%)\n\n1000\n10\n100\n196\n37\n101\n100\n500\n\n84.4\n99.0\n91.3\n94.6\n95.9\n93.0\n92.7\n83.6\n\nPrevious Best (%)\n83.9 [12] (85.4\u2217[27])\n98.5 [26]\n89.3 [26]\n94.8\u2217 [26]\n93.8\u2217 [29]\n90.4\u2217 [30]\n92.9\u2217 [31]\n80.2\u2217 [32]\n\nin [12]. The network was divided into 4 partitions. This single model achieves 84.4% top-1 and 97%\ntop-5 validation accuracy with single-crop.\nWe further demonstrate the effectiveness of giant convolution networks on other image datasets\nthrough transfer learning [22, 23]. Speci\ufb01cally, we used the pre-trained ImageNet model to \ufb01ne-tune\non a variety of target datasets ranging from general to \ufb01ne-grained classi\ufb01cation. We changed the\nnumber of output units in the last softmax classi\ufb01cation layer to the number of classes in the target\ndataset and initialized the new softmax layer randomly. All the other layers were initialized from\nImageNet pre-training. Input images to the network during training were resized to 480 \u00d7 480,\nhorizontally \ufb02ipped randomly and augmented using cutout [24]. Training hyper-parameters were\nthe same as those used for ImageNet (a detailed description of our training setup is provided in\nsupplementary material). In Table 5, we report the average single-crop test accuracy over 5 \ufb01ne-tuning\nruns for each dataset. Our giant models obtain competitive results on all target datasets. For example,\nCIFAR-10 error rate is reduced to 1% and CIFAR-100 error rate to 8.7%. These results corroborate\nthe \ufb01ndings by Kornblith et al. [25], i.e., better ImageNet models transfer better.\n\n5 Massive Massively Multilingual Machine Translation\n\nNext, we demonstrate the \ufb02exibility of GPipe by scaling up models used for Natural Language\nProcessing (NLP). Due to an abundance of available parallel corpora, neural machine translation\n(NMT) has become a benchmark task for any architecture used for NLP [33, 15, 34, 35, 36]. For\nthis reason, we continue our GPipe experiments on a large-scale multilingual NMT task. We use a\ncorpus of parallel documents over 102 languages and English, containing a total of 25 billion training\nexamples, ranging from 104 to 109 per language [37]. This dataset creates a realistic test bed for\nexperiments on scalability by spanning a diverse set of languages from data-scarce (low-resource) to\ndata-rich (high-resource). For the \ufb01rst time in machine translation, we show that a large enough NMT\nmodel can learn the mapping between more than 100 language pairs simultaneously, while achieving\nbetter than bilingual model performance for all languages. This further brings out the importance of\nhaving ef\ufb01cient and \ufb02exible model-parallelism tools.\nOur comparison is based on the performance of a single Transformer [15] trained on all language\npairs in this corpus. We scale the architecture along two dimensions to stress the \ufb02exibility of GPipe:\n(i) along the depth by increasing the number of layers in the model and (ii) along the width by\nincreasing the hidden dimension in the feed-forward layers and the number of attention heads (as well\nas # attention channels) in multi-head attention layers similar to Shazeer et al. [34]. Please refer to\nthe supplementary material for a detailed description of our dataset, baselines, training con\ufb01guration\nand optimization hyper-parameters.\nWe start with a standard 400M-parameter Transformer Big model, T (6, 8192, 16)1, as described in\nChen et al. [35], with a vocabulary size of 64k. In Figure 3, we compare its performance against a\n\n1T (L, H, A) is a Transformer model with L encoder layers and L decoder layers, a feed-forward hidden\n\ndimension of H and A attention heads. The model dimension is \ufb01xed to 1024.\n\n6\n\n\f1.3B-parameter deep model, T (24, 8192, 16), a 1.3B-parameter wide model, T (12, 16384, 32), a 3B-\nparameter model, T (32, 16384, 32) and a 6B-parameter model, T (64, 16384, 32). All of the models\nare trained on all language pairs simultaneously, using temperature-based sampling as employed for\nmultilingual BERT2 [3]. T (12, 16384, 32), T (24, 8192, 32), T (32, 16384, 32) and T (64, 16384, 32)\nare partitioned over 2, 4, 8 and 16 accelerators respectively.\nFrom Figure 3, we can observe that increasing the model capacity from 400M to 1.3B parameters\nsigni\ufb01cantly improves performance across all languages. Scaling up the model from 1.3B parameters\nto 6B parameters shows further improvement, especially for high-resource languages. Below we\ndiscuss some of our empirical \ufb01ndings based on these large-scale experiments.\n\nFigure 3: Translation quality across all languages with increasing multilingual model capacity.\nLanguages are arranged in the order of decreasing training dataset size from left to right. T (L, H, A),\ndepicts the performance of a Transformer with L encoder and L decoder layers, a feed-forward hidden\ndimension of H and A attention heads. We notice that increasing the model capacity, from 400M\nparams (T (6, 8192, 16)) to 1.3B (T (24, 8192, 16)), and further, to 6B (T (64, 16384, 32)), leads to\nsigni\ufb01cant quality improvements across all languages. We also notice huge quality improvements\nfor low-resource languages (right side of the plot), when compared against bilingual baselines,\nhighlighting the signi\ufb01cant transfer gains resulting from training a multilingual model.\n\nDepth-Width Trade-off: We study the trade-off between depth and width in our multilingual\nsetup and compare the performance of 1.3B wide model T (12, 16384, 32) and 1.3B deep model\nT (24, 8192, 16). While the quality of these two models on high-resource languages (left of Figure 3)\nis very similar, the deeper model outperforms by huge margins on low-resource languages, suggesting\nthat increasing model depth might be better for generalization. Further, the quality improvements for\nlow-resource languages (right side of Figure 3), when comparing the 1.3B deep model against the\n400M model, are almost as large as the improvements for high-resource languages, indicating that\nincreasing depth might potentially increase the extent of transfer to low-resource tasks.\nTrainability Challenges with Deep Models: Although depth increases the representational capacity\nof neural networks, it also complicates the optimization problem. In our large-scale experiments,\nwe encountered severe trainability issues arising from a combination of sharp activations (positive\nkurtosis) and dataset noise. We observed that after training for a few thousand steps, the model\npredictions would become extremely peaky and vulnerable to noise, which frequently resulted\nin non-\ufb01nite or large gradients that eventually destroyed the learning progress. To counter these\nproblems, we apply two methods: (i) Following Zhang et al. [38], we scale down the initialization\nof all transformer feed-forward layers by the number of layers. (ii) We clip the logit predictions\n(softmax pre-activations) whenever their magnitude exceeds a certain value. A combination of these\ntwo approaches allows us to mitigate the training instability posed by scaling model depth.\n\n6 Design Features and Trade-Offs\n\nSeveral approaches have been proposed to enable ef\ufb01cient large-scale model parallelism. However,\neach approach chooses its own set of trade-offs, making it suitable for scaling speci\ufb01c architectures\n\n2https://github.com/google-research/bert/blob/master/multilingual.md\n\n7\n\n\funder particular hardware constraints. The core idea of model parallelism involves partitioning a\nnetwork into different computational units, which are then placed on different devices [39, 40, 41, 42].\nConceptually this supports scaling a large spectrum of models to huge capacities. However these\napproaches typically suffer from low hardware utilization and communication bottlenecks. Single\nProgram Multiple Data (SPMD) and pipeline parallelism have been proposed as solutions to counter\nthese challenges.\nMesh-Tensor\ufb02ow [34] follows the SPMD paradigm, which extends the Single Instruction Multiple\nData (SIMD) approach used for data parallelism to other tensor dimensions. SPMD allows splitting\nevery computation across multiple devices, allowing the user to scale the size of individual matrix\nmultiplications (and thus, the model parameters of individual layers) linearly with the number of\naccelerators. However, this also introduces high communication overhead between the accelerators\ndue to an abundance of AllReduce-like operations used to combine the outputs of each parallelized\nmatrix multiplication. This limits the applicability of the approach to scenarios where accelerators\nare connected with high speed interconnects. Further, SPMD limits the type of operations that can be\nef\ufb01ciently scaled, restricting its use to a speci\ufb01c set of network architectures and machine learning\ntasks. For example, splitting along the channel dimension of convolution layers under this paradigm\nis not ef\ufb01cient given that channels are effectively fully connected, whereas splitting along the spatial\ndimension requires sophisticated techniques for the halo regions. While SPMD allows scaling the\nmodel depth by making each operation smaller, it requires splitting each layer over a larger number\nof accelerators, which in turn further increases the communication overhead across devices.\nOther approaches have attempted to utilize pipeline-parallelism-based approaches to scale neural\nnetworks [43, 44]. The most recent iteration of pipeline parallelism applied to neural network\ntraining is PipeDream [45], which targets reducing the communication overhead for parameter\nservers [46]. PipeDream pipelines the execution of forward passes and intersperses them with\nbackward passes in an attempt to maximize hardware utilization. This design suffers from weight\nstaleness introduced by asynchronous backward updates. To avoid optimization issues stemming\nfrom the weight staleness, PipeDream requires maintaining multiple versioned copies of the model\nparameters on each accelerator in order to compute the gradient updates accurately, preventing users\nfrom scaling to bigger models.\nGPipe introduces a new brand of pipeline parallelism that pipelines the execution of micro-batches\nbefore applying a single synchronous gradient update for the entire mini-batch. Our novel batch-\nsplitting pipeline parallelism algorithm, when combined with re-materialization, allows scaling\nto a large number of micro-batches. This minimizes the bubble overhead without the need for\nasynchronous gradient updates. GPipe enables the user to scale model size linearly with the number\nof accelerators used. Unlike SPMD, pipeline parallelism introduces little additional communication\noverhead when scaling the model. Inter-device communication only takes place at partition boundaries\nfor every micro-batch and the introduced communication overhead is marginal, extending the utility\nof GPipe to situations where high-speed device interconnects are not available. However, GPipe\ncurrently assumes that a single layer \ufb01ts within the memory requirements of a single accelerator3.\nAdditionally, micro-batch splitting requires complicated strategies to support layers that require\ncomputations across the batch (for example, BatchNorm uses statistics over the micro-batch during\ntraining, but accumulates mini-batch statistics for evaluation).\n\n7 Conclusion\n\nIn this work, we introduce GPipe, a scalable model-parallelism library for training giant networks. We\npropose a novel batch-splitting pipeline-parallelism algorithm that uses synchronous gradient updates,\nallowing model parallelism with high hardware utilization and training stability. We leverage GPipe to\ntrain large-scale convolutional and transformer-based models and demonstrate strong empirical results\non both image classi\ufb01cation and multilingual machine translation. We highlight three key attributes\nof GPipe: 1) Ef\ufb01ciency: Using a novel batch-splitting pipelining algorithm, GPipe achieves almost\nlinear speedup with the number of devices. 2) Flexibility: GPipe supports any sequential neural\nnetworks. 3) Reliability: GPipe utilizes synchronous gradient descent and guarantees consistent\ntraining regardless of the number of partitions.\n\n3One possible way around this limitation is splitting a single matrix-multiplication into smaller ones and\n\nspreading them sequentially across multiple layers.\n\n8\n\n\fReferences\n[1] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextual-\n\nized word vectors. CoRR, abs/1708.00107, 2017.\n\n[2] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke\n\nZettlemoyer. Deep contextualized word representations. In ACL, 2018.\n\n[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec-\n\ntional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[4] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models\n\nare unsupervised multitask learners. 2019.\n\n[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR. IEEE, 2009.\n\n[6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, and et al. Going deeper with convolutions. In\n\nCVPR, pages 1\u20139, 2015.\n\n[7] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the\n\ninception architecture for computer vision. In CVPR, pages 2818\u20132826, 2016.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nIn European conference on computer vision, pages 630\u2013645. Springer, 2016.\n\n[9] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-\n\ntions for deep neural networks. In CVPR, 2017.\n\n[10] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. CVPR, 2018.\n[11] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for\n\nscalable image recognition. CVPR, 2018.\n\n[12] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. arXiv preprint arXiv:1802.01548, 2018.\n\n[13] Andreas Griewank and Andrea Walther. Algorithm 799: revolve: an implementation of checkpointing for\nthe reverse or adjoint mode of computational differentiation. ACM Transactions on Mathematical Software\n(TOMS), 26(1):19\u201345, 2000.\n\n[14] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory\n\ncost. arXiv preprint arXiv:1604.06174, 2016.\n\n[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\n\nKaiser, and Illia Polosukhin. Attention is all you need. In Neurips, pages 5998\u20136008, 2017.\n\n[16] Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia Xu Chen, Ye Jia, Anjuli Kannan,\nTara Sainath, Yuan Cao, Chung-Cheng Chiu, et al. Lingvo: a modular and scalable framework for\nsequence-to-sequence modeling. arXiv preprint arXiv:1902.08295, 2019.\n\n[17] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nIn\n\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding.\nProceedings of the 22nd ACM international conference on Multimedia, pages 675\u2013678. ACM, 2014.\n\n[18] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan\nZhang, and Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for heterogeneous\ndistributed systems. arXiv preprint arXiv:1512.01274, 2015.\n\n[19] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\n\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.\n\n[20] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. ICML, 2015.\n\n[21] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. Megdet:\n\nA large mini-batch object detector. CVPR, 7, 2017.\n\n[22] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf:\n\nAn astounding baseline for recognition. In CVPR Workshops, pages 512\u2013519, 2014.\n\n[23] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640\u2013651, 2017.\n\n[24] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with\n\ncutout. arXiv preprint arXiv:1708.04552, 2017.\n\n[25] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better? CoRR,\n\nabs/1805.08974, 2018.\n\n9\n\n\f[26] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning\n\naugmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.\n\n[27] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin\nBharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. ECCV,\n2018.\n\n[28] Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Kornblith, Quoc Le, and Ruoming Pang. Domain\n\nadaptive transfer learning. 2018.\n\n[29] Yuxin Peng, Xiangteng He, and Junjie Zhao. Object-part attention model for \ufb01ne-grained image classi\ufb01ca-\n\ntion. IEEE Transactions on Image Processing, 27(3):1487\u20131500, 2018.\n\n[30] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale \ufb01ne-grained categoriza-\n\ntion and domain-speci\ufb01c transfer learning. In CVPR, 2018.\n\n[31] Fisher Yu, Dequan Wang, and Trevor Darrell. Deep layer aggregation. In CVPR, 2018.\n[32] Xiu-Shen Wei, Chen-Wei Xie, Jianxin Wu, and Chunhua Shen. Mask-cnn: Localizing parts and selecting\n\ndescriptors for \ufb01ne-grained bird species categorization. Pattern Recognition, 76:704\u2013714, 2018.\n\n[33] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence\n\nto sequence learning. CoRR, abs/1705.03122, 2017.\n\n[34] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter\nHawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensor\ufb02ow: Deep learning for\nsupercomputers. In Neurips, pages 10414\u201310423, 2018.\n\n[35] Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion\nJones, Niki Parmar, Mike Schuster, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both\nworlds: Combining recent advances in neural machine translation. CoRR, abs/1804.09849, 2018.\n\n[36] Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, and Michael Auli. Pay less attention with\n\nlightweight and dynamic convolutions. CoRR, abs/1901.10430, 2019.\n\n[37] Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun,\nMia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multilingual neural machine\ntranslation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019.\n\n[38] Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without\n\nnormalization. arXiv preprint arXiv:1901.09321, 2019.\n\n[39] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint\n\narXiv:1404.5997, 2014.\n\n[40] Seunghak Lee, Jin Kyu Kim, Xun Zheng, Qirong Ho, Garth A Gibson, and Eric P Xing. On model\nparallelization and scheduling strategies for distributed machine learning. In Neurips, pages 2834\u20132842,\n2014.\n\n[41] Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar,\nMohammad Norouzi, Samy Bengio, and Jeff Dean. Device placement optimization with reinforcement\nlearning. arXiv preprint arXiv:1706.04972, 2017.\n\n[42] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc aurelio Ranzato,\nAndrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. Large scale distributed deep\nnetworks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Neurips 25, pages\n1223\u20131231. Curran Associates, Inc., 2012.\n\n[43] A. Petrowski, G. Dreyfus, and C. Girault. Performance analysis of a pipelined backpropagation parallel\n\nalgorithm. IEEE Transactions on Neural Networks, 4(6):970\u2013981, Nov 1993.\n\n[44] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim\nKrikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine translation system: Bridging\nthe gap between human and machine translation. Transactions of the Association for Computational\nLinguistics,, 2017.\n\n[45] Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and\nPhil Gibbons. Pipedream: Fast and ef\ufb01cient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377,\n2018.\n\n[46] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long,\nEugene J Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In\nOSDI, volume 14, pages 583\u2013598, 2014.\n\n10\n\n\f", "award": [], "sourceid": 59, "authors": [{"given_name": "Yanping", "family_name": "Huang", "institution": "Google Brain"}, {"given_name": "Youlong", "family_name": "Cheng", "institution": "Google"}, {"given_name": "Ankur", "family_name": "Bapna", "institution": "Google"}, {"given_name": "Orhan", "family_name": "Firat", "institution": "Google"}, {"given_name": "Dehao", "family_name": "Chen", "institution": "Google"}, {"given_name": "Mia", "family_name": "Chen", "institution": "Google Brain"}, {"given_name": "HyoukJoong", "family_name": "Lee", "institution": "Google"}, {"given_name": "Jiquan", "family_name": "Ngiam", "institution": "Google Brain"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}, {"given_name": "Yonghui", "family_name": "Wu", "institution": "Google"}, {"given_name": "zhifeng", "family_name": "Chen", "institution": "Google Brain"}]}