NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2952
Title:Ouroboros: On Accelerating Training of Transformer-Based Language Models

Reviewer 1


		
Authors propose the model-parallel gradient descent algorithm for the purposes of speeding up training of Transformer based language models. Overall the paper is well written and experiments are convincing in demonstration of validity of the approach. My main question is whether authors have tried training much larger Transformer models that don't fit into one GPU using their algorithm

Reviewer 2


		
This paper provides a way to divide very large Transformer-based deep Neural Networks to several modules in an efficient way for training. Transformer-based language models require a huge computational cost and time, therefore model parallelism is required if a model is too large to fit in a single computing device. However, the basic Transformer training requires to wait for the previous layers' gradients to compute the current layer gradients. Therefore, some GPUs becomes idle during training when we split the model into multiple GPUs. The proposed accelerating training method enables Transformer-based language model parallelism by avoiding the backward locking. The novelty of the idea and the contribution of the theoretical analysis is somewhat limited, but overall this paper shows the great progress on the parallelization on Transformer-based Language model training.

Reviewer 3


		
The paper introduces a new method for model-parallel training, where layers of a model are distributed across multiple accelerators. The method avoids locking in the backward pass by using stale gradients during back-propagation. I'm not aware of any prior work that took such an approach. Furthermore, the authors provide theoretical claims and empirical results to demonstrate that their method has convergence properties similar to conventional SGD, despite using stale gradients. The lack of effective model-parallel training is a major roadblock for scaling up model sizes, and the proposed approach promises to overcome this issue. Minor: 74: the notation grad f_{l_x_{i(t)}} is not used in equation (6). It would also be useful to remind what this notation means next to equation (8). 84: should be "any ... method" 116 or 127: it would be good to say that proofs are in the supplementary material, rather than leaving it unstated whether the authors have proven the key theorems.