{"title": "Lookahead Optimizer: k steps forward, 1 step back", "book": "Advances in Neural Information Processing Systems", "page_first": 9597, "page_last": 9608, "abstract": "The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of ``fast weights\" generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.", "full_text": "Lookahead Optimizer: k steps forward, 1 step back\n\nMichael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba\n\nDepartment of Computer Science, University of Toronto, Vector Institute\n\n{michael, jlucas, hinton,jba}@cs.toronto.edu\n\nAbstract\n\nThe vast majority of successful deep neural networks are trained using variants of\nstochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD\ncan be broadly categorized into two approaches: (1) adaptive learning rate schemes,\nsuch as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and\nNesterov momentum. In this paper, we propose a new optimization algorithm,\nLookahead, that is orthogonal to these previous approaches and iteratively updates\ntwo sets of weights. Intuitively, the algorithm chooses a search direction by looking\nahead at the sequence of \u201cfast weights\" generated by another optimizer. We\nshow that Lookahead improves the learning stability and lowers the variance of\nits inner optimizer with negligible computation and memory cost. We empirically\ndemonstrate Lookahead can signi\ufb01cantly improve the performance of SGD and\nAdam, even with their default hyperparameter settings on ImageNet, CIFAR-\n10/100, neural machine translation, and Penn Treebank.\n\n1\n\nIntroduction\n\nDespite their simplicity, SGD-like algorithms remain competitive for neural network training against\nadvanced second-order optimization methods. Large-scale distributed optimization algorithms\n[10, 45] have shown impressive performance in combination with improved learning rate scheduling\nschemes [42, 35], yet variants of SGD remain the core algorithm in the distributed systems. The\nrecent improvements to SGD can be broadly categorized into two approaches: (1) adaptive learning\nrate schemes, such as AdaGrad [7] and Adam [18], and (2) accelerated schemes, such as Polyak\nheavy-ball [33] and Nesterov momentum [29]. Both approaches make use of the accumulated past\ngradient information to achieve faster convergence. However, to obtain their improved performance\nin neural networks often requires costly hyperparameter tuning [28].\nIn this work, we present Lookahead, a new optimization method, that is orthogonal to these previous\napproaches. Lookahead \ufb01rst updates the \u201cfast weights\u201d [12] k times using any standard optimizer in\nits inner loop before updating the \u201cslow weights\u201d once in the direction of the \ufb01nal fast weights. We\nshow that this update reduces the variance. We \ufb01nd that Lookahead is less sensitive to suboptimal\nhyperparameters and therefore lessens the need for extensive hyperparameter tuning. By using\nLookahead with inner optimizers such as SGD or Adam, we achieve faster convergence across\ndifferent deep learning tasks with minimal computational overhead.\nEmpirically, we evaluate Lookahead by training classi\ufb01ers on the CIFAR [19] and ImageNet datasets\n[5], observing faster convergence on the ResNet-50 and ResNet-152 architectures [11]. We also\ntrained LSTM language models on the Penn Treebank dataset [24] and Transformer-based [42]\nneural machine translation models on the WMT 2014 English-to-German dataset. For all tasks, using\nLookahead leads to improved convergence over the inner optimizer and often improved generalization\nperformance while being robust to hyperparameter changes. Our experiments demonstrate that\nLookahead is robust to changes in the inner loop optimizer, the number of fast weight updates, and\nthe slow weights learning rate.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAlgorithm 1 Lookahead Optimizer:\nRequire: Initial parameters 0, objective function L\nRequire: Synchronization period k, slow weights step\n\nend for\nPerform outer update t t1 + \u21b5(\u2713t,k  t1)\n\nend for\nreturn parameters \n\nsize \u21b5, optimizer A\nfor t = 1, 2, . . . do\n\nSynchronize parameters \u2713t,0 t1\nfor i = 1, 2, . . . , k do\n\nsample minibatch of data d \u21e0D\n\u2713t,i \u2713t,i1 + A(L, \u2713t,i1, d)\n\nFigure 1: (Left) Visualizing Lookahead (k = 10) through a ResNet-32 test accuracy surface at epoch\n100 on CIFAR-100. We project the weights onto a plane de\ufb01ned by the \ufb01rst, middle, and last fast\n(inner-loop) weights. The fast weights are along the blue dashed path. All points that lie on the plane\nare represented as solid, including the entire Lookahead slow weights path (in purple). Lookahead\n(middle, bottom right) quickly progresses closer to the minima than SGD (middle, top right) is able\nto. (Right) Pseudocode for Lookahead.\n\n2 Method\n\nIn this section, we describe the Lookahead algorithm and discuss its properties. Lookahead maintains\na set of slow weights  and fast weights \u2713, which get synced with the fast weights every k updates.\nThe fast weights are updated through applying A, any standard optimization algorithm, to batches of\ntraining examples sampled from the dataset D. After k inner optimizer updates using A, the slow\nweights are updated towards the fast weights by linearly interpolating in weight space, \u2713  . We\ndenote the slow weights learning rate as \u21b5. After each slow weights update, the fast weights are reset\nto the current slow weights value. Psuedocode is provided in Algorithm 1.1\nStandard optimization methods typically require carefully tuned learning rates to prevent oscillation\nand slow convergence. This is even more important in the stochastic setting [25, 43]. Lookahead,\nhowever, bene\ufb01ts from a larger learning rate in the inner loop. When oscillating in the high curvature\ndirections, the fast weights updates make rapid progress along the low curvature directions. The slow\nweights help smooth out the oscillations through the parameter interpolation. The combination of\nfast weights and slow weights improves learning in high curvature directions, reduces variance, and\nenables Lookahead to converge rapidly in practice.\nFigure 1 shows the trajectory of both the fast weights and slow weights during the optimization of a\nResNet-32 model on CIFAR-100. While the fast weights explore around the minima, the slow weight\nupdate pushes Lookahead aggressively towards an area of improved test accuracy, a region which\nremains unexplored by SGD after 20 updates.\n\nSlow weights trajectory We can characterize the trajectory of the slow weights as an exponential\nmoving average (EMA) of the \ufb01nal fast weights within each inner-loop, regardless of the inner\noptimizer. After k inner-loop steps we have:\n\nt+1 = t + \u21b5(\u2713t,k  t)\n\n= \u21b5[\u2713t,k + (1  \u21b5)\u2713t1,k + . . . + (1  \u21b5)t1\u27130,k] + (1  \u21b5)t0\n\n(1)\n(2)\n\nIntuitively, the slow weights heavily utilize recent proposals from the fast weight optimization but\nmaintain some in\ufb02uence from previous fast weights. We show that this has the effect of reducing\nvariance in Section 3.1. While a Polyak-style average has further theoretical guarantees, our results\nmatch the claim that \u201can exponentially-decayed moving average typically works much better in\npractice\" [25].\n\n1Our open source implementation is available at https://github.com/michaelrzhang/lookahead.\n\n2\n\n\fFigure 2: CIFAR-10 training loss with \ufb01xed and adaptive \u21b5. The adaptive \u21b5 is clipped between\n[\u21b5low, 1]. (Left) Adam learning rate = 0.001. (Right) Adam learning rate = 0.003.\n\nFast weights trajectory Within each inner-loop, the trajectory of the fast weights depends on the\nchoice of underlying optimizer. Given an optimization algorithm A that takes in an objective function\nL and the current mini-batch training examples d, we have the update rule for the fast weights:\n\n(3)\nWe have the choice of maintaining, interpolating, or resetting the internal state (e.g. momentum) of\nthe inner optimizer. We evaluate this tradeoff on the CIFAR dataset (where every choice improves\nconvergence) in Appendix D.1 and maintain internal state for the other experiments.\n\n\u2713t,i+1 = \u2713t,i + A(L, \u2713t,i1, d).\n\nComputational complexity Lookahead has a constant computational overhead due to parameter\ncopying and basic arithmetic operations that is amortized across the k inner loop updates. The number\nk ) times that of the inner optimizer. Lookahead maintains a single additional\nof operations is O( k+1\ncopy of the number of learnable parameters in the model.\n\n2.1 Selecting the Slow Weights Step Size\nThe step size in the direction (\u2713t,k  \u2713t,0) is controlled by \u21b5. By taking a quadratic approximation of\nthe loss, we present a principled way of selecting \u21b5.\nProposition 1 (Optimal slow weights step size). For a quadratic loss function L(x) = 1\n2 xT AxbT x,\nthe step size \u21b5\u21e4 that minimizes the loss for two points \u2713t,0 and \u2713t,k is given by:\n\n\u21b5\u21e4 = arg min\n\u21b5\n\nL(\u2713t,0 + \u21b5(\u2713t,k  \u2713t,0)) =\n\nwhere \u2713\u21e4 = A1b minimizes the loss.\n\n(\u2713t,0  \u2713\u21e4)T A(\u2713t,0  \u2713t,k)\n(\u2713t,0  \u2713t,k)T A(\u2713t,0  \u2713t,k)\n\nProof is in the appendix. Using quadratic approximations for the curvature, which is typical in second\norder optimization [7, 18, 26], we can derive an estimate for the optimal \u21b5 more generally. The\nfull Hessian is typically intractable so we instead use aforementioned approximations, such as the\ndiagonal approximation to the empirical Fisher used by the Adam optimizer [18]. This approximation\nworks well in our numerical experiments if we clip the magnitude of the step size. At each slow\nweight update, we compute:\n\n\u02c6\u21b5\u21e4 = clip(\n\n(\u2713t,0  (\u2713t,k  \u02c6A1 \u02c6rL(\u2713t,k))T \u02c6A(\u2713t,0  \u2713t,k)\n\n(\u2713t,0  \u2713t,k)T \u02c6A(\u2713t,0  \u2713t,k)\n\n,\u21b5 low, 1)\n\nwhere \u02c6A is the empirical Fisher approximation and \u2713t,k  \u02c6A1 \u02c6rL(\u2713t,k) approximates the optimum\n\u2713\u21e4. We prove Proposition 1 and elaborate on assumptions in the appendix B.2. Setting \u21b5low > 0\nimproves the stability of our algorithm. We evaluate the performance of this adaptive scheme versus\na \ufb01xed scheme and standard Adam on a ResNet-18 trained on CIFAR-10 with two different learning\nrates and show the results in Figure 2. Additional hyperparameter details are given in appendix C.\nBoth the \ufb01xed and adaptive Lookahead offer improved convergence.\nIn practice, a \ufb01xed choice of \u21b5 offers similar convergence bene\ufb01ts and tends to generalize better.\nFixing \u21b5 avoids the need to maintain an estimate of the empirical Fisher, which incurs a memory and\ncomputational cost when the inner optimizer does not maintain such an estimate e.g. SGD. We thus\nuse a \ufb01xed \u21b5 for the rest of our deep learning experiments.\n\n3\n\n\f3 Convergence Analysis\n\n3.1 Noisy quadratic analysis\n\nWe analyze Lookahead on a noisy quadratic\nmodel to better understand its convergence guar-\nantees. While simple, this model is a proxy for\nneural network optimization and effectively op-\ntimizing it remains a challenging open problem\n[37, 26, 43, 47]. In this section, we will show\nunder equal learning rates that Lookahead will\nconverge to a smaller steady-state risk than SGD.\nWe will then show through simulation of the ex-\npected dynamics that Lookahead is able to con-\nverge to this steady-state risk more quickly than\nSGD for a range of hyperparameter settings.\n\nModel de\ufb01nition We use the same model as\nin Schaul et al. [37] and Wu et al. [43].\n\n\u02c6L(x) =\n\n1\n2\n\n(x  c)T A(x  c),\n\n(4)\n\nFigure 3: Comparing expected optimization\nprogress between SGD and Lookahead(k = 5)\non the noisy quadratic model. Each vertical slice\ncompares the convergence of optimizers with the\nsame \ufb01nal loss values. For Lookahead, conver-\ngence rates for 100 evenly spaced \u21b5 values in the\nrange (0, 1] are overlaid.\n\nwith c \u21e0N (x\u21e4, \u2303). We assume that both A and \u2303 are diagonal and that, without loss of generality,\nx\u21e4 = 0. While it is trivial to assume that A is diagonal 2 the co-diagonalizable noise assumption is\nnon-trivial but is common \u2014 see Wu et al. [43] and Zhang et al. [47] for further discussion. We use\nai and 2\ni to denote the diagonal elements of A and \u2303 respectively. Taking the expectation over c,\nthe expected loss of the iterates \u2713(t) is,\n\nL(\u2713(t)) = E[ \u02c6L(\u2713(t))] =\n\n2\n\nai(\u2713(t)\ni\n\n+ 2\n\ni )] =\n\n1\n\n2 E[Xi\n\n1\n\n2Xi\n\nai(E[\u2713(t)\n\ni\n\n]2 + V[\u2713(t)\n\ni\n\n] + 2\n\ni ).\n\n(5)\n\nAnalyzing the expected dynamics of the SGD iterates and the slow weights gives the following result.\nProposition 2 (Lookahead steady-state risk). Let 0 << 2/L be the learning rate of SGD and\nLookahead where L = maxi ai. In the noisy quadratic model, the iterates of SGD and Lookahead\nwith SGD as its inner optimizer converge to 0 in expectation and the variances converge to the\nfollowing \ufb01xed points:\n\n2A2\u23032\n\nI  (I  A)2\n\nV \u21e4SGD =\n\nV \u21e4LA =\n\n\u21b52(I  (I  A)2k)\n\n\u21b52(I  (I  A)2k) + 2\u21b5(1  \u21b5)(I  (I  A)k)\n\n(6)\n\n(7)\n\nV \u21e4SGD\n\nRemarks For the Lookahead variance \ufb01xed point, the \ufb01rst product term is always smaller than 1\nfor \u21b5 2 (0, 1), and thus Lookahead has a variance \ufb01xed point that is strictly smaller than that of the\nSGD inner-loop optimizer for the same learning rate. Evidence of this phenomenon is present in deep\nneural networks trained on the CIFAR dataset, shown in Figure 10.\nIn Proposition 2, we use the same learning rate for both SGD and Lookahead. To fairly evaluate the\nconvergence of the two methods, we compare the convergence rates under hyperparameter settings\nthat achieve the same steady-state risk. In Figure 3 we show the expected loss after 1000 updates\n(computed analytically) for both Lookahead and SGD. This shows that there exists (\ufb01xed) settings\nof the Lookahead hyperparameters that arrive at the same steady state risk as SGD but do so more\nquickly. Moreover, Lookahead outperforms SGD across the broad spectrum of \u21b5 values we simulated.\nDetails, further simulation results, and additional discussion are presented in Appendix B.\n\n2Classical momentum\u2019s iterates are invariant to translations and rotations (see e.g. Sutskever et al. [41]) and\n\nLookahead\u2019s linear interpolation is also invariant to such changes.\n\n4\n\n\f3.2 Deterministic quadratic convergence\n\nIn the previous section we showed that on the noisy quadratic model, Lookahead is able to improve\nconvergence of the SGD optimizer under setting with equivalent convergent risk. Here we analyze\nthe quadratic model without noise using gradient descent with momentum [33, 9] and show that when\nthe system is under-damped, Lookahead is able to improve on the convergence rate.\nAs before, we restrict our attention to diagonal quadratic functions (which in this case is entirely\nwithout loss of generality). Given an initial point \u27130, we wish to \ufb01nd the rate of contraction, that is,\nthe smallest \u21e2 satisfying ||\u2713t  \u2713\u21e4|| \uf8ff \u21e2t||\u27130  \u2713\u21e4||. We follow the approach of [31] and model the\noptimization of this function as a linear dynamical system allowing us to compute the rate exactly.\nDetails are in Appendix B.\nAs in Lucas et al. [23], to better un-\nderstand the sensitivity of Lookahead\nto misspeci\ufb01ed conditioning we \ufb01x\nthe momentum coef\ufb01cient of classi-\ncal momentum and explore the con-\nvergence rate over varying condition\nnumber under the optimal learning\nrate. As expected, Lookahead has\nslightly worse convergence in the\nover-damped regime where momen-\ntum is set too low and CM is slowly,\nmonotonically converging to the op-\ntimum. However, when the system\nis under-damped (and oscillations oc-\ncur) Lookahead is able to signi\ufb01cantly\nimprove the convergence rate by skip-\nping to a better parameter setting during oscillation.\n\nFigure 4: Quadratic convergence rates (1  \u21e2) of classical\nmomentum versus Lookahead wrapping classical momentum.\nFor Lookahead, we \ufb01x k = 20 lookahead steps and \u21b5 =\n0.5 for the slow weights step size. Lookahead is able to\nsigni\ufb01cantly improve on the convergence rate in the under-\ndamped regime where oscillations are observed.\n\n4 Related work\n\nOur work is inspired by recent advances in understanding the loss surface of deep neural networks.\nWhile the idea of following the trajectory of weights dates back to Ruppert [36], Polyak and Juditsky\n[34], averaging weights in neural networks has not been carefully studied until more recently. Garipov\net al. [8] observe that the \ufb01nal weights of two independently trained neural networks can be connected\nby a curve with low loss. Izmailov et al. [14] proposes Stochastic Weight Averaging (SWA), which\naverages the weights at different checkpoints obtained during training. Parameter averaging schemes\nare used to create ensembles in natural language processing tasks [15, 27] and in training Generative\nAdversarial Networks [44]. In contrast to previous approaches, which generally focus on generating\na set of parameters at the end of training, Lookahead is an optimization algorithm which performs\nparameter averaging during the training procedure to achieve faster convergence. We elaborate on\ndifferences with SWA and present additional experimental results in appendix D.3.\nThe Reptile algorithm, proposed by Nichol et al. [30], samples tasks in its outer loop and runs an\noptimization algorithm on each task within the inner loop. The initial weights are then updated in the\ndirection of the new weights. While the functionality is similar, the application and setting are starkly\ndifferent. Reptile samples different tasks and aims to \ufb01nd parameters which act as good initial values\nfor new tasks sampled at test time. Lookahead does not sample new tasks for each outer loop and\naims to take advantage of the geometry of loss surfaces to improve convergence.\nKatyusha [1], an accelerated form of SVRG [17], also uses an outer and inner loop during optimization.\nKatyusha checkpoints parameters during optimization. Within each inner loop step, the parameters\nare pulled back towards the latest checkpoint. Lookahead computes the pullback only at the end\nof the inner loop and the gradient updates do not utilize the SVRG correction (though this would\nbe possible). While Katyusha has theoretical guarantees in the convex optimization setting, the\nSVRG-based update does not work well for neural networks [4].\nAnderson acceleration [2] and other related extrapolation techniques [3] have a similar \ufb02avor to\nLookahead. These methods keep track of all iterates within an inner loop and then compute some\nlinear combination which extrapolates the iterates towards their \ufb01xed point. This presents additional\n\n5\n\n\fOPTIMIZER\nCIFAR-100\nSGD\n78.24 \u00b1 .18\nPOLYAK\n77.99 \u00b1 .42\nADAM\n76.88 \u00b1 .39\nLOOKAHEAD\n78.34 \u00b1 .05\nTable 1: CIFAR Final Validation Accuracy.\n\nCIFAR-10\n95.23 \u00b1 .19\n95.26 \u00b1 .04\n94.84 \u00b1 .16\n95.27 \u00b1 .06\n\nFigure 5: Performance comparison of the different optimization algorithms. (Left) Train Loss on\nCIFAR-100. (Right) CIFAR ResNet-18 validation accuracies with various optimizers. We do a grid\nsearch over learning rate and weight decay on the other optimizers (details in appendix C). Lookahead\nand Polyak are wrapped around SGD.\n\nchallenges \ufb01rst in the form of additional memory overhead as the number of inner-loop steps increases\nand also in \ufb01nding the best linear combination. Scieur et al. [38, 39] propose a method by which to\n\ufb01nd a good linear combination and apply this approach to deep learning problems and report both\nimproved convergence and generalization. However, their method requires on the order of k times\nmore memory than Lookahead. Lookahead can be seen as a simple version of Anderson acceleration\nwherein only the \ufb01rst and last iterates are used.\n\n5 Experiments\n\nWe completed a thorough evaluation of the Lookahead optimizer on a variety of deep learning tasks\nagainst well-calibrated baselines. We explored image classi\ufb01cation on CIFAR-10/CIFAR-100 [19]\nand ImageNet [5]. We also trained LSTM language models on the Penn Treebank dataset [24] and\nTransformer-based [42] neural machine translation models on the WMT 2014 English-to-German\ndataset. For all of our experiments, every algorithm consumed the same amount of training data.\n\n5.1 CIFAR-10 and CIFAR-100\nThe CIFAR-10 and CIFAR-100 datasets for classi\ufb01cation consist of 32 \u21e5 32 color images, with\n10 and 100 different classes, split into a training set with 50,000 images and a test set with 10,000\nimages. We ran all our CIFAR experiments with 3 seeds and trained for 200 epochs on a ResNet-18\n[11] with batches of 128 images and decay the learning rate by a factor of 5 at the 60th, 120th, and\n160th epochs. Additional details are given in appendix C.\nWe summarize our results in Figure 5.3 We also elaborate on how Lookahead contrasts with SWA and\npresent results demonstrating lower validation error with Pre-ResNet-110 and Wide-ResNet-28-10\n[46] on CIFAR-100 in appendix D.3. Note that Lookahead achieves signi\ufb01cantly faster convergence\nthroughout training even though the learning rate schedule is optimized for the inner optimizer\u2014\nfuture work can involve building a learning rate schedule for Lookahead. This improved convergence\nis important for better anytime performance in new datasets where hyperparameters and learning rate\nschedules are not well-calibrated.\n\nImageNet\n\n5.2\nThe 1000-way ImageNet task [5] is a classi\ufb01cation task that contains roughly 1.28 million training\nimages and 50,000 validation images. We use the of\ufb01cial PyTorch implementation4 and the ResNet-\n50 and ResNet-152 [11] architectures. Our baseline algorithm is SGD with an initial learning rate of\n0.1 and momentum value of 0.9. We train for 90 epochs and decay our learning rate by a factor of 10\nat the 30th and 60th epochs. For Lookahead, we set k = 5 and slow weights step size \u21b5 = 0.5.\nMotivated by the improved convergence we observed in our initial experiment, we tried a more\naggressive learning rate decay schedule where we decay the learning rate by a factor of 10 at the 30th,\n48th, and 58th epochs. Using such a schedule, we reach 75% single crop top-1 accuracy on ImageNet\nin just 50 epochs and reach 75.5% top-1 accuracy in 60 epochs. The results are shown in Figure 6.\n\n3We refer to SGD with heavy ball momentum [33] as SGD.\n4Implementation available at https://github.com/pytorch/examples/tree/master/imagenet.\n\n6\n\n\fOPTIMIZER\nEPOCH 50 - TOP 1\nEPOCH 50 - TOP 5\nEPOCH 60 - TOP 1\nEPOCH 60 - TOP 5\n\nLA\n75.13\n92.22\n75.49\n92.53\n\nSGD\n74.43\n92.15\n75.15\n92.56\n\nTable 2: Top-1 and Top-5 single crop validation\naccuracies on ImageNet.\n\nFigure 6: ImageNet training loss. The asterisk denotes the aggressive learning rate decay schedule,\nwhere LR is decayed at iteration 30, 48, and 58. We report validation accuracies for this schedule.\n\n(a) Training perplexity of LSTM models trained\non the Penn Treebank dataset\n\n(b) Training Loss on Transformer. Adam and\nAdaFactor both use a linear warmup scheme de-\nscribed in Vaswani et al. [42].\n\nFigure 7: Optimization performance on Penn Treebank and WMT-14 machine translation task.\n\nTo test the scalability of our method, we ran Lookahead with the aggressive learning rate decay on\nResNet-152. We reach 77% single crop top-1 accuracy in 49 epochs (matching what is reported in He\net al. [11]) and 77.96% top-1 accuracy in 60 epochs. Other approaches for improving convergence on\nImageNet can require hundreds of GPUs, or tricks such as ramping up the learning rate and adaptive\nbatch-sizes [10, 16]. The fastest convergence we are aware of uses an approximate second-order\nmethod to train a ResNet-50 to 75% top-1 accuracy in 35 epochs with 1,024 GPUs [32]. In contrast,\nLookahead requires changing one single line of code and can easily scale to ResNet-152.\n\n5.3 Language modeling\n\nWe trained LSTMs [13] for language modeling on the Penn Treebank dataset. We followed the\nmodel setup of Merity et al. [27] and made use of their publicly available code in our experiments.\nWe did not include the \ufb01ne-tuning stages. We searched over hyperparameters for both Adam and\nSGD (without momentum) to \ufb01nd the model which gave the best validation performance. We then\nperformed an additional small grid search on each of these methods with Lookahead. Each model\nwas trained for 750 epochs. We show training curves for each model in Figure 7a.\nUsing Lookahead with Adam we were able to achieve the fastest convergence and best training,\nvalidation, and test perplexity. The models trained with SGD took much longer to converge (around\n700 epochs) and were unable to match the \ufb01nal performance of Adam. Using Polyak weight averaging\n[34] with SGD, as suggested by Merity et al. [27] and referred to as ASGD, we were able to improve\non the performance of Adam but were unable to match the performance of Lookahead. Full results\nare given in Table 3 and additional details are in appendix C.\n\n5.4 Neural machine translation\n\nWe trained Transformer based models [42] on the WMT2014 English-to-German translation task on\na single Tensor Processing Unit (TPU) node. We took the base model from Vaswani et al. [42] and\ntrained it using the proposed warmup-then-decay learning rate scheduling scheme and, additionally,\nthe same scheme wrapped with Lookahead. We found Lookahead speedups the early stage of the\ntraining over Adam and the later proposed AdaFactor [40] optimizer. All the methods converge to\nsimilar training loss and BLEU score at the end, see Figure 7b and Table 4.\n\n7\n\n01000020000300004000050000Inner LooS ()asW WeLghWs) SWeSs1.501.752.002.252.502.753.003.253.507raLnLng LossAdam + warmuSAda)aFWorLookahead + Adam + warmuSLookahead + Adam\fTable 3: LSTM training, validation, and test per-\nplexity on the Penn Treebank dataset.\n\nOPTIMIZER\nSGD\nLA(SGD)\nADAM\nLA(ADAM)\nPOLYAK\n\nTRAIN\n43.62\n35.02\n33.54\n31.92\n\n-\n\nVAL.\n66.0\n65.10\n61.64\n60.28\n61.18\n\nTEST\n63.90\n63.04\n59.33\n57.72\n58.79\n\nTable 4: Transformer Base Model trained for\n50k steps on WMT English-to-German. \u201cAdam-\u201d\ndenote Adam without learning rate warm-up.\n\nOPTIMIZER\nADAM\nLA(ADAM)\nLA(ADAM-)\nADAFACTOR\n\nNEWSTEST13 NEWSTEST14\n\n24.6\n24.68\n24.3\n24.17\n\n24.6\n24.70\n24.4\n24.51\n\n(a) CIFAR-10 Train Loss: Different LR\n\n(b) CIFAR-10 Train Loss: Different momentum\n\nFigure 8: We \ufb01x Lookahead parameters and evaluate on different inner optimizers.\n\nOur NMT experiments further con\ufb01rms Lookahead improves the robustness of the inner loop\noptimizer. We found Lookahead enables a wider range of learning rate {0.02, 0.04, 0.06} choices for\nthe Transformer model that all converge to similar \ufb01nal losses. Full details are given in Appendix C.4.\n\n5.5 Empirical analysis\n\nRobustness to inner optimization algorithm, k, and \u21b5 We demonstrate empirically on the CIFAR\ndataset that Lookahead consistently delivers fast convergence across different hyperparameter settings.\nWe \ufb01x slow weights step size \u21b5 = 0.5 and k = 5 and run Lookahead on inner SGD optimizers with\ndifferent learning rates and momentum; results are shown in Figure 8. In general, we observe that\nLookahead can train with higher learning rates on the base optimizer with little to no tuning on k and\n\u21b5. This agrees with our discussion of variance reduction in Section 3.1. We also evaluate robustness\nto the Lookahead hyperparameters by \ufb01xing the inner optimizer and evaluating runs with varying\nupdates k and step size \u21b5; these results are shown in Figure 9.\n\nInner loop and outer loop evaluation To get a better understanding of the Lookahead update, we\nalso plotted the test accuracy for every update on epoch 65 in Figure 10. We found that within each\ninner loop the fast weights may lead to substantial degradation in task performance\u2014this re\ufb02ects\nour analysis of the higher variance of the inner loop update in section 3.1. The slow weights step\nrecovers the outer loop variance and restores the test accuracy.\n\n\u21b5\nK\n5\n10\n\n0.5\n\n0.8\n\n78.24 \u00b1 .02\n78.19 \u00b1 .22\n\n78.27 \u00b1 .04\n77.94 \u00b1 .22\n\nTable 5: All settings have higher validation accu-\nracy than SGD (77.72%)\n\nFigure 9: CIFAR-100 train loss and \ufb01nal test accuracy with various k and \u21b5.\n\n8\n\n\fFigure 10: Visualizing Lookahead accuracy for 60 fast weight updates. We plot the test accuracy\nafter every update (the training accuracy and loss behave similarly). The inner loop update tends to\ndegrade both the training and test accuracy, while the interpolation recovers the original performance.\n\n6 Conclusion\n\nIn this paper, we present Lookahead, an algorithm that can be combined with any standard optimiza-\ntion method. Our algorithm computes weight updates by looking ahead at the sequence of \u201cfast\nweights\" generated by another optimizer. We illustrate how Lookahead improves convergence by\nreducing variance and show strong empirical results on many deep learning benchmark datasets and\narchitectures.\n\n9\n\n\fAcknowledgements\n\nWe\u2019d like to thank Roger Grosse, Guodong Zhang, Denny Wu, Silviu Pitis, David Madras, Jackson\nWang, Harris Chan, and Mufan Li for helpful comments on earlier versions of this work. We are also\nthankful for the many helpful comments from anonymous reviewers.\n\nReferences\n[1] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n1200\u20131205. ACM, 2017.\n\n[2] Donald G Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM\n\n(JACM), 12(4):547\u2013560, 1965.\n\n[3] Claude Brezinski and M Redivo Zaglia. Extrapolation methods: theory and practice, volume 2.\n\nElsevier, 2013.\n\n[4] Aaron Defazio and L\u00e9on Bottou. On the ineffectiveness of variance reduced optimization for\n\ndeep learning, 2018.\n\n[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. Ieee, 2009.\n\n[6] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural\n\nnetworks with cutout. arXiv preprint arXiv:1708.04552, 2017.\n\n[7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n[8] Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew Gordon\nWilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. arXiv preprint\narXiv:1802.10026, 2018.\n\n[9] Gabriel Goh. Why momentum really works. Distill, 2017. doi: 10.23915/distill.00006. URL\n\nhttp://distill.pub/2017/momentum.\n\n[10] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[12] Geoffrey E Hinton and David C Plaut. Using fast weights to deblur old memories. 1987.\n[13] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[14] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon\nWilson. Averaging weights leads to wider optima and better generalization. arXiv preprint\narXiv:1803.05407, 2018.\n\n[15] S\u00e9bastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large\n\ntarget vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007, 2014.\n\n[16] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie,\nZhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system\nwith mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205,\n2018.\n\n[17] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n10\n\n\f[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nCiteseer, 2009.\n\n[20] Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization\nalgorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57\u201395,\n2016.\n\n[21] Ren-cang Li. Sharpness in rates of convergence for cg and symmetric lanczos methods.\n\nTechnical report, 2005.\n\n[22] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. arXiv preprint\n\narXiv:1711.05101, 2017.\n\n[23] James Lucas, Richard Zemel, and Roger Grosse. Aggregated momentum: Stability through\n\npassive damping. arXiv preprint arXiv:1804.00325, 2018.\n\n[24] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated\n\ncorpus of english: The penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[25] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint\n\narXiv:1412.1193, 2014.\n\n[26] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored ap-\nproximate curvature. In International conference on machine learning, pages 2408\u20132417,\n2015.\n\n[27] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm\n\nlanguage models. arXiv preprint arXiv:1708.02182, 2017.\n\n[28] Gr\u00e9goire Montavon, Genevi\u00e8ve Orr, and Klaus-Robert M\u00fcller. Neural networks: tricks of the\n\ntrade, volume 7700. springer, 2012.\n\n[29] Yurii E Nesterov. A method for solving the convex programming problem with convergence\n\nrate o (1/k\u02c6 2). In Dokl. akad. nauk Sssr, volume 269, pages 543\u2013547, 1983.\n\n[30] Alex Nichol, Joshua Achiam, and John Schulman. Reptile: a scalable metalearning algorithm.\n\narXiv preprint arXiv:1803.02999, 2018.\n\n[31] Brendan O\u2019Donoghue and Emmanuel Candes. Adaptive restart for accelerated gradient schemes.\n\nFoundations of computational mathematics, 15(3):715\u2013732, 2015.\n\n[32] Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka.\nSecond-order optimization method for large mini-batch: Training resnet-50 on imagenet in 35\nepochs, 2018.\n\n[33] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[34] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.\n\nSIAM Journal on Control and Optimization, 30(4):838\u2013855, 1992.\n\n[35] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language\nunderstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-\nassets/research-covers/languageunsupervised/language understanding paper. pdf, 2018.\n\n[36] David Ruppert. Ef\ufb01cient estimations from a slowly convergent robbins-monro process. Techni-\n\ncal report, Cornell University Operations Research and Industrial Engineering, 1988.\n\n[37] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In International\n\nConference on Machine Learning, pages 343\u2013351, 2013.\n\n[38] Damien Scieur, Edouard Oyallon, Alexandre d\u2019Aspremont, and Francis Bach. Nonlinear\n\nacceleration of deep neural networks. arXiv preprint arXiv:1805.09639, 2018.\n\n11\n\n\f[39] Damien Scieur, Edouard Oyallon, Alexandre d\u2019Aspremont, and Francis Bach. Nonlinear\n\nacceleration of cnns. arXiv preprint arXiv:1806.00370, 2018.\n\n[40] Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory\n\ncost. arXiv preprint arXiv:1804.04235, 2018.\n\n[41] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 5998\u20136008, 2017.\n\n[43] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in\n\nstochastic meta-optimization. arXiv preprint arXiv:1803.02021, 2018.\n\n[44] Stefan Winkler Kim-Hui Yap Georgios Piliouras Vijay Chandrasekhar Yasin Yaz\u0131c\u0131, Chuan-\narXiv preprint\n\nSheng Foo. The unusual effectiveness of averaging in gan training.\narXiv:1806.04498, 2018.\n\n[45] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in\nminutes. In Proceedings of the 47th International Conference on Parallel Processing, page 1.\nACM, 2018.\n\n[46] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.\n[47] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E Dahl,\nChristopher J Shallue, and Roger Grosse. Which algorithmic choices matter at which batch\nsizes? insights from a noisy quadratic model. arXiv preprint arXiv:1907.04164, 2019.\n\n12\n\n\f", "award": [], "sourceid": 5096, "authors": [{"given_name": "Michael", "family_name": "Zhang", "institution": "University of Toronto / Vector Institute"}, {"given_name": "James", "family_name": "Lucas", "institution": "University of Toronto"}, {"given_name": "Jimmy", "family_name": "Ba", "institution": "University of Toronto / Vector Institute"}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": "Google & University of Toronto"}]}