{"title": "Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6048, "page_last": 6058, "abstract": "We propose a population-based Evolutionary Stochastic Gradient Descent (ESGD) framework for optimizing deep neural networks. ESGD combines SGD and gradient-free evolutionary algorithms as complementary algorithms in one framework in which the optimization alternates between the SGD step and evolution step to improve the average fitness of the population. With a back-off strategy in the SGD step and an elitist strategy in the evolution step, it guarantees that the best fitness in the population will never degrade. In addition, individuals in the population optimized with various SGD-based optimizers using distinct hyper-parameters in the SGD step are considered as competing species in a coevolution setting such that the complementarity of the optimizers is also taken into account. The effectiveness of ESGD is demonstrated across multiple applications including speech recognition, image recognition and language modeling, using networks with a variety of deep architectures.", "full_text": "Evolutionary Stochastic Gradient Descent for\n\nOptimization of Deep Neural Networks\n\nXiaodong Cui, Wei Zhang, Zolt\u00e1n T\u00fcske and Michael Picheny\n\n{cuix, weiz, picheny}@us.ibm.com, {Zoltan.Tuske}@ibm.com\n\nIBM Research AI\n\nIBM T. J. Watson Research Center\nYorktown Heights, NY 10598, USA\n\nAbstract\n\nWe propose a population-based Evolutionary Stochastic Gradient Descent (ESGD)\nframework for optimizing deep neural networks. ESGD combines SGD and\ngradient-free evolutionary algorithms as complementary algorithms in one frame-\nwork in which the optimization alternates between the SGD step and evolution\nstep to improve the average \ufb01tness of the population. With a back-off strategy in\nthe SGD step and an elitist strategy in the evolution step, it guarantees that the\nbest \ufb01tness in the population will never degrade. In addition, individuals in the\npopulation optimized with various SGD-based optimizers using distinct hyper-\nparameters in the SGD step are considered as competing species in a coevolution\nsetting such that the complementarity of the optimizers is also taken into account.\nThe effectiveness of ESGD is demonstrated across multiple applications including\nspeech recognition, image recognition and language modeling, using networks\nwith a variety of deep architectures.\n\n1\n\nIntroduction\n\nStochastic gradient descent (SGD) is the dominant technique in deep neural network optimization\n[1]. Over the years, a wide variety of SGD-based algorithms have been developed [2, 3, 4, 5].\nSGD algorithms have proved to be effective in optimization of large-scale deep learning models.\nMeanwhile, gradient-free evolutionary algorithms (EA) [6, 7, 8, 9, 10] also have been used in various\napplications. They represent another family of so-called black-box optimization techniques which\nare well suited for some non-linear, non-convex or non-smooth optimization problems. Biologically\ninspired, population-based EA make no assumptions on the optimization landscape. The population\nevolves based on genetic variation and selection towards better solutions of the problems of interest.\nIn deep learning applications, EA such as genetic algorithms (GA), evolutionary strategies (ES) and\nneuroevolution have been used for optimizing neural network architectures [11, 12, 13, 14, 15] and\ntuning hyper-parameters [16, 17]. Applying EA to the direct optimization of deep neural networks\nis less common. In [18], a simple EA is shown to be competitive to SGD when optimizing a small\nneural network (around 1,000 parameters). However, competitive performance on a state-of-the-art\ndeep neural network with complex architectures and more parameters is yet to be seen.\nThe complementarity between SGD and EA is worth investigating. While SGD optimizes objective\nfunctions based on their gradient or curvature information, gradient-free EA sometimes are advan-\ntageous when dealing with complex and poorly-understood optimization landscape. Furthermore,\nEA are population-based so computation is intrinsically parallel. Hence, implementation is very\nsuitable for large-scale distributed optimization. In this paper we propose Evolutionary Stochastic\nGradient Descent (ESGD) \u2013 a framework to combine the merits of SGD and EA by treating them as\ncomplementary optimization techniques.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fGiven an optimization problem, ESGD works with a population of candidate solutions as individ-\nuals. Each individual represents a set of model parameters to be optimized by an optimizer (e.g.\nconventional SGD, Nesterov accelerated SGD or ADAM) with a distinct set of hyper-parameters (e.g.\nlearning rate and momentum). Optimization is carried out by alternating SGD and EA in a stage-wise\nmanner in each generation of the evolution. Following the EA terminology [6, 19, 20], consider each\nindividual in the population as a \u201cspecies\". Over the course of any single generation, each species\nevolves independently in the SGD step, and then interacts with each other in the EA step. This has the\neffect of producing more promising candidate solutions for the next generation, which is coevolution\nin a broad sense. Therefore, ESGD not only integrates EA and SGD as complementary optimization\nstrategies but also makes use of complementary optimizers under this coevolution mechanism. We\nevaluated ESGD in a variety of tasks. Experimental results showed the effectiveness of ESGD across\nall of these tasks in improving performance.\n\n2 Related Work\n\nThe proposed ESGD is pertinent to neuroevolution [21, 22] which consists of a broad family of\ntechniques to evolve neural networks based on EA. A large amount of work in this domain is devoted\nto optimizing the networks with respect to their architectures and hyper-parameters [11, 12, 15, 16,\n23, 24]. Recently remarkable progress has been made in reinforcement learning (RL) using ES\n[22, 25, 26, 27, 28]. In the reported work, EA is utilized as an alternative approach to SGD and is\nable to compete with state-of-the-art SGD-based performance in RL with deep architectures. It shows\nthat EA works surprisingly well in RL where only imperfect gradient is available with respect to\nthe \ufb01nal performance. In our work, rather than treating EA as an alternative optimization paradigm\nto replace SGD, the proposed ESGD attempts to integrate the two as complementary paradigms to\noptimize the parameters of networks.\nThe ESGD proposed in this paper carries out population-based optimization which deals with a\nset of models simultaneously. Many of the neuroevolution approaches also belong to this category.\nRecently population-based techniques have also been applied to optimize neural networks with deep\narchitectures, most notably population-based training (PBT) in [17]. Although both ESGD and PBT\nare population-based optimization strategies whose motivations are similar in spirit, there are clear\ndifferences between the two. While evolution is only used for optimizing the hyper-parameters in\nPBT, ESGD treats EA and SGD as complementary optimizers to directly optimize model parameters\nand only indirectly optimize hyper-parameters. We investigate ESGD in the conventional setting\nof supervised learning of deep neural networks with a \ufb01xed architecture without explicit tuning of\nhyper-parameters. More importantly, ESGD uses a model back-off and elitist strategy to give a\ntheoretical guarantee that the best model in the population will never degrade.\nThe idea of coevolution is used in the design of ESGD where candidates under different optimizers\ncan be considered as competing species. Coevolution has been widely employed for improved\nneuroevolution [19, 20, 29, 30] but in cooperative coevolution schemes species typically represent a\nsubcomponent of a solution in order to decompose dif\ufb01cult high-dimensional problems. In ESGD,\nthe coevolution is carried out on competing optimizers to take advantage of their complementarity.\n\n3 Evolutionary SGD\n\n3.1 Problem Formulation\nConsider the supervised learning problem. Suppose X \u2286 Rdx is the input space and Y \u2286 Rdy is the\noutput (label) space. The goal of learning is to estimate a function h that maps from the input to the\noutput\n\nh(x; \u03b8) : X \u2192 Y\n\n(1)\nwhere x \u2208 X and h comes from a family of functions parameterized by \u03b8 \u2208 Rd. A loss function\n(cid:96)(h(x; \u03b8), y) is de\ufb01ned on X \u00d7Y to measure the closeness between the prediction h(x; \u03b8) and the\nlabel y\u2208Y. A risk function R(\u03b8) for a given \u03b8 is de\ufb01ned as the expected loss over the underlying\njoint distribution p(x, y):\n\nR(\u03b8) = E(x,y)[(cid:96)(h(x; \u03b8), y)]\n\n(2)\n\n2\n\n\fWe want to \ufb01nd a function h(x; \u03b8\u2217) that minimizes this expected risk. In practice, we only have\ni=1\u2208X\u00d7Y which are independently drawn from p(x, y).\naccess to a set of training samples {(xi, yi)}n\nAccordingly, we minimize the following empirical risk with respect to n samples\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nRn(\u03b8) =\n\n1\nn\n\n(cid:96)(h(xi; \u03b8), yi) (cid:44) 1\nn\n\nli(\u03b8)\n\n(3)\n\nwhere li(\u03b8) (cid:44) (cid:96)(h(xi; \u03b8), yi). Under stochastic programming, Eq.3 can be cast as\n\n(4)\nwhere \u03c9 \u223c Uniform{1,\u00b7\u00b7\u00b7 , n}. In the conventional SGD setting, at iteration k, a sample (xik , yik ),\nik \u2208 {1,\u00b7\u00b7\u00b7 , n}, is drawn at random and the stochastic gradient \u2207lik is then used to update \u03b8 with\nan appropriate stepsize \u03b1k > 0:\n\nRn(\u03b8) = E\u03c9[l\u03c9(\u03b8)]\n\n\u03b8k+1 = \u03b8k \u2212 \u03b1k\u2207lik (\u03b8k).\n\n(5)\n\nIn conventional SGD optimization of Eq.3 or Eq.4, there is only one parameter vector \u03b8 under\nconsideration. We further assume \u03b8 follows some distribution p(\u03b8) and consider the expected\nempirical risk over p(\u03b8)\n\nIn practice, a population of \u00b5 candidate solutions, {\u03b8j}\u00b5\naverage empirical risk of the population\n\nJ = E\u03b8[Rn(\u03b8)] = E\u03b8[E\u03c9[l\u03c9(\u03b8)]]\n\n(6)\nj=1, is drawn and we deal with the following\n\n\u00b5(cid:88)\n\nj=1\n\nJ\u00b5 =\n\n1\n\u00b5\n\nRn(\u03b8j) =\n\n1\n\u00b5\n\n(cid:32)\n\n\u00b5(cid:88)\n\nj=1\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:33)\n\nli(\u03b8j)\n\n(7)\n\nEq.6 and Eq.7 formulate the objective function of the proposed ESGD algorithm. Following the EA\nterminology, we interpret the empirical risk Rn(\u03b8) given parameter \u03b8 as the \ufb01tness function of \u03b8\nwhich we want to minimize. 1 We want to choose a population of parameter \u03b8, {\u03b8j}\u00b5\nj=1, such that\nthe whole population or its selected subset has the best average \ufb01tness values.\nDe\ufb01nition 1 (m-elitist average \ufb01tness). Let \u03a8\u00b5 = {\u03b81,\u00b7\u00b7\u00b7 , \u03b8\u00b5} be a population with \u00b5 individuals \u03b8j\nand let f be a \ufb01tness function associated with each individual in the population. Rank the individuals\nin the ascending order\n\n(8)\nwhere \u03b8k:\u00b5 denotes the k-th best individual of the population [9]. The m-elitist average \ufb01tness of\n\u03a8\u00b5 is de\ufb01ned to be the average of \ufb01tness of the \ufb01rst m-best individuals (1\u2264 m\u2264 \u00b5)\n\nf (\u03b81:\u00b5) \u2264 f (\u03b82:\u00b5) \u2264 \u00b7\u00b7\u00b7 \u2264 f (\u03b8\u00b5:\u00b5)\n\nm(cid:88)\n\nk=1\n\nJ \u00afm:\u00b5 =\n\n1\nm\n\nf (\u03b8k:\u00b5)\n\n(9)\n\nNote that, when m = \u00b5, J \u00afm:\u00b5 amounts to the average \ufb01tness of the whole population. When m = 1,\nJ \u00afm:\u00b5 = f (\u03b81:\u00b5), the \ufb01tness of the single best individual of the population.\n\n3.2 Algorithm\n\nESGD iteratively optimizes the m-elitist average \ufb01tness of the population de\ufb01ned in Eq.9. The\nevolution inside each ESGD generation for improving J \u00afm:\u00b5 alternates between the SGD step, where\neach individual \u03b8j is updated using the stochastic gradient of the \ufb01tness function Rn(\u03b8j), and the\nevolution step, where the gradient-free EA is applied using certain transformation and selection\noperators. The overall procedure is given in Algorithm 1.\nTo initialize ESGD, a parent population \u03a8\u00b5 with \u00b5 individuals is \ufb01rst created. This population evolves\nin generations. Each generation consists of an SGD step followed by an evolution step. In the SGD\n\n1Conventionally one wants to increase the \ufb01tness. But to keep the notation uncluttered we will de\ufb01ne the\n\n\ufb01tness function here as the risk function which we want to minimize.\n\n3\n\n\fAlgorithm 1: Evolutionary Stochastic Gradient Descent (ESGD)\nInput: generations K, SGD steps Ks, evolution steps Kv, parent population size \u00b5, offspring\npopulation size \u03bb and elitist level m.\nInitialize population \u03a8(0)\n// K generations\nfor k = 1 : K do\n\n1 ,\u00b7\u00b7\u00b7 , \u03b8(0)\n\u00b5 };\n\n\u00b5 \u2190 {\u03b8(0)\n\nUpdate population \u03a8(k)\n\n\u00b5 \u2190 \u03a8(k\u22121)\n\n\u00b5\n\n;\n\n// in parallel\nfor j = 1 : \u00b5 do\n\nfor individual \u03b8(k)\n\n;\n\nj\n\nj\n\nPick an optimizer \u03c0(k)\nSelect hyper-parameters of \u03c0(k)\n// Ks SGD steps\nfor s = 1 : Ks do\n\nj\n\nand set a learning schedule;\n\nSGD update of individual \u03b8(k)\nIf the \ufb01tness degrades, the individual backs off to the previous step s\u22121.\n\nusing \u03c0(k)\n\n;\n\nj\n\nj\n\nend\n\nend\n// Kv evolution steps\nfor v = 1 : Kv do\n\n\u03bb \u2190 {\u03b8(k)\n\n1 ,\u00b7\u00b7\u00b7 , \u03b8(k)\n\u03bb };\n\nGenerate offspring population \u03a8(k)\n\u00b5+\u03bb \u2190 \u03a8(k)\nSort the \ufb01tness of the parent and offspring population \u03a8(k)\nSelect the top m (m \u2264 \u00b5) individuals with the best \ufb01tness (m-elitist);\nUpdate population \u03a8(k)\nnon-m-elitist candidates;\n\n\u03bb ;\n\u00b5 by combining m-elitist and randomly selected \u00b5\u2212m\n\n\u00b5\n\n(cid:83) \u03a8(k)\n\nend\n\nend\n\nstep, an SGD-based optimizer \u03c0j with certain hyper-parameters and learning schedule is selected\nfor each individual \u03b8j which is updated by Ks epochs. In this step, there is no interaction between\nthe optimizers. From the EA perspective, their gene isolation as a species is preserved. After each\nepoch, if the individual has a degraded \ufb01tness, \u03b8j will back off to the previous epoch. After the SGD\nstep, the gradient-free evolution step follows. In this step, individuals in the parent population \u03a8\u00b5\nstart interacting via model combination and mutation to produce an offspring population \u03a8\u03bb with\n\u03bb offsprings. An m-elitist strategy is applied to the combined population \u03a8\u00b5+\u03bb = \u03a8\u00b5\nthe m (m\u2264 \u00b5) individuals with the best \ufb01tness are selected, together with the rest \u00b5\u2212m randomly\nselected individuals to form the new parent population \u03a8\u00b5 for the next generation.\nThe following theorem shows that the proposed ESGD given in Algorithm 1 guarantees that the\nm-elitist average \ufb01tness will never degrade.\nTheorem 1. Let \u03a8\u00b5 be a population with \u00b5 individuals {\u03b8j}\u00b5\nj=1. Suppose \u03a8\u00b5 evolves according to\nthe ESGD algorithm given in Algorithm 1 with back-off and m-elitist. Then for each generation k,\n\n(cid:83) \u03a8\u03bb where\n\n\u00afm:\u00b5 \u2264 J (k\u22121)\nJ (k)\n\u00afm:\u00b5 ,\n\nk \u2265 1\n\nThe proof of the theorem is given in the supplementary material. From the theorem, we also have the\nfollowing corollary regarding the m-elitist average \ufb01tness.\nCorollary 1. \u2200m(cid:48), 1 \u2264 m(cid:48) \u2264 m, we have\n\n\u00afm(cid:48):\u00b5 \u2264 J (k\u22121)\nJ (k)\n\u00afm(cid:48):\u00b5 ,\n\nfor k \u2265 1.\n\nParticularly, for m(cid:48) = 1, we have\n\nf (k)(\u03b81:\u00b5) \u2264 f (k\u22121)(\u03b81:\u00b5),\n\nfor k \u2265 1.\n\nThe \ufb01tness of the best individual in the population never degrades.\n\n(10)\n\n(11)\n\n4\n\n\f3.3\n\nImplementation\n\nIn this section, we give the implementation details of ESGD. The initial population is created either\nby randomized initialization of the weights of networks or by perturbing some existing networks.\nIn the SGD step of each generation, a family of SGD-based optimizers (e.g. conventional SGD\nand ADAM) is considered. For each selected optimizer, a set of hyper-parameters (e.g. learning\nrate, momentum, Nesterov acceleration and dropout rate) is chosen and a learning schedule is set.\nThe hyper-parameters are randomly selected from a pre-de\ufb01ned range. In particular, an annealing\nschedule is applied to the range of the learning rate over generations.\nIn the evolution step there are a wide variety of evolutionary algorithms that can be considered.\nDespite following similar biological principles, these algorithms have diverse evolving diagrams.\nIn this work, we use the (\u00b5/\u03c1+\u03bb)-ES [6]. Speci\ufb01cally, we have the following transformation and\nselection schemes:\n\n1. Encoding: Parameters are vectorized into a real-valued vector in the continuous space.\n2. Selection, recombination and mutation:\n\nIn generation k, \u03c1 individuals are selected from the\nparent population \u03a8(k)\n\u00b5 using roulette wheel selection where the probability of selection is\nproportional to the \ufb01tness of an individual [7]. An individual with better \ufb01tness has a higher\nprobability to be selected. \u03bb offsprings are generated to form the offspring population \u03a8(k)\n\u03bb\nby intermediate recombination followed by a perturbation with the zero-mean Gaussian\nnoise, which is given in Eq.12.\n\n\u03c1(cid:88)\n\nj=1\n\n\u03b8(k)\ni =\n\n1\n\u03c1\n\n\u03b8(k)\nj + \u0001(k)\n\ni\n\n(12)\n\nk). An annealing schedule may be applied\n\n\u03bb , \u03b8j \u2208 \u03a8(k)\nwhere \u03b8i \u2208 \u03a8(k)\nto the mutation strength \u03c32\n\ni \u223c N (0, \u03c32\n\n\u00b5 and \u0001(k)\nk over generations.\n\n(cid:83) \u03a8(k)\n\n\u03bb is evaluated.\n\n3. Fitness evaluation: After the offspring population is generated, the \ufb01tness value for each\n\nindividual in \u03a8(k)\n\n\u00b5+\u03bb = \u03a8(k)\n\n\u00b5\n\n4. m-elitist: m (1\u2264 m\u2264 \u00b5) individuals with the best \ufb01tness are \ufb01rst selected from \u03a8(k)\n\n\u00b5+\u03bb. The\nrest \u00b5\u2212m individuals are then randomly selected from the other \u00b5+\u03bb\u2212m candidates in\n\u00b5+\u03bb to form the parent population \u03a8(k+1)\n\u03a8(k)\n\nof the next generation.\n\n\u00b5\n\nAfter ESGD training is \ufb01nished, the candidate with the best \ufb01tness in the population \u03b81:\u00b5 is used as\nthe \ufb01nal model for classi\ufb01cation or regression tasks. All the SGD updates and \ufb01tness evaluation are\ncarried out in parallel on a set of GPUs.\n\n4 Experiments\n\nWe evaluate the performance of the proposed ESGD on large vocabulary continuous speech recog-\nnition (LVCSR), image recognition and language modeling. We compare ESGD with two baseline\nsystems. The \ufb01rst baseline system, denoted \u201csingle baseline\u201d when reporting experimental results, is\na well-established single model system with respect to the application under investigation, trained\nusing certain SGD-based optimizer with appropriately selected hyper-parameters following certain\ntraining schedule. The second baseline system, denoted \u201cpopulation baseline\u201d, is a population-based\nsystem with the same size of population as ESGD. Optimizers being considered are SGD and ADAM\nexcept in image recognition where only SGD variants are considered. The optimizers together with\ntheir hyper-parameters are randomly decided at the beginning and then \ufb01xed for the rest of the\ntraining with a pre-determined training schedule. This baseline system is used to mimic the typical\nhyper-parameter tuning process when training deep neural network models. We also conducted\nablation experiments where the evolution step is removed from ESGD to investigate the impact of\nevolution. The m-elitist strategy is applied to 60% of the parent population.\n\n4.1 Speech Recognition\n\nBN50 The 50-hour Broadcast News is a widely used dataset for speech recognition [31]. The\n50-hour data consists of 45-hour training set and a 5-hour validation set. The test set comprises 3\n\n5\n\n\fhours of audio from 6 broadcasts. The acoustic models we used in the experiments are fully-connected\nfeed-forward network with 6 hidden layers and one softmax output layer with 5,000 states. There\nare 1,024 units in the \ufb01rst 5 hidden layers and 512 units in the last hidden layer. Sigmoid activation\nfunctions are used for all hidden units except the bottom 3 hidden layers in which ReLU functions\nare used. The fundamental acoustic features are 13-dimensional Perceptual Linear Predictive (PLP)\n[32] coef\ufb01cients. The input to the network is 9 consecutive 40-dimensional speaker-adapted PLP\nfeatures after linear discriminative analysis (LDA) projection from adjacent frames.\nSWB300 The 300-hour Switchboard dataset is another widely used dataset in speech recognition\n[31]. The test set is the Hub5 2000 evaluation set composed of two parts: 2.1 hours of switchboard\n(SWB) data from 40 speakers and 1.6 hours of call-home (CH) data from 40 different speakers.\nAcoustic models are bi-directional long short-term memory (LSTM [33]) networks with 4 LSTM\nlayers. Each layer contains 1,024 cells with 512 in each direction. On top of the LSTM layers, there is\na linear bottleneck layer with 256 hidden units followed by a softmax output layer with 32,000 units.\nThe LSTMs are unrolled 21 frames. The input dimensionality is 140 which comprises 40-dimensional\nspeaker-adapted PLP features after LDA projection and 100-dimensional speaker embedding vectors\n(i-vectors [34]).\nThe networks are optimized under the cross-entropy criterion. The single baseline is trained using\nSGD with a batch size 128 without momentum for 20 epochs. The initial learning rate is 0.001 for\nBN50 and 0.025 for SWB300. The learning rate is annealed by 2x every time the loss on the validation\nset of the current epoch is worse than the previous epoch and meanwhile the model is backed off\nto the previous epoch. The population sizes for both baseline and ESGD are 100. The offspring\npopulation of ESGD consists of 400 individuals. In ESGD, after 15 generations (Ks = 1), a 5-epoch\n\ufb01ne-tuning is applied to each individual with a small learning rate. The details of experimental\ncon\ufb01guration are given in the supplementary material.\nTable 1 shows the results of two baselines and ESGD on BN50 and SWB300, respectively. Both\nthe validation losses and word error rates (WERs) are presented for the best individual and the\ntop 15 individuals of the population. For the top 15 individuals, a range of losses and WERs are\npresented. From the tables, it can be observed that the best individual of the population in ESGD\nsigni\ufb01cantly improves the losses and also improves the WERs over both the single baseline as well\nas the population baseline. Note that the model with the best loss may not give the best WER in\nsome cases, although typically they correlate well. The ablation experiment shows that the interaction\nbetween individuals in the evolution step of ESGD is helpful, and removing the evolution step hurts\nthe performance in both cases.\n\n4.2\n\nImage Recognition\n\nThe CIFAR10 [35] dataset is a widely used image recognition benchmark. It contains a 50K image\ntraining-set and a 10K image test-set. Each image is a 32x32 3-channel color image. The model\nused in this paper is a depth-20 ResNet model [36] with a 64x10 linear layer in the end. The ResNet\nis trained under the cross-entropy criterion with batch-normalization. Note that CIFAR10 does not\ninclude a validation set. To be consistent with the training set used in the literature, we do not split a\nvalidation-set from the training-set. Instead, we evaluate training \ufb01tness over the entire training-set.\nFor the single-run baseline, we follow the recipes proposed in [37], in which the initial learning\nrate is 0.1 and gets annealed by 10x after 81 epochs and then annealed by another 10x at epoch 122.\nTraining \ufb01nishes in 160 epochs. The model is trained by SGD using Nesterov acceleration with a\nmomentum 0.9. The classi\ufb01cation error rate of the single baseline is 8.34%. In practice, we found\nfor this workload, ESGD works best when only considering SGD optimizer with randomized hyper-\nparameters (e.g., learning rate and momentum). We record the detailed experimental con\ufb01guration in\nthe supplementary material. The CIFAR10 results in Table 1 indicate that ESGD clearly outperforms\nthe two baselines in both training \ufb01tness and classi\ufb01cation error rates.\n\n4.3 Language Modeling\n\nThe evaluation of the ESGD algorithm is also carried out on the standard language modeling task\nPenn Treebank (PTB) dateset [38]. Hyper-parameters have been massively optimized over the\nprevious years. The current state-of-the-art results are reported in [39] and [40]. Hence, our focus is\nto investigate the effect of ESGD on top of a state-of-the-art 1-layer LSTM LM training recipe [40].\n\n6\n\n\fThe results are summarized in Table 1. Starting from scratch, the single baseline model converged\nafter 574 epochs and achieved a perplexity of 67.3 and 64.6 on the validation and evaluation sets. The\npopulation baseline models are initialized by cloning the single baseline and generating offsprings by\nmutation. Then optimizers (SGD and ADAM) are randomly picked and models are trained for 300\nepochs. Randomizing the optimizer includes additional hyper-parameters like the various dropout\nratios/models ([41, 42, 43, 40, 44, 45]), batch size, etc. For comparison, the warm restart of the\nsingle baseline gives 0.2 perplexity improvement on the test set [46]. Using ESGD, we also relax the\nback-off to the initial model by probability of pbackoff = 0.7 in each generation. The single baseline\nmodel is always added to each generation without any update, which guarantees that the population\ncan not perform worse than the single baseline. The detailed parameter settings are provided in the\nsupplementary material. ESGD without the evolutionary step clearly shows dif\ufb01culties, and the best\nmodel\u2019s \u201cgene\u201d can not become prevalent in the successive generations according to the proportion\nof its \ufb01tness value. In summary, we observe small but consistent gain by \ufb01ne-tuning existing, highly\noptimized model with ESGD.\nNote that the above implementation for PTB experiments can be viewed as another variant of ESGD:\nSuppose we have a well-trained model (e.g. a competitive baseline model) which is always inserted\ninto the population in each generation of evolution. The m-elitist strategy will guarantee that the best\nmodel in the population is not worse than this well-trained model even if we relax the back-off in\nSGD with probability.\nIn Fig.1 we show the \ufb01tness as a function of ESGD generations in the four investigated tasks.\n\nTable 1: Performance of single baseline, population baseline and ESGD on BN50, SWB300, CIFAR10\nand PTB. For ESGD, the tables show the losses and classi\ufb01cation error rates of the best individual as\nwell as the top 15 individuals in the population for the \ufb01rst three tasks. In PTB, the perplexities (ppl),\nwhich is the exponent of loss, of the validation set and test set are presented. The tables also present\nthe results of the ablation experiments where the evolution step is removed from ESGD.\n\nBN50\n\nsingle baseline\npopulation baseline\nESGD w/o evolution\nESGD\n\nloss\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[2.029, 2.062]\n[2.036, 2.075]\n[1.916, 1.920]\n\nWER\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[16.9, 17.6]\n[17.1, 17.7]\n[16.2, 16.4]\n\n\u03b81:\u00b5\n17.4\n17.1\n17.4\n16.4\n\n\u03b81:\u00b5\n2.082\n2.029\n2.036\n1.916\n\nSWB300\n\nsingle baseline\npopulation baseline\nESGD w/o evolution\nESGD\n\nloss\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[1.645, 1.666]\n[1.626, 1.641]\n[1.551, 1.557]\n\nSWB WER\n\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[10.3, 10.7]\n[10.3, 10.7]\n[10.0, 10.1]\n\n\u03b81:\u00b5\n10.4\n10.4\n10.3\n10.0\n\nCH WER\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[18.2, 18.8]\n[18.0, 18.6]\n[18.0, 18.3]\n\n\u03b81:\u00b5\n18.5\n18.2\n18.3\n18.2\n\n\u03b81:\u00b5\n1.648\n1.645\n1.626\n1.551\n\nCIFAR10\n\nsingle baseline\npopulation baseline\nESGD w/o evolution\nESGD\n\n\u03b81:\u00b5\n0.0176\n0.0151\n0.0147\n0.0142\n\nloss\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[0.0151, 0.0164]\n[0.0147, 0.0166]\n[0.0142, 0.0159]\n\nerror rate\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[7.90, 8.69]\n[7.86, 8.53]\n[7.43, 8.10]\n\n\u03b81:\u00b5\n8.34\n8.24\n8.49\n7.52\n\nPTB\n\nsingle baseline\npopulation baseline\nESGD w/o evolution\nESGD\n\n\u03b81:\u00b5\n67.27\n66.58\n67.27\n66.29\n\nvalidation ppl\n\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[66.58, 68.04]\n[67.27, 79.25]\n[66.29, 66.30]\n\n7\n\ntest ppl\n\u03b81:\u00b5 \u2194 \u03b815:\u00b5\n\n\u2013\n\n[63.96, 64.58]\n[64.58, 76.64]\n[63.72, 63.74]\n\n\u03b81:\u00b5\n64.58\n63.96\n64.58\n63.73\n\n\fFigure 1: Fitness as a function of ESGD generations for BN50, SWB300, CIFAR10 and PTB. The\nthree curves represent the single baseline (red), top 15 individuals of population baseline (orange)\nand ESGD (green). The latter two are illustrated as bands. The lower bounds of the ESGD curve\nbands indicate the best \ufb01tness values in the populations which are always non-increasing. Note that in\nthe PTB case, this monotonicity is violated since the back-off strategy was relaxed with probability,\nwhich explains the increase of perplexity in some generations.\n\nFigure 2: Percentage of offsprings selected in the 60% m-elitist over generations of ESGD in BN50,\nSWB300, CIFAR10 and PTB.\n\n4.4 Discussion\n\nPopulation diversity\nIt is important to maintain a good population diversity in EA to avoid\npremature convergence due to the homogeneous \ufb01tness among individuals. In experiments, we \ufb01nd\nthat the m-elitist strategy applied to the whole population, although has a better overall average\n\ufb01tness, can give rise to premature convergence in the early stage. Therefore, we set the percentage of\nm-elitist to 60% of the population and the remaining 40% of the population is generated by random\nselection. This m-elitist strategy is helpful in practice.\nPopulation evolvement The SGD step of ESGD mimics the coevolution mechanism between com-\npeting species (individuals under different optimizers) where distinct species evolute independently.\nThe evolution step of ESGD allows the species to interact with each other to hopefully produce\npromising candidate solutions for the next generation. Fig.2 shows the percentage of offsprings\nselected in the 60% m-elitist for the next generation. From the table, in the early stage the population\n\n8\n\n0369121518211.82.02.22.42.62.83.0BN50/DNN0369121518211.41.61.82.02.22.42.6SWB300/LSTMSinglePopulationESGD123456789100.050.100.150.200.250.300.350.400.45CIFAR10/ResNet-2002468101214164.204.214.224.234.244.254.26PTB/LSTM56780.010.020.030.040.05GenerationFitness0246810121416020406080100BN50/DNN0246810121416SWB300/LSTM0123456789CIFAR10/ResNet-200246810121416PTB/LSTMGeneration% offsprings selected\fevolves dominantly based on SGD since the offsprings are worse than almost all the parents. How-\never, in late generations the number of elite offsprings increases. The interaction between distinct\noptimizers starts to play an important role in selecting better candidate solutions.\nComplementary optimizers In each generation of ESGD, an individual selects an optimizer from a\npool of optimizers with certain hyper-parameters. In most of the experiments, the pool of optimizers\nconsists of SGD variants and ADAM. It is often observed that ADAM tends to be aggressive in the\nearly stage but plateaus quickly. SGD, however, starts slow but can get to better local optima. ESGD\ncan automatically choose optimizers and their appropriate hyper-parameters based on the \ufb01tness value\nduring the evolution process so that the merits of both SGD and ADAM can be combined to seek\na better local optimal solution to the problem of interest. In the supplementary material examples\nare given where we show the optimizers with their training hyper-parameters selected by the best\nindividuals in ESGD in each generation. It indicates that over generations different optimizers are\nautomatically chosen by ESGD giving rise to a better \ufb01tness value.\nParallel computation In the experiments of this paper, all SGD updates and EA \ufb01tness evaluations\nare carried out in parallel using multiple GPUs. The SGD updates dominate the ESGD computation.\nThe EA updates and \ufb01tness evaluations have a fairly small computational cost compared to the SGD\nupdates. Given suf\ufb01cient computing resource (e.g. \u00b5 GPUs), ESGD should take about the same\namount of time as one end-to-end vanilla SGD run. Practically, trade-off has to be made between\nthe training time and performance under the constraint of computational budget. In general, parallel\ncomputation is suitable and preferred for population-based optimization.\n\n5 Conclusion\n\nWe have presented the population-based ESGD as an optimization framework to combine SGD and\ngradient-free evolutionary algorithm to explore their complementarity. ESGD alternately optimizes\nthe m-elitist average \ufb01tness of the population between an SGD step and an evolution step. The SGD\nstep can be interpreted as a coevolution mechanism where individuals under distinct optimizers evolve\nindependently and then interact with each other in the evolution step to hopefully create promising\ncandidate solutions for the next generation. With an appropriate decision strategy, the \ufb01tness of the\nbest individual in the population is guaranteed to be non-degrading. Extensive experiments have\nbeen carried out in three applications using various neural networks with deep architectures. The\nexperimental results have demonstrated the effectiveness of ESGD.\n\nReferences\n[1] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.\n\narXiv preprint arXiv:1606.04838, 2016.\n\n[2] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[3] D. P. Kingma and J. L. Ba. ADAM: a method for stochastic optimization. In International\n\nConference on Learning Representations (ICLR), 2015.\n\n[4] T. Tieleman and G. Hinton. Lecture 6.e. RMSProp: divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\n[5] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of\n\nconvergence o(1/k2). Soviet Math Docl, 269:543\u2013547, 1983.\n\n[6] H.-G. Beyer and H.-P. Schwefel. Evolution strategies: a comprehensive introduction. Natural\n\ncomputing, 1(1):3\u201352, 2002.\n\n[7] D. E. Goldberg. Genetic algorithm in search, optimization and machine learning. Addison-\n\nWesley Publishing Co., 1989.\n\n[8] C. Igel. Neuroevolution for reinforcement learning using evolution strategies. In IEEE Congress\n\nof Evolutionary Computation (CEC), page 2588\u20132595, 2003.\n\n[9] N. Hansen. The CMA evolution strategy: a tutorial. arXiv preprint arXiv:1604.00772, 2016.\n\n9\n\n\f[10] I. Loshchilov. LM-CMA: an alternative to L-BFGS for large scale black-box optimization.\n\nEvolutionary Computation, 25(1):143\u2013171, 2017.\n\n[11] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin.\nLarge-scale evolution of image classi\ufb01ers. In International Conference on Machine Learning\n(ICML), pages 2902\u20132911, 2017.\n\n[12] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classi\ufb01er\n\narchitecture search. arXiv preprint arXiv:1802.01548, 2018.\n\n[13] J. Liang, E. Meyerson, and R. Miikkulainen. Evolutionary architecture search for deep multitask\n\nnetworks. arXiv preprint arXiv:1803.03745, 2018.\n\n[14] S. Ebrahimi, A. Rohrbach, and T. Darrell. Gradient-free policy architecture search and adapta-\n\ntion. In Conference on Robot Learning (CoRL), 2017.\n\n[15] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad,\nA. Navruzyan, N. Nuffy, and B. Hodjat. Evolving deep neural networks. arXiv preprint\narXiv:1703.00548, 2017.\n\n[16] I. Loshchilov and F. Hutter. CMA-ES for hyperparameter optimization of deep neural networks.\n\nIn International Conference on Learning Representations (ICLR), workshop track, 2016.\n\n[17] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals,\nT. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu. Population based\ntraining of neural networks. arXiv preprint arXiv:1711.09846, 2017.\n\n[18] G. Morse and K. O. Stanley. Simple evolutionary optimization can rival stochastic gradient de-\nscent in neural networks. In The Genetic and Evolutionary Computation Conference (GECCO),\npages 477\u2013484, 2016.\n\n[19] Z. Yang, K. Tang, and X. Yao. Large scale evolutionary optimization using cooperative\n\ncoevolution. Information Sciences, 178(15):2985\u20132999, 2008.\n\n[20] N. Garcia-Pedrajas, C. Hervas-Martinez, and J. Munoz-Perez. COVNET: a cooperative co-\nevolutionary model for evolving arti\ufb01cial neural networks. IEEE Trans. on Neural Networks,\n14(3):575\u2013595, 2003.\n\n[21] X. Yao. Evolving arti\ufb01cial neural networks. Proceedings of the IEEE, 87(9):1423\u20131447, 1999.\n\n[22] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune. Deep neuroevo-\nlution: genetic algorithms are a competitive alternative for training deep neural networks for\nreinforcement learning. arXiv preprint arXiv:1712.06567, 2017.\n\n[23] M. Suganuma, S. Shirakawa, and T. Nagao. A genetic programming approach to designing\nconvolutional neural network architectures. In The Genetic and Evolutionary Computation\nConference (GECCO), pages 497\u2013504, 2017.\n\n[24] C. Liu, B. Zoph, J. Shlens, W. Hua, L.-J. Li, F.-F. Li, A. Yuille, J. Huang, and K. Murphy.\n\nProgressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.\n\n[25] P. Chrabaszcz, I. Loshchilov, and F. Hutter. Back to basics: benchmarking canonical evolution\n\nstrategies for playing Atari. arXiv preprint arXiv:1802.08842, 2018.\n\n[26] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable\n\nalternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\n[27] X. Zhang, J. Clune, and K. O. Stanley. On the relationship between the OpenAI evolution\n\nstrategy and stochastic gradient descent. arXiv preprint arXiv:1712.06564, 2017.\n\n[28] J. Lehman, J. Chen, J. Clune, and K. O. Stanley. ES is more than just a traditional \ufb01nite-\n\ndifference approximator. arXiv preprint arXiv:1712.06568, 2017.\n\n[29] F. Gomez, J. Schmidhuber, and R. Miikkulainen. Accelerated neural evolution through coopera-\n\ntively coevolved synapses. Journal of Machine Learning Research, 9:937\u2013965, 2008.\n\n10\n\n\f[30] N. Garcia-Pedrajas, C. Hervas-Martinez, and D. Ortiz-Boyer. Cooperative coevolution of\nIEEE Trans. on Evolutionary\n\narti\ufb01cial neural network ensembles for pattern recognition.\nComputation, 9(3):271\u2013302, 2005.\n\n[31] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,\nT. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech\nrecognition. IEEE Signal Processing Maganize, pages 82\u201397, November 2012.\n\n[32] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of Acoustical\n\nSociety America, 87(4):1738\u20131752, 1990.\n\n[33] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[34] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis\nfor speaker veri\ufb01cation. IEEE Transactions on Audio, Speech, and Language Processing,,\n19(4):788\u2013798, 2011.\n\n[35] A. Krizhevsky. Learning multiple layers of features from tiny images. In Technical Report,\n\n2009.\n\n[36] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Conference\n\non Computer Vision and Pattern Recognition (CVPR\u201915), 2015.\n\n[37] https://github.com/facebook/fb.resnet.torch.\n\n[38] T. Mikolov, M. Kara\ufb01\u00e1t, L. Burget, J. Cernocky, and S. Khudanpur. Recurrent neural network\n\nbased language model. In Interspeech, pages 1045\u20131048, 2010.\n\n[39] S. Merity, N. Keskar, and R. Socher. Regularizing and optimizing LSTM language models. In\n\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[40] K. Zolna, D. Arpit, D. Suhubdy, and Y. Bengio. Fraternal dropout. In International Conference\n\non Learning Representations (ICLR), 2018.\n\n[41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[42] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using\ndropconnect. In International Conference on Machine Learning (ICML), volume 28, pages\n1058\u20131066, 2013.\n\n[43] Y. Gal and Z. Ghahramani. A theoretically grounded application of dropout in recurrent neural\nnetworks. In International Conference on Neural Information Processing Systems (NIPS),\nNIPS\u201916, pages 1027\u20131035, 2016.\n\n[44] X. Ma, Y. Gao, Z. Hu, Y. Yu, Y. Deng, and E. H. Hovy. Dropout with expectation-linear\n\nregularization. In International Conference on Learning Representations (ICLR), 2017.\n\n[45] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n[46] I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\nIn\n\n11\n\n\f", "award": [], "sourceid": 2968, "authors": [{"given_name": "Xiaodong", "family_name": "Cui", "institution": "IBM T. J. Watson Research Center"}, {"given_name": "Wei", "family_name": "Zhang", "institution": "IBM T.J.Watson Research Center"}, {"given_name": "Zolt\u00e1n", "family_name": "T\u00fcske", "institution": "IBM T. J. Watson Research Center"}, {"given_name": "Michael", "family_name": "Picheny", "institution": "IBM T. J. Watson Research Center"}]}