{"title": "Evolution-Guided Policy Gradient in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1188, "page_last": 1200, "abstract": "Deep Reinforcement Learning (DRL) algorithms have been successfully applied to a range of challenging control tasks. However, these methods typically suffer from three core difficulties: temporal credit assignment with sparse rewards, lack of effective exploration, and brittle convergence properties that are extremely sensitive to hyperparameters. Collectively, these challenges severely limit the applicability of these approaches to real world problems. Evolutionary Algorithms (EAs), a class of black box optimization techniques inspired by natural evolution, are well suited to address each of these three challenges. However, EAs typically suffer from high sample complexity and struggle to solve problems that require optimization of a large number of parameters. In this paper, we introduce Evolutionary Reinforcement Learning (ERL), a hybrid algorithm that leverages the population of an EA to provide diversified data to train an RL agent, and reinserts the RL agent into the EA population periodically to inject gradient information into the EA. ERL inherits EA's ability of temporal credit assignment with a fitness metric, effective exploration with a diverse set of policies, and stability of a population-based approach and complements it with off-policy DRL's ability to leverage gradients for higher sample efficiency and faster learning. Experiments in a range of challenging continuous control benchmarks demonstrate that ERL significantly outperforms prior DRL and EA methods.", "full_text": "Evolution-Guided Policy Gradient in Reinforcement\n\nLearning\n\nShauharda Khadka\n\nKagan Tumer\n\nCollaborative Robotics and Intelligent Systems Institute\n\nOregon State University\n\n{khadkas,kagan.tumer}@oregonstate.edu\n\nAbstract\n\nDeep Reinforcement Learning (DRL) algorithms have been successfully applied to\na range of challenging control tasks. However, these methods typically suffer from\nthree core dif\ufb01culties: temporal credit assignment with sparse rewards, lack of\neffective exploration, and brittle convergence properties that are extremely sensitive\nto hyperparameters. Collectively, these challenges severely limit the applicability\nof these approaches to real-world problems. Evolutionary Algorithms (EAs), a\nclass of black box optimization techniques inspired by natural evolution, are well\nsuited to address each of these three challenges. However, EAs typically suffer\nfrom high sample complexity and struggle to solve problems that require optimiza-\ntion of a large number of parameters. In this paper, we introduce Evolutionary\nReinforcement Learning (ERL), a hybrid algorithm that leverages the population of\nan EA to provide diversi\ufb01ed data to train an RL agent, and reinserts the RL agent\ninto the EA population periodically to inject gradient information into the EA. ERL\ninherits EA\u2019s ability of temporal credit assignment with a \ufb01tness metric, effective\nexploration with a diverse set of policies, and stability of a population-based ap-\nproach and complements it with off-policy DRL\u2019s ability to leverage gradients for\nhigher sample ef\ufb01ciency and faster learning. Experiments in a range of challenging\ncontinuous control benchmarks demonstrate that ERL signi\ufb01cantly outperforms\nprior DRL and EA methods.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) algorithms have been successfully applied in a number of challenging\ndomains, ranging from arcade games [35, 36], board games [49] to robotic control tasks [3, 31]. A\nprimary driving force behind the explosion of RL in these domains is its integration with powerful non-\nlinear function approximators like deep neural networks. This partnership with deep learning, often\nreferred to as Deep Reinforcement Learning (DRL) has enabled RL to successfully extend to tasks\nwith high-dimensional input and action spaces. However, widespread adoption of these techniques to\nreal-world problems is still limited by three major challenges: temporal credit assignment with long\ntime horizons and sparse rewards, lack of diverse exploration, and brittle convergence properties.\nFirst, associating actions with returns when a reward is sparse (only observed after a series of actions)\nis dif\ufb01cult. This is a common occurrence in most real world domains and is often referred to as the\ntemporal credit assignment problem [54]. Temporal Difference methods in RL use bootstrapping\nto address this issue but often struggle when the time horizons are long and the reward is sparse.\nMulti-step returns address this issue but are mostly effective in on-policy scenarios [10, 45, 46]. Off-\npolicy multi-step learning [34, 48] have been demonstrated to be stable in recent works but require\ncomplementary correction mechanisms like importance sampling, Retrace [37, 59] and V-trace [14]\nwhich can be computationally expensive and limiting.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fSecondly, RL relies on exploration to \ufb01nd good policies and avoid converging prematurely to local\noptima. Effective exploration remains a key challenge for DRL operating on high dimensional\naction and state spaces [41]. Many methods have been proposed to address this issue ranging from\ncount-based exploration [38, 55], intrinsic motivation [4], curiosity [40] and variational information\nmaximization [26]. A separate class of techniques emphasize exploration by adding noise directly to\nthe parameter space of agents [20, 41]. However, each of these techniques either rely on complex\nsupplementary structures or introduce sensitive parameters that are task-speci\ufb01c. A general strategy\nfor exploration that is applicable across domains and learning algorithms is an active area of research.\nFinally, DRL methods are notoriously sensitive to the choice of their hyperparamaters [25, 27] and\noften have brittle convergence properties [24]. This is particularly true for off-policy DRL that utilize\na replay buffer to store and reuse past experiences [5]. The replay buffer is a vital component in\nenabling sample-ef\ufb01cient learning but pairing it with a deep non-linear function approximator leads\nto extremely brittle convergence properties [13, 24].\nOne approach well suited to address these challenges in theory is evolutionary algorithms (EA)\n[19, 50]. The use of a \ufb01tness metric that consolidates returns across an entire episode makes EAs\nindifferent to the sparsity of reward distribution and robust to long time horizons [44, 53]. EA\u2019s\npopulation-based approach also has the advantage of enabling diverse exploration, particularly when\ncombined with explicit diversity maintenance techniques [9, 30]. Additionally, the redundancy\ninherent in a population also promotes robustness and stable convergence properties particularly\nwhen combined with elitism [2]. A number of recent work have used EA as an alternative to DRL\nwith some success [8, 22, 44, 53]. However, EAs typically suffer with high sample complexity and\noften struggle to solve high dimensional problems that require optimization of a large number of\nparameters. The primary reason behind this is EA\u2019s inability to leverage powerful gradient descent\nmethods which are at the core of the more sample-ef\ufb01cient DRL approaches.\nIn this paper, we introduce Evolutionary Reinforce-\nment Learning (ERL), a hybrid algorithm that incor-\nporates EA\u2019s population-based approach to generate\ndiverse experiences to train an RL agent, and trans-\nfers the RL agent into the EA population periodically\nto inject gradient information into the EA. The key\ninsight here is that an EA can be used to address the\ncore challenges within DRL without losing out on the\nability to leverage gradients for higher sample ef\ufb01-\nciency. ERL inherits EA\u2019s ability to address temporal\ncredit assignment by its use of a \ufb01tness metric that\nconsolidates the return of an entire episode. ERL\u2019s\nselection operator which operates based on this \ufb01t-\nness exerts a selection pressure towards regions of the\npolicy space that lead to higher episode-wide return.\nThis process biases the state distribution towards re-\ngions that have higher long term returns. This is a\nform of implicit prioritization that is effective for do-\nmains with long time horizons and sparse rewards.\nAdditionally, ERL inherits EA\u2019s population-based ap-\nproach leading to redundancies that serve to stabilize\nthe convergence properties and make the learning\nprocess more robust. ERL also uses the population to combine exploration in the parameter space\nwith exploration in the action space which lead to diverse policies that explore the domain effectively.\nFigure 1 illustrates ERL\u2019s double layered learning approach where the same set of data (experiences)\ngenerated by the evolutionary population is used by the reinforcement learner. The recycling of the\nsame data enables maximal information extraction from individual experiences leading to improved\nsample ef\ufb01ciency. Experiments in a range of challenging continuous control benchmarks demonstrate\nthat ERL signi\ufb01cantly outperforms prior DRL and EA methods.\n\nFigure 1: High level schematic of ERL high-\nlighting the incorporation of EA\u2019s population-\nbased learning with DRL\u2019s gradient-based op-\ntimization.\n\n2\n\n\freaches a terminal state marking the end of an episode. The return Rt =(cid:80)\u221e\n\n2 Background\nA standard reinforcement learning setting is formalized as a Markov Decision Process (MDP) and\nconsists of an agent interacting with an environment E over a number of discrete time steps. At each\ntime step t, the agent receives a state st and maps it to an action at using its policy \u03c0. The agent\nreceives a scalar reward rt and moves to the next state st+1. The process continues until the agent\nn=1 \u03b3krt+k is the total\naccumulated return from time step t with discount factor \u03b3 \u2208 (0, 1]. The goal of the agent is to\nmaximize the expected return. The state-value function Q\u03c0(s, a) describes the expected return from\nstate s after taking action a and subsequently following policy \u03c0.\n\n2.1 Deep Deterministic Policy Gradient (DDPG)\n\nPolicy gradient methods frame the goal of maximizing return as the minimization of a loss function\nL(\u03b8) where \u03b8 parameterizes the agent. A widely used policy gradient method is Deep Deterministic\nPolicy Gradient (DDPG) [31], a model-free RL algorithm developed for working with continuous high\ndimensional actions spaces. DDPG uses an actor-critic architecture [54] maintaining a deterministic\npolicy (actor) \u03c0 : S \u2192 A, and an action-value function approximation (critic) Q : S \u00d7 A \u2192 R.\nThe critic\u2019s job is to approximate the actor\u2019s action-value function Q\u03c0. Both the actor and the critic\nare parameterized by (deep) neural networks with \u03b8\u03c0 and \u03b8Q, respectively. A separate copy of the\nactor \u03c0(cid:48) and critic Q(cid:48) networks are kept as target networks for stability. These networks are updated\nperiodically using the actor \u03c0 and critic networks Q modulated by a weighting parameter \u03c4.\nA behavioral policy is used to explore during training. The behavioral policy is simply a noisy\nversion of the policy: \u03c0b(s) = \u03c0(s) + N (0, 1) where N is temporally correlated noise generated\nusing the Ornstein-Uhlenbeck process [58]. The behavior policy is used to generate experience in the\nenvironment. After each action, the tuple (st, at, rt, st+1) containing the current state, actor\u2019s action,\nobserved reward and the next state, respectively is saved into a cyclic replay buffer R. The actor\nand critic networks are updated by randomly sampling mini-batches from R. The critic is trained by\nminimizing the loss function:\n\n(cid:80)\ni(yi \u2212 Q(si, ai|\u03b8Q))2 where yi = ri + \u03b3Q(cid:48)(si+1, \u03c0(cid:48)(si+1|\u03b8\u03c0(cid:48)\n\n)|\u03b8Q(cid:48)\n\n)\n\nL = 1\nT\n\nThe actor is trained using the sampled policy gradient:\n\n(cid:80)\u2207aQ(s, a|\u03b8Q)|s=si,a=ai\u2207\u03b8\u03c0 \u03c0(s|\u03b8\u03c0)|s=si\n\n\u2207\u03b8\u03c0 J \u223c 1\n\nT\n\nThe sampled policy gradient with respect to the actor\u2019s parameters \u03b8\u03c0 is computed by backpropagation\nthrough the combined actor and critic network.\n\n2.2 Evolutionary Algorithm\n\nEvolutionary algorithms (EAs) are a class of search algorithms with three primary operators: new\nsolution generation, solution alteration, and selection [19, 50]. These operations are applied on a\npopulation of candidate solutions to continually generate novel solutions while probabilistically\nretaining promising ones. The selection operation is generally probabilistic, where solutions with\nhigher \ufb01tness values have a higher probability of being selected. Assuming higher \ufb01tness values\nare representative of good solution quality, the overall quality of solutions will improve with each\npassing generation. In this work, each individual in the evolutionary algorithm de\ufb01nes a deep neural\nnetwork. Mutation represents random perturbations to the weights (genes) of these neural networks.\nThe evolutionary framework used here is closely related to evolving neural networks, and is often\nreferred to as neuroevolution [18, 33, 43, 52].\n\n3 Motivating Example\n\nConsider the standard Inverted Double Pendulum task from OpenAI gym [6], a classic continuous\ncontrol benchmark. Here, an inverted double pendulum starts in a random position, and the goal of\nthe controller is to keep it upright. The task has a state space S = 11 and action space A = 1 and is\na fairly easy problem to solve for most modern algorithms. Figure 2 (left) shows the comparative\nperformance of DDPG, EA and our proposed approach: Evolutionary Reinforcement Learning\n\n3\n\n\fFigure 2: Comparative performance of DDPG, EA and ERL in a (left) standard and (right) hard\nInverted Double Pendulum Task. DDPG solves the standard task easily but fails at the hard task. Both\ntasks are equivalent for the EA. ERL is able to inherit the best of DDPG and EA, successfully solving\nboth tasks similar to EA while leveraging gradients for greater sample ef\ufb01ciency similar to DDPG.\n\n(ERL), which combines the mechanisms within EA and DDPG. Unsurprisingly, both ERL and DDPG\nsolve the task under 3000 episodes. EA solves the task eventually but is much less sample ef\ufb01cient,\nrequiring approximately 22000 episodes. ERL and DDPG are able to leverage gradients that enable\nfaster learning while EA without access to gradients is slower.\nWe introduce the hard Inverted Double Pendulum by modifying the original task such that the\nreward is disbursed to the controller only at the end of the episode. During an episode which can\nconsist of up to 1000 timesteps, the controller gets a reward of 0 at each step except for the last one\nwhere the cumulative reward is given to the agent. Since the agent does not get feedback regularly on\nits actions but has to wait a long time to get feedback, the task poses an extremely dif\ufb01cult temporal\ncredit assignment challenge.\nFigure 2 (right) shows the comparative performance of the three algorithms in the hard Inverted\nDouble Pendulum Task. Since EA does not use intra-episode interactions and compute \ufb01tness only\nbased on the cumulative reward of the episode, the hard Inverted Double pendulum task is equivalent\nto its standard instance for an EA learner. EA retains its performance from the standard task and\nsolves the task after 22000 episodes. DDPG on the other hand fails to solve the task entirely. The\ndeceptiveness and sparsity of the reward where the agent has to wait up to 1000 steps to receive\nuseful feedback signal creates a dif\ufb01cult temporal credit assignment problem that DDPG is unable to\neffectively deal with. In contrast, ERL which inherits the temporal credit assignment bene\ufb01ts of an\nencompassing \ufb01tness metric from EA is able to successfully solve the task. Even though the reward\nis sparse and deceptive, ERL\u2019s selection operator provides a selection pressure for policies with high\nepisode-wide return (\ufb01tness). This biases the distribution of states stored in the buffer towards states\nwith higher long term payoff enabling ERL to successfully solve the task. Additionally, ERL is\nable to leverage gradients which allows it to solve the task within 10000 episodes, much faster than\nthe 22000 episodes required by EA. This result highlights the key capability of ERL: combining\nmechanisms within EA and DDPG to achieve the best of both approaches.\n4 Evolutionary Reinforcement Learning\nThe principal idea behind Evolutionary Reinforcement Learning (ERL) is to incorporate EA\u2019s\npopulation-based approach to generate a diverse set of experiences while leveraging powerful gradient-\nbased methods from DRL to learn from them. In this work, we instantiate ERL by combining\na standard EA with DDPG but any off-policy reinforcement learner that utilizes an actor-critic\narchitecture can be used.\nA general \ufb02ow of the ERL algorithm proceeds as follow: a population of actor networks is initialized\nwith random weights. In addition to the population, one additional actor network (referred to as rlactor\nhenceforth) is initialized alongside a critic network. The population of actors (rlactor excluded)\nare then evaluated in an episode of interaction with the environment. The \ufb01tness for each actor is\ncomputed as the cumulative sum of the reward that they receive over the timesteps in that episode. A\nselection operator then selects a portion of the population for survival with probability commensurate\non their relative \ufb01tness scores. The actors in the population are then probabilistically perturbed\nthrough mutation and crossover operations to create the next generation of actors. A select portion of\nactors with the highest relative \ufb01tness are preserved as elites and are shielded from the mutation step.\nEA \u2192 RL: The procedure up till now is reminiscent of a standard EA. However, unlike EA which\nonly learns between episodes using a coarse feedback signal (\ufb01tness score), ERL additionally learns\n\n4\n\n\fand \u03b8Q(cid:48)\n\n, respectively\n\nrl and critic Q(cid:48)\n\nrl with weights \u03b8\u03c0(cid:48)\n\n\ufb01tness, R = Evaluate(\u03c0, R, noise=None, \u03be)\n\nAlgorithm 1 Evolutionary Reinforcement Learning\n1: Initialize actor \u03c0rl and critic Qrl with weights \u03b8\u03c0 and \u03b8Q, respectively\n2: Initialize target actor \u03c0(cid:48)\n3: Initialize a population of k actors pop\u03c0 and an empty cyclic replay buffer R\n4: De\ufb01ne a a Ornstein-Uhlenbeck noise generator O and a random number generator r() \u2208 [0, 1)\n5: for generation = 1, \u221e do\nfor actor \u03c0 \u2208 pop\u03c0 do\n6:\n7:\nend for\n8:\nRank the population based on \ufb01tness scores\n9:\nSelect the \ufb01rst e actors \u03c0 \u2208 pop\u03c0 as elites where e = int(\u03c8*k)\n10:\nSelect (k\u2212 e) actors \u03c0 from pop\u03c0, to form Set S using tournament selection with replacement\n11:\nwhile |S| < (k \u2212 e) do\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n\nend for\n_, R = Evaluate(\u03c0rl,R, noise = O, \u03be = 1)\n(cid:80)\nSample a random minibatch of T transitions (si, ai, ri, si+1) from R\nrl(si+1|\u03b8\u03c0(cid:48)\nCompute yi = ri + \u03b3Q(cid:48)\n)\nUpdate Qrl by minimizing the loss: L = 1\ni(yi \u2212 Qrl(si, ai|\u03b8Q)2\nUpdate \u03c0rl using the sampled policy gradient\n\nUse crossover between a randomly sampled \u03c0 \u2208 e and \u03c0 \u2208 S and append to S\n\nend while\nfor Actor \u03c0 \u2208 Set S do\n\nif r() < mutprob then\n\nrl(si+1, \u03c0(cid:48)\n\nMutate(\u03b8\u03c0)\n\n)|\u03b8Q(cid:48)\n\nend if\n\nT\n\n(cid:80)\u2207aQrl(s, a|\u03b8Q)|s=si,a=ai\u2207\u03b8\u03c0 \u03c0(s|\u03b8\u03c0)|s=si\n\n\u2207\u03b8\u03c0 J \u223c 1\n\nT\n\nSoft update target networks: \u03b8\u03c0(cid:48) \u21d0 \u03c4 \u03b8\u03c0 + (1 \u2212 \u03c4 )\u03b8\u03c0(cid:48)\nif generation mod \u03c9 = 0 then\n\nCopy the RL actor into the population: for weakest \u03c0 \u2208 pop\u03c0 : \u03b8\u03c0 \u21d0 \u03b8\u03c0rl\n\nand \u03b8Q(cid:48) \u21d0 \u03c4 \u03b8Q + (1 \u2212 \u03c4 )\u03b8Q(cid:48)\n\n25:\n26:\n27:\n28:\n29: end for\n\nend if\n\nf itness = 0\nfor i = 1:\u03be do\n\nAlgorithm 2 Function Evaluate\n1: procedure EVALUATE(\u03c0, R, noise, \u03be)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n\u03be\n13: end procedure\n\nend for\nReturn f itness\n\nend while\n\n, R\n\nReset environment and get initial state s0\nwhile env is not done do\nSelect action at = \u03c0(st|\u03b8\u03c0) + noiset\nExecute action at and observe reward rt and new state st+1\nAppend transition (st, at, rt, st+1) to R\nf itness \u2190 f itness + rt and s = st+1\n\nfrom the experiences within episodes. ERL stores each actor\u2019s experiences de\ufb01ned by the tuple\n(current state, action, next state, reward) in its replay buffer. This is done for every interaction, at every\ntimestep, for every episode, and for each of its actors. The critic samples a random minibatch from\nthis replay buffer and uses it to update its parameters using gradient descent. The critic, alongside the\nminibatch is then used to train the rlactor using the sampled policy gradient. This is similar to the\nlearning procedure for DDPG, except that the replay buffer has access to the experiences from the\nentire evolutionary population.\n\n5\n\n\ffor Weight Matrix M \u2208 \u03b8\u03c0 do\n\nfor iteration = 1, mutf rac \u2217 |M| do\n\nRandomly sample indices i and j from M(cid:48)s \ufb01rst and second axis, respectively\nif r() < supermutprob then\n\nAlgorithm 3 Function Mutate\n1: procedure MUTATE(\u03b8\u03c0)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: end procedure\n\nend for\n\nend for\n\nend if\n\nelse if r() < resetprob then\n\nM[i, j] = M[i, j] * N (0, 100 \u2217 mutstrength)\nM[i, j] = N (0, 1)\nM[i, j] = M[i, j] * N (0, mutstrength)\n\nelse\n\nData Reuse: The replay buffer is the central mechanism that enables the \ufb02ow of information from the\nevolutionary population to the RL learner. In contrast to a standard EA which would extract the \ufb01tness\nmetric from these experiences and disregard them immediately, ERL retains them in the buffer and\nengages the rlactor and critic to learn from them repeatedly using powerful gradient-based methods.\nThis mechanism allows for maximal information extraction from each individual experiences leading\nto improved sample ef\ufb01ciency.\nTemporal Credit Assignment: Since \ufb01tness scores capture episode-wide return of an individual, the\nselection operator exerts a strong pressure to favor individuals with higher episode-wide returns. As\nthe buffer is populated by the experiences collected by these individuals, this process biases the state\ndistribution towards regions that have higher episode-wide return. This serves as a form of implicit\nprioritization that favors experiences leading to higher long term payoffs and is effective for domains\nwith long time horizons and sparse rewards. A RL learner that learns from this state distribution\n(replay buffer) is biased towards learning policies that optimizes for higher episode-wide return.\nDiverse Exploration: A noisy version of the rlactor using Ornstein-Uhlenbeck [58] process is used\nto generate additional experiences for the replay buffer. In contrast to the population of actors which\nexplore by noise in their parameter space (neural weights), the rlactor explores through noise in\nits action space. The two processes complement each other and collectively lead to an effective\nexploration strategy that is able to better explore the policy space.\nRL \u2192 EA: Periodically, the rlactor network\u2019s weights are copied into the evolving population\nof actors, referred to as synchronization. The frequency of synchronization controls the \ufb02ow of\ninformation from the RL learner to the evolutionary population. This is the core mechanism that\nenables the evolutionary framework to directly leverage the information learned through gradient\ndescent. The process of infusing policy learned by the rlactor into the population also serves to\nstabilize learning and make it more robust to deception. If the policy learned by the rlactor is good, it\nwill be selected to survive and extend its in\ufb02uence to the population over subsequent generations.\nHowever, if the rlactor is bad, it will simply be selected against and discarded. This mechanism\nensures that the \ufb02ow of information from the rlactor to the evolutionary population is constructive,\nand not disruptive. This is particularly relevant for domains with sparse rewards and deceptive local\nminima which gradient-based methods can be highly susceptible to.\nAlgorithm 1, 2 and 3 provide a detailed pseudocode of the ERL algorithm using DDPG as its policy\ngradient component. Adam [29] optimizer with gradient clipping at 10 and a learning rate of 5e\u22125\nand 5e\u22124 was used for the rlactor and rlcritic, respectively. The size of the population k was set to\n10, while the elite fraction \u03c8 varied from 0.1 to 0.3 across tasks. The number of trials conducted to\ncompute a \ufb01tness score, \u03be ranged from 1 to 5 across tasks. The size of the replay buffer and batch size\nwere set to 1e6 and 128, respectively. The discount rate \u03b3 and target weight \u03c4 were set to 0.99 and\n1e\u22123, respectively. The mutation probability mutprob was set to 0.9 while the syncronization period\n\u03c9 ranged from 1 to 10 across tasks. The mutation strength mutstrength was set to 0.1 corresponding\nto a 10% Gaussian noise. Finally, the mutation fraction mutf rac was set to 0.1 while the probability\nfrom super mutation supermutprob and reset resetmutprob were set to 0.05.\n\n6\n\n\f(a) HalfCheetah\n\n(b) Swimmer\n\n(c) Reacher\n\n(d) Ant\n\n(e) Hopper\n\n(f) Walker2D\n\nFigure 3: Learning curves on Mujoco-based continous control benchmarks.\n\n5 Experiments\n\nDomain: We evaluated the performance of ERL1 agents on 6 continuous control tasks simulated\nusing Mujoco [56]. These are benchmarks used widely in the \ufb01eld [13, 25, 53, 47] and are hosted\nthrough the OpenAI gym [6].\nCompared Baselines: We compare the performance of ERL with a standard neuroevolutionary\nalgorithm (EA), DDPG [31] and Proximal Policy Optimization (PPO) [47]. DDPG and PPO are\nstate of the art deep reinforcement learning algorithms of the off-policy and and on-policy variety,\nrespectively. PPO builds on the Trust Region Policy Optimization (TRPO) algorithm [45]. ERL\nis implemented using PyTorch [39] while OpenAI Baselines [11] was used to implement PPO and\nDDPG. The hyperparameters for both algorithms were set to match the original papers except that a\nlarger batch size of 128 was used for DDPG which was shown to improve performance in [27].\nMethodology for Reported Metrics: For DDPG and PPO, the actor network was periodically\ntested on 5 task instances without any exploratory noise. The average score was then logged as its\nperformance. For ERL, during each training generation, the actor network with the highest \ufb01tness\nwas selected as the champion. The champion was then tested on 5 task instances, and the average\nscore was logged. This protocol was implemented to shield the reported metrics from any bias of the\npopulation size. Note that all scores are compared against the number of steps in the environment.\nEach step is de\ufb01ned as an instance where the agent takes an action and gets a reward back from the\nenvironment. To make the comparisons fair across single agent and population-based algorithms, all\nsteps taken by all actors in the population are cumulative. For example, one episode of HalfCheetah\nconsists of 1000 steps. For a population of 10 actors, each generation consists of evaluating the actors\nin an episode which would incur 10, 000 steps. We conduct \ufb01ve independent statistical runs with\nvarying random seeds, and report the average with error bars logging the standard deviation.\nResults: Figure 3 shows the comparative performance of ERL, EA, DDPG and PPO. The per-\nformances of DDPG and PPO were veri\ufb01ed to have matched the ones reported in their original\npapers [31, 47]. ERL signi\ufb01cantly outperforms DDPG across all the benchmarks. Notably, ERL\nis able to learn on the 3D quadruped locomotion Ant benchmark where DDPG normally fails to\nmake any learning progress [13, 23, 24]. ERL also consistently outperforms EA across all but the\nSwimmer environment, where the two algorithms perform approximately equivalently. Considering\n\n1Code available at https://github.com/ShawK91/erl_paper_nips18\n\n7\n\n\fthat ERL is built primarily using the subcomponents of these two algorithms, this is an important\nresult. Additionally, ERL signi\ufb01cantly outperforms PPO in 4 out of the 6 benchmark environments2.\nThe two exceptions are Hopper and Walker2D where ERL\neventually matches and exceeds PPO\u2019s performance but\nis less sample ef\ufb01cient. A common theme in these two en-\nvironments is early termination of an episode if the agent\nfalls over. Both environments also disburse a constant\nsmall reward for each step of survival to encourage the\nagent to hold balance. Since EA selects for episode-wide\nreturn, this setup of reward creates a strong local mini-\nmum for a policy that simply survives by balancing while\nstaying still. This is the exact behavior EA converges to\nfor both environments. However, while ERL is initially\ncon\ufb01ned by the local minima\u2019s strong basin of attraction,\nFigure 4: Ablation experiments with the\nit eventually breaks free from it by virtue of its RL com-\nselection operator removed. NS indi-\nponents: temporally correlated exploration in the action\ncates ERL without the selection operator.\nspace and policy gradient-based on experience batches\nsampled randomly from the replay buffer. This highlights the core aspect of ERL: incorporating the\nmechanisms within EA and policy gradient methods to achieve the best of both approaches.\nAblation Experiments: We use an ablation experiment to test the value of the selection operator,\nwhich is the core mechanism for experience selection within ERL. Figure 4 shows the comparative\nresults in HalfCheetah and Swimmer benchmarks. The performance for each benchmark was\nnormalized by the best score achieved using the full ERL algorithm (Figure 3). Results demonstrate\nthat the selection operator is a crucial part of ERL. Removing the selection operation (NS variants)\nlead to signi\ufb01cant degradation in learning performance (\u223c80%) across both benchmarks.\nInteraction between RL and EA: To\ntease apart the system further, we ran\nsome additional experiments logging\nwhether the rlactor synchronized peri-\nodically within the EA population was\nclassi\ufb01ed as an elite, just selected, or\ndiscarded during selection (see Table\n1). The results vary across tasks with\nHalf-Cheetah\u2019s and Swimmer standing\nat either extremes: rlactor being the\nmost and the least performant, respec-\ntively. The Swimmer\u2019s selection rate is consistent with the results in Figure 3b where EA matched\nERL\u2019s performance while the RL approaches struggled. The overall distribution of selection rates\nsuggest tight integration between the rlactor and the evolutionary population as the driver for suc-\ncessful learning. Interestingly, even for HalfCheetah which favors the rlactor most of the time, EA\nplays a critical role with \u2018critical interventions.\u2019 For instance, during the course of learning, the\ncheetah bene\ufb01ts from leaning forward to increase its speed which gives rise to a strong gradient in\nthis direction. However, if the cheetah leans too much, it falls over. The gradient-based methods\nseem to often fall into this trap and then fail to recover as the gradient information from the new state\nhas no guarantees of undoing the last gradient update. However, ERL with its population provides\nbuilt in redundancies which selects against this deceptive trap, and eventually \ufb01nds a direction for\nlearning which avoids it. Once this deceptive trap is avoided, gradient descent can take over again\nin regions with better reward landscapes. These critical interventions seem to be crucial for ERL\u2019s\nrobustness and success in the Half-Cheetah benchmark.\nNote on runtime: On average, ERL took approximately 3% more time than DDPG to run. The\nmajority of the added computation stem from the mutation operator, whose cost in comparison to\ngradient descent was minimal. Additionally, these comparisons are based on implementation of ERL\nwithout any parallelization. We anticipate a parallelized implementation of ERL to run signi\ufb01cantly\nfaster as corroborated by previous work in population-based approaches [8, 44, 53].\n\nDiscarded\nHalf-Cheetah 83.8 \u00b1 9.3% 14.3 \u00b1 9.1% 2.3 \u00b1 2.5%\n4.0 \u00b1 2.8% 20.3 \u00b1 18.1% 76.0 \u00b1 20.4%\nSwimmer\n68.3 \u00b1 9.9% 19.7 \u00b1 6.9% 9.0\u00b16.9%\nReacher\n66.7 \u00b1 1.7% 15.0 \u00b1 1.4% 18.0 \u00b1 0.8%\n28.7 \u00b1 8.5% 33.7 \u00b1 4.1% 37.7 \u00b1 4.5%\n38.5 \u00b1 1.5% 39.0 \u00b1 1.9% 22.5 \u00b1 0.5%\n\nTable 1: Selection rate for synchronized rlactor\n\nAnt\n\nHopper\nWalker-2d\n\nElite\n\nSelected\n\n2Videos of learned policies available at https://tinyurl.com/erl-mujoco\n\n8\n\n\f6 Related Work\n\nUsing evolutionary algorithms to complement reinforcement learning, and vice versa is not a new\nidea. Stafylopatis and Blekas combined the two using a Learning Classi\ufb01er System for autonomous\ncar control [51]. Whiteson and Stone used NEAT [52], an evolutionary algorithm that evolves both\nneural topology and weights to optimize function approximators representing the value function\nin Q-learning [60]. More recently, Colas et.al. used an evolutionary method (Goal Exploration\nProcess) to generate diverse samples followed by a policy gradient method for \ufb01ne-tuning the policy\nparameters [7]. From an evolutionary perspective, combining RL with EA is closely related to\nthe idea of incorporating learning with evolution [1, 12, 57]. Fernando et al. leveraged a similar\nidea to tackle catastrophic forgetting in transfer learning [17] and constructing differentiable pattern\nproducing networks capable of discovering CNN architecture automatically [16].\nRecently, there has been a renewed push in the use of evolutionary algorithms to offer alternatives\nfor (Deep) Reinforcement Learning [43]. Salimans et al. used a class of EAs called Evolutionary\nStrategies (ES) to achieve results competitive with DRL in Atari and robotic control tasks [44]. The\nauthors were able to achieve signi\ufb01cant improvements in clock time by using over a thousand parallel\nworkers highlighting the scalability of ES approaches. Similar scalability and competitive results\nwere demonstrated by Such et al. using a genetic algorithm with novelty search [53]. A companion\npaper applied novelty search [30] and Quality Diversity [9, 42] to ES to improve exploration [8]. EAs\nhave also been widely used to optimize deep neural network architecture and hyperparmaters [28, 32].\nConversely, ideas within RL have also been used to improve EAs. Gangwani and Peng devised a\ngenetic algorithm using imitation learning and policy gradients as crossover and mutation operator,\nrespectively [22]. ERL provides a framework for combining these developments for potential further\nimproved performance. For instance, the crossover and mutation operators from [22] can be readily\nincorporated within ERL\u2019s EA module while bias correction techniques such as [21] can be used to\nimprove policy gradient operations within ERL.\n\n7 Discussion\n\nWe presented ERL, a hybrid algorithm that leverages the population of an EA to generate diverse\nexperiences to train an RL agent, and reinserts the RL agent into the EA population sporadically\nto inject gradient information into the EA. ERL inherits EA\u2019s invariance to sparse rewards with\nlong time horizons, ability for diverse exploration, and stability of a population-based approach and\ncomplements it with DRL\u2019s ability to leverage gradients for lower sample complexity. Additionally,\nERL recycles the date generated by the evolutionary population and leverages the replay buffer to\nlearn from them repeatedly, allowing maximal information extraction from each experience leading\nto improved sample ef\ufb01ciency. Results in a range of challenging continuous control benchmarks\ndemonstrate that ERL outperforms state-of-the-art DRL algorithms including PPO and DDPG.\nFrom a reinforcement learning perspective, ERL can be viewed as a form of \u2018population-driven guide\u2019\nthat biases exploration towards states with higher long-term returns, promotes diversity of explored\npolicies, and introduces redundancies for stability. From an evolutionary perspective, ERL can be\nviewed as a Lamarckian mechanism that enables incorporation of powerful gradient-based methods\nto learn at the resolution of an agent\u2019s individual experiences. In general, RL methods learn from an\nagent\u2019s life (individual experience tuples collected by the agent) whereas EA methods learn from an\nagent\u2019s death (\ufb01tness metric accumulated over a full episode). The principal mechanism behind ERL\nis the capability to incorporate both modes of learning: learning directly from the high resolution\nof individual experiences while being aligned to maximize long term return by leveraging the low\nresolution \ufb01tness metric.\nIn this paper, we used a standard EA as the evolutionary component of ERL. Incorporating more\ncomplex evolutionary sub-mechanisms is an exciting area of future work. Some examples include\nincorporating more informative crossover and mutation operators [22], adaptive exploration noise\n[20, 41], and explicit diversity maintenance techniques [8, 9, 30, 53]. Other areas of future work\nwill incorporate implicit curriculum based techniques like Hindsight Experience Replay [3] and\ninformation theoretic techniques [15, 24] to further improve exploration. Another exciting thread of\nresearch is the extension of ERL into multiagent reinforcement learning settings where a population\nof agents learn and act within the same environment.\n\n9\n\n\fReferences\n[1] D. Ackley and M. Littman. Interactions between learning and evolution. Arti\ufb01cial life II, 10:\n\n487\u2013509, 1991.\n\n[2] C. W. Ahn and R. S. Ramakrishna. Elitism-based compact genetic algorithms. IEEE Transac-\n\ntions on Evolutionary Computation, 7(4):367\u2013385, 2003.\n\n[3] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin,\nO. P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information\nProcessing Systems, pages 5048\u20135058, 2017.\n\n[4] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos. Unifying\ncount-based exploration and intrinsic motivation. In Advances in Neural Information Processing\nSystems, pages 1471\u20131479, 2016.\n\n[5] S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesv\u00e1ri. Convergent\ntemporal-difference learning with arbitrary smooth function approximation. In Advances in\nNeural Information Processing Systems, pages 1204\u20131212, 2009.\n\n[6] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[7] C. Colas, O. Sigaud, and P.-Y. Oudeyer. Gep-pg: Decoupling exploration and exploitation in\n\ndeep reinforcement learning algorithms. arXiv preprint arXiv:1802.05054, 2018.\n\n[8] E. Conti, V. Madhavan, F. P. Such, J. Lehman, K. O. Stanley, and J. Clune. Improving exploration\nin evolution strategies for deep reinforcement learning via a population of novelty-seeking\nagents. arXiv preprint arXiv:1712.06560, 2017.\n\n[9] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret. Robots that can adapt like animals. Nature,\n\n521(7553):503, 2015.\n\n[10] K. De Asis, J. F. Hernandez-Garcia, G. Z. Holland, and R. S. Sutton. Multi-step reinforcement\n\nlearning: A unifying algorithm. arXiv preprint arXiv:1703.01327, 2017.\n\n[11] P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor,\n\nand Y. Wu. Openai baselines. https://github.com/openai/baselines, 2017.\n\n[12] M. M. Drugan. Reinforcement learning versus evolutionary computation: A survey on hybrid\n\nalgorithms. Swarm and Evolutionary Computation, 2018.\n\n[13] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement\nlearning for continuous control. In International Conference on Machine Learning, pages 1329\u2013\n1338, 2016.\n\n[14] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley,\nI. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner\narchitectures. arXiv preprint arXiv:1802.01561, 2018.\n\n[15] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills\n\nwithout a reward function. arXiv preprint arXiv:1802.06070, 2018.\n\n[16] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, and\nD. Wierstra. Convolution by evolution: Differentiable pattern producing networks. In Proceed-\nings of the Genetic and Evolutionary Computation Conference 2016, pages 109\u2013116. ACM,\n2016.\n\n[17] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wier-\nstra. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint\narXiv:1701.08734, 2017.\n\n[18] D. Floreano, P. D\u00fcrr, and C. Mattiussi. Neuroevolution: from architectures to learning. Evolu-\n\ntionary Intelligence, 1(1):47\u201362, 2008.\n\n10\n\n\f[19] D. B. Fogel. Evolutionary computation: toward a new philosophy of machine intelligence,\n\nvolume 1. John Wiley & Sons, 2006.\n\n[20] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Has-\nsabis, O. Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295,\n2017.\n\n[21] S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic\n\nmethods. arXiv preprint arXiv:1802.09477, 2018.\n\n[22] T. Gangwani and J. Peng. Genetic policy optimization. arXiv preprint arXiv:1711.01012, 2017.\n\n[23] S. Gu, T. Lillicrap, R. E. Turner, Z. Ghahramani, B. Sch\u00f6lkopf, and S. Levine. Interpolated\npolicy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement\nlearning. In Advances in Neural Information Processing Systems, pages 3849\u20133858, 2017.\n\n[24] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy\ndeep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.\n\n[25] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement\n\nlearning that matters. arXiv preprint arXiv:1709.06560, 2017.\n\n[26] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. De Turck, and P. Abbeel. Vime: Variational\ninformation maximizing exploration. In Advances in Neural Information Processing Systems,\npages 1109\u20131117, 2016.\n\n[27] R. Islam, P. Henderson, M. Gomrokchi, and D. Precup. Reproducibility of benchmarked deep\nreinforcement learning tasks for continuous control. arXiv preprint arXiv:1708.04133, 2017.\n\n[28] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals,\nT. Green, I. Dunning, K. Simonyan, et al. Population based training of neural networks. arXiv\npreprint arXiv:1711.09846, 2017.\n\n[29] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[30] J. Lehman and K. O. Stanley. Exploiting open-endedness to solve problems through the search\n\nfor novelty. In ALIFE, pages 329\u2013336, 2008.\n\n[31] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.\nContinuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.\n\n[32] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representa-\n\ntions for ef\ufb01cient architecture search. arXiv preprint arXiv:1711.00436, 2017.\n\n[33] B. L\u00fcders, M. Schl\u00e4ger, A. Korach, and S. Risi. Continual and one-shot learning through neural\nnetworks with dynamic external memory. In European Conference on the Applications of\nEvolutionary Computation, pages 886\u2013901. Springer, 2017.\n\n[34] A. R. Mahmood, H. Yu, and R. S. Sutton. Multi-step off-policy learning without importance\n\nsampling ratios. arXiv preprint arXiv:1702.03006, 2017.\n\n[35] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529, 2015.\n\n[36] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\nIn International Conference on\n\nAsynchronous methods for deep reinforcement learning.\nMachine Learning, pages 1928\u20131937, 2016.\n\n[37] R. Munos. Q (\u03bb) with off-policy corrections. In Algorithmic Learning Theory: 27th Interna-\ntional Conference, ALT 2016, Bari, Italy, October 19-21, 2016, Proceedings, volume 9925,\npage 305. Springer, 2016.\n\n11\n\n\f[38] G. Ostrovski, M. G. Bellemare, A. v. d. Oord, and R. Munos. Count-based exploration with\n\nneural density models. arXiv preprint arXiv:1703.01310, 2017.\n\n[39] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.\n\n[40] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-\nsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017,\n2017.\n\n[41] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and\nM. Andrychowicz. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905,\n2017.\n\n[42] J. K. Pugh, L. B. Soros, and K. O. Stanley. Quality diversity: A new frontier for evolutionary\n\ncomputation. Frontiers in Robotics and AI, 3:40, 2016.\n\n[43] S. Risi and J. Togelius. Neuroevolution in games: State of the art and open challenges. IEEE\n\nTransactions on Computational Intelligence and AI in Games, 9(1):25\u201341, 2017.\n\n[44] T. Salimans, J. Ho, X. Chen, and I. Sutskever. Evolution strategies as a scalable alternative to\n\nreinforcement learning. arXiv preprint arXiv:1703.03864, 2017.\n\n[45] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization.\n\nIn International Conference on Machine Learning, pages 1889\u20131897, 2015.\n\n[46] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous\n\ncontrol using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.\n\n[47] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[48] C. Sherstan, B. Bennett, K. Young, D. R. Ashley, A. White, M. White, and R. S. Sutton. Directly\nestimating the variance of the {\\lambda}-return using temporal-difference methods. arXiv\npreprint arXiv:1801.08287, 2018.\n\n[49] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural\nnetworks and tree search. nature, 529(7587):484\u2013489, 2016.\n\n[50] W. M. Spears, K. A. De Jong, T. B\u00e4ck, D. B. Fogel, and H. De Garis. An overview of\nevolutionary computation. In European Conference on Machine Learning, pages 442\u2013459.\nSpringer, 1993.\n\n[51] A. Stafylopatis and K. Blekas. Autonomous vehicle navigation using evolutionary reinforcement\n\nlearning. European Journal of Operational Research, 108(2):306\u2013318, 1998.\n\n[52] K. O. Stanley and R. Miikkulainen. Evolving neural networks through augmenting topologies.\n\nEvolutionary computation, 10(2):99\u2013127, 2002.\n\n[53] F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley, and J. Clune. Deep neuroevo-\nlution: Genetic algorithms are a competitive alternative for training deep neural networks for\nreinforcement learning. arXiv preprint arXiv:1712.06567, 2017.\n\n[54] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[55] H. Tang, R. Houthooft, D. Foote, A. Stooke, O. X. Chen, Y. Duan, J. Schulman, F. DeTurck, and\nP. Abbeel. # exploration: A study of count-based exploration for deep reinforcement learning.\nIn Advances in Neural Information Processing Systems, pages 2750\u20132759, 2017.\n\n[56] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In\nIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages\n5026\u20135033. IEEE, 2012.\n\n12\n\n\f[57] P. Turney, D. Whitley, and R. W. Anderson. Evolution, learning, and instinct: 100 years of the\n\nbaldwin effect. Evolutionary Computation, 4(3):iv\u2013viii, 1996.\n\n[58] G. E. Uhlenbeck and L. S. Ornstein. On the theory of the brownian motion. Physical review, 36\n\n(5):823, 1930.\n\n[59] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample\n\nef\ufb01cient actor-critic with experience replay. arXiv preprint arXiv:1611.01224, 2016.\n\n[60] S. Whiteson and P. Stone. Evolutionary function approximation for reinforcement learning.\n\nJournal of Machine Learning Research, 7(May):877\u2013917, 2006.\n\n13\n\n\f", "award": [], "sourceid": 621, "authors": [{"given_name": "Shauharda", "family_name": "Khadka", "institution": "Oregon State University"}, {"given_name": "Kagan", "family_name": "Tumer", "institution": "Oregon State University US"}]}