{"title": "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 6379, "page_last": 6390, "abstract": "We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.", "full_text": "Multi-Agent Actor-Critic for Mixed\n\nCooperative-Competitive Environments\n\nRyan Lowe\u2217\n\nMcGill University\n\nOpenAI\n\nJean Harb\n\nMcGill University\n\nOpenAI\n\nAviv Tamar\nUC Berkeley\n\nIgor Mordatch\n\nOpenAI\n\nYi Wu\u2217\n\nUC Berkeley\n\nPieter Abbeel\nUC Berkeley\n\nOpenAI\n\nAbstract\n\nWe explore deep reinforcement learning methods for multi-agent domains. We\nbegin by analyzing the dif\ufb01culty of traditional algorithms in the multi-agent case:\nQ-learning is challenged by an inherent non-stationarity of the environment, while\npolicy gradient suffers from a variance that increases as the number of agents grows.\nWe then present an adaptation of actor-critic methods that considers action policies\nof other agents and is able to successfully learn policies that require complex multi-\nagent coordination. Additionally, we introduce a training regimen utilizing an\nensemble of policies for each agent that leads to more robust multi-agent policies.\nWe show the strength of our approach compared to existing methods in cooperative\nas well as competitive scenarios, where agent populations are able to discover\nvarious physical and informational coordination strategies.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) has recently been applied to solve challenging problems, from game\nplaying [23, 28] to robotics [18]. In industrial applications, RL is seeing use in large scale systems\nsuch as data center cooling [1]. Most of the successes of RL have been in single agent domains,\nwhere modelling or predicting the behaviour of other actors in the environment is largely unnecessary.\nHowever, there are a number of important applications that involve interaction between multiple\nagents, where emergent behavior and complexity arise from agents co-evolving together. For example,\nmulti-robot control [20], the discovery of communication and language [29, 8, 24], multiplayer games\n[27], and the analysis of social dilemmas [17] all operate in a multi-agent domain. Related problems,\nsuch as variants of hierarchical reinforcement learning [6] can also be seen as a multi-agent system,\nwith multiple levels of hierarchy being equivalent to multiple agents. Additionally, multi-agent\nself-play has recently been shown to be a useful training paradigm [28, 30]. Successfully scaling RL\nto environments with multiple agents is crucial to building arti\ufb01cially intelligent systems that can\nproductively interact with humans and each other.\nUnfortunately, traditional reinforcement learning approaches such as Q-Learning or policy gradient\nare poorly suited to multi-agent environments. One issue is that each agent\u2019s policy is changing\nas training progresses, and the environment becomes non-stationary from the perspective of any\nindividual agent (in a way that is not explainable by changes in the agent\u2019s own policy). This presents\nlearning stability challenges and prevents the straightforward use of past experience replay, which is\n\n\u2217Equal contribution.\nmordatch@openai.com.\n\nCorresponding authors:\n\nryan.lowe@cs.mcgill.ca, jxwuyi@gmail.com,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcrucial for stabilizing deep Q-learning. Policy gradient methods, on the other hand, usually exhibit\nvery high variance when coordination of multiple agents is required. Alternatively, one can use model-\nbased policy optimization which can learn optimal policies via back-propagation, but this requires\na (differentiable) model of the world dynamics and assumptions about the interactions between\nagents. Applying these methods to competitive environments is also challenging from an optimization\nperspective, as evidenced by the notorious instability of adversarial training methods [11].\nIn this work, we propose a general-purpose multi-agent learning algorithm that: (1) leads to learned\npolicies that only use local information (i.e. their own observations) at execution time, (2) does\nnot assume a differentiable model of the environment dynamics or any particular structure on the\ncommunication method between agents, and (3) is applicable not only to cooperative interaction\nbut to competitive or mixed interaction involving both physical and communicative behavior. The\nability to act in mixed cooperative-competitive environments may be critical for intelligent agents;\nwhile competitive training provides a natural curriculum for learning [30], agents must also exhibit\ncooperative behavior (e.g. with humans) at execution time.\nWe adopt the framework of centralized training with decentralized execution, allowing the policies\nto use extra information to ease training, so long as this information is not used at test time. It is\nunnatural to do this with Q-learning without making additional assumptions about the structure of the\nenvironment, as the Q function generally cannot contain different information at training and test\ntime. Thus, we propose a simple extension of actor-critic policy gradient methods where the critic is\naugmented with extra information about the policies of other agents, while the actor only has access\nto local information. After training is completed, only the local actors are used at execution phase,\nacting in a decentralized manner and equally applicable in cooperative and competitive settings. This\nis a natural setting for multi-agent language learning, as full centralization would not require the\ndevelopment of discrete communication protocols.\nSince the centralized critic function explicitly uses the decision-making policies of other agents, we\nadditionally show that agents can learn approximate models of other agents online and effectively use\nthem in their own policy learning procedure. We also introduce a method to improve the stability of\nmulti-agent policies by training agents with an ensemble of policies, thus requiring robust interaction\nwith a variety of collaborator and competitor policies. We empirically show the success of our\napproach compared to existing methods in cooperative as well as competitive scenarios, where agent\npopulations are able to discover complex physical and communicative coordination strategies.\n\n2 Related Work\n\nThe simplest approach to learning in multi-agent settings is to use independently learning agents.\nThis was attempted with Q-learning in [34], but does not perform well in practice [22]. As we will\nshow, independently-learning policy gradient methods also perform poorly. One issue is that each\nagent\u2019s policy changes during training, resulting in a non-stationary environment and preventing the\nna\u00efve application of experience replay. Previous work has attempted to address this by inputting\nother agent\u2019s policy parameters to the Q function [35], explicitly adding the iteration index to the\nreplay buffer, or using importance sampling [9]. Deep Q-learning approaches have previously been\ninvestigated in [33] to train competing Pong agents.\nThe nature of interaction between agents can either be cooperative, competitive, or both and many\nalgorithms are designed only for a particular nature of interaction. Most studied are cooperative\nsettings, with strategies such as optimistic and hysteretic Q function updates [15, 21, 25], which\nassume that the actions of other agents are made to improve collective reward. Another approach is to\nindirectly arrive at cooperation via sharing of policy parameters [12], but this requires homogeneous\nagent capabilities. These algorithms are generally not applicable in competitive or mixed settings.\nSee [26, 4] for surveys of multi-agent learning approaches and applications.\nConcurrently to our work, [7] proposed a similar idea of using policy gradient methods with a\ncentralized critic, and test their approach on a StarCraft micromanagement task. Their approach\ndiffers from ours in the following ways: (1) they learn a single centralized critic for all agents, whereas\nwe learn a centralized critic for each agent, allowing for agents with differing reward functions\nincluding competitive scenarios, (2) we consider environments with explicit communication between\nagents, (3) they combine recurrent policies with feed-forward critics, whereas our experiments\n\n2\n\n\fuse feed-forward policies (although our methods are applicable to recurrent policies), (4) we learn\ncontinuous policies whereas they learn discrete policies.\nRecent work has focused on learning grounded cooperative communication protocols between agents\nto solve various tasks [29, 8, 24]. However, these methods are usually only applicable when the\ncommunication between agents is carried out over a dedicated, differentiable communication channel.\nOur method requires explicitly modeling decision-making process of other agents. The importance\nof such modeling has been recognized by both reinforcement learning [3, 5] and cognitive science\ncommunities [10]. [13] stressed the importance of being robust to the decision making process of\nother agents, as do others by building Bayesian models of decision making. We incorporate such\nrobustness considerations by requiring that agents interact successfully with an ensemble of any\npossible policies of other agents, improving training stability and robustness of agents after training.\n\n3 Background\n\nMarkov Games\nIn this work, we consider a multi-agent extension of Markov decision processes\n(MDPs) called partially observable Markov games [19]. A Markov game for N agents is de\ufb01ned by a\nset of states S describing the possible con\ufb01gurations of all agents, a set of actions A1, ...,AN and\na set of observations O1, ...,ON for each agent. To choose actions, each agent i uses a stochastic\npolicy \u03c0\u03c0\u03c0\u03b8i : Oi \u00d7Ai (cid:55)\u2192 [0, 1], which produces the next state according to the state transition function\nT : S \u00d7 A1 \u00d7 ... \u00d7 AN (cid:55)\u2192 S.2 Each agent i obtains rewards as a function of the state and agent\u2019s\naction ri : S \u00d7 Ai (cid:55)\u2192 R, and receives a private observation correlated with the state oi : S (cid:55)\u2192 Oi.\nThe initial states are determined by a distribution \u03c1 : S (cid:55)\u2192 [0, 1]. Each agent i aims to maximize its\ni where \u03b3 is a discount factor and T is the time horizon.\n\nown total expected return Ri =(cid:80)T\n\nt=0 \u03b3trt\n\nQ-Learning and Deep Q-Networks (DQN). Q-Learning and DQN [23] are popular methods in\nreinforcement learning and have been previously applied to multi-agent settings [8, 35]. Q-Learning\nmakes use of an action-value function for policy \u03c0\u03c0\u03c0 as Q\u03c0\u03c0\u03c0(s, a) = E[R|st = s, at = a]. This Q\nfunction can be recursively rewritten as Q\u03c0\u03c0\u03c0(s, a) = Es(cid:48)[r(s, a) + \u03b3Ea(cid:48)\u223c\u03c0\u03c0\u03c0[Q\u03c0\u03c0\u03c0(s(cid:48), a(cid:48))]]. DQN learns\nthe action-value function Q\u2217 corresponding to the optimal policy by minimizing the loss:\n\u00afQ\u2217(s(cid:48), a(cid:48)),\n\nL(\u03b8) = Es,a,r,s(cid:48)[(Q\u2217(s, a|\u03b8) \u2212 y)2],\n\ny = r + \u03b3 max\n\n(1)\n\nwhere\n\na(cid:48)\n\nwhere \u00afQ is a target Q function whose parameters are periodically updated with the most recent\n\u03b8, which helps stabilize learning. Another crucial component of stabilizing DQN is the use of an\nexperience replay buffer D containing tuples (s, a, r, s(cid:48)).\nQ-Learning can be directly applied to multi-agent settings by having each agent i learn an inde-\npendently optimal function Qi [34]. However, because agents are independently updating their\npolicies as learning progresses, the environment appears non-stationary from the view of any one\nagent, violating Markov assumptions required for convergence of Q-learning. Another dif\ufb01culty\nobserved in [9] is that the experience replay buffer cannot be used in such a setting since in general,\nP (s(cid:48)|s, a, \u03c0\u03c0\u03c01, ..., \u03c0\u03c0\u03c0N ) (cid:54)= P (s(cid:48)|s, a, \u03c0\u03c0\u03c0(cid:48)\n\nN ) when any \u03c0\u03c0\u03c0i (cid:54)= \u03c0\u03c0\u03c0(cid:48)\ni.\n\n1, ..., \u03c0\u03c0\u03c0(cid:48)\n\nPolicy Gradient (PG) Algorithms. Policy gradient methods are another popular choice for a\nvariety of RL tasks. The main idea is to directly adjust the parameters \u03b8 of the policy in order to\nmaximize the objective J(\u03b8) = Es\u223cp\u03c0\u03c0\u03c0,a\u223c\u03c0\u03c0\u03c0\u03b8 [R] by taking steps in the direction of \u2207\u03b8J(\u03b8). Using\nthe Q function de\ufb01ned previously, the gradient of the policy can be written as [32]:\n\n\u2207\u03b8J(\u03b8) = Es\u223cp\u03c0\u03c0\u03c0,a\u223c\u03c0\u03c0\u03c0\u03b8 [\u2207\u03b8 log \u03c0\u03c0\u03c0\u03b8(a|s)Q\u03c0\u03c0\u03c0(s, a)],\n\nreturn Rt =(cid:80)T\n\n(2)\nwhere p\u03c0\u03c0\u03c0 is the state distribution. The policy gradient theorem has given rise to several practical\nalgorithms, which often differ in how they estimate Q\u03c0\u03c0\u03c0. For example, one can simply use a sample\ni=t \u03b3i\u2212tri, which leads to the REINFORCE algorithm [37]. Alternatively, one could\nlearn an approximation of the true action-value function Q\u03c0\u03c0\u03c0(s, a) by e.g. temporal-difference learning\n[31]; this Q\u03c0\u03c0\u03c0(s, a) is called the critic and leads to a variety of actor-critic algorithms [31].\nPolicy gradient methods are known to exhibit high variance gradient estimates. This is exacerbated\nin multi-agent settings; since an agent\u2019s reward usually depends on the actions of many agents,\n\n2To minimize notation we will often omit \u03b8 from the subscript of \u03c0\u03c0\u03c0.\n\n3\n\n\fthe reward conditioned only on the agent\u2019s own actions (when the actions of other agents are not\nconsidered in the agent\u2019s optimization process) exhibits much more variability, thereby increasing the\nvariance of its gradients. Below, we show a simple setting where the probability of taking a gradient\nstep in the correct direction decreases exponentially with the number of agents.\nProposition 1. Consider N agents with binary actions: P (ai = 1) = \u03b8i, where R(a1, . . . , aN ) =\n1a1=\u00b7\u00b7\u00b7=aN . We assume an uninformed scenario, in which agents are initialized to \u03b8i = 0.5 \u2200i. Then,\nif we are estimating the gradient of the cost J with policy gradient, we have:\n\nwhere \u02c6\u2207J is the policy gradient estimator from a single sample, and \u2207J is the true gradient.\n\nP ((cid:104) \u02c6\u2207J,\u2207J(cid:105) > 0) \u221d (0.5)N\n\nProof. See Appendix.\n\nThe use of baselines, such as value function baselines typically used to ameliorate high variance, is\nproblematic in multi-agent settings due to the non-stationarity issues mentioned previously.\n\nDeterministic Policy Gradient (DPG) Algorithms.\nIt is also possible to extend the policy gradient\nframework to deterministic policies \u00b5\u00b5\u00b5\u03b8 : S (cid:55)\u2192 A. In particular, under certain conditions we can write\nthe gradient of the objective J(\u03b8) = Es\u223cp\u00b5\u00b5\u00b5[R(s, a)] as:\n\n\u2207\u03b8J(\u03b8) = Es\u223cD[\u2207\u03b8\u00b5\u00b5\u00b5\u03b8(a|s)\u2207aQ\u00b5\u00b5\u00b5(s, a)|a=\u00b5\u00b5\u00b5\u03b8(s)]\n\n(3)\nSince this theorem relies on \u2207aQ\u00b5\u00b5\u00b5(s, a), it requires that the action space A (and thus the policy \u00b5\u00b5\u00b5)\nbe continuous.\nDeep deterministic policy gradient (DDPG) is a variant of DPG where the policy \u00b5\u00b5\u00b5 and critic Q\u00b5\u00b5\u00b5 are\napproximated with deep neural networks. DDPG is an off-policy algorithm, and samples trajectories\nfrom a replay buffer of experiences that are stored throughout training. DDPG also makes use of a\ntarget network, as in DQN [23].\n\n4 Methods\n\n4.1 Multi-Agent Actor Critic\n\nWe have argued in the previous section that na\u00efve\npolicy gradient methods perform poorly in simple\nmulti-agent settings, and this is supported in our ex-\nperiments in Section 5. Our goal in this section is to\nderive an algorithm that works well in such settings.\nHowever, we would like to operate under the follow-\ning constraints: (1) the learned policies can only use\nlocal information (i.e. their own observations) at ex-\necution time, (2) we do not assume a differentiable\nmodel of the environment dynamics, unlike in [24],\nand (3) we do not assume any particular structure on\nthe communication method between agents (that is, we don\u2019t assume a differentiable communication\nchannel). Ful\ufb01lling the above desiderata would provide a general-purpose multi-agent learning\nalgorithm that could be applied not just to cooperative games with explicit communication channels,\nbut competitive games and games involving only physical interactions between agents.\nSimilarly to [8], we accomplish our goal by adopting the framework of centralized training with\ndecentralized execution. Thus, we allow the policies to use extra information to ease training, so\nlong as this information is not used at test time. It is unnatural to do this with Q-learning, as the Q\nfunction generally cannot contain different information at training and test time. Thus, we propose\na simple extension of actor-critic policy gradient methods where the critic is augmented with extra\ninformation about the policies of other agents.\nMore concretely, consider a game with N agents with policies parameterized by \u03b8\u03b8\u03b8 = {\u03b81, ..., \u03b8N},\nand let \u03c0\u03c0\u03c0 = {\u03c0\u03c0\u03c01, ..., \u03c0\u03c0\u03c0N} be the set of all agent policies. Then we can write the gradient of the\n\nFigure 1: Overview of our multi-agent decen-\ntralized actor, centralized critic approach.\n\n4\n\n .. .. .. .. m 1 m N c 1 c N l 1 l M (cid:7716) c (cid:7716) l a C a b pool pool FCFCFCFCFC\u03c0oaagent 1. . .Q\u03c0oaagent NQexecutiontraining. . .. . . 1 N N 11N\fexpected return for agent i, J(\u03b8i) = E[Ri] as:\n\n\u2207\u03b8i J(\u03b8i) = Es\u223cp\u00b5\u00b5\u00b5,ai\u223c\u03c0\u03c0\u03c0i[\u2207\u03b8i log \u03c0\u03c0\u03c0i(ai|oi)Q\u03c0\u03c0\u03c0\n\ni (x, a1, ..., aN )].\n\n(4)\n\n\u2207\u03b8iJ(\u00b5\u00b5\u00b5i) = Ex,a\u223cD[\u2207\u03b8i\u00b5\u00b5\u00b5i(ai|oi)\u2207aiQ\u00b5\u00b5\u00b5\n\ni (x, a1, ..., aN ) is a centralized action-value function that takes as input the actions of all\nHere Q\u03c0\u03c0\u03c0\nagents, a1, . . . , aN , in addition to some state information x, and outputs the Q-value for agent i. In\nthe simplest case, x could consist of the observations of all agents, x = (o1, ..., oN ), however we\ncould also include additional state information if available. Since each Q\u03c0\u03c0\u03c0\nis learned separately,\ni\nagents can have arbitrary reward structures, including con\ufb02icting rewards in a competitive setting.\nWe can extend the above idea to work with deterministic policies. If we now consider N continuous\npolicies \u00b5\u00b5\u00b5\u03b8i w.r.t. parameters \u03b8i (abbreviated as \u00b5\u00b5\u00b5i), the gradient can be written as:\ni (x, a1, ..., aN )|ai=\u00b5\u00b5\u00b5i(oi)],\n\n(5)\nHere the experience replay buffer D contains the tuples (x, x(cid:48), a1, . . . , aN , r1, . . . , rN ), recording\nexperiences of all agents. The centralized action-value function Q\u00b5\u00b5\u00b5\ny = ri + \u03b3 Q\u00b5\u00b5\u00b5(cid:48)\nL(\u03b8i) = Ex,a,r,x(cid:48)[(Q\u00b5\u00b5\u00b5\nwhere \u00b5\u00b5\u00b5(cid:48) = {\u00b5\u00b5\u00b5\u03b8(cid:48)\ni. As shown in\nSection 5, we \ufb01nd the centralized critic with deterministic policies works very well in practice, and\nrefer to it as multi-agent deep deterministic policy gradient (MADDPG). We provide the description\nof the full algorithm in the Appendix.\nA primary motivation behind MADDPG is that, if we know the actions taken by all agents, the\nenvironment is stationary even as the policies change, since P (s(cid:48)|s, a1, ..., aN , \u03c0\u03c0\u03c01, ..., \u03c0\u03c0\u03c0N ) =\nP (s(cid:48)|s, a1, ..., aN ) = P (s(cid:48)|s, a1, ..., aN , \u03c0\u03c0\u03c0(cid:48)\ni. This is not the case if we\ndo not explicitly condition on the actions of other agents, as done for most traditional RL methods.\nNote that we require the policies of other agents to apply an update in Eq. 6. Knowing the observations\nand policies of other agents is not a particularly restrictive assumption; if our goal is to train agents to\nexhibit complex communicative behaviour in simulation, this information is often available to all\nagents. However, we can relax this assumption if necessary by learning the policies of other agents\nfrom observations \u2014 we describe a method of doing this in Section 4.2.\n\n} is the set of target policies with delayed parameters \u03b8(cid:48)\n\ni is updated as:\ni (x(cid:48), a(cid:48)\n1, . . . , a(cid:48)\n\ni (x, a1, . . . , aN ) \u2212 y)2],\n\nN ) for any \u03c0\u03c0\u03c0i (cid:54)= \u03c0\u03c0\u03c0(cid:48)\n\nN )(cid:12)(cid:12)a(cid:48)\n\n1, ..., \u03c0\u03c0\u03c0(cid:48)\n\n, ..., \u00b5\u00b5\u00b5\u03b8(cid:48)\n\nj (oj ),\n\nj =\u00b5\u00b5\u00b5(cid:48)\n\n(6)\n\n1\n\nN\n\n4.2\n\nInferring Policies of Other Agents\n\nTo remove the assumption of knowing other agents\u2019 policies, as required in Eq. 6, each agent i\ncan additionally maintain an approximation \u02c6\u00b5\u00b5\u00b5\u03c6j\n(where \u03c6 are the parameters of the approximation;\nhenceforth \u02c6\u00b5\u00b5\u00b5j\ni ) to the true policy of agent j, \u00b5\u00b5\u00b5j. This approximate policy is learned by maximizing\nthe log probability of agent j\u2019s actions, with an entropy regularizer:\ni (aj|oj) + \u03bbH(\u02c6\u00b5\u00b5\u00b5j\ni )\n\ni ) = \u2212Eoj ,aj\n\nL(\u03c6j\n\nlog \u02c6\u00b5\u00b5\u00b5j\n\n(cid:105)\n\n(cid:104)\n\n(7)\n\n,\n\ni\n\nwhere H is the entropy of the policy distribution. With the approximate policies, y in Eq. 6 can be\nreplaced by an approximate value \u02c6y calculated as follows:\n(cid:48)1\ni (x(cid:48), \u02c6\u00b5\u00b5\u00b5\ni (o1), . . . , \u00b5\u00b5\u00b5(cid:48)\n\n\u02c6y = ri + \u03b3Q\u00b5\u00b5\u00b5(cid:48)\n\n(cid:48)N\ni (oN )),\n\ni(oi), . . . , \u02c6\u00b5\u00b5\u00b5\n\n(8)\n\n(cid:48)j\ni denotes the target network for the approximate policy \u02c6\u00b5\u00b5\u00b5j\nwhere \u02c6\u00b5\u00b5\u00b5\nin a completely online fashion: before updating Q\u00b5\u00b5\u00b5\nsamples of each agent j from the replay buffer to perform a single gradient step to update \u03c6j\nalso input the action log probabilities of each agent directly into Q, rather than sampling.\n\ni . Note that Eq. 7 can be optimized\ni , the centralized Q function, we take the latest\ni . We\n\n4.3 Agents with Policy Ensembles\n\nA recurring problem in multi-agent reinforcement learning is the environment non-stationarity due\nto the agents\u2019 changing policies. This is particularly true in competitive settings, where agents can\nderive a strong policy by over\ufb01tting to the behavior of their competitors. Such policies are undesirable\nas they are brittle and may fail when the competitors alter their strategies.\n\n5\n\n\fTo obtain multi-agent policies that are more robust to changes in the policy of competing agents,\nwe propose to train a collection of K different sub-policies. At each episode, we randomly select\none particular sub-policy for each agent to execute. Suppose that policy \u00b5\u00b5\u00b5i is an ensemble of K\ndifferent sub-policies with sub-policy k denoted by \u00b5\u00b5\u00b5\u03b8(k)\n). For agent i, we are then\nmaximizing the ensemble objective: Je(\u00b5\u00b5\u00b5i) = E\nSince different sub-policies will be executed in different episodes, we maintain a replay buffer D(k)\nfor each sub-policy \u00b5\u00b5\u00b5(k)\nof agent i. Accordingly, we can derive the gradient of the ensemble objective\nwith respect to \u03b8(k)\n\ni\nk\u223cunif(1,K),s\u223cp\u00b5\u00b5\u00b5,a\u223c\u00b5\u00b5\u00b5(k)\n\n(denoted as \u00b5\u00b5\u00b5(k)\n\ni\nas follows:\n\ni\n\n[Ri(s, a)] .\n\ni\n\ni\n\ni\n\n(cid:21)\n\n(cid:12)(cid:12)(cid:12)ai=\u00b5\u00b5\u00b5(k)\n\ni\n\n.\n\n(9)\n\n(oi)\n\n(cid:20)\n\u2207\u03b8(k)\n\ni\n\n\u00b5\u00b5\u00b5(k)\n\ni\n\n(ai|oi)\u2207aiQ\u00b5\u00b5\u00b5i (x, a1, . . . , aN )\n\n\u2207\u03b8(k)\n\ni\n\nJe(\u00b5\u00b5\u00b5i) =\n\n1\nK\n\nE\nx,a\u223cD(k)\n\ni\n\n5 Experiments2\n\n5.1 Environments\n\nTo perform our experiments, we adopt the grounded communication environment proposed in [24],\nwhich consists of N agents and L landmarks inhabiting a two-dimensional world with continuous\nspace and discrete time2. Agents may take physical actions in the environment and communication\nactions that get broadcasted to other agents. Unlike [24], we do not assume that all agents have\nidentical action and observation spaces, or act according to the same policy \u03c0\u03c0\u03c0. We also consider\ngames that are both cooperative (all agents must maximize a shared return) and competitive (agents\nhave con\ufb02icting goals). Some environments require explicit communication between agents in order\nto achieve the best reward, while in other environments agents can only perform physical actions. We\nprovide details for each environment below.\n\nFigure 2: Illustrations of the experimental environment and some tasks we consider, including a)\nCooperative Communication b) Predator-Prey c) Cooperative Navigation d) Physical Deception. See\nwebpage for videos of all experimental results.\nCooperative communication. This task consists of two cooperative agents, a speaker and a listener,\nwho are placed in an environment with three landmarks of differing colors. At each episode, the\nlistener must navigate to a landmark of a particular color, and obtains reward based on its distance\nto the correct landmark. However, while the listener can observe the relative position and color\nof the landmarks, it does not know which landmark it must navigate to. Conversely, the speaker\u2019s\nobservation consists of the correct landmark color, and it can produce a communication output at\neach time step which is observed by the listener. Thus, the speaker must learn to output the landmark\ncolour based on the motions of the listener. Although this problem is relatively simple, as we show in\nSection 5.2 it poses a signi\ufb01cant challenge to traditional RL algorithms.\nCooperative navigation. In this environment, agents must cooperate through physical actions to\nreach a set of L landmarks. Agents observe the relative positions of other agents and landmarks, and\nare collectively rewarded based on the proximity of any agent to each landmark. In other words, the\nagents have to \u2018cover\u2019 all of the landmarks. Further, the agents occupy signi\ufb01cant physical space and\nare penalized when colliding with each other. Our agents learn to infer the landmark they must cover,\nand move there while avoiding other agents.\n\n2 Videos of our experimental results can be viewed at https://sites.google.com/site/multiagentac/\n2 The environments are publicly available: https://github.com/openai/multiagent-particle-envs\n\n6\n\nspeakerlistener\u201cgreen\u201dagent 1agent 3landmarklandmarklandmark p vcagent 2predator 1preypredator 2predator 3agent 1agent 2agent 3agent 1agent 2adversary?\fFigure 3: Comparison between MADDPG and DDPG (left), and between single policy MADDPG\nand ensemble MADDPG (right) on the competitive environments. Each bar cluster shows the 0-1\nnormalized score for a set of competing policies (agent v adversary), where a higher score is better for\nthe agent. In all cases, MADDPG outperforms DDPG when directly pitted against it, and similarly\nfor the ensemble against the single MADDPG policies. Full results are given in the Appendix.\n\nKeep-away. This scenario consists of L landmarks including a target landmark, N cooperating\nagents who know the target landmark and are rewarded based on their distance to the target, and M\nadversarial agents who must prevent the cooperating agents from reaching the target. Adversaries\naccomplish this by physically pushing the agents away from the landmark, temporarily occupying it.\nWhile the adversaries are also rewarded based on their distance to the target landmark, they do not\nknow the correct target; this must be inferred from the movements of the agents.\nPhysical deception. Here, N agents cooperate to reach a single target landmark from a total of N\nlandmarks. They are rewarded based on the minimum distance of any agent to the target (so only one\nagent needs to reach the target landmark). However, a lone adversary also desires to reach the target\nlandmark; the catch is that the adversary does not know which of the landmarks is the correct one.\nThus the cooperating agents, who are penalized based on the adversary distance to the target, learn to\nspread out and cover all landmarks so as to deceive the adversary.\nPredator-prey. In this variant of the classic predator-prey game, N slower cooperating agents\nmust chase the faster adversary around a randomly generated environment with L large landmarks\nimpeding the way. Each time the cooperative agents collide with an adversary, the agents are rewarded\nwhile the adversary is penalized. Agents observe the relative positions and velocities of the agents,\nand the positions of the landmarks.\nCovert communication. This is an adversarial communication environment, where a speaker agent\n(\u2018Alice\u2019) must communicate a message to a listener agent (\u2018Bob\u2019), who must reconstruct the message\nat the other end. However, an adversarial agent (\u2018Eve\u2019) is also observing the channel, and wants to\nreconstruct the message \u2014 Alice and Bob are penalized based on Eve\u2019s reconstruction, and thus\nAlice must encode her message using a randomly generated key, known only to Alice and Bob. This\nis similar to the cryptography environment considered in [2].\n\n5.2 Comparison to Decentralized Reinforcement Learning Methods\n\nWe implement MADDPG and evaluate it on the\nenvironments presented in Section 5.1. Unless\notherwise speci\ufb01ed, our policies are parameter-\nized by a two-layer ReLU MLP with 64 units\nper layer. To support discrete communication\nmessages, we use the Gumbel-Softmax estima-\ntor [14]. To evaluate the quality of policies\nlearned in competitive settings, we pitch MAD-\nDPG agents against DDPG agents, and compare\nthe resulting success of the agents and adver-\nsaries in the environment. We train our models\nuntil convergence, and then evaluate them by\naveraging various metrics for 1000 further iter-\nations. We provide the tables and details of our\nresults on all environments in the Appendix, and\nsummarize them here.\n\nFigure 4: The reward of MADDPG against tradi-\ntional RL approaches on cooperative communica-\ntion after 25000 episodes.\n\n7\n\n\f.\n\nm\nm\no\nC\ne\nv\ni\nt\na\nr\ne\np\no\no\nC\n\nn\no\ni\nt\np\ne\nc\ne\nD\n\nl\na\nc\ni\ns\ny\nh\nP\n\n(a) MADDPG\n\n(b) DDPG\n\nFigure 5: Comparison between MADDPG (left) and DDPG (right) on the cooperative communication\n(CC) and physical deception (PD) environments at t = 0, 5, and 25. Small dark circles indicate\nlandmarks. In CC, the grey agent is the speaker, and the color of the listener indicates the target\nlandmark. In PD, the blue agents are trying to deceive the red adversary, while covering the target\nlandmark (in green). MADDPG learns the correct behavior in both cases: in CC the speaker learns\nto output the target landmark color to direct the listener, while in PD the agents learn to cover both\nlandmarks to confuse the adversary. DDPG (and other RL algorithms) struggles in these settings:\nin CC the speaker always repeats the same utterance and the listener moves to the middle of the\nlandmarks, and in PP one agent greedily pursues the green landmark (and is followed by the adversary)\nwhile the othe agent scatters. See video for full trajectories.\n\nWe \ufb01rst examine the cooperative communication scenario. Despite the simplicity of the task (the\nspeaker only needs to learn to output its observation), traditional RL methods such as DQN, Actor-\nCritic, a \ufb01rst-order implementation of TRPO, and DDPG all fail to learn the correct behaviour\n(measured by whether the listener is within a short distance from the target landmark). In practice we\nobserved that the listener learns to ignore the speaker and simply moves to the middle of all observed\nlandmarks. We plot the learning curves over 25000 episodes for various approaches in Figure 4.\nWe hypothesize that a primary reason for the failure of traditional RL methods in this (and other)\nmulti-agent settings is the lack of a consistent gradient signal. For example, if the speaker utters\nthe correct symbol while the listener moves in the wrong direction, the speaker is penalized. This\nproblem is exacerbated as the number of time steps grows: we observed that traditional policy\ngradient methods can learn when the objective of the listener is simply to reconstruct the observation\nof the speaker in a single time step, or if the initial positions of agents and landmarks are \ufb01xed and\nevenly distributed. This indicates that many of the multi-agent methods previously proposed for\nscenarios with short time horizons (e.g. [16]) may not generalize to more complex tasks.\nConversely, MADDPG agents can learn coordinated behaviour more easily via the centralized critic.\nIn the cooperative communication environment, MADDPG is able to reliably learn the correct listener\nand speaker policies, and the listener is often (84.0% of the time) able to navigate to the target.\nA similar situation arises for the physical deception task: when the cooperating agents are trained\nwith MADDPG, they are able to successfully deceive the adversary by covering all of the landmarks\naround 94% of the time when L = 2 (Figure 5). Furthermore, the adversary success is quite low,\nespecially when the adversary is trained with DDPG (16.4% when L = 2). This contrasts sharply\nwith the behaviour learned by the cooperating DDPG agents, who are unable to deceive MADDPG\nadversaries in any scenario, and do not even deceive other DDPG agents when L = 4.\nWhile the cooperative navigation and predator-prey tasks have a less stark divide between success and\nfailure, in both cases the MADDPG agents outperform the DDPG agents. In cooperative navigation,\nMADDPG agents have a slightly smaller average distance to each landmark, but have almost half the\naverage number of collisions per episode (when N = 2) compared to DDPG agents due to the ease\nof coordination. Similarly, MADDPG predators are far more successful at chasing DDPG prey (16.1\ncollisions/episode) than the converse (10.3 collisions/episode).\nIn the covert communication environment, we found that Bob trained with both MADDPG and\nDDPG out-performs Eve in terms of reconstructing Alice\u2019s message. However, Bob trained with\nMADDPG achieves a larger relative success rate compared with DDPG (52.4% to 25.1%). Further,\nonly Alice trained with MADDPG can encode her message such that Eve achieves near-random\n\n8\n\n\fFigure 6: Effectiveness of learning by approximating policies of other agents in the cooperative\ncommunication scenario. Left: plot of the reward over number of iterations; MADDPG agents quickly\nlearn to solve the task when approximating the policies of others. Right: KL divergence between the\napproximate policies and the true policies.\n\nreconstruction accuracy. The learning curve (a sample plot is shown in Appendix) shows that the\noscillation due to the competitive nature of the environment often cannot be overcome with common\ndecentralized RL methods. We emphasize that we do not use any of the tricks required for the\ncryptography environment from [2], including modifying Eve\u2019s loss function, alternating agent and\nadversary training, and using a hybrid \u2018mix & transform\u2019 feed-forward and convolutional architecture.\n\n5.3 Effect of Learning Polices of Other Agents\n\nWe evaluate the effectiveness of learning the policies of other agents in the cooperative communication\nenvironment, following the same hyperparameters as the previous experiments and setting \u03bb = 0.001\nin Eq. 7. The results are shown in Figure 6. We observe that despite not \ufb01tting the policies of other\nagents perfectly (in particular, the approximate listener policy learned by the speaker has a fairly\nlarge KL divergence to the true policy), learning with approximated policies is able to achieve the\nsame success rate as using the true policy, without a signi\ufb01cant slowdown in convergence.\n\n5.4 Effect of Training with Policy Ensembles\n\nWe focus on the effectiveness of policy ensembles in competitive environments, including keep-away,\ncooperative navigation, and predator-prey. We choose K = 3 sub-policies for the keep-away and\ncooperative navigation environments, and K = 2 for predator-prey. To improve convergence speed,\nwe enforce that the cooperative agents should have the same policies at each episode, and similarly\nfor the adversaries. To evaluate the approach, we measure the performance of ensemble policies\nand single policies in the roles of both agent and adversary. The results are shown on the right side\nof Figure 3. We observe that agents with policy ensembles are stronger than those with a single\npolicy. In particular, when pitting ensemble agents against single policy adversaries (second to left\nbar cluster), the ensemble agents outperform the adversaries by a large margin compared to when the\nroles are reversed (third to left bar cluster).\n\n6 Conclusions and Future Work\n\nWe have proposed a multi-agent policy gradient algorithm where agents learn a centralized critic\nbased on the observations and actions of all agents. Empirically, our method outperforms traditional\nRL algorithms on a variety of cooperative and competitive multi-agent environments. We can further\nimprove the performance of our method by training agents with an ensemble of policies, an approach\nwe believe to be generally applicable to any multi-agent algorithm.\nOne downside to our approach is that the input space of Q grows linearly (depending on what\ninformation is contained in x) with the number of agents N. This could be remedied in practice by,\nfor example, having a modular Q function that only considers agents in a certain neighborhood of a\ngiven agent. We leave this investigation to future work.\n\n9\n\n01000200030004000500060007000iteration8070605040302010average rewarduse approximate policiesuse true policies of other agents01000200030004000500060007000iteration0.00.20.40.60.81.01.21.4KL(||)listenerspeaker\fAcknowledgements\n\nThe authors would like to thank Jacob Andreas, Smitha Milli, Jack Clark, Jakob Foerster, and others at\nOpenAI and UC Berkeley for interesting discussions related to this paper, as well as Jakub Pachocki,\nYura Burda, and Joelle Pineau for comments on the paper draft. We thank Tambet Matiisen for\nproviding the code base that was used for some early experiments associated with this paper. Ryan\nLowe is supported in part by a Vanier CGS Scholarship and the Samsung Advanced Institute of\nTechnology. Finally, we\u2019d like to thank OpenAI for fostering an engaging and productive research\nenvironment.\n\nReferences\n[1] DeepMind\n\ngoogle\n\ndata\n\ncentre\n\nAI\n\n40.\nhttps://deepmind.com/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-40/. Accessed:\n2017-05-19.\n\nreduces\n\ncooling\n\nbill\n\nby\n\n[2] M. Abadi and D. G. Andersen. Learning to protect communications with adversarial neural\n\ncryptography. arXiv preprint arXiv:1610.06918, 2016.\n\n[3] C. Boutilier. Learning conventions in multiagent stochastic domains using likelihood estimates.\nIn Proceedings of the Twelfth international conference on Uncertainty in arti\ufb01cial intelligence,\npages 106\u2013114. Morgan Kaufmann Publishers Inc., 1996.\n\n[4] L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforce-\nment learning. IEEE Transactions on Systems Man and Cybernetics Part C Applications and\nReviews, 38(2):156, 2008.\n\n[5] G. Chalkiadakis and C. Boutilier. Coordination in multiagent reinforcement learning: a bayesian\napproach. In Proceedings of the second international joint conference on Autonomous agents\nand multiagent systems, pages 709\u2013716. ACM, 2003.\n\n[6] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in neural information\n\nprocessing systems, pages 271\u2013271. Morgan Kaufmann Publishers, 1993.\n\n[7] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent\n\npolicy gradients. arXiv preprint arXiv:1705.08926, 2017.\n\n[8] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with\n\ndeep multi-agent reinforcement learning. CoRR, abs/1605.06676, 2016.\n\n[9] J. N. Foerster, N. Nardelli, G. Farquhar, P. H. S. Torr, P. Kohli, and S. Whiteson. Stabilising\nexperience replay for deep multi-agent reinforcement learning. CoRR, abs/1702.08887, 2017.\n\n[10] M. C. Frank and N. D. Goodman. Predicting pragmatic reasoning in language games. Science,\n\n336(6084):998\u2013998, 2012.\n\n[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\n2014.\n\n[12] J. K. Gupta, M. Egorov, and M. Kochenderfer. Cooperative multi-agent control using deep\n\nreinforcement learning. 2017.\n\n[13] J. Hu and M. P. Wellman. Online learning about other agents in a dynamic multiagent system.\nIn Proceedings of the Second International Conference on Autonomous Agents, AGENTS \u201998,\npages 239\u2013246, New York, NY, USA, 1998. ACM.\n\n[14] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[15] M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in cooperative\nmulti-agent systems. In In Proceedings of the Seventeenth International Conference on Machine\nLearning, pages 535\u2013542. Morgan Kaufmann, 2000.\n\n10\n\n\f[16] A. Lazaridou, A. Peysakhovich, and M. Baroni. Multi-agent cooperation and the emergence of\n\n(natural) language. arXiv preprint arXiv:1612.07182, 2016.\n\n[17] J. Z. Leibo, V. F. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel. Multi-agent reinforcement\n\nlearning in sequential social dilemmas. CoRR, abs/1702.03037, 2017.\n\n[18] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.\n\narXiv preprint arXiv:1504.00702, 2015.\n\n[19] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In\nProceedings of the eleventh international conference on machine learning, volume 157, pages\n157\u2013163, 1994.\n\n[20] L. Matignon, L. Jeanpierre, A.-I. Mouaddib, et al. Coordinated multi-robot exploration under\n\ncommunication constraints using decentralized markov decision processes. In AAAI, 2012.\n\n[21] L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Hysteretic q-learning: an algorithm for\ndecentralized reinforcement learning in cooperative multi-agent teams. In Intelligent Robots\nand Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, pages 64\u201369. IEEE,\n2007.\n\n[22] L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Independent reinforcement learners in coopera-\ntive markov games: a survey regarding coordination problems. The Knowledge Engineering\nReview, 27(01), 2012.\n\n[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,\nM. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep rein-\nforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\n[24] I. Mordatch and P. Abbeel. Emergence of grounded compositional language in multi-agent\n\npopulations. arXiv preprint arXiv:1703.04908, 2017.\n\n[25] S. Omidsha\ufb01ei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task\nmulti-agent reinforcement learning under partial observability. CoRR, abs/1703.06182, 2017.\n\n[26] L. Panait and S. Luke. Cooperative multi-agent learning: The state of the art. Autonomous\n\nAgents and Multi-Agent Systems, 11(3):387\u2013434, Nov. 2005.\n\n[27] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang. Multiagent bidirectionally-\n\ncoordinated nets for learning to play starcraft combat games. CoRR, abs/1703.10069, 2017.\n\n[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,\nI. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-\nbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis.\nMastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484 \u2013\n489, 2016.\n\n[29] S. Sukhbaatar, R. Fergus, et al. Learning multiagent communication with backpropagation. In\n\nAdvances in Neural Information Processing Systems, pages 2244\u20132252, 2016.\n\n[30] S. Sukhbaatar, I. Kostrikov, A. Szlam, and R. Fergus.\n\nIntrinsic motivation and automatic\n\ncurricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.\n\n[31] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge,\n\n1998.\n\n[32] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for rein-\nforcement learning with function approximation. In Advances in neural information processing\nsystems, 2000.\n\n[33] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vi-\ncente. Multiagent cooperation and competition with deep reinforcement learning. PloS one,\n12(4):e0172395, 2017.\n\n11\n\n\f[34] M. Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings\n\nof the tenth international conference on machine learning, pages 330\u2013337, 1993.\n\n[35] G. Tesauro. Extending q-learning to general adaptive multi-agent systems. In Advances in\n\nneural information processing systems, pages 871\u2013878, 2004.\n\n[36] P. S. Thomas and A. G. Barto. Conjugate markov decision processes. In Proceedings of the\n\n28th International Conference on Machine Learning (ICML-11), pages 137\u2013144, 2011.\n\n[37] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n12\n\n\f", "award": [], "sourceid": 3193, "authors": [{"given_name": "Ryan", "family_name": "Lowe", "institution": "McGill University"}, {"given_name": "YI", "family_name": "WU", "institution": "UC Berkeley"}, {"given_name": "Aviv", "family_name": "Tamar", "institution": "UC Berkeley"}, {"given_name": "Jean", "family_name": "Harb", "institution": "McGill University"}, {"given_name": "OpenAI", "family_name": "Pieter Abbeel", "institution": "OpenAI, UC Berkeley"}, {"given_name": "Igor", "family_name": "Mordatch", "institution": "OpenAI"}]}