{"title": "Efficient Communication in Multi-Agent Reinforcement Learning via Variance Based Control", "book": "Advances in Neural Information Processing Systems", "page_first": 3235, "page_last": 3244, "abstract": "Multi-agent reinforcement learning (MARL) has recently received considerable attention due to its applicability to a wide range of real-world applications. However, achieving efficient communication among agents has always been an overarching problem in MARL. In this work, we propose Variance Based Control (VBC), a simple yet efficient technique to improve communication efficiency in MARL. By limiting the variance of the exchanged messages between agents during the training phase, the noisy component in the messages can be eliminated effectively, while the useful part can be preserved and utilized by the agents for better performance. \nOur evaluation using multiple MARL benchmarks indicates that our method achieves $2-10\\times$ lower in communication overhead than state-of-the-art MARL algorithms, while allowing agents to achieve better overall performance.", "full_text": "Ef\ufb01cient Communication in Multi-Agent\n\nReinforcement Learning via Variance Based Control\n\nSai Qian Zhang\nHarvard University\n\nQi Zhang\nAmazon Inc.\n\nJieyu Lin\n\nUniversity of Toronto\n\nAbstract\n\nMulti-agent reinforcement learning (MARL) has recently received considerable at-\ntention due to its applicability to a wide range of real-world applications. However,\nachieving ef\ufb01cient communication among agents has always been an overarching\nproblem in MARL. In this work, we propose Variance Based Control (VBC), a\nsimple yet ef\ufb01cient technique to improve communication ef\ufb01ciency in MARL. By\nlimiting the variance of the exchanged messages between agents during the training\nphase, the noisy component in the messages can be eliminated effectively, while the\nuseful part can be preserved and utilized by the agents for better performance. Our\nevaluation using multiple MARL benchmarks indicates that our method achieves\n2\u2212 10\u00d7 lower in communication overhead than state-of-the-art MARL algorithms,\nwhile allowing agents to achieve better overall performance.\n\n1\n\nIntroduction\n\nMany real-world applications (e.g., autonomous driving [16], game playing [12] and robotics con-\ntrol [9]) today require reinforcement learning tasks to be carried out in multi-agent settings. In MARL,\nmultiple agents interact with each other in a shared environment. Each agent only has access to partial\nobservations of the environment, and needs to make local decisions based on partial observations as\nwell as both direct and indirect interactions with the other agents. This complex interaction model has\nintroduced numerous challenges for MARL. In particular, during the training phase, each agent may\ndynamically change its strategy, causing dynamics in the surrounding environment and instability in\nthe training process. Worse still, each agent can easily over\ufb01t its strategy to the behaviours of other\nagents [11], which may seriously deteriorate the overall performance.\nIn the research literature, there have been three lines of research that try to mitigate the instability\nand inef\ufb01ciency caused by decentralized execution. The most common approach is independent\nQ-learning (IQL) [20], which breaks down a multi-agent learning problem into multiple independent\nsingle-agent learning problems, thus allowing each agent to learn and act independently. Unfortu-\nnately, this approach does not account for instability caused by environment dynamics, and therefore\noften suffer from the problem of poor convergence. The second approach adopts the centralized train-\ning and decentralized execution [18] paradigm, where a joint action value function is learned during\nthe training phase to better coordinate the agents\u2019 behaviours. During execution, each agent acts\nindependently without direct communication. The third approach introduces communication among\nagents during execution [17, 3]. This approach allows each agent to dynamically adjusts its strategy\nbased on its local observation along with the information received from the other agents. Nonetheless,\nit introduces additional communication overhead in terms of latency and bandwidth during execution,\nand its effectiveness is heavily dependent on the usefulness of the received information.\nIn this work, we leverage the advantages of both the second and third approaches. Speci\ufb01cally,\nwe consider a fully cooperative scenario where multiple agents collaborate to achieve a common\nobjective. The agents are trained in a centralized fashion within the multi-agent Q-learning framework,\nand are allowed to communicate with each other during execution. However, unlike previous work,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwe make a few key observations. First, for many applications, it is often super\ufb02uous for an agent to\nwait for feedback from all surrounding agents before making an action decision. For instance, when\nthe front camera on an autonomous vehicle detects an obstacle within the dangerous distance limit,\nit triggers the \u2018brake\u2018 signal 2 without waiting for the feedback from the other parts of the vehicle.\nSecond, the feedback received from the other agents may not always provide useful information. For\nexample, the navigation system of the autonomous vehicle should pay more attention to the messages\nsent by the perception system (e.g., camera, radar), and less attention to the entertainment system\ninside the vehicle before taking its action. The full (i.e., all-to-all) communication pattern among\nthe agents can lead to a signi\ufb01cant communication overhead in terms of both bandwidth and latency,\nwhich limits its practicality and effectiveness in real applications with strict latency requirements and\nbandwidth constraints (e.g., real-time traf\ufb01c signal control, autonomous driving, etc). In addition, as\npointed out by Jiang et al. [7], an excessive amount of communication may introduce useless and\neven harmful information which can even impair the convergence of the learning process.\nMotivated by these observations, we design a novel deep MARL architecture that can signi\ufb01cantly\nimprove inter-agent communication ef\ufb01ciency. Speci\ufb01cally, we introduce Variance Based Control\n(VBC), a simple yet ef\ufb01cient approach to reduce the among of information transferred between agents.\nBy inserting an extra loss term on the variance of the exchanged information, the meaningful part of\nthe messages can be effectively extracted and utilized to bene\ufb01t the training of each individual agent.\nFurthermore, unlike previous work, we do not require an extra decision module to dynamically adjust\nthe communication pattern. This allows us to reduce the model complexity signi\ufb01cantly. Instead, each\nagent \ufb01rst makes a preliminary decision based on its local information, and initiates communication\nonly when its con\ufb01dence level on this preliminary decision is low. Similarly, upon receiving the\ncommunication request, the agent replies to the request only when its message is informative. By only\nexchanging useful information among the agents, VBC not only improves agent performance, but\nalso substantially reduces communication overhead during execution. Lastly, it can be theoretically\nshown that the resulting training algorithm provides guaranteed stability.\nFor evaluation, we test VBC on several MARL benchmarks, including StarCraft Multi-Agent Chal-\nlenge [15], Cooperative Navigation (CN) [10] and Predator-prey (PP) [8]. For StarCraft Multi-Agent\nChallenge, VBC achieves 20% higher winning rate and 2 \u2212 10\u00d7 lower communication overhead on\naverage compared with the other benchmark algorithms. For both CN and PP scenarios, VBC outper-\nforms the existing algorithms and incurs much lower overhead than existing communication-enabled\napproaches. A video demo is available at [2] for a better illustration of the VBC performance. The\ncode is available at https://github.com/saizhang0218/VBC.\n\n2 Related Work\n\nThe simplest training method for MARL is to make each agent learn independently using Independent\nQ-Learning (IQL) [20]. Although IQL is successful in solving simple tasks such as Pong [19], it\nignores the environment dynamics arose from the interactions among the agents. As a result, it suffers\nfrom the problem of poor convergence, making it dif\ufb01cult to handle advanced tasks.\nGiven the recent success on deep Q-learning [12], some recent studies explore the scheme of\ncentralized training and decentralized execution. Sunehag et al. [18] propose Value Decomposition\nNetwork (VDN), a method that acquires the joint action value function by summing up all the action\nvalue functions of each agent. All the agents are trained as a whole by updating the joint action value\nfunctions iteratively. QMIX [14] sheds some light on VDN, and utilizes a neural network to represent\nthe joint action value function as a function of the individual action value functions and the global\nstate information. The authors of [10] extend the actor-critic methods to the multi-agent scenario. By\nperforming centralized training and decentralized execution over the agents, the agents can better\nadapt to the changes in the environment and collaborate with each other. Foerster et al. [5] propose\ncounterfactual multi-agent policy gradient (COMA), which employs a centralized critic function\nto estimate the action value function of the joint, and decentralized actor functions to make each\nagent execute independently. All the aforementioned methods assume no communication between\nthe agents during the execution. As a result, many subsequent approaches, including ours, can be\napplied to improve the performance of these methods.\nLearning the communication pattern for MARL is \ufb01rst proposed by Sukhbaatar et. al. [17]. The\nauthors introduce CommNet, a framework that adopts continuous communication for fully cooperative\n\n2\n\n\ftasks. During the execution, each agent takes their internal states as well as the means of the internal\nstates of the rest agents as the input to make decision on its action. The BiCNet [13] uses a\nbidirectional coordinated network to connect the agents. However, both schemes require all-to-all\ncommunication among the agents, which can cause a signi\ufb01cant communication overhead and latency.\nSeveral other proposals [3, 7, 8] use a selection module to dynamically adjust the communication\npattern among the agents. In Differentiable Inter-Agent Learning (DIAL) [3], the messages produced\nby an agent are selectively sent to the neighboring agents through the discretize/regularise unit\n(DRU). By jointly training DRU with the agent network, the communication overhead can be\nef\ufb01ciently reduced. Jiang et. al. [7] propose an attentional communication model that learns when\nthe communication is required and how to aggregate the shared information. However, an agent\ncan only talk to the agents within its observable range at each timestep. This limits the speed of\ninformation propagation, and restricts the possible communication patterns when the local observable\n\ufb01eld is small. Kim et. al. [8] propose a communication scheduling scheme for wireless environment,\nbut only a fraction of the agents can broadcast their messages at each time. In comparison, our\napproach does not impose hard constraints on the communication pattern, which is bene\ufb01cial to the\nlearning process. Also our method does not adopt additional decision module for the communication\nscheduling, which greatly reduces the model complexity.\n\n3 Background\n\n(cid:11). The action value function Q\u03b8(s, a) can be trained recursively by\n\nmaximize the total expected discounted reward R =(cid:80)T\ntransition tuples(cid:10)st, at, st+1, rt\n\nDeep Q-networks: We consider a standard reinforcement learning problem based on Markov\nDecision Process (MDP). At each timestamp t, the agent observes the state st, and chooses an action\nat. It then receives a reward rt for its action at and proceeds to the next state st+1. The goal is to\nt=1 \u03b3trt, where \u03b3 \u2208 [0, 1] is the discount\nfactor. A Deep Q-Network (DQN) use a deep neural network to represent the action value function\nQ\u03b8(s, a) = E[Rt|st = s, at = a], where \u03b8 represents the parameters of the neural network, and Rt is\nthe total rewards received at and after t. During the training phase, a replay buffer is used to store the\nminimizing the loss L = Est,at,rt,st+1[yt \u2212 Q\u03b8(st, at)]2, where yt = rt + \u03b3maxat+1Q\u03b8(cid:48)(st, at+1)\nand \u03b8(cid:48) represents the parameters of the target network. An action is usually selected with \u0001-greedy\npolicy. Namely, selecting the action with maximum action value with probability 1 \u2212 \u0001, and choosing\na random action with probability \u0001.\nMulti-agent deep reinforcement learning: We consider an environment with N agents work\ncooperatively to ful\ufb01ll a given task. At timestep t, each agent i (1 \u2264 i \u2264 N) receives a local\nobservation ot\ni. They then receive a joint reward rt and proceed to the next\nstate. We use a vector at = {at\ni} to represent the joint actions taken by all the agents. The agents aim\nto maximize the joint reward by choosing the best joint actions at at each timestep t.\nDeep recurrent Q-networks: Traditional DQNs generate action solely based on a limited number\nof local observations without considering the prior knowledge. Hausknecht et al. [6] introduce Deep\nRecurrent Q-Networks (DRQN), which models the action value function with a recurrent neural\nnetwork (RNN). The DRQN leverages its recurrent structure to integrate the previous observations\nand knowledge for better decision-making. At each timestep t, the DRQN Q\u03b8(ot\ni) takes the\nlocal observation ot\nLearning the joint Q-function: Recent research effort has been made on the learning of joint\naction value function for multi-agent Q-learning. Two representative works are VDN [18] and\nQMIX [14]. In VDN, the joint action value function Qtot(ot, ht\u22121, at) is assumed to be the sum\ni), where\not = {ot\ni} are the collection of the observations, hidden states and actions\nof all the agents at timestep t respectively. QMIX employs a neural network to represent the joint\nvalue function Qtot(ot, ht\u22121, at) as a nonlinear function of Qi(ot\n\nof all the individual action value functions, i.e. Qtot(ot, ht\u22121, at) = (cid:80)\n\nfrom the previous steps as input to yield action values.\n\ni}, ht = {ht\n\ni} and at = {at\n\ni and hidden state ht\u22121\n\ni and executes an action at\n\ni Qi(ot\n\ni, ht\u22121\n\ni\n\ni, ht\u22121\n\ni\n\n, at\n\ni\n\n, at\n\ni, ht\u22121\n\ni\n\n, at\n\ni) and global state st.\n\n4 Variance Based Control\n\nIn this section, we present the detailed design of VBC in the context of multi-agent Q-learning. The\nmain idea of VBC is to improve agent performance and communication ef\ufb01ciency by limiting the\n\n3\n\n\fFigure 1: (a) Agent network structure of agent 1, which consists of local agent generator, combiner\nand several message encoder. (b) The mixing network takes the output Qi(ot, ht\u22121, at\ni) from each\nnetwork of agent i, and perform centralized training. ct\u2212i means all the ct\n\nj(cid:54)=i.\n\nvariance of the transferred messages. During execution, each agent communicates with other agents\nonly when its local decision is ambiguous. The degree of ambiguity is measured by the difference\nbetween the top two largest action values. Upon receiving the communication request from other\nagents, the agent replies only if its feedback is informative, namely the variance of the feedback is\nhigh.\n\n4.1 Agent Network Design\n\n, at\n\ni\n\ni\n\ni and the hidden state ht\u22121\n\ni is then sent to the FC layer, which outputs the local action values Qi(ot\ni \u2208 A, where A is the set of possible actions. The message encoder, f ij\n\nThe agent network consists of the following three networks: local action generator, message encoder\nand combiner. Figure 1(a) describes the network architecture for agent 1. The local action generator\nconsists of a Gated Recurrent Unit (GRU) and a fully connected layer (FC). For agent i, the GRU takes\nas the inputs, and generates the intermediate results\nthe local observation ot\ni, ht\u22121\ni. ct\ni) for each\nct\naction at\nenc(.), is a multi-layer\nperceptron (MLP) which contains two FC layers and a leaky ReLU layer. The agent network involves\nj from another agent j (j (cid:54)= i), and outputs\nmultiple independent message encoders, each accepts ct\nj). The outputs from local action generator and message encoder are then sent to the combiner,\nenc(ct\nf ij\nwhich produces the global action value function Qi(ot, ht\u22121, at\ni) of agent i by taking into account the\nglobal observation ot and global history ht\u22121. To simplify the design and reduce model complexity,\nwe do not introduce extra parameters for the combiner. Instead, we make the dimension of the f ij\nenc(ct\nj)\n, .), and hence the combiner can simply perform\nthe same as the local action values Qi(ot\nelementwise summation over its inputs, namely Qi(ot, ht\u22121, .) = Qi(ot\nj).\nenc(ct\nThe combiner chooses the action with the \u0001-greedy policy \u03c0(.). Let \u03b8i\nenc denote the set\nof parameters of the local action generators and the message encoders, respectively. To prevent the\nlazy agent problem [18] and decrease the model complexity, we make \u03b8i\nlocal the same for all i, and\nenc the same for all i and j(j (cid:54)= i). Accordingly, we can drop the corner scripts and use\nmake \u03b8ij\n\u03b8 = {\u03b8local, \u03b8enc} and fenc(.) to denote the agent network parameters and the message encoder.\n\ni, ht\u22121\nlocal and \u03b8ij\n\n, .) +(cid:80)\n\ni, ht\u22121\n\ni\n\nj(cid:54)=i f ij\n\ni\n\n4.2 Loss Function De\ufb01nition\n\nDuring the training phase, the message encoder and local action generator jointly learn to generate\nthe best estimation on the action values. More speci\ufb01cally, we employ a mixing network (shown\nin Figure 1(b)) to aggregate the global action value functions Qi(ot, ht\u22121, at\ni) from each agents i,\nand yields the joint action value function, Qtot(ot, ht\u22121, at). To limit the variance of the messages\nfrom the other agents, we introduce an extra loss term on the variance of the outputs of the message\nencoders fenc(ct\n\nj). The loss function during the training phase is de\ufb01ned as:\n\nL(\u03b8local, \u03b8enc) =\n\ntot \u2212 Qtot(ob\n\nt , hb\n\nt\u22121, ab\n\nt ; \u03b8))2 + \u03bb\n\nV ar(fenc(ct,b\n\n(1)\n\nN(cid:88)\n\ni ))(cid:3)\n\nB(cid:88)\n\nT(cid:88)\n\n(cid:2)(yb\n\nb=1\n\nt=1\n\ni=1\n\n4\n\nCombinerCentralized training......Agent 1SumLeakyReLU...\u03b5MsgEncMsgEncMessageEncoderFC2FC1Local Action GeneratorGRUFCAgent 1Agent NMixing Network......\u03c0Agent Network(a)(b)\fFigure 2: In (a), since the difference between the largest and second largest action values is greater\nthan \u03b41, no communication is required. In (b), agent 1 broadcasts a request to agent 2 and 3, only\nagent 2 replies the request since the variance of fenc(ct\n\n2) is greater than \u03b42.\n\nAlgorithm 1: Communication protocol at agent i\n\n1 Input: Con\ufb01dence threshold of local actions \u03b41, threshold on variance of message encoder output \u03b42. Total\n2 for t \u2208 T do\n\nnumber of agents N.\n\ni, ht\u22121\n\ni\n\n, .). Denote m1, m2 the top two largest values of Qi(ot\n\ni, ht\u22121\n\ni\n\n, .).\n\n// Decision on the action of itself:\nCompute local action values Qi(ot\nif m1 \u2212 m2 \u2265 \u03b41 then\n\nLet Qi(ot, ht\u22121, .) = Qi(ot\n\ni, ht\u22121\n\ni\n\n, .).\n\n3\n\n4\n5\n6\n\n7\n8\n\n9\n\n10\n11\n12\n13\n\nelse\n\nBroadcast a request to the other agents, and receive the fenc(ct\nLet Qi(ot, ht\u22121, .) = Qi(ot\n\nfenc(ct\n\ni, ht\u22121\n\nj).\n\ni\n\n, .) +(cid:80)Nreply\n\nj=1\n\nj) from Nreply(Nreply \u2264 N ) agents.\n\n// Generating reply messages for the other agents:\nCalculate variance of fenc(ct\nif V ar(fenc(ct\n\ni)) \u2265 \u03b42 and Receive a request from agent j then\n\ni), if V ar(fenc(ct\n\ni)) \u2265 \u03b42, store fenc(ct\n\nReply the request from agent j with fenc(ct\n\ni).\n\ni) in the buffer.\n\ntot = rb\n\nt + \u03b3maxat+1Qtot(ob\n\n\u2212 is the parameter of the target network\nwhere yb\nwhich is copied from the \u03b8 periodically, V ar(.) is the variance function and \u03bb is the weight of the\nloss on it. b is the batch index. The replay buffer is refreshed periodically by running each agent\nnetwork and selecting the action which maximizes Qi(ot, ht\u22121, .).\n\nt, at+1; \u03b8\n\nt+1, hb\n\n), \u03b8\n\n\u2212\n\n4.3 Communication Protocol Design\n\ni\n\n1\n\ni, ht\u22121\n\n1, ht\u22121\n\n, .) and fenc(ct\n\nDuring the execution phase, at every timestep t, the agent i \ufb01rst computes the local action value\nfunction Qi(ot\ni). It then measures the con\ufb01dence level on the local decision by\ncomputing the difference between the largest and the second largest element within the action values.\nAn example is given in Figure 2(a). Assume agent 1 has three actions to select, and the output of the\nlocal action generator of agent 1 is Q1(ot\n, .) = (0.1, 1.6, 3.8), and the difference between the\nlargest and the second largest action values is 3.8 \u2212 1.6 = 2.2, which is greater than the threshold\n\u03b41 = 1.0. Given the fact that the variance of message encoder outputs fenc(ct\nj) from the agent 2\nand 3 is relatively small due to the additional penalty term on variance in equation 1, it is highly\npossible that the global action value function Q1(ot, ht\u22121, .) also has the largest value in its third\nelement. Therefore agent 1 does not have to talk to other agents to acquire fenc(ct\nj). Otherwise, agent\n1 broadcasts a request to ask for help if its con\ufb01dence level on the local decision is low. Because the\nrequest does not contain any actual data, it consumes very low bandwidth. Upon receiving the request,\nonly the agents whose message has a large variance reply (Figure 2(b)), because their messages may\nchange the current action decision of agent 1. This protocol not only reduces the communication\noverhead considerably, but also eliminates noisy, less informative messages that may impair the\noverall performance. The detailed protocol and operations performed at an agent i is summarized in\nAlgorithm 1.\n\n5\n\n(0.1,1.6.,3.8)action values 3.8-1.6 = 2.2 > \u0aa01= 1.0, no communication between the agents(a)Agent 1Agent 2Agent 3Agent 1Agent 2Agent 3(0.1,2.2.,2.3)action values (0.14,2.6.,2.2)2.3-2.2 = 0.1 < \u0aa01 = 1.0, agent 1 broadcasts a request to all the agentsrequestrequestVar = 0.044 >\u0aa02 = 0.02Var = 0.007 <\u0aa02 = 0.02(b)\f5 Convergence Analysis\n\nIn this section, we analyze convergence of the learning process with the loss function de\ufb01ned in\nequation (1) under the tabular setting. For the sake of simplicity, we ignore the dependency of the\naction value function on the previous knowledge ht. To minimize equation (1), given the initial state\nQ0, at iteration k, the q values in the table is updated according to the following rule:\n\nQk+1\n\ntot (ot, at) = Qk\n\ntot(ot, at) + \u03b7k\n\nrt + \u03b3maxaQk\n\ntot(ot+1, a)\u2212 Qk\n\ntot(ot, at)\u2212 \u03bb\n\n(cid:20)\n\n(cid:21)\n\nN(cid:88)\n\ni=1\n\n\u2202V ar(fenc(ct\ni))\ntot(ot, at)\n\u2202Qk\n\n(2)\n\nTheorem 1. Assume 0 \u2264 \u03b7k \u2264 1, (cid:80)\n\nwhere \u03b7k, Qk\nLet Q\u2217\nconvergence of the learning process. A detailed proof is given in the supplementary materials.\n\ntot(.) are the learning rate and the joint action value function at iteration k respectively.\ntot(.) denote the optimal joint action value function. We have the following result on the\nk < \u221e. Also assume the number of\ntot(ot, at) \u2212\n|| \u2264 G,\u2200i, k, t, ot, at.\n\npossible actions and states are \ufb01nite. By performing equation 2 iteratively, we have ||Qk\ntot(ot, at)|| \u2264 \u03bbN G \u2200ot, at, as k \u2192 \u221e, where G satis\ufb01es || \u2202V ar(fenc(ct\nQ\u2217\ntot(ot,at)\n\nk \u03b7k = \u221e, (cid:80)\n\nk \u03b72\n\n\u2202Qk\n\ni))\n\n6 Experiment\n\nWe evaluated the performance of VBC on the StarCraft Multi-Agent Challenge (SMAC) [15].\nStarCraft II [1] is a real-time strategy (RTS) game that has recently been utilized as a benchmark by\nthe reinforcement learning community [14, 5, 13, 4]. In this work, we focus on the decentralized\nmicromanagement problem in StarCraft II, which involves two armies, one controlled by the user\n(i.e. a group of agents), and the other controlled by the build-in StarCraft II AI. The goal of the user\nis to control its allied units to destroy all enemy units, while minimizing received damage on each\nunit. We consider six different battle settings. Three of them are symmetrical battles, where both the\nuser and the enemy groups consist of 2 Stalkers and 3 Zealots (2s3z), 3 Stalkers and 5 Zealots (2s5z),\nand 1 Medivac, 2 Marauders and 7 Marines (MMM) respectively. The other three are unsymmetrical\nbattles, where the user and enemy groups have different army unit compositions, including: 3 Stalkers\nfor user versus 4 Zealots for enemy (3s_vs_4z), 6 Hydralisks for user versus 8 Zealots for enemy\n(6s_vs_8z), and 6 Zealot for user versus 24 Zerglings for enemy (6z_vs_24zerg). The unsymmetrical\nbattles are considered to be harder than the symmetrical battles because of the difference in army size.\nAt each timestep, each agent controls a single unit to perform an action, including move[direction],\nattack[enemy_id], stop and no-op. Each agent has a limited sight range and shooting range, where\nshooting range is less than the sight range. The attack operation is available only when the enemies\nare within the shooting range. The joint reward received by the allied units equals to the total damage\nin\ufb02icted on enemy units. Additionally, the agents are rewarded 100 extra points after killing each\nenemy unit, and 200 extra points for killing the entire army. The user wins the battle only when\nthe allied units kill all the enemies within the time limit. Otherwise the built-in AI wins. The input\nobservation of each agent is a vector that consists of the following information of each allied unit\nand enemy unit in its sight range: relative x, y coordinates, relative distance and agent type. For\nthe detailed game settings, hyperparameters, and additional experiment evaluation over other test\nenvironments, please refer to supplementary materials.\n\n6.1 Results\n\nWe compare VBC and several benchmark algorithms, including VDN [18], QMIX [14] and Sched-\nNet [8] for controlling allied units. We consider two types of VBCs by adopting the mixing networks\nof VDN and QMIX, denoted as VBC+VDN and VBC+QMIX. The mixing network of VDN simply\ncomputes the elementwise summation across all the inputs, and the mixing network of QMIX deploys\na neural network whose weight is derived from the global state st. The detailed architecture of this\nmixing network can be found in [14]. Additionally, we create an algorithm FC (full communication)\nby removing the penalty in Equation (1), and dropping the limit on variance during the execution\nphase (i.e., \u03b41 = \u221e and \u03b42 = \u2212\u221e). The agents are trained with the same network architecture shown\nin Figure (1), and the mixing network of VDN is used. For SchedNet, at every timestep only K out\nof N agents can broadcast their messages by using Top(k) scheduling policy [8]. We usually set K\nclose to 0.5N, that is, each time roughly half of the allied units can broadcast their messages. The\n\n6\n\n\f(a) MMM\n\n(b) 2s3z\n\n(c) 3s5z\n\n(d) 3s_vs_4z\n\n(e) 6h_vs_8z\n\n(f) 6z_vs_24zerg\n\nFigure 3: Winning rates for the six tasks, the shaded regions represent the 95% con\ufb01dence intervals.\n\nVBC are trained for different number of episodes based on the dif\ufb01culties of the battles, which we\ndescribe in detail next.\nTo measure the convergence speed of each algorithm, we stop the training process and save the\ncurrent model every 200 training episodes. We then run 20 test episodes and measure the winning\nrates for these 20 episodes. For VBC+VDN and VBC+QMIX, the winning rates are measured by\nrunning the communication protocol described in Algorithm 1. For easy tasks, namely MMM and\n2s_vs_3z, we train the algorithms with 2 million and 4 million episodes respectively. For all the other\ntasks, we train the algorithms with 10 million episodes. Each algorithm is trained 15 times. Figure 3\nshows the average winning rate and 95% con\ufb01dence interval of each algorithm for all the six tasks.\nFor hyperparameters used by VBC (i.e., \u03bb used in equation (1), \u03b41and\u03b42 in Algorithm 1), we \ufb01rst\nsearch for a coarse parameter range based on random trial, experience and message statistics. We\nthen perform a random search within a smaller hyperparameter space. Best selections are shown in\nthe legend of each \ufb01gure.\nWe observe that the algorithms that involve communication (i.e., SchedNet, FC, VBC) outperform the\nalgorithms without communication (i.e., VDN, QMIX) in all the six tasks. This is a clear indication\nthat communication bene\ufb01ts the performance. Moreover, both VBC+VDN and VBC+QMIX achieve\nbetter winning rates than SchedNet, because SchedNet only allows a \ufb01xed number of agents to talk\nat every timestep, which prohibits some key information to exchange in a timely fashion. Finally,\nVBC achieves similar performance as FC and even outplays FC for some tasks (e.g., 2s3z,6h_vs_8z,\n6z_vs_24zerg). This is because a fair amount of communication between the agents are noisy and\nredundant. By eliminating these undesired messages, VBC is able to achieve both communication\nef\ufb01ciency and performance gain.\n\n6.2 Communication Overhead\n\nother words, the communication overhead \u03b2 =(cid:80)T\n\nWe now evaluate the communication overhead of VBC. To quantify the amount of communication\ninvolved, we run Algorithm 1 and count the total number of pairs of agents gt that conduct communi-\ncation for each timestep t, then divided by the total number of pairs of agents in the user group, R. In\nt=1 gt/RT . An an example, for the task 3s_vs_4z,\nsince the user controls 3 Stalkers, and the total number of agent pairs is R = 3 \u00d7 2 = 6. Within these\n6 pairs of agents, suppose that 2 pairs involve communication, then gt = 2. Table 1 shows the \u03b2 of\nVBC+VDN, VBC+QMIX and SchedNet across all the test episodes at the end of the training phase\nof each battle. For SchedNet, \u03b2 simply equals the ratio between the number of allied agents that are\nallowed to talk and the total number of allied agents. As shown in Table 1, in contrast to ScheNet,\n\n7\n\n0.000.250.500.751.001.251.501.752.00Training episodes1e70.00.20.40.60.81.0Winning rateWinning rate for MMMVDNQMIXVBC+VDN (=5.0,1=0.04,2=0.02)VBC+QMIX (=5.0,1=0.04,2=0.02)FCSchedNet (5 agents)0.00.51.01.52.02.53.03.54.0Training episodes1e70.00.20.40.60.81.0Winning rateWinning rate for 2s3zVDNQMIXVBC+VDN (=4.0,1=0.03,2=0.015)VBC+QMIX (=4.0,1=0.03,2=0.015)FCSchedNet (3 agents)0.00.20.40.60.81.0Training episodes1e80.00.20.40.60.8Winning rateWinning rate for 3s5zVDNQMIXVBC+VDN (=5.0,1=0.1,2=0.15)VBC+QMIX (=5.0,1=0.1,2=0.15)FCSchedNet (5 agents)0.00.20.40.60.81.0Training episodes1e80.00.20.40.60.8Winning rateWinning rate for 3s_vs_4zVDNQMIXVBC+VDN (=2.0,1=0.06,2=0.002)VBC+QMIX (=2.0,1=0.07,2=0.004)FCSchedNet (1 agents)0.00.20.40.60.81.0Training episodes1e80.00.10.20.30.40.5Winning rateWinning rate for 6h_vs_8zVDNQMIXVBC+QMIX (=1.7,1=0.11,2=0.03)VBC+VDN (=1.7,1=0.23,2=0.1)FCSchedNet (4 agents)0.00.20.40.60.81.0Training episodes1e80.00.10.20.30.40.5Winning rateWinning rate for 6z_vs_24zergVDNQMIXVBC+VDN (=5.0,1=0.04,2=0.023)VBC+QMIX (=5.0,1=0.04,2=0.019)FCSchedNet (3 agents)\f(a) Communication overhead\n\n(b) VBC (6h_vs_8z)\n\n(c) QMIX and VDN (6h_vs_8z)\n\n(d) VBC (3s_vs_4z)\n\n(e) VBC (6z_vs_24zergs, stage 1) (f) VBC (6z_vs_24zergs, stage 2)\n\nFigure 4: Strategies and communication pattern for different scenarios\n\nTable 1: Communication overhead\n\nVBC+VDN VBC+QMIX SchedNet\n\n\u03b2\n\nMMM\n2s3z\n3s5z\n\n3s_vs_4z\n6h_vs_8z\n\n6z_vs_24zerg\n\n5.25%\n4.33%\n27.70%\n5.07%\n35.93%\n12.13%\n\n5.36%\n4.68%\n28.13%\n5.19%\n36.16%\n13.35%\n\n50%\n60%\n62.5%\n33.3%\n66.7%\n50%\n\nVBC+VDN and VBC+QMIX produce 10\u00d7 lower communication overhead for MMM and 2s3z, and\n2 \u2212 6\u00d7 less traf\ufb01c for the rest of tasks.\n\n6.3 Learned Strategy\n\nIn this section, we examine the behaviors of the agents in order to better understand the strategies\nadopted by the different algorithms. We have made a video demo available at [2] for better illustration.\nFor unsymmetrical battles, the number of allied units is less than the enemy units, and therefore the\nagents are prone to be attacked by the enemies. This is exactly what happened for the QMIX and\nVDN agents on 6h_vs_8z, as shown in (Figure 4(c)). Figure 4(b) shows the strategy of VBC, all the\nHydralisks are placed in a row at the bottom margin of the map. Due to the limited size of the map,\nthe Zealots can not go beyond the margin to surround the Hydralisks. The Hydralisks then focus their\n\ufb01re to kill each Zealot. Figure 4(a) shows the change on \u03b2 for a sample test episode. We observe\nthat most of the communication appears in the beginning of the episode. This is due to the fact that\nHydralisks need to talk in order to arrange in a row formation. After the arrangement is formed, no\ncommunication is needed until the arrangement is broken due to the deaths of some Hydralisks, as\nindicated by the short spikes near the end of the episode. Finally, SchedNet and FC utilize a similar\nstrategy as VBC. Nonetheless, due to the restriction on communication pattern, the row formed by\nthe allied agents are usually not well formed, and can be easily broken by the enemies.\nFor 3s_vs_4z scenario, the Stalkers have a larger attack range than Zealots. All the algorithms adopt\na kiting strategy where the Stalkers form a group and attack the Zealots while kiting them. For\nVBC and FC, at each timestep only the agents that are far from the enemies attack, and the rest of\nthe agents (usually the healthier ones) are used as a shield to protect the \ufb01ring agents (Figure 4(d)).\nCommunication only occurs when the group are broken and need to realign. In contrast, VDN and\nQMIX do not have this attacking pattern, and all the Stalkers always \ufb01re simultaneously, therefore\nthe Stalkers closest to the Zealots are get killed \ufb01rst. SchedNet and FC also adopt a similar policy as\nVBC, but the attacking pattern of the Stalkers is less regular, i.e., the Stalkers close to the Zealots\nalso \ufb01re occasionally.\n\n8\n\n0.00.20.40.60.81.0Test episodes (shown in percentage)0.00.20.40.60.81.0Communication overhead Agents talk with each other to discuss the arrangementCommunication is needed for keeping the arrangement between the agentsOnce the arrangementis settled, no communication is need\f(b) Results on Cooperative Navigation (#agents = 6)\n#collisions\n0.169\n0.176\n0.161\n1.872\n\nMethods Avg. dist\n2.687\n2.798\n2.990\n3.886\n\nVBC+VDN\nSchedNet\n\nFC\nVDN\n\nFigure 5: (a) Results on PP with 3 predators and 3 prey. For SchedNet, we allow 1 predator/prey to\nbroadcast messages. (b) Results of CN. For SchedNet, we allow 3 agents to broadcast messages.\n\n6z_vs_24zerg is the toughest scenario in our experiment. For QMIX and VDN, the 6 Zealots are\nsurrounded and killed by 24 Zerglings shortly after the episode starts. In contrast, VBC \ufb01rst separates\nthe agents into two groups with two Zealots and four Zealots respectively (Figure 4(e)). The two\nZealots attract most of the Zerglings to a place far away from the rest four Zealots, and are killed\nshortly. Due to the limit sight range of the Zerglings, they can not \ufb01nd the rest four Zealots. On the\nother side, the four Zealots kill the small part of Zerglings easily and search for the rest Zerglings.\nThe four Zealots take advantage of the short sight of the Zerglings. Each time the four Zealots adjust\ntheir positions in a way such that they can only be seen by a small number of the Zerglings, the\nbaited Zerglings are then killed easily (Figure 4(f)). For VBC, the communication only occurs in\nthe beginning of the episode when the Zealots are separated into two groups, and near the end of the\nepisode when four Zealots adjust their positions. Both FC and SchedNet learn the strategy of splitting\nthe Zealots into two groups, but they fail to \ufb01ne-tune their positions to kill the remaining Zerglings.\nFor symmetrical battles, the tasks are less challenging, and we see less disparities on performances\nof the algorithms. For 2s3z and 3s5z, the VDN agents attack the enemies blindly without any\ncooperation. The QMIX agents learn to focus \ufb01ring and protect the Stalkers. The agents of VBC, FC\nand SchedNet adopt a more aggressive policy, where the allied Zealots try to surround and kill the\nenemy Zealots \ufb01rst, and then attack the enemy Stalkers by collaborating with the allied Stalkers. This\nis extremely effective because Zealots counter Stalkers, so it is important to kill the enemy Zealots\nbefore they damage allied Stalkers. For VBC, the communication occurs mostly when the allied\nZealots try to surround the enemy Zealots. For MMM, almost all the methods learn the optimal\npolicy, namely killing the Medivac \ufb01rst, then attack the rest of the enemy units cooperatively.\n\n6.4 Evaluation on Cooperative Navigation and Predator-prey\n\nTo demonstrate the applicability of VBC in more general settings, we have tested VBC for two more\nscenarios: (1) Cooperative Navigation (CN) which is a cooperative scenario, and (2) Predator-prey\n(PP) which is a competitive scenario. The game settings are the same as what are used in [10]\nand [8], respectively. We train each method until convergence and test the result models for 2000\nepisodes. For PP, we make the agents of VBC compete against the agents of other methods, and\nreport the normalized score of Predator (Figure 5(a)). For CN we report the average distance between\nagents and their destinations, and average number of collisions (Figure 5(b)). We notice that methods\nwhich allow communication (i.e., SchedNet, FC, VBC) outperform the others for both tasks, and\nVBC achieves the best performance. Moreover, in both scenarios, VBC incurs 10\u00d7 and 3\u00d7 lower\ncommunication overhead than FC and SchedNet respectively. In CN, most of the communication of\nVBC occurs when the agents are close to each other to prevent collisions. In PP, the communication of\nVBC occurs mainly to rearrange agent positions for better coordination. These observations con\ufb01rm\nthat VBC\u2019s can be applied to a variety of MARL scenarios with great effectiveness.\n\n7 Conclusion\n\nIn this work, we propose VBC, a simple and effective approach to achieve ef\ufb01cient communication\namong agents in MARL. By constraining the variance of the exchanged messages during the training\nphase, VBC improves communication ef\ufb01ciency while enabling better cooperation among the agents.\nThe test results of multiple MARL benchmarks indicate that VBC outperforms the other state-of-the-\nart methods signi\ufb01cantly in terms of both performance and communication overhead.\n\n9\n\nVBC as predatorVBC as prey0.00.20.40.60.8Normalized score of Predator(a) Performance on PP (3 predators, 3 preys)VBC and SchedNetVBC and FCVBC and VDN\fReferences\n[1] Starcraft of\ufb01cial game site. https://starcraft2.com/.\n[2] Vbc video demo. https://bit.ly/2VFkvCZ.\n[3] J. Foerster, I. A. Assael, N. de Freitas, and S. Whiteson. Learning to communicate with deep\nmulti-agent reinforcement learning. In Advances in Neural Information Processing Systems,\npages 2137\u20132145, 2016.\n\n[4] J. N. Foerster, C. A. S. de Witt, G. Farquhar, P. H. Torr, W. Boehmer, and S. Whiteson. Multi-\n\nagent common knowledge reinforcement learning. arXiv preprint arXiv:1810.11702, 2018.\n\n[5] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson. Counterfactual multi-agent\n\npolicy gradients. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[6] M. Hausknecht and P. Stone. Deep recurrent q-learning for partially observable mdps. In 2015\n\nAAAI Fall Symposium Series, 2015.\n\n[7] J. Jiang and Z. Lu. Learning attentional communication for multi-agent cooperation.\n\nAdvances in Neural Information Processing Systems, pages 7254\u20137264, 2018.\n\nIn\n\n[8] D. Kim, S. Moon, D. Hostallero, W. J. Kang, T. Lee, K. Son, and Y. Yi. Learning to schedule\ncommunication in multi-agent reinforcement learning. arXiv preprint arXiv:1902.01554, 2019.\n[9] J. Kober, J. A. Bagnell, and J. Peters. \"reinforcement learning in robotics: A survey.\". The\n\nInternational Journal of Robotics Research, 2013.\n\n[10] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch. Multi-agent actor-critic for\nmixed cooperative-competitive environments. In Advances in Neural Information Processing\nSystems, pages 6379\u20136390, 2017.\n\n[11] L. Marc, V. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. P\u00e9rolat, D. Silver, and T. Graepel.\n\"a uni\ufb01ed game-theoretic approach to multiagent reinforcement learning.\". In Advances in\nNeural Information Processing Systems, 2017.\n\n[12] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller.\n\n\"playing atari with deep reinforcement learning.\". arXiv preprint arXiv:1312.5602, 2013.\n\n[13] P. Peng, Y. Wen, Y. Yang, Q. Yuan, Z. Tang, H. Long, and J. Wang. Multiagent bidirectionally-\ncoordinated nets: Emergence of human-level coordination in learning to play starcraft combat\ngames. arXiv preprint arXiv:1703.10069, 2017.\n\n[14] T. Rashid, M. Samvelyan, C. S. de Witt, G. Farquhar, J. Foerster, and S. Whiteson. Qmix:\nMonotonic value function factorisation for deep multi-agent reinforcement learning. arXiv\npreprint arXiv:1803.11485, 2018.\n\n[15] M. Samvelyan, T. Rashid, C. S. de Witt, G. Farquhar, N. Nardelli, T. G. J. Rudner, C.-M. Hung,\nP. H. S. Torr, J. Foerster, and S. Whiteson. The StarCraft Multi-Agent Challenge. CoRR,\nabs/1902.04043, 2019.\n\n[16] S.-S. Shai, S. Shammah, and A. Shashua. \"safe, multi-agent, reinforcement learning for\n\nautonomous driving.\". arXiv preprint arXiv:1610.03295, 2016.\n\n[17] S. Sukhbaatar, R. Fergus, et al. Learning multiagent communication with backpropagation. In\n\nAdvances in Neural Information Processing Systems, pages 2244\u20132252, 2016.\n\n[18] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot,\nN. Sonnerat, J. Z. Leibo, K. Tuyls, et al. Value-decomposition networks for cooperative\nmulti-agent learning. arXiv preprint arXiv:1706.05296, 2017.\n\n[19] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vi-\ncente. Multiagent cooperation and competition with deep reinforcement learning. PloS one,\n12(4):e0172395, 2017.\n\n[20] M. Tan. \"multi-agent reinforcement learning: Independent vs. cooperative agents.\". In Proceed-\n\nings of the tenth international conference on machine learning. IEEE, 1993.\n\n10\n\n\f", "award": [], "sourceid": 1823, "authors": [{"given_name": "Sai Qian", "family_name": "Zhang", "institution": "Harvard University"}, {"given_name": "Qi", "family_name": "Zhang", "institution": "Amazon"}, {"given_name": "Jieyu", "family_name": "Lin", "institution": "University of Toronto"}]}