NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7196
Title:Biases for Emergent Communication in Multi-agent Reinforcement Learning

Reviewer 1

The authors present two losses for improving emergent communication, addressing concerns laid out in previous work (Lowe 2019). One is the concern that the speaker agent may be communicating generic messages and not ones relevant to the particular sitation. A loss here encourages the agent to send messages that are correlated with their observation, based on maximizing mutual information between them. A second concern is that the listener agent may not be conditioning their behavior on the communication, and in this case an extra loss Both constraints are intuitive, and phrasing them as losses doesn't seem to be particularly challenging. However, as with many issues in emergent communication, such a judgement may gloss over hidden difficulties in a complex optimization problem, and this appears to be the case here, requiring some non-obvious sidestepping to provide losses with better convergence. We're not supplied detailed analysis of what failed, but at face value, having the losses formulated in a way that has shown to be useful for optimization is a useful contribution to all researchers working in this area (regardless of the conceptual simplicity of the idea). There are some concerns about the long-term usefulness of these biases, especially of the positive listening loss. While this bias may make sense in simple optimization problems, in reality, a change in agent strategy is not necessarily indicative of listening, nor is the lack of such change an indication of a poor listener. I was a bit surprised not to see any discussion about how realistic these loss-motivating assumptions are in the bigger picture. But this is a minor concern, as we are still very much in the realm of toy examples. As for the evaluation, overall strong improvements shown for either of the losses over the baseline, and for both losses used together. I thought overall the experiments were well chosen -- one where the RL environment was reduced to a trivial degree, and one which closer reflects the types of domains used in related work. Both seem well-controlled. The baselines seemed a bit too outmatched here and raises the question of whether or not there are some optimization tricks that could have been utilized, rather than setting up difficult problems where the baseline is inclined to fail. But assuming "worst case", being presented with such a problem where no curriculum learning or reward shaping options are available, the provided methods show improvements both in the percent of good strategies reached, and the rewards collected from such policies. The paper is also very clearly written, and so while the technical difficulty of the methods presented here is not high, nor are they particularly novel, the paper represents a concise contribution of effective optimization strategies, broadly applicable to the emergent communication literature, and is certainly ready for acceptance at some venue. Perhaps its greatest shortcoming is simply whether it meets the NeurIPS bar in terms of its novelty/technicality. Some in-line comments: L 127: (2) is formulated with trajectories but it wasn't clear if this more general interpretation was useful. Does this have a noticable effect over a single state? In general, states and trajectories feel a bit mixed together throughout S3/3.1. L 139/143.2: The loss notation here changes from L_S to L_ps L 143/Alg 1: Inconsistent period usage L 143/Alg 1: Lines 3-12 should be clearly marked as documentary/comments L 143.15: what is the purpose of having the hidden state passed to p_t^i? In Alg 2 this makes sense as it is required for conditioning on future roll-outs, but in the positive signalling loss too? Fig 2: these are compelling plots but presumably the data samples are chosen at random and from the full label size. The losses are useful compared to baselines without, but have you considered approaching the problem from the data side? I would be interested to know if simplifying the problem (in terms of label size), and gradually increasing it / setting a curriculm, would have a positive effect on the baselines. It appears the baselines are just setup too poorly here, with little effort given to "make them work". L 222: is Tab 1 referenced? L 242: typically not a context for a semi-colon

Reviewer 2

Summary: The paper proposes the use of two extra term losses that encourage positive signalling and listening in multi-agent reinforcement learning settings where agents have access to a communication channel. The authors show that this leads to agents that learn to more robustly use the communication channel (across different runs) compared to agents that are not trained with these extra losses (or intrinsic rewards) or only use one of these terms. Strengths: The paper is generally clear and well-structured The paper addresses an important problem in MARL and particularly, attempts to tackle the more challenging and realistic setting of decentralized training and execution of the agents, without direct access to other agents’ parameters, rewards, actions, states, or internal beliefs and preferences. I like the various discussions throughout the paper that provide explanations and intuitions for the behavior of the agents or the effects or different losses on their performance (e.g. section 4.2). The paper appears to be technically correct. I also appreciated the fact that the authors openly acknowledged some of the limitations of their method (e.g. section 4.2) Weaknesses: The paper could benefit from comparisons against stronger and more diverse baselines, such as the method proposed by Jaques et al. 2019 (which also uses decentralized training and execution), Foerster et al. 2016 (e.g. RIAL which is concerned with the same setting), or Sukhbaatar et al. 2016 The paper lacks a large number of references to related work. The auxiliary losses proposed are a form of reward shaping for improving MARL algorithms (e.g. prosociality, curiosity, empowerment, optimistic Q-learning etc.), on which there exists a large body of work which is not discussed in the paper. Some examples of related papers are: Peysakhovich & Lerer (2018), Devlin et al. (2014), Foerster et al. (2018), Oudeyer & Kaplan (2006), Oudeyer & Smith (2016), Forestier & Oudeyer (2017). While the simplicity of the chosen tasks helps to understand in greater detail what the agents’ behavior, I think the paper requires more evaluation on more complex tasks. In particular, I think many readers would be interested in a discussion of the scalability of this method to more than 2 agents. Moreover, the number of symbols used in the communication channel is quite small and it would be good to see how the method performs as the vocabulary increases. Some of the results don’t seem that impressive. While the proposed method allows agents to use the communication channel more often than the alternatives, it does not improve performance by a very significant amount, when comparing the runs that make use of communication. Numerous details about the algorithm used for optimization are missing. In particular, what is the total loss used for optimization in both tasks? How does that relate to RIAL or other previously proposed MARL algorithms? Other Comments: It would be good to provide a plot or table with the final reward across all runs. I expect that number to show their method as obtaining much better rewards on average, given the larger proportion of runs in which the communication channel is actually being used by the agents. It is not very clear to me why the agents still learn suboptimal communication protocols even when learning to use the communication channel. Have the authors tried to use a more powerful RL algorithm such as PPO, SAC or A3C? Why did the authors decide to use a multi-step version of the CIC algorithm? It would be useful to provide an ablation study in which they also compare against the single step version. There are some missing details when going from equation (5) to (6) that should at least appear in the supplementary material. The proposed method is novel (as far as I can tell) and the problem is of wide interest, so I believe a stronger version of this submission would be of interest to the community. However, I do believe the paper as it stands right now requires more empirical evaluation against stronger baselines and on more complex environments. ------------- UPDATE: I have read the rebuttal and the other reviews. I appreciate the fact that the authors took into account the feedback, clarified some parts of the paper, and promised to add more comparisons with prior work, analysis of the learned communication protocols and relevant references. Their rebuttal helped me understand why they made certain decisions and what the scope of the paper is. I now lean towards acceptance and I have updated my score to 6.

Reviewer 3

One of the most important open problems in emergent communication is the problem of discovery: how can two randomly-initialized agents stumble upon a communication policy that transmits useful information and helps solve a given task? Many previous approaches have used gradient information passed from the listener to the speaker in order to improve the learning of communication protocols, however this has some drawbacks. This paper provides an alternate approach, which is to add additional rewards to the decentralized, discrete-message framework to encourage communication. The authors show that these biases improve performance on two separate communication games. I quite like this paper. It is well-written, the methodology makes sense, and the results clearly show that the proposed biases (which incentivizes the ‘speaker’ agent to send messages correlated with its state, and incentivizes the ‘listener’ agent to take the speaker’s messages into account when acting) improve performance. The environments tested (MNIST adding and the new ‘treasure hunt’ environment) are fairly simple, but in my opinion interesting enough to show the benefit of the proposed approach. The implementation of these biases is also non-trivial, and the paper walks through how they are derived in detail. My main concern about the paper is the comparison to Jaques et al. (2018). As the authors mention, incentivizing ‘positive listening’ was previously investigated in Jaques et al. by giving the *speaker* a reward for sending messages that had a large influence on the listener. In contrast, this paper rewards the *listener* for being influenced by the speaker. Given that these two objectives are fairly similar, it is surprising to me that the paper doesn’t compare to the influence formulation of Jaques et al. I think the experimental results would be much stronger if this comparison was included, along with an explanation of the pros / cons of each approach. Finally, given that a similar approach to the one in this paper was taken by Jaques et al., the novelty of the paper is more limited. Overall, I would recommend this paper for acceptance if there was a more extensive comparison to Jaques et al. For now, I will give this paper a weak accept, but am willing to update my review accordingly. Small comments: - L166: We use is the -> We use the - L172: is new -> is a new - L176: the results protocols -> the resulting protocols ------------------------------------------------------- After reading the author's rebuttal, I will also increase my score by 1 point, from a 6 to a 7. My main concern was the comparison to the similar formulation from Jaques et al., and I'm convinced by the authors assertion that this does not work in the MNIST setting. I'm hopeful to see this comparison in the final version. My sole remaining concern is how Jaques et al.'s method will compare in the Treasure Hunt game (the authors state that they 'will run this'), and whether this result will be included even if it shows their method does worse. Despite this concern, I now reside more firmly in the 'accept' camp.