Review for NeurIPS paper: Multi-agent active perception with prediction rewards

NeurIPS 2020

Multi-agent active perception with prediction rewards

Review 1

Summary and Contributions: This paper describes an approach to map multiagent active perception problems that can be modeled as a Dec-\rho-POMDP to a standard Dec-POMDP problem model. This mapping is valuable in that it enables any standard Dec-POMDP solver to formulate an active sensing policy for all agents, opening up new possibilities for solving Dec-\rho-POMDPs. The key feature of this mapping is that agents individually choose predictive actions (similar to a POMDP-IR) in an extra step after the original problem horizon in a decentralized manner to approximate the single centralized predictive action of the Dec-\rho-POMDP. Theoretical analysis establishes a bound on the regret in the Dec-\rho-POMDP introduced through decentralized predictions (instead of centralized), and a sufficient condition is given demonstrating when this error bound can be 0. Empirical evaluations on the same benchmark problems studied previously with the Dec-\rho-POMDP demonstrate that the new approach enables a group of agents to plan with a longer horizon (by reducing the memory complexity of the solution) while achieving nearly as high of rewards as the state-of-the-art planning algorithm.

Strengths: I believe that this research contributes to the potential use of Dec-\rho-POMDP models in real world applications by both (1) enhancing the scalability of the model to problems requiring longer planning horizons than those possible with current state-of-the-art algorithms, and (2) by admitting the use of a greater number of solution algorithms (including those yet to be developed in the active community of Dec-POMDP researchers). Altogether, this advances the study of multiagent active perception problems. It is novel, albeit somewhat incremental, and will be of interest to the multiagent planning community at NeurIPS. I appreciated the inclusion of both theoretical and empirical evaluations of the approach, and the theoretical claims and proofs appeared to be sound.

Weaknesses: I did not find the empirical analysis to be as convincing as the theoretical analysis. Interestingly, in the Rovers domain, both version of the APAS algorithm actually performed *worse* as the horizon increased beyond horizons for which the NPGI could successfully plan (h > 5). Typically, we expect policy values to increase with greater horizons, which was the case for all algorithms until NPGI could no longer plan. This same phenomenon also occurred for the largest horizon (h = 8) in the MAV domain, although the policy values did keep improving for h = 6 and 7 after NPGI could no longer plan. I'm not sure if this decreased performance in both domains is an artifact of only running the experiment 10 times, which might not be enough to adequately determine average performance, or if there is something meaningful about those horizons that should cause planning to do worse. As stated above, one of the primary contributions of this research (in my opinion) is that it enables agents to plan for greater horizons than NPGI, but if it does worse over those horizons, then that contribution is less valuable.

Correctness: The claims and methods appear correct to me. A minor note about the proof for Theorem 1: you never establish that ||\tilde{\rho} - \tilde{R}_h||_\infty is bounded, so technically the regret is not necessarily bounded. For some choices of \rho (e.g., negative entropy which is bounded above and below by [-ln |S|, 0]), the regret is bounded, but is this true for all possible choices of \rho? Without working through all of the details, I would guess that it is sufficient to claim that \rho is bounded above and below, but I'm not 100% sure. Also, I would have liked to have seen a greater number of experimental runs conducted to better understand the true average performance (and better account for possible variance between runs), along with either standard deviations or standard errors presented so I could evaluate the range of performances that occurred. Also, since the computational complexity of APAS vs. NPGI was discussed, I would have also liked to have seen a sense of how long each algorithm took to plan (which does depend on the Dec-POMDP planner, as the authors pointed out). APAS is more memory efficient, but how does it do on time since it involves first creating a new problem model before solving?

Clarity: I found the paper to be well written and organized. It was enjoyable to read. I also appreciated the amount of extra details provided in the supplement.

Relation to Prior Work: The prior literature was appropriately cited (especially recent work), as far as I'm aware. There are many more examples of belief-based rewards being used in single agent POMDPs before the \rho-POMDP, but that is probably outside the scope of this particular paper.

Reproducibility: Yes

Additional Feedback: My questions for the authors include: 1) Do you have a sense of why APAS experienced decreased policy values as the horizon increased in both domains? What does this imply about the usefulness of the solution? 2) Since submitting the paper, have you run additional experiments so that you have more than 10 repetitions? If so, how have the results changed? Finally, a small suggestion: in the proof for Theorem 1, it would aid readability if you were to subscript the V functions with which particular problem they correspond -- the original \Dec-\rho-POMDP or the mapped Dec-POMDP. ##### Post Rebuttal ##### I thank the authors for their clarifying responses to my questions! I appreciate the additional empirical results from more runs, as well as the discussion of runtimes. I think both will improve the experiment section of the paper. The ability to solve the Dec-\rho-POMDP problem in a distributed fashion using pre-existing (and future) solutions to the Dec-POMDP should help make the use of Dec-\rho-POMDPs more accessible for multiagent active perception problems.

Review 2

Summary and Contributions: The work presents a decentralized model and a solution approach for the problem of multiagent active perception where a team of agents actively gather observation to compute a joint estimate of a hidden variable. Previous work for this problem presented mostly centralized approaches as the reward function for the last time depends on the centralized belief over the hidden variable. In the current work, the authors have provided an approximate mapping of this problem to a decentralized setting such that standard approaches for Dec-POMDPs (a standard model for multiagent sequential decision making becomes applicable). The resulting solution is also tested on a number of empirical benchmarks.

Strengths: The work presents a decentralized model and a solution approach for the problem of multiagent active perception where a team of agents actively gather observation to compute a joint estimate of a hidden variable. Previous work for this problem presented mostly centralized approaches as the reward function for the last time depends on the centralized belief over the hidden variable. In the current work, the authors have provided an approximate mapping of this problem to a decentralized setting such that standard approaches for Dec-POMDPs (a standard model for multiagent sequential decision making becomes applicable). The resulting solution is also tested on a number of empirical benchmarks.

Weaknesses: The paper is well written and easy to follow. The problem is active perception is also interesting. There are a few areas where more clarification is needed as pointed below: -- The authors have highlighted a number of previous models for the problem of active perception such as Dec-\rhoPOMDP, POMDP-IR etc. Given the focus on converting this problem to a decentralized framework, it is not clearly conveyed why decentralizing the problem is significant? There are hints available in the paper such as less communication overhead, but there is no empirical evidence presented towards justifying decentralized approaches over such previous approaches (e.g., how much communication overhead is reduced) -- The technical approach presented by the authors is elegant and simple, but it is essentially a heuristic approach. The bound provided in theorem 1 would seem to be loose in the worst case (and its values in experiments is not shown). As a result, one has to rely on empirical results to judge the effectiveness and scalability of the proposed approach. This is where convincing empirical support is lacking. The authors have tested on two domains both of them are relatively small (wrt to number of agents containing only two agents). Comparing against previous approaches show that the authors approach provide worst solution quality when the horizon is small against previous centralized approaches (which is expected), but the current approach scales better wrt to the problem horizon. However, critically, no comparisons against any other multiagent/centralized approach are provided for larger horizons (and the scalability of the authors’ approach is only shown until horizon 8, which is relatively small). Without comparisons against other approaches, there is no baseline against which the number provided by the authors approach can be interpreted. Indeed, it is not difficult to come up with solid baselines to compare against on this problem. One can simply take any recent multiagent RL approach (such as MADDPG/COMA/QMIX), develop a simulator to generate samples from domain, and use either the original reward function or the newly developed linear rewards the authors have used. Other baselines can be centralized RL approaches such as by Satsangi et al (2020) as pointed in related work by the authors, or RL methods for solving POMDPs that can model the centralized information reward. Lacking such comparisons, the significance of the scalability aspect (wrt the planning horizon) is not clear. Another interesting dimension where additional discussion is needed is the scalability wrt number of agents. Current method and solution approach seems to be tailored towards a 2-agent setting. It is not clear how would the current methods scale when there are n-agents, which is a more realistic setting where a large number of agents are acting for information gathering. Similarly, based on Lauri et al 2019, the problems addressed are quite small (8 states for MAV domain, and 256 for rover domain). Showing Scalability wrt to state and observation space is also critical to demonstrate for showing the significance of the developed methods. It is also not clear why the authors have decided to use the approach by Pajarinen and Peltonen (2011) as their approach is for infinite horizon controllers. The approach below has shown much more scalability when solving finite horizon Dec-POMDP for 2-agent problems: Optimally Solving Dec-POMDPs as Continuous-State MDPs. JAIR 2016.

Correctness: looks correct

Clarity: Mostly well written

Relation to Prior Work: Seems clear.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper addresses the problem of cooperative active perception, in which a group of agents must coordinate to measure (in a distributed fashion) some quantity of interest. Such problem can be addressed using Dec-rho-POMDPs, a multiagent sequential decision-making model where the agents optimize, simultaneously, the sum of rewards and as final reward computed as a convex function of a belief about the quantity of interest computed from the combined observations of all agents. The main contribution of the paper is a reduction from a Dec-rho-POMDP to a standard Dec-POMDP. The reduction proceeds by creating a Dec-POMDP that is, in all respects, similar to the Dec-rho-POMDP, and artificially increasing the decision horizon of the original Dec-rho-POMDP by one, where the last decision epoch consist of a distributed estimation (and optimization) of the final reward term. This reduction enables the Dec-rho-POMDP to be solved using standard Dec-POMDP solution methods. The paper provides a bound for the difference in performance obtained by such reduction.

Strengths: The paper addresses a relevant problem (cooperative active perception) and, to the extent of my knowledge, provides a novel contribution to the state of the art. The proposed approach offers computational savings, enabling the solution of longer-horizon problems, at a bounded cost in terms of performance.

Weaknesses: The notation is somewhat heavy and unintuitive, making it hard, at times, hard to parse what is being said. Also, the impact of the selection of alpha-vectors---although illustrated in the results---could be better discussed (more on this below).

Correctness: The claims in Lemmas 1-4 are established in the supplementary material, which I went over only superficially. However, I would expect these results to be true. I went over the proof of Theorem 1 (the bound in the performance loss) carefully and found no problem. Overall, I believe that the paper is technically sound.

Clarity: Yes, although notational issues could be improved.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Overall, I liked the paper. The problem addressed is interesting and the proposed approach in the paper is, to the extent of my knowledge, an improvement over existing state of the art: the reduction proposed in the paper---although losing in terms of performance to existing approaches---does provide significant computational savings and enables addressing active cooperative perception problems with longer horizons. Additionally, the paper provides a bound for the loss in performance (even if the bound is in the worst case and, for most cases, not very tight). There are some aspects which, I believe, could help improve the final version of the paper (I number those to help author's response): 1. The notation is somewhat heavy and unintuitive, which makes the paper somewhat hard to parse at times. For example, the theoretical results often discuss and compare the value functions for the extended Dec-POMDP M+ and the Dec-rho-POMDP <M, Gamma> resulting from the policies from both models, but there is no way to differentiate when referring to the policy of M+ and <M, Gamma>, requiring the reader to infer so from the context---something that, at times, is not trivial. 2. Regarding the selection of the alpha-vectors used to estimate the final reward, I would like to better understand the impact that such selection has in the performance of the proposed approach. While the results suggest that the adaptation of the alpha-vectors is advantageous, these do impact directly the distributed reward estimation in M+ as well as the performance bound, so additional discussion on this topic would be important. 3. Still on the issue above, the results compare the performance of the algorithm with alpha-vector adaptation vs. having a (fixed, I assume) set of alpha-vectors computed at beliefs sampled a priori. How are these sampled? Uniformly at random? Using some heuristic? I could not figure this out from the paper and I believe it would be important to make clear---also because I guess that the difference in performance observed may be more or less sharp depending on how the alpha-vectors are selected, no? -- Update after author's rebuttal -- Thank you for the detailed rebuttal. I am happy with the responses to the issues I've raised.

Review 4

Summary and Contributions: This paper proposes an approach to convert a multi-agent active perception problem into a standard Dec-POMDP with a decentralized prediction reward. This is done by introducing individual prediction actions for each agent. The authors proved that the loss due to decentralization is bounded (zero in some condition). They demonstrated the empirical usefulness of our results by applying standard Dec-POMDP algorithms to multi-agent active perception problems with state-of-the-art performance.

Strengths: The main strength of this paper is that the authors give theoretical proofs about the loss due to decentralization. The experiments confirm the claims of the paper.

Weaknesses: It is not clear for me why the decentralization is important for multi-agent active perception. The authors claim that the advantage of doing so is to make the standard Dec-POMDP algorithms applicable for such problems. However, standard Dec-POMDP algorithms usually have scalability issues for large problems. It should have more benefits beyond that but I cannot see from the paper.

Correctness: They seems correct.

Clarity: The motivation of this paper is not clear for me.

Relation to Prior Work: Discussions on related work seems to be sufficiently covered.

Reproducibility: Yes

Additional Feedback: After rebuttal: I have read the authors' feedback and the others' reviewers. The reviewers also have an active discussion about this paper. However, the author feedback does not convince me why converting Dec-\rhoPOMDP to standard Dec-POMDP is significant. I agree that decentralized active perception has many applications. Here, the key different between Dec-\rhoPOMDP to standard Dec-POMDP is that Dec-\rhoPOMDP has a centralized reward function. However, planning for standard Dec-POMDP is usually centralized so a centralized reward function won't hurt (they both have a decentralize policy). Beside, converting Dec-\rhoPOMDP to standard Dec-POMDP will introduce loss. Although the loss is bounded, it can be very large (Eq.9) due to oversimplified reward decomposition (Eq. 7). Therefore, I am not clear why in practice people would like to do the convection. Apart from that, I think this is paper is technical sound.