NeurIPS 2020

### Review 1

Summary and Contributions: The manuscript proposes an attention network for modeling the relationship between object regions and finally contributes to semantic segmentation. The author proves such an attention mechanism contributes to the contextual representation.

Strengths: The operation in the proposed method is closely combined with mathematics, and it is not a random fabrication. 2) Experiments show that the method proposed by the author is capable and brings useful improvement on multiple datasets. The effect of this method is excellent compared with SPP based and attention-based methods.

Weaknesses: However, there are some concerns to be further improved as well: 1) The author gives the operation of RCB and RIB step by step but does not give a reasonable starting point. Although the final experiment has proved that such an operation may be effective, the author needs a more explicit motivation in this paper. 2) The contrast experiment compares the backbone network's effect, but the backbone network cannot be directly used for segmentation, so the author adds what structure to complete the segmentation after the backbone network, and the subsequent structure will have a significant impact on the segmentation results. The author should explain clearly. 3) There is a large amount of accumulation in the typesetting of the formula in this paper, and the author can further optimize and adjust it.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: After the rebuttal: The authors respond to my concerns well. So I update my score with "7: accept".

### Review 2

Summary and Contributions: This paper introduces an Region Attention Network (RANet) to model the relationship between object regions for image semantic segmentation; An Region Construction Block (RCB) is designed to jointly analyze the boundary score map and the semantic score maps; An Region Interaction Block (RIB) is designed to select the representative pixels in each region for context information exchanging.

Strengths: The proposed blocks are somewhat novel and the experimental results are good. Ablation analysis are conducted to validate the efficacy of the proposed modules and the whole network.

Weaknesses: The motivation and illustration are not clear. The details are as follows, 1. Why the RANet can capture more context information than SPP and previous attentional models? The authors claim that RANet naturally provides the spatial and category relationship of pixels to construct the contextual representations, but the category information in RANet may be not accurate, which could results in error guiding information. 2. After obtaining the contextual representation $O$ as described by Eq.(8), how to get the final segmentation map? 3.In Eq.(1), $B_{i \arrow j}$ denote a set of pixels on the line, what is the direction of the line? Vertical?horizontal？or oblique？

Correctness: correct

Clarity: The written and illustration should be improved.

Relation to Prior Work: Could be improved, especially the motivation of the RANet, i.e., why it can capture more context information than SPP and previous attentional models?

Reproducibility: No

Additional Feedback: 1. I would like to ask the authors to clarify the motivation of the RANet, i.e., why it can capture more context information than SPP and previous attentional models? 2. How to get the final segmentation map by $O$ obtained by Eq.(8)? 3. In Eq.(1), $B_{i \arrow j}$ denote a set of pixels on the line, what is the direction of the line?

### Review 3

Summary and Contributions: The paper introduces a novel attention network for semantic segmentation task. There are two main steps, which is dividing the image into regions and modeling the relation between regions. The author thinks the SPP based methods just use the spatial information between pixels to capture contextual information while other attention methods just use the category information between pixels to exchange context. Compared to these above methods, the new network can better use the correlation between the spatial and category information to enrich the contextual information by exchanging the regional information. Besides, many detailed experiments in various datasets has shown the performance improvement.

Strengths: —It is a novel idea to use the object regions to construct the contextual representations by region interaction —The region construction block and the region interaction block are clearly explained and the visualization is also very good — Extensive experiments are conducted and results are state of the art.

Weaknesses: — The RCB block and RIB block both seem to be very time-consuming. The RCB block includes the boundary score map and category score map. Each map needs to calculate the similarity between each pair of pixels. The RIB block needs to find the representative pixels for each region and aggregate information between regions. So during the training process, it needs to compute all the above information in each iteration, thus it may seriously affect the training speed. —In line 18, authors are encouraged to give more intuitive descriptions to show the differences between RAN and SPP or other attention mechanisms. This may be the main motivation of the new network. —In line 135, In the RIB block, why should we choose the representative pixels？ why not use all the pixels in each region？The author needs to give a short explanation. I guess that maybe there are two reasons. The first one is the computation efficiency and the other one is the segmentation accuracy, which is illustrated in table2.

Correctness: YES

Clarity: YES

Relation to Prior Work: YES

Reproducibility: No

Additional Feedback: —In Figure 5, why do these representative pixels gradually separate from each other as the iteration progresses? —Update：The authors have answered my questions well, so I changed my score. I suggest that the author update the motivation of this method in the paper.