NeurIPS 2020

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder


Review 1

Summary and Contributions: This paper proposes a method to use different object representations (i.e. anchor/proposal, center point, corner points) simultaneously in a single object detection framework. It combines the strengths of each representation. Similar to how features interact with each other in an attention module (such as in Transformers), different representations enforce each other, thereby improving the object detector accuracy. The method is applied to many different detectors (RetinaNet, FCOS, Faster RCNN, ATSS). Let's take RetinaNet as an example. It uses the anchor representation by default. The proposed method adds a lightweight network to RetinaNet, which predicts center and corner points. An anchor of interest is taken as the query and estimated (center and corner) points in its local neighbor are taken as "keys". Through the usual attention formulation, the features of the query is enhanced using the information from the keys. The method is widely applicable and improves all baselines. By improving ATSS, the method achieves 52.7 AP on COCO test-dev.

Strengths: + Very well-written paper. + Proposed method significantly improves a variety of strong baselines. It is widely applicable. + Proposed method achieves SOTA performance on COCO test-dev by improving ATSS.

Weaknesses: -

Correctness: Claims made in the abstract and introduction are backed by experiments. Method seems to be correct. Empirical methodology is correct.

Clarity: Yes, very well. A grammatical error at line 173: "is consist of"

Relation to Prior Work: To the best of my knowledge, all related work is present and they are sufficiently discussed.

Reproducibility: Yes

Additional Feedback: From the authors' response to the "4. ML Reproducibility -- Code" part of the "Submission Questions", I get the impression that authors provide code, however, it is not provided in the supplementary material. Below I provide some comments/suggestions: * The "strength" of each representation type is mentioned without any data (lines 30-31). Would be great to base these on data. * You say that "the method works in-place". Different people might interpret this differently. Please be more specific and describe what you mean. * Why do you say "part corners" to corners? Aren't they box corners? They are not corners of object parts. * I find the term "evolution flow" a bit exaggerated. We are talking about only one-stage or at most two-stages. Evolution? * Section 3.1 describes Eq1 and Eq2 only technically and symbolically. Although these equations come from the well-known attention module, an intuitive description of what is going on would be highly appreciated by the reader. * This method could be even more interesting if it can efficiently handle long-range interactions. Currently, at least for the RetinaNet, it only uses "keys" in a local neighborhood. The authors might want to check a related, very recent paper: HoughNet, ECCV 2020.


Review 2

Summary and Contributions: This paper analyzes three representations of objects: anchor/proposal rectangle boxes, center points, and corner points. The authors claim that different representations have different benefits, and exploit self-attention methods to integrate them together. The method, named bridging visual representations (BVR), has broad effectiveness in current detection frameworks, include RetinaNet, Faster R-CNN, FCOS and ATSS.

Strengths: 1. This paper is well written and easy to follow. 2. The experiments are thorough and validate the effectiveness of the proposed method. 3. The idea in this paper is good that integrate different representations of objects together to boost the performance. The idea is simple and effective. Overall I think this is a good paper. It has reasonable motivations, that is, current object detection systems are doing different representations for objects, this paper want to integrate them together. The methodology proposed in this paper tackle the problems mentioned by the authors. The experiments part is thorough and analyze the details for different aspects. The authors have done some experiments that draw my concerns. For example, they analyze the relations between the proposed module and non-local blocks, relation networks, which is also based on self-attention blocks. The authors analyze the computational cost for their method that also darw my concerns. I think this paper does a good job in academic writing and is self-contained. I recommend to accept it in NeurIPS.

Weaknesses: As I mentioned in strengthens, I think this paper did a good job in academic writing and propose thorough experiments. Most of my concerns during reading the main paper are drawed in their experimental parts. However, the idea does not superise me. Self-attention block is well used in vision tasks. This paper finds a great application for self-attention block and tackle the proposed problem. The authors use some methods to "hard" integrate different representations. There is still master representations to represent the objects. It would be great if we can truly do something that can integrate the representations together in following works.

Correctness: Yes.

Clarity: Yes. The paper is well written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Comments after rebuttal: The rebuttal address most concerns of the reviewers. I keep my reviews as accept.


Review 3

Summary and Contributions: In this work, the authors analyzed the representation formats of current object detectors and proposed an attention-based manner to combine the advantages of different representations. The proposed method is compared with various state-of-the-art methods and is shown to be better.

Strengths: 1. The authors concluded the advantages and disadvantages of different object representations. 2. To combine the strengths of different object representations, the authors developed an attention-based module to bridge them. 3. The author optimized the computation and memory complexities of the proposed attention module. 4. The proposed method achieved good performance.

Weaknesses: 1. Some implemented details are not clear. a) In the point head network, are the predictions (including top-k strategies) of top-left points and bottom-right points individual? b) How do the features of two corner points enhance the single regression feature? It seems that there is no detailed explanation about them in the paper, and it would be better if there are figures to illustrate them. 2. The proposed method has an extra point head and two pixel-wise attention layers which are time-consuming. a) What is the input size during the FLOPS calculation? b) What is the time cost of BVR? Can you provide a detailed time cost analysis about BVR?

Correctness: The method is clearly described except for some implemented details and the empirical methodology is correct.

Clarity: Most parts of the paper are well written.

Relation to Prior Work: The authors have discussed the differences from previous works.

Reproducibility: No

Additional Feedback:


Review 4

Summary and Contributions: This paper works on enhancing existing object detectors by adding a center/ corner point branch as an auxiliary task. This additional point branch provides an attention mask for both the existing classification and regression branches. Experiments show the proposed method brings ~2mAP improvements on COCO dataset under different detectors with ~10% more computation. The best performance achieved 52.7 mAP.

Strengths: + The experimental results are strong (~2mAP improvements on different detectors), especially the best performance of 52.7 mAP. + The authors have experimented under different detectors: both one-stage detectors and Faster RCNN, shows the proposed contribution is general.

Weaknesses: - While the overall performance is strong, the reviewer is not excited about the technical novelty. The good performance feels like from putting existing output modalities together. Training cornernet/ centernet in an FPN structure is new, but this part is not well explained in the paper: what is the training loss for the point head? Is that the CornerNet-style focal loss or standard cross-entropy? How to assign different points to different FPN levels? - All ablation studies are done on the poor RetinaNet, where the numbers are not exciting at all. It will be much better if the ablation experiments are on the strongest baseline. The author claimed RetinaNet and ATSS/ FCOS are the same (L217), however it is unclear to the reviewer how the centerness branch in ATSS and FCOS work. Does the centerness branch play a similar role as the center point head? - Here is one interesting baseline: can we only add the point head as an auxiliary task, but not apply it as attention? This will help ablate if the improvement is from the auxiliary task or from the attention. - The paper writing is unsatisfactory. See the Clarity box. - L274 provides the additional computation in FLOPs, however the runtime will be more straightforward. The authors are suggested to report the runtime for both the baselines and the proposed method on their own machines. - The authors claim "a general viewpoint to understand current object detection frameworks" as their first contribution (L56). Unfortunately, the reviewer didn't find any new insights from section 2. The analysis is more of a standard related work rather than a contribution.

Correctness: The main claims are fair as far as the reviewer can assess. There seem to be an misuse of the big O notation is L195.

Clarity: There are a few unsatisfactory parts in writing: - All figures are without comprehensive captions. E.g., what does fig. 2 want to convey? It seems fig. 2. is very similar to fig. 1. - Figure 3 and figure 4 are unclear. First, it has no captions. It will be much helpful if the authors can illustrate the data shape in and out each block. - The query-keys description in Section 3.1 is not a familiar concept for the general object detection audience. It seems just a softmax attention. - The authors mentioned "corners are more accurate in localization" at least twice (L92, L127). This seems intuitive, but are there any experiments are reference can backup this point? - L119 "FCOS uses center points as its representation". As far as the reviewer understands, FCOS uses all points inside a box. Center sampling is not the core information of FCOS.

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: If there is a neutral borderline I would choose that. The only reason to support accepting this paper is its performance. However the contribution is not clear for me yet (see paper weaknesses), and the presentation needs many improvements. I first leave my rating as 5 and hope the authors can address my concerns in the rebuttal. My complaint on the technical novelty is resolved by the "evidence fir strength of different object representations" section of the rebuttal. The fact that different representations are good at different aspects is very interesting and motivates the paper well. The authors also satisfied my curiosity on the multi-task-only baseline, and promised to fix the writing issue. Based on the rebuttal and the reviews from other reviewers, I gladly raise my rating to 7.