Paper ID: 1207
Title: Learning to Segment Object Candidates
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Quality and Originality: This is a nice paper consisting of a clean approach and convincing state-of-the-art results. The paper presents a simple and novel end-to-end CNN model for proposing object candidates; I like the the two-branch design for mask and score prediction.

Significance: I believe this work will provide positive impact to the object recognition community since the solution is clean and boosts the current state-of-the-art with a large margin, and also given the importance of the task of object proposal to many computer vision tasks (e.g. object detection). I highly recommend that the authors share their source code to the public.

Questions: 1. For the generalization experiment: While the pre-trained ImageNet VGG network has not learned to localize objects, it has still learned semantic information about 1000 labeled categories.

So to truly test generalization ability, I think the approach should also test on objects that do not exist in both ImageNet and PASCAL (L360-362).

Could the authors please comment on this?

2. What is the reason for having the first and second fully-connected layers having 1024 and 2048 hidden units, respectively, and in that order?

3. In L145, how is "in the center" quantified?
Q2: Please summarize your review in 1-2 sentences
The paper presents an end-to-end deep convolutional neural network that learns to generate object proposals and score them given raw image pixels as input. It consists of a clean solution and shows convincing qualitative and quantitative results for this task.


Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
A neural network inputs an image and generates a set of proposal segmentation masks and an objectness score for each one. It is trained on COCO dataset given the segmentation masks.

This is great that the model works well on object categories that it has not seen in training.

Why do we want to solve this problem? Reducing the number of candidates can speed up the object detection. I am not sure it can improve the average precision of detection as the total recall for previous approaches like "selective search" is already really high. I think that would be great to apply an off-the-shelf object detection in the output of this method to see the improvement in the speed vs. accuracy.

Moreover, this method produces segmentation masks which is great. However, I think it should be compared with the proper segmentation baselines so that we can see the performance.

Table 1, Table 2, and Figure 3 show the recall at maximum of 1000 proposals which is not a big number. (Some papers use 2000 candidates for selective search) That would be great to extend it to a larger number say 5000 to see if different approaches converge. For instance, if the recall of this method converges to selective search in 2000 proposals, what is the benefit of this method? I know that this method generates segmentation masks as well, but then, there are quite a few baselines that can be considered to evaluate the segmentation masks.

Line 269: that would be great to discuss why downsampling the ground-truth instead of up-sampling the network output degrades the accuracy.

The colors on Fig 3 are not easy to read. One can change the pattern of lines (to dots or etc).
Q2: Please summarize your review in 1-2 sentences
A deep learning approach is proposed for generating object proposal masks and confidence values. The idea is interesting with great results.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents an approach for generating object proposals, i.e., regions in an image that are likely to contain an object. In contrast to several previous methods which first compute the proposals with one method (typically a non deep learning approach), and then follow another method to compute the likelihood/score of each proposal region containing an object, the proposed approach introduces a discriminative convolution network to address both these tasks. In essence, the core of the approach is a network which jointly predicts the regions/masks and their object scores. The model is trained on the MS COCO dataset and tested on PASCAL VOC and MS COCO datasets. Similar to several recent works using CNNs, this paper also shows improved results over non-CNN methods.

The paper needs to build on the promising start by improving the empirical evaluation and details. - The qualitative/quantitative evaluation needs to be improved by showing more comparisons/results. - The method should be compared with [17,26] by converting the segmentation result into a bounding box (as done for comparison with other methods, cf. L.300-1). Indeed, the proposed method is more powerful by providing a more precise localization of the object, but this comparison with more related methods at bounding box level will be interesting. - The variants "DeepMaskFast" and "DeepMaskZoom" are not clearly defined. - The failure cases of the method need to be discussed.

Other comments: - L.047: regions should contain** - L.051: three? - L.098: we do not rely *on*

Q2: Please summarize your review in 1-2 sentences
The paper adapts a well-known CNN architecture for the joint task of learning proposals as well as their object scores. This is incremental overall, which is fine, but given the rather low technical novelty, the method needs stronger experimental evaluation.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for detailed and helpful reviews.

*Originality: R2 raises concerns about originality: "The paper adapts a well-known CNN architecture for the joint task of learning proposals as well as their object scores. This is incremental overall ... low technical novelty..." We indeed adopt a common CNN architecture (specifically VGG) as is now typical in vision tasks. However, we adopt the network to a new problem domain: generating segmentation object proposals. For this problem domain, all previous approaches operate by merging or grouping superpixels or edges (with no or only minor data-driven components). Instead, we formulate the problem directly as a learning problem.

Note that the other five reviewers tend to agree that the method is novel: R1-"nice paper consisting of a clean approach", R4-"idea is interesting with great results", R5-"nice idea", R6-"the paper is original", "the model is clear and interesting", R7-"new approach".

*Experimentals: R2: "The qualitative / quantitative evaluation needs to be improved by showing more comparisons/results." Recently [12] performed a thorough evaluation of all previous object proposal methods with code available and identified 5 top methods from among over a dozen tested. We compare to precisely these 5 top methods. Note that we cannot compare to the bounding box methods in [17,26] as R2 suggests as unfortunately these methods do not have source code available (and [26] is a nearly concurrent submission). R4 states we should additionally compare to segmentation methods. We apologize for the confusion: we do perform experiments using both bounding boxes (table 1 left on COCO, table 2 on PASCAL, figure 3a-b) and segmentation proposals (table 1 right on COCO, fig 3c-f). We feel that we did not effectively communicate the breadth of our experiments, we will clarify if the paper is accepted.

*Speed: R7 brings up an important criticism: speed. At the time of submission our method ran at ~7.7s. Since submission we have sped up our method over 4x to ~1.6s on COCO images (1.2s on smaller PASCAL images). The primary change was to replace the last large linear layer of the mask branch with a low-rank approximation (two linear layers with no non-linearity). Incidentally, as the model has fewer parameters, training is also faster and the final model achieves slightly improved accuracy.

*Open source: we pledge to release all source code upon publication as R1 suggests.

R4: "I am not sure it can improve the average precision of detection as the total recall for previous approaches like selective search is already really high." While selective search has high recall at IoU=.5, its recall is much lower at higher IoU. [12] showed that detectors (such as RCNN) benefit from high recall at higher IoU (the AR metric captures this). Moreover, recall is much lower for all methods on COCO (even at 1000 proposals and IoU=.5 recall is only 50% for previous methods). Hence there is *substantial* room for improvement. We will emphasize this in the text. We will also add experiments with R-CNN as R4 suggests.

R4: "if the recall of this method converges to selective search in 2000 proposals, what is the benefit of this method?" In the recent Fast R-CNN paper [8], the authors show that detection performance actually begins to *decrease* beyond a certain number of proposals (see fig 3 in [8]) even though proposal recall continues to increase, presumably because additional proposals are mostly false positives. Hence, in addition to improving detection speed, using fewer proposals (while maintaining recall) improves final detection accuracy.

We address some, not all, of remaining points questions raised by reviewers here. If accepted, we will attempt to address all remaining reviewer concerns in the final manuscript.
R1: Generalization?: Good point. While pre-training on 1000 categories should not lead to significant overfitting (we only use the pre-trained convolutional layers which are fairly generic), we will do additional experiments to explore this.
R1: Number hidden units?: chosen using parameter sweeps on a small validation set
R1: How is "in the center" quantified?: defined as center of bounding box enclosing object
R2: "The variants "DeepMaskFast" and "DeepMaskZoom" are not clearly defined": Thank you for pointing this out. We briefly discuss in L264-265 and L355-L356, respectively, but admittedly this is currently not entirely clear.
R2: "The failure cases of the method need to be discussed.": Good suggestion, we will do so.