Review for NeurIPS paper: Grasp Proposal Networks: An End-to-End Solution for Visual Learning of Robotic Grasps

NeurIPS 2020

Grasp Proposal Networks: An End-to-End Solution for Visual Learning of Robotic Grasps

Review 1

Summary and Contributions: The papers proposes learning a grasp proposal network for 6-DOF grasping, generated on a synthetic dataset of objects in simulation. The task focused on is grasping a singulated object on a flat surface with a parallel-jaw gripper. The main distinction from prior grasp annotation / grasp proposal methods is that they aim for 6-DOF grasping rather than planar grasps. For each object, a large number of grasps are attempted in simulation. This creates a large set of grasp annotations, and a neural net is trained to take the object point cloud and generate candidate grasps with high success rate. The network works by partitioning 3D space into r x r x r grid cells, treating each grid cell as an anchor for the grasp center. For each anchor, the network predicts an angle and offset to apply. At training time, the grasp proposals are filtered to only be ones near the ground-truth grasps and objects, to reduce training needs. The proposals are then scored based on the annotation label. The models are then evaluated according to 3 methods 1. Error to the ground-truth grasp annotation 2. Grasp performance in sim 3. Grasp performance in real

Strengths: The generated dataset sounds useful for pushing 6-DOF grasping. The claims are all very reasonable, and their results suggest that the anchor-based architecture makes the problem easier to learn. The work is very thorough in describing its dataset generation and network architecture.

Weaknesses: The paper primarily feels like a dataset paper. There is nothing inherently wrong with this, but there have been several grasp datasets in prior work, including 6-DOF grasp datasets from 6-DOF GraspNet. The difference between that dataset and this one is that this dataset is 22.6M grasps instead of 10.8M grasps, and the grasps are over a slightly wider set of ShapeNet objects (226 objects instead of 206 objects, 8 categories instead of 6 categories). So, on the dataset side, the question I'm considering is how useful this 2x larger generation is. Given these are simulated grasps, which are less costly to generate, my intuition is that this dataset is useful but not significantly so compared to existing ones. If there is an argument for why the 2x larger dataset is a big deal, I could change my mind here. As for the anchor-based architecture, I think the idea of using models based on 2D object detection is a good idea, and the results for it seem good, but I'm very suspicious of the GPNet-Naive results. This row of the table suggests that removing the anchor coordinates from the grasp proposals completely ruins performance. This seems like too strong a result, given that prior baselines still have reasonable performance on this dataset. Could the authors explain why the naive performance is so bad? I also do not understand why r,b are still mentioned for GPNet-Naive - if the anchor centers are not given as features, then grasps proposed in each of the r x r x r grid cells are not distinguishable, and if those aren't distinguishable, then the baseline feels worthless.

Correctness: The claims look correct to me.

Clarity: Some wording issues, like "We totally obtain" should be "in total we obtain", but otherwise the paper was clear.

Relation to Prior Work: The paper mentions relevant prior work, but is not very clear on the difference between prior work and this work, especially when comparing to prior 6-DOF synthetic datasets (I believe the difference between this work and planar grasp work is fairly clear.) The visual grasp learning section would be better if it explained how their anchor-based architecture differs from prior work.

Reproducibility: Yes

Additional Feedback: This paper feels like much more of a robotics paper than a NeurIPS paper. I feel that overall this paper is valuable and written well but I'm not sure it adds much to prior work. Edit: I have read the other reviews and the rebuttal. Thank you for the explanation about GPNet-Naive. I feel the authors have addressed some of my concerns about the dataset, but I'm not sure if antipodal labels are an important enough distinction, so I plan to keep the same score.

Review 2

Summary and Contributions: Generating a set of grasp proposals for various objects is a tough problem and is an interest in the field of robotics. The paper proposes a network to deal this problem, Apart from generating a set of proposals, it also gives confidence for each proposal. Specifically, the authors uses Pointnet++ to encode object shape features on each surface point, then feed all pairs of grip centers and one surface point as initial proposals to three headers to do antipodal-based proposal pruning, complete grasp parameter generation and confidence evaluation. The training data is generated using physics simulation. And the proposed method has been evaluated on different experiments, showing state of the art performance.

Strengths: The proposed method is able to generate large set of grasp proposals along with confidence. Experiments show the state of the art performance of the proposed method. This work should be the interest of some people in the robotic field.

Weaknesses: 1. Lack of theoretical novelty. It's a task-specific solution and has a limited impact on other areas. This could potentially be a high impact paper on robotic journal or conference though. 2. Though an antipodal-based classifier can prune most of the proposals, pruning and evaluating every possible combination of gripper center and surface point is not very efficient. It could be better if only promising proposals are brought up at the beginning and go through further processing. For example, an environment-and-object based surface point proposal will be more interesting.

Correctness: Overall the method and claims are valid. Please refer to the first point in the Additional Feedback part.

Clarity: Yes

Relation to Prior Work: Yes. And if it is inspiring, I suggest the authors to involve some human hand grasping works such as 'ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging'.

Reproducibility: Yes

Additional Feedback: 1. l.109 the rule used for the gripper angle is not very convincing. To measure the distance between two unit quaternions p and q, I suggest something like either pq* or the length of shortest path on 4D unit sphere between p and q. 2. Minor. In Fig 1, the FIRST header Antipodal Classifier is on the bottom right corner, which is not intuitive and may cause certain confusion if not reading in details. It could be better if the three headers can be arrange from top to bottom or from left to right rather than from bottom to top.

Review 3

Summary and Contributions: The paper tackles the task of 6DOF grasping. It proposes a grasp proposal network (GPNet) that uses PointNet++ style architecture to classify grasp proposals (at grasp anchor locations), and regress to the 6DOF grasp parameters. The paper conducts experiments on datasets derived out of the ShapeNet dataset, and on a real robotic platform.

Strengths: 1. Paper tackles an important problem, that of 6DOF grasping. 2. Paper designs a new architecture for 6DOF grasp prediction. It adopts the standard object detection paradigm, and proposes to score grasp candidates (and additionally output grasp angles, and offsets). Grasp candidates are predicted from a set of densely sampled anchor locations. The grasp proposal network uses the PointNet++ architecture. Overall, I think the proposed architecture is novel, and I haven't seen specifically this architecture being used for 6DOF grasp prediction. The specific formulation makes intuitive sense. 3. The paper constructs a dataset using objects from the ShapeNet dataset for training, and evaluating the proposed Grasp Proposal Net. Paper reports favorable comparison to recent work on this problem [23].

Weaknesses: 1. While the exact architecture may not have been tried on this problem (or even otherwise), the architecture builds upon existing ideas in the field (casting grasp prediction as grasp proposal classification [24]), sim2real for grasping using depth data [17], uses of anchors [30], PointNet++[27]). Thus, I won't consider the novelty of the proposed architecture super high. 2. At the same time, experimental evaluation could be stronger. Given limited novelty, I would expect a through empirical evaluation of the different aspects of the proposed model, eg: is PointNet++ style architecture necessary, or would a more natural choice 3D CNN (or 2D Depth CNN as in DexNet) work better. 3. Why GPNet-Naive only report r = 10, b = 22, while a more through validation has been done for GPNet? 4. Use of non-standard evaluation metrics. Past work in [23] seems to use plots of success vs coverage. Current paper only reports 4 points on this curve. Why not include plots similar to ones in [23]? 5.Past papers in learning based grasping (albiet in 2D) employ a more thorough real experimental evaluation. By that measure the real world evaluation is weak.

Correctness: Empirical methodology is correct, though could me more thorough in a) equivalent validation for baselines (see 3. in Weaknesses), b) use of standard metrics as used in past work, c) comparison to more ablations and baselines.

Clarity: The paper doesn't clearly identify its contribution. L48 to L58 mention that the proposed GPNet is novel, but fails to crisply identify what individual aspects are novel and in what way.

Relation to Prior Work: No, the paper does not clearly establish relationship to past work. It does not identify what aspects of the proposed approach are novel.

Reproducibility: No

Additional Feedback: == After Rebuttal == Thanks for your response, and providing additional experiments.

Review 4

Summary and Contributions: The focus of the work is computing 6DOF grasps for a parallel jaw gripper over point clouds of previously unseen objects. With respect to related literature, the claim is that this approach can compute a relatively more diverse set of grasps on objects. This is based on a grasp proposal module that defines anchors of grasp centers at regular 3d grid positions.

Strengths: - The motivation of generating a diverse set of robust grasp is very relevant as diverse grasps are useful for high-level task planning and to increase the chance of success due to kinematic constraints and collisions. The evaluation is appropriately designed to back the claim. - The idea of using 3d grid centers as anchors for regressing grasps over point cloud is novel (although derived from similar ideas in 2d scenarios and in object detection). - The design of grasp pruning -> grasp regression -> grasp classification with feature extraction neighborhood based on grasp proposal is novel.

Weaknesses: See Additional feedback and questions section.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: - The advantage of this approach over [33][34] as mentioned in Line 40 is mostly computational. However, no computational analysis is done to support this claim. Do these approaches achieve a diverse set of robust grasps when given enough time - how much time does it take. The code for these approaches is publicly available. - Why do the grasps from [23] tend to focus on a single-mode? Is there a theoretical limitation of the approach? and how does the proposed approach address it? It might be more insightful to elaborate on this. - The grasp proposals are first pruned based on antipodal criteria and then passed on to the regression module. Given the grasps are already antipodal (within a threshold) how important is the regression module? (especially for high-resolution scenarios) 1. Is the main objective of this module to estimate the free rotation parameter? 2. Is it only useful for low-resolution scenarios? In that case, what is the computational benefit of this over high resolution? More discussion regarding this aspect in the experiment section could be insightful. - The training and test data are from the same object categories. So it is not a fair claim to say that the learning can be generalized to completely unseen objects. This should be specified in the introduction. - It was claimed in the paper that the method is flexible to support either a more precise or more diverse grasp by focusing or spreading the anchors of grasp centers in 3D space. This claim was not backed empirically. ****************** After Rebuttal ****************** Appreciate the response and I think these discussions would be very useful in the revised version of the paper.