Review for NeurIPS paper: Canonical 3D Deformer Maps: Unifying parametric and non-parametric methods for dense weakly-supervised category reconstruction

NeurIPS 2020

Canonical 3D Deformer Maps: Unifying parametric and non-parametric methods for dense weakly-supervised category reconstruction

Review 1

Summary and Contributions: This paper addresses the problem of inferring shape and texture given a collection of images where each image depicts a different instance of a shape category with known object masks and sparse keypoints during training. This paper incorporates a lot of different pieces, but the main technical contribution is to combine predicting (spherical) surface canonical coordinates with the C3DPO nonrigid SfM pipeline. The paper evaluates on several datasets and quantitatively compares against CMR [29] for shape reconstruction and qualitatively for shape texture transfer.

Strengths: This paper addresses a hard problem. While the main components were demonstrated in previous work (image => canonical coordinates with cycle loss for mapping to a single canonical shape [34], C3DPO for nonrigid shape reconstruction with sparse keypoints [43]), their combination for dense shape reconstruction and texture transfer makes it interesting and timely. The paper also incorporates other technical details for improving the results - appearance loss (used in prior work [32]), soft occlusion regularization, and deformation model for generating texture. The quantitative and qualitative improvement over CMR [29] (state of the art) for shape reconstruction is compelling and model ablations with respect to shape reconstruction look reasonable.

Weaknesses: Suggestions for improvement: 1. The main weakness for me is that the contribution with respect to texture transfer is not fully evaluated. While the textures look qualitatively better than CMR’s in Figure 4, it’s not clear if the difference is solely due to better shape reconstruction. Missing is a comparison where CMR’s texture transfer method is applied to this paper’s proposed shape reconstruction pipeline. Also, it would be great if there is a quantitative evaluation with respect to texture transfer. A few potential ideas come to mind for quantitative evaluation - run a user study, evaluate a pre-trained classifier on the rendered shape with texture, compute FID score. 2. It would be good to see an experiment on how well the proposed method works on keypoint transfer (as is evaluated in the CSM paper [34]) to see whether the C3DPO shape model helps with this task. 3. Both this method and CMR uses ground truth masks and keypoints during training. It would be great to see how well the method performs when both the masks and keypoints are computed automatically during training and test (L147-148). 4. It would be good to have a more detailed discussion of the differences with respect to the shape and texture model of CMR (either in the related work or in Section 3). 5. While this paper just appeared at CVPR 2020, it may be worth mentioning anyways in the related work: Articulation-aware Canonical Surface Mapping. Nilesh Kulkarni, Abhinav Gupta, David Fouhey, Shubham Tulsiani. CVPR, 2020. 6. For B(k) and C(k), it may be good to try a sinusoidal positional encoding, similar to transformers and nerf. 7. It would be good to see F-score reported for shape reconstruction (see the paper "What Do Single-view 3D Reconstruction Networks Learn?” for setting up F-score). In light of these suggestions, I’m still leaning positive on this paper. However, for me, points (1) and (2) are the main limitations with the current paper draft.

Correctness: I did not find any correctness issues.

Clarity: The paper is well written and was a pleasure to read. Some small comments: L113 “most surfaces” => maybe be more precise and say that the shapes are assumed to be genus-0. Section 3 notation: Please be careful with the notation and variable overloading. For example, L115 X(k) and Eq (1) X(y). Equation (2), maybe call left-most “y” as “y_hat” (as is done later). L130 "I(y)”. L204 “where K consists of…” - This sentence was not clear to me. Why are random samples from training images returned for a test image? There are small typos throughout (e.g., L3 “an novel”) - please proofread.

Relation to Prior Work: The references are good.

Reproducibility: Yes

Additional Feedback: Final feedback: After a very healthy discussion after considering the rebuttal and other reviews, I'm still positive on this paper. Please see the meta review for a summary of the key exchanges during the discussion and a list of highly recommended requested changes.

Review 2

Summary and Contributions: This paper tackles the problem of single view 3D reconstrution and learning to do so from image collections with 2D annotations, the same setup as CMR. This paper combines the recent CSM objective and implicit shape and texture basis (similar to AtlasNet-sphere) instead of an explicit shape representation such as meshes used in CMR. Simply put this paper is CMR + CSM initialized with C3DPO, with implicit shape basis instead of a mesh. The result qualitatively looks nice and improves upon CMR quantitatively. However it's unclear if the improvement comes from the change in shape representation, a key ablative experiment is missing, without which is challenging for the community to decipher what the lessons are.

Strengths: - Good quantiative metric on faces and cars - Results look nice, especially that of texture transfer.

Weaknesses: 1. How much of the improvement is coming from the implicit shape representation over meshes? The proposed approach of combining local and global information via CSM's consistency loss could have been done with meshes (CSM was mesh based). What does the result look like with meshes? Or CMR could have also used this implicit shape representation. This key ablative study is missing. 2. The face results on Figure 4, the shortcomings of CMR mainly seems to lie in the mesh representation. How was the CMR initialized on the half-sphere nature of the faces? this should be discussed. 3. The paper says to adopt the evaluation protocol of CMR and compare qualitatively, however there is quantiatvie evaluation on CUB dataset in CMR that can be done via the mask IOU and PCK metric. 4. The paper claims to have a significant gain over CMR on Pascal3D Chairs, however, none of the results from Plane, Char, and Bus are shown in the paper nor the supplemental. This is rather questionable. It's not surprising that the spherical nature of the representation does not work on Chairs. I suspect much of the difference comes in the representation (as implicit sphere is easier to deform than expicit mesh), again calling for the need to evaluate this proposed approach using the same representation. As the proposed idea to combine CMR and CSM can be done on either representations. 5. Also, there are rather adhoc losses such as the min-k perceptive loss and the L_emb-align, which was not used in CMR. 6. Limitation of the proposed approach is not discussed. 7. The paper should cite Groueix et al. AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation CVPR 2018, as the implicit 3D surface representation is very similar to Atlasnet-sphere. 8. The need for explicit basis is not clearly motivated -- the ablation seems to indicate however in principal can't the non-linear MLP basis learn this all? Further more it would be interesting to visualize the space of dense basis learned after this process. Unclear points: a. At test time, is silhouette used to visualize the results on Figure 3? What does the cannical map output in the background? This should be clearer. b. Line 193. Why does CUB have this problem of "reconstructed surface tends to be noisy due to some parts .. overfitting to specific images". What is exactly meant by this? And why does the mask reprojection lsos help with this? I don't see the connection between the issue of surface noise and the silhouette loss, beside this loss was also used in CMR. c. Why is line 189 section called soft occlusion? The actual name of the loss is emb-align. The section name is not very intuitive.

Correctness: Yes

Clarity: - Much of the core detail of the paper is in the supplemental. The writing in intro and related work (for ex. sparse NR-SFM section) could be reduced to dedicate more content to the paper. - For one, the ablation study is not well explained due to lack of space. what exactly is "basis" in table 1? The papre lists the equations, 5, 7, 8, 4 but they don't correspond in the same order in the table which is repro (5), basis ?, min-k percep (7), emb-align (8). So basis must be first term in (4) one-by-one, but it's not clear what this means and this should b ediscussed in the papre, not just in the table cpation. In particular this is an important ablation as it's not clear to me why you need the explicit basis when the spherical MLP mapping could in fact capture the deformation as well. - Paragraph from line 115 and in Figure 1 -- at first it was not clear why B has to be a function of k. Why the expression needs to be written as $B(k)\alpha$, instead of $B\alpha(k)$ (take the index after linear combination vs before). This I believe is due to B being implicit, if the representation was explicit, this indexing could have happened after taking the linear combination. I think this should be explained more. - Line 204. k is just a point on the sphere, why not just sample from the sphere, instead of sampling from output of the training? - The sentence in Line 174 is confusing, as in this single-view collection setup there are no two views of the same object? Later in the paragraph it says that the images are of different instances. - Paragraph in line 163 should cite CSM and attribute in this paragraph. - Which 2D loss is required (silhouettes and ladmarks) should be made explicit in the introduction. - Notation of beta around ine 131 is confusing, since B is used for the deformation basis. I think \tau would be better. - In the intro its unclear what is means by intrinsic quantity, giving more concrete example would help.

Relation to Prior Work: The paper could mention atlasnet-sphere, and the exact difference with CMR in related work.

Reproducibility: Yes

Additional Feedback: The results look nice, however, the paper should ablate the improvements that come from the implicit representation and provide the ablation as discussed above. Also the method seems more complicated with lots of losses that weren't used in CMR even though the setup is similar. The approach combines several existing works together: CMR, C3DPO, CSM consistency loss with implicit surface representation. For a publication a better ablative study is necessary and significant improvements in writing. ==================== Post-rebuttal feedback: Thanks for addressing some of my concerns. After an extensive discussion + rebuttal, I am increasing my rating on this paper given that showing how local and global 3D information may be combined is an interesting, very reasonable direction. However, there still remains key experiments and concerns that if addressed would make this paper stronger. We have written these hard requirements out if the paper is to be accepted. Given this I am raising my score to 6: marginally above. ====================

Review 3

Summary and Contributions: This paper proposes Canonical 3D Deformer Map, a representation of category-specific shapes from multi-view images. This representation provides corresponding across different category instances by "anchoring" different objects in a common space. The approach combines a parametric and another non-parameteric representations into canonical maps. Specifically, the parametric representation is provided by non-rigid SfM, and the non-parametric representation comes from the depth prediction. The advantage of the proposed approach is that it not only provides correspondence across different objects, but also dense description of the shape.

Strengths: The method is able to provide shape estimations of non-rigid shapes, which are known to be notoriously hard to reconstruct. In addition, such shape estimations provide correspondence across different objects within the category of interest, as in the original CMR work. The paper is well-written, specifying each loss used and justifying their use by ablation studies. The comparison against CMR shows the model outperforms CMR.

Weaknesses: I am mostly concerned on the contribution of this paper. Building upon CMR, this work incorporates additional non-rigid SfM cues, but it remains unclear to me how this is crucial. Intuitively, this should be helpful when the input images are of the same bird/car/person that is deforming non-rigidly. However, from what the authors present, I cannot figure out how this non-rigid addition is helpful, since what non-rigid motion is present and how it makes CMR fail are unclear. Although the losses are elaborated clearly, but a combination of so many losses is a bit alarming to me as to how well this method actually works. This is not a major concern, though, provided that the author can use better non-rigid input to demonstrate how CMR fails in such cases, and how this work succeeds. The handling of view-dependent effects by relaxing the loss to be perceptual is unsatisfying. To properly handle such effects, the authors may consider modeling viewing directions explicitly, since the camera parameters have been estimated already anyways. Finally, the reliance on non-rigid SfM as a preprocessing step makes me wonder what the subsequent model will then do. This fact that subsequent network training relies on successful non-rigid SfM seems like a major limitation. ==================== Post-rebuttal feedback: Thanks for clarifying on some of my concerns. After extensive discussion, we settled down on a few hard requirements for the paper to be accepted (please see the metareview). I'm not raising my score here as it's hard for me to predict if those requirements could be met. If they were, my score could be regarded as a 6: marginally above. ====================

Correctness: Yes.

Clarity: Mostly yes, but I still don't get why depth prediction by a network is considered non-parametric. I would call that parametric.

Relation to Prior Work: Looks mostly complete.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper extends three existing techniques for image-based shape reconstruction: canonical surface mapping, non-rigid structure from motion, and depth prediction. Instead of predicting depth, it predicts a mapping to a canonical domain (i.e., spherical parameterization of a deformable shape). This canonical mapping, provides corresponding 3D point on all deformable shapes, which are further used in 3D reconstruction with known map to the image. This work nicely extends the current family of research projects on canonical mapping in surface reconstruction. It also leverages various consistency terms and regularizations that are unique to the particular set of representations used in this paper.

Strengths: This paper builds on several state-of-the-art techniques to deliver a robust technique for surface reconstruction with weak supervision. It shows compelling empirical results, and interesting high-level contribution: combining parametric and non-parametric models.

Weaknesses: The novelty of this work is medium (i.e., above NeurIPS bar, but maybe not ground-breaking). It mostly builds on existing well-known loss functions and representations, however, the combination it provides is technically sound, and it is an improvement over prior work.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: