NeurIPS 2020

Online Adaptation for Consistent Mesh Reconstruction in the Wild

Review 1

Summary and Contributions: The paper proposes reconstructing a mesh of a single object from a given single-view image, and from a sequence of images, thus encouraging coherency. The paper presents an improvement in the former case, and a novel approach and significant improvement to the latter.

Strengths: - The paper demonstrates improvements for both tasks in a convincing manner. - The ideas of ARAP energy, and the temporal coherency update are not novel, but used in an interesting and effective manner.

Weaknesses: - The paper could benefit from some more comparisons. It basically only compares against one method, which is not the dominating state-of-the-art for all cases. The field is flooded with other works that could be compared against In the rebuttal, the authors seems to resist making an effort to evaluate the method better, even though all reviewers agree that this is missing. I find that somewhat disappointing, and I have slightly reduced the score accordingly, since I assumed that experiments required for the quality of the exposition would simply be added upon request.. That said, my general impression of what this method can achieve has not changed, so I am still recommending acceptance.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes, assuming code is provided.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper presents a method to reconstruct 3D meshes for the known category of objects from unlabeled video data in the wild. The paper considers the case that no 3D supervision is available, and relies on weaker cues (silhouette and 2D keypoint) to train a neural network model. The core idea is to leverage temporal information of the videos, enforcing temporal coherency in texture and base shapes. Performances are evaluated on bird and zebra categories from public datasets, showing better performance than previous SOTA.

Strengths: Overall the paper is in a good shape, with sufficient methodological novelty and good performance. The idea of leveraging temporal information in testing-time looks novel. The other parts such as enforcing consistency in texture map and shape base look convincing. Removing the symmetry constraint by using shape bases and as-rigid-as-possible constraint makes sense, although I may think that this is not always superior than symmetric constraint. The paper is well written, and easy to follow. The experiments support the strength of the paper, showing better performance than previous work.

Weaknesses: Only a few categories have been considered (bird and zebra). Both of these categories actually do not have much shape variations (e.g., compared to humans and monkeys). Even among them, results on large possible pose variations are not shown (e.g., a flying condor, jumping zebras). This makes me conclude that the proposed method still does not make a big breakthrough over previous work, showing incremental improvements at most. The shape based model may only work for the objects with less motion variations (bird). For the objects with big articulated body motion (human or monkeys), these are not sufficient and skeletal structure or other ways would be required (e.g., the parametric 3D models as SMPL)

Correctness: Looks good for me

Clarity: Overall the paper is well organized.

Relation to Prior Work: Looks good for me

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper presents a method for 3D mesh reconstruction from images that can also leverage videos of an articulated object at test time. For training, it considers two modes: 1) still images of an object category with keypoint+silhouette supervision, 2) just images. In the training stage, it builds upon the CMR model [13] but learns a linear parametric model for the base shape, thus decreasing the role of deformations, and drops the symmetricity constraint. On the test stage, it proposes to continue training the model on the video frames with a set of novel loss functions that enforce temporal consistency. The paper visually demonstrates that having videos indeed improves the consistency and quality of reconstruction. UPD. On the one hand, the rebuttal addressed my concerns, in particular, I think reporting the 3D metrics is crucial for this paper, even if on synthetic data (hopefully the authors will share the data or the protocol to generate them). On the other hand, reviewers raised reasonable concerns that the paper does not ablate all the contributions. Most notably, we agreed that it needs to compare to simple baselines based on temporal smoothing in coordinate or parameter space. I believe that this comparison is easy to add and will make the paper stronger.

Strengths: The paper proposes a new reasonable setting for the task of mesh reconstruction for articulated objects, when the videos are available at test time, thus competing with SLAM-type methods but allowing non-rigid articulations. Having videos can potentially help resolve the ambiguities of single-view reconstruction, and the paper indeed shows how to execute on it. The novelty is strong, and if the paper had proper evaluation (see below), I would put it in the best 50% accepted bucket. Learning-based 3D reconstruction is relevant to NeurIPS. The paper proposes a set of interesting improvements over CMR that help to learn 3D reconstruction in the unsupervised way without assuming a hand-crafted base model or parametric linear model. Specifically, 1) CMR learns keypoint-to-vertex assignment matrix and then learns texture model on top of the 3D model without backpropagating the gradients to the 3D branch, while the proposed methods uses the texture mapping for keypoint correspondences; 2) representation of the shape as the linear combination of blendshapes and decreasing the role of the deformations, 3) regularisation of the predicted shape as as-rigid-as-possible (ARAP) constraint from the base shape. To be able to learn on the videos at test time to enforce temporal consistency, it proposes new ideas: 4) part correspondence constraint based on random part segmentations and propagation with optical flow; 5) texture-flow and base-shape invariance constraints for learning on videos. Another contribution is a new dataset of 22 bird videos. It is quite small and annotated only with keypoints every 5 frames (no 3D annotation), but supplements the paper nicely as a secondary contribution.

Weaknesses: The main weakness is that the paper does not evaluate the quality of 3D reconstruction. It claims to improve the quality of mesh reconstruction but it is only supported by a few visual examples. All quantitative evaluation is done using the metrics related to silhouette and keypoint accuracy. However, it is possible to learn to predict those correctly without any 3D reasoning at all! The task itself in theory allows evaluating the 3D quality if the ground-truth depth/3D or multiple views are available. In order to track the progress in the field, we should use datasets that allow measuring it. See below for more details. As much as I like the novelty, lack of proper evaluation makes me doubt about the paper. On the clarity side, the paper describes two different methods within one framework (see below for details as well). Learning the unsupervised mesh reconstruction is notoriously difficult, while the paper frames it as a minor modification. The progress in this task is interesting, so the paper will be much stronger by discussing it in more details (may be it should be a separate paper).

Correctness: The paper claims to improve 3D mesh reconstruction, however it is not proven. The paper reports only 2D metrics like PCK for keypoints or IOU for masks. The methods that don’t reason about 3D are quite good at keypoint detection and instance segmentation, so one can improve those metrics while predicting a bogus mesh. The paper itself notes this flaw in line 325, where it looks at the poor meshes learned without ARAP constraints that produce better 2D metrics. I understand that the datasets used in the paper do not have 3D annotation or multiple views. This means the method may need other datasets to show the empirical improvement. Getting the numbers on a synthetic dataset (e.g. based on ShapeNet) in addition to visualisation on the real data may be a first step. There are datasets of humans obtained in the controlled environment that can also enable getting 3D metrics. Finally, the ubiquity of depth sensors makes creating a new dataset of videos of articulated objects quite viable.

Clarity: The paper is generally well written. Elaborating on the overlooked unsupervised case. The paper describes two related but quite different systems: a weakly-supervised one that learns asymmetric parametric shape in the first stage and then adapts it at test time; and the unsupervised one, that leans a symmetric fixed shape on the first stage, and then adapts it at test time. There should probably be two separate papers about them. The description of the unsupervised method seems rushed (I would have difficulty to reproduce). In particular, Section 3.3 describes the first stage as a variant as ACMR, while Section 2 in the supplementary says it is based on [7], which is similar to the simplified ACMR (without blendshapes though) but I think uses a few more constraints. The results section in the supplementary is confusing as it is often unclear which of the two systems is evaluated. Lines 112–113: I think CMR uses orthographic projection. How can you use the perspective projection if the CUB dataset does not provide camera intrinsics? I did not find the number of blendshapes Nb used in the model. It would be good to have an idea how big the space is. Lines 323–325: “our full ACMR model does not quantitatively 324 outperform the model trained without the ARAP constraint”: according to Table 1, it does? Not sure how to match that to the numbers.

Relation to Prior Work: Related work is complete and the distinction is described clearly.

Reproducibility: No

Additional Feedback: Not really a weakness but using the random parts segmentation (e.g. stripes) seems arbitrary, and it is unclear why they should work better than the appearance losses. It is interesting to dig into that deeper. Typos: * Fig. 1 caption: bbox ← box, * Fig. 2: \Nabla V ← \Delta V ? (4 times) * line 123: render ← renderer, * line 222: are not strictly one-to-one corresponded ← do not correspond / cannot be bijectively mapped, * line 278: YoutTube ← YouTube.

Review 4

Summary and Contributions: This paper presents a method to reconstruct temporally consistent asymmetric meshes from in-the-wild animal videos given a categorical model. the shape is represented by asymmetric parts that can be explained by minimal deformation from the linear blend shape. The temporal consistency is enforced on a UV-space texture map and part maps. The work is evaluated on real bird and synthetic zebra data, showing ability to model diverse asymmetric poses and their stabilized motion. They compare it with existing work including CMR.

Strengths: + (Contribution) The main contribution of this paper is a new way to reconstruct temporally consistent 3D shape without 3D supervision. The key idea is to enforce geometric and appearance coherence by using the following facts: (1) base shape that describes object identity needs to be similar between two frames; (2) texture in canonical coordinate must be the same; (3) predicted part semantics must be consistent in 3D. This makes stabilized reconstruction possible. This is novel and the result is compelling. Beyond this, the paper makes the following contributions: + A linear blend shape representation for base shape + ARAP deformation to express asymmetric body part + Strong performance comparing to frame-based reconstruction (CMR) + (Clarity) The method section is well written with clear exposition of the framework.

Weaknesses: (Validation of novelty) Leveraging texture and shape consistency for online refinement has been used in prior literature, e.g., [3, 4] in a different context and application. I believe the most significant contribution is the way texture and shape invariance is applied, i.e., swapping pose and texture across time to enforce consistency. However, its novelty is not empirically validated at all, i.e., why is this design choice reasonable and better than the existing idea. For instance, they argue that their idea can address blurry texture and low res shape while the results look blurred and low res anyway. Without rigorous ablation study in isolation, it is not clear how the core novelty makes an impact. Regarding static shape learning, the work is by large built upon previous work [1,2]: keypoint reprojection idea is similar to geometry cyclic consistency [1], and ARAP is used many places in graphics for mesh deformation, and so on. The paper should clearly state and validate the critical gap produced by the novelty on static shape learning as well. (Evaluation) This is not the first work that addresses the temporal consistency of video reconstruction. For instance, [5, 6] learn dynamics from videos that produce highly stabilized reconstruction. Further, various metrics have been used to evaluate the performance of video reconstruction such as keypoint acceleration loss [4]; other ways to measure could be Chamfer loss on shape/base-shape from neighboring frames and L1/L2 losses on texture maps. Regarding the static shape representation (linear blend shape + ARAP), an articulated joint constraint is used to model variable object shapes [7]. (Generalizabilty) While the method is agnostic to the object category, the paper includes limited validation only on bird data (zebra data are synthetic data). It is hard to predict its generalizability of the method. Prior works have evaluated on diverse quadrupedal animals (e.g., horse, hippo, lion) and deformable objects (e.g., car, humans, and face). (Degree of deformation) The shape modeling shares strong similarities with the SMPL/SMPL+D model used to model (clothed) humans. As a simplified version, their shape bases are supposed to model base shapes (identities) across the database while deformations are modeled by offsets. However, visualization of the bases reveals (supp Fig 10) that deformation is also included in the bases, eg flying duck. Base shapes are thus inadequately constrained and their semantics rather loose. This will probably fail to model challenging scenarios eg: a clip in which a resting duck started to fly (i.e., topological change). Further, how multimodal is the base shape over the bases learnt? How about including a video showing the base shape deforming over time in addition to the complete shape? Missing reference [7] [1] Canonical Surface Mapping via Geometric Cycle Consistency [2] learning category-specific mesh reconstruction from image collections [3] High Fidelity Facial Performance Tracking In-the-wild, CVPR 2019 [4] DeepCap: Monocular Human Performance Capture Using Weak Supervision, CVPR 2020 [5] Learning 3D Human Dynamics from Video [6] VIBE: Video Inference for Human Body Pose and Shape Estimation [7] Articulation-aware Canonical Surface Mapping, CVPR 2020

Correctness: Reasonable.

Clarity: Reasonably well.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: See above.