NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID:413
Title:Single-Image Depth Perception in the Wild

Reviewer 1

Summary

This paper address the problem of depth estimation given a single 2D image as input. The paper has two contributions: (i) a new large dataset of human pairwise relative depth judgements, and (ii) a novel ranking loss function that encourages predicted depths to agree with the ground truth. The ranking function is built on top of a variant of the "hourglass" network of [38]. The proposed approach is compared against the recent approach of Zoran et al. [14] on the tasks of ordinal relation and metric depth prediction on the NYU dataset, and out-performs it on both tasks. The proposed approach out-performs Eigen and Fergus [8] on ordinal relation prediction, but under-performs on the harder task of metric depth prediction.

Qualitative Assessment

I appreciate the novel dataset, and the combination of the hourglass network with the ranking loss. The latter insight appears to yield improvement over Zoran et al. [14], which estimates the ordinal relationships first, and then conditioned on the estimated relationships optimizes a constrained quadratic problem. The paper also investigates whether superpixel sampling is necessary (it was used in [14]), and shows improved performance over [14] without it. The paper writing is clear, references are good, and experiments are thorough. Minor comment: It seems that performance has not saturated yet with respect to the random sampling experiment in Table 2. Is it possible to try with even more samples?

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 2

Summary

This paper tackles the problem of estimating depth for images "in the wild". Specifically, current depth datasets have certain biases, eg only indoor scenes [4] or urban street envinroments [5]. The paper introduces a new dataset with a much larger variety of scenes, where depth annotation comes as relative depth between pairs of random points. In addition, the authors introduce a method to use the provided annotation in order to make a pixel-wise depth prediction using a neural network architecture.

Qualitative Assessment

Overall, I find the paper interesting and with high potential for future work. The main contributions are a) the dataset and b) the loss for using the provided annotations (see more details below). Also the experimental evaluation is thorough and captures well the comparison with other methods and the relations with other depth datasets. The dataset itself is very useful, as relative relations between points have been already used in practice for different tasks [18, 14, 29, 30]. Moreover, a depth dataset with diverse scenes is a good step towards more general single image depth estimation methods. Another positive of the paper is the introduction of the loss function. With this loss, it is possible to have direct pixel estimations of depth using a CNN architecture. It would be interesting to see this loss applied also to other tasks such as intrinsic image decomposition. Only some minor comments: In the experiment section, it is confusing to present results of the same table on different parts. For example the rand_* results are in a separate part than the above and below results, making the text difficult to follow. In several cases, the authors use questions for making a point (lines 24, 102, 109, 157, 217). Overdoing it makes the text too informal.

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)


Reviewer 3

Summary

In this paper, the authors collected a new dataset for training 3D reconstruction from a single image. Unlike previous datasets that recorded the true 3D structure of the scenes, this dataset only records the relative depth of a pair of points in each image. It is because that compared to the annotating absolute depth of the environment, it is easier for human beings to judge the relative depths of two certain points in the scene. In addition, the authors designed a new method that used annotations of relative depth to learn a model for single-image 3D reconstruction.

Qualitative Assessment

First, since the new dataset is one of the key contributions of this paper, I suggested the authors to provide more statistics of the dataset. For example, among the 495K images in the dataset, how many images describe large scenes? How many images focus on certain small objects? Among all the scene images, how many images describe indoor scenes, outdoor urban scenes, and outdoor wild scenes? Among all the object images, how many images describe man-made objects and animals? Based on such statistics, we can evaluate the generality of the new dataset. Second, the new dataset can be only applied to the methods that can learn from ordinal relations between two points in an image. Thus, most existing methods cannot use this dataset. Third, I do not understand why the authors only label a single pair of points in each image. I admit that labeling two pairs of points in two different images may provide more information than labeling two pairs of points in a single image, because the two pairs of points in a single image may share similar relative depth. However, there also are some advantages for annotating more pairs of points in a single image. 1) Labeling more points may greatly decrease the structural uncertainty of the image, thus providing more reliable training images. It is possible that a small number of reliable training images may contribute more to model learning than a large number of unreliable training images. 2) Labeling more points in a single image may make this dataset applicable to more methods, because only using one pair of annotations in an image is a very strict constraint. Fourth, 50% of the point pairs for annotation were randomly selected, and the other 50% of point pairs had symmetric shapes. Such a design for point annotation is quite arbitrary. I suggest the authors to design a certain loss to achieve an active framework for point annotation, which uses the loss to identify the most informative point pairs in an image. This will be a more efficient way for dataset construction. Fifth, the proposed method for learning from ordinal point relations has a minor novelty.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 4

Summary

This paper proposed a noval network structure to estimate depth map from a single image input. The main contribution of this work is to proposed a network that can directly output dense depth map, but only using annotation of relative depth in the training stage. Previous work either requires full depth map at the training stage, or can only predict the relative depth between pixels and another post-processing step is required to create the dense depth map. The author also proposed a new dataset "depth in the wild", which consists of more challenging testing images compared with previous RGB-D dataset. Experients show that the proposed algorithm out-performs the state-of-the-art single image depth estimation algorithm on the NYU dataset and the new "depth in the wild" dataset.

Qualitative Assessment

Overall, this paper has a high quality. The proposed network based the relative depth map is noval. Experiment results, both quantitative and qualitative ones, shows superior over the previous methods. The presentation of the work is very clear and results should be easy to reproduce. My main concerns are the number of point pairs required to train the network. The author claimed that only one pair per image might be enough to train the network. I am not convinced that just from one relative depth label per-image, the network can predict a decent depth that is smooth inside objects and has sharp boundaries between objects. Specially, on "depth in the wild" (DIW) dataset, the authors only show the qualitative result of Ours_NYU_DIW, which is pre-trained on NYU dataset using all pairs of training points in all training images (L195-L197). Also, Table 3 shows that directly training on DIW (which only has sparse depth labels), the Weighted Human Disagreement Rate (WHDR) is roughly 9% higher than pre-trained network on fully annotated NYU dataset. This raise up the question that whether single-label per image is enough to learn a good dense depth map, or it mostly helps to adapt the pre-trained network on fully labeled dataset (like NUY) to a new dataset with only sparse label (DIW). If that is case, the author should make this point more clear in the paper. It would also be great if the authors can show the quantitative result of Ours_DIW, and compared it with Ours_Full and Ours_NYU_DIW. And I gave this work poster-level mainly due to this concern.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 5

Summary

This paper proposes training a deep neural network for estimating scene depth (up to monotonic transformations) for single images in the wild. They achieve this by building an extensive dataset of relative depths for points in images. Human crowd sourcing was used to obtain relations between pairs of points such that we would know which of the two points are closer. A given image would be annotated with many such pairs and ultimately a large dataset of such images is created. A ranking based loss function that considers both the annotated relative depths and estimated depth is proposed. This loss function is used to train the neural network to output pixelwise depth estimates. The authors show that their framework allows for using a much larger amount of data from the wild than is possible with existing frameworks, which rely on specialized depth sensors for building training datasets.

Qualitative Assessment

I like the potential to scale up the approach to larger "in the wild" training datasets. I think this work has the potential to be picked up by many researchers in academia and industry as scaling it up is straightforward. My only issue with the work is that it is unclear how the loss function would enforce the degree of smoothness in the estimated depth between neighboring pixels. How does the density of human annotations for point pairs in scenes affect the results? I suspect the more dense the annotations, the better. This would explain why Table 3 shows pretraining on the NYU dataset (which has ground truth Kinect depth) gave the best results. Otherwise, I think this is a solid paper that (to my knowledge) is the first to make use of such a large (495K images) in the wild dataset for depth estimation. Another minor point is that in earlier parts of the paper, I had the impression each training image only consisted of a single pair of annotations. But on reading about the loss function, it seems that multiple annotations are used in each image. It is only that each point pair annotation was done by a single person.

Confidence in this Review

2-Confident (read it all; understood it all reasonably well)


Reviewer 6

Summary

This paper introduces an approach to leveraging ordinal depth relationships to predict a full depth map from a single image. To this end, a deep network is trained with a loss specifically designed to encode depth ordering. The network, however, still outputs a full depth map, and thus avoid the two-stage procedure that would consist of first predicting depth ordering and then optimizing a depth map to satisfy these relationships. The paper also introduces a new dataset of images in the wild with ground-truth ordinal depth relationships.

Qualitative Assessment

In general, I quite like the paper, as I think that having an end-to-end framework to predict depth from weak annotations is valuable. The paper is sometimes a bit misleading: - From the introduction, I was under the impression that the proposed method would outperform [8] on NYUv2 when trained from a single pair per image coming from the new dataset. This would have been truly remarkable, but is not the case. - The paper suggests that the model can predict metric depth by just being trained using ordinal relationships. This is not entirely true, since, as mentioned in the experiments, the predicted depth maps need to be rescaled to match the training data statistics. In other words, some additional information (although quite weak) is still required to predict metric depth. I would suggest the authors to rephrase their statements to clarify this throughout the paper. Experiments: - Since the authors used an architecture that has not been employed before for depth estimation (the hourglass network), I think it would be interesting to also evaluate this architecture on the fully-supervised case (with full depth maps). While the comparison with [8] is interesting, it is unclear how much of the benefits comes from the use of a different network, or truly from using the ordinal relationships. - Using between 800 pairs and all of them on NYUv2 is not very realistic, although I acknowledge that it corresponds to what [14] did. In practice, one can only expect people to label a much smaller number of pairs. It would be interesting to study the robustness of the method when decreasing the number of pairs. As a matter of fact, I think that this experiment would be more valuable than the one done by removing the superpixels. - It would also be interesting to see how well the network trained from DIW only performs on NYUv2 for metric depth prediction. - On lines 184-186, the authors mention that [14] makes use of a ratio-based rule to determine if two points have the same depth, but, here, the difference is employed. Why not use the same rule as in [14]?

Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)