Paper ID: 354
Title: Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper is very exciting. The authors present a technique for jointly inferring the intrinsic properties of a scene (shapes, colors, illuminations) and a semantic-segmentation of a scene (objects labels, attributes). The technique basically works by gluing together Barron & Malik's SIRFS work with the DenseCRF work, where the "glue" holding the two models together is a dual-decomposition technique built around a segmentation technique. Doing these two tasks jointly seems to help both tasks --- not by a great deal, but probably enough.

The goal of this paper is great. Combining semantics and intrinsics is definitely something the computer vision community should focus on more, and I can see the machine learning community getting excited about the ideas here. The algorithm, though necessarily very complicated as it's a superset of complicated models, is presented well. The idea is novel, as is the overall model, though the machinery is not very novel itself.

I do have some concerns regarding clarity and evaluation. First, it's not clear to me how to contrast this model to Barron & Malik's stuff. Though the authors cite [4] (the scene-based version of the model) they don't seem to have used it. Instead they've made a new "scene" generalization out out what is presumably some past version of SIRFS. Can the authors clarify exactly what they did? It seems like the authors used the technique from [2] on different segments of the image, while sharing illumination across the entire image. This seems to be a technique for generalizing SIRFS to scenes, but it is not [4]. How does this technique work by itself for the task in [4], of just doing intrinsic reconstruction? Any why is only one illumination used? In [4] they get a lot out of modeling a mixture of lights, and many of the mistakes made by this model seem to be due to this incorrect assumption of a single global model of illumination. This needs to be explained well, because the author's "scene" version of [2] is used as a baseline technique by this paper, when it seems to be a novel algorithm of its own. Why not actually benchmark against [4]? The output here looks significantly worse than that of [4].

One completely unmentioned and very worrisome aspect of this paper is that in the NYU experiments seem to use a Kinect depth map as input. I assume this is true just because the output depth maps look too good to have been produced blind, by either this paper or by SIRFS. How is the Kinect being used? And what is being done in the PASCAL experiments in which there is not input depth? What sort of shape reconstructions are produced on those images? Some discussion of this is very necessary.

I'm confused by the comparison to [10]. The improvement of this paper with respect to [10] seems somewhat small. I appreciate the author's honesty in submitting a comparison to what is presumably another work of theirs, but why was this done? Are components of this model shared between these two works? If so, what are they? And does [10] use extra "verbal" information, as the attached paper suggests? Thankfully, the comparison against [12] seems correct, and this paper does improve the results by a medium-sized amount, which is enough to convince me that this technique works.

My greatest concern is that the improvement of this paper over [12] may be entirely due to the use of the Kinect depth input. It's possible that SIRFS is serving to do nothing except indirectly inform the attribute/object CRF components of the shape of the scene, which may be the only driving force behind the improved numbers. To verify that this is not the case, a simple baseline in which the DenseCRF also has access to the Kinect depth map (in some simple featurized form, of which several now exist) should also be presented.

Overall, I think this paper deserves credit for attempting an extremely ambitious task. The qualitative improvement (looking at the intrinsic output) is not very large, and the quantitative semantic improvement looks okay, but not very significant. I would really like my concerns regarding evaluation and the Kinect input to be addressed, as I feel like this will completely determine how I feel about the true contribution of this paper.
Q2: Please summarize your review in 1-2 sentences
A very cool, ambitious paper about trying to solve all of computer vision. However, it has some potentially serious issues regarding evaluation and clarity with respect to experiments and past work.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a model which estimates both intrinsic properties of images as well as object and attribute labeling. The intrinsic image labeling component comes from the SIRFs work of Barron et al, and the object/attribute labeling component comes from [10]. A probabilistic framework is proposed, and a dual decomposition strategy is used during function minimization. Experiments are shown on both NYU RGB-D and Pascal datasets, showing an improvement over several baselines.

Despite having many mathematical details, the paper is easy to read, well-motivated, and will likely excite the object recognition community. Showing that intrinsic image decomposition helps object class labeling is one of the strong points of the paper. While many of the ideas in this paper are not new, the combination of two separate lines of research make this paper exciting to read.
Q2: Please summarize your review in 1-2 sentences
This paper combines two popular research avenues in computer vision, namely intrinsic image estimation and object/attribute labeling. The paper shows promising results, and introduces an easy-to-follow probabilistic framework using the language of modern machine learning.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a technique for jointly estimating object class labels, attribute labels, as well as the intrinsic properties of an image, such as depth, illumination and reflectance. The model builds heavily on two previous works: [2] for estimating intrinsic properties, and [10] for estimating object and attribute labels. The proposed model extends [10] with new terms to accommodate for intrinsic properties, which are modelled as in [2]. Hence, it is a cross-over between the models in [2] and [10]. The authors' main motivation is to exploit synergistic effects that might arise between all these properties (e.g. estimating depth might help object labelling).


Quality and clarity:
the paper is rather hard to read because:
(a) it is very dense, with lots of definitions and equations following in close succession at a varying level of detail (e.g. some acronyms are not even spelled out, such as SIRFS; the learning paragraph at the end of sec. 4 is very vague);
(b) it lacks figures to illustrate the presented concepts;
(c) it is not self-contained, as part of the model are deferred to the supplementary material (e.g. section 2.2).


Originality:
Attempting to perform all the above estimation tasks at the same time is, in itself, rather original. However, the technique is incremental over [2,10].


Significance:
Given that essentially all model components are from [2,10] and that the proposed joint model does not perform much better than [10] on object and attribute labelling, this paper has limited significance.


Experiments:
On the positive side, the authors experiment on two reasonably sized datasets and compare to several previous works, including their base model [10]. On the negative side however, there are two important points:

- on the object and attribute labelling task, the proposed method performs barely better than [10]. Depending on dataset and performance measure, the 'improvement' is between +0.2% and +1.8%. This is quite weak and I wonder if we really need another paper to explain how this minor effect was obtained (i.e. by also jointly estimating intrinsic properties).

- even more importantly, the authors do not report quantitative evaluation for the other side of the story: estimating intrinsic properties such as depth, illumination and reflectance (e.g. by comparing to [2]). This is a major shortcoming of the experiments, as the main story of the paper is about the synergistic effects of estimating everything at the same time. The paper only evaluates whether intrinsic properties help labelling objects and attributes, but not the other way around.

All in all, the experiments do not convince of the alleged synergy, as the effect is very small in one direction, and not evaluated in the other direction.
Q2: Please summarize your review in 1-2 sentences
Overall this is an ok paper which combines two rather mature lines of research: estimating object and attribute labels [10], and estimating intrinsic properties of the image, such as depth, illumination and reflectance [2]. Unfortunately, novelty is incremental, the paper is hard to read, and, importantly, the experiments do not support the main hope of the paper, i.e. to gain from the synergistic effects of estimating everything at the same time.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We would like to thank the reviewers for their time and valuable comments on our paper, especially R5 and R6 for their positive responses. R5 and R6 acknowledge the novelty in the idea of exploring the synergy between the intrinsic properties and the objects and attributes present in a scene. We address the main concerns of R5 and R7 as follows.

R5: “it's not clear to me how to contrast this model to Barron & Malik's stuff. Though the authors cite [4] (the scene-based version of the model) they don't seem to have used it. Instead they've made a new "scene" generalization out what is presumably some past version of SIRFS.”

Ans: Note that we found out about [4] only some weeks before the NIPS paper submission. (CVPR camera ready was just over a month before NIPS submission deadline). There was no code provided by the authors’ of [4]. We extended the SIRFS [2] to our Scene-SIRFS using a mixture of reflectance and depth maps, and a single illumination model. These mixtures of reflectance and depth maps were embedded in the soft segmentation of the scene generated using the approach of Felzenswalb et.al. [21]. We had also initialized the Scene-SIRFS model with the Kinect depths on the NYU dataset. We call this model: “Barron and Malik [2,4]” see e.g. fig 4. We will make this point clearer in a final version. Finally, since the Scene-SIRFS approach is not limited to using a single global illumination model, we may also use a mixture of illumination models as done in [4] to further improve the robustness of our method, but this is our future direction to be explored (see line 431).

R5: “One completely unmentioned and very worrisome aspect of this paper is that in the NYU experiments seem to use a Kinect depth map as input. I assume this is true just because the output depth maps look too good to have been produced blind, by either this paper or by SIRFS. How is the Kinect being used? And what is being done in the PASCAL experiments in which there is not input depth? What sort of shape reconstructions are produced on those images? Some discussion of this is very necessary.”

Ans: Please see lines 318-319: “we use them to initialize the depth maps Z for our joint estimation approach.” For the Pascal dataset we do not have any initial depth map. Thus, we start with a depth map where each point in the space is given same constant depth value. We also apologize that we did not include results for depth maps for the Pascal dataset. In a final version we will add the shapes for the results in fig 3. The shapes are certainly not as good as with NYU. However in some cases (around 30%) the results look qualitatively acceptable. We will make these points clearer in a final version.

R5: “I'm confused by the comparison to [10]…Are components of this model shared between these two works? If so, what are they? And does [10] use extra "verbal" information, as the attached paper suggests? Thankfully, the comparison against [12] seems correct, and this paper does improve the results by a medium-sized amount, which is enough to convince me that this technique works.”

Ans: Both the papers [10] and [12] predict per-pixel object and attribute labels based on the mean-field approach, however [10] consists of object-attribute and attribute-attribute correlation terms. The use of such correlations terms helped [10] to improve the objects and attributes prediction, see sec. 2.2, thus we have made it as our baseline. We use all parts of [10] apart from the extra “verbal” information. Note that our approach is completely automatic.

R5: “My greatest concern is that the improvement of this paper over [12] may be entirely due to the use of the Kinect depth input…A simple baseline in which the DenseCRF also has access to the Kinect depth map (in some simple featurized form, of which several now exist) should also be presented.”

Ans: First, note that we also get an improvement over [12] for the Pascal dataset where no depth map is used. Secondly, thank you for the suggestion. We will add this experiment to a final version of this work. We suspect that the following may happen. Besides RGB-based appearance features we will add geometric features based on the depth maps. Note that these trained models will be input to both [12] and ours. Thus, it is likely that it would boost the performance of both [12] and our model.

R7 is concerned about the quality and clarity of the text.

Ans: To improve the clarity in a final version, we plan to include a table of definitions and notations. Note that R6 says: “Despite having many mathematical details, the paper is easy to read, well-motivated”

R7: “the authors do not report quantitative evaluation for the other side of the story: estimating intrinsic properties such as depth, illumination and reflectance”

Ans: We are aware of this concern and tried our best. We could not produce quantitatively results since there is no ground truth available for the reflectance, depth and illumination on these datasets (see lines 302-304). However we have qualitatively compared these properties against the models of retinex[17], Gehler et.al. [5] and our extended Scene-SIRFS model based on Barron and Malik [2]. Note, in [2] results on the MIT-intrinsic dataset were presented which consists of ground truth for the intrinsic properties. However there are around 20 images for 16 object classes. This is not enough to train the generic object and attribute models.

R7: “-on the object and attribute labelling task, the proposed method performs barely better than [10].”

Ans: In the early days of computer vision people have speculated about the synergy effect of intrinsic image properties and object recognition. However, in the past not many papers have addressed this topic. We believe that now is the right time to do so. Note that R7 says that both lines of research are rather mature. We see our work as a first step towards this grand goal rather than the final step.