Paper ID: 102
Title: Where are they looking?
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a vision method to estimate where within an image a pictured head is looking.

Given the image and the cropped out head, the system returns a saliency map consisting of confidence ratings on grid cells saying how likely that position is to be the subject of that head (person)'s gaze.

The technique uses CNN with two pathways, one for the head/gaze and one for the full image/saliency of the scene.

The authors also contribute a dataset pulling relevant images from SUN/COCO/Actions40/PASCAL/Imagenet, and have it annotated for gaze points by Mech Turk workers.

This dataset contains in total 36K people.

The method is compared to a few reasonable baselines that represent alternative approaches one might implement and sanity checks.

(To my knowledge existing methods in this space are more restrictive than the proposed method, so they are not easily comparable without taking subsets of the test data that satisfy those methods' requirements.)

Quantitative results show the promise, and the qualitative analysis and figures are interesting and well composed.

The paper tackles the very practical and yet under-explored problem.

The solution is simple but effective.

It has low technical novelty; it is largely a straightforward application of CNNs with simple input descriptors for the head and scene.

There is the detail about the shifted grids that was important to results.

Nonetheless, the dataset collection effort and systematic results (including reasonable baselines) make it valuable.

Others are likely to build upon this work.

Paper clarity is very good.

It is an enjoyable paper to read with many good illustrative figures.

Unclear points/ questions:

*More detail could be given about the annotation process and instructions etc.

For example, how were the annotators told to decide on a gaze point?

Is it literally the point of fixation as they perceive it from that person's line of sight?

Or the center of the object they are fixating on?

What rules were used to decide what constitutes a poor annotation for discarding?

*How many images are in the labeled dataset, after pruning?

There are 36K people instances.

*It is good that the results compare against human agreement on the test set, where 10 people labeled each test image.

Still, why not get multiple gaze annotations for the training data as well, to find a consensus for the ground truth?

* In Sec 3.3 this sentence is unclear "Since we only supervise...subproblems."

*Text can explain why the uniform distribution of gazes was sought in the test set.

Presumably this is to avoid bias.

*The SVM baseline description (Lines 333-335) is not clear.

Related work:

*This sentence about related work is unclear in the intro:

"Only [17] tackles the unrestricted..."

What is meant by "pre-built components" and why can't it handle people looking away from the camera?

If it only handles people looking at the camera, then how is it doing gaze estimation at all?

*The paper notes that [7] uses an eye tracker for gaze in egocentric video.

The same authors have a more recent paper that removes the need for the gaze tracker: Learning to Predict Gaze in Egocentric Video, Yin Li, Alireza Fathi, James M. Rehg, ICCV 2013

*The proposed work seems also related to the interactee prediction work of Chen & Grauman (ACCV 2014).

It too tries to produce a "saliency map form the point of view of the person inside the picture (Line 75)", uses similar features (but not CNNs), and produces a multi-modal distribution as its estimate (but using a mixture density neural network instead of classifier).

The proposed work is still distinct, mostly because it cares strictly about gaze and treats the interacting/gazed upon objects only implicitly, but I think that work can be cited and the differences explained.

Perhaps the insight of the ACCV work about representing the scale and position of the gazed upon object would be relevant to improve this system as well.

Predicting the Location of "Interactees" in Novel Human-Object Interactions.

C-Y. Chen and K. Grauman.

In Proceedings of the Asian Conference on Computer Vision (ACCV), Singapore, Nov 2014.

Results:

*The results would be more well-rounded by including failure cases and discussion on failure modes.

*I can guess why the Places-CNN and Imagenet-CNN were used for the saliency and gaze components, respectively.

But this motivation can be written in the text.

(Sec 3.3)

I am also curious if it matters in practice to have scenes/objects for the two, or whether results would be similarly if e.g. both were initialized with Imagenet or Places.

* A possible enhanced baseline over Fixed Bias: take a subset of training images for which there is a similar distribution of heads.

And how is "same head location" determined for this baseline?

Typo: Line 353 " the in the"
Q2: Please summarize your review in 1-2 sentences
The paper tackles the very practical and yet under-explored problem of estimating where people are looking in images.

The solution is simple but effective.

It has low technical novelty, yet the dataset collection effort and systematic results (including reasonable baselines) make it valuable.

Others are likely to build upon this work.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Missing citations:

Hierarchical temporal graphical model for head pose estimation and subsequent attribute classification in real-world videos, Meltem Demirkus, Doina Precup, James J. Clark, Tal Arbel
Q2: Please summarize your review in 1-2 sentences
This paper argues that gaze prediction is understudied, proposes a new dataset, and builds a neural network for gaze prediction.

Head pose estimation is a very well studied related area that seems to be ignored a bit by the paper.

Results compare a number of baselines to [12], which is from 2009.

Related approaches include [15] (and citations therein), which actually performs a more complicated task of which gaze prediction is a subtask.

See missing citations below (though this is a "light review")

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper deals with a problem of estimating gaze directions of people in a given image, and proposes a method based on convolution network and saliency. The problem has already been presented in several previous researches such as [Fathi+ ECCV12], [Borji+ JoV14] and [Parks+ VS2014], however, this paper tries to handle more general cases with less constraints.

The architecture of the proposed model follows a defacto-standard pathway of a combination of Caffe CNNs. It is technically sound but no technical contributions can be found in the model.

I am afraid that the use of saliency prediction for gaze following might not be appropriate, since saliency describes visual distinctiveness from the viewpoint of observers (i.e. viewers of the image), not the target person in the image. Namely, saliency computed from a given image can estimate where observers may focus, but is cannot estimate where people in the image focus.

Q2: Please summarize your review in 1-2 sentences
The problem dealing with in this paper is interesting and significant, but the novelty of the proposed method is limited and the proposed method might contain a critical problem.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Using images containing people "attending" regions (i.e. objects, other people). The authors propose a CNN-based method combining the image saliency map (pixels), a person's head appearance (pixels) and the head location in the image to predict the gaze "attentional" region in an image.

Quality: The overall quality of the paper is good. The authors build an architecture of deep convolutional neural networks (CNN), which learns features describing image saliency, face gaze and face location in parallel. Later in a forward pass the CNN will predict the location of the gaze fixation. Technically, the algorithm is straight forward. The authors train the whole model at once using back propagation. Later they compare their prediction results with other baselines achieving the best results.

Originality: The concept of the work is original. Nevertheless, fully supervised location "detectors" using combinations of CNNs not really novel.

Pioneering work of Ross B. Girshick (RCNN) and others have this concept inside.

Clarity: The paper is well written and its easy to follow. To improve the quality, the authors could relax the use of the word "predict" (e.g line 25 and others). Prediction is used for generative models (e.g. LSTMS). This paper is fully supervised, and its analogous to the object detection papers. So "detection" should be used instead.

Significance: The problem is very interesting and important. Good gaze following algorithms can help in other tasks such as fine grained object detection. The authors will open to the community the collected and curated dataset of face-gaze fixations.
Q2: Please summarize your review in 1-2 sentences
The paper proposes a method to predict and follow the gaze of people in unconstrained images. The framework combines image saliency and human gaze to learn the predicted location of "attention" of the person. The learning is supervised and performed at once end-to-end. The results outperform baseline methods.


Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
[this is a light review]

minor: supplemental: line 102, red and yellow have to be exchanged.
Q2: Please summarize your review in 1-2 sentences
The clearly written paper contributes a large-scale dataset and a novel model, which combines gaze direction and saliency in a deep learning framework. The approach is convincingly evaluated using different ablations and related work.

== post rebuttal ==

I keep my judgment.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a framework for predicting where people at the images are looking at. This is a novel twist on the established problem of the salience / eye gaze prediction in images. Authors collect and annotate a new dataset containing a set of images as well as annotation of the gaze of each person inside an image. Eye gaze model is then learnt by using convolutional networks.

During test stage, gaze is predicted by fusing results of fixation prediction (similar as it is done in regular saliency prediction) and actual following-gaze prediction. Experimental results demonstrate that this approach provides superior results compared to regular fixation-only gaze prediction.

Quality: The paper is interesting and theoretically sound, but a few things could be improved:

1) Learning part should be better defined (mathematical formulation or at least a diagram) 2) This paper should also compare results with approaches where gaze is learnt not just from free-viewing a scene, but when people are given a task when viewing a scene. E.g., S. Mathe and C. Sminchisescu, Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths, NIPS, 2013 3) Would be interesting to analyze overalp between gaze fixations and gaze-following.

4) This approach requires location of the human head of gaze prediction. It would be interesting to see results with off-shelf head detection.

Clarity: Paper is overall well written.

Originality: Novelty is incremental.

Significance: Paper covers interesting aspect of predicting where people are looking, thus has large potential in different computer vision application which require better understanding of human-object and human-human interactions.

Q2: Please summarize your review in 1-2 sentences
The paper presents a novel approach for following-gaze prediction in videos. Results are promising but it would be interesting to see it working with off-shelf head detector

as well as understand what is correlation between following-gaze and gaze fixations.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank the reviewers for their careful consideration. We greatly appreciate the positive comments and address major concerns below.

R4 Missing citations
Note that we do cite as [17] and discuss the work by Parks et al in L090-100. We further clarify the distinctions below. We will include the work by Demirkus et al in the next version and discuss head pose estimation below.

R4 Head pose estimation (HPE)
While we agree that HPE is an important and well-studied problem that definitely deserves attention here, we would like to highlight certain subtle but important distinctions. Using HPE, we can estimate the angle of gaze, but image evidence is required to accurately estimate the magnitude of the gaze vector to identify exactly where one is looking. To better illustrate this, we highlight our result in Tbl 1 - the comparison between (1) 'No image' and (2) 'Our Full'. (1) uses head location and pose to infer gaze without using any image evidence (unlike (2)) - the angular error is very similar to (2) while the distance error is markedly higher resulting in a lower AUC (0.78 vs 0.83). Thus, while head pose is certainly important, it is not sufficient.

R4, R1 Comparison to [17]
Firstly, note that the goal of [17] is not gaze following, but saliency prediction using gaze following. Thus, while they propose a gaze following system, they do not evaluate it directly. More importantly, the automatic system from [17] relies on the 'face' detector by Zhu & Ramanan which is not designed for 'head' detection and hence does not perform well on people facing away from the camera. Further, [17] does not use image evidence to identify gaze - they use a prior that combines HPE with its scale. This is similar to our 'No image' method explained above. While not being explicitly trained to use the face scale, we found that our 'No image' model was sensitive to the scale of the face (i.e., higher gaze length for larger faces). Lastly, the code and dataset from [17] are not available making a direct comparison difficult.

R2 Meaning of saliency - "critical problem"
We completely agree with you - 'saliency' typically refers to what observers of an image fixate on (as in [12]). In our case, we re-define saliency to be from the perspective of the person in the image e.g., in an image with a person looking at a car, typical saliency would likely result in the person's face being salient, while our re-definition would result in the car being salient. We highlight this distinction in Fig 4b. We apologize for overloading this commonly used term and will update it to 'gaze-saliency' to avoid further confusion.

R3 Head detector
We did not use a head detector as the primary focus of our work is gaze following. We did not want to confound the results by combining two imperfect solutions. Given the convolutional nature of our approach, we do not expect somewhat imprecise head detections to lead to a significant performance drop.

R3 Learning better defined
We will add details of the mathematical formulation, such as the loss equation

R1 Annotation process
We told the turkers to make their 'best guess for where a person is looking' instead of giving more explicit instructions (related to objects, etc) to avoid biasing them in any particular way. To test them, we selected images where the gaze location was obvious and allowed for reasonable margins of error. This prevented random clicking.

R1 Image count
27,967 after pruning (Fig 2, right col)

R1 Failure cases
Fig 5 (images 3&4, 1&4 and 4 in rows 1, 2, 3) shows some failures. We believe the main failure mode is from the lack of 3D understanding e.g., image 1 in row 2, Fig 5: the farthest person from the camera is predicted to be looking at a stove that is behind him. We will add more discussion of failures to the paper.

R1 Both paths with ImageNet or Places
If both pathways are initialized with ImageNet the AUC is 82.2 and 81.5 with Places, as compared to 83.0 with our approach

R1 Unclear sentence, Sec 3.3
We show that the saliency pathway is computing a gaze-following saliency map and the gaze pathway is computing the head-pose map for the target person. However, we did not explicitly enforce that each of these pathways solve the specific subproblems. Instead, we only supervised the final output of the network (gaze location) and the network naturally learned to factorize the problem in this way.

R1 Uniform test distribution
This distribution was used to avoid bias in the evaluation

R1 clarify L333-335
In the SVM baseline, the input elements are the same as our final model, but instead of a CNN, we used an SVM classifier. Shifted grids are also used to ensure a fair comparison.

R1 Fixed bias baseline
Similar to your suggestion, we use a 13x13 grid for head location

We apologize for not clarifying all questions given the limited space and many reviews. We will fix all typos and add missing references in the next revision.