Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	102
Title:	Where are they looking?

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

The authors propose a vision method to estimate where within an image a pictured head is looking.

Given the image and the cropped out head, the system returns a saliency map consisting of confidence ratings on grid cells saying how likely that position is to be the subject of that head (person)'s gaze.

The technique uses CNN with two pathways, one for the head/gaze and one for the full image/saliency of the scene.

The authors also contribute a dataset pulling relevant images from SUN/COCO/Actions40/PASCAL/Imagenet, and have it annotated for gaze points by Mech Turk workers.

This dataset contains in total 36K people.

The method is compared to a few reasonable baselines that represent alternative approaches one might implement and sanity checks.

(To my knowledge existing methods in this space are more restrictive than the proposed method, so they are not easily comparable without taking subsets of the test data that satisfy those methods' requirements.)

Quantitative results show the promise, and the qualitative analysis and figures are interesting and well composed.

The paper tackles the very practical and yet under-explored problem.

The solution is simple but effective.

It has low technical novelty; it is largely a straightforward application of CNNs with simple input descriptors for the head and scene.

There is the detail about the shifted grids that was important to results.

Nonetheless, the dataset collection effort and systematic results (including reasonable baselines) make it valuable.

Others are likely to build upon this work.

Paper clarity is very good.

It is an enjoyable paper to read with many good illustrative figures.

Unclear points/ questions:

*More detail could be given about the annotation process and instructions etc.

For example, how were the annotators told to decide on a gaze point?

Is it literally the point of fixation as they perceive it from that person's line of sight?

Or the center of the object they are fixating on?

What rules were used to decide what constitutes a poor annotation for discarding?

*How many images are in the labeled dataset, after pruning?

There are 36K people instances.

*It is good that the results compare against human agreement on the test set, where 10 people labeled each test image.

Still, why not get multiple gaze annotations for the training data as well, to find a consensus for the ground truth?

* In Sec 3.3 this sentence is unclear "Since we only supervise...subproblems."

*Text can explain why the uniform distribution of gazes was sought in the test set.

Presumably this is to avoid bias.

*The SVM baseline description (Lines 333-335) is not clear.

Related work:

*This sentence about related work is unclear in the intro:

"Only [17] tackles the unrestricted..."

What is meant by "pre-built components" and why can't it handle people looking away from the camera?

If it only handles people looking at the camera, then how is it doing gaze estimation at all?

*The paper notes that [7] uses an eye tracker for gaze in egocentric video.

The same authors have a more recent paper that removes the need for the gaze tracker: Learning to Predict Gaze in Egocentric Video, Yin Li, Alireza Fathi, James M. Rehg, ICCV 2013

*The proposed work seems also related to the interactee prediction work of Chen & Grauman (ACCV 2014).

It too tries to produce a "saliency map form the point of view of the person inside the picture (Line 75)", uses similar features (but not CNNs), and produces a multi-modal distribution as its estimate (but using a mixture density neural network instead of classifier).

The proposed work is still distinct, mostly because it cares strictly about gaze and treats the interacting/gazed upon objects only implicitly, but I think that work can be cited and the differences explained.

Perhaps the insight of the ACCV work about representing the scale and position of the gazed upon object would be relevant to improve this system as well.

Predicting the Location of "Interactees" in Novel Human-Object Interactions.

C-Y. Chen and K. Grauman.

In Proceedings of the Asian Conference on Computer Vision (ACCV), Singapore, Nov 2014.

Results:

*The results would be more well-rounded by including failure cases and discussion on failure modes.

*I can guess why the Places-CNN and Imagenet-CNN were used for the saliency and gaze components, respectively.

But this motivation can be written in the text.

(Sec 3.3)

I am also curious if it matters in practice to have scenes/objects for the two, or whether results would be similarly if e.g. both were initialized with Imagenet or Places.

* A possible enhanced baseline over Fixed Bias: take a subset of training images for which there is a similar distribution of heads.

And how is "same head location" determined for this baseline?

Typo: Line 353 " the in the"