Reviews: Deep Structured Prediction for Facial Landmark Detection

The integration of convnets with the conditional random fields to model the structural dependencies of facial landmarks during face alignment is nice contribution. Previously proposed methods in this direction were hybrid systems (eg. OpenFace versions) and not fully integrated. The authors evaluate on multiple datasets (300W, 300W-Video, Menpo & COFW-68) and compare results with other methods. Both inter- and cross-dataset performance are provided. Writing is good, easy to follow. Couple of concerns, mainly about the evaluation protocol: 1) The authors compare to methods with available code, and they re-ran the codes on the benchmark datasets. The reported results (in Table 1) in some cases are lower than the ones reported in the original papers (see results in FAN[5]). This leaves the reader wondering on what is the source of this discrepancy? Were different parameters used during the reproduction? This need more discussion and would be important to state the original performance scores. 2) The evaluation metric was taken from [5], which is a modification of the original 300W benchmark metric. The original metric from 300W dataset normalizes landmark errors to the inter-ocular distance and widely accepted and used in the computational face community. The study of [5] modified the normalization the the bounding box size, and was adopted only in a few papers. The two normalization is not compatible and the scores are not comparable to the methods that use the original protocol. See for example (Dong et al, 2018) for a list of recent SotA methods. Without reporting scores using the original metric is difficult to judge how the proposed technique compares to the state-of-the-art. Also, among the compared methods there is only one recent study from 2017, the others are 4-5 years old techniques. Dong, Xuanyi, et al. "Style aggregated network for facial landmark detection." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. 3) It's not clear where is the 3D model (that was used in the study) coming from. Eq. 4 refers to a linear 3D subspace (\Phi) that provides the the rigid (S,R) and non-rigid (q) parameters. Was this subspace learned from a different 3D dataset or inferred from 2D using structure from motion?

Reviewer 2

Originality: The paper proposes a new method that combines a fully-connected CRF which is completely learned and inferred with a CNN for structured predictions of facial landmarks. Its novelty is incremental as variants of CRFs have been widely used and combined with CNNs previously, including for prediction of landmarks, albeit using different formulations than the one proposed. Quality: The authors show improvements over several existing methods and compare on 4 datasets, where their method performs the best. However, they omit comparisons against many of the latest state of the art methods. I would like to see how their method compares against these latest methods. Clarity: The paper is well written and clear. Significance: The paper is of incremental value in a niche application area.

Reviewer 3

The paper presents an approach for facial landmark detection/alignment based on a combination of a Convolutional Neural Network and a Conditional Random Field, to impose structure on predicted landmarks. Strengths: 1. The work is novel 2. The paper is well presented and clear 3. Results are very promising Weaknesses: 1. Unclear experimental methodology. The paper states that 300W-LP is used to train the model, but later it is claimed same procedure is used as was used for baselines. Most baselines do not use 300W-LP dataset in their training. Is 300W-LP used in all experiments or just some? If it is used in all this would provide an unfair advantage to the proposed method. 2. Missing link to similar work on Continuous Conditional Random Fields [Ristovski 2013] and Continuous Conditional Neural Fields [Baltrusaitis 2014] that has a similar structure of the CRF and ability to perform exact inference. 3. What is Gaussian NLL? This seems to come out of nowhere and is not mentioned anywhere in the paper, besides the ablation study? Trivia: Consider replacing "difference mean" with "expected difference" between two landmarks (I believe it would be clearer)

Paper ID:	1429
Title:	Deep Structured Prediction for Facial Landmark Detection

Reviewer 1

Reviewer 2

Reviewer 3