{"title": "Deep Structured Prediction for Facial Landmark Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 2450, "page_last": 2460, "abstract": "Existing deep learning based facial landmark detection methods have achieved excellent performance. These methods, however, do not explicitly embed the structural dependencies among landmark points. They hence cannot preserve the geometric relationships between landmark points or generalize well to challenging conditions or unseen data. This paper proposes a method for deep structured facial landmark detection based on combining a deep Convolutional Network with a Conditional Random Field. We demonstrate its superior performance to existing state-of-the-art techniques in facial landmark detection, especially a better generalization ability on challenging datasets that include large pose and occlusion.", "full_text": "Deep Structured Prediction for Facial Landmark\n\nDetection\n\nLisha Chen1, Hui Su1,2, Qiang Ji1\n\n1Rensselaer Polytechnic Institute, 2IBM Research\n\nchenl21@rpi.edu, huisuibmres@us.ibm.com, jiq@rpi.edu\n\nAbstract\n\nExisting deep learning based facial landmark detection methods have achieved\nexcellent performance. These methods, however, do not explicitly embed the\nstructural dependencies among landmark points. They hence cannot preserve the\ngeometric relationships between landmark points or generalize well to challenging\nconditions or unseen data. This paper proposes a method for deep structured\nfacial landmark detection based on combining a deep Convolutional Network\nwith a Conditional Random Field. We demonstrate its superior performance to\nexisting state-of-the-art techniques in facial landmark detection, especially a better\ngeneralization ability on challenging datasets that include large pose and occlusion.\n\n1\n\nIntroduction\n\nFacial landmark detection is to automatically localize the \ufb01ducial facial landmark points around facial\ncomponents and facial contour. It is essential for various facial analysis tasks such as facial expression\nanalysis, headpose estimation and face recognition. With the development of deep learning techniques,\ntraditional facial landmark detection approaches that rely on hand-crafted low-level features have\nbeen outperformed by deep feature based approaches. The purely deep learning based methods,\nhowever, cannot effectively capture the structural dependencies among landmark points. They hence\ncannot perform well under challenging conditions, such as large head pose, occlusion, and large\nexpression variation. Probabilistic graphical models such as Conditional Random Fields (CRFs), have\nbeen widely applied to various computer vision tasks. They can systematically capture the structural\nrelationships among random variables and perform structured prediction. Recently, there have been\nworks that combine deep models with CRF to simultaneously leverage convolutional neural networks\u2019\n(CNNs) representation power and CRF\u2019s structure modeling power [10, 9, 51]. Their combination has\nyielded signi\ufb01cant performance improvement over methods that use either CNN or CRF alone. These\nworks so far are mainly applied to classi\ufb01cation tasks such as semantic image segmentation. Besides\nclassi\ufb01cation, some works apply the CNN and CRF model to human pose [41, 12, 11] and facial\nlandmark detection [2, 44] . To simplify computational complexity, the CRF models are typically\nof special structure (e.g. tree structure), moreover, they employ approximate learning and inference\ncriteria. In this work, we propose to combine CNN with a fully-connected CRF to jointly perform\nfacial landmark detection in regression framework.\nCompared to the existing works, the contributions of our work are summarized as follows:\n1) We introduce the fully-connected CNN-CRF that produces structured probabilistic prediction of\nfacial landmark locations.\n2) Our model explicitly captures the structure relationship variations caused by pose and deformation,\nunlike some previous works that combine CNN with CRF using a \ufb01xed pairwise relationship.\n3) We use an alternating method and derive closed-form solutions in the alternating steps for learning\nand inference, unlike previous works that use approximate methods such as energy minimization\nwhich ignores the partition function for learning and mean-\ufb01eld for inference. And instead of using\ndiscriminative criterion or other approximate loss functions, we employ negative log likelihood (NLL)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\floss function, without any assumption.\n4) Experiments on benchmark face alignment datasets demonstrate the advantages of the proposed\nmethod in achieving better prediction accuracy and generalization to challenging or unseen data than\ncurrent state-of-the-art (SoA) models.\n\n2 Related Work\n\n2.1 Facial Landmark Detection\n\nClassic facial landmark detection methods including Active Shape Model (ASM) [14, 28], Active\nAppearance Model (AAM) [13, 24, 27, 36], Constrained Local Model (CLM) [25, 37], and Cascade\nRegression [8, 6, 53, 7, 46] rely on hand-crafted shallow image features and are usually sensitive to\ninitializations. They are outperformed by modern deep learning based methods.\nUsing deep learning for face alignment was \ufb01rst proposed in [39] and achieved better performance\nthan classic methods. This purely deep appearance based approach uses a deep cascade convolutional\nnetwork and coordinate regression in each cascade level. Later on, more work using purely deep\nappearance based framework for coordinate regression has been explored. Tasks-constrained deep\nconvolutional network (TCDCN) [50] was proposed to jointly optimize facial landmark detection\nwith correlated tasks such as head pose estimation and facial attribute inference. Mnemonic Descent\nMethod (MDM) [42], an end-to-end trainable deep convolutional Recurrent Neural Network (RNN),\nwas proposed where the cascade regression was implemented by RNN. Recently, heatmap learning\nbased methods established new state-of-the-art for face alignment and body pose estimation [41,\n30, 43]. And most of these face alignment methods [5, 44] follow the architecture of Stacked\nHourglass [30]. The stacked modules re\ufb01ne the network predictions after each stack. Different from\ndirect coordinate regression, it predicts a heatmap with the same size as the input image. Hybrid\ndeep methods combine deep models with face shape models. One strategy is to directly predict 3D\ndeformable parameters instead of landmark locations in a cascaded deep regression framework, e.g.\n3D Dense Face Alignment (3DDFA) [54] and Pose-Invariant Face Alignment (PIFA) [23]. Another\nstrategy is to use the deformable model as a constraint to limit the face shape search space thus to\nre\ufb01ne the predictions from the appearance features, e.g. Convolutional Experts Constrained Local\nModel (CE-CLM) [48].\n\n2.2 Structured Deep Models\n\nTo produce structured predictions, some works combine deep models with graphical models. Early\nworks like [31] jointly train a CNN and a graphical model for image segmentation. Do et al.[16]\nintroduced NeuralCRF for sequence labeling. And various works are explored for other tasks. For\ninstance, Jain et al. [22] and Eigen et al. [18]\u2019s work for image restoration, Yao et al. and Morin et\nal.\u2019s work [47, 29] for language understanding, Yoshua et al., Peng et al. and Jaderberg et al.\u2019s work\n[3, 32, 21] for handwriting or text recognition. Recently, for human body pose estimation, Chen et\nal.[10] use CNN to output image dependent part presence as the unary term and spatial relationship\nas the pairwise potential in a tree-structured CRF and uses Dynamic Programming for inference.\nTompson et al. [41, 40] jointly trained a CNN and a fully-connected MRF by using the convolution\nkernel to capture pairwise relationships among different body joints and an iterative convolution\nprocess to implement the belief propagation. The idea of using convolution to implement message\npassing has also been explored in [12], where structure relationships at the body joint feature level\nrather than the output level are captured in a bi-directional tree structured model. And the work\nof Chu et al.[12] is applied to face alignment [44] to pass messages between facial part boundary\nfeature maps. As an extension to [12], [11] models structures in both output and hidden feature\nlayers in CNN. Similarly, for image segmentation, DeepLab [9] uses fully connected CRF with\nbinary cliques and mean-\ufb01eld inference, and [26] uses ef\ufb01cient piecewise training to avoid repeated\ninference during training. In [51], the CRF mean-\ufb01eld inference is implemented by RNN and the\nnetwork is end-to-end trainable by directly optimizing the performance of the mean-\ufb01eld inference.\nUsing RNN to implement message passing has also been applied to facial action unit recognition\n[15]. In [20], the MRF deformable part model is implemented as a layer in a CNN.\nComparison. Compared to previous models serving similar purposes such as [12, 11, 44] that\nassume a tree structured model with belief propagation as inference method, we use a fully-connected\n\n2\n\n\fmodel. With a fully connected model, we don\u2019t need to specify a certain tree structured model, letting\nthe model learn the strong or weak relationships from data, thus this method is more generalizable to\ndifferent tasks. And the works [41, 12, 11, 44, 51] use convolution to implement the pairwise term\nand the message passing process. The pairwise term, once trained, is independent of the input image,\nthus cannot capture the pairwise constraint variations across different conditions like target object\nrotation and object shape. However, we explicitly capture the object pose, deformation variations.\nMoreover, they employ approximate methods such as energy minimization ignoring the partition\nfunction for learning and mean-\ufb01eld for inference. In this paper we do exact learning and inference,\ncapturing the full covariance of the joint distribution of facial landmarks given deformable parameters.\nLastly, compared to the traditional CRF models [33, 34], the weights for each unary terms in our\nmodel are also outputs of the neural network whose inverse quanti\ufb01es heteroscedastic aleatoric\nuncertainty of the unary prediction.\n\n3 Method\n\nThis section presents the proposed structured deep probabilistic facial landmark detection model. In\nthis model, the joint probability distribution of facial landmark locations and deformable parameters\nare captured by a conditional random \ufb01eld model.\n\n3.1 Model de\ufb01nition\n\nDenote the face image as x, the 2D facial landmark lo-\ncations as y, each landmark is yi, i = 1, . . . , N. The\ndeformable model parameters that capture pose, identity\nand expression variation are denoted as \u03b6. The model\nparameter we want to learn is denoted as \u0398. Assuming\n\u03b6 is marginally dependent on x but conditionally inde-\npendent of x given y, the graphical model is shown in\nFig. 1.\nBased on this de\ufb01nition and assumption, the joint dis-\ntribution of landmarks y and deformable parameters \u03b6\nconditioned on the face image x can be formulated in a\nCRF framework and written as\n\np\u0398(y, \u03b6 | x) =\n\n1\n\nZ\u0398(x)\n\n\u2212 N(cid:88)\n\nexp{\u2212 N(cid:88)\nN(cid:88)\n\ni=1\n\n\u03c6\u03b81(yi | x)\n\n\u03c8Cij (yi, yj, \u03b6)}\n\nFigure 1: The graphical model. Dashed,\ndotted, solid lines represent dependencies\nbetween pairs of landmarks, landmark and\ndeformable parameters, landmarks and\nface image, respectively.\n\n(1)\n\ni=1\n\nj=i+1\n\nwhere \u0398 = [\u03b81, Cij], \u03b81 is neural network parameter, Cij is a 2\u00d7 2 symmetric positive de\ufb01nite matrix\nthat captures the spatial relationships between a pair of landmark points, yi and yj. Z\u0398(x) is the\npartition function. \u03c6\u03b81 (yi | x) is the unary energy function with parameter \u03b81 and \u03c8Cij (yi, yj, \u03b6) is\nthe triple-wise energy function with parameter Cij.\n\n3.2 Energy functions\n\nWe de\ufb01ne the unary and triple-wise energy in Eq.(2) and Eq.(3) respectively.\n\n1\n2\n\n[yi \u2212 \u00b5i(x, \u03b81)]T \u03a3\u22121\n\n\u03c6\u03b81(yi | x) =\n\u03c8Cij (yi, yj, \u03b6) = [yi \u2212 yj \u2212 \u00b5ij(\u03b6)]T Cij[yi \u2212 yj \u2212 \u00b5ij(\u03b6)]\n\n(x, \u03b81)[yi \u2212 \u00b5i(x, \u03b81)]\n\ni\n\n(3)\nwhere \u00b5i(x, \u03b81) and \u03a3i(x, \u03b81) are the outputs of the CNN that represent mean and covariance\nmatrix of each landmark given the image x. \u00b5ij(\u03b6) represents the expected difference between\ntwo landmark locations. It is fully determined by the 3D deformable face shape parameters \u03b6,\nwhich contains rigid parameters: rotation R and scale S, and non-rigid parameters q.\n\n(cid:20)\u00b5ij(\u03b6)\n\n(cid:21)\n\n=\n\n(2)\n\n1\n\n3\n\nOutput: deformableparametersOutput: faciallandmark locationsy3Input: face imagey2y5y4y1\fi + \u03a6iq \u2212 \u00afy3d\n\nj \u2212 \u03a6jq), where \u00afy3d is the 3D mean face shape, \u03a6 is the bases of deformable\n1\n\u03bb SR(\u00afy3d\nmodel, they are learned from data. The deformable parameters \u03b6 = [S, R, q] are jointly estimated\nwith 2D landmark locations during inference. In this work, we assume weak perspective projection\nmodel. S is a 3 \u00d7 3 diagonal matrix that contains 2 independent parameters sx, sy as scaling factor\n(encode the camera intrinsic parameters) for column and row respectively. While R is a 3 \u00d7 3\northonormal matrix with 3 independent parameters \u03b31, \u03b32, \u03b33 as the pitch, yaw, roll rotation angle.\nNote that the translation vector is canceled by taking the difference of two landmark points.\n\n3.3 Learning and Inference\n\nWe propose to implement the conditional probability distribution in Eq. (1) with a CNN-CRF model.\nAs shown in Fig. 2, the CNN with parameter \u03b81 outputs mean \u00b5i(x, \u03b81) and covariance matrix\n\u03a3i(x, \u03b81) for each facial landmark yi, which together forms the unary energy function \u03c6\u03b81 (yi | x).\nA fully-connected (FC) graph with parameter Cij (cid:31) 0 gives the triple-wise energy \u03c8Cij (yi, yj, \u03b6), if\ngiven \u03b6 as well as the output from the unary, the FC can output E(x, \u03b6, \u0398) and \u039bp(x, \u03b6, \u0398), the mean\nand precision matrix for the conditional distribution p\u0398(y | \u03b6, x). The FC can be implemented as\nanother layer following the CNN. Combining the unary and the triple-wise energy, we obtain the joint\ndistribution p\u0398(y, \u03b6 | x). However, direct inference of y\u2217, \u03b6\u2217 from p\u0398(y, \u03b6 | x) is dif\ufb01cult, therefore\nwe iteratively infer from conditional distributions p\u0398(y | \u03b6, x) and p\u0398(\u03b6 | y).\n\nFigure 2: Overall \ufb02owchart of the proposed CNN-CRF model.\n\nMean and Precision matrix\nDuring learning and inference, we need to compute conditional probability p\u0398(y | \u03b6, x). By using\nthe quadratic unary and triple-wise energy function, the distribution p\u0398(y | \u03b6, x) is a multivariate\nGaussian distribution that can be written as\n\nexp{\u2212 N(cid:88)\n\n\u03c6\u03b81(yi | x) \u2212 N(cid:88)\n\nN(cid:88)\n\ni=1\n\ni=1\n\nj=i+1\n\n\u03c8Cij (yi, yj, \u03b6)}\n\n(4)\n\np\u0398(y | \u03b6, x) =\n\n1\nZ(cid:48)\n\u0398(x)\n= exp{ 1\n2\n\nln|\u039bp(x, \u0398, \u03b6)| \u2212 1\n2\n\n[y \u2212 E(x, \u0398, \u03b6)]T \u039bp(x, \u0398, \u03b6)[y \u2212 E(x, \u0398, \u03b6)]}\n\nwhere Z(cid:48)\n\u0398(x) is the partition function. E(x, \u0398, \u03b6) and \u039bp(x, \u0398, \u03b6) is the mean and precision matrix\nof the multivariate Gaussian distribution. They are computed exactly during learning and inference.\nThe mean E can be computed by solving the linear system of equations \u039bpE = b where \u039bp, the\nprecision matrix, is a symmetric positive de\ufb01nite matrix that can be directly computed from the\ncoef\ufb01cient in the unary and pairwise term as shown in Eq. (5), and b can be computed from Eq. (5).\n\n\uf8ee\uf8ef\uf8f0 \u039bp11\n\n...\n\n\u039bpN 1\n\n\u039bp =\n\n. . . \u039bp1N\n...\n. . . \u039bpN N\n\n...\n\n(cid:40)\n\n\uf8f9\uf8fa\uf8fb ,\n\ni +(cid:80)\n\n\u22121\n\u039bpii = \u03a3\n\u039bpij = \u2212Cij\n\nj(cid:54)=i Cij\n\nb =\n\n\uf8f9\uf8fa\uf8fa\uf8fb , bi = \u03a3\n\n\uf8ee\uf8ef\uf8ef\uf8f0 b1\n\nb2\n\n...\n\nbN\n\n(cid:88)\n\nj(cid:54)=i\n\nCij\u00b5ij\n\n(5)\n\n\u22121\ni \u00b5i +\n\nFrom Eq.(5) we can see that the \ufb01nal inference result Ei is a combination of \u00b5i and \u00b5j + \u00b5ij, j \u2208\n{1, . . . , N}, j (cid:54)= i. To solve this linear system of equations, we use direct method for exact solution\nwith a fast implementation by Cholesky factorization that requires O(N 3) FLOPs. For a practical\nimplementation of the determinant to avoid numerical issues, we again use the Cholesky factorization\ndiag(\u00b7) takes the diagonal element of a matrix.\n\nof \u039bp to get LLT = \u039bp, then we compute the log determinant by ln|\u039bp| = 2(cid:80) ln diag(L) where\n\n4\n\nCNN\u00a0 \u00a0 \u00a0 \u00a0unary energytriple-wise energyFC\u00a0 \u00a0 \u00a0 \u00a0 \u00a0Input image xJoint distributionexpnormalizeConditional distributionConditional distributionJointly infer-+\fLearning\nDuring learning, our goal is to optimize \u0398 given training data D = {xm, ym, m = 1, . . . , M}.\nWe directly optimize the inference performance. Note that we don\u2019t have ground truth label for \u03b6,\nwhere \u03b6 = {\u03b61, . . . , \u03b6m}. We use an alternating method, based on the current \u0398t, \u02c6yt = E(x, \u0398t, \u03b6 t),\noptimize \u03b6 by\n\n\u2212 ln p\u0398t(\u02c6yt\nThen based on current \u03b6t, optimize \u0398 by\n\n\u03b6 t+1\nm = arg min\n\n\u03b6m\n\nm, \u03b6m | xm) = arg min\n\n\u03b6m\n\n\u0398t+1 = arg min\n\n\u0398\n\nLoss = arg min\n\n\u0398\n\nm=1\n\nln p\u0398(ym, \u03b6 t\n\nm | xm) = arg min\n\n\u0398\n\n\u2212 M(cid:88)\n\n\u03c8Ct\n\nij\n\n(\u02c6yt\n\nmi, \u02c6yt\n\nmj, \u03b6m)\n\n(6)\n\n\u2212 M(cid:88)\n\nm=1\n\nln p\u0398(ym | \u03b6 t\n\nm, xm)\n\n\u0398\n\nm=1\n\n\u2212 1\n2\n\n= arg min\n\nln|\u039bp(xm, \u0398, \u03b6 t\n\nm)[ym \u2212 E(xm, \u0398, \u03b6 t\n(7)\nThe algorithm for this problem is designed to \ufb01rst set Cij = 0 and optimize \u03b81, the CNN parameter.\nThen set Cij = 0.01I and optimize \u03b6, then \ufb01x a subset of parameters from \u0398 and optimize the others\nalternately, whose pseudo code is shown in Algorithm 1.\n\n[ym \u2212 E(xm, \u0398, \u03b6 t\n\nm)]T \u039bp(xm, \u0398, \u03b6 t\n\nm)| +\n\n1\n2\n\nm)]\n\nM(cid:88)\n\nAlgorithm 1: Learning CNN-CRF\nInput: training data {xm, ym, m = 1, . . . , M};\nInitialization: parameters \u03980 = {\u03b80\nwhile not converge do\n\n1 = randn, C 0\n\n\u03b8t+1\n1 = \u03b8t\n\n1 \u2212 \u03b7t\n\n1\n\n\u2202Loss\n\n\u2202\u03b81\n\n; t = t + 1;\n\nend\nm = E(xm, \u0398t, \u03b6 t), C t\n\u02c6yt\nwhile not converge do\n\nij = 0.01I;\n\nij = 0}, t = 0 ;\n\nStage 1: Fix parameters \u0398 = \u0398t, optimize \u03b6 by Eq. (6);\nwhile not converge do\n\n(cid:46) Optimize deformable parameters\n\n(\u02c6yt\n\nmi, \u02c6yt\n\nmj, \u03b6m), \u02c6yt+1 = E(x, \u0398, \u03b6 t+1), t = t + 1;\n\n\u03b6 t+1\nm = arg min\u03b6m \u03c8Ct\n\nij\n\nend\n\u0398t = \u0398\nStage 2: Fix \u03b6 = \u03b6t, Cij = C t\nwhile not converge do\n\n\u03b8t+1\n1 = \u03b8t\n\n1 \u2212 \u03b7t\n\n1\n\n\u2202Loss\n\n\u2202\u03b81\n\n; t = t + 1;\n\nij] = [\u03b6, Cij]\n\nend\n[\u03b6t, C t\nStage 3: Fix \u03b6 = \u03b6t, \u03b81 = \u03b8t\nwhile not converge do\nij \u2212 \u03b7t\n\nij = C t\n\nC t+1\n\n\u2202Loss\n\u2202Cij\n\n2\n\nend\n[\u03b6t, \u03b8t\n\nend\n\n1] = [\u03b6, \u03b81]\n\nij, update \u03b81 using Eq. (7);\n\n1, update Cij using Eq. (7);\n\n; t = t + 1;\n\n(cid:46) Update CNN parameters\n\n(cid:46) Update CRF parameters\n\nInference\nThe inference problem is a joint inference of \u03b6, y for each input face image x, de\ufb01ned in Eq. (8)\n\ny\u2217, \u03b6\u2217 = arg max\n\nln p\u0398(y, \u03b6 | x)\n\ny,\u03b6\n\nWe use an alternating method. Based on current yt, optimize \u03b6 t by (see supplementary):\n\n\u03b6 t = arg max\n\n\u03b6\n\nln p\u0398(yt, \u03b6 | x) = arg min\n\n\u03b6\n\n\u03c8Cij (yt\n\ni, yt\n\nj, \u03b6)\n\nThen based on current \u03b6 t, optimize yt+1 by:\n\nyt+1 = arg max\n\ny\n\nln p\u0398(y, \u03b6 t | x) = arg max\n\nln p\u0398(y | \u03b6 t, x) = E(x, \u0398, \u03b6 t)\n\ny\n\n5\n\nN(cid:88)\n\nN(cid:88)\n\ni=1\n\nj=i+1\n\n(8)\n\n(9)\n\n(10)\n\n\fThe inference algorithm is shown in Algorithm 2.\n\nAlgorithm 2: Inference for CNN-CRF\nInput: face image x\nInitialization: y0\nwhile not converge do\n\ni = \u00b5i, i = 1, . . . , N , t = 0;\n\nUpdate \u03b6 by Eq. (9). \u03b6 t = arg min\u03b6\ni=1\nUpdate y by Eq. (10). yt+1 = E(x, \u0398, \u03b6 t);\nt = t + 1 ;\n\nend\n\n(cid:80)N\n\n(cid:80)N\n\nj=i+1 \u03c8Cij (yt\n\ni, yt\n\nj, \u03b6);\n\n4 Experiments\n\nDatasets. We evaluate our methods on popular benchmark facial landmark detection datasets,\nincluding 300W [35], Menpo [49], COFW [6], 300VW [1].\n300W has 68 landmark annotation. It contains 3837 faces for training and 300 indoor and 300 outdoor\nfaces for testing.\nMenpo contains images from AFLW and FDDB with landmark re-annotation following the 68\nlandmark annotation scheme. It has two subsets, Menpo-frontal which has 68 landmark annotations\nfor near frontal faces (6679 samples) and Menpo-pro\ufb01le which has 39 landmark annotations for\npro\ufb01le faces (2300 samples). We use it as a test set for cross dataset evaluation.\nCOFW has 1345 training samples and 507 testing samples, whose facial images are all partially\noccluded. The original dataset is annotated with 29 landmarks. We use the COFW-68 test set [19]\nwhich has 68 landmarks re-annotation for cross dataset evaluation.\n300VW is a facial video dataset with 68 landmarks annotation. It contains 3 scenarios: 1) constrained\nlaboratory and naturalistic well-lit conditions; 2) unconstrained real-world conditions with different\nilluminations, dark rooms, overexposed shots, etc.; 3) completely unconstrained arbitrary conditions\nincluding various illumination, occlusions, make-up, expression, head pose, etc. We use the test set\nfor cross dataset evaluation.\nEvaluation metrics. We evaluate our algorithm using the standard normalized mean error (NME)\nand the Cumulative Errors Distribution (CED) curve. Besides, the area-under-the-curve (AUC)\nand the failure rate (FR) for a maximum error of 0.07 are reported. Same as in [5], the NME is\nde\ufb01ned as the average point-to-point Euclidean distance between the ground truth (ygt) and predicted\nwbbox \u2217 hbbox,\n(ypred) landmark locations normalized by the ground truth bounding box size d =\n. Based on the NME in the test dataset, we can draw a CED Curve\nNME = 1\nN\nwith NME as the horizontal axis and percentage of test images as the vertical axis. Then the AUC is\ncomputed as the area under that curve for each test dataset.\nImplementation details. To make a fair comparison with the SoA purely deep learning based\nmethods [5], we use the same training and testing procedure for 2D landmark detection. The 3D\ndeformable model was trained on the 300W-train dataset or 300W-LP dataset by structure from\nmotion [4]. For CNN, we use 4 stacks of Hourglass with the same structure as [5], each stack followed\nby a softmax layer to output a probability map for each facial landmark. From the probability map,\nwe compute mean \u00b5i and covariance \u03a3i. And we use additional softmax cross entropy loss and L1\nloss on the mean [38] to assist training which shows better performance empirically.\nTraining procedure: The initial learning rate \u03b71 is 10\u22124 for 15 epochs using a minibatch of 10, then\ndropped to 10\u22125 and 10\u22126 after every 15 epochs and keep training until convergence. The learning\nrate \u03b72 is set to 10\u22123. We applied random augmentations such as random cropping, rotation, etc. We\n\ufb01rst train the method on 300W-LP [54] dataset which is augmented from the original 300W dataset\nfor large yaw pose. And then we \ufb01ne-tune on the original 300W train dataset.\nTesting procedure: We follow the same testing procedure as [5]. The face is cropped using the\nground truth bounding box de\ufb01ned in 300W. The cropped face is rescaled to 256 \u00d7 256 before\npassed to the network. For the Menpo-pro\ufb01le dataset, the annotation scheme is different, we use the\noverlapping 26 points for evaluation, i.e., removing points other than the 2 endpoints on the face\ncontour and the eyebrow respectively and removing the 5th point on the nose contour.\n\n(cid:80)N\n\npred\u2212y(i)\n\ngt ||2\n\n||y(i)\n\n\u221a\n\ni=1\n\nd\n\n4.1 Comparison with existing approaches\n\n6\n\n\fIn Table 1, we compare with some most recent best results\nreported, in the 300W protocol that trains on LFPW-train,\nHELEN-train, AFW and tests on LFPW-test, HELEN-test,\nibug and use NME normalized with inter-ocular/pupil dis-\ntance as the metric.\nIn Table 2, we compare with other baseline facial landmark\ndetection algorithms, including purely deep learning based\nmethods such as TCDCN [50] and FAN [5] as well as hybrid\nmethods such as CLNF [2] and CE-CLM [48]. The results\nfor these methods are evaluated using the code provided\nby the authors in the same experiment protocol, i.e., same\nbounding box and same evaluation metrics. The CED curves\non the 300W testset are shown in Fig. 3a.\n\nTable 1: Comparison with SoA meth-\nods on 300W dataset using 300W pro-\ntocol (NME normalized with inter-\nocular/pupil distance %)\nCom.\n\nSubset\n\nChal.\n\nFull\n\nMethod\n\nInter-ocular distance\n\n-\n\n5.03\n3.34\n2.98\n2.93\n\n4.83\n4.20\n4.06\n\n-\n\n8.95\n6.60\n5.19\n4.84\n\n10.14\n7.41\n6.98\n\n4.05\n5.80\n3.98\n3.49\n3.30\n\n5.88\n4.92\n4.63\n\nMDM [42]\nRDR [45]\nSAN [17]\nLAB (4-stack) [44]\nOur method (4-stack)\n\nMDM [42]\nLAB (4-stack) [44]\nOur method (4-stack)\n\nInter-pupil distance\n\n(a) 300W testset\n\n(b) Menpo-frontal dataset (c) Menpo-pro\ufb01le dataset\n\n(d) COFW-68 testset\n\nFigure 3: CED curves on different datasets (better viewed in color and magni\ufb01ed)\n\n(a) 300VW category1\n\n(b) 300VW category2\n\n(c) 300VW category3\n\nFigure 4: CED curves on 300VW testset (better viewed in color and magni\ufb01ed)\n\nCross-dataset Evaluation\nBesides 300W testset, we evaluate the proposed method on Menpo dataset, COFW-68 testset, 300VW\ntestset for cross dataset evaluation. The results are shown in Table 2 for Menpo and COFW-68 dataset\nand Table 3 for 300VW dataset. And the CED curves are shown in Fig. 3b, 3c, 3d respectively. The\nmethod is trained on 300W-LP and \ufb01ne-tuned on 300W Challenge train set for 68 landmarks. We can\nsee that compared to the results on 300W testset and Menpo-frontal dataset, where the SoA methods\nattaining saturating performance as mentioned in [5], for cross-dataset evaluation in more challenging\nconditions such as COFW with heavy occlusion and Menpo-pro\ufb01le with large pose, the proposed\nmethod shows better generalization ability with a signi\ufb01cant performance improvement. On the other\nhand, the proposed method shows smallest failure rate (FR) on all evaluated datasets.\n\nTable 2: Within and cross dataset prediction results (%)\n\n300W-test\n\nMenpo-frontal\n\nMenpo-pro\ufb01le\n\nCOFW-68 test\n\nNME\n4.15\n3.09\n6.90\n4.22\n3.05\n\n-\n\n2.86\n2.21\n\nAUC\n42.1\n56.7\n20.6\n47.6\n56.9\n66.9\n59.7\n68.1\n\nFR\n4.83\n1.83\n30.00\n6.67\n2.33\n\n-\n\n1.00\n0.17\n\nNME\n4.04\n3.91\n6.57\n3.74\n2.78\n\n-\n\n2.95\n2.01\n\nAUC\n46.2\n57.4\n28.7\n55.4\n63.3\n67.5\n61.9\n71.0\n\nFR\n5.84\n9.75\n24.57\n5.82\n1.66\n\n-\n\n3.11\n0.16\n\nNME\n13.96\n15.04\n8.37\n8.32\n4.63\n\n-\n\n8.80\n3.03\n\nAUC\n5.9\n15.2\n20.5\n27.8\n45.2\n\n-\n\n29.0\n60.0\n\nFR\n75.61\n58.87\n41.43\n27.65\n7.17\n\n-\n\n28.65\n1.96\n\nNME\n4.71\n3.79\n8.13\n4.75\n3.36\n\n-\n\n3.50\n2.55\n\nAUC\n35.8\n49.0\n18.2\n42.9\n52.4\n\n-\n\n51.9\n63.2\n\nFR\n8.68\n4.34\n43.79\n10.65\n2.37\n\n-\n\n3.94\n0.00\n\nDataset\n\nMetric\n\nMethod\nTCDCN [50]\nCFSS [52]\n3DDFA [54]\nCLNF [2]\nCE-CLM [48]\nFAN (reported in [5])\nSAN [17]\nour method\n4.2 Analysis\n\nIn this section, we report the results of sensitivity analysis and ablation study. If not speci\ufb01ed, analysis\nis performed on test datasets with models trained on 300W-LP and \ufb01ne-tuned on 300W train set.\n\n7\n\n01234567NME (%)0102030405060708090100Proportion of images (%)TCDCNCFSS3DDFACLNFCECLMFANSANproposed01234567NME (%)0102030405060708090100Proportion of images (%)TCDCNCFSS3DDFACLNFCECLMFANSANproposed01234567NME (%)0102030405060708090100Proportion of images (%)TCDCNCFSS3DDFACLNFCECLMFANSANproposed01234567NME (%)0102030405060708090100Proportion of images (%)TCDCNCFSS3DDFACLNFCECLMFANSANproposed01234567NME (%)0102030405060708090100Proportion of images (%)TCDCNCFSS3DDFACLNFCECLMFANSANproposed01234567NME (%)0102030405060708090100Proportion of images (%)TCDCNCFSS3DDFACLNFCECLMFANSANproposed01234567NME (%)0102030405060708090100Proportion of images (%)TCDCNCFSS3DDFACLNFCECLMFANSANproposed\fTable 3: 300VW testset prediction results for cross-dataset evaluation (%)\n300VW-category3\nDataset\n\n300VW-category2\n\n300VW-category1\n\nMetric\n\nMethod\nTCDCN [50]\nCFSS [52]\n3DDFA [54]\nCLNF [2]\nCE-CLM [48]\nFAN (reported in [5])\nSAN [17]\nour method\n\nNME\n3.49\n2.44\n5.80\n3.34\n2.54\n\n-\n\n2.58\n1.91\n\nAUC\n51.2\n67.0\n32.4\n60.4\n65.7\n72.1\n64.5\n73.3\n\nFR\n1.74\n1.66\n24.50\n4.31\n1.58\n\n-\n\n1.10\n0.36\n\nNME\n3.80\n2.49\n4.44\n2.98\n2.39\n\n-\n\n2.57\n1.97\n\nAUC\n45.8\n64.3\n39.2\n60.0\n66.0\n71.2\n63.2\n71.6\n\nFR\n1.76\n0.77\n8.82\n3.02\n0.61\n\n-\n\n0.42\n0.04\n\nNME\n4.45\n3.26\n5.48\n4.73\n3.61\n\n-\n\n4.06\n2.50\n\nAUC\n43.8\n60.5\n31.6\n47.1\n56.4\n64.1\n52.9\n67.4\n\nFR\n8.85\n5.18\n18.26\n7.74\n5.69\n\n-\n\n7.19\n1.68\n\nSensitivity to challenging conditions. We evaluate different methods on challenging conditions\ncaused by either high noise, low resolution, or different initializations in Fig. 5. Generally, the\nproposed CNN-CRF model is more robust under challenging conditions compared to a pure CNN\nmodel with the same structure, i.e. the CNN-CRF model with Cij = 0.\n\n(a) Noise\n\n(b) Lower resolution\n\n(c) Larger bounding box\n\nFigure 5: Prediction error sensitivity to challenging conditions\n\nAblation Study. The improvement of the proposed method lies in two aspects. On the one hand, the\nproposed softmax + L1 mean loss + Gaussian negative log likelihood (NLL) loss gives better results\nempirically. On the other hand, the joint training of the CNN-CRF model with the assistance of the\ndeformable model captures structured relationships with pose and deformation awareness. To analyze\nthe effect of the proposed method, in Table 4, we evaluate the performance of a plain CNN prediction,\nthe 3D deformable model \ufb01tting to the ground truth, and the joint CNN-CRF prediction accuracy.\n\nTable 4: Ablation study on 300W testset (%)\n\nMethod\nPlain CNN with softmax cross entropy loss\nPlain CNN with softmax + L1 mean loss + Gaussian NLL loss (proposed loss)\nSeparately trained CNN and CRF with proposed loss\nDeformable model \ufb01tting\nJointly trained CNN-CRF with proposed loss (proposed method)\n\nNME\n2.38\n2.30\n2.23\n1.39\n2.21\n\nAUC\n65.9\n67.4\n67.8\n79.8\n68.1\n\nFR\n0.50\n0.50\n0.50\n0.00\n0.17\n\n5 Conclusion\n\nIn this paper, we propose a method combining CNN with a fully-connected CRF model for facial\nlandmark detection. Compared to the state-of-the-art purely deep learning based methods, our method\nexplicitly captures the structured relationships between different facial landmark locations. Compared\nto previous methods that combine CNN with CRF for human body pose estimation that learn a\n\ufb01xed pairwise relationship representation for different test samples implemented by convolution, our\nmethods capture the structure relationship variations caused by pose and deformation. Moreover,\nwe use a fully-connected model instead of a tree-structured model, obtaining a better representation\nability. Lastly, compared to previous methods that do approximate learning such as omitting the\npartition function and inference such as mean-\ufb01eld method, we perform exact learning and inference,\nthus able to provide a better structured uncertainty. Experiments on benchmark datasets demonstrate\nthat the proposed method outperforms the existing state-of-the-art methods, in particular under\nchallenging conditions, for both within dataset and cross dataset.\n\n8\n\n\fAcknowledgment The work described in this paper is supported in part by NSF award IIS #1539012\nand by RPI-IBM Cognitive Immersive Systems Laboratory (CISL), a center in IBM\u2019s AI Horizon\nNetwork.\n\nReferences\n[1] 300VW dataset. http://ibug.doc.ic.ac.uk/resources/300-VW/, 2015.\n\n[2] Tadas Baltru\u0161aitis, Peter Robinson, and Louis-Philippe Morency. Continuous conditional neural \ufb01elds\nfor structured regression. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,\nComputer Vision \u2013 ECCV 2014, pages 593\u2013608, Cham, 2014. Springer International Publishing.\n\n[3] Yoshua Bengio, Yann LeCun, and Donnie Henderson. Globally trained handwritten word recognizer\nusing spatial representation, convolutional neural networks, and hidden markov models. In J. D. Cowan,\nG. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems 6, pages 937\u2013944.\nMorgan-Kaufmann, 1994.\n\n[4] C. Bregler, L. Torresani, and A. Hertzmann. Nonrigid structure-from-motion: Estimating shape and motion\nwith hierarchical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(05):878\u2013892,\nmay 2008.\n\n[5] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment\nproblem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision,\n2017.\n\n[6] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Doll\u00e1r. Robust face landmark estimation under\nocclusion. In Proceedings of the 2013 IEEE International Conference on Computer Vision, ICCV \u201913,\npages 1513\u20131520, Washington, DC, USA, 2013. IEEE Computer Society.\n\n[7] Xudong Cao, Yichen Wei, Fang Wen, and Jian Sun. Face alignment by explicit shape regression. Interna-\n\ntional Journal of Computer Vision, 107(2):177\u2013190, Apr 2014.\n\n[8] Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, and Jian Sun. Joint cascade face detection and\nalignment. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision \u2013\nECCV 2014, pages 109\u2013122, Cham, 2014. Springer International Publishing.\n\n[9] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation\nwith deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, 40(4):834\u2013848, April 2018.\n\n[10] Xianjie Chen and Alan Yuille. Articulated pose estimation by a graphical model with image dependent\npairwise relations. In Proceedings of the 27th International Conference on Neural Information Processing\nSystems - Volume 1, NIPS\u201914, pages 1736\u20131744, Cambridge, MA, USA, 2014. MIT Press.\n\n[11] Xiao Chu, Wanli Ouyang, hongsheng Li, and Xiaogang Wang. Crf-cnn: Modeling structured information\nin human pose estimation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 29, pages 316\u2013324. Curran Associates, Inc., 2016.\n\n[12] Xiao Chu, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Structured feature learning for pose\n\nestimation. In CVPR, 2016.\n\n[13] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. In Hans Burkhardt and Bernd\nNeumann, editors, Computer Vision \u2014 ECCV\u201998, pages 484\u2013498, Berlin, Heidelberg, 1998. Springer\nBerlin Heidelberg.\n\n[14] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham. Active shape models-their training and application.\n\nComputer Vision and Image Understanding, 61(1):38 \u2013 59, 1995.\n\n[15] Ciprian A. Corneanu, Meysam Madadi, and Sergio Escalera. Deep structure inference network for facial\n\naction unit recognition. In ECCV, 2018.\n\n[16] Trinh\u2013Minh\u2013Tri Do and Thierry Artieres. Neural conditional random \ufb01elds. In Yee Whye Teh and Mike\nTitterington, editors, Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and\nStatistics, volume 9 of Proceedings of Machine Learning Research, pages 177\u2013184, Chia Laguna Resort,\nSardinia, Italy, 13\u201315 May 2010. PMLR.\n\n9\n\n\f[17] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style aggregated network for facial landmark\ndetection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 379\u2013388, 2018.\n\n[18] David Eigen, Dilip Krishnan, and Rob Fergus. Restoring an image taken through a window covered with\ndirt or rain. In Proceedings - 2013 IEEE International Conference on Computer Vision, ICCV 2013, pages\n633\u2013640. Institute of Electrical and Electronics Engineers Inc., 2013.\n\n[19] Golnaz Ghiasi and Charless C. Fowlkes. Occlusion coherence: Detecting and localizing occluded faces.\n\nCoRR, abs/1506.08347, 2015.\n\n[20] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks.\nIn 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 437\u2013446, June\n2015.\n\n[21] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Structured Output\n\nLearning for Unconstrained Text Recognition. dec 2014.\n\n[22] V. Jain, J. F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. L. Briggman, M. N. Helmstaedter, W. Denk, and\nH. S. Seung. Supervised learning of image restoration with convolutional networks. In 2007 IEEE 11th\nInternational Conference on Computer Vision, pages 1\u20138, Oct 2007.\n\n[23] Amin Jourabloo and Xiaoming Liu. Pose-invariant face alignment via cnn-based dense 3d model \ufb01tting.\n\nInt. J. Comput. Vision, 124(2):187\u2013203, September 2017.\n\n[24] F. Kahraman, G. Muhitin, S. Darkner, and R. Larsen. An active illumination and appearance model for\nface alignment. Turkish Journal of Electrical Engineering and Computer Science, 18(4):677\u2013692, 2010.\n\n[25] Neeraj Kumar, Peter N. Belhumeur, and Shree K. Nayar. Facetracer: A search engine for large collections\n\nof images with faces. In The 10th European Conference on Computer Vision (ECCV), October 2008.\n\n[26] G. Lin, C. Shen, A. Hengel, and I. Reid. Ef\ufb01cient piecewise training of deep structured models for semantic\nsegmentation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages\n3194\u20133203, Los Alamitos, CA, USA, jun 2016. IEEE Computer Society.\n\n[27] Iain Matthews and Simon Baker. Active appearance models revisited. International Journal of Computer\n\nVision, 60(2):135\u2013164, Nov 2004.\n\n[28] Stephen Milborrow and Fred Nicolls. Locating facial features with an extended active shape model. In\nProceedings of the 10th European Conference on Computer Vision: Part IV, ECCV \u201908, pages 504\u2013513,\nBerlin, Heidelberg, 2008. Springer-Verlag.\n\n[29] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Robert G.\nCowell and Zoubin Ghahramani, editors, Proceedings of the Tenth International Workshop on Arti\ufb01cial\nIntelligence and Statistics, pages 246\u2013252. Society for Arti\ufb01cial Intelligence and Statistics, 2005.\n\n[30] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In\nComputer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14,\n2016, Proceedings, Part VIII, pages 483\u2013499, 2016.\n\n[31] Feng Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. E. Barbano. Toward automatic phenotyping\n\nof developing embryos from videos. Trans. Img. Proc., 14(9):1360\u20131371, September 2005.\n\n[32] Jian Peng, Liefeng Bo, and Jinbo Xu. Conditional neural \ufb01elds. In Y. Bengio, D. Schuurmans, J. D.\nLafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems\n22, pages 1419\u20131427. Curran Associates, Inc., 2009.\n\n[33] Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. Continuous conditional random \ufb01elds\nfor regression in remote sensing. In Proceedings of the 2010 Conference on ECAI 2010: 19th European\nConference on Arti\ufb01cial Intelligence, pages 809\u2013814, Amsterdam, The Netherlands, The Netherlands,\n2010. IOS Press.\n\n[34] Kosta Ristovski, Vladan Radosavljevic, Slobodan Vucetic, and Zoran Obradovic. Continuous conditional\n\nrandom \ufb01elds for ef\ufb01cient regression in large fully connected graphs. In AAAI, 2013.\n\n[35] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja\n\nPantic. 300 faces in-the-wild challenge. Image Vision Comput., 47(C):3\u201318, March 2016.\n\n10\n\n\f[36] J. Saragih and R. Goecke. A nonlinear discriminative approach to aam \ufb01tting.\n\nIn 2007 IEEE 11th\nInternational Conference on Computer Vision, pages 1\u20138, 2007. Exported from https://app.dimensions.ai\non 2018/11/15.\n\n[37] Jason M. Saragih, Simon Lucey, and Jeffrey F. Cohn. Deformable model \ufb01tting by regularized landmark\n\nmean-shift. International Journal of Computer Vision, 91(2):200\u2013215, Jan 2011.\n\n[38] Xiao Sun, Bin Xiao, Shuang Liang, and Yichen Wei. Integral human pose regression. arXiv preprint\n\narXiv:1711.08229, 2017.\n\n[39] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point de-\nIn Computer Vision - CVPR IEEE Computer Society Conference on Computer Vision and\ntection.\nPattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. .\n10.1109/CVPR.2013.446., Proceedings, pages 3476\u20133483, 2013.\n\n[40] Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, and Christoph Bregler. Ef\ufb01cient object\n\nlocalization using convolutional networks. In CVPR, 2015.\n\n[41] Jonathan Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Joint training of a convolutional\nnetwork and a graphical model for human pose estimation. In Proceedings of the 27th International\nConference on Neural Information Processing Systems - Volume 1, NIPS\u201914, pages 1799\u20131807, Cambridge,\nMA, USA, 2014. MIT Press.\n\n[42] George Trigeorgis, Patrick Snape, Mihalis A. Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou.\nMnemonic descent method: A recurrent process applied for end-to-end face alignment. In CVPR, pages\n4177\u20134187. IEEE Computer Society, 2016.\n\n[43] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In\n2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA,\nJune 27-30, 2016, pages 4724\u20134732, 2016.\n\n[44] Wayne Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at boundary: A\n\nboundary-aware face alignment algorithm. In CVPR, 2018.\n\n[45] Shengtao Xiao, Jiashi Feng, Luoqi Liu, Xuecheng Nie, Wei Wang, Shuicheng Yan, and Ashraf Kassim.\nIn The IEEE International\n\nRecurrent 3d-2d dual learning for large-pose facial landmark detection.\nConference on Computer Vision (ICCV), Oct 2017.\n\n[46] Xuehan Xiong and Fernando De la Torre. Global supervised descent method. In CVPR, pages 2664\u20132673.\n\nIEEE Computer Society, 2015.\n\n[47] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao. Recurrent conditional random \ufb01eld for language\nunderstanding. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing\n(ICASSP), pages 4077\u20134081, May 2014.\n\n[48] Amir Zadeh, Yao Chong Lim, Tadas Baltrusaitis, and Louis-Philippe Morency. Convolutional experts\nconstrained local model for 3d facial landmark detection. In The IEEE International Conference on\nComputer Vision (ICCV) Workshops, Oct 2017.\n\n[49] S. Zafeiriou, G. Trigeorgis, G. Chrysos, J. Deng, and J. Shen. The menpo facial landmark localisation\nIn 2017 IEEE Conference on Computer Vision and Pattern\n\nchallenge: A step towards the solution.\nRecognition Workshops (CVPRW), pages 2116\u20132125, July 2017.\n\n[50] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Facial landmark detection by deep\nmulti-task learning. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer\nVision \u2013 ECCV 2014, pages 94\u2013108, Cham, 2014. Springer International Publishing.\n\n[51] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong\nDu, Chang Huang, and Philip H. S. Torr. Conditional random \ufb01elds as recurrent neural networks. In\nProceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV \u201915, pages\n1529\u20131537, Washington, DC, USA, 2015. IEEE Computer Society.\n\n[52] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-\ufb01ne shape\nsearching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n4998\u20135006, 2015.\n\n[53] Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Unconstrained face alignment via cascaded\ncompositional learning. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 3409\u20133417, 2016.\n\n[54] Xiangyu Zhu, Zhen Lei, Stan Z Li, et al. Face alignment in full pose range: A 3d total solution. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1429, "authors": [{"given_name": "Lisha", "family_name": "Chen", "institution": "Rensselaer Polytechnic Institute"}, {"given_name": "Hui", "family_name": "Su", "institution": "IBM"}, {"given_name": "Qiang", "family_name": "Ji", "institution": "Rensselaer Polytechnic Institute"}]}