Review for NeurIPS paper: Knowledge Augmented Deep Neural Networks for Joint Facial Expression and Action Unit Recognition

NeurIPS 2020

Knowledge Augmented Deep Neural Networks for Joint Facial Expression and Action Unit Recognition

Review 1

Summary and Contributions: The authors target the Action Unit (AU) detection and FER (Facial Expression Recognition), in a joint fashion, with knowledge driven training. The knowledge of inter-dependencies between AUs and FERs is captured by a Bayesian Network via constraint optimization and integrated, step-wise, into the joint training. The main contributions include a multi-step, optimization and learning based approach to integrate the knowledge of relation between the AUs and FER. The generic knowledge captured by BN is then used as a semi-supervisory signal for AU detection. This is then exploited in a joint training procedure to estimate AUs and FEs. Their method gives an AU detector which can generalize well on different datasets with comparable performance, and also gives SoA (State of the Art) performance for FER as compared to other data-based models.

Strengths: The authors have supported their claims with sound experiments and comparisons to other related work. Their method (expression augmented AU detection) works very well and is able to encode a generic knowledge of the AU-relationships in a BN. The method is able to achieve good performance specially on unbalanced datasets like MMI, and non-acted facial expressions datasets like EmotioNet. The integration of generic knowledge in the form of expression dependent (joint/single) and independent (joint) AU probabilities is able to give the boost in performance to the proposed method. These claims are validated with experiments on varied datasets like CK+ (acted expressions), BP4D (spontaneous), MMI (posed) and EmotioNet (unconstrained, in the wild expressions). Learning the parameters of a BN with probability constraints by posing it as an optimization problem is a good formulation. Emotions can be contextual and not only related to facial expressions ([a]), and so facial expressions are not clearly associated with their corresponding emotions, however, as the authors have shown, it is possible to relate the AUs with their anatomical muscle counterpart via standard studies [5,6 & 8]. The correlations between the AUs is explicitly encoded into the constraint optimizations for the proposed methods. And since these relations are data-agnostic, they are very strong priors for the AU detection and consequently FER. [a] https://www.tdx.cat/handle/10803/667808 (Chapter 2, first 2 paragraphs)

Weaknesses: Using joint training for AU as well as FER in a single network clearly improves the performance of both the tasks (multi-task learning). However, integrating the generic knowledge of AU relationship with facial expressions is NOT a novel idea (although worth exploring for the affective computing community). Optimization of the parameters for a graphical model via Regularization of BN was already demonstrated by [b]. Use of only selected works (specially [5, 6 & 8]) to formulate the probability constraints for AUs to get BN formulation is not complete. There are a lot more studies [a], including psychological studies [c], exploring the relation of FER to contextual features. This could severely affect the join AU-FER study. [a] https://www.tdx.cat/handle/10803/667808 (Appendix C) [b] https://ieeexplore.ieee.org/abstract/document/4587368 [c] https://journals.sagepub.com/doi/10.1111/j.1467-9280.2008.02148.x

Correctness: The claims are grounded on sound psychological studies and the conclusions are supported and well consistent with their experiments and their corresponding inferences made out of the experiments.

Clarity: The paper is well written and organization of the sections has a good flow to it.

Relation to Prior Work: Related works are clearly mentioned and their relevance to the proposed method also stated. The current work is a continuation of lot of previous work in the research community.

Reproducibility: Yes

Additional Feedback: The work is a good incremental step towards understanding the relationship of AU and FER, and their influence in detecting one over the other. Figure 1: I am assuming that the dotted lines represent back-propagation steps for each module. Please clarify this in the manuscript/Figure. Sec 3.1: The explanation on using the generic knowledge as probabilities is not unique ([b]), and the usage of limited 8 AUs (there are a lot more) is not justified. While generating Table 1, it is important to note that these numbers are taken from studies which explored more AUs than mentioned in the table. There could be influence of AUs which not mentioned in the paper on the ones that are mentioned in the paper. What do the authors have to comment on this? Also, as the authors mention in #291-292, a couple of unbalanced AUs can have high impact on the final scores, could it be that the inclusion of other AUs (balanced or unbalanced) might make a huge difference on the final results? Table 4: Similar analysis on MMI (and/or EmotioNet) would shed more light to the claim that the generic-BN is indeed effective. Specifically since both these datasets are unique; i.e. MMI and EmotioNet are unbalanced and have images of faces in the wild (spontaneous/unconstrained poses). EmotioNet has high quality manually labeled AU data for all their 24556 images [e]. Table 6: The comparison with TCAE/LP-SM seems inconsistent. The reported metrics for TCAE/LP-SM are based on the whole BP4D/CK+ dataset, however the AUD-BN/EA are tested on selected apex frames on BP4D (#240-245) and last frames on CK+ (#245-248). Extending the above comment for Table 8: You have selectively collected your data (Section 4; Databases) to train/test on. How is it that you are comparing your results with the reported results? Did I miss anything? #32: There are previous works that include facial information in multiple ways. For example, in [a], the authors use a global-local loss function (section 2.2) that is grounded in facial landmarks' relations. They use this global-information of facial landmarks while training which makes the training faster. #149: The word should be 'joint' #258: Could you explain the choice of using a 3-layer CNN for the AU detection model? Did you try using a pre-trained CNN like VGGFace [d] for AU detection or any other deep network pre-trained on AU detection task? #262: Using different initializations and/or pre-trained models, it is possible to arrive at different values of $\lambda_{1}$ and $\lambda_{2}$, that is, if you tuned them. Could you explain the influence of their values on the final results? Or did you just use a grid search to get those values. Using optimization/test data of EmotioNet for performance evaluation of your methods is fine. But this data from EmotioNet is expert labeled data and the bigger challenge is to have a better algorithm to annotate the other (called the training data) ~900K noisy images. Would it be fair, in your opinion, to compare your method's performance on this noisy data? [a] http://ieeexplore.ieee.org/document/8237690/ [d] https://www.robots.ox.ac.uk/~vgg/data/vgg_face2/ [e] http://cbcsl.ece.ohio-state.edu/enc-2020/index.html (search for 'Optimization data') ----XX-------- Post Rebuttal: Most of my main queries have been responded in the rebuttal. However, since they are training/optimising multiple components with multiple datasets, it is not easy to focus on all the details, and possibly miss something. As pointed by one of the reviewer, the paper could benefit from proof reading. I still think that the work can influence other research by their results via BN knowledge encoding. I will keep my score.

Review 2

Summary and Contributions: The paper proposes integrating manually constructed knowledge into a Bayesian network for Action Unit and Facial Expression recognition. It is evaluated on both tasks, including AU recognition with weak supervision.

Strengths: The model seems to demonstrate promising results on expression of emotion recognition (Table 8)

Weaknesses: The description of expressions and action units could use more rigor: - Facial expression relates to a configuration of the human face. Most facial expressions can be described using action units (with some exception of speech related expressions). So while in some ways facial expressions and action units can be seen as different levels of abstraction to describe what a face is doing (with action units being lower level), they do not "represent two levels of facial muscle movements". Especially, as action units are based on visible muscle activations on the face. - Some of these expressions can have "semantic" names associated with them related to perceived emotion (for example anger, happiness etc.), however these are highly subjective, culture, and context dependent. While this does not mean that these labels are not useful, but it means that they are very susceptible to noise. In several places the paper refers to AU annotations being noisy, but not facial expression ones. I would argue for opposite being true, action unit annotations are typically (but not always) done by certified experts, while most of perceived emotion datasets are annotated by lay people. Finally, the correlations between the AUs and the expressions of emotion are very culture and context specific. The paper provides limited novelty. The idea to combine AUs and Global facial expression labels has been around for some time (e.g. "Facial expressions recognition system using Bayesian inference", "Real-Time Inference of Complex Mental States from Facial Expressions and Head Gestures", "Facial Emotion and Action Unit Recognition based on Bayesian Network"), would be great to have more discussion of what is different in the proposed approach and what are the advantages. The paper does not discuss potential issues of broader impact. While arguably less abusable than face recognition, facial expression recognition still has significant privacy concerns. One example is tracking at scale of how individuals react to certain political messages, or when exposed to certain political figures. Another is monitoring of individuals in public places to understand their emotional reactions to events. Again, while it is probably not as privacy invading as face recognition the technology is not without risks.

Correctness: The proposed method and evaluation methodology appear to be sound.

Clarity: The paper is difficult to follow and understand. It would benefit from proof-reading and copy editing. A lot of methods and their combinations are introduced, making it fairly difficult to understand contribution of each.

Relation to Prior Work: The work could do with clarifying the difference between their proposed method and a number of other approaches that integrate AU and expression of emotion information in a Bayesian framework.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper proposes a knowledge-augmented deep learning framework that jointly performs facial expression recognition (FER) and Action Union (AU) detection, two separate but closely related research tasks. The domain knowledge of the inter-relationships between expressions and AUs is modelled using a regression Bayesian network (rBN) that is learned with probability constraints derived from the expert knowledge. During training, the combined expression distribution of FER, AU and rBN is used to supervise the framework to exploit interactions between three sub-modules. With the incorporated domain knowledge, the proposed framework achieves state-of-the-art results on FER and weakly-supervised AU tasks, and generalizes well to datasets with unbalanced or noisy labels. -----rebuttal update------- After checking other reviewers' comments and the author response, I still think this is a borderline work. Since the authors have addressed part of my concerns in the response, I will raise my vote from 5 to 6.

Strengths: 1. The paper connects FER and AU detection, two closely related tasks using one joint framework. 2. It provide an weakly supervised learning framework to incorporate domain knowledge of the expression-AU relationship. This allows the framework to perform AU detection even when the training data does not provide AU labels, which is hard to annotate.

Weaknesses: 1. No analysis of two weights in Eq. 10. It would be interesting to see the cases when domain knowledge is more important. This can reveal more insights into the data that are helpful for constructing future datasets. 2. No experiments on each individual component. The overall framework contains multiple components and it would be interesting to see how each component. For example, how does the pretraining affects each sub-network? 3. In terms of the FER performance on EmotionNet, since EmotionNet does not provide AU annotations, it is not clear how the rBN is learned. How does training rBN on different datasets affect the generalization ability to noisy expression labels? 4. The overall model is comprised of multiple components, and training the model requires multiple separate stages. As no code (training or testing) is promised, the provided training details about data and hyperparameters do not gurantee the results are reproducible. 5. A few typos need to be corrected, for example L149 join -> joint, L118 it’s -> its, table 1 P is missing. 6. Some references have inconsistent formats. [16] [54] and [55] are missing venues, [24] misses spaces in author list, [27] and [29] use acronyms for the venue but others use full names.

Correctness: Correct

Clarity: Clearly written and easy to follow.

Relation to Prior Work: Yes

Reproducibility: No

Additional Feedback:

Review 4

Summary and Contributions: The authors proposed a network to jointly recognize expression and facial action unit (AU). Different from the other work, the proposed method does not use any AU annotation for training AU detection module; instead, the domain knowledge – the relationships between expression and AUs are encoded in a Bayesian Network which is used for weak supervision. The expression recognition is also enhanced by employing the knowledge with AUs.

Strengths: AU-coded images are rather limited as compared to millions of emotion-coded images. This paper proposed a new way to systematically capture the facial expression and AU dependencies and incorporate them into a deep learning framework for joint facial expression recognition and AU detection. Experimental results on four datasets for both AU detection and FER tasks are quite convincing. Impressively, the proposed AU detection module can achieve descent performance without AU annotations.

Weaknesses: There are a few minor grammatic problems, e.g., in line 35 "expression are".

Correctness: Yes.

Clarity: The paper was well written.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: