{"title": "Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1629, "page_last": 1637, "abstract": "We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data.", "full_text": "Facial Expression Transfer with Input-Output\n\nTemporal Restricted Boltzmann Machines\n\nMatthew D. Zeiler1, Graham W. Taylor1, Leonid Sigal2, Iain Matthews2, and Rob Fergus1\n\n1Department of Computer Science, New York University, New York, NY 10012\n\n2Disney Research, Pittsburgh, PA 15213\n\nAbstract\n\nWe present a type of Temporal Restricted Boltzmann Machine that de\ufb01nes a prob-\nability distribution over an output sequence conditional on an input sequence. It\nshares the desirable properties of RBMs: ef\ufb01cient exact inference, an exponen-\ntially more expressive latent state than HMMs, and the ability to model nonlinear\nstructure and dynamics. We apply our model to a challenging real-world graphics\nproblem: facial expression transfer. Our results demonstrate improved perfor-\nmance over several baselines modeling high-dimensional 2D and 3D data.\n\n1\n\nIntroduction\n\nModeling temporal dependence is an important consideration in many learning problems. One can\ncapture temporal structure either explicitly in the model architecture, or implicitly through latent\nvariables which can act as a \u201cmemory\u201d. Feedforward neural networks which incorporate \ufb01xed delays\ninto their architecture are an example of the former. A limitation of these models is that temporal\ncontext is \ufb01xed by the architecture instead of inferred from the data. To address this shortcoming,\nrecurrent neural networks incorporate connections between the latent variables at different time\nsteps. This enables them to capture arbitrary dynamics, yet they are more dif\ufb01cult to train [2].\nAnother family of dynamical models that has received much attention are probabilistic models such\nas Hidden Markov Models and more general Dynamic Bayes nets. Due to their statistical struc-\nture, they are perhaps more interpretable than their neural-network counterparts. Such models can\nbe separated into two classes [19]: tractable models, which permit an exact and ef\ufb01cient procedure\nfor inferring the posterior distribution over latent variables, and intractable models which require\napproximate inference. Tractable models such as Linear Dynamical Systems and HMMs are widely\napplied and well understood. However, they are limited in the types of structure that they can cap-\nture. These limitations are exactly what permit simple exact inference. Intractable models, such as\nSwitching LDS, Factorial HMMs, and other more complex variants of DBNs permit more complex\nregularities to be learned from data. This comes at the cost of using approximate inference schemes,\nfor example, Gibbs sampling or variational inference, which introduce either a computational burden\nor poorly approximate the true posterior.\nIn this paper we focus on Temporal Restricted Boltzmann Machines [19,20], a family of models that\npermits tractable inference but allows much more complicated structure to be extracted from time\nseries data. Models of this class have a number of attractive properties: 1) They employ a distributed\nstate space where multiple factors interact to explain the data; 2) They permit nonlinear dynamics\nand multimodal predictions; and 3) Although maximum likelihood is intractable for these models,\nthere exists a simple and ef\ufb01cient approximate learning algorithm that works well in practice.\nWe concentrate on modeling the distribution of an output sequence conditional on an input sequence.\nRecurrent neural networks address this problem, though in a non-probabilistic sense. The Input-\nOutput HMM [3] extends HMMs by conditioning both their dynamics and emission model on an\ninput sequence. However, the IOHMM is representationally limited by its simple discrete state in\n\n1\n\n\fthe same way as a HMM. Therefore we extend TRBMs to cope with input-output sequence pairs.\nGiven the conditional nature of a TRBM (its hidden states and observations are conditioned on short\nhistories of these variables), conditioning on an external input is a natural extension to this model.\nSeveral real-world problems involve sequence-to-sequence mappings. This includes motion-style\ntransfer [9], economic forecasting with external indicators [13], and various tasks in natural language\nprocessing [6]. Sequence classi\ufb01cation is a special case of this setting, where a scalar target is\nconditioned on an input sequence. In this paper, we consider facial expression transfer, a well-known\nproblem in computer graphics. Current methods considered by the graphics community are typically\nlinear (e.g., methods based on blendshape mapping) and they do not take into account dynamical\naspects of the facial motion itself. This makes it dif\ufb01cult to retarget the facial articulations involved\nin speech. We propose a model that can encode a complex nonlinear mapping from the motion of\none individual to another which captures facial geometry and dynamics of both source and target.\n2 Related work\nIn this section we discuss several latent variable models which can map an input sequence to an\noutput sequence. We also brie\ufb02y review our application \ufb01eld: facial expression transfer.\n2.1 Temporal models\n\nAmong probabilistic models, the Input-Output HMM [3] is most similar to the architecture we pro-\npose. Like the HMM, the IOHMM is a generative model of sequences but it models the distribution\nof an output sequence conditional on an input, while the HMM simply models the distribution of an\noutput sequence. The IOHMM is also trained with a more discriminative-style EM-based learning\nparadigm than HMMs. A similarity between IOHMMs and TRBMs is that in both models, the dy-\nnamics and emission distributions are formulated as neural networks. However, the IOHMM state\nspace is a multinomial while TRBMs have binary latent states. A K-state TRBM can thus represent\nthe history of a time series using 2K state con\ufb01gurations while IOHMMs are restricted to K settings.\nThe Continuous Pro\ufb01le Model [12] is a rich and robust extension of dynamic time warping that\ncan be applied to many time series in parallel. The CPM has a discrete state-space and requires an\ninput sequence. Therefore it is a type of conditional HMM. However, unlike the IOHMM and our\nproposed model, the input is unobserved, making learning completely unsupervised.\nOur approach is also related to the many proposed techniques for supervised learning with struc-\ntured outputs. The problem of simultaneously predicting multiple, correlated variables has received\na great deal of recent attention [1]. Many of these models, including the one we propose, are formally\nde\ufb01ned as undirected graphs whose potential functions are functions of some input. In Graph Trans-\nformer Networks [11] the dependency structure on the outputs is chosen to be sequential, which\ndecouples the graph into pairwise potentials. Conditional Random Fields [10] are a special case of\nthis model with linear potential functions. These models are trained discriminatively, typically with\ngradient descent, where our model is trained generatively using an approximate algorithm.\n2.2 Facial expression transfer\n\nFacial expression transfer, also called motion retargeting or cross-mapping, is the act of adapting the\nmotion of an actor to a target character. It, as well as the related \ufb01elds of facial performance capture\nand performance-driven animation, have been very active research areas over the last several years.\nAccording to a review by Pighin [15], the two most important considerations for this task are facial\nmodel parameterization (called \u201cthe rig\u201d in the graphics industry) and the nature of the chosen cross-\nmapping. A popular parameterization is \u201cblendshapes\u201d where a rig is a set of linearly combined\nfacial expressions each controlled by a scalar weight. Retargeting amounts to estimating a set of\nblending weights at each frame of the source data that accurately reconstructs the target frame.\nThere are many different ways of selecting blendshapes, from simply selecting a set of suf\ufb01cient\nframes from the data, to creating models based on principal components analysis. Another common\nparameterization is to simply represent the face by its vertex, polygon or spline geometry. The\ndownside of this approach is that this representation has many more degrees of freedom than are\npresent in an actual facial expression.\nA linear function is the most common choice for cross-mapping. While it is simple to estimate\nfrom data, it cannot produce subtle nonlinear motion required for realistic graphics applications. An\n\n2\n\n\fexample of this approach is [5] which uses a parametric model based on eigen-points to reliably\nsynthesize simple facial expressions but ultimately fails to capture more subtle details. Vlasic et\nal. [23] have proposed a multilinear mapping where variation in appearance across the source and\ntarget is explicitly separated from the variation in facial expression. None of these models explicitly\nincorporate dynamics into the mapping, which is a limitation addressed by our approach.\nFinally, we note that Susskind et al. [18] have used RBMs for facial expression generation, but not\nretargeting. Their work is focused on static rather than temporal data.\n\n3 Modeling dynamics with Temporal Restricted Boltzmann Machines\n\nIn this section we review the Temporal Restricted Boltzmann Machine. We then introduce the Input-\nOutput Temporal Restricted Boltzmann Machine which extends the architecture to model an output\nsequence conditional on an input sequence.\n\n3.1 Temporal Restricted Boltzmann Machines\n\nA Restricted Boltzmann Machine [17] is a bipartite Markov Random Field consisting of a layer\nof stochastic observed variables (\u201cvisible units\u201d) connected to a layer of stochastic latent variables\n(\u201chidden units\u201d). The absence of connections between hidden units ensures they are conditionally in-\ndependent given a setting of the visible units, and vice-versa. This simpli\ufb01es inference and learning.\nThe RBM can be extended to model temporal data by conditioning its visible units and/or hidden\nunits on a short history of their activations. This model is called a Temporal Restricted Boltzmann\nMachine [19]. Conditioning the model on the previous settings of the hidden units complicates infer-\nence. Although one can approximate the posterior distribution with the \ufb01ltering distribution (treating\nthe past setting of the hidden units as \ufb01xed), we choose to use a simpli\ufb01ed form of the model which\nconditions only on previous visible states [20]. This model inherits the most important computa-\ntional properties of the standard RBM: simple, exact inference and ef\ufb01cient approximate learning.\nRBMs typically have binary observed variables and binary latent variables but to model real-valued\ndata (e.g., the parameterization of a face), we can use a modi\ufb01ed form of the TRBM with condi-\ntionally independent linear-Gaussian observed variables [7]. The model, depicted in Fig. 1 (left),\nde\ufb01nes a joint probability distribution over a real-valued representation of the current frame of data,\nvt, and a collection of binary latent variables, ht, hj \u2208 {0, 1}:\n\np(vt, ht|v