{"title": "Joint Tracking of Pose, Expression, and Texture using Conditionally Gaussian Filters", "book": "Advances in Neural Information Processing Systems", "page_first": 889, "page_last": 896, "abstract": null, "full_text": " Joint Tracking of Pose, Expression, and Texture\n             using Conditionally Gaussian Filters\n\n\n\n     Tim K. Marks        John Hershey          J. Cooper Roddey      Javier R. Movellan\n     Department of Cognitive Science               Institute for Neural Computation\n     University of California San Diego           University of California San Diego\n         La Jolla, CA 92093-0515                       La Jolla, CA 92093-0523\n     tkmarks@cogsci.ucsd.edu                         cooper@sccn.ucsd.edu\n      hershey@microsoft.com                       movellan@mplab.ucsd.edu\n\n\n\n                                           Abstract\n\n          We present a generative model and stochastic filtering algorithm for si-\n          multaneous tracking of 3D position and orientation, non-rigid motion,\n          object texture, and background texture using a single camera. We show\n          that the solution to this problem is formally equivalent to stochastic fil-\n          tering of conditionally Gaussian processes, a problem for which well\n          known approaches exist [3, 8]. We propose an approach based on Monte\n          Carlo sampling of the nonlinear component of the process (object mo-\n          tion) and exact filtering of the object and background textures given the\n          sampled motion. The smoothness of image sequences in time and space\n          is exploited by using Laplace's method to generate proposal distributions\n          for importance sampling [7]. The resulting inference algorithm encom-\n          passes both optic flow and template-based tracking as special cases, and\n          elucidates the conditions under which these methods are optimal. We\n          demonstrate an application of the system to 3D non-rigid face tracking.\n\n\n1     Background\n\nRecent algorithms track morphable objects by solving optic flow equations, subject to the\nconstraint that the tracked points belong to an object whose non-rigid deformations are\nlinear combinations of a set of basic shapes [10, 2, 11]. These algorithms require precise\ninitialization of the object pose and tend to drift out of alignment on long video sequences.\nWe present G-flow, a generative model and stochastic filtering formulation of tracking that\naddress the problems of initialization and error recovery in a principled manner.\n\nWe define a non-rigid object by the 3D locations of n vertices. The object is a linear com-\nbination of k fixed morph bases, with coefficients c = [c1, c2,    , ck]T . The fixed 3  k\nmatrix hi contains the position of the ith vertex in all k morph bases. The transformation\nfrom object-centered to image coordinates consists of a rotation, weak perspective projec-\ntion, and translation. Thus xi, the 2D location of the ith vertex on the image plane, is\n                                       xi = grhic + l,                                        (1)\nwhere r is the 3  3 rotation matrix, l is the 2  1 translation vector, and g = 1 0 0\n                                                                                  0 1 0     is the\nprojection matrix. The object pose, ut, comprises both the rigid motion parameters and the\nmorph parameters at time t:\n                                    ut = {r(t), l(t), c(t)}.                                  (2)\n\n\f\n1.1    Optic flow\n\nLet yt represent the current image, and let xi(ut) index the image pixel that is rendered by\nthe ith object vertex when the object assumes pose ut. Suppose that we know ut-1, the\npose at time t - 1, and we want to find ut, the pose at time t. This problem can be solved\nby minimizing the following form with respect to ut:\n\n                                       1 n\n                       ^\n                    ut = argmin                    [yt(xi(ut)) - yt-1(xi(ut-1))]2 .              (3)\n                                 u     2\n                                  t         i=1\n\nIn the special case in which the xi(ut) are neighboring points that move with the same\n2D displacement, this reduces to the standard Lucas-Kanade optic flow algorithm [9, 1].\nRecent work [10, 2, 11] has shown that in the general case, this optimization problem can\nbe solved efficiently using the Gauss-Newton method. We will take advantage of this fact\nto develop an efficient stochastic inference algorithm within the framework of G-flow.\n\n\nNotational conventions           Unless otherwise stated, capital letters are used for random vari-\nables, small letters for specific values taken by random variables, and Greek letters for fixed\nmodel parameters. Subscripted colons indicate sequences: e.g., X1:t = X1    Xt. The\nterm In stands for the n  n identity matrix, E for expected value, V ar for the covariance\nmatrix, and V ar-1 for the inverse of the covariance matrix (precision matrix).\n\n\n2      The Generative Model for G-Flow\n\n\n\n\n\nFigure 1: Left: a(Ut) determines which texel (color at a vertex of the object model or a pixel of the\nbackground model) is responsible for rendering each image pixel. Right: G-flow video generation\nmodel: At time t, the object's 3D pose, Ut, is used to project the object texture, Vt, into 2D. This\nprojection is combined with the background texture, Bt, to generate the observed image, Yt.\n\n\nWe model the image sequence Y as a stochastic process generated by three hidden causes,\nU , V , and B, as shown in the graphical model (Figure 1, right). The m  1 random vector\nYt represents the m-pixel image at time t. The n  1 random vector Vt and the m  1\nrandom vector Bt represent the n-texel object texture and the m-texel background texture,\nrespectively. As illustrated in Figure 1, left, the object pose, Ut, determines onto which\nimage pixels the object and background texels project at time t. This is formulated using\nthe projection function a(Ut). For a given pose, ut, the projection a(ut) is a block matrix,\na(ut) def\n        =    av(ut)         ab(ut) . Here av(ut), the object projection function, is an m  n\nmatrix of 0s and 1s that tells onto which image pixel each object vertex projects; e.g., a 1\nat row j, column i it means that the ith object point projects onto image pixel j. Matrix ab\nplays the same role for background pixels. Assuming the foreground mapping is one-to-\none, we let ab = Im-av(ut)av(ut)T , expressing the simple occlusion constraint that every\n\n\f\nimage pixel is rendered by object or background, but not both. In the G-flow generative\nmodel:\n                           V\n          Y                     t\n           t = a(Ut)                 + W\n                           B                t          Wt  N (0, wIm),           w > 0\n                                t\n          Ut  p(ut | ut-1)                                                                          (4)\n          Vt = Vt-1 + Zvt-1                            Zv\n                                                         t-1  N (0, v ),        v is diagonal\n          Bt = Bt-1 + Zbt-1                            Zbt-1  N(0, b),          b is diagonal\nwhere p(ut | ut-1) is the pose transition distribution, and Zv, Zb, W are independent of\neach other, of the initial conditions, and over time. The form of the pose distribution is left\nunspecified since the algorithm proposed here does not require the pose distribution or the\npose dynamics to be Gaussian. For the initial conditions, we require that the variance of V1\nand the variance of B1 are both diagonal.\n\nNon-rigid 3D tracking is a difficult nonlinear filtering problem because changing the pose\nhas a nonlinear effect on the image pixels. Fortunately, the problem has a rich structure\nthat we can exploit: under the G-flow model, video generation is a conditionally Gaussian\nprocess [3, 6, 4, 5]. If the specific values taken by the pose sequence, u1:t, were known,\nthen the texture processes, V and B, and the image process, Y , would be jointly Gaussian.\nThis suggests the following scheme: we could use particle filtering to obtain a distribution\nof pose experts (each expert corresponds to a highly probable sample of pose, u1:t). For\neach expert we could then use Kalman filtering equations to infer the posterior distribution\nof texture given the observed images. This method is known in the statistics community as\na Monte Carlo filtering solution for conditionally Gaussian processes [3, 4], and in the ma-\nchine learning community as Rao-Blackwellized particle filtering [6, 5]. We found that in\naddition to Rao-Blackwellization, it was also critical to use Laplace's method to generate\nthe proposal distributions for importance sampling [7]. In the context of G-flow, we ac-\ncomplished this by performing an optic flow-like optimization, using an efficient algorithm\nsimilar to those in [10, 2].\n\n\n3    Inference\n\nOur goal is to find an expression for the filtering distribution, p(ut, vt, bt | y1:t). Using the\nlaw of total probability, we have the following equation for the filtering distribution:\n\n          p(ut, vt, bt | y1:t) =            p(ut, vt, bt | u1:t-1, y1:t) p(u1:t-1 | y1:t) du1:t-1    (5)\n\n                                                   Opinion              Credibility\n                                                    of expert             of expert\n\nWe can think of the integral in (5) as a sum over a distribution of experts, where each expert\ncorresponds to a single pose history, u1:t-1. Based on its hypothesis about pose history,\neach expert has an opinion about the current pose of the object, Ut, and the texture maps\nof the object and background, Vt and Bt. Each expert also has a credibility, a scalar that\nmeasures how well the expert's opinion matches the observed image yt. Thus, (5) can be\ninterpreted as follows: The filtering distribution at time t is obtained by integrating over the\nentire ensemble of experts the opinion of each expert weighted by that expert's credibility.\nThe opinion distribution of expert u1:t-1 can be factorized into the expert's opinion about\nthe pose Ut times the conditional distribution of texture Vt, Bt given pose:\n               p(ut, vt, bt | u1:t-1, y1:t) = p(ut | u1:t-1, y1:t)       p(vt, bt | u1:t, y1:t)      (6)\n\n                       Opinion                     Pose Opinion         Texture Opinion\n                       of expert                                              given pose\n\nThe rest of this section explains how we evaluate each term in (5) and (6). We cover the\ndistribution of texture given pose in 3.1, pose opinion in 3.2, and credibility in 3.3.\n\n\f\n3.1      Texture opinion given pose\n\nThe distribution of Vt and Bt given the pose history u1:t is Gaussian with mean and covari-\nance that can be obtained using the Kalman filter estimation equations:\n\n V ar-1(Vt, Bt | u1:t, y1:t) = V ar-1(Vt, Bt | u1:t-1, y1:t-1) + a(ut)T -1a(u\n                                                                                                                        w              t)              (7)\n E(Vt, Bt | u1:t, y1:t) = V ar(Vt, Bt | u1:t, y1:t)\n\n            V ar-1(Vt, Bt | u1:t-1, y1:t-1)E(Vt, Bt | u1:t-1, y1:t-1) + a(ut)T -1y\n                                                                                                                                       w         t     (8)\n\nThis requires p(Vt, Bt|u1:t-1, y1:t-1), which we get from the Kalman prediction equations:\n       E(Vt, Bt | u1:t-1, y1:t-1) = E(Vt-1, Bt-1 | u1:t-1, y1:t-1)                                                                                     (9)\n\n                                                                                                                        \n V ar(V                                                                                                                  v             0\n           t, Bt | u1:t-1, y1:t-1) = V ar(Vt-1, Bt-1 | u1:t-1, y1:t-1) +                                                                              (10)\n                                                                                                                         0             b\n\nIn (9), the expected value E(Vt, Bt | u1:t-1, y1:t-1) consists of texture maps (templates)\nfor the object and background. In (10), V ar(Vt, Bt | u1:t-1, y1:t-1) represents the degree\nof uncertainty about each texel in these texture maps. Since this is a diagonal matrix, we\ncan refer to the mean and variance of each texel individually. For the ith texel in the object\ntexture map, we use the following notation:\n\n                     v(i)           def\n                                     = ith\n                         t                     element of E(Vt | u1:t-1, y1:t-1)\n                     v(i)           def\n                                     = (i, i)th\n                         t                                element of V ar(Vt | u1:t-1, y1:t-1)\n\nSimilarly, define b(j)                        (j)\n                      t            and bt           as the mean and variance of the jth texel in the back-\nground texture map. (This notation leaves the dependency on u1:t-1 and y1:t-1 implicit.)\n\n3.2      Pose opinion\n\nBased on its current texture template (derived from the history of poses and images up to\ntime t-1) and the new image yt, each expert u1:t-1 has a pose opinion, p(ut|u1:t-1, y1:t), a\nprobability distribution representing that expert's beliefs about the pose at time t. Since the\neffect of ut on the likelihood function is nonlinear, we will not attempt to find an analytical\nsolution for the pose opinion distribution. However, due to the spatio-temporal smoothness\nof video signals, it is possible to estimate the peak and variance of an expert's pose opinion.\n\n\n3.2.1     Estimating the peak of an expert's pose opinion\n\nWe want to estimate ^\n                              ut(u1:t-1), the value of ut that maximizes the pose opinion. Since\n                                    p(y\n  p(u                                      1:t-1 | u1:t-1)\n         t | u1:t-1, y1:t) =                                      p(u\n                                     p(y                                 t | ut-1) p(yt | u1:t, y1:t-1),                                              (11)\n                                            1:t | u1:t-1)\n\n   ^\n  ut(u1:t-1) def\n                 = argmax p(ut | u1:t-1, y1:t) = argmax p(ut | ut-1) p(yt | u1:t, y1:t-1).\n                         ut                                                     ut\n                                                                                                                                                      (12)\n\nWe now need an expression for the final term in (12), the predictive distribu-\ntion p(yt | u1:t, y1:t-1).                   By integrating out the hidden texture variables from\np(yt, vt, bt | u1:t, y1:t-1), and using the conditional independence relationships defined by\nthe graphical model (Figure 1, right), we can derive:\n                                                          m                1\n           log p(yt | u1:t, y1:t-1) = -                        log 2 -          log |V ar(Y\n                                                          2                2                           t | u1:t, y1:t-1)|\n\n                              1 n (y                                   (i))2          1                     (y                    (j))2\n                      -                     t(xi(ut)) - v\n                                                                  t             -                                 t(j) - b\n                                                                                                                             t              ,         (13)\n                              2                 v(i) + w                            2                           b(j) + \n                                   i=1               t                                                                            w\n                                                                                           jX (u                  t\n                                                                                                     t )\n\n\f\nwhere xi(ut) is the image pixel rendered by the ith object vertex when the object assumes\npose ut, and X (ut) is the set of all image pixels rendered by the object under pose ut.\nCombining (12) and (13), we can derive\n\n\n^\nut(u1:t-1) = argmin - log p(ut | ut-1)                                                                                                                                  (14)\n                       ut\n\n\n       1 n [y                                   (i)]2              [y                                   (x\n+                  t(xi(ut)) - v\n                                           t             - t(xi(ut)) - bt                                       i(ut))]2 - log[b(x\n       2             v(i) +                                                                                                                    t    i(ut)) + w ]\n                                           w                                   b(x\n            i=1        t                                                            t         i(ut)) + w\n\n                   Foreground term                                                                     Background terms\n\nNote the similarity between (14) and constrained optic flow (3). For example, focus on the\nforeground term in (14) and ignore the weights in the denominator. The previous image\nyt-1 from (3) has been replaced by v()\n                                                                         t          , the estimated object texture based on the images\nand poses up to time t - 1. As in optic flow, we can find the pose estimate ^\n                                                                                                                                                                 ut(u1:t-1)\nefficiently using the Gauss-Newton method.\n\n\n3.2.2        Estimating the distribution of an expert's pose opinion\n\nWe estimate the distribution of an expert's pose opinion using a combination of Laplace's\nmethod and importance sampling. Suppose at time t - 1 we are given a sample of experts\nindexed by d, each endowed with a pose sequence u(d)\n                                                                                                                 1:t-1, a weight w(d)\n                                                                                                                                                 t-1, and the means\nand variances of Gaussian distributions for object and background texture. For each expert\nu(d)\n     1:t-1, we use (14) to compute ^\n                                                         u(d)\n                                                              t     , the peak of the pose distribution at time t according\nto that expert. Define ^\n                                    (d)\n                                     t           as the inverse Hessian matrix of (14) at this peak, the Laplace\nestimate of the covariance matrix of the expert's opinion. We then generate a set of s\n\nindependent samples {u(d,e) : e = 1,    , s}\n                                      t                                                       from a Gaussian distribution with mean ^\n                                                                                                                                                                        u(d)\n                                                                                                                                                                         t\n\nand variance proportional to ^\n                                                 (d)                                    , ^\n                                                                                              (d)),\n                                                   t     , g(| ^\n                                                                              u(d)\n                                                                               t                 t          where the parameter  > 0 determines\nthe sharpness of the sampling distribution. (Note that letting   0 would be equivalent to\nsimply setting the new pose equal to the peak of the pose opinion, u(d,e) = ^\n                                                                                                                                                           u(d)\n                                                                                                                                            t               t      .) To find\nthe parameters of this Gaussian proposal distribution, we use the Gauss-Newton method,\nignoring the second of the two background terms in (14). (This term is not ignored in the\nimportance sampling step.)\n\nTo refine our estimate of the pose opinion we use importance sampling. We assign each\nsample from the proposal distribution an importance weight wt(d, e) that is proportional to\nthe ratio between the posterior distribution and the proposal distribution:\n                                                                                s                                                 w\n                       ^\n                       p(u                                                                                                        t(d, e)\n                              t | u(d)\n                                          1:                                             (u                       )                                                    (15)\n                                                t-1, y1:t) =                                    t - u(d,e)\n                                                                                                            t                s         wt(d, f )\n                                                                              e=1                                            f =1\n\n                                                  p(u(d,e) | u(d)                                                                      , y1:t-1)\n                       w                                 t                     t-1)p(yt | u(d)\n                                                                                                            1:t-1, u(d,e)\n                                                                                                                             t\n                             t(d, e) =                                                                                                                                  (16)\n                                                                               g(u(d,e) | ^\n                                                                                                      u(d), ^\n                                                                                                                   (d))\n                                                                                         t             t                t\n\nThe numerator of (16) is proportional to p(u(d,e) |u(d)\n                                                                                         t            1:t-1, y1:t) by (12), and the denominator\nof (16) is the sampling distribution.\n\n\n3.3         Estimating an expert's credibility\n\n\nThe credibility of the dth expert, p(u(d)\n                                                                         1:t-1 | y1:t), is proportional to the product of a prior\nterm and a likelihood term:\n\n                                                                   p(u(d)\n                      p(u(d)                                                  1:t-1 | y1:t-1)p(yt | u(d)\n                                                                                                                             1:t-1, y1:t-1)\n                             1:                                                                                                                       .                 (17)\n                                   t-1 | y1:t) =                                                 p(yt | y1:t-1)\n\n\f\nRegarding the likelihood,\n\n\np(yt|u1:t-1, y1:t-1) =                p(yt, ut|u1:t-1, y1:t-1)dut =                             p(yt|u1:t, y1:t-1)p(ut|ut-1)dut\n\n                                                                                                                                            (18)\nWe already generated a set of samples {u(d,e) : e = 1,    , s}\n                                                            t                                      that estimate the pose opin-\nion of the dth expert, p(ut | u(d)\n                                          1:t-1, y1:t). We can now use these samples to estimate the\nlikelihood for the dth expert:\n\n\np(yt | u(d)\n         1:t-1, y1:t-1) =              p(yt | u(d)\n                                                  1:t-1, ut, y1:t-1)p(ut | u(d)\n                                                                                                t-1)dut                                     (19)\n\n                                                                             p(u                                              s\n                                                                                     t | u(d)                                        w\n=                                                                                                                                         t(d, e)\n        p(y                                                                                      t-1)                         e=1\n               t | u(d)\n                   1:                                       , ^\n                                                                  (d))                                     du\n                         t-1, ut, y1:t-1)g(ut | ^\n                                                    u(d)\n                                                      t             t                                             t \n                                                                           g(u                                                       s\n                                                                                  t | ^\n                                                                                     u(d), ^\n                                                                                                   (d))\n                                                                                           t        t\n\n\n3.4    Updating the filtering distribution\n\nOnce we have calculated the opinion and credibility of each expert u1:t-1, we evaluate the\nintegral in (5) as a weighted sum over experts. The credibilities of all of the experts are\nnormalized to sum to 1. New experts u1:t (children) are created from the old experts u1:t-1\n(parents) by appending a pose ut to the parent's history of poses u1:t-1. Every expert in the\nnew generation is created as follows: One parent is chosen to sire the child. The probability\nof being chosen is proportional to the parent's credibility. The child's value of ut is chosen\nat random from its parent's pose opinion (the weighted samples described in Section 3.2.2).\n\n\n4      Relation to Optic Flow and Template Matching\n\nIn basic template-matching, the same time-invariant texture map is used to track every\nframe in the video sequence. Optic flow can be thought of as template-matching with a\ntemplate that is completely reset at each frame for use in the subsequent frame. In most\ncases, optimal inference under G-flow involves a combination of optic flow-based and\ntemplate-based tracking, in which the texture template gradually evolves as new images\nare presented. Pure optic flow and template-matching emerge as special cases.\n\n\nOptic Flow as a Special Case                     Suppose that the pose transition probability p(ut | ut-1)\nis uninformative, that the background is uninformative, that every texel in the initial object\ntexture map has equal variance, V ar(V1) = In, and that the texture transition uncertainty\nis very high, v  diag(). Using (7), (8), and (10), it follows that:\n\n                                v(i) = [a\n                                 t              v (ut-1)]T yt-1 = yt-1(xi(ut-1)) ,                                                          (20)\n\ni.e., the object texture map at time t is determined by the pixels from image yt-1 that\naccording to pose ut-1 were rendered by the object. As a result, (14) reduces to:\n\n                                                      1 n                                                                2\n                    ^\n                    ut(u1:t-1) = argmin                           yt(xi(ut)) - yt-1(xi(ut-1))                                               (21)\n                                           u          2\n                                            t              i=1\n\nwhich is identical to (3). Thus constrained optic flow [10, 2, 11] is simply a special case of\noptimal inference under G-flow, with a single expert and with sampling parameter   0.\n\nThe key assumption that v  diag() means that the object's texture is very different in\nadjacent frames. However, optic flow is typically applied in situations in which the object's\ntexture in adjacent frames is similar. The optimal solution in such situations calls not for\noptic flow, but for a texture map that integrates information across multiple frames.\n\n\f\nTemplate Matching as a Special Case            Suppose the initial texture map is known pre-\ncisely, V ar(V1) = 0, and the texture transition uncertainty is very low, v  0. By (7),\n(8), and (10), it follows that v(i) = v\n                               t         t-1(i) = v\n                                                       1 (i), i.e., the texture map does not change\nover time, but remains fixed at its initial value (it is a texture template). Then (14) becomes:\n\n                                                n\n                                                                             2\n                       ^\n                       ut(u1:t-1) = argmin            yt(xi(ut)) - v1(i)                       (22)\n                                         ut    i=1\n\nwhere v1(i) is the ith texel of the fixed texture template. This is the error function mini-\nmized by standard template-matching algorithms. The key assumption that v  0 means\nthe object's texture is constant from each frame to the next, which is rarely true in real data.\nG-flow provides a principled way to relax this unrealistic assumption of template methods.\n\n\nGeneral Case      In general, if the background is uninformative, then minimizing (14) re-\nsults in a weighted combination of optic flow and template matching, with the weight of\neach approach depending on the current level of certainty about the object template. In\naddition, when there is useful information in the background, G-flow infers a model of the\nbackground which is used to improve tracking.\n\n\n\n\n\n     Figure 2: G-flow tracking an outdoor video. Results are shown for frames 1, 81, and 620.\n\n\n\n5    Simulations\n\nWe collected a video (30 frames/sec) of a subject in an outdoor setting who made a variety\nof facial expressions while moving her head. A later motion-capture session was used to\ncreate a 3D morphable model of her face, consisting of a set of 5 morph bases (k = 5).\n\nTwenty experts were initialized randomly near the correct pose on frame 1 of the video\nand propagated using G-flow inference (assuming an uninformative background). See\nhttp://mplab.ucsd.edu for video. Figure 2 shows the distribution of experts for three frames.\nIn each frame, every expert has a hypothesis about the pose (translation, rotation, scale, and\nmorph coefficients). The 38 points in the model are projected into the image according to\neach expert's pose, yielding 760 red dots in each frame. In each frame, the mean of the ex-\nperts gives a single hypothesis about the 3D non-rigid deformation of the face (lower right)\nas well as the rigid pose of the face (rotated 3D axes, lower left). Notice G-flow's ability to\nrecover from error: bad initial hypotheses are weeded out, leaving only good hypotheses.\n\nTo compare G-flow's performance versus deterministic constrained optic flow algorithms\nsuch as [10, 2, 11] , we used both G-flow and the method from [2] to track the same video\nsequence. We ran each tracker several times, introducing small errors in the starting pose.\n\n\f\nFigure 3: Average error over time for G-flow (green) and for deterministic optic flow [2] (blue).\nResults were averaged over 16 runs (deterministic algorithm) or 4 runs (G-flow) and smoothed.\n\n\n\nAs ground truth, the 2D locations of 6 points were hand-labeled in every 20th frame. The\nerror at every 20th frame was calculated as the distance from these labeled locations to the\ninferred (tracked) locations, averaged across several runs. Figure 3 compares this tracking\nerror as a function of time for the deterministic constrained optic flow algorithm and for a\n20-expert version of the G-flow tracking algorithm. Notice that the deterministic system\nhas a tendency to drift (increase in error) over time, whereas G-flow can recover from drift.\n\n\nAcknowledgments\n\nTim K. Marks was supported by NSF grant IIS-0223052 and NSF grant DGE-0333451 to GWC.\nJohn Hershey was supported by the UCDIMI grant D00-10084. J. Cooper Roddey was supported by\nthe Swartz Foundation. Javier R. Movellan was supported by NSF grants IIS-0086107, IIS-0220141,\nand IIS-0223052, and by the UCDIMI grant D00-10084.\n\n\nReferences\n\n [1] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. Interna-\n     tional Journal of Computer Vision, 56(3):221255, 2002.\n\n [2] M. Brand. Flexible flow for 3D nonrigid tracking and shape recovery. In CVPR, volume 1,\n     pages 315322, 2001.\n\n [3] H. Chen, P. Kumar, and J. van Schuppen. On Kalman filtering for conditionally gaussian sys-\n     tems with random matrices. Syst. Contr. Lett., 13:397404, 1989.\n\n [4] R. Chen and J. Liu. Mixture Kalman filters. J. R. Statist. Soc. B, 62:493508, 2000.\n\n [5] A. Doucet and C. Andrieu. Particle filtering for partially observed gaussian state space models.\n     J. R. Statist. Soc. B, 64:827838, 2002.\n\n [6] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for\n     dynamic bayesian networks. In 16th Conference on Uncertainty in AI, pages 176183, 2000.\n\n [7] A. Doucet, S. J. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for\n     bayesian filtering. Statistics and Computing, 10:197208, 2000.\n\n [8] Zoubin Ghahramani and Geoffrey E. Hinton. Variational learning for switching state-space\n     models. Neural Computation, 12(4):831864, 2000.\n\n [9] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo\n     vision. In Proceedings of the International Joint Conference on Artificial Intelligence, 1981.\n\n[10] L. Torresani, D. Yang, G. Alexander, and C. Bregler. Tracking and modeling non-rigid objects\n     with rank constraints. In CVPR, pages 493500, 2001.\n\n[11] Lorenzo Torresani, Aaron Hertzmann, and Christoph Bregler. Learning non-rigid 3d shape from\n     2d motion. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.\n\n\f\n", "award": [], "sourceid": 2732, "authors": [{"given_name": "Tim", "family_name": "Marks", "institution": null}, {"given_name": "J.", "family_name": "Roddey", "institution": null}, {"given_name": "Javier", "family_name": "Movellan", "institution": null}, {"given_name": "John", "family_name": "Hershey", "institution": null}]}