{"title": "People Tracking with the Laplacian Eigenmaps Latent Variable Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1705, "page_last": 1712, "abstract": "Reliably recovering 3D human pose from monocular video requires constraints that bias the estimates towards typical human poses and motions. We define priors for people tracking using a Laplacian Eigenmaps Latent Variable Model (LELVM). LELVM is a probabilistic dimensionality reduction model that naturally combines the advantages of latent variable models---definining a multimodal probability density for latent and observed variables, and globally differentiable nonlinear mappings for reconstruction and dimensionality reduction---with those of spectral manifold learning methods---no local optima, ability to unfold highly nonlinear manifolds, and good practical scaling to latent spaces of high dimension. LELVM is computationally efficient, simple to learn from sparse training data, and compatible with standard probabilistic trackers such as particle filters. We analyze the performance of a LELVM-based probabilistic sigma point mixture tracker in several real and synthetic human motion sequences and demonstrate that LELVM provides sufficient constraints for robust operation in the presence of missing, noisy and ambiguous image measurements.", "full_text": "People Tracking with the Laplacian Eigenmaps\n\nLatent Variable Model\n\nZhengdong Lu\n\nCSEE, OGI, OHSU\nzhengdon@csee.ogi.edu\n\nMiguel \u00b4A. Carreira-Perpi\u02dcn\u00b4an\n\nEECS, UC Merced\n\nCristian Sminchisescu\n\nUniversity of Bonn\n\nhttp://eecs.ucmerced.edu\n\nsminchisescu.ins.uni-bonn.de\n\nAbstract\n\nReliably recovering 3D human pose from monocular video requires models that\nbias the estimates towards typical human poses and motions. We construct pri-\nors for people tracking using the Laplacian Eigenmaps Latent Variable Model\n(LELVM). LELVM is a recently introduced probabilistic dimensionality reduc-\ntion model that combines the advantages of latent variable models\u2014a multimodal\nprobability density for latent and observed variables, and globally differentiable\nnonlinear mappings for reconstruction and dimensionality reduction\u2014with those\nof spectral manifold learning methods\u2014no local optima, ability to unfold highly\nnonlinear manifolds, and good practical scaling to latent spaces of high dimen-\nsion. LELVM is computationally ef\ufb01cient, simple to learn from sparse training\ndata, and compatible with standard probabilistic trackers such as particle \ufb01lters.\nWe analyze the performance of a LELVM-based probabilistic sigma point mixture\ntracker in several real and synthetic human motion sequences and demonstrate that\nLELVM not only provides suf\ufb01cient constraints for robust operation in the pres-\nence of missing, noisy and ambiguous image measurements, but also compares\nfavorably with alternative trackers based on PCA or GPLVM priors.\n\nRecent research in reconstructing articulated human motion has focused on methods that can exploit\navailable prior knowledge on typical human poses or motions in an attempt to build more reliable\nalgorithms. The high-dimensionality of human ambient pose space\u2014between 30-60 joint angles\nor joint positions depending on the desired accuracy level, makes exhaustive search prohibitively\nexpensive. This has negative impact on existing trackers, which are often not suf\ufb01ciently reliable at\nreconstructing human-like poses, self-initializing or recovering from failure. Such dif\ufb01culties have\nstimulated research in algorithms and models that reduce the effective working space, either us-\ning generic search focusing methods (annealing, state space decomposition, covariance scaling) or\nby exploiting speci\ufb01c problem structure (e.g. kinematic jumps). Experience with these procedures\nhas nevertheless shown that any search strategy, no matter how effective, can be made signi\ufb01cantly\nmore reliable if restricted to low-dimensional state spaces. This permits a more thorough explo-\nration of the typical solution space, for a given, comparatively similar computational effort as a\nhigh-dimensional method. The argument correlates well with the belief that the human pose space,\nalthough high-dimensional in its natural ambient parameterization, has a signi\ufb01cantly lower percep-\ntual (latent or intrinsic) dimensionality, at least in a practical sense\u2014many poses that are possible\nare so improbable in many real-world situations that it pays off to encode them with low accuracy.\n\nA perceptual representation has to be powerful enough to capture the diversity of human poses in a\nsuf\ufb01ciently broad domain of applicability (the task domain), yet compact and analytically tractable\nfor search and optimization. This justi\ufb01es the use of models that are nonlinear and low-dimensional\n(able to unfold highly nonlinear manifolds with low distortion), yet probabilistically motivated and\nglobally continuous for ef\ufb01cient optimization. Reducing dimensionality is not the only goal: per-\nceptual representations have to preserve critical properties of the ambient space. Reliable tracking\nneeds locality: nearby regions in ambient space have to be mapped to nearby regions in latent space.\nIf this does not hold, the tracker is forced to make unrealistically large, and dif\ufb01cult to predict jumps\nin latent space in order to follow smooth trajectories in the joint angle ambient space.\n\n1\n\n\fIn this paper we propose to model priors for articulated motion using a recently introduced proba-\nbilistic dimensionality reduction method, the Laplacian Eigenmaps Latent Variable Model (LELVM)\n[1]. Section 1 discusses the requirements of priors for articulated motion in the context of proba-\nbilistic and spectral methods for manifold learning, and section 2 describes LELVM and shows how\nit combines both types of methods in a principled way. Section 3 describes our tracking frame-\nwork (using a particle \ufb01lter) and section 4 shows experiments with synthetic and real human motion\nsequences using LELVM priors learned from motion-capture data.\nRelated work: There is signi\ufb01cant work in human tracking, using both generative and discrimina-\ntive methods. Due to space limitations, we will focus on the more restricted class of 3D generative\nalgorithms based on learned state priors, and not aim at a full literature review. Deriving com-\npact prior representations for tracking people or other articulated objects is an active research \ufb01eld,\nsteadily growing with the increased availability of human motion capture data. Howe et al. and\nSidenbladh et al. [2] propose Gaussian mixture representations of short human motion fragments\n(snippets) and integrate them in a Bayesian MAP estimation framework that uses 2D human joint\nmeasurements, independently tracked by scaled prismatic models [3]. Brand [4] models the human\npose manifold using a Gaussian mixture and uses an HMM to infer the mixture component index\nbased on a temporal sequence of human silhouettes. Sidenbladh et al. [5] use similar dynamic priors\nand exploit ideas in texture synthesis\u2014ef\ufb01cient nearest-neighbor search for similar motion frag-\nments at runtime\u2014in order to build a particle-\ufb01lter tracker with observation model based on contour\nand image intensity measurements. Sminchisescu and Jepson [6] propose a low-dimensional proba-\nbilistic model based on \ufb01tting a parametric reconstruction mapping (sparse radial basis function) and\na parametric latent density (Gaussian mixture) to the embedding produced with a spectral method.\nThey track humans walking and involved in conversations using a Bayesian multiple hypotheses\nframework that fuses contour and intensity measurements. Urtasun et al. [7] use a dynamic MAP\nestimation framework based on a GPLVM and 2D human joint correspondences obtained from an\nindependent image-based tracker. Li et al. [8] use a coordinated mixture of factor analyzers within a\nparticle \ufb01ltering framework, in order to reconstruct human motion in multiple views using chamfer\nmatching to score different con\ufb01guration. Wang et al. [9] learn a latent space with associated dy-\nnamics where both the dynamics and observation mapping are Gaussian processes, and Urtasun et\nal. [10] use it for tracking. Taylor et al. [11] also learn a binary latent space with dynamics (using\nan energy-based model) but apply it to synthesis, not tracking. Our work learns a static, generative\nlow-dimensional model of poses and integrates it into a particle \ufb01lter for tracking. We show its\nability to work with real or partially missing data and to track multiple activities.\n\n1 Priors for articulated human pose\n\nWe consider the problem of learning a probabilistic low-dimensional model of human articulated\nmotion. Call y \u2208 RD the representation in ambient space of the articulated pose of a person. In this\npaper, y contains the 3D locations of anywhere between 10 and 60 markers located on the person\u2019s\njoints (other representations such as joint angles are also possible). The values of y have been\nnormalised for translation and rotation in order to remove rigid motion and leave only the articulated\nmotion (see section 3 for how we track the rigid motion). While y is high-dimensional, the motion\npattern lives in a low-dimensional manifold because most values of y yield poses that violate body\nconstraints or are simply atypical for the motion type considered. Thus we want to model y in terms\nof a small number of latent variables x given a collection of poses {yn}N\nn=1 (recorded from a human\nwith motion-capture technology). The model should satisfy the following: (1) It should de\ufb01ne a\nprobability density for x and y, to be able to deal with noise (in the image or marker measurements)\nand uncertainty (from missing data due to occlusion or markers that drop), and to allow integration\nin a sequential Bayesian estimation framework. The density model should also be \ufb02exible enough\nto represent multimodal densities. (2) It should de\ufb01ne mappings for dimensionality reduction F :\ny \u2192 x and reconstruction f : x \u2192 y that apply to any value of x and y (not just those in the\ntraining set); and such mappings should be de\ufb01ned on a global coordinate system, be continuous\n(to avoid physically impossible discontinuities) and differentiable (to allow ef\ufb01cient optimisation\nwhen tracking), yet \ufb02exible enough to represent the highly nonlinear manifold of articulated poses.\nFrom a statistical machine learning point of view, this is precisely what latent variable models\n(LVMs) do; for example, factor analysis de\ufb01nes linear mappings and Gaussian densities, while the\ngenerative topographic mapping (GTM; [12]) de\ufb01nes nonlinear mappings and a Gaussian-mixture\ndensity in ambient space. However, factor analysis is too limited to be of practical use, and GTM\u2014\n\n2\n\n\fwhile \ufb02exible\u2014has two important practical problems: (1) the latent space must be discretised to\nallow tractable learning and inference, which limits it to very low (2\u20133) latent dimensions; (2) the\nparameter estimation is prone to bad local optima that result in highly distorted mappings.\n\nAnother dimensionality reduction method recently introduced, GPLVM [13], which uses a Gauss-\nian process mapping f (x), partly improves this situation by de\ufb01ning a tunable parameter xn for\neach data point yn. While still prone to local optima, this allows the use of a better initialisation\nn=1 (obtained from a spectral method, see later). This has prompted the application of\nfor {xn}N\nGPLVM for tracking human motion [7]. However, GPLVM has some disadvantages: its training is\nvery costly (each step of the gradient iteration is cubic on the number of training points N, though\napproximations based on using few points exist); unlike true LVMs, it de\ufb01nes neither a posterior\ndistribution p(x|y) in latent space nor a dimensionality reduction mapping E {x|y}; and the latent\nrepresentation it obtains is not ideal. For example, for periodic motions such as running or walking,\nrepeated periods (identical up to small noise) can be mapped apart from each other in latent space\nbecause nothing constrains xn and xm to be close even when yn = ym (see \ufb01g. 3 and [10]).\nThere exists a different type of dimensionality reduction methods, spectral methods (such as Isomap,\nLLE or Laplacian eigenmaps [14]), that have advantages and disadvantages complementary to those\nof LVMs. They de\ufb01ne neither mappings nor densities but just a correspondence (xn, yn) between\npoints in latent space xn and ambient space yn. However, the training is ef\ufb01cient (a sparse eigen-\nvalue problem) and has no local optima, and often yields a correspondence that successfully models\nhighly nonlinear, convoluted manifolds such as the Swiss roll. While these attractive properties have\nspurred recent research in spectral methods, their lack of mappings and densities has limited their\napplicability in people tracking. However, a new model that combines the advantages of LVMs and\nspectral methods in a principled way has been recently proposed [1], which we brie\ufb02y describe next.\n\n2 The Laplacian Eigenmaps Latent Variable Model (LELVM)\n\nmin tr(cid:0)XLX\u22a4(cid:1)\n\nLELVM is based on a natural way of de\ufb01ning an out-of-sample mapping for Laplacian eigenmaps\n(LE) which, in addition, results in a density model. In LE, typically we \ufb01rst de\ufb01ne a k-nearest-\nn=1 and weigh each edge yn \u223c ym by a Gaussian af\ufb01nity\nneighbour graph on the sample data {yn}N\n2 k(yn \u2212 ym)/\u03c3k2). Then the latent points X result from:\nfunction K(yn, ym) = wnm = exp (\u2212 1\n(1)\nwhere we de\ufb01ne the matrix XL\u00d7N = (x1, . . . , xN ), the symmetric af\ufb01nity matrix WN \u00d7N , the de-\ngree matrix D = diag (PN\nn=1 wnm), the graph Laplacian matrix L = D\u2212W, and 1 = (1, . . . , 1)\u22a4.\nThe constraints eliminate the two trivial solutions X = 0 (by \ufb01xing an arbitrary scale) and\nx1 = \u00b7 \u00b7 \u00b7 = xN (by removing 1, which is an eigenvector of L associated with a zero eigenvalue).\nThe solution is given by the leading u2, . . . , uL+1 eigenvectors of the normalised af\ufb01nity matrix\nN = D\u2212 1\n2 with VN \u00d7L = (v2, . . . , vL+1) (an a posteriori trans-\nlated, rotated or uniformly scaled X is equally valid).\n\ns.t. X \u2208 RL\u00d7N , XDX\u22a4 = I, XD1 = 0\n\n2 , namely X = V\u22a4D\u2212 1\n\n2 WD\u2212 1\n\nFollowing [1], we now de\ufb01ne an out-of-sample mapping F(y) = x for a new point y as a semi-\nsupervised learning problem, by recomputing the embedding as in (1) (i.e., augmenting the graph\nLaplacian with the new point), but keeping the old embedding \ufb01xed:\nK(y)\u22a4 1\u22a4K(y)(cid:17)(cid:16) X\u22a4\n\n(2)\n2 k(y \u2212 yn)/\u03c3k2) for n = 1, . . . , N is the kernel induced by\nwhere Kn(y) = K(y, yn) = exp (\u2212 1\nthe Gaussian af\ufb01nity (applied only to the k nearest neighbours of y, i.e., Kn(y) = 0 if y \u2241 yn).\nThis is one natural way of adding a new point to the embedding by keeping existing embedded\npoints \ufb01xed. We need not use the constraints from (1) because they would trivially determine x, and\nthe uninteresting solutions X = 0 and X = constant were already removed in the old embedding\nanyway. The solution yields an out-of-sample dimensionality reduction mapping x = F(y):\n\ntr(cid:16)( X x )(cid:16) L\n\nx\u22a4 (cid:17)(cid:17)\n\nmin\nx\u2208RL\n\nK(y)\n\nx = F(y) = X K(y)\n\n1\u22a4K(y) = PN\n\nn=1\n\nK(y,yn)\n\nPN\n\nn\u2032=1 K(y,yn\u2032 )\n\nxn\n\n(3)\n\nN PN\n\n3\n\napplicable to any point y (new or old). This mapping is formally identical to a Nadaraya-Watson\nestimator (kernel regression; [15]) using as data {(xn, yn)}N\nn=1 and the kernel K. We can take this\na step further by de\ufb01ning a LVM that has as joint distribution a kernel density estimate (KDE):\np(x, y) = 1\n\nn=1 Ky(y, yn)Kx(x, xn) p(y) = 1\n\nn=1 Ky(y, yn) p(x) = 1\n\nN PN\n\nN PN\n\nn=1 Kx(x, xn)\n\n\fwhere Ky is proportional to K so it integrates to 1, and Kx is a pdf kernel in x\u2013space. Consequently,\nthe marginals in observed and latent space are also KDEs, and the dimensionality reduction and\nreconstruction mappings are given by kernel regression (the conditional means E {y|x}, E {x|y}):\n\nF(y) = PN\nallow the\n\nyn = PN\nn=1 p(n|y)xn\nbandwidths\nambient\nand\nlatent\n2 k(y \u2212 yn)/\u03c3yk2).\n2 k(x \u2212 xn)/\u03c3xk2) and Ky(y, yn) \u221d exp (\u2212 1\n\nWe\nKx(x, xn) \u221d exp (\u2212 1\nmay be tuned to control the smoothness of the mappings and densities [1].\n\nf (x) = PN\nbe\n\ndifferent\n\nKx(x,xn)\n\nn\u2032=1 Kx(x,xn\u2032 )\nin\n\nthe\n\nn=1\n\nPN\n\nto\n\nn=1 p(n|x)yn.\n\n(4)\n\nspaces:\nThey\n\nThus, LELVM naturally extends a LE embedding (ef\ufb01ciently obtained as a sparse eigenvalue prob-\nlem with a cost O(N 2)) to global, continuous, differentiable mappings (NW estimators) and po-\ntentially multimodal densities having the form of a Gaussian KDE. This allows easy computation\nof posterior probabilities such as p(x|y) (unlike GPLVM). It can use a continuous latent space of\narbitrary dimension L (unlike GTM) by simply choosing L eigenvectors in the LE embedding. It\nhas no local optima since it is based on the LE embedding. LELVM can learn convoluted mappings\n(e.g. the Swiss roll) and de\ufb01ne maps and densities for them [1]. The only parameters to set are the\ngraph parameters (number of neighbours k, af\ufb01nity width \u03c3) and the smoothing bandwidths \u03c3x, \u03c3y.\n\n3 Tracking framework\n\nWe follow the sequential Bayesian estimation framework, where for state variables s and observation\nvariables z we have the recursive prediction and correction equations:\n\np(st|z0:t\u22121) = R p(st|st\u22121) p(st\u22121|z0:t\u22121) dst\u22121\n\np(st|z0:t) \u221d p(zt|st) p(st|z0:t\u22121).\n\n(5)\n\nWe de\ufb01ne the state variables as s = (x, d) where x \u2208 RL is the low-dim. latent space (for pose)\nand d \u2208 R3 is the centre-of-mass location of the body (in the experiments our state also includes\nthe orientation of the body, but for simplicity here we describe only the translation). The observed\nvariables z consist of image features or the perspective projection of the markers on the camera\nplane. The mapping from state to observations is (for the markers\u2019 case, assuming M markers):\n\nx \u2208 RL\nd \u2208 R3\n\nf\u2212\u2212\u2212\u2212\u2192 y \u2208 R3M \u2212\u2212\u2192 \u2295 P\u2212\u2212\u2212\u2212\u2212\u2192 z \u2208 R2M\n\n(6)\n\nwhere f is the LELVM reconstruction mapping (learnt from mocap data); \u2295 shifts each 3D marker\nby d; and P is the perspective projection (pinhole camera), applied to each 3D point separately. Here\nwe use a simple observation model p(zt|st): Gaussian with mean given by the transformation (6)\nand isotropic covariance (set by the user to control the in\ufb02uence of measurements in the tracking).\nWe assume known correspondences and observations that are obtained either from the 3D markers\n(for tracking synthetic data) or 2D tracks obtained from a 2D tracker. Our dynamics model is\n\np(st|st\u22121) \u221d pd(dt|dt\u22121) px(xt|xt\u22121) p(xt)\n\n(7)\nwhere both dynamics models for d and x are random walks: Gaussians centred at the previous\nstep value dt\u22121 and xt\u22121, respectively, with isotropic covariance (set by the user to control the\nin\ufb02uence of dynamics in the tracking); and p(xt) is the LELVM prior. Thus the overall dynamics\npredicts states that are both near the previous state and yield feasible poses. Of course, more complex\ndynamics models could be used if e.g. the speed and direction of movement are known.\n\nAs tracker we use the Gaussian mixture Sigma-point particle \ufb01lter (GMSPPF) [16]. This is a par-\nticle \ufb01lter that uses a Gaussian mixture representation for the posterior distribution in state space\nand updates it with a Sigma-point Kalman \ufb01lter. This Gaussian mixture will be used as proposal\ndistribution to draw the particles. As in other particle \ufb01lter implementations, the prediction step\nis carried out by approximating the integral (5) with particles and updating the particles\u2019 weights.\nThen, a new Gaussian mixture is \ufb01tted with a weighted EM algorithm to these particles. This re-\nplaces the resampling stage needed by many particle \ufb01lters and mitigates the problem of sample\ndepletion while also preventing the number of components in the Gaussian mixture from growing\nover time. The choice of this particular tracker is not critical; we use it to illustrate the fact that\nLELVM can be introduced in any probabilistic tracker for nonlinear, nongaussian models. Given the\ncorrected distribution p(st|z0:t), we choose its mean as recovered state (pose and location). It is also\npossible to choose instead the mode closest to the state at t \u2212 1, which could be found by mean-shift\nor Newton algorithms [17] since we are using a Gaussian-mixture representation in state space.\n\n4\n\n\f4 Experiments\n\nWe demonstrate our low-dimensional tracker on image sequences of people walking and running,\nboth synthetic (\ufb01g. 1) and real (\ufb01g. 2\u20133). Fig. 1 shows the model copes well with persistent partial\nocclusion and severely subsampled training data (A,B), and quantitatively evaluates temporal recon-\nstruction (C). For all our experiments, the LELVM parameters (number of neighbors k, Gaussian\naf\ufb01nity \u03c3, and bandwidths \u03c3x and \u03c3y) were set manually. We mainly considered 2D latent spaces\n(for pose, plus 6D for rigid motion), which were expressive enough for our experiments. More\ncomplex, higher-dimensional models are straightforward to construct. The initial state distribution\np(s0) was chosen a broad Gaussian, the dynamics and observation covariance were set manually to\ncontrol the tracking smoothness, and the GMSPPF tracker used a 5-component Gaussian mixture\nin latent space (and in the state space of rigid motion) and a small set of 500 particles. The 3D\nrepresentation we use is a 102-D vector obtained by concatenating the 3D markers coordinates of all\nthe body joints. These would be highly unconstrained if estimated independently, but we only use\nthem as intermediate representation; tracking actually occurs in the latent space, tightly controlled\nusing the LELVM prior. For the synthetic experiments and some of the real experiments (\ufb01gs. 2\u20133)\nthe camera parameters and the body proportions were known (for the latter, we used the 2D outputs\nof [6]). For the CMU mocap video (\ufb01g. 2B) we roughly guessed. We used mocap data from several\nsources (CMU, OSU). As observations we always use 2D marker positions, which, depending on\nthe analyzed sequence were either known (the synthetic case), or provided by an existing tracker\n[6] or speci\ufb01ed manually (\ufb01g. 2B). Alternatively 2D point trackers similar to the ones of [7] can be\nused. The forward generative model is obtained by combining the latent to ambient space mapping\n(this provides the position of the 3D markers) with a perspective projection transformation. The\nobservation model is a product of Gaussians, each measuring the probability of a particular marker\nposition given its corresponding image point track.\nExperiments with synthetic data: we analyze the performance of our tracker in controlled condi-\ntions (noise perturbed synthetically generated image tracks) both under regular circumstances (rea-\nsonable sampling of training data) and more severe conditions with subsampled training points and\npersistent partial occlusion (the man running behind a fence, with many of the 2D marker tracks\nobstructed). Fig. 1B,C shows both the posterior (\ufb01ltered) latent space distribution obtained from\nour tracker, and its mean (we do not show the distribution of the global rigid body motion; in all\nexperiments this is tracked with good accuracy). In the latent space plot shown in \ufb01g. 1B, the onset\nof running (two cycles were used) appears as a separate region external to the main loop. It does not\nappear in the subsampled training set in \ufb01g. 1B, where only one running cycle was used for training\nand the onset of running was removed. In each case, one can see that the model is able to track quite\ncompetently, with a modest decrease in its temporal accuracy, shown in \ufb01g. 1C, where the averages\nare computed per 3D joint (normalised wrt body height). Subsampling causes some ambiguity in\nthe estimate, e.g. see the bimodality in the right plot in \ufb01g. 1C. In another set of experiments (not\nshown) we also tracked using different subsets of 3D markers. The estimates were accurate even\nwhen about 30% of the markers were dropped.\nExperiments with real images: this shows our tracker\u2019s ability to work with real motions of differ-\nent people, with different body proportions, not in its latent variable model training set (\ufb01gs. 2\u20133).\nWe study walking, running and turns. In all cases, tracking and 3D reconstruction are reasonably ac-\ncurate. We have also run comparisons against low-dimensional models based on PCA and GPLVM\n(\ufb01g. 3). It is important to note that, for LELVM, errors in the pose estimates are primarily caused\nby mismatches between the mocap data used to learn the LELVM prior and the body proportions of\nthe person in the video. For example, the body proportions of the OSU motion captured walker are\nquite different from those of the image in \ufb01g. 2\u20133 (e.g. note how the legs of the stick man are shorter\nrelative to the trunk). Likewise, the style of the runner from the OSU data (e.g. the swinging of the\narms) is quite different from that of the video. Finally, the interest points tracked by the 2D tracker\ndo not entirely correspond either in number or location to the motion capture markers, and are noisy\nand sometimes missing. In future work, we plan to include an optimization step to also estimate the\nbody proportions. This would be complicated for a general, unconstrained model because the di-\nmensions of the body couple with the pose, so either one or the other can be changed to improve the\ntracking error (the observation likelihood can also become singular). But for dedicated prior pose\nmodels like ours these dif\ufb01culties should be signi\ufb01cantly reduced. The model simply cannot assume\nhighly unlikely stances\u2014these are either not representable at all, or have reduced probability\u2014and\nthus avoids compensatory, unrealistic body proportion estimates.\n\n5\n\n\fB\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n0.4\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n0.4\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\nn = 15\n\nn = 40\n\nn = 65\n\nn = 90\n\nn = 115\n\nn = 140\n\nA\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\nn = 1\n\nn = 13\n\nn = 25\n\nn = 37\n\nn = 49\n\nn = 60\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n0.4\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n0.4\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n0.4\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n0.4\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n0.6\n\n0.8\n\n1\n\n1.2\n\n1.4\n\n1.6\n\n1.8\n\n2\n\n2.2\n\n2.4\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n1\n\n1\n\n0.5\n\n0.5\n\n0\n\n\u22120.5\n\n0\n\n\u22120.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\nC\n\nE\nS\nM\nR\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n0\n\n0.13\n\n0.12\n\n0.11\n\n0.1\n\n0.09\n\n0.08\n\n0.07\n\n0.06\n\nE\nS\nM\nR\n\n150\n\n0.05\n\n0\n\n10\n\n50\n\n100\n\ntime step n\n\n20\n\n30\n\n40\n\n50\n\n60\n\ntime step n\n\nOSU running man motion capture data. A: we use 217 datapoints for training LELVM\nFigure 1:\n(with added noise) and for tracking. Row 1: tracking in the 2D latent space. The contours (very tight\nin this sequence) are the posterior probability. Row 2: perspective-projection-based observations\nwith occlusions. Row 3: each quadruplet (a, a\u2032, b, b\u2032) show the true pose of the running man from\na front and side views (a, b), and the reconstructed pose by tracking with our model (a\u2032, b\u2032). B: we\nuse the \ufb01rst running cycle for training LELVM and the second cycle for tracking. C: RMSE errors\nfor each frame, for the tracking of A (left plot) and B (middle plot), normalised so that 1 equals the\nj=1 kynj \u2212 \u02c6ynjk2(cid:1)\u22121/2 for all 3D locations of the M\nheight of the stick man. RMSE(n) = (cid:0) 1\nmarkers, i.e., comparison of reconstructed stick man \u02c6yn with ground-truth stick man yn. Right plot:\nmultimodal posterior distribution in pose space for the model of A (frame 42).\n\nM PM\n\nComparison with PCA and GPLVM (\ufb01g. 3): for these models, the tracker uses the same GMSPPF\nsetting as for LELVM (number of particles, initialisation, random-walk dynamics, etc.) but with the\nmapping y = f (x) provided by GPLVM or PCA, and with a uniform prior p(x) in latent space\n(since neither GPLVM nor the non-probabilistic PCA provide one). The LELVM-tracker uses both\nits f (x) and latent space prior p(x), as discussed. All methods use a 2D latent space. We ensured\nthe best possible training of GPLVM by model selection based on multiple runs. For PCA, the\nlatent space looks deceptively good, showing non-intersecting loops. However, (1) individual loops\ndo not collect together as they should (for LELVM they do); (2) worse still, the mapping from 2D\nto pose space yields a poor observation model. The reason is that the loop in 102-D pose space\nis nonlinearly bent and a plane can at best intersect it at a few points, so the tracker often stays\nput at one of those (typically an \u201caverage\u201d standing position), since leaving it would increase the\nerror a lot. Using more latent dimensions would improve this, but as LELVM shows, this is not\nnecessary. For GPLVM, we found high sensitivity to \ufb01lter initialisation: the estimates have high\nvariance across runs and are inaccurate \u2248 80% of the time. When it fails, the GPLVM tracker often\nfreezes in latent space, like PCA. When it does succeed, it produces results that are comparable\nwith LELVM, although somewhat less accurate visually. However, even then GPLVM\u2019s latent space\nconsists of continuous chunks spread apart and offset from each other; GPLVM has no incentive to\nplace nearby two xs mapping to the same y. This effect, combined with the lack of a data-sensitive,\nrealistic latent space density p(x), makes GPLVM jump erratically from chunk to chunk, in contrast\nwith LELVM, which smoothly follows the 1D loop. Some GPLVM problems might be alleviated\nusing higher-order dynamics, but our experiments suggest that such modeling sophistication is less\n\n6\n\n\fn = 1\n\nn = 15\n\nn = 29\n\nn = 43\n\nn = 55\n\nn = 69\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\nn = 4\n\nn = 9\n\nn = 14\n\nn = 19\n\nn = 24\n\nn = 29\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\nA\n\nB\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n180\n\n200\n\n220\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n100\n\n50\n\n0\n\n\u221250\n\n\u2212100\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\n\u221250\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221250\n\nA: tracking of a video from [6] (turning & walking). We use 220 datapoints (3 full walking\nFigure 2:\ncycles) for training LELVM. Row 1: tracking in the 2D latent space. The contours are the estimated\nposterior probability. Row 2: tracking based on markers. The red dots are the 2D tracks and the\ngreen stick man is the 3D reconstruction obtained using our model. Row 3: our 3D reconstruction\nfrom a different viewpoint. B: tracking of a person running straight towards the camera. Notice the\nscale changes and possible forward-backward ambiguities in the 3D estimates. We train the LELVM\nusing 180 datapoints (2.5 running cycles); 2D tracks were obtained by manually marking the video.\nIn both A\u2013B the mocap training data was for a person different from the video\u2019s (with different body\nproportions and motions), and no ground-truth estimate was available for favourable initialisation.\n\nLELVM\n\nGPLVM\n\nPCA\n\ntracking in latent space\n\n38\n0.99\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\n0\n\n\u22120.005\n\n\u22120.01\n\n\u22120.015\n\n\u22120.02\n\n\u22120.025\n\n\u22120.025\n\n\u22120.02\n\n\u22120.015\n\n\u22120.01\n\n\u22120.005\n\n0\n\n0.005\n\n0.01\n\n0.015\n\n0.02\n\n0.025\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\ntracking in latent space\n\n38\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221280\n\ntracking in latent space\n\n38\n\n\u221260\n\n\u221240\n\n\u221220\n\n0\n\n20\n\n40\n\n60\n\n80\n\nframe 38.\n\nFigure 3: Method compari-\nson,\nPCA and\nGPLVM map consecutive walk-\ning cycles to spatially distinct\nlatent space regions.\nCom-\npounded by a data independent\nlatent prior, the resulting tracker\ngets easily confused:\nit jumps\nacross loops and/or remains put,\ntrapped in local optima. In con-\ntrast, LELVM is stable and fol-\nlows tightly a 1D manifold (see\nvideos).\n\ncrucial if locality constraints are correctly modeled (as in LELVM). We conclude that, compared to\nLELVM, GPLVM is signi\ufb01cantly less robust for tracking, has much higher training overhead and\nlacks some operations (e.g. computing latent conditionals based on partly missing ambient data).\n\n7\n\n\f5 Conclusion and future work\n\nWe have proposed the use of priors based on the Laplacian Eigenmaps Latent Variable Model\n(LELVM) for people tracking. LELVM is a probabilistic dim. red. method that combines the advan-\ntages of latent variable models and spectral manifold learning algorithms: a multimodal probability\ndensity over latent and ambient variables, globally differentiable nonlinear mappings for reconstruc-\ntion and dimensionality reduction, no local optima, ability to unfold highly nonlinear manifolds, and\ngood practical scaling to latent spaces of high dimension. LELVM is computationally ef\ufb01cient, sim-\nple to learn from sparse training data, and compatible with standard probabilistic trackers such as\nparticle \ufb01lters. Our results using a LELVM-based probabilistic sigma point mixture tracker with sev-\neral real and synthetic human motion sequences show that LELVM provides suf\ufb01cient constraints\nfor robust operation in the presence of missing, noisy and ambiguous image measurements. Com-\nparisons with PCA and GPLVM show LELVM is superior in terms of accuracy, robustness and\ncomputation time. The objective of this paper was to demonstrate the ability of the LELVM prior\nin a simple setting using 2D tracks obtained automatically or manually, and single-type motions\n(running, walking). Future work will explore more complex observation models such as silhouettes;\nthe combination of different motion types in the same latent space (whose dimension will exceed 2);\nand the exploration of multimodal posterior distributions in latent space caused by ambiguities.\n\nAcknowledgments\n\nThis work was partially supported by NSF CAREER award IIS\u20130546857 (MACP), NSF IIS\u20130535140\nand EC MCEXT\u2013025481 (CS). CMU data: http://mocap.cs.cmu.edu (created with fund-\ning from NSF EIA\u20130196217). OSU data: http://accad.osu.edu/research/mocap/mocap\ndata.htm.\n\nReferences\n[1] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an and Z. Lu. The Laplacian Eigenmaps Latent Variable Model. In AISTATS, 2007.\n[2] N. R. Howe, M. E. Leventon, and W. T. Freeman. Bayesian reconstruction of 3D human motion from\n\nsingle-camera video. In NIPS, volume 12, pages 820\u2013826, 2000.\n\n[3] T.-J. Cham and J. M. Rehg. A multiple hypothesis approach to \ufb01gure tracking. In CVPR, 1999.\n[4] M. Brand. Shadow puppetry. In ICCV, pages 1237\u20131244, 1999.\n[5] H. Sidenbladh, M. J. Black, and L. Sigal. Implicit probabilistic models of human motion for synthesis\n\nand tracking. In ECCV, volume 1, pages 784\u2013800, 2002.\n\n[6] C. Sminchisescu and A. Jepson. Generative modeling for continuous non-linearly embedded visual infer-\n\nence. In ICML, pages 759\u2013766, 2004.\n\n[7] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua. Priors for people tracking from small training sets. In\n\nICCV, pages 403\u2013410, 2005.\n\n[8] R. Li, M.-H. Yang, S. Sclaroff, and T.-P. Tian. Monocular tracking of 3D human motion with a coordinated\n\nmixture of factor analyzers. In ECCV, volume 2, pages 137\u2013150, 2006.\n\n[9] J. M. Wang, D. Fleet, and A. Hertzmann. Gaussian process dynamical models. In NIPS, volume 18, 2006.\n[10] R. Urtasun, D. J. Fleet, and P. Fua. Gaussian process dynamical models for 3D people tracking. In CVPR,\n\npages 238\u2013245, 2006.\n\n[11] G. W. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using binary latent variables. In\n\nNIPS, volume 19, 2007.\n\n[12] C. M. Bishop, M. Svens\u00b4en, and C. K. I. Williams. GTM: The generative topographic mapping. Neural\n\nComputation, 10(1):215\u2013234, January 1998.\n\n[13] N. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable\n\nmodels. Journal of Machine Learning Research, 6:1783\u20131816, November 2005.\n\n[14] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.\n\nNeural Computation, 15(6):1373\u20131396, June 2003.\n\n[15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.\n[16] R. van der Merwe and E. A. Wan. Gaussian mixture sigma-point particle \ufb01lters for sequential probabilistic\n\ninference in dynamic state-space models. In ICASSP, volume 6, pages 701\u2013704, 2003.\n\n[17] M. \u00b4A. Carreira-Perpi\u02dcn\u00b4an. Acceleration strategies for Gaussian mean-shift image segmentation. In CVPR,\n\npages 1160\u20131167, 2006.\n\n8\n\n\f", "award": [], "sourceid": 110, "authors": [{"given_name": "Zhengdong", "family_name": "Lu", "institution": null}, {"given_name": "Cristian", "family_name": "Sminchisescu", "institution": null}, {"given_name": "Miguel", "family_name": "Carreira-Perpi\u00f1\u00e1n", "institution": null}]}