{"title": "Conditional Visual Tracking in Kernel Space", "book": "Advances in Neural Information Processing Systems", "page_first": 1249, "page_last": 1256, "abstract": "", "full_text": "Conditional Visual Tracking in Kernel Space\n\nCristian Sminchisescu1;2;3 Atul Kanujia3 Zhiguo Li3 Dimitris Metaxas3\n\n1TTI-C, 1497 East 50th Street, Chicago, IL, 60637, USA\n\n2University of Toronto, Department of Computer Science, Canada\n\n3Rutgers University, Department of Computer Science, USA\n\ncrismin@cs.toronto.edu, fkanaujia,zhli,dnmg@cs.rutgers.edu\n\nAbstract\n\nWe present a conditional temporal probabilistic framework for recon-\nstructing 3D human motion in monocular video based on descriptors en-\ncoding image silhouette observations. For computational ef\ufb01ciency we\nrestrict visual inference to low-dimensional kernel induced non-linear\nstate spaces. Our methodology (kBME) combines kernel PCA-based\nnon-linear dimensionality reduction (kPCA) and Conditional Bayesian\nMixture of Experts (BME) in order to learn complex multivalued pre-\ndictors between observations and model hidden states. This is necessary\nfor accurate, inverse, visual perception inferences, where several proba-\nble, distant 3D solutions exist due to noise or the uncertainty of monoc-\nular perspective projection. Low-dimensional models are appropriate\nbecause many visual processes exhibit strong non-linear correlations in\nboth the image observations and the target, hidden state variables. The\nlearned predictors are temporally combined within a conditional graphi-\ncal model in order to allow a principled propagation of uncertainty. We\nstudy several predictors and empirically show that the proposed algo-\nrithm positively compares with techniques based on regression, Kernel\nDependency Estimation (KDE) or PCA alone, and gives results competi-\ntive to those of high-dimensional mixture predictors at a fraction of their\ncomputational cost. We show that the method successfully reconstructs\nthe complex 3D motion of humans in real monocular video sequences.\n\n1\n\nIntroduction and Related Work\n\nWe consider the problem of inferring 3D articulated human motion from monocular video.\nThis research topic has applications for scene understanding including human-computer in-\nterfaces, markerless human motion capture, entertainment and surveillance. A monocular\napproach is relevant because in real-world settings the human body parts are rarely com-\npletely observed even when using multiple cameras. This is due to occlusions form other\npeople or objects in the scene. A robust system has to necessarily deal with incomplete,\nambiguous and uncertain measurements. Methods for 3D human motion reconstruction\ncan be classi\ufb01ed as generative and discriminative. They both require a state representation,\nnamely a 3D human model with kinematics (joint angles) or shape (surfaces or joint po-\nsitions) and they both use a set of image features as observations for state inference. The\ncomputational goal in both cases is the conditional distribution for the model state given\n\n\fimage observations.\n\nGenerative model-based approaches [6, 16, 14, 13] have been demonstrated to \ufb02exibly re-\nconstruct complex unknown human motions and to naturally handle problem constraints.\nHowever it is dif\ufb01cult to construct reliable observation likelihoods due to the complexity\nof modeling human appearance. This varies widely due to different clothing and defor-\nmation, body proportions or lighting conditions. Besides being somewhat indirect, the\ngenerative approach further imposes strict conditional independence assumptions on the\ntemporal observations given the states in order to ensure computational tractability. Due\nto these factors inference is expensive and produces highly multimodal state distributions\n[6, 16, 13]. Generative inference algorithms require complex annealing schedules [6, 13]\nor systematic non-linear search for local optima [16] in order to ensure continuing tracking.\n\nThese dif\ufb01culties motivate the advent of a complementary class of discriminative algo-\nrithms [10, 12, 18, 2], that approximate the state conditional directly, in order to simplify\ninference. However, inverse, observation-to-state multivalued mappings are dif\ufb01cult to\nlearn (see e.g. \ufb01g. 1a) and a probabilistic temporal setting is necessary. In an earlier paper\n[15] we introduced a probabilistic discriminative framework for human motion reconstruc-\ntion. Because the method operates in the originally selected state and observation spaces\nthat can be task generic, therefore redundant and often high-dimensional, inference is more\nexpensive and can be less robust. To summarize, reconstructing 3D human motion in a\n\nFigure 1: (a, Left) Example of 180o ambiguity in predicting 3D human poses from sil-\nhouette image features (center). It is essential that multiple plausible solutions (e.g. F1 and\nF2) are correctly represented and tracked over time. A single state predictor will either\naverage the distant solutions or zig-zag between them, see also tables 1 and 2. (b, Right) A\nconditional chain model. The local distributions p(ytjyt(cid:0)1; zt) or p(ytjzt) are learned as\nin \ufb01g. 2. For inference, the predicted local state conditional is recursively combined with\nthe \ufb01ltered prior c.f . (1).\n\nconditional temporal framework poses the following dif\ufb01culties: (i) The mapping between\ntemporal observations and states is multivalued (i.e. the local conditional distributions to be\nlearned are multimodal), therefore it cannot be accurately represented using global function\napproximations. (ii) Human models have multivariate, high-dimensional continuous states\nof 50 or more human joint angles. The temporal state conditionals are multimodal which\nmakes ef\ufb01cient Kalman \ufb01ltering algorithms inapplicable. General inference methods (par-\nticle \ufb01lters, mixtures) have to be used instead, but these are expensive for high-dimensional\nmodels (e.g. when reconstructing the motion of several people that operate in a joint state\nspace). (iii) The components of the human state and of the silhouette observation vector ex-\nhibit strong correlations, because many repetitive human activities like walking or running\nhave low intrinsic dimensionality. It appears wasteful to work with high-dimensional states\nof 50+ joint angles. Even if the space were truly high-dimensional, predicting correlated\nstate dimensions independently may still be suboptimal.\n\nIn this paper we present a conditional temporal estimation algorithm that restricts visual\ninference to low-dimensional, kernel induced state spaces. To exploit correlations among\nobservations and among state variables, we model the local, temporal conditional distri-\nbutions using ideas from Kernel PCA [11, 19] and conditional mixture modeling [7, 5],\nhere adapted to produce multiple probabilistic predictions. The corresponding predictor is\n\n\freferred to as a Conditional Bayesian Mixture of Low-dimensional Kernel-Induced Experts\n(kBME). By integrating it within a conditional graphical model framework (\ufb01g. 1b), we\ncan exploit temporal constraints probabilistically. We demonstrate that this methodology is\neffective for reconstructing the 3D motion of multiple people in monocular video. Our con-\ntribution w.r.t. [15] is a probabilistic conditional inference framework that operates over a\nnon-linear, kernel-induced low-dimensional state spaces, and a set of experiments (on both\nreal and arti\ufb01cial image sequences) that show how the proposed framework positively com-\npares with powerful predictors based on KDE, PCA, or with the high-dimensional models\nof [15] at a fraction of their cost.\n\n2 Probabilistic Inference in a Kernel Induced State Space\n\nWe work with conditional graphical models with a chain structure [9], as shown in \ufb01g. 1b,\nThese have continuous temporal states yt, t = 1 : : : T , observations zt. For compactness,\nwe denote joint states Yt = (y1; y2; : : : ; yt) or joint observations Zt = (z1; : : : ; zt).\nLearning and inference are based on local conditionals: p(ytjzt) and p(ytjyt(cid:0)1; zt), with\nyt and zt being low-dimensional, kernel induced representations of some initial model\nhaving state xt and observation rt. We obtain zt; yt from rt, xt using kernel PCA [11, 19].\nInference is performed in a low-dimensional, non-linear, kernel induced latent state space\n(see \ufb01g. 1b and \ufb01g. 2 and (1)). For display or error reporting, we compute the original\nconditional p(xjr), or a temporally \ufb01ltered version p(xtjRt); Rt = (r1; r2; : : : ; rt), using\na learned pre-image state map [3].\n\n2.1 Density Propagation for Continuous Conditional Chains\n\nFor online \ufb01ltering, we compute the optimal distribution p(ytjZt) for the state yt, con-\nditioned by observations Zt up to time t. The \ufb01ltered density can be recursively derived\nas:\n\np(ytjZt) = Zyt(cid:0)1\n\np(ytjyt(cid:0)1; zt)p(yt(cid:0)1jZt(cid:0)1)\n\n(1)\n\nWe compute using a conditional mixture for p(ytjyt(cid:0)1; zt) (a Bayesian mixture of experts\nc.f . x2.2) and the prior p(yt(cid:0)1jZt(cid:0)1), each having, say M components. We integrate M 2\npairwise products of Gaussians analytically. The means of the expanded posterior are clus-\ntered and the centers are used to initialize a reduced M-component Kullback-Leibler ap-\nproximation that is re\ufb01ned using gradient descent [15]. The propagation rule (1) is similar\nto the one used for discrete state labels [9], but here we work with multivariate continuous\nstate spaces and represent the local multimodal state conditionals using kBME (\ufb01g. 2), and\nnot log-linear models [9] (these would require intractable normalization). This complex\ncontinuous model rules out inference based on Kalman \ufb01ltering or dynamic programming\n[9].\n\n2.2 Learning Bayesian Mixtures over Kernel Induced State Spaces (kBME)\n\nIn order to model conditional mappings between low-dimensional non-linear spaces we\nrely on kernel dimensionality reduction and conditional mixture predictors. The authors of\nKDE [19] propose a powerful structured unimodal predictor. This works by decorrelating\nthe output using kernel PCA and learning a ridge regressor between the input and each\ndecorrelated output dimension.\n\nOur procedure is also based on kernel PCA but takes into account the structure of the\nstudied visual problem where both inputs and outputs are likely to be low-dimensional and\nthe mapping between them multivalued. The output variables xi are projected onto the\ncolumn vectors of the principal space in order to obtain their principal coordinates yi. A\n\n\fz 2 P(Fr)\n\np(yjz)\n\n/ y 2 P(Fx)\n\nkP CA\n\nkP CA\n\n(cid:8)r(r) (cid:26) Fr\n\n(cid:8)x(x) (cid:26) Fx\n\n(cid:8)r\n\n(cid:8)x\n\nQ\n\nQ\n\nQ\n\nQ\n\nQ\n\nQ\n\nQ\n\nQ\n\nQ\n\nQ\n\nQ\n\nQ\n\n(Q\n\nx (cid:25) PreImage(y)\n\nr 2 R (cid:26) Rr\n\nx 2 X (cid:26) Rx\n\np(xjr) (cid:25) p(xjy)\n\nFigure 2: The learned low-dimensional predictor, kBME, for computing p(xjr) (cid:17)\np(xtjrt); 8t. (We similarly learn p(xtjxt(cid:0)1; rt), with input (x; r) instead of r \u2013 here we\nillustrate only p(xjr) for clarity.) The input r and the output x are decorrelated using Ker-\nnel PCA to obtain z and y respectively. The kernels used for the input and output are (cid:8)r\nand (cid:8)x, with induced feature spaces Fr and Fx, respectively. Their principal subspaces\nobtained by kernel PCA are denoted by P(Fr) and P(Fx), respectively. A conditional\nBayesian mixture of experts p(yjz) is learned using the low-dimensional representation\n(z; y). Using learned local conditionals of the form p(ytjzt) or p(ytjyt(cid:0)1; zt), tempo-\nral inference can be ef\ufb01ciently performed in a low-dimensional kernel induced state space\n(see e.g. (1) and \ufb01g. 1b). For visualization and error measurement, the \ufb01ltered density, e.g.\np(ytjZt), can be mapped back to p(xtjRt) using the pre-image c.f . (3).\n\nsimilar procedure is performed on the inputs ri to obtain zi. In order to relate the reduced\nfeature spaces of z and y (P(Fr) and P(Fx)), we estimate a probability distribution over\nmappings from training pairs (zi; yi). We use a conditional Bayesian mixture of experts\n(BME) [7, 5] in order to account for ambiguity when mapping similar, possibly identical\nreduced feature inputs to very different feature outputs, as common in our problem (\ufb01g. 1a).\nThis gives a model that is a conditional mixture of low-dimensional kernel-induced experts\n(kBME):\n\np(yjz) =\n\nM\n\nXj=1\n\ng(zj(cid:14)j)N (yjWj z; (cid:6)j)\n\n(2)\n\nwhere g(zj(cid:14)j) is a softmax function parameterized by (cid:14)j and (Wj ; (cid:6)j) are the parame-\nters and the output covariance of expert j, here a linear regressor. As in many Bayesian\nsettings [17, 5], the weights of the experts and of the gates, Wj and (cid:14)j, are controlled by\nhierarchical priors, typically Gaussians with 0 mean, and having inverse variance hyperpa-\nrameters controlled by a second level of Gamma distributions. We learn this model using\na double-loop EM and employ ML-II type approximations [8, 17] with greedy (weight)\nsubset selection [17, 15].\n\nFinally, the kBME algorithm requires the computation of pre-images in order to recover\nthe state distribution x from it\u2019s image y 2 P(Fx). This is a closed form computation\nfor polynomial kernels of odd degree. For more general kernels optimization or learning\n(regression based) methods are necessary [3]. Following [3, 19], we use a sparse Bayesian\nkernel regressor to learn the pre-image. This is based on training data (xi; yi):\n\np(xjy) = N (xjA(cid:8)y(y); (cid:10))\n\n(3)\n\nwith parameters and covariances (A; (cid:10)). Since temporal inference is performed in\nthe low-dimensional kernel induced state space,\nthe pre-image function needs to be\ncalculated only for visualizing results or for the purpose of error reporting. Propa-\ngating the result from the reduced feature space P(Fx) to the output space X pro-\n\n/\n(\nO\nO\nO\nO\n(cid:15)\n(cid:15)\nO\nO\nO\nO\n\fduces a Gaussian mixture with M elements, having coef\ufb01cients g(zj(cid:14) j) and components\nA> + (cid:10)), where J(cid:8)y is the Jacobian of the mapping (cid:8)y.\nN (xjA(cid:8)y(Wj z); AJ(cid:8)y (cid:6)j J>\n(cid:8)y\n\n3 Experiments\n\nWe run experiments on both real image sequences (\ufb01g. 5 and \ufb01g. 6) and on sequences where\nsilhouettes were arti\ufb01cially rendered. The prediction error is reported in degrees (for mix-\nture of experts, this is w.r.t. the most probable one, but see also \ufb01g. 4a), and normalized per\njoint angle, per frame. The models are learned using standard cross-validation. Pre-images\nare learned using kernel regressors and have average error 1:7o.\n\nTraining Set and Model State Representation: For training we gather pairs of 3D human\nposes together with their image projections, here silhouettes, using the graphics package\nMaya. We use realistically rendered computer graphics human surface models which we\nanimate using human motion capture [1]. Our original human representation (x) is based\non articulated skeletons with spherical joints and has 56 skeletal d.o.f. including global\ntranslation. The database consists of 8000 samples of human activities including walking,\nrunning, turns, jumps, gestures in conversations, quarreling and pantomime.\n\nImage Descriptors: We work with image silhouettes obtained using statistical background\nsubtraction (with foreground and background models). Silhouettes are informative for pose\nestimation although prone to ambiguities (e.g. the left / right limb assignment in side views)\nor occasional lack of observability of some of the d.o.f. (e.g. 180o ambiguities in the global\nazimuthal orientation for frontal views, e.g. \ufb01g. 1a). These are multiplied by intrinsic for-\nward / backward monocular ambiguities [16]. As observations r, we use shape contexts\nextracted on the silhouette [4] (5 radial, 12 angular bins, size range 1/8 to 3 on log scale).\n\nThe features are computed at different scales and sizes for points sampled on the silhou-\nette. To work in a common coordinate system, we cluster all features in the training set\ninto K = 50 clusters. To compute the representation of a new shape feature (a point on the\nsilhouette), we \u2018project\u2019 onto the common basis by (inverse distance) weighted voting into\nthe cluster centers. To obtain the representation (r) for a new silhouette we regularly sam-\nple 200 points on it and add all their feature vectors into a feature histogram. Because the\nrepresentation uses overlapping features of the observation the elements of the descriptor\nare not independent. However, a conditional temporal framework (\ufb01g. 1b) \ufb02exibly accom-\nmodates this.\n\nFor experiments, we use Gaussian kernels for the joint angle feature space and dot product\nkernels for the observation feature space. We learn state conditionals for p(ytjzt) and\np(ytjyt(cid:0)1; zt) using 6 dimensions for the joint angle kernel induced state space and 25\ndimensions for the observation induced feature space, respectively. In \ufb01g. 3b) we show\nan evaluation of the ef\ufb01cacy of our kBME predictor for different dimensions in the joint\nangle kernel induced state space (the observation feature space dimension is here 50). On\nthe analyzed dancing sequence, that involves complex motions of the arms and the legs,\nthe non-linear model signi\ufb01cantly outperforms alternative PCA methods and gives good\npredictions for compact, low-dimensional models.1\nIn tables 1 and 2, as well as \ufb01g. 4, we perform quantitative experiments on arti\ufb01cially\nrendered silhouettes. 3D ground truth joint angles are available and this allows a more\n\n1Running times: On a Pentium 4 PC (3 GHz, 2 GB RAM), a full dimensional BME model with\n5 experts takes 802s to train p(xtjxt(cid:0)1; rt), whereas a kBME (including the pre-image) takes 95s to\ntrain p(ytjyt(cid:0)1; zt). The prediction time is 13.7s for BME and 8.7s (including the pre-image cost\n1.04s) for kBME. The integration in (1) takes 2.67s for BME and 0.31s for kBME. The speed-up for\nkBME is signi\ufb01cant and likely to increase with original models having higher dimensionality.\n\n\fs\nr\ne\n\nt\ns\nu\nC\n\nl\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n1000\n\n100\n\n10\n\n1\n\n1 2 3 4 5 6 7 8\nDegree of Multimodality\n\n100\n\n10\n\n1\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\nP\n\n0\n\nkBME\nKDE_RVM\nPCA_BME\nPCA_RVM\n\n20\n\n40\n\n60\n\nNumber of Dimensions\n\nFigure 3: (a, Left) Analysis of \u2018multimodality\u2019 for a training set. The input zt dimension\nis 25, the output yt dimension is 6, both reduced using kPCA. We cluster independently in\n(yt(cid:0)1; zt) and yt using many clusters (2100) to simulate small input perturbations and we\nhistogram the yt clusters falling within each cluster in (yt(cid:0)1; zt). This gives intuition on\nthe degree of ambiguity in modeling p(ytjyt(cid:0)1; zt), for small perturbations in the input. (b,\nRight) Evaluation of dimensionality reduction methods for an arti\ufb01cial dancing sequence\n(models trained on 300 samples). The kBME is our model x2.2, whereas the KDE-RVM\nis a KDE model learned with a Relevance Vector Machine (RVM) [17] feature space map.\nPCA-BME and PCA-RVM are models where the mappings between feature spaces (ob-\ntained using PCA) is learned using a BME and a RVM. The non-linearity is signi\ufb01cant.\nKernel-based methods outperform PCA and give low prediction error for 5-6d models.\n\nsystematic evaluation. Notice that the kernelized low-dimensional models generally out-\nperform the PCA ones. At the same time, they give results competitive to the ones of\nhigh-dimensional BME predictors, while being lower-dimensional and therefore signi\ufb01-\ncantly less expensive for inference, e.g. the integral in (1).\n\nIn \ufb01g. 5 and \ufb01g. 6 we show human motion reconstruction results for two real image se-\nquences. Fig. 5 shows the good quality reconstruction of a person performing an agile\njump. (Given the missing observations in a side view, 3D inference for the occluded body\nparts would not be possible without using prior knowledge!) For this sequence we do infer-\nence using conditionals having 5 modes and reduced 6d states. We initialize tracking using\np(ytjzt), whereas for inference we use p(ytjyt(cid:0)1; zt) within (1). In the second sequence\nin \ufb01g. 6, we simultaneously reconstruct the motion of two people mimicking domestic ac-\ntivities, namely washing a window and picking an object. Here we do inference over a\nproduct, 12-dimensional state space consisting of the joint 6d state of each person. We\nobtain good 3D reconstruction results, using only 5 hypotheses. Notice however, that the\nresults are not perfect, there are small errors in the elbow and the bending of the knee for\nthe subject at the l.h.s., and in the different wrist orientations for the subject at the r.h.s.\nThis re\ufb02ects the bias of our training set.\n\nWalk and turn\nConversation\n\nRun and turn left\n\nKDE-RR RVM KDE-RVM BME kBME\n4.69\n4.79\n4.92\n\n10.46\n7.95\n5.22\n\n4.27\n4.15\n5.01\n\n7.57\n6.31\n6.25\n\n4.95\n4.96\n5.02\n\nTable 1: Comparison of average joint angle prediction error for different models. All\nkPCA-based models use 6 output dimensions. Testing is done on 100 video frames for\neach sequence, the inputs are arti\ufb01cially generated silhouettes, not in the training set. 3D\njoint angle ground truth is used for evaluation. KDE-RR is a KDE model with ridge regres-\nsion (RR) for the feature space mapping, KDE-RVM uses an RVM. BME uses a Bayesian\nmixture of experts with no dimensionality reduction. kBME is our proposed model. kPCA-\nbased methods use kernel regressors to compute pre-images.\n\n\ft\n\nh\nu\nr\nt\n \n\nd\nn\nu\no\nr\ng\no\n\n \n\nt\n \n\nl\n\n \n\ne\ns\no\nC\n\u2212\n \ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\nExpert Prediction\n\n1\n\n2\n\n3\n\nExpert Number\n\n4\n\n5\n\nh\n\nt\n\nu\nr\nt\n \n\nd\nn\nu\no\nr\nG\no\n\n \n\nt\n \nt\ns\ne\ns\no\nC\n\nl\n\n \n \n\n\u2212\n \ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n1st Probable Prev Output\n2nd Probable Prev Output\n3rd Probable Prev Output\n4th Probable Prev Output\n5th Probable Prev Output\n\n1\n\n2\n\n3\n\nCurrent Expert\n\n4\n\n5\n\nFigure 4: (a, Left) Histogram showing the accuracy of various expert predictors: how\nmany times the expert ranked as the k-th most probable by the model (horizontal axis) is\nclosest to the ground truth. The model is consistent (the most probable expert indeed is\nthe most accurate most frequently), but occasionally less probable experts are better. (b,\nRight) Histograms show the dynamics of p(ytjyt(cid:0)1; zt), i.e. how the probability mass is\nredistributed among experts between two successive time steps, in a conversation sequence.\n\nWalk and turn back\n\nRun and turn\n\n6.9\n16.8\n\nKDE-RR RVM KDE-RVM BME kBME\n3.72\n8.01\n\n7.15\n16.08\n\n7.59\n17.7\n\n3.6\n8.2\n\nTable 2: Joint angle prediction error computed for two complex sequences with walks, runs\nand turns, thus more ambiguity (100 frames). Models have 6 state dimensions. Unimodal\npredictors average competing solutions. kBME has signi\ufb01cantly lower error.\n\nFigure 5: Reconstruction of a jump (selected frames). Top: original image sequence. Mid-\ndle: extracted silhouettes. Bottom: 3D reconstruction seen from a synthetic viewpoint.\n\n4 Conclusion\n\nWe have presented a probabilistic framework for conditional inference in latent kernel-\ninduced low-dimensional state spaces. Our approach has the following properties: (a)\n\n\fFigure 6: Reconstructing the activities of 2 people operating in an 12-d state space (each\nperson has its own 6d state). Top: original image sequence. Bottom: 3D reconstruction\nseen from a synthetic viewpoint.\n\nAccounts for non-linear correlations among input or output variables, by using kernel non-\nlinear dimensionality reduction (kPCA); (b) Learns probability distributions over mappings\nbetween low-dimensional state spaces using conditional Bayesian mixture of experts, as re-\nquired for accurate prediction. In the resulting low-dimensional kBME predictor ambigu-\nities and multiple solutions common in visual, inverse perception problems are accurately\nrepresented. (c) Works in a continuous, conditional temporal probabilistic setting and of-\nfers a formal management of uncertainty. We show comparisons that demonstrate how the\nproposed approach outperforms regression, PCA or KDE alone for reconstructing the 3D\nhuman motion in monocular video. Future work we will investigate scaling aspects for\nlarge training sets and alternative structured prediction methods.\n\nReferences\n[1] CMU Human Motion DataBase. Online at http://mocap.cs.cmu.edu/search.html, 2003.\n[2] A. Agarwal and B. Triggs. 3d human pose from silhouettes by Relevance Vector Regression.\n\n[3] G. Bakir, J. Weston, and B. Scholkopf. Learning to \ufb01nd pre-images. In NIPS, 2004.\n[4] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape\n\n[5] C. Bishop and M. Svensen. Bayesian mixtures of experts. In UAI, 2003.\n[6] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by Annealed Particle\n\n[7] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural\n\nIn CVPR, 2004.\n\ncontexts. PAMI, 24, 2002.\n\nFiltering. In CVPR, 2000.\n\nComputation, (6):181\u2013214, 1994.\n\n[8] D. Mackay. Bayesian interpolation. Neural Computation, 4(5):720\u2013736, 1992.\n[9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information\n\nextraction and segmentation. In ICML, 2000.\n\n[10] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps. In NIPS, 2002.\n[11] B. Sch\u00a8olkopf, A. Smola, and K. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Computation, 10:1299\u20131319, 1998.\n\n[12] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter Sensitive\n\n[13] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard. Tracking Loose-limbed People. In CVPR,\n\nHashing. In ICCV, 2003.\n\n2004.\n\n[14] C. Sminchisescu and A. Jepson. Generative Modeling for Continuous Non-Linearly Embedded\n\nVisual Inference. In ICML, pages 759\u2013766, Banff, 2004.\n\n[15] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative Density Propagation for\n\n3D Human Motion Estimation. In CVPR, 2005.\n\n[16] C. Sminchisescu and B. Triggs. Kinematic Jump Processes for Monocular 3D Human Tracking.\n\nIn CVPR, volume 1, pages 69\u201376, Madison, 2003.\n\n[17] M. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. JMLR, 2001.\n[18] C. Tomasi, S. Petrov, and A. Sastry. 3d tracking = classi\ufb01cation + interpolation. In ICCV, 2003.\n[19] J. Weston, O. Chapelle, A. Elisseeff, B. Scholkopf, and V. Vapnik. Kernel dependency estima-\n\ntion. In NIPS, 2002.\n\n\f", "award": [], "sourceid": 2866, "authors": [{"given_name": "Cristian", "family_name": "Sminchisescu", "institution": null}, {"given_name": "Atul", "family_name": "Kanujia", "institution": null}, {"given_name": "Zhiguo", "family_name": "Li", "institution": null}, {"given_name": "Dimitris", "family_name": "Metaxas", "institution": null}]}