{"title": "Pose-Sensitive Embedding by Nonlinear NCA Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2280, "page_last": 2288, "abstract": "This paper tackles the complex problem of visually matching people in similar pose but with different clothes, background, and other appearance changes. We achieve this with a novel method for learning a nonlinear embedding based on several extensions to the Neighborhood Component Analysis (NCA) framework. Our method is convolutional, enabling it to scale to realistically-sized images. By cheaply labeling the head and hands in large video databases through Amazon Mechanical Turk (a crowd-sourcing service), we can use the task of localizing the head and hands as a proxy for determining body pose. We apply our method to challenging real-world data and show that it can generalize beyond hand localization to infer a more general notion of body pose. We evaluate our method quantitatively against other embedding methods. We also demonstrate that real-world performance can be improved through the use of synthetic data.", "full_text": "Pose-Sensitive Embedding\n\nby Nonlinear NCA Regression\n\nGraham W. Taylor, Rob Fergus, George Williams, Ian Spiro and Christoph Bregler\n\nCourant Institute of Mathematics, New York University\n\ngwtaylor,fergus,spiro,bregler@cs.nyu.edu\n\nNew York, USA 10003\n\nAbstract\n\nThis paper tackles the complex problem of visually matching people in similar\npose but with different clothes, background, and other appearance changes. We\nachieve this with a novel method for learning a nonlinear embedding based on\nseveral extensions to the Neighborhood Component Analysis (NCA) framework.\nOur method is convolutional, enabling it to scale to realistically-sized images. By\ncheaply labeling the head and hands in large video databases through Amazon\nMechanical Turk (a crowd-sourcing service), we can use the task of localizing\nthe head and hands as a proxy for determining body pose. We apply our method\nto challenging real-world data and show that it can generalize beyond hand lo-\ncalization to infer a more general notion of body pose. We evaluate our method\nquantitatively against other embedding methods. We also demonstrate that real-\nworld performance can be improved through the use of synthetic data.\n\nIntroduction\n\n1\nDetermining the pose of a human body from one or more images is a central problem in Computer\nVision. The complex, multi-jointed nature of the body makes the determination of pose challenging,\nparticularly in natural settings where ambiguous and unusual con\ufb01gurations may be observed. The\nability to localize the hands is particularly important: they provide tight constraints on the layout of\nthe upper body, yielding a strong cue as to the action and intent of a person.\nA huge range of techniques, both parametric and non-parametric, exist for inferring body pose from\n2D images and 3D datasets [10, 39, 4, 28, 33, 8, 3, 6, 11]. We propose a non-parametric approach to\n\nFigure 1: Query image (in left column) and the eight nearest neighbours found by our method.\nDistance in the learned embedded space is shown bottom right. Matches are based on the location\nof the hands, and more generally body pose - not the individual or the background.\n\n1\n\nd=5.51d=5.57d=5.60d=4.17d=4.31d=4.58d=5.05d=5.24d=5.35d=5.40d=5.47d=5.49d=4.29d=5.00d=5.09d=5.21d=3.20d=3.65d=3.90d=3.91d=4.02d=5.29d=3.93d=3.88\festimating body pose by localizing the hands using a parametric, nonlinear multi-layered embedding\nof the raw pixel images. Unlike many other metric learning approaches, ours is designed for use with\nreal-world images, having a convolutional architecture that scales gracefully to large images and is\ninvariant to local geometric distortions.\nOur embedding, trained on both real and synthetic data, is a functional mapping that projects images\nwith similar head and hand positions to lie close-by in a low-dimensional output space. Ef\ufb01cient\nnearest-neighbour search can then be performed in this space to \ufb01nd images in a large training\ncorpus that have similar pose. Speci\ufb01cally for this task, we have designed an interface to obtain\nand verify head and hand labels for thousands of frames through Amazon Mechanical Turk with\nminimal user intervention. We \ufb01nd that our method is able to cope with the terse and noisy labels\nprovided by crowd-sourcing. It succeeds in generalizing to body and hand pose when such cues are\nnot explicitly provided in the labels (see Fig. 1).\n\n2 Related work\nOur application domain is related to several approaches in the computer vision literature that propose\nhand or body pose tracking. Many techniques rely on sliding-window part detectors based on color\nand other features applied to controlled recording conditions ([10, 39, 4, 28] to name a few, we\nrefer to [32] for a complete survey). In our domain, hands might only occupy a few pixels, and the\nonly body-part that can reliably be detected is the human face ([26, 13]). Many techniques have\nbeen proposed that extract, learn, or reason over entire body features. Some use a combination of\nlocal detectors and structural reasoning (see [33] for coarse tracking and [8] for person-dependent\ntracking). In a similar spirit, more general techniques using pictorial structures [3, 12, 35], \u201cposelets\u201d\n[6], and other part-models [11] have received increased attention. An entire new stream of kinematic\nmodel-based techniques based on the HumanEva dataset has been proposed [37], but this area differs\nfrom our domain in that the images considered are of higher quality and less cluttered.\nMore closely related to our task are nearest-neighbour and locally-weighted regression-based tech-\nniques. Some extract \u201cshape-context\u201d edge based histograms from the human body [25, 1] or just\nsilhouette features [15]. Shakhnarovich et al. [36] use HOG [9] features and boosting for learn-\ning a parameter sensitive hash function. All these approaches rely on good background subtraction\nor recordings with clear backgrounds. Our domain contains clutter, lighting variations and low\nresolution such that it is impossible to separate body features from background successfully. We\ninstead learn relevant features directly from pixels (instead of pre-coded edge or gradient histogram\nfeatures), and discover implicitly background invariance from training data.\nSeveral other works [36, 9, 4, 15] have used synthetically created data as a training set. We show in\nthis paper several experiments with challenging real video (with crowd-sourced Amazon Mechanical\nTurk labels), synthetic training data, and hybrid datasets. Our \ufb01nal system (after training) is always\napplied to the cluttered non-background subtracted real video input without any labels.\nOur technique is also related to distance metric learning, an important area of machine learning\nresearch, especially due to recent interest in analyzing complex high-dimensional data. A subset\nof approaches for dimensionality reduction [17, 16] implicitly learn a distance metric by learning\na function (mapping) from high-dimensional (i.e. pixel) space to low-dimensional \u201cfeature\u201d space\nsuch that perceptually similar observations are mapped to nearby points on a manifold. Neighbour-\nhood Components Analysis (NCA) [14] proposes a solution where the transformation from input\nto feature space is linear and the distance metric is Euclidean. NCA learns the transformation that\nis optimal for performing KNN in the feature space. NCA has also been recently extended to the\nnonlinear case [34] using MNIST class labels and to linear 1D regression for reinforcement learning\n[20]. Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) [16] also learns a non-\nlinear mapping. Like NCA, DrLIM uses class neighbourhood structure to drive the optimization:\nobservations with the same class label are driven to be close-by in feature space. Our approach\nis also inspired by recent hashing methods [2, 34, 38], although those techniques are restricted to\nbinary codes for fast lookup.\n3 Learning an invariant mapping by nonlinear embedding\nWe \ufb01rst discuss Neighbourhood Components Analysis [14] and its nonlinear variants. We then pro-\npose an alternative objective function optimized for performing nearest neighbour (NN) regression\nrather than classi\ufb01cation. Next, we describe our convolutional architecture which maps images from\n\n2\n\n\fhigh-dimensional to low-dimensional space. Finally we introduce a related but different objective\nfor our model based on DrLIM.\n3.1 Neighbourhood Components Analysis\n\nNCA (both linear and nonlinear) and DrLIM do not presuppose the existence of a meaningful and\ncomputable distance metric in the input space. They only require that neighbourhood relationships\nbe de\ufb01ned between training samples. This is well-suited for learning a metric for non-parametric\nclassi\ufb01cation (e.g. KNN) on high-dimensional data. If the original data does not contain discrete\nclass labels, but real-valued labels (e.g. pose information for images of people) one alternative is to\nde\ufb01ne neighbourhoods based on the distance in the real-valued label space and proceed as usual.\nHowever, if classi\ufb01cation is not our ultimate goal, we may wish to exploit the \u201csoft\u201d nature of the\nlabels and use an alternative objective (i.e. one that does not optimize KNN performance).\nSuppose we are given a set of N labeled training cases {xi, yi}, i = 1, 2, . . . , N, where xi \u2208 RD,\nand yi \u2208 RL. Each training point, i, selects another point, j, as its neighbour with some probability\nde\ufb01ned by normalizing distances in the transformed feature space [14]:\n\npij =\n\npii = 0,\n\nik) ,\n\n(1)\nwhere we use a Euclidean distance metric dij and zi = f(xi|\u03b8) is the mapping (parametrized\nby \u03b8) from input space to feature space. For NCA this is typically linear, but it can be extended\nto be nonlinear through back-propagation (for example in [34] it is a multi-layer neural network).\nNCA assumes that the labels, yi, are discrete yi \u2208 1, 2, . . . , C rather than real-valued and seeks to\nmaximize the expected number of correctly classi\ufb01ed points on the training data which minimizes:\n\ndij = ||zi \u2212 zj||2\n\n(cid:80)\nexp(\u2212d2\nij)\nk(cid:54)=i exp(\u2212d2\n\nLNCA = \u2212 N(cid:88)\n\n(cid:88)\n\ni=1\n\nj:yi=yj\n\npij.\n\n(2)\n\nThe parameters are found by minimizing LNCA with respect to \u03b8; back-propagating in the case of\na multi-layer parametrization. Instead of seeking to optimize KNN classi\ufb01cation performance, we\ncan use the NCA regression (NCAR) objective [20]:\n\nLNCAR =\n\npij||yi \u2212 yj||2\n2.\n\n(3)\n\nN(cid:88)\n\n(cid:88)\n\ni=1\n\nj(cid:54)=i\n\nIntuitively, this states that if, with high probability, i and j are neighbours in feature space, then\nthey should also lie close-by in label space. While we use the Euclidean distance in label space, our\napproach generalizes to other metrics which may be more appropriate for a different domain.\nKeller et al. [20] consider the linear case of NCAR, where \u03b8 is a weight matrix and y is a scalar\nrepresenting Bellman error to map states with similar Bellman errors close together. Similar to\nNCA, we can extend this objective to the nonlinear, multi-layer case. We simply need to compute\nthe derivative of LNCAR with respect to the output of the mapping, zi, and backpropagate through\nthe remaining layers of the network. The gradient can be computed ef\ufb01ciently as:\n\n(cid:0)y2\n\nij \u2212 \u03b4i\n\n(cid:1) + pji\n\n(cid:0)y2\n\nij \u2212 \u03b4j\n\n(cid:1)(cid:3) .\n\n(zi \u2212 zj)(cid:2)pij\n\n= \u22122(cid:88)\n2 and \u03b4i =(cid:80)\n\nj(cid:54)=i\n\n\u2202LNCAR\n\n\u2202zi\n\nij = ||yi \u2212 yj||2\n\nj pijy2\n\nij. See the supplementary material for details.\n\nwhere we use y2\n3.2 Convolutional architectures\nAs [34] points out, nonlinear NCA was originally proposed in [14] but with the exception of a\nmodest success with a two-layer network in extracting 2D codes that explicitly represented the\nsize and orientation of face images, attempts to extract more complex properties using multi-layer\nfeature extraction were less successful. This was due, in part, to the dif\ufb01culty in training multi-layer\nnetworks and the fact that many data pairs are required to \ufb01t the large number of network parameters.\nThough both [34] and [38] were successful in learning a multi-layer nonlinear mapping of the data,\nthere is still a fundamental limitation of using fully-connected networks that must be addressed.\nSuch an architecture can only be applied to relatively small image patches (typically less than 64\n\u00d7 64 pixels), because they do not scale well with the size of the input. Salakhutdinov and Hinton\n\n(4)\n\n3\n\n\fescaped this issue by training only on the MNIST dataset (28\u00d7 28 images of digits) and Torralba\net al. used a global image descriptor [29] as an initial feature representation rather than pixels.\nHowever, to avoid such hand-crafted features which may not be suitable for the task, and to scale to\nrealistic sized inputs, models should take advantage of the pictorial nature of the image input. This is\naddressed by convolutional architectures [21], which exploit the fact that salient motifs can appear\nanywhere in the image. By employing successive stages of weight-sharing and feature-pooling,\ndeep convolutional architectures can achieve stable latent representations at each layer, that preserve\nlocality, provide invariance to small variations of the input, and drastically reduce the number of free\nparameters.\nOur proposed method which we call Convolutional NCA regression (C-NCAR) is based on a stan-\ndard convolutional architecture [21, 18]: alternating convolution and subsampling layers followed\nby a single fully-connected layer (see Fig. 2). It differs from typical convolutional nets in the objec-\ntive function with which it is trained (i.e. minimizing Eq. 3). Because the loss is de\ufb01ned on pairs of\nexamples, we use a siamese network [5]. Pairs of frames are processed by separate networks with\nequal weights. The loss is then computed on the output of both networks. Hadsell et al. [16] also\nuse a siamese convolutional network with yet a different objective. They use their method for visu-\nalization but not any discriminative task. Mobahi et al. [24] have also recently used a convolutional\nsiamese network in which temporal coherence between pairs of frames drives the regularization of\nthe model rather than the objective. More details of training our network are given in Sec. 4.\n\nFigure 2: Convolutional NCA regression (C-NCAR). Each image is processed by two convolutional\nand subsampling layers and one fully-connected layer. A loss (Eq. 3) computed on the distance\nbetween resulting codes drives parameter learning.\n\n3.3 Adding a contrastive loss function\nLike NCA, DrLIM assumes a discrete notion of similarity or dissimilarity between data pairs, xi\nand xj. It de\ufb01nes both a \u201csimilarity\u201d loss, Ls, which penalizes similar points which are far apart\nin code space, and a \u201cdissimilarity\u201d loss, LD, which penalizes dissimilar points which lie within a\nuser-de\ufb01ned margin, m, of each other:\n1\n2 d2\n\n{max(0, m \u2212 dij)}2\n\nLD(xi, xj) =\n\nLS(xi, xj) =\n\n(5)\n\n1\n2\n\nij\n\nwhere dij is given by Eq. 1. Let \u03b3ij be an indicator such that \u03b3ij = 1 if xi and xj are deemed\nsimilar and \u03b3ij = 1 if xi and xj are deemed dissimilar. For example, if labels yi are discrete\nyi \u2208 1, 2, . . . , C, then \u03b3ij = 1 for yi = yj and \u03b3ij = 0 otherwise. The total loss is de\ufb01ned by:\n\nN(cid:88)\n\n(cid:88)\n\ni=1\n\nj(cid:54)=i\n\nLDrLIM =\n\n\u03b3ijLs(xi, xj) + (1 \u2212 \u03b3ij)LD(xi, xj).\n\n(6)\n\nWhen faced with real-valued labels, yi, we can avoid explicitly de\ufb01ning similarity and dissimilarity\n(e.g. via thresholding) by de\ufb01ning a \u201csoft\u201d notion of similarity:\nexp(\u2212||yi \u2212 yj||2\n2)\nk(cid:54)=i exp(\u2212||yi \u2212 yj||2\n2) .\n\n(cid:80)\n\n\u02c6\u03b3ij =\n\n(7)\n\nReplacing the indicator variables \u03b3ij with \u02c6\u03b3ij in Eq. 6 yields what we call the soft DrLIM loss.\n\n4\n\nInput: 128\u00d7128 Layer 1: 16\u00d7120\u00d7120 Layer 2: 16\u00d724\u00d724 Layer 3: 32\u00d716\u00d716 Layer 4: 32\u00d74\u00d74 Output: 32\u00d71\u00d71 Convolutions, tanh(), abs() Average pooling Convolutions, tanh(), abs() Average pooling d(zi,zj)Fully connected xixj\f4 Experimental results\nWe evaluate our approach in real and synthetic environments by performing 1-nearest neighbour\n(NN) regression using a variety of standard and learned metrics described below. For every query\nimage in a test set, we compute its distance (under the metric) to each of the training points in a\ndatabase. We then copy the label (e.g. (x,y) position of the head and hands) of the neighbour to the\nquery example. For evaluation, we compare the ground-truth label of the query to the label of the\nnearest neighbour. Errors are reported in terms of mean pixel error over each query and each marker:\nthe head (if it is tracked) and each hand. Errors are absolute with respect to the original image size.\nWe acknowledge that improved results could potentially be obtained by using more than one neigh-\nbour or with more sophisticated techniques such as locally weighted regression [36]. However, we\nfocus on learning a good metric for performing this task rather than the regression problem. The\napproaches compared are:\nPixel distance can be used to \ufb01nd nearest neighbours though it is not practical in real situations due\nto the intractability of computing distances in such a high-dimensional space.\nGIST descriptors [29] are a global representation of image content.We are motivated to use GIST\nby its previous use in nonlinear NCA for image retrieval [38]. The resulting image representation\nis a length-512 vector. We note that this is still too large for ef\ufb01cient NN search and that the GIST\nfeatures are not domain-adaptive.\nLinear NCA regression (NCAR) is described in Section 3. We pre-compute GIST for each image\nand use that as our input representation. We learn a 512 \u00d7 32 matrix of weights by minimizing\nEq. 3 using nonlinear conjugate gradients with randomly sampled mini-batches of size 512. We\nperform three line-searches per mini-batch and stop learning after 500 mini-batches. We found that\nour results slightly improved when we applied a form of local contrast normalization (LCN) prior\nto computing GIST. Each pixel\u2019s response was normalized by the integrated response of a 9 \u00d7 9\nwindow of neighbouring pixels. For more details see [30].\nConvolutional NCA regression (C-NCAR) See Fig. 2 for a summary of our architecture. Images\nare pre-processed using LCN. Convolutions are followed by pixel-wise tanh and absolute value\nrecti\ufb01cation. The abs prevents cancellations in local neighbourhoods during average downsampling\n[18]. Our architectural parameters (size of \ufb01lters, number of \ufb01lter banks, etc.) are chosen to produce\na 32-dimensional output. Derivations of parameter updates are presented as supplementary material.\nSoft DrLIM (S-DrLIM) and Convolutional soft DrLIM (CS-DrLIM) We also experiment with a\nvariant of an alternative, energy-based method that adds an explicit contrastive loss to the objective\nrather than implicitly through normalization. The contrastive loss only operates on dissimilar points\nwhich lie within a speci\ufb01ed margin, m, of each other. We use m = 1.25 as suggested by [16].\nIn both the linear and nonlinear case, the architecture and training procedure remains the same as\nNCAR and C-NCAR, respectively. We use a different objective: minimizing Eq. 6 with respect to\nthe parameters.\n4.1 Estimating 2D head and hand pose from synthetic data\nWe extracted 10,000 frames of training data and 5,000 frames of test data from Poser renderings\nof several hours of real motion capture data. Our synthetic data is similar to that considered in [36],\nhowever, we use a variety of backgrounds rather than a constant background. Furthermore, subjects\nare free to move around the frame and are rendered at various scales. The training set contains 6\ndifferent characters superimposed on 9 different backgrounds. The test set contains 6 characters and\n8 backgrounds not present in the training set. The inputs, x, are 320\u00d7 240 images, and the labels,\ny, are 6D vectors - the true (x,y) locations of the head and hands.\nResults are shown in Table 1 (column SY). Simple linear NCAR performs well compared to the\nbaselines, while our nonlinear methods C-NCAR and CS-DrLIM (which are not restricted to the\nGIST descriptor) signi\ufb01cantly outperform all other approaches. Pixel-based matching (though ex-\ntremely slow) does surprisingly well. This is perhaps an artifact of the synthetic data.\n4.2 Estimating 2D hand pose from real video\n\nWe digitally recorded all of the contributing and invited speakers at the Learning Workshop (Snow-\nbird) held in April 2010. The set consisted of 30 speakers, with talks ranging from 10-40 minutes\neach. After each session of talks, blocks of 150 frames were distributed as Human Intelligence Tasks\n\n5\n\n\fTable 1: 1-NN regression performance on the synthetic (SY) dataset and the real (RE) dataset. Re-\nsults are divided into baselines (no learning), linear embeddings and nonlinear embeddings. Errors\nare the mean pixel distance between the nearest neighbour and the ground truth label of the query.\nFor SY we locate the head and both hands. For RE we assume the location and scale of the head is\ngiven by a face detector and only locate the hands. The images at right indicate: (top) a radius of\n25.40 pixels with respect to the 320\u00d7240 SY input; (bottom) a radius of 16.41 pixels with respect to\nthe 128\u00d7128 RE input. Images have been scaled for the plot.\n\nInput\nPixels\nGIST\nGIST\nGIST\nGIST\nLCN+GIST\nGIST\n\nEmbedding\nNone\nNone\nPCA\nPCA\nNCAR\nNCAR\nS-DrLIM\nBoost-SSC [36] LCN+GIST\nC-NCAR\nCS-DrLIM\n\nLCN\nLCN\n\nDim Error-SY Error-RE\n25.12\n25.13\n24.85\n25.74\n24.93\n23.15\n25.19\n22.65\n16.41\n19.61\n\n32.86\n47.41\n47.17\n48.99\n34.21\n32.90\n37.80\n34.80\n28.95\n25.40\n\n16384\n512\n128\n32\n32\n32\n32\n32\n32\n32\n\non Amazon Mechanical Turk. We were able to obtain accurate hand and head tracks for each of the\nspeakers within a few hours of their talks. For the following experiments, we divided the 30 speakers\ninto a training set (odd numbered speakers) and test set (even numbered speakers).\nSince current state-of-the-art face detection algorithms work reasonably well, we concentrate on the\nharder problem of tracking the speakers\u2019 hands. We \ufb01rst run a commercial face detection algorithm\n[26] on all frames which provides an estimate of scale for every frame. We use the average scale (per\nvideo) estimated by the face detector to crop and rescale each frame to a 128x128 image (centered\non the head) that contains the speaker at roughly the same scale as other speakers (there is some\nvariability due to using average scale per video as speakers move throughout their talks). A similar\npreprocessing step was used in [12]. We do not consider cases in which the hands lie outside\nthe frame or are occluded. This yields 39,792 and 37,671 training and test images, respectively,\ncontaining the head and both hands. Since the images are head-centered, the labels, y, used during\ntraining are the 4-dimensional vector containing the relative offset of each hand from the head.\nWe emphasize that \ufb01nding the hands is an extremely dif\ufb01cult task (sometimes even for human sub-\njects). Frames are low-resolution (typically the hands are 10-15 pixels in diameter) and contain\ncamera movement as well as frequently poor lighting. While previous work has assumed static\nbackgrounds, we confront the changing backgrounds and aim to learn invariance to both scene and\nsubject identity.\nResults are shown in Table 1 (column RE). They are organized into three groups: baselines (high-\ndimensional), and learning-based methods both linear and nonlinear. The linear methods are able\nto achieve performance comparable to the baseline with the important attribute that distances are\ncomputed in a 32-dimensional space. If the codes are made binary (as in [38]) we could use fast\napproximate hashing techniques to permit real-time tracking using a database of well over 1 million\nexamples. The nonlinear methods show a dramatic improvement over the linear methods, especially\nour convolutional architectures which learn features from pixels. Boost-SSC [36] is based on a\nglobal representation similar to GIST, and so it is restricted in domain adaptivity. We also investigate\nthe performance of C-NCAR on code size (Fig. 5(a)). Performance is impressive even when the\ndimension in which we compute distances is reduced from 32 to 2. A visualization of the 2D\nembedding is shown in Fig. 3.\nFig. 4 shows some examples of nearest-neighbour matches under several different metrics. Most\napparent is that our methods, and in particular C-NCAR, develop invariance to background and focus\non the subject\u2019s pose. Both pixel-based and GIST-batch matching are highly driven by the scene\n(including lighting and background). Though our method is trained only on the relative positions\nof the hands from the head, it appears to capture something more substantial about body pose in\ngeneral. We plan on evaluating this result quantitatively, using synthetic data in which we have\naccess to an articulated skeleton.\n\n6\n\n16.41 px25.40 px\fFigure 3: Visualization of the 2D C-NCAR embedding of 1024 points from the RE training set. We\nshow the data points and their local geometry within four example clusters: C1-C4. Note that even\nwith a 2D embedding, we are able to capture pose similarity invariant of subject and background.\n\nFigure 4: Nearest neighbour pose estimation. The leftmost column shows the query image, and\nthe remaining columns (left to right) show the nearest neighbour found by: nonlinear C-NCAR\nregression, linear NCAR, GIST, pixel distance. Circles mark the pose obtained by crowd-sourcing;\nwe superimpose the pose estimated by C-NCAR onto the query with crosses.\n\n4.3\n\nImproving real-world performance with synthetic data\n\nThere has been recent interest in using synthetic examples to improve performance on real-world\nvision tasks (e.g. [31]). The subtle differences between real and synthetic data make it dif\ufb01cult to\napply existing techniques to a dataset comprised of both types of examples. This problem falls under\nthe domain of transfer learning, but to the best of our knowledge, transfer learning between real and\nsynthetic pairings is relatively unexplored. While previous work has attempted to learn representa-\ntions that are invariant to such effects as geometric distortions of the input [16] and temporal shifts\n[5, 24] we know of no previous work that has explicitly attempted to learn features that are invariant\nto the nature of the input, that is, real or synthetic.\n\n7\n\nc1c4c3c2c222cccc3c4c411cccc2c3c4c12354671234561122334457651QueryC-NCARNCARGISTPixelsE=1.53E=10.55E=9.91E=9.91E=1.88E=27.55E=10.02E=23.42E=2.41E=20.00E=19.70E=20.64E=2.61E=8.97E=18.84E=30.54\f(a)\n\n(b)\n\nFigure 5: (a) Effect of code size on the performance of Convolutional NCA regression. (b) Adding\nsynthetic data to a \ufb01xed dataset of 1024 real examples to improve test performance measured on\nreal data. Error is expressed relative to a training set with no synthetic data. NCAR-1 does not\nre-initialize weights when more synthetic examples are added. NCAR-2 reinitializes weights to\nthe same random seed for each run. The curves show that adding synthetic examples improve\nperformance up to a point at which the synthetic examples outnumber the real examples 2:1.\n\nThe pairwise nature of our approach is well-suited to learning such invariance, provided that we have\nestablished correspondences between real and synthetic examples. In our case of pose estimation,\nthis comes from the labels. By forcing examples with similar poses (regardless of whether they are\nreal or synthetic) to lie close-by in code space we can implicitly produce a representation at each\nlayer that is invariant to the nature of the input. We have not made an attempt to restrict pairings to\nbe only between real and synthetic examples, though this may further aid in learning invariance.\nFig. 5(b) demonstrates the effect of gradually adding synthetic examples from SY to the RE training\ndataset. We use a reduced-size set of 1024 real examples for training which is gradually modi\ufb01ed\nto contain synthetic examples and a \ufb01xed set of 1024 real examples for testing. Error is expressed\nrelative to the case of no synthetic examples. We use Linear NCA for this experiment and train as\ndescribed above. We follow two different regimes. In NCAR-1 we do not reset the weights of the\nmodel to random each time we adjust the training set to add more synthetic examples. We simply\nadd more synthetic data and continue learning. In NCAR-2 we reset the weights to the same random\nseed for each run. The overall result is the same for each regime: the addition of synthetic examples\nto the training set improves test performance on real data up to a level at which the number of\nsynthetic examples is double the number of real examples.\n\n5 Conclusions\nWe have presented a nonparametric approach for pose estimation in realistic, challenging video\ndatasets. At the core of our method is a learned parametric mapping from high-dimensional space to\na low-dimensional space in which distance is ef\ufb01ciently computed. Our work differs from previous\nattempts at learning invariant mappings in that it is optimized for nearest neighbour regression rather\nthan classi\ufb01cation and it scales to realistic sized images through the use of convolution and weight-\nsharing. This permits us to learn domain-adaptive features directly from pixels rather than relying\non hand-crafted features or global descriptors.\nIn our experiments, we have restricted ourselves to 1-NN matching, but we plan to investigate other\nmore sophisticated approaches such as locally weighted regression, or using the match as an initial-\nization for a gradient descent search in a parametric model. Though we work with video, our model\ndoes not rely on any type of temporal coherence. Integrating temporal knowledge in the form of a\nprior would bene\ufb01t our approach. Alternatively, temporal context could be integrated at the input\nlevel, from simple frame differencing to more sophisticated temporal feature extraction (e.g. [23]).\nOur entire network is trained end-to-end with a single objective, and we do not perform any net-\nwork pre-training as in [34, 38]. Recent work has demonstrated that pre-training can successfully\nbe applied to convolutional architectures, both in the context of RBMs [22, 27] and sparse cod-\ning [19]. We intend to investigate the effect of pre-training, as well as the use of mixed generative\nand discriminative objectives.\n\n8\n\n24816321616.51717.51818.519Pixel error (test)Dimension of code2565121024204840960.920.940.960.9811.021.041.06Relative error (test) No syntheticNumber of Synthetic ExamplesNCAR\u22121NCAR\u22122\fReferences\n[1] A. Agarwal, B. Triggs, I. Rhone-Alpes, and F. Montbonnot. Recovering 3D human pose from monocular images. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 28(1):44\u201358, 2006.\n\n[2] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions.\n\n459\u2013468, 2006.\n\nIn FOCS, pages\n\n[3] M. Andriluka, S. Roth, and B. Schiele. Pictorial structures revisited: People detection and articulated pose estimation. In CVPR, 2009.\n[4] V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. Boostmap: A method for ef\ufb01cient approximate similarity rankings. CVPR, 2004.\n[5] S. Becker and G. Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356):161\u2013163,\n\n1992.\n\n[6] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV, sep 2009.\n[7] J. Bouvrie. Notes on convolutional neural networks. Unpublished, 2006.\n[8] P. Buehler, A. Zisserman, and M. Everingham. Learning sign language by watching TV (using weakly aligned subtitles). CVPR, 2009.\n[9] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of \ufb02ow and appearance. ECCV, 2006.\n[10] A. Farhadi, D. Forsyth, and R. White. Transfer Learning in Sign language. In CVPR, 2007.\n[11] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008.\n[12] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Pose search: Retrieving people using their pose. In CVPR, 2009.\n[13] A. Frome, G. Cheung, A. Abdulkader, M. Zennaro, B. Wu, A. Bissacco, H. Adam, H. Neven, and L. Vincent. Large-scale Privacy\n\nProtection in Google Street View. In ICCV, 2009.\n\n[14] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS, 2004.\n[15] K. Grauman, G. Shakhnarovich, and T. Darrell.\n\nInferring 3d structure with a statistical image-based shape model.\n\n641\u2013648, 2003.\n\nIn ICCV, pages\n\n[16] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, pages 1735\u20131742, 2006.\n[17] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504 \u2013 507, 2006.\n[18] K. Jarrett, K. Kavukcuoglu, M-A Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In ICCV,\n\n2009.\n\n[19] K. Kavukcuoglu, M-A Ranzato, and Y. LeCun. Fast inference in sparse coding algorithms with applications to object recognition.\n\nTechnical report, NYU, 2008. CBLL-TR-2008-12-01.\n\n[20] P. Keller, S. Mannor, and D. Precup. Automatic basis function construction for approximate dynamic programming and reinforcement\n\nlearning. In ICML, pages 449\u2013456, 2006.\n\n[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278\u2013\n\n2324, 1998.\n\n[22] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical\n\nrepresentations. In ICML, pages 609\u2013616, 2009.\n\n[23] R. Memisevic and G. Hinton. Unsupervised learning of image transformations. In CVPR, 2007.\n[24] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In ICML, pages 737\u2013744, 2009.\n[25] G. Mori and J. Malik. Estimating human body con\ufb01gurations using shape context matching. ECCV, 2002.\n[26] M. Nechyba, L. Brandy, and H. Schneiderman. Pittpatt face detection and tracking for the CLEAR 2007 evaluation. Multimodal\n\nTechnologies for Perception of Humans, 2008.\n\n[27] M. Norouzi, M. Ranjbar, and G. Mori. Stacks of convolutional restricted boltzmann machines for shift-invariant feature learning. In\n\nCVPR, 2009.\n\n[28] S.J. Nowlan and J.C. Platt. A convolutional neural network hand tracker. In NIPS, 1995.\n[29] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of\n\nComputer Vision, 42(3):145\u2013175, 2001.\n\n[30] N. Pinto, D. Cox, and J. DiCarlo. Why is real-world visual object recognition hard? PLoS Comput Biol, 4(1), 2008.\n[31] N. Pinto, D. Doukhan, J. DiCarlo, and David D. Cox. A high-throughput screening approach to discovering good forms of biologically\n\ninspired visual representation. PLoS Comput Biol, 5(11), 11 2009.\n\n[32] R. Poppe. Vision-based human motion analysis: An overview. Computer Vision and Image Understanding, 108(1-2):4\u201318, 2007.\n[33] D. Ramanan, D. Forsyth, and A. Zisserman. Strike a pose: Tracking people by \ufb01nding stylized poses. In CVPR, 2005.\n[34] R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS, volume 11,\n\n2007.\n\n[35] B. Sapp, C. Jordan, and B.Taskar. Adaptive pose priors for pictorial structures. In CVPR, 2010.\n[36] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing. In ICCV, pages 750\u2013759, 2003.\n[37] L. Sigal, A. Balan, and Black. M. J. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation\n\nof articulated human motion. IJCV, 87(1/2):4\u201327, 2010.\n\n[38] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR, 2008.\n[39] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. P\ufb01nder: Real-time tracking of the human body. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 19(7):780\u2013785, 1997.\n\n9\n\n\f", "award": [], "sourceid": 893, "authors": [{"given_name": "Graham", "family_name": "Taylor", "institution": ""}, {"given_name": "Rob", "family_name": "Fergus", "institution": ""}, {"given_name": "George", "family_name": "Williams", "institution": ""}, {"given_name": "Ian", "family_name": "Spiro", "institution": ""}, {"given_name": "Christoph", "family_name": "Bregler", "institution": ""}]}