{"title": "Non-linear CCA and PCA by Alignment of Local Models", "book": "Advances in Neural Information Processing Systems", "page_first": 297, "page_last": 304, "abstract": "", "full_text": "Non-linear CCA and PCA\n\nby Alignment of Local Models\n\nJakob J. Verbeeky, Sam T. Roweisz, and Nikos Vlassisy\n\ny Informatics Institute, University of Amsterdam\n\nz Department of Computer Science,University of Toronto\n\nAbstract\n\nWe propose a non-linear Canonical Correlation Analysis (CCA) method\nwhich works by coordinating or aligning mixtures of linear models. In\nthe same way that CCA extends the idea of PCA, our work extends re-\ncent methods for non-linear dimensionality reduction to the case where\nmultiple embeddings of the same underlying low dimensional coordi-\nnates are observed, each lying on a different high dimensional manifold.\nWe also show that a special case of our method, when applied to only\na single manifold, reduces to the Laplacian Eigenmaps algorithm. As\nwith previous alignment schemes, once the mixture models have been\nestimated, all of the parameters of our model can be estimated in closed\nform without local optima in the learning. Experimental results illustrate\nthe viability of the approach as a non-linear extension of CCA.\n\n1 Introduction\n\nIn this paper, we are interested in data that lies on or close to a low dimensional manifold\nembedded, possibly non-linearly, in a Euclidean space of much higher dimension. Data\nof this kind is often generated when our observations are very high dimensional but the\nnumber of underlying degrees of freedom is small. A typical example are images of an\nobject under different conditions (e.g. pose and lighting). A simpler example is given in\nFig. 1, where we have data in IR3 which lies on a two dimensional manifold. We want\nto recover the structure of the data manifold, so that we can \u2018unroll\u2019 the data manifold\nand work with the data expressed in the underlying \u2018latent coordinates\u2019, i.e. coordinates\non the manifold. Learning low dimensional latent representations may be desirable for\ndifferent reasons, such as compression for storage and communication, visualization of\nhigh dimensional data, or as preprocessing for further data analysis or prediction tasks.\n\nRecent work on unsupervised nonlinear feature extraction has pursued several comple-\nmentary directions. Various nonparametric spectral methods, such as Isomap[1], LLE[2],\nKernel PCA[3] and Laplacian Eigenmaps[4] have been proposed which reduce the dimen-\nsionality of a \ufb01xed training set in a way that maximally preserve certain inter-point rela-\ntionships, but these methods do not generally provide a functional mappings between the\nhigh and low dimensional spaces that are valid both on and off the training data. In this\npaper, we consider a method to integrate several local feature extractors into a single global\nrepresentation, similar to the approaches of [5, 6, 7, 8]. These methods, as well as ours,\n\n\fdeliver after training a functional mapping which can be used to convert previously unseen\nhigh dimensional observations into their low dimensional global coordinates. Like most\nof the above algorithms, our method performs non-linear feature extraction by minimizing\na convex objective function whose critical points can be characterized as eigenvectors of\nsome matrix. These algorithms are generally simple and ef\ufb01cient; one needs only to con-\nstruct a matrix based on local feature analysis of the training data and then computes its\nlargest or smallest eigenvectors using standard numerical methods. In contrast, methods\nlike generative topographic mapping[9] and self-organizing maps[10] are prone to local\noptima in the objective function.\n\nOur method is based on the same intuitions as in earlier work: the idea is to learn a mixture\nof latent variable density models on the original training data so that each mixture compo-\nnent acts as a local feature extractor. For example, we may use a mixture of factor analyzers\nor a mixture of principal component analyzers (PCA). After this mixture has been learned,\nthe local feature extractors are \u2018coordinated\u2019 by \ufb01nding, for each model, a suitable linear\nmapping (and offset) from its latent variable space into a single \u2018global\u2019 low-dimensional\ncoordinate system. The local feature extractors together with the coordinating linear maps\nprovide a global non-linear map from the data space to the latent space and back. Learning\nthe mixture is driven by a density signal \u2013 we want to place models near the training points,\nwhile the post-coordination is driven by the idea that when two different models place sig-\nni\ufb01cant weight on the same point, they should agree on its mapping into the global space.\n\nOur algorithm, developed in the following section, builds upon recent work of coordination\nmethods. As in [6], we use a cross-entropy between a unimodal approximation and the true\nposterior over global coordinates to encourage agreement. However we do not attempt to\nsimultaneously learn the mixture model and coordinate since this causes severe problems\nwith local minima. Instead, as in [7, 8], we \ufb01x a speci\ufb01c mixture and then study the com-\nputations involved in coordinating its local representations. We extend the latter works as\nCCA extends PCA: rather than \ufb01nding a projection of one set of points, we \ufb01nd projections\nfor two sets of corresponding points fxng and fyng (xn corresponding to yn) into a single\nlatent space that project corresponding points in the two point sets as nearby as possible.\n\nIn this setting we begin by showing, in Section 3, how Laplacian Eigenmaps[4] are a special\ncase of the algorithms presented here when they are applied to only a single manifold.\nWe go on, in Section 4, to extend our algorithm to a setting in which multiple different\nobservation spaces are available, each one related to the same underlying global space but\nthrough different nonlinear embeddings. This naturally gives rise to a nonlinear version of\nweighted Canonical Correlation Analysis (CCA). We present results of several experiments\nin the same section and we conclude the paper with a general discussion in Section 5.\n\n2 Non-linear PCA by aligning local feature extractors\n\nConsider a given data set X = fx1; : : : ; xN g and a collection of k local feature extractors,\nfs(x) is a vector containing the, zero or more, features produced by model s. Each feature\nextractor also provides an \u201cactivity signal\u201d, as(x) representing its con\ufb01dence in modeling\nthe point. We convert these activities into posterior responsibilities using a simple soft-\nmax: p(sjx) = exp(as(x))= Pr exp(ar(x)). If the experts are actually components of a\nmixture, then setting the activities to the logarithm of the posteriors under the mixture will\nrecover exactly the same posteriors above.\n\nNext, we consider the relationship between the given representation of the data and the\nrepresentation of the data in a global latent space, which we would like to \ufb01nd. Throughout,\nwe will use g to denote latent \u2019Global\u2019 coordinates for data. For the unobserved latent\ncoordinate g corresponding to a data point xn and conditioned on s, we assume the density:\n(1)\n\np(gjxn; s) = N (g; (cid:20)s + Asfs(xn); (cid:27)2I) = N (g; gns; (cid:27)2I);\n\n\fwhere N (g; (cid:22); (cid:6)) is a Gaussian distribution on g with mean (cid:22) and covariance (cid:6). The\nmean, gns, of p(gjxn; s) is the sum of the component offset (cid:20)s in the latent space and a lin-\near transformation, implemented by As, of fs(xn). From now on we will use homogeneous\ncoordinates and write: Ls = [As(cid:20)s] and zns = [fs(xn)>1]>, and thus gns = Lszns. Con-\nsider the posterior distribution on latent coordinates given some data:\n\np(gjx) = X\n\np(s; gjx) = X\n\np(sjx)p(gjx; s):\n\n(2)\n\ns\n\ns\n\nGiven a \ufb01xed set of local feature extractors and a corresponding activities, we are interested\nin \ufb01nding linear maps Ls that give rise to \u2018consistent\u2019 projections of the data in the latent\nspace. By \u2018consistent\u2019, we mean that the p(gjx; s) are similar for components with large\nposterior. If the predictions are in perfect agreement for a point xn, then all the gns are\nequal and the posterior p(gjx) is Gaussian, in general p(gjx) is a mixture of Gaussians. To\nmeasure the consistency, we de\ufb01ne the following error function:\n\n(cid:8)(fL1; : : : ; Lkg) = min\n\nfQn;:::QN g\n\nX\n\nn;s\n\nqnsD(Qn(g) k p(gjxn; s));\n\n(3)\n\nwhere we used qns as a shorthand for p(sjxn) and Qn is a Gaussian with mean gn\nand covariance matrix (cid:6)n. The objective sums for each data point xn and model s the\nKullback-Leibler divergence D between a single Gaussian Qn(g) and the component den-\nsities p(gjx; s), weighted by the posterior p(sjxn). It is easy to derive that in order to\nminimize the objective (cid:8) w.r.t. gn and (cid:6)n we obtain:\n\ngn = X\n\nqnsgns and (cid:6)n = (cid:27)2I;\n\ns\n\n(4)\n\nwhere I denotes the identity matrix. Skipping some additive and multiplicative constants\nwith respect to the linear maps Ls, the objective (cid:8) then simpli\ufb01es to:\n\n(cid:8) = X\n\nqns k gn (cid:0) gns k2=\n\nn;s\n\n1\n2 X\n\nn;s;t\n\nqnsqnt k gnt (cid:0) gns k2(cid:21) 0:\n\n(5)\n\nThe main attraction with this setup is that our objective is a quadratic function of the linear\nmaps Ls, as in [7, 8]. Using some extra notation, we obtain a clearer form of the objective\nas a function of the linear maps. Let:\n\nun = [qn1z>\n\nn1 : : : qnkz>\n\nnk];\n\nU = [u>\n\n1 : : : u>\n\nN ]>;\n\nL = [L1 : : : Lk]>:\n\n(6)\n\nNote that from (4) and (6) we have: gn = (unL)>. The expected projection coordinates\ncan thus be computed as: G = [g1 : : : gN ]> = UL: We de\ufb01ne the block-diagonal matrix\nD with k blocks given by Ds = Pn qnsznsz>\n\nns. The objective can now be written as:\n\n(cid:8) = TrfL>(D (cid:0) U>U)Lg:\n\n(7)\n\nThe objective function is invariant to translation and rotation of the global latent space and\nre-scaling the latent space changes the objective monotonically, c.f. (5). To make solutions\nunique with respect to translation, rotation and scaling, we impose two constraints:\n\n(transl:) : (cid:22)g = X\n\ngn=N = 0;\n\n(rot: + scale) : (cid:6)g = X\n\n(gn (cid:0) (cid:22)g)(gn (cid:0) (cid:22)g)>=N = I:\n\nn\n\nn\n\nThe columns of L minimizing (cid:8) are characterized as the generalized eigenvectors:\n\n(D (cid:0) U>U)v = (cid:21)U>Uv , Dv = ((cid:21) + 1)U>Uv:\n\n(8)\n\nThe value of the objective function is given by the sum of the corresponding eigenvalues\n(cid:21). The smallest eigenvalue is always zero, corresponding to mapping all data into the same\n\n\f1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\n\u22123\n\n\u22123.5\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\nFigure 1: Data in\nIR3 with local charts\nindicated by the axes\n(left). Data repre-\nsentation in IR2 gen-\nerated by optimizing\nour objective func-\ntion.\nExpected la-\ntent coordinates gn\nare plotted (right).\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\u22122\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22120.5\n\n1.5\n\n1\n\n0.5\n\nlatent coordinate. This embedding is uninformative since it is constant, therefore we select\nthe eigenvectors corresponding to the second up to the (d + 1)st smallest eigenvalues to\nobtain the best embedding in d dimensions. Note that, as mentioned in [7], this framework\nenables us to use feature extractors that provide different numbers of features.\n\nIn Fig. 1 we give an illustration of applying the above procedure to a simple manifold. The\nplots show the original data presented to the algorithm (left) and the 2-dimensional latent\ncoordinates gn = Ps qnsgns found by the algorithm (right).\n\n3 Laplacian Eigenmaps as a special case\n\nConsider the special case of the algorithm of Section 2, where no features are extracted.\nThe only information the mixture model provides are the posterior probabilities collected\nin the matrix Q with [Q]ns = qns = p(sjxn). In that case:\n\ngns = (cid:20)s;\n\nU = Q;\n(cid:8) = TrfL>(D (cid:0) A)Lg = X\n\nL = [(cid:20)\n\n>\n1 : : : (cid:20)\n\nk ]>;\n>\nqnsqnt;\n\nk (cid:20)s (cid:0) (cid:20)t k2 X\n\n(9)\n\n(10)\n\ns;t\n\nn\n\nwhere A = Q>Q is an adjacency matrix with [A]st = Pn qnsqnt and D is the diagonal\ndegree matrix of A with [D]ss = Pt Ast = Pn qns. Optimization under the constrains\nof zero mean and identity covariance leads to the generalized eigenproblem:\n\n(D (cid:0) A)v = (cid:21)Av , (D (cid:0) A)v =\n\n(cid:21)\n\n1 + (cid:21)\n\nDv\n\n(11)\n\nThe optimization problem is exactly the Laplacian Eigenmaps algorithm[4], but applied\non the mixture components instead of the data points. Since we do not use any feature\nextractors in this setting, it can be applied to mixture models that model data for which\nit is hard to design feature extractors, e.g. data that has (both numerical and) categorical\nfeatures. Thus, we can use mixture densities without latent variables, e.g. mixtures of\nmultinomials, mixtures of Hidden Markov Models, etc. Notice that in this manner the\nmixture model not only provides a soft grouping of the data through the posteriors, but also\nan adjacency matrix between the groups.\n\n4 Non-linear CCA by aligning local feature extractors\n\nCanonical Correlation Analysis (CCA) is a data analysis method that \ufb01nds correspondences\nbetween two or more sets of measurements. The data are provided in tuples of correspond-\ning measurements in the different spaces. The sets of measurements can be obtained by\n\n\femploying different sensors to make measurements of some phenomenon. Our main inter-\nest in this paper is to develop a nonlinear extension of CCA which works when the differ-\nent measurements come from separate nonlinear manifolds that share an underlying global\ncoordinate system. Non-linear CCA can be trained to \ufb01nd a shared low dimensional em-\nbedding for both manifolds, exploiting the pairwise correspondence provided by the data\nset. Such models can then be used for different purposes, like sensor fusion, denoising,\n\ufb01lling in missing data, or predicting a measurement in one space given a measurement in\nthe other space. Another important aspect of this learning setup is that the use of multiple\nsensors might also function as regularization helping to avoid over\ufb01tting, c.f. [11].\nIn CCA two (zero mean) sets of points are given: X = fx1; : : : ; xN g (cid:26) IRp and Y =\nfy1; : : : ; yNg (cid:26) IRq. The aim is to \ufb01nd linear maps a and b, that map members of X and\nY respectively on the real line, such that the correlation between the linearly transformed\nvariables is maximized. This is easily shown to be equivalent to minimizing:\n\nE =\n\n1\n2 X\n\nn\n\n[axn (cid:0) byn]2\n\n(12)\n\nn ]b> = 1. The above is easily\nunder the constraint that a[Pn xnx>\ngeneralized such that the sets do not need to be zero mean and allowing a translation as\nwell. We can also generalize by mapping to IRd instead of the real line, and then requiring\nthe sum of the covariance matrices of the projections to be identity. CCA can also be readily\nextended to take into account more than two point sets, as we now show.\n\nn ]a> + b[Pn yny>\n\nIn the generalized CCA setting with multiple point-sets, allowing translations and linear\nmappings to IRd, the objective is to minimize the squared distance between all pairs of\nprojections under the same constraint as above. We denote the projection of the n-th point\nin the s-th point-set as gns and let gn = 1\nk Ps gns. We then minimize the error function:\n\n(cid:8)CCA =\n\n1\n2k2 X\n\nn;s;t\n\nk gns (cid:0) gnt k2=\n\n1\nk X\n\nn;s\n\nk gns (cid:0) gn k2 :\n\n(13)\n\nThe objective (cid:8) in equation (5) coincides with (cid:8)CCA if qns = 1=k. The different con-\nstraints imposed upon the optimization by CCA and our objective of the previous sections\nare equivalent. We can thus regard the alignment procedure as a weighted form of CCA.\nThis suggests using the coordination technique for non-linear CCA. This is achieved quite\neasily, without modifying the objective function (5). We consider different point sets, each\nhaving a mixture of locally valid linear projections into the \u2018global\u2019 latent space that is now\nshared by all mixture components and point sets. We minimize the weighted sum of the\nsquared distances between allpairs of projections, i.e. we have pairs of projections due to\nthe same point set and also pairs that combine projections from different point sets.\n\nWe use c as an index ranging over the C different observation spaces, and write q c\nns for\nthe posterior on component s for observation n in observation space c. Similarly, we use\nns to denote the projection due component s from space c. The average projection due to\ngc\nns. We use index r to range over all\nobservation space c is then denoted by gc\nmixture components and observation spaces, so that qnr = 1\nC p(sjxn) if r corresponds to\nC p(sjyn) if r corresponds to (c = 2; s), i.e. r $ (c; s). The overall\n(c = 1; s) and qnr = 1\naverage projection then becomes: gn = 1\nn = Pr qnrgnr. The objective (5) can\nnow be rewritten as:\n\nn = Ps qc\n\nC Pc gc\n\nnsgc\n\n(cid:8) = X\n\nqnr k gnr (cid:0) gn k2=\n\nn;r\n\n1\nC X\n\nc;n\n\nk gn (cid:0) gc\n\nn k2 +\n\n1\nC X\n\nc;n;s\n\nqc\nns k gc\n\nn (cid:0) gc\n\nns k2 : (14)\n\nObserve how in (14) the objective sums between point set consistency of the projections\n(\ufb01rst summand) and within point set consistency of the projections (second summand).\n\n\f5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\u22122\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n\u22121.5\n\n\u22121.5\n\n2\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\n1.5\n\n\u22122\n\u22121\n\n\u22120.9\n\n\u22120.8\n\n\u22120.7\n\n\u22120.6\n\n\u22120.5\n\n\u22120.4\n\n\u22120.3\n\n\u22120.2\n\n\u22120.1\n\n0\n\nFigure 2: Data and\nindicated by\ncharts,\n(left-middle).\nbars\ncoordinates\nLatent\n(vert.)\nand coordi-\nnate on generating\ncurve (hor.) (right).\n\nThe above technique can also be used to get more stable results of the chart coordination\nprocedure for a single manifold discussed in Section 2. Robustness for variation in the\nmixture \ufb01tting can be improved by using several sets of charts \ufb01tted to the same mani-\nfold. We can then align all these sets of charts by optimizing (14). This aligns the charts\nwithin each set and at the same time makes sure the different sets of aligned charts are\naligned, providing important regularization, since now every point is modeled by several\nlocal models.\n\nNote that if the charts and responsibilities are obtained using a mixture of PCA or factor\nanalyzers, the local linear mappings to the latent space induce a Gaussian mixture in the\nlatent space. This mixture can be used to compute responsibilities on components given\nlatent coordinates. Also, for each linear map from the data to the latent space we can\ncompute a pseudo inverse projecting back. By averaging the individual back projections\nwith the responsibilities computed in latent space we obtain a projection from the latent\nspace to the data space. In total, we can thus map from one observation space into another.\nThis is how we generated the reconstructions in the experiments reported below. When\nusing linear CCA for data that is non-linearly embedded, reconstructions will be poor since\nlinear CCA can only map into a low dimensional linear subspace.\n\nAs an illustrative example of the non-linear CCA we used two point-sets in IR2. The \ufb01rst\npoint-set was generated on an S-shaped curve the second point set was generated along an\narc, see Fig. 2. To both point sets we added Gaussian noise and we learned a 10 component\nmixture model on both sets.\nIn the rightmost panel of Fig. 2 the, clearly successfully,\ndiscovered latent coordinates are plotted against the coordinate on the generating curve.\nBelow, we describe three more challenging experiments.\n\nIn the \ufb01rst experiment we use two data sets which we know to share the same underlying\ndegrees of freedom. We use images of a face varying its gaze left-right and up-down. We\ncut these images in half to obtain our two sets of images. We trained the system on 1500\nimage halves of 40 (cid:2) 20 pixels each. Both image halves were modeled with a mixture of 40\ncomponents. In Fig. 3 some generated right half images based on the left half are shown.\n\nThe second experiment concerns appearance based pose estimation of an object. One point\nset consists of a pixel representation of images of an object and the other point set contains\nthe corresponding pose of the camera w.r.t. the object. For the pose parameters we used the\nidentity to \u2018extract\u2019 features (i.e. we just used one component for this space). The training\ndata was collected1 by moving a camera over the half-sphere centered at the object. A\nmixture of 40 PCA\u2019s was trained on the image data and aligned with the pose parameters in\na 2-dimensional latent space. The right panel of Fig. 3 shows reconstructions of the images\nconditioned on various pose inputs (left image of each pair is reconstruction based on pose\nof right image). Going the other way, when we input an image and estimate the pose, the\nabsolute errors in the longitude (0(cid:14) (cid:0) 360(cid:14)) were under 10(cid:14) in over 80% of the cases and\nfor latitude (0(cid:14) (cid:0) 90(cid:14)) this was under 5(cid:14) in over 90% of the cases.\n\n1Thanks to G. Peters for sharing the images used in [12] and recorded at the Institute for Neural\n\nComputation, Ruhr-University Bochum, Germany.\n\n\fFigure 3: Right half of the images was generated given the left half using the trained model\n(left). Image reconstructions given pose parameters (right).\n\nIn the third experiment we use the same images as in the second experiment, but replace\nthe direct (low dimensional) supervision signal of the pose parameters with (high dimen-\nsional) correspondences in the form of images of another object in corresponding poses.\nWe trained a mixture of 40 PCA\u2019s on both image sets (2000 images of 64(cid:2)64 pixels in each\nset) and aligned these in a 3-dimensional latent space. Comparing the pose of an object to\nthe pose of the nearest (in latent space) image from the other object the std. dev. of error\nin latitude is 2:0(cid:14). For longitude we found 4 errors of about 180(cid:14) in our 500 test cases,\nthe rest of the errors had std. dev. 3:9(cid:14). Given a view of one object we can reconstruct the\ncorresponding view of the second object, Fig. 4 shows some of the obtained reconstruction\nresults. All presented reconstructions were made for data not included in training.\n\n5 Discussion\n\nIn this paper, we have extended alignment methods for single manifold nonlinear dimen-\nsionality reduction to perform non-linear CCA using measurements from multiple man-\nifolds. We have also shown the close relationship with Laplacian Eigenmaps[4] in the\ndegenerate case of a single manifold and feature extractors of zero dimensionality.\n\nIn [7] a related method to coordinate local charts is proposed, which is based on the LLE\ncost function as opposed to our cross-entropy term; this means that we need more than just\na set of local feature extractors and their posteriors: we also need to be able to compute\nreconstruction weights, collected in a N (cid:2) N weight matrix. The weights indicate how\nwe can reconstruct each data point from its nearest neighbors. Computing these weights\nrequires access to the original data directly, not just through the \u201cinterface\u201d of the mix-\nture model. De\ufb01ning sensible weights and the \u2018right\u2019 number of neighbors might not be\nstraightforward, especially for data in non-Euclidean spaces. Furthermore, computing the\nweights costs in principle O(N 2) because we need to \ufb01nd nearest neighbors, whereas the\npresented work has running time linear in the number of data points.\n\nIn [11] it is considered how to \ufb01nd low dimensional representations for multiple point sets\nsimultaneously, given few correspondences between the point sets. The generalization of\nLLE presented there for this problem is closely related to our non-linear CCA model. The\nwork presented here can also be extended to the case where we know only for few points in\none set to which points they correspond in the other set. The use of multiple sets of charts\nfor one data set is similar in spirit as the self-correspondence technique of [11] where the\ndata is split into several overlapping sets used to stabilize the generalized LLE.\n\n\fd\n\nc\n\nb\n\na\n\n(a),\n\nFigure 4: I1:\nimage in \ufb01rst\nset\nI2:\ncorrespond-\ning image in\nset\nsecond\nclosest\n(b),\nimage\nin\nsecond set (in\nlatent\nspace)\nto I1 (c), re-\nconstruction\nof I2 given I1\n(d).\n\nFinally, it would be interesting to compare our approach with treating the data in the joint\n(x; y) space and employing techniques for a single point set[8, 7, 6]. In this case, points\nfor which we do not have the correspondence can be treated as data with missing values.\n\nAcknowledgments\n\nJJV and NV are supported by the Technology Foundation STW (AIF4997) applied science\ndivision of NWO and the technology program of the Dutch Ministry of Economic Affairs.\nSTR is supported in part by the Learning Project of IRIS Canada and by the NSERC.\n\nReferences\n\n[1] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, December 2000.\n\n[2] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding.\n\nScience, 290(5500):2323\u20132326, December 2000.\n\n[3] B. Sch\u00a8olkopf, A.J. Smola, and K. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. NeuralComputation, 10:1299\u20131319, 1998.\n\n[4] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and\n\nclustering. In AdvancesinNeuralInformationProcessingSystems, volume 14, 2002.\n\n[5] C. Bregler and S.M. Omohundro. Surface learning with applications to lipreading. In Advances\n\ninNeuralInformationProcessingSystems, volume 6, 1994.\n\n[6] S.T. Roweis, L.K. Saul, and G.E. Hinton. Global coordination of local linear models. In Ad-\n\nvancesinNeuralInformationProcessingSystems, volume 14, 2002.\n\n[7] Y.W. Teh and S.T. Roweis. Automatic alignment of local representations. In AdvancesinNeural\n\nInformationProcessingSystems, volume 15, 2003.\n\n[8] M. Brand. Charting a manifold. In Advances in Neural Information Processing Systems, vol-\n\nume 15, 2003.\n\n[9] C.M. Bishop, M. Svens\u00b4en, and C.K.I Williams. GTM: the generative topographic mapping.\n\nNeuralComputation, 10:215\u2013234, 1998.\n\n[10] T. Kohonen. Self-organizingmaps. Springer, 2001.\n[11] J.H. Ham, D.D. Lee, and L.K. Saul. Learning high dimensional correspondences from low\ndimensional manifolds. In ICML\u201903,workshoponthecontinuumfromlabeledtounlabeleddata\ninmachinelearninganddatamining, 2003.\n\n[12] G. Peters, B. Zitova, and C. von der Malsburg. How to measure the pose robustness of object\n\nviews. ImageandVisionComputing, 20(4):249\u2013256, 2002.\n\n\f", "award": [], "sourceid": 2443, "authors": [{"given_name": "Jakob", "family_name": "Verbeek", "institution": null}, {"given_name": "Sam", "family_name": "Roweis", "institution": null}, {"given_name": "Nikos", "family_name": "Vlassis", "institution": "Adobe Research"}]}