{"title": "Diffeomorphic Dimensionality Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 1713, "page_last": 1720, "abstract": "This paper introduces a new approach to constructing meaningful lower dimensional representations of sets of data points. We argue that constraining the mapping between the high and low dimensional spaces to be a diffeomorphism is a natural way of ensuring that pairwise distances are approximately preserved. Accordingly we develop an algorithm which diffeomorphically maps the data near to a lower dimensional subspace and then projects onto that subspace. The problem of solving for the mapping is transformed into one of solving for an Eulerian flow field which we compute using ideas from kernel methods. We demonstrate the efficacy of our approach on various real world data sets.", "full_text": "Diffeomorphic Dimensionality Reduction\n\nChristian Walder and Bernhard Sch\u00a8olkopf\n\nMax Planck Institute for Biological Cybernetics\n\n72076 T\u00a8ubingen, Germany\n\nfirst.last@tuebingen.mpg.de\n\nAbstract\n\nThis paper introduces a new approach to constructing meaningful lower dimen-\nsional representations of sets of data points. We argue that constraining the map-\nping between the high and low dimensional spaces to be a diffeomorphism is a\nnatural way of ensuring that pairwise distances are approximately preserved. Ac-\ncordingly we develop an algorithm which diffeomorphically maps the data near to\na lower dimensional subspace and then projects onto that subspace. The problem\nof solving for the mapping is transformed into one of solving for an Eulerian \ufb02ow\n\ufb01eld which we compute using ideas from kernel methods. We demonstrate the\nef\ufb01cacy of our approach on various real world data sets.\n\n1 Introduction\n\nThe problem of visualizing high dimensional data often arises in the context of exploratory data\nanalysis. For many real world data sets this is a challenging task, as the spaces in which the data\nlie are often too high dimensional to be visualized directly. If the data themselves lie on a lower\ndimensional subspace however, dimensionality reduction techniques may be employed, which aim\nto meaningfully represent the data as elements of this lower dimensional subspace.\n\nThe earliest approaches to dimensionality reduction are the linear methods known as principal com-\nponents analysis (PCA) and factor analysis (Duda et al., 2000). More recently however, the major-\nity of research has focussed on non-linear methods, in order to overcome the limitations of linear\napproaches\u2014for an overview and numerical comparison see e.g. (Venna, 2007; van der Maaten\net al., 2008), respectively. In an effort to better understand the numerous methods which have been\nproposed, various categorizations have been proposed. In the present case, it is pertinent to make\nthe distinction between methods which focus on properties of the mapping to the lower dimensional\nspace, and methods which focus on properties of the mapped data, in that space. A canonical ex-\nample of the latter is multidimensional scaling (MDS), which in its basic form \ufb01nds the minimizer\nwith respect to y1, y2, . . . , ym of (Cox & Cox, 1994)\n\nm\n\nXi,j=1\n\n(kxi \u2212 xjk \u2212 kyi \u2212 yjk)2 ,\n\n(1)\n\nwhere here, as throughout the paper, the xi \u2208 Ra are input or high dimensional points, and the\nyi \u2208 Rb are output or low dimensional points, so that b < a. Note that the above term is a\nfunction only of the input points and the corresponding mapped points, and is designed to preserve\nthe pairwise distances of the data set.\n\nThe methods which focus on the mapping itself (from the higher to the lower dimensional space,\nwhich we refer to as the downward mapping, or the upward mapping which is the converse) are less\ncommon, and form a category into which the present work falls. Both auto-encoders (DeMers &\nCottrell, 1993) and the Gaussian process latent variable model (GP-LVM) (Lawrence, 2004) also\nfall into this category, but we focus on the latter as it provides an appropriate transition into the\n\n1\n\n\fmain part the paper. The GP-LVM places a Gaussian process (GP) prior over each high dimen-\nsional component of the upward mapping, and optimizes with respect to the set of low dimensional\npoints\u2014which can be thought of as hyper-parameters of the model\u2014the likelihood of the high di-\nmensional points. Hence the GP-LVM constructs a regular (in the sense of regularization, i.e. likely\nunder the GP prior) upward mapping. By doing so, the model guarantees that nearby points in\nthe low dimensional space should be mapped to nearby points in the high dimensional space\u2014an\nintuitive idea for dimensionality reduction which is also present in the MDS objective (1), above.\n\nThe converse is not guaranteed in the original GP-LVM however, and this has lead to the more re-\ncent development of the so-called back-constrained GP-LVM (Lawrence & Candela, 2006), which\nessentially places an additional GP prior over the downward mapping. By guaranteeing in this way\nthat (the modes of the posterior distributions over) both the upward and downward mappings are\nregular, the back constrained GP-LVM induces something reminiscent of a diffeomorphic mapping\nbetween the two spaces. This leads us to the present work, in which we derive our new algorithm,\nDiffeomap, by explicitly casting the dimensionality reduction problem as one of constructing a dif-\nfeomorphic mapping between the low dimensional space and the subspace of the high dimensional\nspace on which the data lie.\n\n2 Diffeomorphic Mappings and their Practical Construction\n\nIn this paper we use the following de\ufb01nition:\nDe\ufb01nition 2.1. Let U and V be open subsets of Ra and Rb, respectively. The mapping F : U \u2192 V\nis said to be a diffeomorphism if it is bijective (i.e. one to one), smooth (i.e. belonging to C\u221e), and\nhas a smooth inverse map F \u22121.\n\nWe note in passing the connection between this de\ufb01nition, our discussion of the GP-LVM, and di-\nmensionality reduction. The GP-LVM constructs a regular upward mapping (analogous to F \u22121)\nwhich ensures that points nearby in Rb will be mapped to points nearby in Ra, a property referred\nto as similarity preservation in (Lawrence & Candela, 2006). The back constrained GP-LVM si-\nmultaneously ensures that the downward mapping (analogous to F ) is regular, thereby additionally\nimplementing what its authors refer to as dissimilarity preservation. Finally, the similarity between\nsmoothness (required of F and F \u22121 in De\ufb01nition 2.1) and regularity (imposed on the downward and\nupward mappings by the GP prior in the back constrained GP-LVM) complete the analogy. There is\nalso an alternative, more direct motivation for diffeomorphic mappings in the context of dimension-\nality reduction, however. In particular, a diffeomorphic mapping has the property that it does not\nlose any information. That is, given the mapping itself and the lower dimensional representation of\nthe data set, it is always possible to reconstruct the original data.\n\nThere has been signi\ufb01cant interest from within the image processing community, in the construction\nof diffeomorphic mappings for the purpose of image warping (Dupuis & Grenander, 1998; Joshi\n& Miller, 2000; Karac\u00b8ali & Davatzikos, 2003). The reason for this can be understood as follows.\nLet I : U \u2192 R3 represent the RGB values of an image, where U \u2282 R2 is the image plane. If we\nnow de\ufb01ne the warped version of I to be I \u25e6 W , then we can guarantee that the warp is topology\npreserving, i.e. that it does not \u201ctear\u201d the image, by ensuring the W be a diffeomorphism U \u2192 U .\nThe following two main approaches to constructing such diffeomorphisms have been taken by the\nimage processing community, the \ufb01rst of which we mention for reference, while the second forms\nthe basis of Diffeomap. It is a notable aside that there seem to be no image warping algorithms\nanalogous to the back constrained GP-LVM, in which regular forward and inverse mappings are\nsimultaneously constructed.\n\n1. Enforcement of the constraint that |J(W )|, the determinant of the Jacobian of the map-\nping, be positive everywhere. This approach has been successfully applied to the problem\nof warping 3D magnetic resonance images (Karac\u00b8ali & Davatzikos, 2003), for example,\nbut a key ingredient of that success was the fact that the authors de\ufb01ned the mapping W\nnumerically on a regular grid. For the high dimensional cases relevant to dimensionality\nreduction however, such a numerical grid is highly computationally unattractive.\n\n2. Recasting the problem of constructing W as an Eulerian \ufb02ow problem (Dupuis & Grenan-\n\nder, 1998; Joshi & Miller, 2000). This approach is the focus of the next section.\n\n2\n\n\fR\n\nx\n\n0\n\nR\n\u03c6(x, 1) = \u03c8(x)\n\n(s, \u03c6(x, s))\n\n(1, v(\u03c6(x, s), s))\n\ns\n\nt\n\n1\n\nFigure 1: The relationship between v(\u00b7, \u00b7), \u03c6(\u00b7, \u00b7) and \u03c8(\u00b7) for the one dimensional case \u03c8 : R \u2192 R.\n\n2.1 Diffeomorphisms via Flow Fields\n\nThe idea here is to indirectly de\ufb01ne the mapping of interest, call it \u03c8 : Ra \u2192 Ra, by way of a \u201ctime\u201d\nindexed velocity \ufb01eld v : Ra \u00d7 R \u2192 Ra. In particular we write \u03c8(x) = \u03c6(x, 1), where\n\n\u03c6(x, t) = x +Z t\n\ns=0\n\nv(\u03c6(x, s), s)ds.\n\nThis choice of \u03c6 satis\ufb01es the following Eulerian transport equation with boundary conditions:\n\n\u2202\u03c6(x, s)\n\n\u2202s\n\n= v(\u03c6(x, s), s), \u03c6(x, 0) = x.\n\n(2)\n\n(3)\n\nThe role of v is to transport a given point x from its original location at time 0 to its mapped location\n\u03c6(x, 1) by way of a trajectory whose position and tangent vector at time s are given by \u03c6(x, s) and\nv(\u03c6(x, s), s), respectively (see Figure 1). The point of this construction is that if v satis\ufb01es certain\nregularity properties, then the mapping \u03c8 will be a diffeomorphism. This fact has been proven in a\nnumber of places\u2014one particularly accessible example is (Dupuis & Grenander, 1998), where the\nnecessary conditions are provided for the three dimensional case along with a proof that the induced\nmapping is a diffeomorphism. Generalizing the result to higher dimensions is straightforward\u2014this\nfact is stated in (Dupuis & Grenander, 1998) along with the basic idea of how to do so.\n\n\u2032, along with its associated trajectory.\n\nWe now offer an intuitive argument for the result. Consider Figure 1, and imagine adding a new\nIt is clear that for the mapping \u03c8 to be a\nstarting point x\n\u2032, the associated trajectories must not\ndiffeomorphism, then for any such pair of points x and x\n\u2032 would\ncollide. This is because the two trajectories would be identical after the collision, x and x\nmap to the same point, and hence the mapping would not be invertible. But if v is suf\ufb01ciently regular\nthen such collisions cannot occur.\n\n3 Diffeomorphic Dimensionality Reduction\n\nThe framework of Eulerian \ufb02ow \ufb01elds which we have just introduced provides an elegant means\nof constructing diffeomorphic mappings Ra \u2192 Ra, but for dimensionality reduction we require\nadditional ingredients, which we now introduce. The basic idea is to construct a diffeomorphic\nmapping in such a way that it maps our data set near to a subspace of Ra, and then to project onto\nthis subspace. The subspace we use, call it Sb, is the b-dimensional one spanned by the \ufb01rst b\ncanonical basis vectors of Ra. Let P(a\u2192b) : Ra \u2192 Rb be the projection operator which extracts the\n\ufb01rst b components of the vector it is applied to, i.e.\n\nP(a\u2192b)x = (I Z) x,\n\n(4)\n\nwhere I \u2208 Ra\u00d7a is the identity matrix and Z \u2208 Ra\u00d7b\u2212a is a matrix of zeros. We can now write the\nmapping \u03d5 : Ra \u2192 Rb which we propose for dimensionality reduction as\n\n\u03d5(x) = P(a\u2192b)\u03c6(x, 1),\n\n(5)\n\n3\n\n\fwhere \u03c6 is given by (2). We choose each component of v at each time to belong to a reproducing\nkernel Hilbert Space (RKHS) H, so that v(\u00b7, t) \u2208 Ha, t \u2208 [0, 1]. If we de\ufb01ne the norm1\n\nkv(\u00b7, t)k2\n\nHa ,\n\n,\n\n(6)\n\na\n\nXj=1(cid:13)(cid:13)(cid:13)\n\n[v(\u00b7, t)]j(cid:13)(cid:13)(cid:13)\n\n2\n\nH\n\nthen kv(\u00b7, t)k2\nHa < \u221e, \u2200t \u2208 [0, 1] is a suf\ufb01cient condition which guarantees that \u03c8 is a diffeo-\nmorphism, provided that some technical conditions are satis\ufb01ed (Dupuis & Grenander, 1998; Joshi\n& Miller, 2000). In particular v need not be regular in its second argument. For dimensionality\nreduction we propose to construct v as the minimizer of\n\nO = \u03bbZ 1\n\nt=0\n\nkv(\u00b7, t)k2\n\nHd dt +\n\nm\n\nXj=1\n\nL (\u03c8(xj)) ,\n\n(7)\n\nwhere \u03bb \u2208 R+ is a regularization parameter. Here, L measures the squared distance to our b\ndimensional linear subspace of interest Sb, i.e.\n\nL(x) =\n\na\n\nXd=b+1\n\n[x]2\nd .\n\n(8)\n\nNote that this places special importance on the \ufb01rst b dimensions of the input space of interest\u2014\naccordingly we make the natural and important preprocessing step of applying PCA such that as\nmuch as possible of the variance of the data is captured in these \ufb01rst b dimensions.\n\n3.1\n\nImplementation\n\nOne can show that the minimizer in v of (7) takes the form\n\n[v(\u00b7, t)]d =\n\nm\n\nXj=1\n\n[\u03b1d(t)]j k(\u03c6(xj, t), \u00b7),\n\nd = 1 . . . a,\n\n(9)\n\nwhere k is the reproducing kernel of H and \u03b1d is a function [0, 1] \u2192 Rm. This was proven directly\nfor a similar speci\ufb01c case (Joshi & Miller, 2000), but we note in passing that it follows immediately\nfrom the celebrated representer theorem of RKHS\u2019s (Sch\u00a8olkopf et al., 2001), by considering a \ufb01xed\ntime t. Hence, we have simpli\ufb01ed the problem of determining v to one of determining m trajectories\n\u03c6(xj, \u00b7). This is because not only does (9) hold, but we can use standard manipulations (in the\ncontext of kernel ridge regression, for example) to determine that for a given set of such trajectories,\n\n\u03b1d(t) = K(t)\u22121ud(t),\n\nd = 1, 2, . . . , a,\n\n(10)\n\nwhere t \u2208 [0, 1], K(t) \u2208 Rm\u00d7m, ud(t) \u2208 Rm and we have let [K(t)]j,k = k(\u03c6(xj, t), \u03c6(xk, t))\nalong with [ud(t)]j = \u2202t\u03c6(xj, t). Note that the invertibility of K(t) is guaranteed for certain kernel\nfunctions (including the Gaussian kernel which we employ in all our Experiments, see Section 4),\nprovided that the set \u03c6(xj, t) are distinct. Hence, one can verify using (9), (10) and the reproducing\nproperty of k in H (i.e. the fact that hf, k(x, \u00b7)iH = f (x), \u2200f \u2208 H), that for the optimal v,\n\nkv(\u00b7, t)k2\n\nHa =\n\na\n\nXd=1\n\nud(t)\u22a4K(t)\u22121ud(t).\n\nThis allows us to write our objective (7) in terms of the m trajectories mentioned above:\n\nO = \u03bbZ 1\n\nt=0\n\na\n\nXd=1\n\nud(t)\u22a4K(t)\u22121ud(t) +\n\nm\n\na\n\nXj=1\n\nXd=b+1\n\n[\u03c6(xj, 1)]2\nd .\n\n(11)\n\n(12)\n\nSo far no approximations have been made, and we have constructed an optimal \ufb01nite dimensional\nbasis for v(\u00b7, t). The second argument of v is not so easily dealt with however, so as an approximate\nby discretizing the interval [0, 1]. In particular, we let tk = k\u03b4, k = 0, 1, . . . , p, where \u03b4 = 1/p,\nand make the approximation \u2202t=tk \u03c6(xj, t) = (\u03c6(xj, tk) \u2212 \u03c6(xj, tk\u22121)) /\u03b4. By making the further\n\n1Square brackets w/ subscripts denote matrix elements, and colons denote entire rows or columns.\n\n4\n\n\f0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n(d)\n\n(c)\n\n(b)\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Dimensionality reduction of motion capture data.\n(a) The data mapped from 102 to\n2 dimensions using Diffeomap (the line shows the temporal order in which the input data were\nrecorded). (b)-(d) Three rendered input points corresponding to the marked locations in (a).\n\napproximationR tk\n\nt=tk\u22121\n\nK(t)\u22121dt = \u03b4K(tk\u22121)\u22121, and substituting into (12) we obtain the \ufb01rst form\n\nof our problem which is \ufb01nite dimensional and hence readily optimized, i.e. the minimization of\n\n\u03bb\n\u03b4\n\na\n\np\n\nXd=1\n\nXk=1\n\n(\u03a6k,d \u2212 \u03a6k\u22121,d)\u22a4 K(tk)\u22121 (\u03a6k,d \u2212 \u03a6k\u22121,d) +\n\nb\n\nk\u03a6p,dk2\n\nXd=a+1\n\n(13)\n\nwith respect to \u03a6k,d \u2208 Rm for k = 1, 2, . . . , p and d = 1, 2, . . . , a, where [\u03a6k,d]j = [\u03c6(xj, tk)]d.\n\n3.2 A Practical Reduced Set Implementation\n\nA practical problem with (13) is the computationally expensive matrix inverse. In practice we reduce\nthis burden by employing a reduced set expansion which replaces the sum over 1, 2, . . . , m in (9)\nwith a sum over a randomly selected subset I, thereby using |I| = n basis functions to represent\nv(\u00b7, t). In this case it is possible to show using the reproducing property of k(\u00b7, \u00b7) that the resulting\nobjective function is identical to (13), but with the matrix K(tk)\u22121 replaced by the expression\n\nKm,n (Kn,mKm,n)\u22121 Kn,n (Kn,mKm,n)\u22121 Kn,m,\n\n(14)\n\nwhere Km,n = K \u22a4\nn,m \u2208 Rm\u00d7n is the sub-matrix of K(tk) formed by taking all of the rows, but\nonly those columns given by I. Similarly, Kn,n \u2208 Rn\u00d7n is the square sub-matrix of K(tk) formed\nby taking a subset of both the rows and columns, namely those given by I. For optimization we\nalso use the gradients of the above expression, the derivation of which we have omitted for brevity.\nNote however that by factorizing appropriately, the computation of the objective function and its\ngradients can be performed with an asymptotic time complexity of n2(m + a).\n\n4 Experiments\n\nIt is dif\ufb01cult to objectively compare dimensionality reduction algorithms, as there is no universally\nagreed upon measure of performance. Algorithms which are generalizations or variations of older\nones may be compared side by side with their predecessors, but this is not the case with our new\nalgorithm, Diffeomap. Hence, in this section we attempt to convince the reader of the utility of our\napproach by visually presenting our results on as many and as varied realistic problems as space\npermits, while providing pointers to comparable results from other authors. For all experiments\nwe \ufb01xed the parameters which trade off between computational speed and accuracy, i.e. we set the\ntemporal resolution p = 20, and the number of basis functions n = 300. We used a Gaussian kernel\n\nfunction k(x, y) = exp(cid:0)\u2212kx \u2212 yk2/(2\u03c32)(cid:1), and tuned the \u03c3 parameter manually along with the\nregularization parameter \u03bb. For optimization we used a conjugate gradient type method2 \ufb01xed to\n1000 iterations and with starting point [\u03a6k,d]j = [xj]d , k = 1, 2, . . . p.\n\n2Carl Rasmussen\u2019s minimize.m, which is freely available from http://www.kyb.mpg.de/\u02dccarl.\n\n5\n\n\fa\n\u00e6\n\"\ne\ni\n1\no\n@\nu\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Vowel data mapped from 24 to 2 dimensions using (a) PCA and (b)-(c) Diffeomap. Plots\n(b) and (c) differ only in the parameter settings of Diffeomap, with (b) corresponding to minimal\none nearest neighbor errors in the low dimensional space\u2014see Section 4.2 for details.\n\n4.1 Motion Capture Data\n\nThe \ufb01rst data set we consider consists of the coordinates in R3 of a set of markers placed on a person\nbreaking into a run, sampled at a constant frequency, resulting in m = 217 data points in a = 102\ndimensions, which we mapped to b = 2 dimensions using Diffeomap (see Figure 2). This data set\nis freely available from http://accad.osu.edu/research/mocap/mocap_data.htm\nas Figure 1 Run, and was also considered in (Lawrence & Candela, 2006), where it was shown\nthat while the original GP-LVM fails to correctly discover the periodic component of the sequence,\nthe back constrained version maps poses in the same part of the subject\u2019s step cycle nearby to\neach other, while simultaneously capturing variations in the inclination of the subject. Diffeomap\nalso succeeded in this sense, and produced results which are competitive with those of the back\nconstrained GP-LVM.\n\n4.2 Vowel Data\n\nIn this next example we consider a data set of a = 24 features (cepstral coef\ufb01cients and delta\ncepstral coef\ufb01cients) of a single speaker performing nine different vowels 300 times per vowel,\nacquired as training data for a vocal joystick system (Bilmes & et.al., 2006), and publicly available\nin pre-processed form from http://www.dcs.shef.ac.uk/\u02dcneil/fgplvm/. Once again\nwe used Diffeomap to map the data to b = 2 dimensions, as depicted in Figure 3. We also depict\nthe poor result of linear PCA, in order to rule out the hypothesis that it is merely the PCA based\ninitialization of Diffeomap (mentioned after equation (8) on page 4) which does most of the work.\n\nThe results in Figure 3 are directly comparable to those provided in (Lawrence & Candela, 2006)\nfor the GP-LVM, back constrained GP-LVM, and Isomap (Tenenbaum et al., 2000). Visually, the\nDiffeomap result appears to be superior to those of the GP-LVM and Isomap, and comparable to the\nback constrained GP-LVM. We also measured the performance of a one nearest neighbor classi\ufb01er\napplied to the mapped data in R2. For the best choice of the parameters \u03c3 and \u03bb, Diffeomap made\n140 errors, which is favorable to the \ufb01gures quoted for Isomap (458), the GP-LVM (226) and the\nback constrained GP-LVM (155) in (Lawrence & Candela, 2006). We emphasize however that this\nmeasure of performance is at best a rough one, since by manually varying our choice of the param-\neters \u03c3 and \u03bb, we were able to obtain a result (Figure 3 (c)) which, although leads to a signi\ufb01cantly\nhigher number of such errors (418), is arguably superior from a qualitative perspective to the result\nwith minimal errors (Figure 3 (b)).\n\n4.3 USPS Handwritten Digits\n\nWe now consider the USPS database of handwritten digits (Hull, 1994). Following the methodol-\nogy of the stochastic neighbor embedding (SNE) and GP-LVM papers (Hinton & Roweis, 2003;\nLawrence, 2004), we take 600 images per class from the \ufb01ve classes corresponding to digits 0, 1, 2,\n3, 4. Since the images are in gray scale and a resolution of 16 by 16 pixels, this results in a data set\nof m = 3000 examples in a = 256 dimensions, which we again mapped to b = 2 dimensions as\ndepicted in Figure 4. The \ufb01gure shows the individual points color coded according to class, along\n\n6\n\n\f(a)\n\n(b)\n\nFigure 4: USPS handwritten digits 0-4 mapped to 2 dimensions using Diffeomap. (a) Mapped points\ncolor coded by class label. (b) A composite image of the mapped data\u2014see Section 4.3 for details.\n\nwith a composite image formed by sequentially drawing each digit in random order at its mapped\nlocation, but only if it would not obscure a previously drawn digit. Diffeomap manages to arrange\nthe data in a manner which reveals such image properties as digit angle and stroke thickness. At the\nsame time the classes are reasonably well separated, with the exception of the ones which are split\ninto two clusters depending on the angle. Although unfortunate, we believe that this splitting can\nbe explained by the fact that (a) the left- and right-pointing ones are rather dissimilar in input space,\nand (b) the number of fairly vertical ones which could help to connect the left- and right-pointing\nones is rather small. Diffeomap seems to produce a result which is superior to that of the GP-LVM\n(Lawrence, 2004), for example, but may be inferior to that of the SNE (Hinton & Roweis, 2003). We\nbelieve this is due to the fact that the nearest neighbor graph used by SNE is highly appropriate to the\nUSPS data set. This is indicated by the fact that a nearest neighbor classi\ufb01er in the 256 dimensional\ninput space is known to perform strongly, with numerous authors having reported error rates of less\nthan 5% on the ten class classi\ufb01cation problem.\n\n4.4 NIPS Text Data\n\nFinally, we present results on the text data of papers from the NIPS conference proceedings volumes\n0-12, which can be obtained from http://www.cs.toronto.edu/\u02dcroweis/data.html.\nThis experiment is intended to address the natural concern that by working in the input space rather\nthan on a nearest neighbor graph, for example, Diffeomap may have dif\ufb01culty with very high dimen-\nsional data. Following (Hinton & Roweis, 2003; Song et al., 2008) we represent the data as a word\nfrequency vs. document matrix in which the author names are treated as words but weighted up by\na factor 20 (i.e. an author name is worth 20 words). The result is a data set of m = 1740 papers\nrepresented in a = 13649 words + 2037 authors = 15686 dimensions. Note however that the input\ndimensionality is effectively reduced by the PCA preprocessing step to m \u2212 1 = 1739, that being\nthe rank of the centered covariance matrix of the data.\n\nAs this data set is dif\ufb01cult to visualize without taking up large amounts of space, we have included\nthe results in the supplementary material which accompanies our NIPS submission. In particular,\nwe provide a \ufb01rst \ufb01gure which shows the data mapped to b = 2 dimensions, with certain authors (or\ngroups of authors) color coded\u2014the choice of authors and their corresponding color codes follows\nprecisely those of (Song et al., 2008). A second \ufb01gure shows a plain marker drawn at the mapped\nlocations corresponding to each of the papers. This second \ufb01gure also contains the paper title and\nauthors of the corrsponding papers however, which are revealed when the user moves the mouse\nover the marked locations. Hence, this second \ufb01gure allows one to browse the NIPS collection con-\n\n7\n\n\ftextually. Since the mapping may be hard to judge, we note in passing that the correct classi\ufb01cation\nrate of a one nearest neighbor classi\ufb01er applied to the result of Diffeomap was 48%, which compares\nfavorably to the rate of 33% achieved by linear PCA (which we use for preprocessing). To compute\nthis score we treated authors as classes, and considered only those authors who were color coded\nboth in our supplementary \ufb01gure and in (Song et al., 2008).\n\n5 Conclusion\n\nWe have presented an approach to dimensionality reduction which is based on the idea that the map-\nping between the lower and higher dimensional spaces should be diffeomorphic. We provided a\njusti\ufb01cation for this approach, by showing that the common intuition that dimensionality reduction\nalgorithms should approximately preserve pairwise distances of a given data set is closely related to\nthe idea that the mapping induced by the algorithm should be a diffeomorphism. This realization\nallowed us to take advantage of established mathematical machinery in order to convert the dimen-\nsionality reduction problem into a so called Eulerian \ufb02ow problem, the solution of which is guar-\nanteed to generate a diffeomorphism. Requiring that the mapping and its inverse both be smooth is\nreminiscent of the GP-LVM algorithm (Lawrence & Candela, 2006), but has the advantage in terms\nof statistical strength that we need not separately estimate a mapping in each direction. We showed\nresults of our algorithm, Diffeomap, on a relatively small motion capture data set, a larger vowel\ndata set, the USPS image data set, and \ufb01nally the rather high dimensional data set derived from the\ntext corpus of NIPS papers, with successes in all cases. Since our new approach performs well in\npractice while being signi\ufb01cantly different to all previous approaches to dimensionality reduction, it\nhas the potential to lead to a signi\ufb01cant new direction in the \ufb01eld.\n\nReferences\n\nBilmes, J., & et.al. (2006). The Vocal Joystick. Proc. IEEE Intl. Conf. on Acoustic, Speech and Signal Process-\n\ning. Toulouse, France.\n\nCox, T., & Cox, M. (1994). Multidimensional scaling. London, UK: Chapman & Hall.\n\nDeMers, D., & Cottrell, G. (1993). Non-linear dimensionality reduction. NIPS 5 (pp. 580\u2013587). Morgan\n\nKaufmann, San Mateo, CA.\n\nDuda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classi\ufb01cation. New York: Wiley. 2nd Edition.\n\nDupuis, P., & Grenander, U. (1998). Variational problems on \ufb02ows of diffeomorphisms for image matching.\n\nQuarterly of Applied Mathematics, LVI, 587\u2013600.\n\nHinton, G., & Roweis, S. (2003). Stochastic neighbor embedding. In S. T. S. Becker and K. Obermayer (Eds.),\n\nAdvances in neural information processing systems 15, 833\u2013840. Cambridge, MA: MIT Press.\n\nHull, J. J. (1994). A database for handwritten text recognition research.\n\nIEEE Trans. Pattern Anal. Mach.\n\nIntell., 16, 550\u2013554.\n\nJoshi, S. C., & Miller, M. I. (2000). Landmark matching via large deformation diffeomorphisms. IEEE Trans-\n\nactions on Image Processing, 9, 1357\u20131370.\n\nKarac\u00b8ali, B., & Davatzikos, C. (2003). Topology preservation and regularity in estimated deformation \ufb01elds.\n\nInformation Processing in Medical Imaging (pp. 426\u2013437).\n\nLawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional data. In\n\nS. Thrun, L. Saul and B. Sch\u00a8olkopf (Eds.), Nips 16. Cambridge, MA: MIT Press.\n\nLawrence, N. D., & Candela, J. Q. (2006). Local distance preservation in the GP-LVM through back constraints.\n\nIn International conference on machine learning, 513\u2013520. ACM.\n\nSch\u00a8olkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. Proc. of the 14th\n\nAnnual Conf. on Computational Learning Theory (pp. 416\u2013426). London, UK: Springer-Verlag.\n\nSong, L., Smola, A., Borgwardt, K., & Gretton, A. (2008). Colored maximum variance unfolding. In J. Platt,\n\nD. Koller, Y. Singer and S. Roweis (Eds.), Nips 20, 1385\u20131392. Cambridge, MA: MIT Press.\n\nTenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimen-\n\nsionality reduction. Science, 290, 2319\u20132323.\n\nvan der Maaten, L. J. P., Postma, E., & van den Herik, H. (2008). Dimensionality reduction: A comparative\n\nreview. In T. Ertl (Ed.), Submitted to neurocognition. Elsevier.\n\nVenna, J. (2007). Dimensionality reduction for visual exploration of similarity structures. Doctoral dissertation,\n\nHelsinki University of Technology.\n\n8\n\n\f", "award": [], "sourceid": 545, "authors": [{"given_name": "Christian", "family_name": "Walder", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}