{"title": "The Manifold Tangent Classifier", "book": "Advances in Neural Information Processing Systems", "page_first": 2294, "page_last": 2302, "abstract": "", "full_text": "The Manifold Tangent Classi\ufb01er\n\nSalah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller\n\nDepartment of Computer Science and Operations Research\n\n{rifaisal, dauphiya, vincentp, bengioy, mullerx}@iro.umontreal.ca\n\nUniversity of Montreal\n\nMontreal, H3C 3J7\n\nAbstract\n\nWe combine three important ideas present in previous work for building classi-\n\ufb01ers: the semi-supervised hypothesis (the input distribution contains information\nabout the classi\ufb01er), the unsupervised manifold hypothesis (data density concen-\ntrates near low-dimensional manifolds), and the manifold hypothesis for classi\ufb01-\ncation (different classes correspond to disjoint manifolds separated by low den-\nsity). We exploit a novel algorithm for capturing manifold structure (high-order\ncontractive auto-encoders) and we show how it builds a topological atlas of charts,\neach chart being characterized by the principal singular vectors of the Jacobian of\na representation mapping. This representation learning algorithm can be stacked\nto yield a deep architecture, and we combine it with a domain knowledge-free\nversion of the TangentProp algorithm to encourage the classi\ufb01er to be insensitive\nto local directions changes along the manifold. Record-breaking classi\ufb01cation\nresults are obtained.\n\n1\n\nIntroduction\n\nMuch of machine learning research can be viewed as an exploration of ways to compensate for\nscarce prior knowledge about how to solve a speci\ufb01c task by extracting (usually implicit) knowledge\nfrom vast amounts of data. This is especially true of the search for generic learning algorithms that\nare to perform well on a wide range of domains for which they were not speci\ufb01cally tailored. While\nsuch an outlook precludes using much domain-speci\ufb01c knowledge in designing the algorithms, it\ncan however be bene\ufb01cial to leverage what might be called \u201cgeneric\u201d prior hypotheses, that appear\nlikely to hold for a wide range of problems. The approach studied in the present work exploits three\nsuch prior hypotheses:\n\n1. The semi-supervised learning hypothesis, according to which learning aspects of the in-\nput distribution p(x) can improve models of the conditional distribution of the supervised\ntarget p(y|x), i.e., p(x) and p(y|x) share something (Lasserre et al., 2006). This hypoth-\nesis underlies not only the strict semi-supervised setting where one has many more unla-\nbeled examples at his disposal than labeled ones, but also the successful unsupervised pre-\ntraining approach for learning deep architectures, which has been shown to signi\ufb01cantly\nimprove supervised performance even without using additional unlabeled examples (Hin-\nton et al., 2006; Bengio, 2009; Erhan et al., 2010).\n\n2. The (unsupervised) manifold hypothesis, according to which real world data presented in\nhigh dimensional spaces is likely to concentrate in the vicinity of non-linear sub-manifolds\nof much lower dimensionality (Cayton, 2005; Narayanan and Mitter, 2010).\n\n3. The manifold hypothesis for classi\ufb01cation, according to which points of different classes\nare likely to concentrate along different sub-manifolds, separated by low density regions of\nthe input space.\n\n1\n\n\fThe recently proposed Contractive Auto-Encoder (CAE) algorithm (Rifai et al., 2011a), based on\nthe idea of encouraging the learned representation to be robust to small variations of the input,\nwas shown to be very effective for unsupervised feature learning. Its successful application in the\npre-training of deep neural networks is yet another illustration of what can be gained by adopting\nhypothesis 1. In addition, Rifai et al. (2011a) propose, and show empirical evidence for, the hypoth-\nesis that the trade-off between reconstruction error and the pressure to be insensitive to variations\nin input space has an interesting consequence: It yields a mostly contractive mapping that, locally\naround each training point, remains substantially sensitive only to a few input directions (with differ-\nent directions of sensitivity for different training points). This is taken as evidence that the algorithm\nindirectly exploits hypothesis 2 and models a lower-dimensional manifold. Most of the directions\nto which the representation is substantially sensitive are thought to be directions tangent to the data-\nsupporting manifold (those that locally de\ufb01ne its tangent space).\nThe present work follows through on this interpretation, and investigates whether it is possible to\nuse this information, that is presumably captured about manifold structure, to further improve clas-\nsi\ufb01cation performance by leveraging hypothesis 3. To that end, we extract a set of basis vectors\nfor the local tangent space at each training point from the Contractive Auto-Encoder\u2019s learned pa-\nrameters. This is obtained with a Singular Value Decomposition (SVD) of the Jacobian of the\nencoder that maps each input to its learned representation. Based on hypothesis 3, we then adopt\nthe \u201cgeneric prior\u201d that class labels are likely to be insensitive to most directions within these local\ntangent spaces (ex: small translations, rotations or scalings usually do not change an image\u2019s class).\nSupervised classi\ufb01cation algorithms that have been devised to ef\ufb01ciently exploit tangent directions\ngiven as domain-speci\ufb01c prior-knowledge (Simard et al., 1992, 1993), can readily be used instead\nwith our learned tangent spaces. In particular, we will show record-breaking improvements by using\nTangentProp for \ufb01ne tuning CAE-pre-trained deep neural networks. To the best of our knowledge\nthis is the \ufb01rst time that the implicit relationship between an unsupervised learned mapping and\nthe tangent space of a manifold is rendered explicit and successfully exploited for the training of a\nclassi\ufb01er. This showcases a uni\ufb01ed approach that simultaneously leverages all three \u201cgeneric\u201d prior\nhypotheses considered. Our experiments (see Section 6) show that this approach sets new records\nfor domain-knowledge-free performance on several real-world classi\ufb01cation problems. Remarkably,\nin some cases it even outperformed methods that use weak or strong domain-speci\ufb01c prior knowl-\nedge (e.g. convolutional networks and tangent distance based on a-priori known transformations).\nNaturally, this approach is even more likely to be bene\ufb01cial for datasets where no prior knowledge\nis readily available.\n\n2 Contractive auto-encoders (CAE)\n\nWe consider the problem of the unsupervised learning of a non-linear feature extractor from a dataset\nD = {x1, . . . , xn}. Examples xi \u2208 IRd are i.i.d. samples from an unknown distribution p(x).\n\n2.1 Traditional auto-encoders\n\nThe auto-encoder framework is one of the oldest and simplest techniques for the unsupervised learn-\ning of non-linear feature extractors. It learns an encoder function h, that maps an input x \u2208 IRd to a\nhidden representation h(x) \u2208 IRdh, jointly with a decoder function g, that maps h back to the input\nspace as r = g(h(x)) the reconstruction of x. The encoder and decoder\u2019s parameters \u03b8 are learned\nby stochastic gradient descent to minimize the average reconstruction error L(x, g(h(x))) for the\nexamples of the training set. The objective being minimized is:\n\nJAE(\u03b8) = (cid:88)\n\nx\u2208D\n\nL(x, g(h(x))).\n\n(1)\n\nWe will will use the most common forms of encoder, decoder, and reconstruction error:\nEncoder: h(x) = s(W x + bh), where s is the element-wise logistic sigmoid s(z) =\n\nParameters are a dh \u00d7 d weight matrix W and bias vector bh \u2208 IRdh.\nDecoder: r = g(h(x)) = s2(W T h(x) + br). Parameters are W T (tied weights, shared with\nthe encoder) and bias vector br \u2208 IRd. Activation function s2 is either a logistic sigmoid\n(s2 = s) or the identity (linear decoder).\n\n1\n\n1+e\u2212z .\n\n2\n\n\f\u2212(cid:80)d\n\nLoss function: Either the squared error: L(x, r) = (cid:107)x\u2212 r(cid:107)2 or Bernoulli cross-entropy: L(x, r) =\n\ni=1 xi log(ri) + (1 \u2212 xi) log(1 \u2212 ri).\n\nThe set of parameters of such an auto-encoder is \u03b8 = {W, bh, br}.\nHistorically, auto-encoders were primarily viewed as a technique for dimensionality reduction,\nwhere a narrow bottleneck (i.e. dh < d) was in effect acting as a capacity control mechanism.\nBy contrast, recent successes (Bengio et al., 2007; Ranzato et al., 2007a; Kavukcuoglu et al., 2009;\nVincent et al., 2010; Rifai et al., 2011a) tend to rely on rich, oftentimes over-complete represen-\ntations (dh > d), so that more sophisticated forms of regularization are required to pressure the\nauto-encoder to extract relevant features and avoid trivial solutions. Several successful techniques\naim at sparse representations (Ranzato et al., 2007a; Kavukcuoglu et al., 2009; Goodfellow et al.,\n2009). Alternatively, denoising auto-encoders (Vincent et al., 2010) change the objective from mere\nreconstruction to that of denoising.\n\n2.2 First order and higher order contractive auto-encoders\n\nMore recently, Rifai et al. (2011a) introduced the Contractive Auto-Encoder (CAE), that encourages\nrobustness of representation h(x) to small variations of a training input x, by penalizing its sensitivity\nto that input, measured as the Frobenius norm of the encoder\u2019s Jacobian J(x) = \u2202h\n\u2202x (x). The\nregularized objective minimized by the CAE is the following:\n\nJCAE(\u03b8) = (cid:88)\n\nx\u2208D\n\nL(x, g(h(x))) + \u03bb(cid:107)J(x)(cid:107)2,\n\n(2)\n\nwhere \u03bb is a non-negative regularization hyper-parameter that controls how strongly the norm of the\nJacobian is penalized. Note that, with the traditional sigmoid encoder form given above, one can\neasily obtain the Jacobian of the encoder. Its jth row is obtained form the jth row of W as:\n\nJ(x)j = \u2202hj(x)\n\u2202x\n\n= hj(x)(1 \u2212 hj(x))Wj.\n\n(3)\n\nComputing the extra penalty term (and its contribution to the gradient) is similar to computing the\nreconstruction error term (and its contribution to the gradient), thus relatively cheap.\nIt is also possible to penalize higher order derivatives (Hessian) by using a simple stochastic tech-\nnique that eschews computing them explicitly, which would be prohibitive. It suf\ufb01ces to penalize\ndifferences between the Jacobian at x and the Jacobian at nearby points \u02dcx = x + \u0001 (stochastic cor-\nruptions of x). This yields the CAE+H (Rifai et al., 2011b) variant with the following optimization\nobjective:\n\n(cid:104)||J(x) \u2212 J(x + \u0001)||2(cid:105)\n\n,\n\n(4)\n\nL(x, g(h(x))) + \u03bb||J(x)||2 + \u03b3E\u0001\u223cN (0,\u03c32I)\n\nJCAE+H(\u03b8) = (cid:88)\n\nx\u2208D\n\nwhere \u03b3 is an additional regularization hyper-parameters that controls how strongly we penalize\nlocal variations of the Jacobian, i.e. higher order derivatives. The expectation E is over Gaussian\nnoise variable \u0001. In practice stochastic samples thereof are used for each stochastic gradient update.\nThe CAE+H is the variant used for our experiments.\n\n3 Characterizing the tangent bundle captured by a CAE\n\nRifai et al. (2011a) reason that, while the regularization term encourages insensitivity of h(x) in all\ninput space directions, this pressure is counterbalanced by the need for accurate reconstruction, thus\nresulting in h(x) being substantially sensitive only to the few input directions required to distinguish\nclose by training points. The geometric interpretation is that these directions span the local tangent\nspace of the underlying manifold that supports the data. The tangent bundle of a smooth manifold\nis the manifold along with the set of tangent planes taken at all points on it. Each such tangent\nplane can be equipped with a local Euclidean coordinate system or chart. In topology, an atlas is a\ncollection of such charts (like the locally Euclidean map in each page of a geographic atlas). Even\nthough the set of charts may form a non-Euclidean manifold (e.g., a sphere), each chart is Euclidean.\n\n3\n\n\f3.1 Conditions for the feature mapping to de\ufb01ne an atlas on a manifold\nIn order to obtain a proper atlas of charts, h must be a diffeomorphism. It must be smooth (C\u221e) and\ninvertible on open Euclidean balls on the manifold M around the training points. Smoothness is\nguaranteed because of our choice of parametrization (af\ufb01ne + sigmoid). Injectivity (different values\nof h(x) correspond to different values of x) on the training examples is encouraged by minimizing\nreconstruction error (otherwise we cannot distinguish training examples xi and xj by only looking\nat h(xi) and h(xj)). Since h(x) = s(W x + bh) and s is invertible, using the de\ufb01nition of injectivity\nwe get (by composing h(xi) = h(xj) with s\u22121)\n\n\u2200i, j h(xi) = h(xj) \u21d0\u21d2 W \u2206ij = 0\n\nits rows Wk, where \u2200 i, j \u2203 \u03b1 \u2208 IRdh, \u2206ij = (cid:80)dh\n\nwhere \u2206ij = xi \u2212 xj. In order to preserve the injectivity of h, W has to form a basis spanned by\nk \u03b1kWk. With this condition satis\ufb01ed, mapping\nh is injective in the subspace spanned by the variations in the training set. If we limit the domain\nof h to h(X ) \u2282 (0, 1)dh comprising values obtainable by h applied to some set X , then we obtain\nsurjectivity by de\ufb01nition, hence bijectivity of h between the training set D and h(D). Let Mx be an\nopen ball on the manifold M around training example x. By smoothness of the manifold M and\nof mapping h, we obtain bijectivity locally around the training examples (on the manifold) as well,\ni.e., between \u222ax\u2208DMx and h(\u222ax\u2208DMx).\n\n3.2 Obtaining an atlas from the learned feature mapping\nNow that we have necessary conditions for local invertibility of h(x) for x \u2208 D, let us consider\nhow to de\ufb01ne the local chart around x from the nature of h. Because h must be sensitive to changes\nfrom an example xi to one of its neighbors xj, but insensitive to other changes (because of the CAE\npenalty), we expect that this will be re\ufb02ected in the spectrum of the Jacobian matrix J(x) = \u2202h(x)\n\u2202x\nat each training point x. In the ideal case where J(x) has rank k, h(x + \u0001v) differs from h(x) only\nif v is in the span of the singular vectors of J(x) with non-zero singular value. In practice, J(x)\nhas many tiny singular values. Hence, we de\ufb01ne a local chart around x using the Singular Value\nDecomposition of J T (x) = U(x)S(x)V T (x) (where U(x) and V (x) are orthogonal and S(x) is\ndiagonal). The tangent plane Hx at x is given by the span of the set of principal singular vectors Bx:\n\nwhere U\u00b7k(x) is the k-th column of U(x), and span({zk}) = {x|x =(cid:80)\n\nBx = {U\u00b7k(x)|Skk(x) > \u0001} and Hx = {x + v|v \u2208 span(Bx)},\n\nk wkzk, wk \u2208 IR}. We can\nthus de\ufb01ne an atlas A captured by h, based on the local linear approximation around each example:\n(5)\n\nA = {(Mx, \u03c6x)|x \u2208 D, \u03c6x(\u02dcx) = Bx(\u02dcx \u2212 x)}.\n\nNote that this way of obtaining an atlas can also be applied to subsequent layers of a deep network.\nIt is thus possible to use a greedy layer-wise strategy to initialize a network with CAEs (Rifai et al.,\n2011a) and obtain an atlas that corresponds to the nonlinear features computed at any layer.\n\n4 Exploiting the learned tangent directions for classi\ufb01cation\n\nUsing the previously de\ufb01ned charts for every point of the training set, we propose to use this addi-\ntional information provided by unsupervised learning to improve the performance of the supervised\ntask. In this we adopt the manifold hypothesis for classi\ufb01cation mentioned in the introduction.\n\n4.1 CAE-based tangent distance\n\nOne way of achieving this is to use a nearest neighbor classi\ufb01er with a similarity criterion de\ufb01ned\nas the shortest distance between two hyperplanes (Simard et al., 1993). The tangents extracted on\neach points will allow us to shrink the distances between two samples when they can approximate\neach other by a linear combination of their local tangents. Following Simard et al. (1993), we\nde\ufb01ne the tangent distance between two points x and y as the distance between the two hyperplanes\nHx,Hy \u2282 IRd spanned respectively by Bx and By. Using the usual de\ufb01nition of distance between\ntwo spaces, d(Hx,Hy) = inf{(cid:107)z\u2212w(cid:107)2|/ (z, w) \u2208 Hx\u00d7Hy}, we obtain the solution for this convex\n\n4\n\n\fproblem by solving a system of linear equations (Simard et al., 1993). This procedure corresponds\nto allowing the considered points x and y to move along the directions spanned by their associated\nlocal charts. Their distance is then evaluated on the new coordinates where the distance is minimal.\nWe can then use a nearest neighbor classi\ufb01er based on this distance.\n\n4.2 CAE-based tangent propagation\n\nNearest neighbor techniques are often impractical for large scale datasets because their compu-\ntational requirements scale linearly with n for each test case. By contrast, once trained, neural\nnetworks yield fast responses for test cases. We can also leverage the extracted local charts when\ntraining a neural network. Following the tangent propagation approach of Simard et al. (1992),\nbut exploiting our learned tangents, we encourage the output o of a neural network classi\ufb01er to be\ninsensitive to variations in the directions of the local chart of x by adding the following penalty to\nits supervised objective function:\n\n\u2126(x) = (cid:88)\n\nu\u2208Bx\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2202o\n\n\u2202x\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\n(x) u\n\n(6)\n\nContribution of this term to the gradients of network parameters can be computed in O(Nw), where\nNw is the number of neural network weights.\n\n4.3 The Manifold Tangent Classi\ufb01er (MTC)\n\nPutting it all together, here is the high level summary of how we build and train a deep network:\n\n1. Train (unsupervised) a stack of K CAE+H layers (Eq. 4). Each is trained in turn on the\n\nrepresentation learned by the previous layer.\n\n2. For each xi \u2208 D compute the Jacobian of the last layer representation J (K)(xi) =\n\n\u2202h(K)\n\n\u2202x (xi) and its SVD1. Store the leading dM singular vectors in set Bxi.\n\n3. On top of the K pre-trained layers, stack an output layer of size the number of classes. Fine-\ntune the whole network for supervised classi\ufb01cation2 with an added tangent propagation\npenalty (Eq. 6), using for each xi, tangent directions Bxi.\n\nWe call this deep learning algorithm the Manifold Tangent Classi\ufb01er (MTC). Alternatively, instead\nof step 3, one can use the tangent vectors in Bxi in a tangent distance nearest neighbors classi\ufb01er.\n\n5 Related prior work\n\nMany Non-Linear Manifold Learning algorithms (Roweis and Saul, 2000; Tenenbaum et al.,\n2000) have been proposed which can automatically discover the main directions of variation around\neach training point, i.e., the tangent bundle. Most of these algorithms are non-parametric and local,\ni.e., explicitly parametrizing the tangent plane around each training point (with a separate set of\nparameters for each, or derived mostly from the set of training examples in every neighborhood),\nas most explicitly seen in Manifold Parzen Windows (Vincent and Bengio, 2003) and manifold\nCharting (Brand, 2003). See Bengio and Monperrus (2005) for a critique of local non-parametric\nmanifold algorithms: they might require a number of training examples which grows exponentially\nwith manifold dimension and curvature (more crooks and valleys in the manifold will require more\nexamples). One attempt to generalize the manifold shape non-locally (Bengio et al., 2006) is based\non explicitly predicting the tangent plane associated to any given point x, as a parametrized function\nof x. Note that these algorithms all explicitly exploit training set neighborhoods (see Figure 2), i.e.\nthey use pairs or tuples of points, with the goal to explicitly model the tangent space, while it is\n\n1J (K) is the product of the Jacobians of each encoder (see Eq. 3) in the stack. It suf\ufb01ces to compute its\nleading dM SVD vectors and singular values. This is achieved in O(dM \u00d7 d \u00d7 dh) per training example. For\ncomparison, the cost of a forward propagation through a single MLP layer is O(d \u00d7 dh) per example.\n\n2A sigmoid output layer is preferred because computing its Jacobian is straightforward and ef\ufb01cient (Eq. 3).\n\nThe supervised cost used is the cross entropy. Training is by stochastic gradient descent.\n\n5\n\n\fmodeled implicitly by the CAE\u2019s objective function (that is not based on pairs of points). More re-\ncently, the Local Coordinate Coding (LCC) algorithm (Yu et al., 2009) and its Local Tangent LCC\nvariant (Yu and Zhang, 2010) were proposed to build a a local chart around each training example\n(with a local low-dimensional coordinate system around it) and use it to de\ufb01ne a representation for\neach input x: the responsibility of each local chart/anchor in explaining input x and the coordinate\nof x in each local chart. That representation is then fed to a classi\ufb01er and yield better generalization\nthan x itself.\nThe tangent distance (Simard et al., 1993) and TangentProp (Simard et al., 1992) algorithms were\ninitially designed to exploit prior domain-knowledge of directions of invariance (ex: knowledge that\nthe class of an image should be invariant to small translations rotations or scalings in the image\nplane). However any algorithm able to output a chart for a training point might potentially be used,\nas we do here, to provide directions to a Tangent distance or TangentProp (Simard et al., 1992)\nbased classi\ufb01er. Our approach is nevertheless unique as the CAE\u2019s unsupervised feature learning\ncapabilities are used simultaneously to provide a good initialization of deep network layers and a\ncoherent non-local predictor of tangent spaces. TangentProp is itself closely related to the Double\nBackpropagation algorithm (Drucker and LeCun, 1992), in which one instead adds a penalty that is\nthe sum of squared derivatives of the prediction error (with respect to the network input). Whereas\nTangentProp attempts to make the output insensitive to selected directions of change, the double\nbackpropagation penalty term attempts to make the error at a training example invariant to changes\nin all directions. Since one is also trying to minimize the error at the training example, this amounts\nto making that minimization more robust, i.e., extend it to the neighborhood of the training examples.\nAlso related is the Semi-Supervised Embedding algorithm (Weston et al., 2008). In addition to\nminimizing a supervised prediction error, it encourages each layer of representation of a deep ar-\nchitecture to be invariant when the training example is changed from x to a near neighbor of x in\nthe training set. This algorithm works implicitly under the hypothesis that the variable y to pre-\ndict from x is invariant to the local directions of change present between nearest neighbors. This\nis consistent with the manifold hypothesis for classi\ufb01cation (hypothesis 3 mentioned in the intro-\nduction). Instead of removing variability along the local directions of variation, the Contractive\nAuto-Encoder (Rifai et al., 2011a) initially \ufb01nds a representation which is most sensitive to them,\nas we explained in section 2.\n\n6 Experiments\n\nWe conducted experiments to evaluate our approach and the quality of the manifold tangents learned\nby the CAE, using a range of datasets from different domains:\nMNIST is a dataset of 28 \u00d7 28 images of handwritten digits. The learning task is to predict the\ndigit contained in the images. Reuters Corpus Volume I is a popular benchmark for document\nclassi\ufb01cation. It consists of 800,000 real-world news wire stories made available by Reuters. We\nused the 2000 most frequent words calculated on the whole dataset to create a bag-of-words vector\nrepresentation. We used the LYRL2004 split to separate between a train and test set. CIFAR-10 is\na dataset of 70,000 32 \u00d7 32 RGB real-world images. It contains images of real-world objects (i.e.\ncars, animals) with all the variations present in natural images (i.e. backgrounds). Forest Cover\nType is a large-scale database of cartographic variables for the prediction of forest cover types made\navailable by the US Forest Service.\nWe investigate whether leveraging the CAE learned tangents leads to better classi\ufb01cation perfor-\nmance on these problems, using the following methodology: Optimal hyper-parameters for (a stack\nof) CAEs are selected by cross-validation on a disjoint validation set extracted from the training set.\nThe quality of the feature extractor and tangents captured by the CAEs is evaluated by initializing an\nneural network (MLP) with the same parameters and \ufb01ne-tuning it by backpropagation on the super-\nvised classi\ufb01cation task. The optimal strength of the supervised TangentProp penalty and number of\ntangents dM is also cross-validated.\n\nResults\n\nFigure 1 shows a visualization of the tangents learned by the CAE. On MNIST, the tangents mostly\ncorrespond to small geometrical transformations like translations and rotations. On CIFAR-10, the\n\n6\n\n\fFigure 1: Visualisation of the tangents learned by the CAE for MNIST, CIFAR-10 and RCV1 (top\nto bottom). The left-most column is the example and the following columns are its tangents. On\nRCV1, we show the tangents of a document with the topic \u201dTrading & Markets\u201d (MCAT) with the\nnegative terms in red(-) and the positive terms in green(+).\n\nFigure 2: Tangents extracted by local PCA on CIFAR-10. This shows the limitation of approaches\nthat rely on training set neighborhoods.\n\nmodel also learns sensible tangents, which seem to correspond to changes in the parts of objects.\nThe tangents on RCV1-v2 correspond to the addition or removal of similar words and removal of\nirrelevant words. We also note that extracting the tangents of the model is a way to visualize what\nthe model has learned about the structure of the manifold. Interestingly, we see that hypothesis 3\nholds for these datasets because most tangents do not change the class of the example.\n\nTable 1: Classi\ufb01cation accuracy on several datasets using KNN variants measured on 10,000 test\nexamples with 1,000 training examples. The KNN is trained on the raw input vector using the\nEuclidean distance while the K-layer+KNN is computed on the representation learned by a K-layer\nCAE. The KNN+Tangents uses at every sample the local charts extracted from the 1-layer CAE to\ncompute tangent distance.\n\nMNIST\nCIFAR-10\nCOVERTYPE\n\nKNN KNN+Tangents\n86.9\n25.4\n70.2\n\n88.7\n26.5\n70.98\n\n90.55\n25.1\n69.54\n\n1-Layer CAE+KNN 2-Layer CAE+KNN\n\n91.15\n\n-\n\n67.45\n\nWe use KNN using tangent distance to evaluate the quality of the learned tangents more objectively.\nTable 1 shows that using the tangents extracted from a CAE always lead to better performance than\na traditional KNN.\nAs described in section 4.2, the tangents extracted by the CAE can be used for \ufb01ne-tuning the multi-\nlayer perceptron using tangent propagation, yielding our Manifold Tangent Classi\ufb01er (MTC). As it\nis a semi-supervised approach, we evaluate its effectiveness with a varying amount of labeled exam-\nples on MNIST. Following Weston et al. (2008), the unsupervised feature extractor is trained on the\nfull training set and the supervised classi\ufb01er is trained on a restricted labeled set. Table 2 shows our\nresults for a single hidden layer MLP initialized with CAE+H pretraining (noted CAE for brevity)\nand for the same classi\ufb01er \ufb01ne-tuned with tangent propagation (i.e. the manifold tangent classi\ufb01er of\nsection 4.3, noted MTC). The methods that do not leverage the semi-supervised learning hypothesis\n(Support Vector Machines, traditional Neural Networks and Convolutional Neural Networks) give\nvery poor performance when the amount of labeled data is low. In some cases, the methods that can\nlearn from unlabeled data can reduce the classi\ufb01cation error by half. The CAE gives better results\nthan other approaches across almost the whole range considered. It shows that the features extracted\n\n7\n\n\fTable 2: Semi-supervised classi\ufb01cation error on the MNIST test set with 100, 600, 1000 and 3000\nlabeled training examples. We compare our method with results from (Weston et al., 2008; Ranzato\net al., 2007b; Salakhutdinov and Hinton, 2007).\n\nNN\n25.81\n11.44\n10.7\n6.04\n\n100\n600\n1000\n3000\n\nSVM CNN TSVM DBN-rNCA EmbedNN CAE MTC\n12.03\n23.44\n5.13\n8.85\n3.64\n7.77\n2.57\n4.21\n\n16.86\n5.97\n5.73\n3.59\n\n22.98\n7.68\n6.45\n3.35\n\n16.81\n6.16\n5.38\n3.45\n\n-\n8.7\n-\n3.3\n\n13.47\n6.3\n4.77\n3.22\n\nfrom the rich unlabeled data distribution give a good inductive prior for the classi\ufb01cation task. Note\nthat the MTC consistently outperforms the CAE on this benchmark.\n\nTable 3: Classi\ufb01cation error on the MNIST test set with the full training set.\n\nK-NN\nDBM CNN MTC\n3.09% 1.60% 1.40% 1.17% 1.04% 0.95% 0.95% 0.81%\n\nSVM DBN\n\nCAE\n\nNN\n\nTable 3 shows our results on the full MNIST dataset with some results taken from (LeCun et al.,\n1999; Hinton et al., 2006). The CAE in this \ufb01gure is a two-layer deep network with 2000 units\nper layer pretrained with the CAE+H objective. The MTC uses the same stack of CAEs trained\nwith tangent propagation using 15 tangents. The prior state of the art for the permutation invariant\nversion of the task was set by the Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009)\nat 0.95%. Using our approach, we reach 0.81% error on the test set. Remarkably, the MTC also\noutperforms the basic Convolutional Neural Network (CNN) even though the CNN exploits prior\nknowledge about vision using convolution and pooling to enhance the results.\n\nTable 4: Classi\ufb01cation error on the Forest CoverType dataset.\n\nSVM Distributed SVM MTC\n3.13%\n4.11%\n\n3.46%\n\nWe also trained a 4 layer MTC on the Forest CoverType dataset. Following Trebar and Steele\n(2008), we use the data split DS2-581 which contains over 500,000 training examples. The MTC\nyields the best performance for the classi\ufb01cation task beating the previous state of the art held by\nthe distributed SVM (mixture of several non-linear SVMs).\n\n7 Conclusion\n\nIn this work, we have shown a new way to characterize a manifold by extracting a local chart at\neach data point based on the unsupervised feature mapping built with a deep learning approach.\nThe developed Manifold Tangent Classi\ufb01er successfully leverages three common \u201cgeneric prior\nhypotheses\u201d in a uni\ufb01ed manner. It learns a meaningful representation that captures the structure\nof the manifold, and can leverage this knowledge to reach superior classi\ufb01cation performance. On\ndatasets from different domains, it successfully achieves state of the art performance.\n\nAcknowledgments The authors would like to acknowledge the support of the following agencies\nfor research funding and computing support: NSERC, FQRNT, Calcul Qu\u00b4ebec and CIFAR.\n\nReferences\nBengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1),\n\n1\u2013127. Also published as a book. Now Publishers, 2009.\n\nBengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS\u201904, pages 129\u2013136. MIT\n\nPress.\n\nBengio, Y., Larochelle, H., and Vincent, P. (2006). Non-local manifold parzen windows. In NIPS\u201905, pages\n\n115\u2013122. MIT Press.\n\n8\n\n\fBengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks.\n\nIn Advances in NIPS 19.\n\nBrand, M. (2003). Charting a manifold. In NIPS\u201902, pages 961\u2013968. MIT Press.\nCayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923, UCSD.\nDrucker, H. and LeCun, Y. (1992).\n\nImproving generalisation performance using double back-propagation.\n\nIEEE Transactions on Neural Networks, 3(6), 991\u2013997.\n\nErhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why does unsuper-\n\nvised pre-training help deep learning? JMLR, 11, 625\u2013660.\n\nGoodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In NIPS\u201909,\n\npages 646\u2013654.\n\nHinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18, 1527\u20131554.\n\nKavukcuoglu, K., Ranzato, M., Fergus, R., and LeCun, Y. (2009). Learning invariant features through topo-\n\ngraphic \ufb01lter maps. pages 1605\u20131612. IEEE.\n\nLasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative and discriminative\n\nmodels. pages 87\u201394, Washington, DC, USA. IEEE Computer Society.\n\nLeCun, Y., Haffner, P., Bottou, L., and Bengio, Y. (1999). Object recognition with gradient-based learning. In\n\nShape, Contour and Grouping in Computer Vision, pages 319\u2013345. Springer.\n\nNarayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis. In J. Lafferty,\nC. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information\nProcessing Systems 23, pages 1786\u20131794.\n\nRanzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007a). Ef\ufb01cient learning of sparse representations with\n\nan energy-based model. In NIPS\u201906.\n\nRanzato, M., Huang, F., Boureau, Y., and LeCun, Y. (2007b). Unsupervised learning of invariant feature\n\nhierarchies with applications to object recognition. IEEE Press.\n\nRifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contracting auto-encoders: Explicit in-\nvariance during feature extraction. In Proceedings of the Twenty-eight International Conference on Machine\nLearning (ICML\u201911).\n\nRifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011b). Higher order\nIn European Conference on Machine Learning and Principles and Practice of\n\ncontractive auto-encoder.\nKnowledge Discovery in Databases (ECML PKDD).\n\nRoweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science,\n\n290(5500), 2323\u20132326.\n\nSalakhutdinov, R. and Hinton, G. E. (2007). Learning a nonlinear embedding by preserving class neighbour-\n\nhood structure. In AISTATS\u20192007, San Juan, Porto Rico. Omnipress.\n\nSalakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS\u20192009, volume 5, pages\n\n448\u2013455.\n\nSimard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism for specifying selected\n\ninvariances in an adaptive network. In NIPS\u201991, pages 895\u2013903, San Mateo, CA. Morgan Kaufmann.\n\nSimard, P. Y., LeCun, Y., and Denker, J. (1993). Ef\ufb01cient pattern recognition using a new transformation\n\ndistance. In NIPS\u201992, pages 50\u201358. Morgan Kaufmann, San Mateo.\n\nTenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework for nonlinear dimen-\n\nsionality reduction. Science, 290(5500), 2319\u20132323.\n\nTrebar, M. and Steele, N. (2008). Application of distributed svm architectures in classifying forest data cover\n\ntypes. Computers and Electronics in Agriculture, 63(2), 119 \u2013 130.\n\nVincent, P. and Bengio, Y. (2003). Manifold parzen windows. In NIPS\u201902. MIT Press.\nVincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders:\nLearning useful representations in a deep network with a local denoising criterion. JMLR, 11(3371\u20133408).\nWeston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008,\n\npages 1168\u20131175, New York, NY, USA.\n\nYu, K. and Zhang, T. (2010). Improved local coordinate coding using local tangents.\nYu, K., Zhang, T., and Gong, Y. (2009). Nonlinear learning using local coordinate coding.\n\nIn Y. Bengio,\nD. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information\nProcessing Systems 22, pages 2223\u20132231.\n\n9\n\n\f", "award": [], "sourceid": 4409, "authors": [{"given_name": "Salah", "family_name": "Rifai", "institution": null}, {"given_name": "Yann", "family_name": "Dauphin", "institution": null}, {"given_name": "Pascal", "family_name": "Vincent", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Xavier", "family_name": "Muller", "institution": null}]}