{"title": "Large Margin Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1385, "page_last": 1392, "abstract": null, "full_text": "Large Margin Component Analysis\n\nLorenzo Torresani\n\nRiya, Inc.\n\nlorenzo@riya.com\n\nKuang-chih Lee\n\nRiya, Inc.\n\nkclee@riya.com\n\nAbstract\n\nMetric learning has been shown to signi\ufb01cantly improve the accuracy of k-nearest\nneighbor (kNN) classi\ufb01cation. In problems involving thousands of features, dis-\ntance learning algorithms cannot be used due to over\ufb01tting and high computa-\ntional complexity. In such cases, previous work has relied on a two-step solution:\n\ufb01rst apply dimensionality reduction methods to the data, and then learn a met-\nric in the resulting low-dimensional subspace. In this paper we show that better\nclassi\ufb01cation performance can be achieved by unifying the objectives of dimen-\nsionality reduction and metric learning. We propose a method that solves for\nthe low-dimensional projection of the inputs, which minimizes a metric objective\naimed at separating points in different classes by a large margin. This projection\nis de\ufb01ned by a signi\ufb01cantly smaller number of parameters than metrics learned\nin input space, and thus our optimization reduces the risks of over\ufb01tting. Theory\nand results are presented for both a linear as well as a kernelized version of the\nalgorithm. Overall, we achieve classi\ufb01cation rates similar, and in several cases\nsuperior, to those of support vector machines.\n\n1 Introduction\n\nThe technique of k-nearest neighbor (kNN) is one of the most popular classi\ufb01cation algorithms.\nSeveral reasons account for the widespread use of this method: it is straightforward to implement,\nit generally leads to good recognition performance thanks to the non-linearity of its decision bound-\naries, and its complexity is independent of the number of classes. In addition, unlike most alterna-\ntives, kNN can be applied even in scenarios where not all categories are given at the time of training,\nsuch as, for example, in face veri\ufb01cation applications where the subjects to be recognized are not\nknown in advance.\n\nThe distance metric de\ufb01ning the neighbors of a query point plays a fundamental role in the accuracy\nof kNN classi\ufb01cation. In most cases Euclidean distance is used as a similarity measure. This choice\nis logical when it is not possible to study the statistics of the data prior to classi\ufb01cation or when it is\nfair to assume that all features are equally scaled and equally relevant. However, in most cases the\ndata is distributed in a way so that distance analysis along some speci\ufb01c directions of the features\nspace can be more informative than along others. In such cases and when training data is available\nin advance, distance metric learning [5, 10, 4, 1, 9] has been shown to yield signi\ufb01cant improvement\nin kNN classi\ufb01cation. The key idea of these methods is to apply transformations to the data in order\nto emphasize the most discriminative directions. Euclidean distance computation in the transformed\nspace is then equivalent to a non-uniform metric analysis in the original input space.\n\nIn this paper we are interested in cases where the data to be used for classi\ufb01cation is very high-\ndimensional. An example is classi\ufb01cation of imagery data, which often involves input spaces of\nthousands of dimensions, corresponding to the number of pixels. Metric learning in such high-\ndimensional spaces cannot be carried out due to over\ufb01tting and high computational complexity. In\nthese scenarios, even kNN classi\ufb01cation is prohibitively expensive in terms of storage and com-\nputational costs. The traditional solution is to apply dimensionality reduction methods to the data\n\n\fand then learn a suitable metric in the resulting low-dimensional subspace. For example, Princi-\npal Component Analysis (PCA) can be used to compute a linear mapping that reduces the data to\ntractable dimensions. However, dimensionality reduction methods generally optimize objectives un-\nrelated to classi\ufb01cation and, as a consequence, might generate representations that are signi\ufb01cantly\nless discriminative than the original data. Thus, metric learning within the subspace might lead to\nsuboptimal similarity measures. In this paper we show that better performance can be achieved by\ndirectly solving for a low-dimensional embedding that optimizes a measure of kNN classi\ufb01cation\nperformance.\n\nOur approach is inspired by the solution proposed by Weinberger et al. [9]. Their technique learns\na metric that attempts to shrink distances of neighboring similarly-labeled points and to separate\npoints in different classes by a large margin. Our contribution over previous work is twofold:\n\n1. We describe the Large Margin Component Analysis (LMCA) algorithm, a technique that\nsolves directly for a low-dimensional embedding of the data such that Euclidean distance\nin this space minimizes the large margin metric objective described in [9]. Our approach\nsolves for only D \u00b7 d unknowns, where D is the dimensionality of the inputs and d is the\ndimensionality of the target space. By contrast, the algorithm of Weinberger et al. [9]\nlearns a Mahalanobis distance of the inputs, which requires solving for a D \u00d7 D matrix,\nusing iterative semide\ufb01nite programming methods. This optimization is unfeasible for large\nvalues of D.\n\n2. We propose a technique that learns Mahalanobis distance metrics in nonlinear feature\nspaces. Our approach combines the goal of dimensionality reduction with a novel \u201dker-\nnelized\u201d version of the metric learning objective of Weinberger et al. [9]. We describe an\nalgorithm that optimizes this combined objective directly. We demonstrate that, even when\ndata is low-dimensional and dimensionality reduction is not needed, this technique can be\nused to learn nonlinear metrics leading to signi\ufb01cant improvement in kNN classi\ufb01cation\naccuracy over [9].\n\n2 Linear Dimensionality Reduction for Large Margin kNN Classi\ufb01cation\n\nIn this section we brie\ufb02y review the algorithm presented in [9] for metric learning in the context\nof kNN classi\ufb01cation. We then describe how this approach can be generalized to compute low\ndimensional projections of the inputs via a novel direct optimization.\n\nA fundamental characteristic of kNN is that its performance does not depend on linear separability\nof classes in input space: in order to achieve accurate kNN classi\ufb01cation it is suf\ufb01cient that the\nmajority of the k-nearest points of each test example have correct label. The work of Weinberger\net al. [9] exploits this property by learning a linear transformation of the input space that aims at\ncreating consistently labeled k-nearest neighborhoods, i.e. clusters where each training example\nand its k-nearest points have same label and where points differently labeled are distanced by an\nadditional safety margin. Speci\ufb01cally, given n input examples x1, ..., xn in \u211cD and corresponding\nclass labels y1, ..., yn, the technique in [9] learns the D \u00d7 D transformation matrix L that optimizes\nthe following objective function:\n\u01eb(L) = X\n\n\u03b7ij(1 \u2212 yil)h(||L(xi \u2212 xj)||2 \u2212 ||L(xi \u2212 xl)||2 + 1),\n\nij\n\n\u03b7ij||L(xi \u2212 xj)||2 + cX\n\nijl\n\n(1)\nwhere \u03b7ij \u2208 {0, 1} is a binary variable indicating whether example xj is one the k-closest points\nof xi that share the same label yi, c is a positive constant, yil \u2208 {0, 1} is 1 iff (yi = yl), and\nh(s) = max(s, 0) is the hinge function. The objective \u01eb(L) consists of two contrasting terms. The\n\ufb01rst aims at pulling closer together points sharing the same label and that were neighbors in the\noriginal space. The second term encourages distancing each example xi from differently labeled\npoints by an amount equal to 1 plus the distance from xi to any of its k similarly-labeled closest\npoints. This term corresponds to a margin condition similar to that of SVMs and it is used to improve\ngeneralization. The constant c controls the relative importance of these two competing terms and it\ncan be chosen via cross validation.\nUpon optimization of \u01eb(L), test example xq is classi\ufb01ed according to the kNN rule applied to its\nprojection x\u2032\nq = Lxq, using Euclidean distance as metric. Equivalently, such classi\ufb01cation can be\n\n\finterpreted as kNN classi\ufb01cation in the original input space under the Mahalanobis distance metric\ninduced by matrix M = LT L. Although Equation 1 is non-convex in L, it can be rewritten as\na semide\ufb01nite program \u01eb(M) in terms of the metric M [9]. Thus, optimizing the objective in M\nguarantees convergence to the global minimum, regardless of initialization.\n\nWhen data is very high-dimensional, minimization of \u01eb(M) using semide\ufb01nite programming meth-\nods is impractical because of slow convergence and over\ufb01tting problems. In such cases [9] propose\napplying dimensionality reduction methods, such as PCA, followed by metric learning within the\nresulting low-dimensional subspace. As outlined above, this procedure leads to suboptimal metric\nlearning. In this paper we propose an alternative approach that solves jointly for dimensionality\nreduction and metric learning. The key idea is to choose the transformation L in Equation 1 to be a\nnonsquare matrix of size d\u00d7D, with d << D. Thus L de\ufb01nes a mapping from the high-dimensional\ninput space to a low-dimensional embedding. Euclidean distance in this low-dimensional embed-\nding is equivalent to Mahalanobis distance in the original input space under the rank-de\ufb01cient metric\nM = LT L (M has now rank at most d).\nUnfortunately, optimization of \u01eb(M) subject to rank-constraints on M leads to a minimization prob-\nlem that is no longer convex [8] and that is awkward to solve. Here we propose an approach for\nminimizing the objective that differs from the one used in [9]. The idea is to optimize Equation 1\ndirectly with respect to the nonsquare matrix L. We argue that minimizing the objective with respect\nto L rather than with respect to the rank-de\ufb01cient D \u00d7 D matrix M, offers several advantages. First,\nour optimization involves only d\u00b7D rather than D2 unknowns, which considerably reduces the risk of\nover\ufb01tting. Second, the optimal rectangular matrix L computed with our method automatically sat-\nis\ufb01es the rank constraints on M without requiring the solution of dif\ufb01cult constrained minimization\nproblems. Although the objective optimized by our method is also not convex, we experimentally\ndemonstrate that our solution converges consistently to better metrics than those computed via the\napplication of PCA followed by subspace distance learning (see Section 4).\n\nWe minimize \u01eb(L) using gradient-based optimizers, such as conjugate gradient methods. Differen-\ntiating \u01eb(L) with respect to the transformation matrix L gives the following gradient for the update\nrule:\n\n\u2202\u01eb(L)\n\n\u2202L\n\n= 2LX\n\nij\n\n\u03b7ij(xi \u2212 xj)(xi \u2212 xj)T +\n\n2cLX\n\nijl\n\n\u03b7ij(1 \u2212 yil)(cid:2)(xi \u2212 xj)(xi \u2212 xj)T \u2212 (xi \u2212 xl)(xi \u2212 xl)T(cid:3)\nh\u2032(||L(xi \u2212 xj)||2 \u2212 ||L(xi \u2212 xl)||2 + 1)\n\n(2)\n\nWe handle the non-differentiability of h(s) at s = 0, by adopting a smooth hinge function as in [8].\n\n3 Nonlinear Feature Extraction for Large Margin kNN Classi\ufb01cation\n\nIn the previous section we have described an algorithm that jointly solves for linear dimensionality\nreduction and metric learning. We now describe how to \u201dkernelize\u201d this method in order to compute\nnon-linear features of the inputs that optimize our distance learning objective. Our approach learns\na low-rank Mahalanobis distance metric in a high dimensional feature space F , related to the inputs\nby a nonlinear map \u03c6 : \u211cD \u2192 F . We restrict our analysis to nonlinear maps \u03c6 for which there exist\nkernel functions k that can be used to compute the feature inner products without carrying out the\nmap, i.e. such that k(xi, xj) = \u03c6T\nWe modify our objective \u01eb(L) by substituting inputs xi with features \u03c6(xi) into Equation 1. L is now\na transformation from the space F into a low-dimensional space \u211cd. We seek the transformation L\nminimizing the modi\ufb01ed objective function \u01eb(L).\nThe gradient in feature space can now be written as:\n\ni \u03c6j, where for brevity we denoted \u03c6i = \u03c6(xi).\n\n\u2202\u01eb(L)\n\n\u2202L\n\n= 2X\n\nij\n\n\u03b7ij L(\u03c6i \u2212 \u03c6j)(\u03c6i \u2212 \u03c6j)T +\n\n2cX\n\nijl\n\n\u03b7ij(1 \u2212 yil)h\u2032(sijl)L(cid:2)(\u03c6i \u2212 \u03c6j)(\u03c6i \u2212 \u03c6j)T \u2212 (\u03c6i \u2212 \u03c6l)(\u03c6i \u2212 \u03c6l)T(cid:3) (3)\n\n\fwhere sijl = (||L(\u03c6i \u2212 \u03c6j)||2 \u2212 ||L(\u03c6i \u2212 \u03c6l)||2 + 1).\n\nLet \u03a6 = [\u03c61, ..., \u03c6n]T . We consider parameterizations of L of the form L = \u2126\u03a6, where \u2126 is some\nmatrix allowing us to write L as a linear combination of the feature points. This form of nonlinear\nmap is analogous to that used in kernel-PCA and it allows us to parameterize the transformation L in\nterms of only d \u00b7 n parameters, the entries of the matrix \u2126. We now introduce the following Lemma\nwhich we will later use to derive an iterative update rule for L.\n\nLemma 3.1 The gradient in feature space can be computed as \u2202\u01eb(L)\nfeatures \u03c6i solely in terms of dot products (\u03c6T\n\ni \u03c6j).\n\n\u2202L = \u0393\u03a6, where \u0393 depends on\n\nProof De\ufb01ning ki = \u03a6\u03c6i = [k(x1, xi), ..., k(xn, xi)]T , non-linear feature projections can be com-\nputed as L\u03c6i = \u2126\u03a6\u03c6i = \u2126ki. From this we derive:\n\n\u2202\u01eb(L)\n\n\u2202L\n\n= 2\u2126X\n\nij\n\n\u03b7ij(ki \u2212 kj)(\u03c6i \u2212 \u03c6j)T +\n\n2c\u2126X\n\nijl\n\n\u03b7ij(1 \u2212 yil)h\u2032(sijl)(cid:2)(ki \u2212 kj)(\u03c6i \u2212 \u03c6j)T \u2212 (ki \u2212 kl)(\u03c6i \u2212 \u03c6l)T(cid:3)\n\u03b7ij hE\n\n(ki\u2212kj )\n\u2212 E\nj\n\n(ki\u2212kj )\ni\n\ni \u03a6 +\n\n= 2\u2126X\n\nij\n\n2c\u2126X\n\nijl\n\n\u03b7ij(1 \u2212 yil)h\u2032(sijl)hE\n\n(ki\u2212kj )\ni\n\n(ki\u2212kj )\n\u2212 E\nj\n\n(ki\u2212kl)\n\u2212 E\ni\n\n(ki\u2212kl)\n+ E\nl\n\ni \u03a6\n\ni = [0, ..., v, 0, ..0] is the n \u00d7 n matrix having vector v in the i-th column and all 0 in the\n\nwhere Ev\nother columns. Setting\n\u0393 = 2\u2126X\n\nij\n\n\u03b7ij hE\n\n(ki\u2212kj )\ni\n\n(ki\u2212kj )\n\u2212 E\nj\n\ni +\n\n2c\u2126X\n\nijl\n\nproves the Lemma.\n\n\u03b7ij(1 \u2212 yil)h\u2032(sijl)hE\n\n(ki\u2212kj )\ni\n\n(ki\u2212kj )\n\u2212 E\nj\n\n(ki\u2212kl)\n\u2212 E\ni\n\n(ki\u2212kl)\n+ E\nl\n\ni\n\n(4)\n\nThis result allows us to implicitly solve for the transformation without ever computing the features\nin the high-dimensional space F : the key idea is to iteratively update \u2126 rather than L. For example,\nusing gradient descent as optimization we derive update rule:\n\nLnew = Lold \u2212 \u03bb\n\n\u2202\u01eb(L)\n\n\u2202L\n\n(cid:12)(cid:12)(cid:12)(cid:12)L=Lold\n\n= [\u2126old \u2212 \u03bb\u0393old] \u03a6 = \u2126new\u03a6\n\n(5)\n\nwhere \u03bb is the learning rate. We carry out this optimization by iterating the update \u2126 \u2190 (\u2126 \u2212 \u03bb\u0393)\nuntil convergence. For classi\ufb01cation, we project points onto the learned low-dimensional space by\nexploiting the kernel trick: L\u03c6q = \u2126kq.\n\n4 Experimental results\n\nWe compared our methods to the metric learning algorithm of Weinberger et al.\n[9], which we\nwill refer to as LMNN (Large Margin Nearest Neighbor). We use KLMCA (kernel-LMCA) to\ndenote the nonlinear version of our algorithm. In all of the experiments reported here, LMCA was\ninitialized using PCA, while KLMCA used the transformation computed by kernel-PCA as initial\nguess. The objectives of LMCA and KLMCA were optimized using the steepest descent algorithm.\nWe experimented with more sophisticated minimization techniques, including the conjugate gradient\nmethod and the Broyden-Fletcher-Goldfarb-Shanno quasi-Newton algorithm [6], but no substantial\nimprovement in performance or speed of convergence was achieved. The KLMCA algorithm was\nimplemented using a Gaussian RBF kernel. The number of nearest neighbors, the weight c in\nEquation 1, and the variance of the RBF kernel, were all automatically tuned using cross-validation.\n\nThe \ufb01rst part of our experimental evaluation focuses on classi\ufb01cation results on datasets with high-\ndimensionality, Isolet, AT&T Faces, and StarPlus fMRI:\n\n\fAT&T Faces\n\nIsolet\n\nfMRI\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n5\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n5\n\nPCA + LMNN\nLMCA + kNN\nKLMCA + kNN\n\n10\n25\nprojection dimensions\n\n20\n\n15\n\n30\n\n%\n\n \nr\no\nr\nr\ne\n \ng\nn\nn\ni\na\nr\nt\n\ni\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n0\n\nPCA + LMNN\nLMCA + kNN\nKLMCA + kNN\n\n50\n150\nprojection dimensions\n\n100\n\n200\n\n%\n\n \nr\no\nr\nr\ne\n \ng\nn\nn\ni\na\nr\nt\n\ni\n\nAT&T Faces\n\nIsolet\n\nPCA + LMNN\nLMCA + kNN\nKLMCA + kNN\n\n40\n\n30\n\n20\n\n10\n\n%\n\n \nr\no\nr\nr\ne\n \ng\nn\ni\nt\ns\ne\nt\n\nPCA + LMNN\nLMCA + kNN\nKLMCA + kNN\n\n%\n\n \nr\no\nr\nr\ne\n \ng\nn\ni\nt\ns\ne\nt\n\n10\n25\nprojection dimensions\n\n20\n\n15\n\n30\n\n0\n0\n\n50\n150\nprojection dimensions\n\n100\n\n200\n\n15\n\n10\n\n5\n\n0\n0\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n0\n\n%\n\n \nr\no\nr\nr\ne\n \ng\nn\nn\ni\na\nr\nt\n\ni\n\n(a)\n\n(b)\n\n%\n\n \nr\no\nr\nr\ne\n \ng\nn\ni\nt\ns\ne\nt\n\nPCA + LMNN\nLMCA + kNN\nKLMCA + kNN\n\n10\n\n20\n\nprojection dimensions\n\n30\n\nfMRI\n\nPCA + LMNN\nLMCA + kNN\nKLMCA + kNN\n\n10\n\n20\n\nprojection dimensions\n\n30\n\nFigure 1: Classi\ufb01cation error rates on the high-dimensional datasets Isolet, AT&T Faces and StarPlus fMRI\nfor different projection dimensions. (a) Training error. (b) Testing error.\n\n\u2022 Isolet1 is a dataset of speech features from the UC Irvine repository, consisting of 6238\ntraining examples and 1559 testing examples with 617 attributes. There are 26 classes\ncorresponding to the spoken letters to be recognized.\n\n\u2022 The AT&T Faces2 database contains 10 grayscale face images of each of 40 distinct sub-\njects. The images were taken at different times, with varying illumination, facial expres-\nsions and poses. As in [9], we downsampled the original 112 \u00d7 92 images to size 38 \u00d7 31,\ncorresponding to 1178 input dimensions.\n\n\u2022 The StarPlus fMRI3 dataset contains fMRI sequences acquired in the context of a cognitive\nexperiment. In these trials the subject is shown for a few seconds either a picture or a sen-\ntence describing a picture. The goal is to recognize the viewing activity of the subject from\nthe fMRI images. We reduce the size of the data by considering only voxels corresponding\nto relevant areas of the brain cortex and by averaging the activity in each voxel over the\nperiod of the stimulus. This yields data of size 1715 for subject \u201d04847,\u201d on which our\nanalysis was restricted. A total number of 80 trials are available for this subject.\n\nExcept for Isolet, for which a separate testing set is speci\ufb01ed, we computed all of the experimental\nresults by averaging over 100 runs of random splitting of the examples into training and testing sets.\nFor the fMRI experiment we used at each iteration 70% of the data for training and 30% for testing.\nFor AT&T Faces, training sets were selected by sampling 7 images at random for each person. The\nremaining 3 images of each individual were used for testing.\n\nUnlike LMCA and KLMCA, which directly solve for low-dimensional embeddings of the input\ndata, LMNN cannot be run on datasets of dimensionalities such as those considered here and must\nbe trained on lower-dimensional representations of the inputs. As in [9], we applied the LMNN\nalgorithm on linear projections of the data computed using PCA. Figure 1 summarizes the training\nand testing performances of kNN classi\ufb01cation using the metrics learned by the three algorithms for\ndifferent subspace dimensions. LMCA and KLMCA give considerably better classi\ufb01cation accu-\nracy than LMNN on all datasets, with the kernelized version of our algorithm always outperforming\nthe linear version. The difference in accuracy between our algorithms and LMNN is particularly\ndramatic when a small number of projection dimensions is used. In such cases, LMNN is unable\nto \ufb01nd good metrics in the low-dimensional subspace computed by PCA. By contrast, LMCA and\nKLMCA solve for the low-dimensional subspace that optimizes the classi\ufb01cation-related objective\n\n1Available at http://www.ics.uci.edu/\u223cmlearn/MLRepository.html\n2Available at http://www.cl.cam.ac.uk/Research/DTG/attarchive/facedatabase.html\n3Available at http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Image reconstruction from PCA and LMCA features. (a) Input images. (b) Reconstruc-\ntions using PCA (left) and LMCA (right). (c) Absolute difference between original images and\nreconstructions from features for PCA (left) and LMCA (right). Red denotes large differences, blue\nindicates similar grayvalues. LMCA learns invariance to effects that are irrelevant for classi\ufb01cation:\nnon-uniform illumination, facial expressions, and glasses (training data contains images with and\nwithout glasses for same individuals).\n\nof Equation 1, and therefore achieve good performance even when projecting to very low dimen-\nsions. In our experiments we found that all three classi\ufb01cation algorithms (LMNN, LMCA+kNN,\nand KLMCA+kNN) performed considerably better than kNN using the Euclidean metric in the PCA\nand KPCA subspaces. For example, using d = 10 in the AT&T dataset, kNN gives a 10.9% testing\nerror rate when used on the PCA features, and a 9.7% testing error rate when applied to the nonlinear\nfeatures computed by KPCA.\n\nWhile LMNN is applied to features in a low-dimensional space, LMCA and KLMCA learn a low-\nrank metric directly from the high-dimensional inputs. Consequently the computational complexity\nof our algorithms is higher than that of LMNN. However, we have found that LMCA and KLMCA\nconverge to a minimum quite rapidly, typically within 20 iterations, and thus the complexity of these\nalgorithms has not been a limiting factor even when applied to very high-dimensional datasets. As a\nreference, using d = 10 and K = 3 on the AT&T dataset, LMNN learns a metric in about 5 seconds,\nwhile LMCA and KLMCA converge to a minimum in 21 and 24 seconds, respectively.\n\nIt is instructive to look at the preimages of LMCA data embeddings. Figure 2 shows comparative re-\nconstructions of images obtained from PCA and LMCA features by inverting their linear mappings.\nThe PCA and LMCA subspaces in this experiment were computed from cropped face images of size\n50 \u00d7 50 pixels, taken from a set of consumer photographs. The dataset contains 2459 face images\ncorresponding to 152 distinct individuals. A total of d = 125 components were used. The subjects\nshown in Figure 2 were not included in the training set. For a given target dimensionality, PCA has\nthe property of computing the linear transformation minimizing the reconstruction error under the\nL2 norm. Unsurprisingly, the PCA face reconstructions are extremely faithful reproductions of the\noriginal images. However, PCA accurately reconstructs also visual effects, such as lighting varia-\ntions and changes in facial expressions, that are unimportant for the task of face veri\ufb01cation and that\nmight potentially hamper recognition. By contrast, LMCA seeks a subspace where neighboring ex-\namples belong to the same class and points differently labeled are separated by a large margin. As a\nresult, LMCA does not encode effects that are found to be insigni\ufb01cant for classi\ufb01cation or that vary\nlargely among examples of the same class. For the case of face veri\ufb01cation, LMCA de-emphasizes\nchanges in illumination, presence or absence of glasses and smiling expressions (Figure 2).\n\nWhen the input data does not require dimensionality reduction, LMNN and LMCA solve the same\noptimization problem, but LMNN should be preferred over LMCA in light of its guarantees of\nconvergence to the global minimum of the objective. However, even in such cases, KLMCA can be\nused in lieu of LMNN in order to extract nonlinear features from the inputs. We have evaluated this\nuse of KLMCA on the following low-dimensional datasets from the UCI repository: Bal, Wine, Iris,\nand Ionosphere. All of these datasets, except Ionosphere, have been previously used in [9] to assess\nthe performance of LMNN. The dimensionality of the data in these sets ranges from 4 to 34. In order\n\n\fBAL \u2212 training error %\n14.1\n\nWINE \u2212 training error %\n30\n\nIRIS \u2212 training error %\n4.3\n\nIONO \u2212 training error %\n15.7\n\n10\n\n6.5\n\n(a)\n\nkNN w/\nEUCL\n\nLMNN\n\nKLMCA\n+ kNN\n\n17.1\n\n1.1\n\nkNN w/\nEUCL\n\nLMNN\n\nKLMCA\n+ kNN\n\nBAL \u2212 testing error %\n14.4\n\nWINE \u2212 testing error %\n30.1\n\n9.7\n\n7.8\n\n6.7\n\n17.6\n\n19\n\n2.6\n\n3.5\n\n3\n\n7.6\n\n2.3\n\nkNN w/\nEUCL\n\nLMNN\n\nKLMCA\n+ kNN\n\nkNN w/\nEUCL\n\nLMNN\n\nKLMCA\n+ kNN\n\nIRIS \u2212 testing error %\n\n4.7\n\n4.3\n\n4.4\n\n3.4\n\nIONO \u2212 testing error %\n16.5\n\n13.7\n\n5.8\n\n(b)\n\nkNN w/\nEUCL\n\nLMNN\n\nKLMCA\n+ kNN\n\nSVM\n\nkNN w/\nEUCL\n\nLMNN\n\nKLMCA\n+ kNN\n\nSVM\n\nkNN w/\nEUCL\n\nLMNN\n\nKLMCA\n+ kNN\n\nSVM\n\nkNN w/\nEUCL\n\nLMNN\n\nKLMCA\n+ kNN\n\nFigure 3: kNN classi\ufb01cation accuracy on low-dimensional datasets: Bal, Wine, Iris, and Ionosphere.\n(a) Training error. (b) Testing error. Algorithms are kNN using Euclidean distance, LMNN [9], kNN\nin the nonlinear feature space computed by our KLMCA algorithm, and multiclass SVM.\n\nto compare LMNN with KLMCA under identical conditions, KLMCA was restricted to compute a\nnumber of features equal to the input dimensionality, although in our experience using additional\nnonlinear features often results in better classi\ufb01cation performance. Figure 3 summarizes the results\nof this comparison. Again, we averaged the errors over 100 runs with different 70/30 splits of the\ndata for training and testing. On all datasets except on Wine, for which the mapping to the high-\ndimensional space seems to hurt performance (note also the high error rate of SVM), KLMCA gives\nbetter classi\ufb01cation accuracy than LMNN. Note also that the error rates of KLMCA are consistently\nlower than those reported in [9] for SVM under identical training and testing conditions.\n\n5 Relationship to other methods\n\nOur method is most similar to the work of Weinberger et al. [9]. Our approach is different in focus\nas it speci\ufb01cally addresses the problem of kNN classi\ufb01cation of very high-dimensional data. The\nnovelty of our method lies in an optimization that solves for data reduction and metric learning\nsimultaneously. Additionally, while [9] is limited to learning a global linear transformation of the\ninputs, we describe a kernelized version of our method that extracts non-linear features of the inputs.\nWe demonstrate that this representation leads to signi\ufb01cant improvements in kNN classi\ufb01cation both\non high-dimensional as well as on low-dimensional data. Our approach bears similarities with Lin-\near Discriminant Analysis (LDA) [2], as both techniques solve for a low-rank Mahalanobis distance\nmetric. However, LDA relies on the assumption that the class distributions are Gaussian and have\nidentical covariance. These conditions are almost always violated in practice. Like our method,\nthe Neighborhood Component Analysis (NCA) algorithm by Goldberger et al. [4] learns a low-\ndimensional embedding of the data for kNN classi\ufb01cation using a direct gradient-based approach.\nNCA and our method differ in the de\ufb01nition of the objective function. Moreover, unlike our method,\nNCA provides purely linear embeddings of the data. A contrastive loss function analogous to the\none used in this paper is adopted in [1] for training a similarity metric. A siamese architecture con-\nsisting of identical convolutional networks is used to parameterize and train the metric. In our work\nthe metric is parameterized by arbitrary nonlinear maps for which kernel functions exist. Recent\nwork by Globerson and Roweis [3] also proposes a technique for learning low-rank Mahalanobis\nmetrics. Their method includes an extension for computing low-dimensional non-linear features us-\ning the kernel trick. However, this approach computes dimensionality reductions through a two-step\nsolution which involves \ufb01rst solving for a possibly full-rank metric and then estimating the low-rank\napproximation via spectral decomposition. Besides being suboptimal, this approach is impractical\nfor classi\ufb01cation problems with high-dimensional data, as it requires solving for a number of un-\nknowns that is quadratic in the number of input dimensions. Furthermore, the metric is trained with\nthe aim of collapsing all examples in the same class to a single point. This task is dif\ufb01cult to achieve\nand not strictly necessary for good kNN classi\ufb01cation performance. The Support Vector Decompo-\n\n\fsition Machine (SVDM) [7] is also similar in spirit to our approach. SVDM optimizes an objective\nthat is a combination of dimensionality reduction and classi\ufb01cation. Speci\ufb01cally, a linear mapping\nfrom input to feature space and a linear classi\ufb01er applied to feature space, are trained simultane-\nously. As in our work, results in their paper demonstrate that this joint optimization yields better\naccuracy than that achieved by learning a low-dimensional representation and a classi\ufb01er separately.\nUnlike our method, which can be applied without any modi\ufb01cation to classi\ufb01cation problems with\nmore than two classes, SVDM is formulated for binary classi\ufb01cation only.\n\n6 Discussion\n\nWe have presented a novel algorithm that simultaneously optimizes the objectives of dimensionality\nreduction and metric learning. Our algorithm seeks, among all possible low-dimensional projec-\ntions, the one that best satis\ufb01es a large margin metric objective. Our approach contrasts techniques\nthat are unable to learn metrics in high-dimensions and that must rely on dimensionality reduction\nmethods to be \ufb01rst applied to the data. Although our optimization is not convex, we have exper-\nimentally demonstrated that the metrics learned by our solution are consistently superior to those\ncomputed by globally-optimal methods forced to search in a low-dimensional subspace.\n\nThe nonlinear version of our technique requires us to compute the kernel distance of a query point to\nall training examples. Future research will focus on rendering this algorithm \u201dsparse\u201d. In addition,\nwe will investigate methods to further reduce over\ufb01tting when learning dimensionality reduction\nfrom very high dimensions.\n\nAcknowledgments\n\nWe are grateful to Drago Anguelov and Burak Gokturk for discussion. We thank Aaron Hertzmann\nand the anonymous reviewers for their comments.\n\nReferences\n\n[1] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\n\nface veri\ufb01cation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2005.\n\n[2] R. A. Fisher. The use of multiple measurements in taxonomic problems. Ann. Eugenics, 7:179\u2013188, 1936.\nIn Y. Weiss, B. Sch\u00a8olkopf, and\n[3] A. Globerson and S. Roweis. Metric learning by collapsing classes.\nJ. Platt, editors, Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, MA,\n2006.\n\n[4] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In\nL. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17,\n2005.\n\n[5] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbor classi\ufb01cation. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence (PAMI), 18:607\u2013616, 1996.\n\n[6] A. Mordecai. Nonlinear Programming: Analysis and Methods. Dover Publishing, 2003.\n[7] F. Pereira and G. Gordon. The support vector decomposition machine. In Proceedings of the International\n\nConference on Machine Learning (ICML), 2006.\n\n[8] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction.\n\nIn Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005.\n\n[9] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor\nclassi\ufb01cation. In Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, editors, Advances in Neural Information Processing\nSystems 18, 2006.\n\n[10] E. P. Xing, A. Y. Ng, M. I. Jordan, , and S. Russell. Distance metric learning, with application to clustering\nwith side-information. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural\nInformation Processing Systems 14, 2002.\n\n\f", "award": [], "sourceid": 3088, "authors": [{"given_name": "Lorenzo", "family_name": "Torresani", "institution": null}, {"given_name": "Kuang-chih", "family_name": "Lee", "institution": null}]}