{"title": "Regressive Virtual Metric Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1810, "page_last": 1818, "abstract": "We are interested in supervised metric learning of Mahalanobis like distances. Existing approaches mainly focus on learning a new distance using similarity and dissimilarity constraints between examples. In this paper, instead of bringing closer examples of the same class and pushing far away examples of different classes we propose to move the examples with respect to virtual points. Hence, each example is brought closer to a a priori defined virtual point reducing the number of constraints to satisfy. We show that our approach admits a closed form solution which can be kernelized. We provide a theoretical analysis showing the consistency of the approach and establishing some links with other classical metric learning methods. Furthermore we propose an efficient solution to the difficult problem of selecting virtual points based in part on recent works in optimal transport. Lastly, we evaluate our approach on several state of the art datasets.", "full_text": "Regressive Virtual Metric Learning\n\nMicha\u00a8el Perrot, and Amaury Habrard\n\nUniversit\u00b4e de Lyon, Universit\u00b4e Jean Monnet de Saint-Etienne,\n\nLaboratoire Hubert Curien, CNRS, UMR5516, F-42000, Saint-Etienne, France.\n\n{michael.perrot,amaury.habrard}@univ-st-etienne.fr\n\nAbstract\n\nWe are interested in supervised metric learning of Mahalanobis like distances.\nExisting approaches mainly focus on learning a new distance using similarity\nand dissimilarity constraints between examples. In this paper, instead of bring-\ning closer examples of the same class and pushing far away examples of different\nclasses we propose to move the examples with respect to virtual points. Hence,\neach example is brought closer to a a priori de\ufb01ned virtual point reducing the\nnumber of constraints to satisfy. We show that our approach admits a closed form\nsolution which can be kernelized. We provide a theoretical analysis showing the\nconsistency of the approach and establishing some links with other classical met-\nric learning methods. Furthermore we propose an ef\ufb01cient solution to the dif\ufb01cult\nproblem of selecting virtual points based in part on recent works in optimal trans-\nport. Lastly, we evaluate our approach on several state of the art datasets.\n\n1\n\nIntroduction\n\nThe goal of a metric learning algorithm is to capture the idiosyncrasies in the data mainly by de\ufb01ning\na new space of representation where some semantic constraints between examples are ful\ufb01lled. In\nthe previous years the main focus of metric learning algorithms has been to learn Mahalanobis like\n\ndistances of the form dM(x, x(cid:48)) = (cid:112)(x \u2212 x(cid:48))T M(x \u2212 x(cid:48)) where M is a positive semi-de\ufb01nite\n\nmatrix (PSD) de\ufb01ning a set of parameters1. Using a Cholesky decomposition M = LLT , one can\nsee that this is equivalent to learn a linear transformation from the input space.\nMost of the existing approaches in metric learning use constraints of type must-link and cannot-link\nbetween learning examples [1, 2]. For example, in a supervised classi\ufb01cation task, the goal is to\nbring closer examples of the same class and to push far away examples of different classes. The idea\nis that the learned metric should affect a high value to dissimilar examples and a low value to similar\nexamples. Then, this new distance can be used in a classi\ufb01cation algorithm like a nearest neighbor\nclassi\ufb01er. Note that in this case the set of constraints is quadratic in the number of examples which\ncan be prohibitive when the number of examples increases. One heuristic is then to select only\na subset of the constraints but selecting such a subset is not trivial. In this paper, we propose to\nconsider a new kind of constraints where each example is associated with an a priori de\ufb01ned virtual\npoint. It allows us to consider the metric learning problem as a simple regression where we try\nto minimize the differences between learning examples and virtual points. Fig. 1 illustrates the\ndifferences between our approach and a classical metric learning approach. It can be noticed that\nour algorithm only uses a linear number of constraints. However de\ufb01ning these constraints by hand\ncan be tedious and dif\ufb01cult. To overcome this problem, we present two approaches to automatically\nde\ufb01ne them. The \ufb01rst one is based on some recent advances in the \ufb01eld of Optimal Transport while\nthe second one uses a class-based representation space.\n\n1When M = I, the identity matrix, it corresponds to the Euclidean distance.\n\n1\n\n\f(a) Classical must-link cannot-link approach.\n\n(b) Our virtual point-based regression formulation.\n\nFigure 1: Arrows denote the constraints used by each approach for one particular example in a\nbinary classi\ufb01cation task. The classical metric learning approach in Fig. 1(a) uses O(n2) constraints\nbringing closer examples of the same class and pushing far away examples of different classes. On\nthe contrary, our approach presented in Fig. 1(b) moves the examples to the neighborhood of their\ncorresponding virtual point, in black, using only O(n) constraints. ( Best viewed in color )\n\nMoreover, thanks to its regression-based formulation, our approach can be easily kernelized allowing\nus to deal ef\ufb01ciently with non linear transformations which is a nice advantage in comparison to\nsome metric learning methods. We also provide a theoretical analysis showing the consistency of\nour approach and establishing some relationships with a classical metric learning formulation.\nThis paper is organized as follows. In Section 2 we identify several related works. Then in Section 3\nwe present our approach, provide some theoretical results and give two solutions to generate the\nvirtual points. Section 4 is dedicated to an empirical evaluation of our method on several widely\nused datasets. Finally, we conclude in Section 5.\n\n2 Related work\n\nFor up-to-date surveys on metric learning see [3] and [4]. In this section we focus on algorithms\nwhich are more closely related to our approach. First of all, one of the most famous approach\nin metric learning is LMNN [5] where the authors propose to learn a PSD matrix to improve the\nk-nearest-neighbours algorithm. In their work, instead of considering pairs of examples, they use\ntriplets (xi, xj, xk) where xj and xk are in the neighborhood of xi and such that xi and xj are of\nthe same class and xk is of a different class. The idea is then to bring closer xi and xj while pushing\nxk far away. Hence, if the number of constraints seems to be cubic, the authors propose to only\nconsider triplets of examples which are already close to each other. In contrast, the idea presented in\n[6] is to collapse all the examples of the same class in a single point and to push in\ufb01nitely far away\nexamples of different classes. The authors de\ufb01ne a measure to estimate the probability of having an\nexample xj given an example xi with respect to a learned PSD matrix M. Then, they minimize,\nw.r.t. M, the KL divergence between this measure and the best case where the probability is 1 if the\ntwo examples are of the same class and 0 otherwise. It can be seen as collapsing all the examples\nof the same class on an implicit virtual point. In this paper we use several explicit virtual points and\nwe collapse the examples on these points with respect to their classes and their distances to them.\nA recurring issue in Mahalanobis like metric learning is to ful\ufb01ll the PSD constraint on the learned\nmetric. Indeed, projecting a matrix on the PSD cone is not trivial and generally requires a costly\neigenvalues decomposition. To address this problem, in ITML [1] the authors propose to use a\nLogDet divergence as the regularization term. The idea is to learn a matrix which is close to an a\npriori de\ufb01ned PSD matrix. The authors then show that if the divergence is \ufb01nite, then the learned\nmatrix is guaranteed to be PSD. Another approach, as proposed in [2], is to learn a matrix L such\nthat M = LLT , i.e. instead of learning the metric the authors propose to learn the projection. The\nmain drawback is the fact that most of the time the resulting optimization problem is not convex\n[3, 4, 7] and is thus harder to optimize. In this paper, we are also interested in learning L directly.\nHowever, because we are using constraints between examples and virtual points, we obtain a convex\nproblem with a closed form solution allowing us to learn the metric in an ef\ufb01cient way.\nThe problem of learning a metric such that the induced space is not linearly dependent of the input\nspace has been addressed in several works before. First, it is possible to directly learn an intrinsically\nnon linear metric as in \u03c72-LMNN [8] where the authors propose to learn a \u03c72 distance rather than a\nMahalanobis distance. This distance is particularly relevant for histograms comparisons. Note that\nthis kind of approaches is close to the kernel learning problem which is beyond the scope of this\nwork. Second, another solution used by local metric learning methods is to split the input space\n\n2\n\n\fin several regions and to learn a metric in each region to introduce some non linearity, as in MM-\nLMNN [7]. Similarly, in GB-LMNN [8] the authors propose to locally re\ufb01ne the metric learned\nby LMNN by successively splitting the input space. A third kind of approach tries to project the\nlearning examples in a new space which is non linearly dependent of the input space. It can be done\nin two ways, either by projecting a priori the learning examples in a new space with a KPCA [9]\nor by rewriting the optimization problem in a kernelized form [1]. The \ufb01rst approach allows one to\ninclude non linearity in most of the metric learning algorithms but imposes to select the interesting\nfeatures beforehand. The second method can be dif\ufb01cult to use as rewriting the optimization problem\nis most of the times non trivial [4]. Indeed, if one wants to use the kernel trick it implies that the\naccess to the learning examples should only be done through dot products which is dif\ufb01cult when\nworking with pairs of examples as it is the case in metric learning. In this paper we show that using\nvirtual points chosen in a given target space allows us to kernelize our approach easily and thus to\nwork in a very high dimensional space without using an explicit projection thanks to the kernel trick.\nOur method is based on a regression and can thus be linked, in its kernelized form, to several ap-\nproaches in kernelized regression for structured output [10, 11, 12]. The idea behind these ap-\nproaches is to minimize the difference between input examples and output examples using kernels,\ni.e. working in a high dimensional space. In our case, the learning examples can be seen as input\nexamples and the virtual points as output examples. However, we only project the learning examples\nin a high dimensional space, the virtual points already belong to the output space. Hence, we do not\nhave the pre-image problem [12]. Furthermore, our goal is not to predict a virtual point but to learn\na metric between examples and thus, after the learning step, the virtual points are discarded.\n\n3 Contributions\n\nThe main idea behind our algorithm is to bring closer the learning examples to a set of virtual points.\nWe present this idea in three subsections. First we assume that we have access to a set of n learning\npairs (x,v) where x is a learning example and v is a virtual point associated to x and we present\nboth the linear and kernelized formulations of our approach called RVML. It boils down to solve\na regression in closed form, the main originality being the introduction of virtual points. In the\nsecond subsection, we show that it is possible to theoretically link our approach to a classical metric\nlearning one based on [13]. In the last subsection, we propose two automatic methods to generate\nthe virtual points and to associate them with the learning examples.\n\ni=1 be a set of examples drawn i.i.d.\n\n3.1 Regressive Virtual Metric Learning (RVML)\nGiven a probability distribution D de\ufb01ned over X \u00d7 Y where X \u2286 Rd and Y is a \ufb01nite label\nset, let S = {(xi, yi)}n\nfrom D. Let fv : X \u00d7 Y \u2192 V\nwhere V \u2286 Rd(cid:48)\nbe the function which associates each example to a virtual point. We consider\nthe learning set Sv = {(xi, vi)}n\ni=1 where vi = fv(xi, yi). For the sake of simplicity denote by\nX = (x1, . . . , xn)T and V = (v1, . . . , vn)T the matrices containing respectively one example and\nthe associated virtual point on each line. In this section, we consider that the function fv is known.\nWe come back to its de\ufb01nition in Section 3.3. Let (cid:107) \u00b7 (cid:107)F be the Frobenius norm and (cid:107) \u00b7 (cid:107)2 be the l2\nvector norm. Our goal is to learn a matrix L such that M = LLT and for this purpose we consider\nthe following optimisation problem:\n\nmin\n\nL\n\nf (L, X, V) = min\n\nL\n\n(cid:107)XL \u2212 V(cid:107)2F + \u03bb(cid:107)L(cid:107)2F .\n\n1\nn\n\n(1)\n\nThe idea is to learn a new space of representation where each example is close to its associated\nvirtual point. Note that L is a d\u00d7 d(cid:48) matrix and if d(cid:48) < d we also perform dimensionality reduction.\nTheorem 1. The optimal solution of Problem 1 can be found in closed form. Furthermore, we can\nderive two equivalent solutions:\n\n(2)\n\n(3)\n\nL =(cid:0)XT X + \u03bbnI(cid:1)\u22121\nL = XT(cid:0)XXT + \u03bbnI(cid:1)\u22121\n\nXT V\n\nV.\n\nProof. The proof of this theorem can be found in the supplementary material.\n\n3\n\n\fFrom Eq. 2 we deduce the matrix M:\n\nM = LLT =(cid:0)XT X + \u03bbnI(cid:1)\u22121\n\nXT VVT X(cid:0)XT X + \u03bbnI(cid:1)\u22121 .\n\n(4)\n\nNote that M is PSD by construction: xT Mx = xT LLT x = (cid:107)LT x(cid:107)2\nSo far, we have focused on the linear setting. We now present a kernelized version, showing that it\nis possible to learn a metric in a very high dimensional space without an explicit projection.\nLet \u03c6(x) be a projection function and K(x, x(cid:48)) = \u03c6(x)T \u03c6(x(cid:48)) be its associated kernel. For the sake\nof readability, let KX = \u03c6(X)\u03c6(X)T where \u03c6(X) = (\u03c6(x1), . . . , \u03c6(xn))T . Given the solution\nX. Then,\n\nmatrix L presented in Eq. 3, we have M = XT(cid:0)XXT + \u03bbnI(cid:1)\u22121\n\nVVT(cid:0)XXT + \u03bbnI(cid:1)\u22121\n\n2 \u2265 0.\n\nMK the kernelized version of the matrix M is de\ufb01ned such that:\n\nMK = \u03c6(X)T (KX + \u03bbnI)\n\n\u22121 VVT (KX + \u03bbnI)\n\n\u22121 \u03c6(X).\n\nThe squared Mahalanobis distance can be written as d2\nThus we can obtain d2\nthe kernelized version by considering that:\n\nM(x, x(cid:48)) = xT Mx + x(cid:48)T Mx(cid:48) \u2212 2xT Mx(cid:48).\n(\u03c6(x), \u03c6(x(cid:48))) = \u03c6(x)T MK\u03c6(x) + \u03c6(x(cid:48))T MK\u03c6(x(cid:48))\u2212 2\u03c6(x)T MK\u03c6(x(cid:48))\n\nMK\n\n\u03c6(x)T MK \u03c6(x) = \u03c6(x)T \u03c6(X)T (KX + \u03bbnI)\n\n\u22121 VVT (KX + \u03bbnI)\n\n\u22121 \u03c6(X)\u03c6(x)\n\n= KX(x)T (KX + \u03bbnI)\n\n\u22121 VVT (KX + \u03bbnI)\n\n\u22121 KX(x)\n\nwhere KX(x) = (K(x, x1), . . . , K(x, xn))T is the similarity vector to the examples w.r.t. K.\n\u22121 V.\nNote that it is also possible to obtain a kernelized version of L: LK = \u03c6(X)T (KX + \u03bbnI)\nThis result is close to a previous one already derived in [11] in a structured output setting. The main\ndifference is the fact that we do not use a kernel on the output (the virtual points here). Hence, it is\npossible to compute the projection of an example x of dimension d in a new space of dimension d(cid:48):\n\n\u03c6(x)LK = \u03c6(x)T \u03c6(X)T (KX + \u03bbnI)\n\n\u22121 V = KX(x)T (KX + \u03bbnI)\n\n\u22121 V.\n\nRecall that in this work we are interested in learning a distance between examples and not in the\nprediction of the virtual points which only serve as a way to bring closer similar examples and push\nfar away dissimilar examples.\nFrom a complexity standpoint, we can see that, assuming the kernel function as easy to calculate,\nthe main bottleneck when computing the solution in closed form is the inversion of a n \u00d7 n matrix.\n\n3.2 Theoretical Analysis\n\nIn this section, we propose to theoretically show the interest of our approach by proving (i) that it is\nconsistent and (ii) that it is possible to link it to a more classical metric learning formulation.\n\n3.2.1 Consistency\n2 be our loss and let Dv be the probability distribution over X \u00d7 V\nLet l(L, (x, v)) = (cid:107)xT L \u2212 vT(cid:107)2\nsuch that pDv (x, v) = pD(x, y|v = fv(x, y)). Showing the consistency boils down to bound with\nhigh probability the true risk, denoted by R(L), by the empirical risk, denoted by \u02c6R(L) such that:\n\nR(L) = E(x,v)\u223cDv l(L, (x, v)) and \u02c6R(L) =\n\n1\nn\n\nl(L, (x, v)) =\n\n(cid:107)XL \u2212 V(cid:107)2F .\n\n1\nn\n\n(cid:88)\n\n(x,v)\u2208Sv\n\nThe empirical risk corresponds to the error of the learned matrix L on the learning set Sv. The true\nrisk is the error of L on the unknown distribution Dv. The consistency property ensures that with a\nsuf\ufb01cient number of examples a low empirical risk implies a low true risk with high probability. To\nshow that our approach is consistent, we use the uniform stability framework [14].\nTheorem 2. Let (cid:107)v(cid:107)2 \u2264 Cv for any v \u2208 V and (cid:107)x(cid:107)2 \u2264 Cx for any x \u2208 X . With probability 1 \u2212 \u03b4,\nfor any matrix L optimal solution of Problem 1, we have:\n\n(cid:32)(cid:18) 16C 2\n\nx\n\n\u03bb\n\n4\n\n(cid:18)\n\n(cid:19)2\n\nR(L) \u2264 \u02c6R(L) +\n\n8C 2\nvC 2\nx\n\u03bbn\n\n1 +\n\nCx\u221a\n\u03bb\n\n+\n\n(cid:19)\n\n(cid:18)\n\n+ 1\n\nC 2\nv\n\n(cid:19)2(cid:33)(cid:115)\n\n1 +\n\nCx\u221a\n\u03bb\n\nln 1\n\u03b4\n2n\n\n.\n\n\fProof. The proof of this theorem can be found in the supplementary material.\n\nWe obtain a rate of convergence in O(cid:16) 1\u221a\n\n(cid:17)\n\nn\n\nwhich is standard with this kind of bounds.\n\n3.2.2 Link with a Classical Metric Learning Formulation\nIn this section we show that it is possible to bound the true risk of a classical metric learning approach\nwith the empirical risk of our formulation. Most of the classical metric learning approaches make\nuse of a notion of margin between similar and dissimilar examples. Hence, similar examples have\nto be close to each other, i.e. at a distance smaller than a margin \u03b31, and dissimilar examples have to\nbe far from each other, i.e. at a distance greater than a margin \u03b3\u22121. Let (xi, yi) and (xj, yj) be two\nexamples from X \u00d7 Y, using this notion of margin, we consider the following loss [13]:\n\n(cid:104)\nyij(d2(LT xi, LT xj) \u2212 \u03b3yij )\n\n(cid:105)\n\n+\n\nl(L, (xi, yi), (xj, yj)) =\n\n(5)\nwhere yij = 1 if yi = yj and \u22121 otherwise, [z]+ = max(0, z) is the hinge loss and \u03b3yij is the\ndesired margin between examples. As introduced before, we consider that \u03b3yij takes a big value\nwhen the examples are dissimilar, i.e. when yij = \u22121, and a small value when the examples are\nsimilar, i.e. when yij = 1. In the following we show that, relating the notion of margin to the\ndistances between virtual points, it is possible to bound the true risk associated with this loss by the\nempirical risk of our approach with respect to a constant.\nTheorem 3. Let D be a distribution over X \u00d7 Y. Let V \u2282 Rd(cid:48)\nbe a \ufb01nite set of virtual points and\nfv is de\ufb01ned as fv(xi, yi) = vi, vi \u2208 V. Let (cid:107)v(cid:107)2 \u2264 Cv for any v \u2208 V and (cid:107)x(cid:107)2 \u2264 Cx for any\nx \u2208 X . Let \u03b31 = 2 maxxk,xl,ykl=1 d2(vk, vl) and \u03b3\u22121 = 1\n2 minxk,xl,ykl=\u22121 d2(vk, vl), we have:\n\nE(xi,yi)\u223cD,(xj ,yj )\u223cD(cid:2)yij(d2(LT xi, LT xj) \u2212 \u03b3yij )(cid:3)\n\uf8eb\uf8ed \u02c6R(L) +\n(cid:32)(cid:18) 16C 2\n\n(cid:19)2\n\n(cid:18)\n\n\u2264 8\n\n1 +\n\n+\n\n8C 2\nvC 2\nx\n\u03bbn\n\nCx\u221a\n\u03bb\n\n+\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)2(cid:33)(cid:115)\n\nx\n\n+ 1\n\nC 2\nv\n\n\u03bb\n\n1 +\n\nCx\u221a\n\u03bb\n\n\uf8f6\uf8f8 .\n\nln 1\n\u03b4\n2n\n\n)\n\nProof. The proof of this theorem can be found in the supplementary material.\n\nIn Theorem 3, we can notice that the margins are related to the distances between virtual points and\ncorrespond to the ideal margins, i.e. the margins that we would like to achieve after the learning\nstep. Aside this remark, we can de\ufb01ne \u02c6\u03b31 and \u02c6\u03b3\u22121 the observed margins obtained after the learning\nstep: All the similar examples are in a sphere centered in their corresponding virtual point and of\ndiameter \u02c6\u03b31 = 2 max(x,v)\nilar examples is \u02c6\u03b3\u22121 = minv,v(cid:48),v(cid:54)=v(cid:48) (cid:107)v \u2212 v(cid:48)(cid:107)2 \u2212 \u02c6\u03b31. As a consequence, even if we do not use\ncannot-link constraints our algorithm is able to push reasonably far away dissimilar examples.\nIn the next subsection we present two different methods to select the virtual points.\n\n(cid:13)(cid:13)xT L \u2212 vT(cid:13)(cid:13)2. Similarly, the distance between hyperspheres of dissim-\n\n3.3 Virtual Points Selection\nPreviously, we assumed to have access to the function fv : X \u00d7 Y \u2192 V. In this subsection, we\npresent two methods for generating automatically the set of virtual points and the mapping fv.\n\n3.3.1 Using Optimal Transport on the Learning Set\n\nIn this \ufb01rst approach, we propose to generate the virtual points by using a recent variation of the\nOptimal Transport (OT) problem [15] allowing one to transport some examples to new points cor-\nresponding to a linear combination of a set of known instances. These new points will actually\ncorrespond to our virtual points. Our approach works as follows. We begin by extracting a set of\nlandmarks S(cid:48) from the training set S. For this purpose, we use an adaptation of the landmark selec-\ntion method proposed in [16] allowing us to take into account some diversity among the landmarks.\nTo avoid to \ufb01x the number of landmarks in advance, we have just replaced it by a simple heuristic\nsaying that the number of landmarks must be greater than the number of classes and that the max-\nimum distance between an example and a landmark must be lower than the mean of all pairwise\n\n5\n\n\fAlgorithm 1: Selecting S(cid:48) from a set of examples S.\ninput : S = {(xi, yi)}n\noutput: S(cid:48) a subset of S\nbegin\n\ni=1 a set of examples; Y the label set.\n\nx\u2208S\n\n\u00b5 = mean of distances between all the examples of S\n(cid:107)x \u2212 0(cid:107)2\nxmax = arg max\nS(cid:48) = {xmax}; S = S \\ S(cid:48)\n\u03b5 = maxx\u2208S minx(cid:48)\u2208S(cid:48) (cid:107)x \u2212 x(cid:48)(cid:107)2\nwhile |S(cid:48)| < |Y| or \u03b5 > \u00b5 do\n(cid:48)(cid:107)2\n(cid:107)x \u2212 x\nxmax = arg max\nS(cid:48) = S(cid:48) \u222a {xmax}; S = S \\ S(cid:48)\n\u03b5 = maxx\u2208S minx(cid:48)\u2208S(cid:48) (cid:107)x \u2212 x(cid:48)(cid:107)2\n\n(cid:88)\n\nx(cid:48)\u2208S(cid:48)\n\nx\u2208S\n\nj(cid:107)2 with xi \u2208 S and x(cid:48)\n\ndistances from the training set -allowing us to have a fully automatic procedure. It is summarized in\nAlgorithm 1.\nThen we compute an optimal transport from the training set S to the landmark set S(cid:48). For this\npurpose, we create a real matrix C of size |S|\u00d7|S(cid:48)| giving the cost to transport one training instance\nto a landmark such that C(i, j) = (cid:107)xi \u2212 x(cid:48)\nj \u2208 S(cid:48). The optimal transport is\nfound by learning a matrix \u03b3 \u2208 R|S|\u00d7|S(cid:48)| able to minimize the cost of moving training examples to\nthe landmark points. Let S(cid:48) be the matrix of landmark points (one per line), the transport w.r.t. \u03b3 of\nany training instance (xi, yi) gives a new virtual point such that fv(xi, yi) = \u03b3(i)S(cid:48), \u03b3(i) designing\nthe ith line of \u03b3. Note that this new virtual point is a linear combination of the landmark instances to\nwhich the example is transported. The set of virtual points is then de\ufb01ned by V = \u03b3S(cid:48). The virtual\npoints are thus not de\ufb01ned a priori but are automatically learned by solving a problem of optimal\ntransport. Note that this transportation mode is potentially non linear since there is no guarantee that\nthere exists a matrix T such that V = XT. Our metric learning approach can, in this case, be seen\nas a an approximation of the result given by the optimal transport.\nTo learn \u03b3, we use the following optimization problem proposed in [17]:\n\narg min\n\n\u03b3\n\n(cid:104)\u03b3, C(cid:105)F \u2212 1\n\u03bb\n\nh(\u03b3) + \u03b7\n\n(cid:107)\u03b3(yi = c, j)(cid:107)p\n\nq\n\n(cid:88)\n\n(cid:88)\n\nj\n\nc\n\nwhere h(\u03b3) = \u2212(cid:80)\n\ni,j \u03b3(i, j) log(\u03b3(i, j)) is the entropy of \u03b3 that allows to solve the transportation\nproblem ef\ufb01ciently with the Sinkhorn-Knopp algorithm [18]. The second regularization term, where\n\u03b3(yi = c, j) corresponds to the lines of the jth column of \u03b3 where the class of the input is c, has\nbeen introduced in [17]. The goal of this term is to prevent input examples of different classes to\nmove toward the same output examples by promoting group sparsity in the matrix \u03b3 thanks to (cid:107) \u00b7 (cid:107)p\nq\n2.\ncorresponding to a lq-norm to the power of p used here with q = 1 and p = 1\n\n3.3.2 Using a Class-based Representation Space\n\nFor this second approach, we propose to de\ufb01ne virtual points as the unit vectors of a space of\ndimension |Y|. Let ej \u2208 R|Y| be such a unit vector (1 \u2264 j \u2264 |Y|) -i.e. a vector where all the\nattributes are 0 except for one attribute j which is set to 1- to which we associate a class label from\nY. Then, for any learning example (xi, yi), we de\ufb01ne fv(xi, yi) = e#yi where #yi = j if ej is\nmapped with the class yi. Thus, we have exactly |Y| virtual points, each one corresponding to a unit\nvector and a class label. We call this approach the class-based representation space method. If the\nnumber of classes is smaller than the number of dimensions used to represent the learning examples,\nthen our method will perform dimensionality reduction for free. Furthermore, our approach will try\nto project all the examples of one class on the same axis while examples of other classes will tend\nto be projected on different axes. The underlying intuition behind the new space de\ufb01ned by L is to\nmake each attribute discriminant for one class.\n\n6\n\n\fTable 1: Comparison of our approach with several baselines in the linear setting.\n\nBase\n\nAmazon\nBreast\nCaltech\nDSLR\n\nIonosphere\n\nIsolet\nLetters\nPima\nScale\nSplice\n\nSvmguide1\n\nWine\n\nWebcam\n\nmean\n\n1NN\n\n41.51 \u00b1 3.24\n95.49 \u00b1 0.79\n18.04 \u00b1 2.20\n29.61 \u00b1 4.38\n86.23 \u00b1 1.95\n94.74 \u00b1 0.27\n69.91 \u00b1 1.69\n78.68 \u00b1 2.66\n\n88.97\n\nBaselines\nLMNN\n\n65.50 \u00b1 2.28\n95.49 \u00b1 0.89\n49.68 \u00b1 2.76\n76.08 \u00b1 4.79\n88.02 \u00b1 3.02\n96.43 \u00b1 0.28*\n70.04 \u00b1 2.20\n78.20 \u00b1 1.91\n\n95.83\n\n71.17\n95.12\n\n96.18 \u00b1 1.59\n42.90 \u00b1 4.19\n\n69.89\n\n82.02\n95.03\n\n98.36 \u00b1 1.03\n85.81 \u00b1 3.75\n\n82.81\n\nSCML\n\n71.68 \u00b1 1.86\n96.50 \u00b1 0.64*\n52.84 \u00b1 1.61\n65.10 \u00b1 9.00\n90.38 \u00b1 2.55*\n96.13 \u00b1 0.20\n69.22 \u00b1 2.60\n93.39 \u00b1 1.70*\n\n89.61\n\n85.43\n87.38\n\n96.91 \u00b1 1.93\n90.43 \u00b1 2.70\n\n83.46\n\nOur approach\n\nRVML-Lin-OT\n71.62 \u00b1 1.34\n95.24 \u00b1 1.21\n52.51 \u00b1 2.41\n74.71 \u00b1 5.27\n87.36 \u00b1 3.12\n90.25 \u00b1 0.60\n70.48 \u00b1 3.19\n90.05 \u00b1 2.13\n\n91.40\n\n84.64\n94.83\n\n98.55 \u00b1 1.67\n88.60 \u00b1 3.63\n\n83.86\n\nRVML-Lin-Class\n73.09 \u00b1 2.49\n95.34 \u00b1 0.95\n55.41 \u00b1 2.55*\n75.29 \u00b1 5.08\n82.74 \u00b1 2.81\n95.51 \u00b1 0.26\n69.57 \u00b1 2.85\n87.94 \u00b1 1.99\n\n94.61\n\n78.44\n85.25\n\n98.18 \u00b1 1.48\n88.60 \u00b1 2.69\n\n83.07\n\n4 Experimental results\n\nIn this section, we evaluate our approach on 13 different datasets coming from either the UCI [19]\nrepository or used in recent works in metric learning [8, 20, 21]. For isolet, splice and svmguide1\nwe have access to a standard training/test partition, for the other datasets we use a 70% training/30%\ntest partition, we perform the experiments on 10 different splits and we average the result. We\nnormalize the examples with respect to the training set by subtracting for each attribute its mean\nand dividing by 3 times its standard deviation. We set our regularization parameter \u03bb with a 5-fold\ncross validation. After the metric learning step, we use a 1-nearest neighbor classi\ufb01er to assess the\nperformance of the metric and report the accuracy obtained.\nWe perform two series of experiments. First, we consider our linear formulation used with the\ntwo virtual points selection methods presented in this paper: RVML-Lin-OT based on Optimal\nTransport (Section 3.3.1) and RVML-Lin-Class using the class-based representation space method\n(Section 3.3.2). We compare them to a 1-nearest neighbor classi\ufb01er without metric learning (1NN),\nand with two state of the art linear metric learning methods: LMNN [5] and SCML [20].\nIn a second series, we consider the kernelized versions of RVML, namely RVML-RBF-OT and\nRVML-RBF-Class, based respectively on Optimal Transport and class-based representation space\nmethods, with a RBF kernel with the parameter \u03c3 \ufb01xed as the mean of all pairwise training set\nEuclidean distances [16]. We compare them to non linear methods using a KPCA with a RBF kernel2\nas a pre-process: 1NN-KPCA a 1-nearest neighbor classi\ufb01er without metric learning and LMNN-\nKPCA corresponding to LMNN in the KPCA-space. The number of dimensions is \ufb01xed as the one\nof the original space for high dimensional datasets (more than 100 attributes), to 3 times the original\ndimension when the dimension is smaller (between 5 and 100 attributes) and to 4 times the original\ndimension for the lowest dimensional datasets (less than 5 attributes). We also consider some local\nmetric learning methods: GBLMNN [8] a non linear version of LMNN and SCMLLOCAL [20] the\nlocal version of SCML. For all these methods, we use the implementations available online letting\nthem handle hyper-parameters tuning.\nThe results for linear methods are presented in Table 1 while Table 2 gives the results obtained\nwith the non linear approaches. In each table, the best result on each line is highlighted with a\nbold font while the second best result is underlined. A star indicates either that the best baseline\nis signi\ufb01cantly better than our best result or that our best result is signi\ufb01cantly better than the best\nbaseline according to classical signi\ufb01cance tests (the p-value being \ufb01xed at 0.05).\nWe can make the following remarks. In the linear setting, our approaches are very competitive to the\nstate of the art and RVML-Lin-OT tends to be the best on average even though it must be noticed that\nSCML is very competitive on some datasets (the average difference is not signi\ufb01cant). RVML-Lin-\nClass performs slightly less on average. Considering now the non linear methods, our approaches\nimprove their performance and are signi\ufb01cantly better than the others on average, RVML-RBF-Class\nhas the best average behavior in this setting. These experiments show that our regressive formulation\n\n2With the \u03c3 parameter \ufb01xed as previously to the mean of all pairwise training set Euclidean distances.\n\n7\n\n\fTable 2: Comparison of our approach with several baselines in the non-linear case.\n\nBaselines\n\nOur approach\n\nBase\n\nAmazon\nBreast\nCaltech\nDSLR\n\nIonosphere\n\nIsolet\nLetter\nPima\nScale\nSplice\n\nSvmguide1\n\nWine\n\nWebcam\n\nmean\n\n1NN-KPCA\n20.27 \u00b1 2.42\n92.43 \u00b1 2.19\n20.82 \u00b1 8.29\n64.90 \u00b1 5.81\n75.57 \u00b1 2.79\n95.39 \u00b1 0.27\n69.57 \u00b1 2.64\n78.36 \u00b1 0.88\n\n68.70\n\n66.99\n95.72\n\n92.18 \u00b1 1.23\n73.55 \u00b1 4.57\n\n70.34\n\nLMNN-KPCA\n53.16 \u00b1 3.73\n95.39 \u00b1 1.32\n29.88 \u00b1 10.89\n73.92 \u00b1 7.57\n85.66 \u00b1 2.55\n97.17* \u00b1 0.18\n69.48 \u00b1 2.04\n88.10 \u00b1 2.26\n\n96.28\n\n88.97\n95.60\n\n95.82 \u00b1 2.98\n84.52 \u00b1 3.83\n\n81.07\n\nGBLMNN\n65.53 \u00b1 2.32\n95.58 \u00b1 0.87\n49.91 \u00b1 2.80\n76.08 \u00b1 4.79\n87.36 \u00b1 3.02\n96.51 \u00b1 0.25\n69.52 \u00b1 2.27\n77.88 \u00b1 2.43\n\n96.02\n\n82.21\n95.00\n\n98.00 \u00b1 1.34\n85.81 \u00b1 3.75\n\n82.72\n\nSCMLLOCAL\n69.14 \u00b1 1.74\n96.31 \u00b1 0.66\n50.56 \u00b1 1.62\n62.55 \u00b1 6.94\n90.94 \u00b1 3.02\n96.63 \u00b1 0.26\n68.40 \u00b1 2.75\n93.86 \u00b1 1.78\n\n91.40\n\n87.13\n87.40\n\n96.55 \u00b1 2.00\n88.71 \u00b1 2.83\n\n83.04\n\nRVML-RBF-OT\n73.51 \u00b1 0.83\n95.73 \u00b1 0.97\n54.39 \u00b1 1.89\n70.39 \u00b1 4.48\n90.66 \u00b1 3.10\n91.26 \u00b1 0.50\n69.35 \u00b1 2.95\n95.19 \u00b1 1.46*\n\n95.96\n\n88.51\n95.67\n\n98.91 \u00b1 1.53\n88.71 \u00b1 4.28\n\n85.25\n\nRVML-RBF-Class\n76.22 \u00b1 2.09*\n95.78 \u00b1 0.92\n57.98 \u00b1 2.22*\n76.67 \u00b1 4.57\n93.11 \u00b1 3.30*\n96.09 \u00b1 0.21\n70.74 \u00b1 2.36\n94.07 \u00b1 2.02\n\n96.73\n\n88.32\n95.05\n\n98.00 \u00b1 1.81\n88.92 \u00b1 2.91\n\n86.74\n\nis very competitive and is even able to improve state of the art performances in a non linear setting\nand consequently that our virtual point selection methods automatically select correct instances.\nConsidering the virtual point selection, we can observe that the OT formulation performs better than\nthe class-based representation space one in the linear case, while it is the opposite in the non-linear\ncase. We think that this can be explained by the fact that the OT approach generates more virtual\npoints in a potentially non linear way which brings more expressiveness for the linear case. On the\nother hand, in the non linear one, the relative small number of virtual points used by the class-based\nmethod seems to induce a better regularization. In Section 4 of the supplementary material, we\nprovide additional experiments showing the interest of using explicit virtual points and the need of\na careful association between examples and virtual points. We also provide some graphics showing\n2D projections of the space learned by RVML-Lin-Class and RVML-RBF-Class on the Isolet dataset\nillustrating the capability of these approaches to learn discriminative attributes.\nIn terms of computational cost, our approach -implemented in closed form [22]- is competitive\nwith classical methods but does not yield to signi\ufb01cant improvements. Indeed, in practice, classical\napproaches only consider a small number of constraints e.g. c times the number of examples, where\nc is a small constant, in the case of SCML. Thus, the practical computational complexity of both our\napproach and classical methods is linearly dependant on the number of examples.\n\n5 Conclusion\n\nWe present a new metric learning approach based on a regression and aiming at bringing closer\nthe learning examples to some a priori de\ufb01ned virtual points. The number of constraints has the\nadvantage to grow linearly with the size of the learning set in opposition to the quadratic grow of\nstandard must-link cannot-link approaches. Moreover, our method can be solved in closed form and\ncan be easily kernelized allowing us to deal with non linear problems. Additionally, we propose\ntwo methods to de\ufb01ne the virtual points: One making use of recent advances in the \ufb01eld of optimal\ntransport and one based on unit vectors of a class-based representation space allowing one to perform\ndirectly some dimensionality reduction. Theoretically, we show that our approach is consistent and\nwe are able to link our empirical risk to the true risk of a classical metric learning formulation.\nFinally, we empirically show that our approach is competitive with the state of the art in the linear\ncase and outperforms some classical approaches in the non-linear one.\nWe think that this work opens the door to design new metric learning formulations, in particular the\nde\ufb01nition of the virtual points can bring a way to control some particular properties of the metric\n(rank, locality, discriminative power, . . . ). As a consequence, this aspect opens new issues which are\nin part related to landmark selection problems but also to the ability to embed expressive semantic\nconstraints to satisfy by means of the virtual points. Other perspectives include the development of\na speci\ufb01c solver, of online versions, the use of low rank-inducing norms or the conception of new\nlocal metric learning methods. Another direction would be to study similarity learning extensions\nto perform linear classi\ufb01cation such as in [21, 23].\n\n8\n\n\fReferences\n[1] Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-\n\ntheoretic metric learning. In Proc. of ICML, pages 209\u2013216, 2007.\n\n[2] Jacob Goldberger, Sam T. Roweis, Geoffrey E. Hinton, and Ruslan Salakhutdinov. Neighbour-\n\nhood components analysis. In Proc. of NIPS, pages 513\u2013520, 2004.\n\n[3] Aur\u00b4elien Bellet, Amaury Habrard, and Marc Sebban. Metric Learning. Synthesis Lectures on\n\nArti\ufb01cial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2015.\n\n[4] Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning,\n\n5(4):287\u2013364, 2013.\n\n[5] Kilian Q. Weinberger, John Blitzer, and Lawrence K. Saul. Distance metric learning for large\n\nmargin nearest neighbor classi\ufb01cation. In Proc. of NIPS, pages 1473\u20131480, 2005.\n\n[6] Amir Globerson and Sam T. Roweis. Metric learning by collapsing classes. In Proc. of NIPS,\n\npages 451\u2013458, 2005.\n\n[7] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest\n\nneighbor classi\ufb01cation. JMLR, 10:207\u2013244, 2009.\n\n[8] Dor Kedem, Stephen Tyree, Kilian Q. Weinberger, Fei Sha, and Gert R. G. Lanckriet. Non-\n\nlinear metric learning. In Proc. of NIPS, pages 2582\u20132590, 2012.\n\n[9] Bernhard Sch\u00a8olkopf, Alex J. Smola, and Klaus-Robert M\u00a8uller. Kernel principal component\n\nanalysis. In Proc. of ICANN, pages 583\u2013588, 1997.\n\n[10] Jason Weston, Olivier Chapelle, Andr\u00b4e Elisseeff, Bernhard Sch\u00a8olkopf, and Vladimir Vapnik.\n\nKernel dependency estimation. In Proc. of NIPS, pages 873\u2013880, 2002.\n\n[11] Corinna Cortes, Mehryar Mohri, and Jason Weston. A general regression technique for learn-\n\ning transductions. In Proc. of ICML, pages 153\u2013160, 2005.\n\n[12] Hachem Kadri, Mohammad Ghavamzadeh, and Philippe Preux. A generalized kernel approach\n\nto structured output learning. In Proc. of ICML, pages 471\u2013479, 2013.\n\n[13] Rong Jin, Shijun Wang, and Yang Zhou. Regularized distance metric learning: Theory and\n\nalgorithm. In Proc. of NIPS, pages 862\u2013870, 2009.\n\n[14] Olivier Bousquet and Andr\u00b4e Elisseeff. Stability and generalization. JMLR, 2:499\u2013526, 2002.\n[15] C\u00b4edric Villani. Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia, 2008.\n\n[16] Purushottam Kar and Prateek Jain. Similarity-based learning via data driven embeddings. In\n\nProc. of NIPS, pages 1998\u20132006, 2011.\n\n[17] Nicolas Courty, R\u00b4emi Flamary, and Devis Tuia. Domain adaptation with regularized optimal\n\ntransport. In Proc. of ECML/PKDD, pages 274\u2013289, 2014.\n\n[18] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Proc. of\n\nNIPS, pages 2292\u20132300, 2013.\n\n[19] M. Lichman. UCI machine learning repository, 2013.\n[20] Yuan Shi, Aur\u00b4elien Bellet, and Fei Sha. Sparse compositional metric learning. In Proc. of\n\nAAAI Conference on Arti\ufb01cial Intelligence, pages 2078\u20132084, 2014.\n\n[21] Aur\u00b4elien Bellet, Amaury Habrard, and Marc Sebban. Similarity learning for provably accurate\n\nsparse linear classi\ufb01cation. In Proc. of ICML, 2012.\n\n[22] The closed-form implementation of RVML is freely available on the authors\u2019 website.\n[23] Maria-Florina Balcan, Avrim Blum, and Nathan Srebro. Improved guarantees for learning via\n\nsimilarity functions. In Proc. of COLT, pages 287\u2013298, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1067, "authors": [{"given_name": "Micha\u00ebl", "family_name": "Perrot", "institution": "University of Saint-Etienne"}, {"given_name": "Amaury", "family_name": "Habrard", "institution": "University of Saint-Etienne"}]}