{"title": "Learning Low-Dimensional Metrics", "book": "Advances in Neural Information Processing Systems", "page_first": 4139, "page_last": 4147, "abstract": "This paper investigates the theoretical foundations of metric learning, focused on three key questions that are not fully addressed in prior work:  1) we consider learning general low-dimensional (low-rank) metrics as well as sparse metrics;2) we develop upper and lower (minimax) bounds on the generalization error; 3)we quantify the sample complexity of metric learning in terms of the dimension of the feature space and the dimension/rank of the underlying metric; 4) we also bound the accuracy of the learned metric relative to the underlying true generative metric. All the results involve novel mathematical approaches to the metric learning problem, and also shed new light on the special case of ordinal embedding (aka non-metric multidimensional scaling).", "full_text": "Learning Low-Dimensional Metrics\n\nLalit Jain \u21e4\n\nUniversity of Michigan\nAnn Arbor, MI 48109\nlalitj@umich.edu\n\nBlake Mason \u21e4\n\nUniversity of Wisconsin\n\nMadison, WI 53706\nbmason3@wisc.edu\n\nRobert Nowak\n\nUniversity of Wisconsin\n\nMadison, WI 53706\nrdnowak@wisc.edu\n\nAbstract\n\nThis paper investigates the theoretical foundations of metric learning, focused on\nthree key questions that are not fully addressed in prior work: 1) we consider\nlearning general low-dimensional (low-rank) metrics as well as sparse metrics;\n2) we develop upper and lower (minimax) bounds on the generalization error; 3)\nwe quantify the sample complexity of metric learning in terms of the dimension\nof the feature space and the dimension/rank of the underlying metric; 4) we also\nbound the accuracy of the learned metric relative to the underlying true generative\nmetric. All the results involve novel mathematical approaches to the metric learning\nproblem, and also shed new light on the special case of ordinal embedding (aka\nnon-metric multidimensional scaling).\n\n1 Low-Dimensional Metric Learning\n\nThis paper studies the problem of learning a low-dimensional Euclidean metric from comparative\njudgments. Speci\ufb01cally, consider a set of n items with high-dimensional features xi 2 Rp and\nsuppose we are given a set of (possibly noisy) distance comparisons of the form\n\nsign(dist(xi, xj)  dist(xi, xk)),\n\nfor a subset of all possible triplets of the items. Here we have in mind comparative judgments\nmade by humans and the distance function implicitly de\ufb01ned according to human perceptions of\nsimilarities and differences. For example, the items could be images and the xi could be visual\nfeatures automatically extracted by a machine. Accordingly, our goal is to learn a p \u21e5 p symmetric\npositive semi-de\ufb01nite (psd) matrix K such that the metric dK(xi, xj) := (xi  xj)T K(xi  xj),\nwhere dK(xi, xj) denotes the squared distance between items i and j with respect to a matrix K,\npredicts the given distance comparisons as well as possible. Furthermore, it is often desired that\nthe metric is low-dimensional relative to the original high-dimensional feature representation (i.e.,\nrank(K) \uf8ff d < p). There are several motivations for this:\n\u2022 Learning a high-dimensional metric may be infeasible from a limited number of comparative\n\u2022 Cognitive scientists are often interested in visualizing human perceptual judgments (e.g., in a\ntwo-dimensional representation) and determining which features most strongly in\ufb02uence human\nperceptions. For example, educational psychologists in [1] collected comparisons between visual\nrepresentations of chemical molecules in order to identify a small set of visual features that most\nsigni\ufb01cantly in\ufb02uence the judgments of beginning chemistry students.\n\njudgments, and encouraging a low-dimensional solution is a natural regularization.\n\ndominate the underlying metric (i.e., many irrelevant features).\n\n\u2022 It is sometimes reasonable to hypothesize that a small subset of the high-dimensional features\n\u2022 Downstream applications of the learned metric (e.g., for classi\ufb01cation purposes) may bene\ufb01t from\nrobust, low-dimensional metrics.\n\u21e4Authors contributed equally to this paper and are listed alphabetically.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a) A general low rank psd\nmatrix\n\n(b) A sparse and low rank\npsd matrix\n\nFigure 1: Examples of K for p = 20 and d = 7. The sparse case depicts a situation in which only\nsome of the features are relevant to the metric.\n\nWith this in mind, several authors have proposed nuclear norm and `1,2 group lasso norm regulariza-\ntion to encourage low-dimensional and sparse metrics as in Fig. 1b (see [2] for a review). Relative to\nsuch prior work, the contributions of this paper are three-fold:\n\n1. We develop novel upper bounds on the generalization error and sample complexity of learning low-\ndimensional metrics from triplet distance comparisons. Notably, unlike previous generalization\nbounds, our bounds allow one to easily quantify how the feature space dimension p and rank or\nsparsity d < p of the underlying metric impacts the sample complexity.\n\n2. We establish minimax lower bounds for learning low-rank and sparse metrics that match the upper\nbounds up to polylogarithmic factors, demonstrating the optimality of learning algorithms for the\n\ufb01rst time. Moreover, the upper and lower bounds demonstrate that learning sparse (and low-rank)\nmetrics is essentially as dif\ufb01cult as learning a general low-rank metric. This suggests that nuclear\nnorm regularization may be preferable in practice, since it places less restrictive assumptions on\nthe problem.\n\n3. We use the generalization error bounds to obtain model identi\ufb01cation error bounds that quantify\nthe accuracy of the learned K matrix. This problem has received very little, if any, attention in\nthe past and is crucial for interpreting the learned metrics (e.g., in cognitive science applications).\nThis is a bit surprising, since the term \u201cmetric learning\u201d strongly suggests accurately determining\na metric, not simply learning a predictor that is parameterized by a metric.\n\n1.1 Comparison with Previous Work\nThere is a fairly large body of work on metric learning which is nicely reviewed and summarized\nin the monograph [2], and we refer the reader to it for a comprehensive summary of the \ufb01eld. Here\nwe discuss a few recent works most closely connected to this paper. Several authors have developed\ngeneralization error bounds for metric learning, as well as bounds for downstream applications, such\nas classi\ufb01cation, based on learned metrics. To use the terminology of [2], most of the focus has\nbeen on must-link/cannot-link constraints and less on relative constraints (i.e., triplet constraints as\nconsidered in this paper). Generalization bounds based on algorithmic robustness are studied in [3],\nbut the generality of this framework makes it dif\ufb01cult to quantify the sample complexity of speci\ufb01c\ncases, such as low-rank or sparse metric learning. Rademacher complexities are used to establish\ngeneralization error bounds in the must-link/cannot-link situation in [4, 5, 6], but do not consider the\ncase of relative/triplet constraints. The sparse compositional metric learning framework of [7] does\nfocus on relative/triplet constraints and provides generalization error bounds in terms of covering\nnumbers. However, this work does not provide bounds on the covering numbers, making it dif\ufb01cult\nto quantify the sample complexity. To sum up, prior work does not quantify the sample complexity of\nmetric learning based on relative/triplet constraints in terms of the intrinsic problem dimensions (i.e.,\ndimension p of the high-dimensional feature space and the dimension d of the underlying metric),\nthere is no prior work on lower bounds, and no prior work quantifying the accuracy of learned\nmetrics themselves (i.e., only bounds on prediction errors, not model identi\ufb01cation errors). Finally\nwe mention that Fazel et a.l [8] also consider the recovery of sparse and low rank matrices from linear\nobservations. Our situation is very different, our matrices are low rank because they are sparse - not\nsparse and simultaneously low rank as in their case.\n\n2\n\n\f2 The Metric Learning Problem\nConsider n known points X := [x1, x2, . . . , xn] 2 Rp\u21e5n. We are interested in learning a symmetric\npositive semide\ufb01nite matrix K that speci\ufb01es a metric on Rp given ordinal constraints on distances\nbetween the known points. Let S denote a set of triplets, where each t = (i, j, k) 2S is drawn\n2  triplets T := {(i, j, k) : 1 \uf8ff i 6= j 6= k \uf8ff n, j < k}.\nuniformly at random from the full set of nn1\nFor each triplet, we observe a yt 2 {\u00b11} which is a noisy indication of the triplet constraint\ndK(xi, xj) < dK(xi, xk). Speci\ufb01cally we assume that each t has an associated probability qt of\nyt = 1, and all yt are statistically independent.\nObjective 1: Compute an estimatecK from S that predicts triplets as well as possible.\n\nIn many instances, our triplet measurements are noisy observations of triplets from a true positive\nsemi-de\ufb01nite matrix K\u21e4. In particular we assume\n\nqt > 1/2 () dK\u21e4(xi, xj) < dK\u21e4(xi, xk) .\n\nWe can also assume an explicit known link function, f : R ! [0, 1], so that qt = f (dK\u21e4(xi, xj) \ndK\u21e4(xi, xk)).\nObjective 2: Assuming an explicit known link function f estimate K\u21e4 from S.\n2.1 De\ufb01nitions and Notation\nOur triplet observations are nonlinear transformations of a linear function of the Gram matrix\nG := X T KX. Indeed for any triple t = (i, j, k), de\ufb01ne\n\nM t(K)\n\n:= dK(xi, xj)  dK(xi, xk)\n= xT\nk Kxi  xT\n\ni Kxk + xT\n\ni Kxj  xT\n\nj Kxi + xT\n\nj Kxj  xT\n\nk Kxk .\n\nSo for every t 2S , yt is a noisy measurement of sign(M t(K)). This linear operator may also be\nexpressed as a matrix\n\nM t := xixT\n\nk + xkxT\n\ni  xixT\n\nj  xjxT\n\ni + xjxT\n\nj  xkxT\nk ,\n\nso that M t(K) = hM t, Ki = Trace(M T\nt K). We will use M t to denote the operator and\nassociated matrix interchangeably. Ordering the elements of T lexicographically, we let M denote\nthe linear map,\n\nM(K) = (M t(K)| for t 2T ) 2 Rn(n1\n2 )\n\n{sign(ythM t,Ki)6=1}\n\nGiven a PSD matrix K and a sample, t 2S , we let `(ythM t, Ki) denote the loss of K with respect\nto t; e.g., the 0-1 loss\n, the hinge-loss max{0, 1  ythM t, Ki}, or the logistic\nloss log(1 + exp(ythM t, Ki)). Note that we insist that our losses be functions of our triplet\ndifferences hM t, Ki. Further, note that this makes our losses invariant to rigid motions of the points\nxi. Other models proposed for metric learning use scale-invariant loss functions [9].\nFor a given loss `, we then de\ufb01ne the empirical risk with respect to our set of observations S to be\n\nbRS(K) :=\n\n1\n\n|S|Xt2S\n\n`(ythM t, Ki).\n\nn 1n1T\n\nThis is an unbiased estimator of the true risk R(K) := E[`(ythM t, Ki)] where the expectation is\ntaken with respect to a triplet t selected uniformly at random and the random value of yt.\nFinally, we let I n denote the identity matrix in Rn\u21e5n, 1n the n-dimensional vector of all ones and\nn the centering matrix. In particular if X 2 Rp\u21e5n is a set of points, XV subtracts\nV := I n  1\nthe mean of the columns of X from each column. We say that X is centered if XV = 0, or\nequivalently X1n = 0. If G is the Gram matrix of the set of points X, i.e. G = X T X, then we say\nthat G is centered if X is centered or if equivalently, G1n = 0. Furthermore we use k\u00b7k \u21e4 to denote\nthe nuclear norm, and k\u00b7k 1,2 to denote the mixed `1,2 norm of a matrix, the sum of the `2 norms of\nits rows. Unless otherwise speci\ufb01ed, we take k\u00b7k to be the standard operator norm when applied to\nmatrices and the standard Euclidean norm when applied to vectors. Finally we de\ufb01ne the K-norm of\na vector as kxk2\n\nK := xT Kx.\n\n3\n\n\f2.2 Sample Complexity of Learning Metrics.\nIn most applications, we are interested in learning a matrix K that is low-rank and positive-\nsemide\ufb01nite. Furthermore as we will show in Theorem 2.1, such matrices can be learned using fewer\nsamples than general psd matrices. As is common in machine learning applications, we relax the\nrank constraint to a nuclear norm constraint. In particular, let our constraint set be\n\nK, = {K 2 Rp\u21e5p|K positive-semide\ufb01nite, kKk\u21e4 \uf8ff , max\n\nt2T hM t, Ki \uf8ff }.\n\n|S|\n\ni Kxi. This bound along with assuming our\n\nUp to constants, a bound on hM t, Ki is a bound on xT\nloss function is Lipschitz, will lead to a tighter bound on the deviation of bRS(K) from R(K) crucial\nin our upper bound theorem.\n:= minK2K, R(K) be the true risk minimizer in this class, and let cK :=\nLet K\u21e4\nminK2K, bRS(K) be the empirical risk minimizer. We achieve the following prediction error\nbounds for the empirical risk minimzer.\nTheorem 2.1. Fix , ,  > 0. In addition assume that max1\uf8ffi\uf8ffn kxik2 = 1. If the loss function `\nis L-Lipschitz, then with probability at least 1  \nR(cK)  R(K\u21e4) \uf8ff 4L0@s 1402 kXX T kn\n\n|S| 1A +s 2L22 log 2/\n\n2 log p\n\n|S|\n\np I). Furthermore, suppose that K\u21e4 = ppd\n\nNote that past generalization error bounds in the metric learning literature have failed to quantify\nthe precise dependence on observation noise, dimension, rank, and our features X. Consider the\nfact that a p \u21e5 p matrix with rank d has O(dp) degrees of freedom. With that in mind, one expects\nthe sample complexity to be also roughly O(dp). We next show that this intuition is correct if the\noriginal representation X is isotropic (i.e., has no preferred direction).\nThe Isotropic Case. Suppose that x1,\u00b7\u00b7\u00b7 , xn, n > p, are drawn independently from the isotropic\nU U T with U 2 Rp\u21e5d is a generic (dense)\nGaussian N (0, 1\northogonal matrix with unit norm columns. The factor ppd is simply the scaling needed so that the\naverage magnitude of the entries in K\u21e4 is a constant, independent of the dimensions p and d. In\nthis case, rank(K\u21e4) = d and kK\u21e4kF = trace(U T U ) = p. These two facts imply that the tightest\nbound on the nuclear norm of K\u21e4 is kK\u21e4k\u21e4 \uf8ff ppd. Thus, we take  = ppd for the nuclear\nnorm constraint. Now let zi = q ppd\nd.\nK = kzik2 \u21e0 2\nTherefore, Ekxik2\nK = d and it follows from standard concentration bounds that with large probability\np I) it follows that if\nmaxi kxik2\nn > p log p, say, then with large probability kXX Tk \uf8ff 5n/p. We now plug these calculations into\nTheorem 2.1 to obtain the following corollary.\nCorollary 2.1.1 (Sample complexity for isotropic points). Fix > 0, set  = ppd, and assume\nK = O(d log n). Then for a generic K\u21e4 2K ,, as\nthat kXX Tk = O(n/p) and  := maxi kxik2\nconstructed above, with probability at least 1  ,\n\nK \uf8ff 5d log n =:  see [10]. Also, because the xi \u21e0N (0, 1\n\nU T xi \u21e0 N (0, I d) and note that kxik2\n\nlog p\n\n+\n\nR(cK)  R(K\u21e4) = O0@s dp(log p + log2 n)\n\n|S|\n\n1A\n\nThis bound agrees with the intuition that the sample complexity should grow roughly like dp, the\ndegrees of freedom on K\u21e4. Moreover, our minimax lower bound in Theorem 2.3 below shows that,\nignoring logarithmic factors, the general upper bound in Theorem 2.1 is unimprovable in general.\nBeyond low rank metrics, in many applications it is reasonable to assume that only a few of the\nfeatures are salient and should be given nonzero weight. Such a metric may be learned by insisting\nK to be row sparse in addition to being low rank. Whereas learning a low rank K assumes that\ndistance is well represented in a low dimensional subspace, a row sparse (and hence low rank) K\nde\ufb01nes a metric using only a subset of the features. Figure 1 gives a comparison of a low rank versus\na low rank and sparse matrix K.\n\n4\n\n\fAnalogous to the convex relaxation of rank by the nuclear norm, it is common to relax row sparsity\nby using the mixed `1,2 norm. In fact, the geometry of the `1,2 and nuclear norm balls are tightly\nrelated as the following lemma shows.\nLemma 2.2. For a symmetric positive semi-de\ufb01nite matrix K 2 Rp\u21e5p, kKk\u21e4 \uf8ff kKk1,2.\n\nProof. kKk1,2 =\n\npXi=1\n\nvuut\n\npXj=1\n\nK2\n\ni,j \n\npXi=1\n\nKi,i = Trace(K) =\n\npXi=1\n\ni(K) = kKk\u21e4\n\nThis implies that the `1,2 ball of a given radius is contained inside the nuclear norm ball of the\nsame radius. In particular, it is reasonable to assume that it is easier to learn a K that is sparse in\naddition to being low rank. Surprisingly, however, the following minimax bound shows that this is\nnot necessarily the case.\nTo make this more precise, we will consider optimization over the set\n\nK0, = {K 2 Rp\u21e5p|K positive-semide\ufb01nite, kKk1,2 \uf8ff , max\n\nt2T hM t, Ki \uf8ff }.\n\nFurthermore, we must specify the way in which our data could be generated from noisy triplet\nobservations of a \ufb01xed K\u21e4. To this end, assume the existence of a link function f : R ! [0, 1]\nso that qt = P(yt = 1) = f (M t(K\u21e4)) governs the observations. There is a natural associated\nlogarithmic loss function `f corresponding to the log-likelihood, where the loss of an arbitrary K is\n\n`f (ythM t, Ki) = {yt=1} log\n\n1\n\nf (hM t, Ki)\n\n+ {yt=1} log\n\n1\n\n1  f (hM t, Ki)\n\nTheorem 2.3. Choose a link function f and let `f be the associated logarithmic loss. For p suf\ufb01ciently\nlarge, then there exists a choice of , , X, and |S| such that\n\nwith Cf = inf|x|\uf8ff f0(x), C1 is an absolute constant, and the\n\nsup\n\nK2K0,\n\ninf\n\ncK\n\nE[R(cK)]  R(K)  Cs C3\nin\ufb01mum is taken over all estimatorscK of K from |S| samples.\n\n32r inf|x|\uf8ff f (x)(1f (x))\n\nwhere C = C2\n\nsup|\u232b|\uf8ff f0(\u232b)2\n\nf\n\n1 ln 4\n2\n\n2 kXX T kn\n\n|S|\n\nImportantly, up to polylogarithmic factors and constants, our minimax lower bound over the `1,2 ball\nmatches the upper bound over the nuclear norm ball given in Theorem 2.1. In particular, in the worst\ncase, learning a sparse and low rank matrix K is no easier than learning a K that is simply low\nrank. However in many realistic cases, a slight performance gain is seen from optimizing over the\n`1,2 ball when K\u21e4 is row sparse, while optimizing over the nuclear norm ball does better when K\u21e4 is\ndense. We show examples of this in the Section 3. The proof is given in the supplementary materials.\nNote that if  is in a bounded range, then the constant C has little effect. For the case that f is the\n4 e. Likewise, the term under the root will be also be\nlogistic function, Cf  1\nbounded for  in a constant range. The terms in the constant C arise when translating from risk and a\nKL-divergence to squared distance and re\ufb02ects the noise in the problem.\n\n4 eythM t,Ki  1\n\n2.3 Sample Complexity Bounds for Identi\ufb01cation\nUnder a general loss function and arbitrary K\u21e4, we can not hope to convert our prediction error\nbounds into a recovery statement. However in this section we will show that as long as K\u21e4 is low\nrank, and if we choose the loss function to be the log loss `f of a given link function f as de\ufb01ned\nprior to the statement of Theorem 2.3, recovery is possible. Firstly, note that under these assumptions\nwe have an explicit formula for the risk,\n\nR(K) =\n\n1\n\n|T |Xt2T\n\nf (hM t, K\u21e4i) log\n\n1\n\nf (hM t, Ki)\n\n+ (1  f (hM t, K\u21e4i)) log\n\n1\n\n1  f (hM t, Ki)\n\n5\n\n\fand\n\nR(K)  R(K\u21e4) =\n\n1\n\n|T |Xt2T\n\nKL(f (hM t, K\u21e4i)||f (hM t, Ki)).\n\nThe following theorem shows that if the excess risk is small, i.e. R(cK) approximates R(K\u21e4) well,\nthen M(cK) approximates M(K\u21e4) well. The proof, given in the supplementary materials, uses\nstandard Taylor series arguments to show the KL-divergence is bounded below by squared-distance.\nLemma 2.4. Let Cf = inf|x|\uf8ff f0(x). Then for any K 2 K,,\n\n2C2\nf\n\n|T | kM(K) M (K\u21e4)k2 \uf8ff R(K)  R(K\u21e4).\n\nThe following may give us hope that recovering K\u21e4 from M(K\u21e4) is trivial, but the linear operator\nM is non-invertible in general, as we discuss next. To see why, we must consider a more general\nclass of operators de\ufb01ned on Gram matrices. Given a symmetric matrix G, de\ufb01ne the operator Lt by\n\nLt(G) = 2Gik  2Gij + Gjj  Gkk\n\nIf G = X T KX then Lt(G) = M t(K), and more so M t = XLtX T . Analogous to M, we will\ncombine the Lt operators into a single operator L,\n\nL(G) = (Lt(G)| for t 2T ) 2 Rn(n1\n2 ).\n\nLemma 2.5. The null space of L is one dimensional, spanned by V = I n  1\nThe proof is contained in the supplementary materials. In particular we see that M is not invertible\nin general, adding a serious complication to our argument. However L is still invertible on the subset\nof centered symmetric matrices orthogonal to V , a fact that we will now exploit. We can decompose\nG into V and a component orthogonal to V denoted H,\nG = H + GV\n\nn .\nn 1n1T\n\nF\n\nwhere G := hG,V i\n, and under the assumption that G is centered, G = kGk\u21e4\nn1 . Remarkably, the\nkV k2\nfollowing lemma tells us that a non-linear function of H uniquely determines G.\nLemma 2.6. If n > d + 1, and G is rank d and centered, then G is an eigenvalue of H with\nmultiplicity n  d  1. In addition, given another Gram matrix G0 of rank d0, G0  G is an\neigenvalue of H  H0 with multiplicity at least n  d  d0  1.\nProof. Since G is centered, 1n 2 ker G, and in particular dim(1?n \\ ker G) = n  d  1. If\nx 2 1?n \\ ker G, then\nFor the second statement, notice that dim(1?n \\ ker G  G0)  n  d  d0  1. A similar argument\nthen applies.\n\nGx = Hx + GV x ) Hx = Gx.\n\nThe proof of the following theorem makes this argument precise.\n\nG\u21e4 (and hence K\u21e4 if X is full rank) can be recovered from H\u21e4 by computing an eigenvalue of H\u21e4.\n\nIf n > 2d, then the multiplicity of the eigenvalue G is at least n/2. So we can trivially identify it\nfrom the spectrum of H. This gives us a non-linear way to recover G from H.\nNow we can return to the task of recovering K\u21e4 from M(cK). Indeed the above lemma implies that\nHowever H\u21e4 is recoverable from L(H\u21e4), which is itself well approximated by L(cH) = M(cK).\nTheorem 2.7. Assume that K\u21e4 is rank d,cK is rank d0, n > d + d0 + 1, X is rank p and X T K\u21e4X\nand X TcKX are all centered. Let Cd,d0 =\u21e31 +\n(ndd01)\u2318. Then with probability at least 1  ,\n35\n|S| 1A +s 2L22 log 2\n\n240@s 1402 kXX T kn\n\nkcK  K\u21e4k2\n\nnmin(XX T )2\n\nlog p\n\n2 log p\n\nF \uf8ff\n\n2LCd,d0\n\nC2\nf\n\n\n\n|S|\n\n+\n\n|S|\n\n|T |\n\nwhere min(XX T ) is the smallest eigenvalue of XX T .\n\nn1\n\n6\n\n\f|T |\n\n\u21e1 1\n\np . In that case nmin(XX T )2\n\nThe proof, given in the supplementary materials, relies on two key components, Lemma 2.6 and a\ntype of restricted isometry property for M on V ?. Our proof technique is a streamlined and more\ngeneral approach similar to that used in the special case of ordinal embedding. In fact, our new bound\nimproves on the recovery bound given in [11] for ordinal embedding.\nWe have several remarks about the bound in the theorem. If X is well conditioned, e.g. isotropic, then\nmin(XX T ) \u21e1 n\np2 , so the left hand side is the average squared error\nof the recovery. In most applications the rank of the empirical risk minimizercK is approximately\nequal to the rank of K\u21e4, i.e. d \u21e1 d0. Note that If d + d0 \uf8ff 1\n2 (n  1) then Cd,d0 \uf8ff 3. Finally, the\nassumption that X T K\u21e4X are centered can be guaranteed by centering X, which has no impact on\nthe triplet differences hM t, K\u21e4i, or insisting that K\u21e4 is centered. As mentioned above Cf will be\nhave little effect assuming that our measurements hM t, Ki are bounded.\n2.4 Applications to Ordinal Embedding\nIn the ordinal embedding setting, there are a set of items with unknown locations, z1,\u00b7\u00b7\u00b7 , zn 2 Rd\nand a set of triplet observations S where as in the metric learning case observing yt = 1, for a\ntriplet t = (i, j, k) is indicative of the kzi  zjk2 \uf8ff kzi  zkk2, i.e. item i is closer to j than k.\nThe goal is to recover the zi\u2019s, up to rigid motions, by recovering their Gram matrix G\u21e4 from these\ncomparisons. Ordinal embedding case reduces to metric learning through the following observation.\nConsider the case when n = p and X = I p, i.e. the xi are standard basis vectors. Letting K\u21e4 = G\u21e4,\nwe see that kxi  xjk2\nK = kzi  zjk2. So in particular, Lt = M t for each triple t, and observations\nare exactly comparative distance judgements. Our results then apply, and extend previous work on\nsample complexity in the ordinal embedding setting given in [11]. In particular, though Theorem 5 in\n[11] provides a consistency guarantee that the empirical risk minimizer bG will converge to G\u21e4, they\nIn their work, it is assumed that kzik2 \uf8ff  and kGk\u21e4 \uf8ff pdn. In particular, sample complexity\nresults of the form O(dn log n) are obtained. However, these results are trivial in the following\nsense, if kzik2 \uf8ff  then kGk\u21e4 \uf8ff n, and their results (as well as our upper bound) implies that true\nsample complexity is signi\ufb01cantly smaller, namely O(n log n) which is independent of the ambient\ndimension d. As before, assume an explicit link function f with Lipschitz constant L, so the samples\nare noisy observations governed by G\u21e4, and take the loss to be the logarithmic loss associated to f.\nWe obtain the following improved recovery bound in this case. The proof is immediate from Theorem\n2.7.\nCorollary 2.7.1. Let G\u21e4 be the Gram matrix of n centered points in d dimensions with kG\u21e4k2\nd . Let bG = minkGk\u21e4\uf8ffn,kGk1\uf8ff RS(G) and assume that bG is rank d, with n > 2d + 1. Then,\n\ndo not provide a convergence rate. We resolve this issue now.\n\nF =\n\n2n2\n\nF\n\nkbG  G\u21e4k2\n\nn2\n\n= O LCd,d\n\nf s n log n\n|S| !\n\nC2\n\n3 Experiments\n\niid\n\u21e0N (0, 1\n\np I), with n = 200, and K\u21e4 = ppd\n\nTo validate our complexity and recovery guarantees, we ran the following simulations. We generate\nU U T for a random orthogonal matrix\nx1,\u00b7\u00b7\u00b7 , xn\nU 2 Rp\u21e5d with unit norm columns. In Figure 2a, K\u21e4 has d nonzero rows/columns. In Figure 2b,\nK\u21e4 is a dense rank-d matrix. We compare the performance of nuclear norm and `1,2 regularization\nin each setting against an unconstrained baseline where we only enforce that K be psd. Given a \ufb01xed\nnumber of samples, each method is compared in terms of the relative excess risk, R(cK)R(K\u21e4)\n, and\nthe relative squared recovery error, kcKK\u21e4k2\n, averaged over 20 trials. The y-axes of both plots have\nkK\u21e4k2\nbeen trimmed for readability.\nIn the case that K\u21e4 is sparse, `1,2 regularization outperforms nuclear norm regularization. However,\nin the case of dense low rank matrices, nuclear norm reularization is superior. Notably, as expected\nfrom our upper and lower bounds, the performances of the two approaches seem to be within constant\n\nR(K\u21e4)\n\nF\n\nF\n\n7\n\n\ffactors of each other. Therefore, unless there is strong reason to believe that the underlying K\u21e4 is\nsparse, nuclear norm regularization achieves comparable performance with a less restrictive modeling\nassumption. Furthermore, in the two settings, both the nuclear norm and `1,2 constrained methods\noutperform the unconstrained baseline, especially in the case where K\u21e4 is low rank and sparse.\nTo empirically validate our sample complexity results, we compute the number of samples averaged\nover 20 runs to achieve a relative excess risk of less than 0.1 in Figure 3. First, we \ufb01x p = 100 and\nincrement d from 1 to 10. Then we \ufb01x d = 10 and increment p from 10 to 100 to clearly show the\nlinear dependence of the sample complexity on d and p as demonstrated in Corollary 2.1.1. To our\nknowledge, these are the \ufb01rst results quantifying the sample complexity in terms of the number of\nfeatures, p, and the embedding dimension, d.\n\n(a) Sparse low rank metric\n\n(b) Dense low rank metric\n\nFigure 2: `1,2 and nuclear norm regularization performance\n\n(a) d varying\n\n(b) p varying\n\nFigure 3: Number of samples to achieve relative excess risk < 0.1\n\nAcknowledgments This work was partially supported by the NSF grants CCF-1218189 and IIS-\n1623605\n\n8\n\n\fReferences\n[1] Martina A Rau, Blake Mason, and Robert D Nowak. How to model implicit knowledge?\nsimilarity learning methods to assess perceptions of visual representations. In Proceedings of\nthe 9th International Conference on Educational Data Mining, pages 199\u2013206, 2016.\n\n[2] Aur\u00e9lien Bellet, Amaury Habrard, and Marc Sebban. Metric learning. Synthesis Lectures on\n\nArti\ufb01cial Intelligence and Machine Learning, 9(1):1\u2013151, 2015.\n\n[3] Aur\u00e9lien Bellet and Amaury Habrard. Robustness and generalization for metric learning.\n\nNeurocomputing, 151:259\u2013267, 2015.\n\n[4] Zheng-Chu Guo and Yiming Ying. Guaranteed classi\ufb01cation via regularized similarity learning.\n\nNeural Computation, 26(3):497\u2013522, 2014.\n\n[5] Yiming Ying, Kaizhu Huang, and Colin Campbell. Sparse metric learning via smooth optimiza-\n\ntion. In Advances in neural information processing systems, pages 2214\u20132222, 2009.\n\n[6] Wei Bian and Dacheng Tao. Constrained empirical risk minimization framework for distance\nmetric learning. IEEE transactions on neural networks and learning systems, 23(8):1194\u20131205,\n2012.\n\n[7] Yuan Shi, Aur\u00e9lien Bellet, and Fei Sha. Sparse compositional metric learning. arXiv preprint\n\narXiv:1404.4105, 2014.\n\n[8] Samet Oymak, Amin Jalali, Maryam Fazel, Yonina C Eldar, and Babak Hassibi. Simultaneously\nstructured models with application to sparse and low-rank matrices. IEEE Transactions on\nInformation Theory, 61(5):2886\u20132908, 2015.\n\n[9] Eric Heim, Matthew Berger, Lee Seversky, and Milos Hauskrecht. Active perceptual similarity\n\nmodeling with auxiliary information. arXiv preprint arXiv:1511.02254, 2015.\n\n[10] Kenneth R Davidson and Stanislaw J Szarek. Local operator theory, random matrices and\n\nbanach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.\n\n[11] Lalit Jain, Kevin G Jamieson, and Rob Nowak. Finite sample prediction and recovery bounds for\nordinal embedding. In Advances In Neural Information Processing Systems, pages 2703\u20132711,\n2016.\n\n[12] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix\n\ncompletion. Information and Inference: A Journal of the IMA, 3(3):189\u2013223, 2014.\n\n[13] Joel A. Tropp. An introduction to matrix concentration inequalities, 2015.\n[14] Felix Abramovich and Vadim Grinshtein. Model selection and minimax estimation in general-\n\nized linear models. IEEE Transactions on Information Theory, 62(6):3721\u20133730, 2016.\n\n[15] Florentina Bunea, Alexandre B Tsybakov, Marten H Wegkamp, et al. Aggregation for gaussian\n\nregression. The Annals of Statistics, 35(4):1674\u20131697, 2007.\n\n[16] Philippe Rigollet and Alexandre Tsybakov. Exponential screening and optimal rates of sparse\n\nestimation. The Annals of Statistics, pages 731\u2013771, 2011.\n\n[17] Jon Dattorro. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing USA,\n\n2011.\n\n9\n\n\f", "award": [], "sourceid": 2185, "authors": [{"given_name": "Blake", "family_name": "Mason", "institution": "University of Wisconsin - Madison"}, {"given_name": "Lalit", "family_name": "Jain", "institution": "University of Michigan"}, {"given_name": "Robert", "family_name": "Nowak", "institution": "University of Wisconsion-Madison"}]}