{"title": "A new metric on the manifold of kernel matrices with application to matrix geometric means", "book": "Advances in Neural Information Processing Systems", "page_first": 144, "page_last": 152, "abstract": "Symmetric positive definite (spd) matrices are remarkably pervasive in a multitude of scientific disciplines, including machine learning and optimization. We consider the fundamental task of measuring distances between two spd matrices; a task that is often nontrivial whenever an application demands the distance function to respect the non-Euclidean geometry of spd matrices. Unfortunately, typical non-Euclidean distance measures such as the Riemannian metric $\\riem(X,Y)=\\frob{\\log(X\\inv{Y})}$, are computationally demanding and also complicated to use. To allay some of these difficulties, we introduce a new metric on spd matrices: this metric not only respects non-Euclidean geometry, it also offers faster computation than $\\riem$ while being less complicated to use. We support our claims theoretically via a series of theorems that relate our metric to $\\riem(X,Y)$, and experimentally by studying the nonconvex problem of computing matrix geometric means based on squared distances.", "full_text": "A new metric on the manifold of kernel matrices with\n\napplication to matrix geometric means\n\nSuvrit Sra\n\nMax Planck Institute for Intelligent Systems\n\n72076 T\u00a8ubigen, Germany\n\nsuvrit@tuebingen.mpg.de\n\nAbstract\n\nSymmetric positive de\ufb01nite (spd) matrices pervade numerous scienti\ufb01c disci-\nplines, including machine learning and optimization. We consider the key task\nof measuring distances between two spd matrices; a task that is often nontriv-\nial whenever the distance function must respect the non-Euclidean geometry of\nspd matrices. Typical non-Euclidean distance measures such as the Riemannian\nmetric \u03b4R(X, Y ) = (cid:107)log(Y \u22121/2XY \u22121/2)(cid:107)F, are computationally demanding and\nalso complicated to use. To allay some of these dif\ufb01culties, we introduce a new\nmetric on spd matrices, which not only respects non-Euclidean geometry but also\noffers faster computation than \u03b4R while being less complicated to use. We sup-\nport our claims theoretically by listing a set of theorems that relate our metric to\n\u03b4R(X, Y ), and experimentally by studying the nonconvex problem of computing\nmatrix geometric means based on squared distances.\n\n1\n\nIntroduction\n\nSymmetric positive de\ufb01nite (spd) matrices1 are remarkably pervasive in a multitude of areas, espe-\ncially machine learning and optimization. Several applications in these areas require an answer to\nthe fundamental question: how to measure a distance between two spd matrices?\nThis question arises, for instance, when optimizing over the set of spd matrices. To judge con-\nvergence of an optimization procedure or in the design of algorithms we may need to compute\ndistances between spd matrices [1\u20133]. As a more concrete example, suppose we wish to retrieve\nfrom a large database of spd matrices the \u201cclosest\u201d spd matrix to an input query. The quality of\nsuch a retrieval depends crucially on the distance function used to measure closeness; a choice that\nalso dramatically impacts the actual search algorithm itself [4, 5]. Another familiar setting is that\nof computing statistical metrics for multivariate Gaussian distributions [6], or more recently, quan-\ntum statistics [7]. Several other applications depend on being able to effectively measure distances\nbetween spd matrices\u2013see e.g., [8\u201310] and references therein.\nIn many of these domains, viewing spd matrices as members of a Euclidean vector space is insuf\ufb01-\ncient, and the non-Euclidean geometry conferred by a suitable metric is of great importance. Indeed,\nthe set of (strict) spd matrices forms a differentiable Riemannian manifold [11, 10] that is perhaps\nthe most studied example of a manifold of nonpositive curvature [12; Ch.10]. These matrices also\nform a convex cone, and the set of spd matrices in fact serves as a canonical higher-rank symmetric\nspace [13]. The conic view is of great importance in convex optimization [14\u201316], symmetric spaces\nare important in algebra and analysis [13, 17], and in optimization [14, 18], while the manifold and\nother views are also widely important\u2014see e.g., [11; Ch.6] for an overview.\n\n1We could equally consider Hermitian matrices, but for simplicity we consider only real matrices.\n\n1\n\n\f(cid:104)x, Ax(cid:105) > 0\n\nThe starting point for this paper is the manifold view. For space reasons, we limit our discussion\nto P(n) as a Riemannian manifold, noting that most of the discussion could also be set in terms of\nFinsler manifolds. But before we go further, let us \ufb01x basic notation.\nNotation. Let Sn denote the set of n \u00d7 n real symmetric matrices. A matrix A \u2208 Sn is called\npositive (we drop the word \u201cde\ufb01nite\u201d for brevity) if\nfor all x (cid:54)= 0;\n\nFrobenius norm of a matrix X \u2208 Rm\u00d7n is de\ufb01ned as (cid:107)X(cid:107)F =(cid:112)tr(X T X), while (cid:107)X(cid:107) denotes the\n\n(1)\nWe denote the set of n \u00d7 n positive matrices by Pn. If only the non-strict inequality (cid:104)x, Ax(cid:105) \u2265 0\nholds (for all x \u2208 Rn) we say A is positive semide\ufb01nite; this is also denoted as A \u2265 0. For two\nmatrices A, B \u2208 Sn, the operator inequality A \u2265 B means that the difference A \u2212 B \u2265 0. The\nstandard operator norm. For an analytic function f on C, and a diagonalizable matrix A = U \u039bU T ,\nf (A) := U f (\u039b)U T . Let \u03bb(X) denote the vector of eigenvalues of X (in any order) and Eig(X) a\ndiagonal matrix that has \u03bb(X) as its diagonal. We also use \u03bb\u2193(X) to denote a sorted (in descending\norder) version of \u03bb(X) and \u03bb\u2191(X) is de\ufb01ned likewise. Finally, we de\ufb01ne Eig\u2193(X) and Eig\u2191(X) as\nthe corresponding diagonal matrices.\n\nalso denoted as A > 0.\n\nBackground. The set Pn is a canonical higher-rank symmetric space that is actually an open set\nwithin Sn, and thereby a differentiable manifold of dimension n(n + 1)/2. The tangent space at a\npoint A \u2208 Pn can be identi\ufb01ed with Sn, so a suitable inner-product on Sn leads to the Riemannian\ndistance on Pn [11; Ch.6]. At the point A this metric is induced by the differential form\n\nds2 = (cid:107)A\u22121/2dAA\u22121/2(cid:107)2\n\n(2)\nFor A, B \u2208 Pn, it is known that there is a unique geodesic joining them given by [11; Thm.6.1.6]:\n(3)\n\n\u03b3(t) := A(cid:93)tB := A1/2(A\u22121/2BA\u22121/2)tA1/2,\n\n0 \u2264 t \u2264 1,\n\nF = tr(A\u22121dAA\u22121dA).\n\nand its midpoint \u03b3(1/2) is the geometric mean of A and B. The associated Riemannian metric is\n\n\u03b4R(A, B) := (cid:107)log(A\u22121/2BA\u22121/2)(cid:107)F,\n\nfor A, B > 0.\n\n(4)\n\nFrom de\ufb01nition (4) it is apparent that computing \u03b4R will be computationally demanding, and requires\ncare. Indeed, to compute (4) we must essentially compute generalized eigenvalues of A and B. For\nan application that must repeatedly compute distances between numerous pairs of matrices this\ncomputational burden can be excessive [4]. Driven by such computational concerns, Cherian et al.\n[4] introduced a symmetrized \u201clog-det\u201d based matrix divergence:\n\n(cid:16) A+B\n\n2\n\n(cid:17) \u2212 1\n\nJ(A, B) = log det\n\n2 log det(AB)\n\nfor A, B > 0.\n\n(5)\n\nThis divergence was used as a proxy for \u03b4R and observed that J(A, B) offers the same level of per-\nformance on a dif\ufb01cult nearest neighbor retrieval task as \u03b4R, while being many times faster! Among\nother reasons, a large part of their speedup was attributed to the avoidance of eigenvalue compu-\ntations for obtaining J(A, B) or its derivatives, a luxury the \u03b4R does not permit. Independently,\nChebbi and Moahker [2] also introduced a slightly generalized version of (5) and studied some of its\nproperties, especially computation of \u201ccentroids\u201d of positive matrices using their matrix divergence.\n\nInterestingly, Cherian et al. [4] claimed that(cid:112)J(A, B) might not be metric, whereas Chebbi and\nMoahker [2] conjectured that(cid:112)J(A, B) is a metric. We resolve this uncertainty and prove that\n(cid:112)J(A, B) is indeed a metric, albeit not one that embeds isometrically into a Hilbert space.\n\nDue to space constraints, we only summarily mention several of the properties that this metric sat-\nJ as a good proxy for the Riemannian\nis\ufb01es, primarily to help develop intuition that motivates\nmetric \u03b4R. We apply these insights to study \u201cmatrix geometric means\u201d of set of positive matrices:\na problem also studied in [4, 2]. Both cited papers have some gaps in their claims, which we \ufb01ll\nby proving that even though computing the geometric mean is a nonconvex problem, we can still\ncompute it ef\ufb01ciently and optimally.\n\n\u221a\n\n2\n\n\f2 The \u03b4(cid:96)d metric\n\nThe main result of this paper is Theorem 1.\nTheorem 1. Let J be as in (5), and de\ufb01ne \u03b4(cid:96)d :=\n\n\u221a\n\nJ. Then, \u03b4(cid:96)d is a metric on Pn.\n\nOur proof of Theorem 1 depends on several key steps. Due to restrictions on space we cannot include\nfull proofs of all the results, and refer the reader to the longer article [19] instead. We do, however,\nprovide sketches for the crucial steps in our proof.\nProposition 2. Let A, B, C \u2208 Pn. Then, (i) \u03b4(cid:96)d(I, A) = \u03b4(cid:96)d(I, Eig(A)); (ii) for P, Q \u2208 GL(n, C),\n\u03b4(cid:96)d(P AQ, P BQ) = \u03b4(cid:96)d(A, B); (iii) for X \u2208 GL(n, C), \u03b4(cid:96)d(X\u2217AX, X\u2217BX) = \u03b4(cid:96)d(A, B);\n(iv) \u03b4(cid:96)d(A, B) = \u03b4(cid:96)d(A\u22121, B\u22121); (v) \u03b4(cid:96)d(A \u2297 B, A \u2297 C) =\nn\u03b4(cid:96)d(B, C), where \u2297 denotes the\nKronecker or tensor product.\n\n\u221a\n\nThe \ufb01rst crucial result is that for positive scalars, \u03b4(cid:96)d is indeed a metric. To prove this, recall the\nnotion of negative de\ufb01nite functions (Def. 3), and a related classical result of Schoenberg (Thm. 4).\nDe\ufb01nition 3 ([20; Def. 1.1]). Let X be a nonempty set. A function \u03c8 : X \u00d7 X \u2192 R is said to be\nnegative de\ufb01nite if for all x, y \u2208 X it is symmetric (\u03c8(x, y) = \u03c8(y, x)), and satis\ufb01es the inequality\n(6)\n\ncicj\u03c8(xi, xj) \u2264 0,\n\nfor all integers n \u2265 2, and subsets {xi}n\nTheorem 4 ([20; Prop. 3.2, Chap. 3]). Let \u03c8 : X \u00d7 X \u2192 R be negative de\ufb01nite. Then, there is a\nHilbert space H \u2286 RX and a mapping x (cid:55)\u2192 \u03d5(x) from X \u2192 H such that we have the equality\n\ni=1 ci = 0.\n\n(cid:88)n\ni=1 \u2286 X , {ci}n\n\ni,j=1\n\ni=1 \u2286 R with(cid:80)n\n\n(cid:107)\u03d5(x) \u2212 \u03d5(y)(cid:107)2H = \u03c8(x, y) \u2212 1\n\n2 (\u03c8(x, x) + \u03c8(y, y)).\nMoreover, negative de\ufb01niteness of \u03c8 is necessary for such a mapping to exist.\nTheorem 5 (Scalar case). De\ufb01ne \u03b42\n\ns (x, y) := log[(x + y)/(2\n\n\u221a\n\nxy)] for scalars x, y > 0. Then,\n\n\u03b4s(x, y) \u2264 \u03b4s(x, z) + \u03b4s(y, z)\n\nProof. We show that \u03c8(x, y) = log(cid:0) x+y\n\n(8)\ns (x, y) = \u03c8(x, y) \u2212\n2 (\u03c8(x, x)+\u03c8(y, y)), Thm. 4 then implies the triangle inequality (8). To prove \u03c8 is negative de\ufb01nite,\nby [Thm. 2.2, Chap. 3, 20] we may equivalently show that e\u2212\u03b2\u03c8(x,y) = ((x + y)/2)\u2212\u03b2 is a positive\nde\ufb01nite function for \u03b2 > 0, and all x, y > 0. To that end, it suf\ufb01ces to show that the matrix\n\n(cid:1) is negative de\ufb01nite. Since \u03b42\n\nfor all x, y, z > 0.\n\n1\n\n2\n\n(7)\n\n1 \u2264 i, j \u2264 n,\nis positive de\ufb01nite for every integer n \u2265 1, and positive numbers {xi}n\n\ni=1. Now, observe that\n\nH = [hij] =(cid:2)(xi + xj)\u2212\u03b2(cid:3) ,\n(cid:90) \u221e\n\n1\n\n1\n\nhij =\n\n(xi + xj)\u03b2 =\n\n\u0393(\u03b2)\n\n0\n\nwhere \u0393(\u03b2) =(cid:82) \u221e\n\ne\u2212t(xi+xj )t\u03b2\u22121dt,\n\n(9)\n\n\u03b2\u22121\n\n2 \u2208\n\nL2([0,\u221e)), we see that [hij] equals the Gram matrix [(cid:104)fi, fj(cid:105)], whereby H > 0.\n\n0 e\u2212tt\u03b2\u22121dt is the well-known Gamma function. Thus, with fi(t) = e\u2212txit\n\nUsing Thm. 5 we obtain the following simple but important \u201cMinkowsi\u201d inequality for \u03b4s.\nCorollary 6. Let x, y, z > 0 be scalars, and let p \u2265 1. Then,\n\n(cid:17)1/p\n\n(cid:16)(cid:88)n\n\n(cid:17)1/p\n\n(cid:16)(cid:88)n\n\n(cid:17)1/p \u2264(cid:16)(cid:88)n\n\n\u03b4p\ns (xi, yi)\n\ni=1\n\n\u03b4p\ns (xi, zi)\n\ni=1\n\n+\n\n\u03b4p\ns (yi, zi)\n\ni=1\n\nCorollary 7. Let X, Y, Z > 0 be diagonal matrices. Then,\n\n\u03b4(cid:96)d(X, Y ) \u2264 \u03b4(cid:96)d(X, Z) + \u03b4(cid:96)d(Y, Z)\n\nNext, we recall a fundamental determinantal inequality.\nTheorem 8 ([21; Exercise VI.7.2]). Let A, B \u2208 Pn. Then,\n\ni (B)) \u2264 det(A + B) \u2264(cid:89)n\n\n\u2193\n\n(cid:89)n\n\n\u2193\ni (A) + \u03bb\n(\u03bb\n\ni=1\n\n\u2191\n\u2193\ni (A) + \u03bb\n(\u03bb\ni (B)).\n\ni=1\n\n3\n\n.\n\n(10)\n\n(11)\n\n(12)\n\n\fCorollary 9. Let A, B > 0. Then,\n\n\u03b4(cid:96)d(Eig\u2193(A), Eig\u2193(B)) \u2264 \u03b4(cid:96)d(A, B) \u2264 \u03b4(cid:96)d(Eig\u2193(A), Eig\u2191(B))\n\nThe \ufb01nal result that we need is a well-known fact from linear algebra (our own proof is in [19]).\nLemma 10 ([e.g., 22; p.58]). Let A > 0, and let B be Hermitian. There is a matrix P for which\n\nP \u2217AP = I,\n\nand P \u2217BP = D,\n\nand D is diagonal.\n\n(13)\n\nWith all these theorems and lemmas in hand, we are now \ufb01nally ready to prove Thm. 1.\n\nProof. (Theorem 1). We must prove that \u03b4(cid:96)d is symmetric, nonnegative, de\ufb01nite, and that is satis\ufb01es\nthe triangle inequality. Symmetry is immediate from de\ufb01nition. Nonnegativity and de\ufb01niteness\nfollow from the strict log-concavity (on Pn) of the determinant, whereby\n\ndet(cid:0) X+Y\n\n2\n\n(cid:1) \u2265 det(X)1/2 det(Y )1/2,\n\nwhich equality iff X = Y , which in turn implies that \u03b4(cid:96)d(X, Y ) \u2265 0 with equality iff X = Y . The\nonly hard part is to prove the triangle inequality, a result that has eluded previous attempts [4, 2].\nLet X, Y, Z > 0 be arbitrary. From Lemma 10 we know that there is a matrix P such that P \u2217XP =\nI and P \u2217Y P = D. Since Z > 0 is arbitrary, and congruence preserves positive de\ufb01niteness, we\nmay write just Z instead of P \u2217ZP . Also, since \u03b4(cid:96)d(P \u2217XP, P \u2217Y P ) = \u03b4(cid:96)d(X, Y ) (see Prop. 2),\nproving the triangle inequality reduces to showing that\n\n\u03b4(cid:96)d(I, D) \u2264 \u03b4(cid:96)d(I, Z) + \u03b4(cid:96)d(D, Z).\n\n(14)\n\nConsider now the diagonal matrices D\u2193 and Eig\u2193(Z). Corollary 7 asserts the inequality\n\n\u03b4(cid:96)d(I, D\u2193) \u2264 \u03b4(cid:96)d(I, Eig\u2193(Z)) + \u03b4(cid:96)d(D\u2193, Eig\u2193(Z)).\n\n(15)\nProp. 2(i) implies that \u03b4(cid:96)d(I, D) = \u03b4(cid:96)d(I, D\u2193) and \u03b4(cid:96)d(I, Z) = \u03b4(cid:96)d(I, Eig\u2193(Z)), while Cor. 9 shows\nthat \u03b4(cid:96)d(D\u2193, Eig\u2193(Z)) \u2264 \u03b4(cid:96)d(D, Z). Combining these inequalities, we obtain (14), as desired.\nAlthough the metric space (Pn, \u03b4(cid:96)d) has numerous fascinating properties, due to space concerns, we\ndo not discuss it further. Instead we discuss a connection more important to machine learning and\nrelated areas: kernel functions arising from \u03b4(cid:96)d. Indeed, some of connections (e.g., Thm. 11) have\nalready been successfully applied very recently in computer vision [23].\n\n2.1 Hilbert space embedding of \u03b4(cid:96)d\n\nTheorem 1 shows that \u03b4(cid:96)d is a metric and Theorem 5 shows that actually for positive scalars, the\nmetric space (R++, \u03b4s) embeds isometrically into a Hilbert space. It is, therefore, natural to ask\nwhether (Pn, \u03b4(cid:96)d) also admits such an embedding?\nTheorem 4 says that such a kernel exists if and only if \u03b42\n\n(cid:96)d is negative de\ufb01nite; equivalently, iff\n\ne\u2212\u03b2\u03b42\n\n(cid:96)d(X,Y ) = det(XY )\u03b2\n\ndet((X+Y )/2)\u03b2 ,\n\n(cid:104)\n\n(cid:105)\n\nH\u03b2 = [hij] :=\n\n1\n\ndet(Xi+Xj )\u03b2\n\n1 \u2264 i, j \u2264 m,\n\n,\n\n(16)\n\n(17)\n\nis a positive de\ufb01nite kernel for all \u03b2 > 0. To verify this, it suf\ufb01ces to check if the matrix\n\nis positive for every integer m \u2265 1 and arbitrary positive matrices X1, . . . , Xm.\nUnfortunately, a numerical experiment (see [19]) reveals that H\u03b2 is not always positive. This implies\nthat (Pd, \u03b4(cid:96)d) cannot embed isometrically into a Hilbert space. Undeterred, we still ask: For what\nchoices of \u03b2 is H\u03b2 positive? Surprisingly, this question admits a complete answer. Theorem 11\ncharacterizes the values of \u03b2 necessary and suf\ufb01cient for H\u03b2 to be positive. We note here that the\ncase \u03b2 = 1 was essentially treated in [24], in the context of semigroup kernels on measures.\nTheorem 11. Let X1, . . . , Xm \u2208 Pn. The matrix H\u03b2 de\ufb01ned by (17) is positive, if and only if\n\n2 : j \u2208 N, and 1 \u2264 j \u2264 (n \u2212 1)(cid:9) \u222a(cid:8)\u03b3 : \u03b3 \u2208 R, and \u03b3 > 1\n\n2 (n \u2212 1)(cid:9) .\n\n\u03b2 \u2208(cid:8) j\n\n(18)\n\n4\n\n\f(cid:90)\n\n(cid:90)\n\nProof. We \ufb01rst prove the \u201cif\u201d part. De\ufb01ne the function fi := 1\nfi \u2208 L2(Rn), where the inner-product is given by the Gaussian integral\n\n\u03c0n/4 e\u2212xT Xix (for 1 \u2264 i \u2264 m). Then,\n\n(cid:104)fi, fj(cid:105) :=\n\n1\n\n\u03c0d/2\n\nRn\n\ne\u2212xT (Xi+Xj )xdx =\n\n1\n\ndet(Xi+Xj )1/2 .\n\n(19)\n\nFrom (19) it follows that H1/2 is positive. Since the Schur (elementwise) product of two positive\nmatrices is again positive, it follows that H\u03b2 > 0 whenever \u03b2 is an integer multiple of 1/2. To\nextend the result to all \u03b2 covered by (18), we need a more intricate integral representation, namely\nthe multivariate Gamma function, de\ufb01ned as [25; \u00a72.1.2]\n\n(cid:90)\n\n(20)\nPn\n2 (n \u2212 1). De\ufb01ne for each i the function fi := ce\u2212 tr(AXi)\nwhere the integral converges for \u03b2 > 1\n(c > 0 is a constant). Then, fi \u2208 L2(Pn), which we equip with the inner product\n\n\u0393n(\u03b2) :=\n\ne\u2212 tr(A) det(A)\u03b2\u2212(n+1)/2dA,\n\n(cid:104)fi, fj(cid:105) := c2\n\ne\u2212 tr(A(Xi+Xj )) det(A)\u03b2\u2212(n+1)/2dA = det(Xi + Xj)\u2212\u03b2,\nPn\n2 (n \u2212 1). Consequently, H\u03b2 is positive for all \u03b2 de\ufb01ned by (18).\nand it exists whenever \u03b2 > 1\nThe \u201conly if\u201d part follows from deeper results in the rich theory of symmetric spaces.2 Speci\ufb01cally,\nsince Pn is a symmetric cone, and 1/ det(X) is a decreasing function on this cone, (i.e., 1/ det(X +\nY ) \u2264 1/ det(X) for all X, Y > 0), an appeal to [26; VII.3.1] grants our claim.\nRemark 12. Readers versed in stochastic processes will recognize that the above result provides a\ndifferent perspective on a classical result concerning in\ufb01nite divisibility of Wishart processes [27],\nwhere the set (18) also arises as a consequence of Gindikin\u2019s theorem [28].\n\nAt this point, it is worth mentioning the following \u201cobvious\u201d result.\nTheorem 13. Let X be a set of positive matrices that commute with each other. Then, (X , \u03b4(cid:96)d) can\nbe isometrically embedded into some Hilbert space.\n\nProof. The proof follows because a commuting set of matrices can be simultaneously diagonalized,\nand for diagonal matrices, \u03b42\ns (Xii, Yii), which is a nonnegative sum of negative\nde\ufb01nite kernels and is therefore itself negative de\ufb01nite.\n\ni \u03b42\n\n(cid:96)d(X, Y ) = (cid:80)\n\n3 Connections between \u03b4(cid:96)d and \u03b4R\n\nAfter showing that \u03b4(cid:96)d is a metric and studying its relation to kernel functions, let us now return to\nour original motivation: introducing \u03b4(cid:96)d as a reasonable alternative to the widely used Riemannian\nmetric \u03b4R. We note here that Cherian et al. [4; 29] offer strong experimental evidence supporting\n\u03b4(cid:96)d as an alternative; we offer more theoretical results.\nOur theoretical results are based around showing that \u03b4(cid:96)d ful\ufb01lls several properties akin to those\ndisplayed by \u03b4R. Due to lack of space, we present only a summary of our results in Table 1, and\ncite the corresponding theorems in the longer article [19] for proofs. While the actual proofs are\nvaluable and instructive, the key message worth noting is: both \u03b4R and \u03b4(cid:96)d express the (negatively\ncurved) non-Euclidean geometry of their respective metric spaces by displaying similar properties.\n\n4 Application: computing geometric means\n\nIn this section we turn our attention to an object that perhaps connects \u03b4R and \u03b4(cid:96)d most intimately:\nthe operator geometric mean (GM), which is given by the midpoint of the geodesic (3), denoted as\n\nA(cid:93)B := \u03b3(1/2) = A1/2(A\u22121/2BA\u22121/2)1/2A1/2.\n\n(21)\n\n2Speci\ufb01cally, the set (18) is identical to the Wallach set which is important in the study of Hilbert spaces of\n\nholomorphic functions over symmetric domains [26; Ch.XIII].\n\n5\n\n\fRiemannian metric\nRef.\n\u03b4R(X\u2217AX, X\u2217BX) = \u03b4R(A, B)\nProp. 2\n\u03b4R(A\u22121, B\u22121) = \u03b4R(A, B)\nProp. 2\n\u03b4R(At, Bt) \u2264 t\u03b4R(A, B)\n[19; Th.4.6]\n\u03b4R(As, Bs) \u2264 (s/u)\u03b4R(Au, Bu)\n[19; Th.4.11]\nTh.14\n\u03b4R(A, A(cid:93)B) = \u03b4R(B, A(cid:93)B)\n[19; Th.4.7]\n\u03b4R(A, A(cid:93)tB) = t\u03b4R(A, B)\n\u03b4R(A(cid:93)tB, A(cid:93)tC) \u2264 t\u03b4R(B, C)\n[19; Th.4.8]\nR(X, B) min(cid:55)\u2192 GM\nTh.14\n\u03b42\nR(X, A) + \u03b42\n\u03b4R(A + X, A + Y ) \u2264 \u03b4R(X, Y )\n[19; Th.4.9]\nTable 1: Some of the similarities between \u03b4R and \u03b4(cid:96)d. All matrices are assumed to be in Pn. The\nscalars t, s, u satisfy 0 < t \u2264 1, 1 \u2264 s \u2264 u < \u221e.\n\n\u03b4(cid:96)d-metric\n\u03b4(cid:96)d(X\u2217AX, X\u2217BX) = \u03b4(cid:96)d(A, B)\n\u03b4(cid:96)d(A\u22121, B\u22121) = \u03b4(cid:96)d(A, B)\n\u03b4(cid:96)d(As, Bs) \u2264(cid:112)s/u\u03b4(cid:96)d(Au, Bu)\n\u03b4(cid:96)d(At, Bt) \u2264 \u221a\nt\u03b4(cid:96)d(A, B)\n\u03b4(cid:96)d(A, A(cid:93)tB) \u2264 \u221a\n\u03b4(cid:96)d(A(cid:93)tB, A(cid:93)tC) \u2264 \u221a\nt\u03b4(cid:96)d(B, C)\n(cid:96)d(X, B) min(cid:55)\u2192 GM\n\u03b42\n(cid:96)d(X, A) + \u03b42\n\u03b4(cid:96)d(A + X, A + Y ) \u2264 \u03b4(cid:96)d(X, Y )\n\nRef.\n[11; Ch.6]\n[11; Ch.6]\n[11; Ex.6.5.4]\n[19; Th.4.11]\nTrivial\n[11; Th.6.1.6]\n[11; Th.6.1.2]\n[11; Ch. 6]\n[3]\n\n\u03b4(cid:96)d(A, A(cid:93)B) = \u03b4(cid:96)d(B, A(cid:93)B)\nt\u03b4(cid:96)d(A, B)\n\nThe GM (21) has numerous attractive properties\u2014see for instance [30]\u2014among these, the following\nvariational characterization is very important [31, 32],\n\n(22)\nespecially because it generalizes the matrix geometric mean to more than two matrices. Speci\ufb01cally,\nthis \u201cnatural\u201d generalization is the Karcher mean (Fr\u00b4echet mean) [31, 32, 11]:\n\nA(cid:93)B = argminX>0\n\n\u03b42\nR(A, X) + \u03b42\n\nR(B, X).\n\n(cid:88)m\n\ni=1\n\n(cid:88)m\n\nGM (A1, . . . , Am) := argminX>0\n\n\u03b42\nR(X, Ai).\n\n(23)\n\nThis multivariable generalization is in fact a well-studied dif\ufb01cult problem\u2014see e.g., [33] for infor-\nmation on state-of-the-art. Indeed, its inordinate computational expenses motivated Cherian et al.\n[4] to study the alternative mean\n\nGM(cid:96)d(A1, . . . , Am) := argmin\n\nX>0\n\n\u03c6(X) :=\n\ni=1\n\n\u03b42\n(cid:96)d(X, Ai),\n\n(24)\n\nwhich has also been more thoroughly studied by Chebbi and Moahker [2].\nAlthough the mean (24) was previously studied in [4, 2], some crucial aspects were missing. Specif-\nically, Cherian et al. [4] only proved their solution to be a stationary point of \u03c6(X); they did not\nprove either global or local optimality. Although Chebbi and Moahker [2] showed that (24) has a\nunique solution, like [4] they too only proved stationarity, neither global nor local optimality.\nWe \ufb01ll these gaps, and we make the following main contributions below:\n\n1. We connect (24) to the Karcher mean more closely, where in Theorem 14 we shows that\n\nfor the two matrix case both problems have the same solution;\n\n2. We show that the unique positive solution to (24) is globally optimal; this result is particu-\n\nlarly interesting because \u03c6(X) is nonconvex.\n\nWe begin by looking at the two variable case of GM(cid:96)d (24).\nTheorem 14. Let A, B > 0. Then,\n\nA(cid:93)B = argminX>0\n\n(cid:96)d(X, B).\nMoreover, A(cid:93)B is equidistant from A and B, i.e., \u03b4(cid:96)d(A, A(cid:93)B) = \u03b4(cid:96)d(B, A(cid:93)B).\nProof. If A = B, then clearly X = A minimizes \u03c6(X). Assume therefore, that A (cid:54)= B. Ignoring\nthe constraint X > 0 momentarily, we see that any stationary point must satisfy \u2207\u03c6(X) = 0. Thus,\n\n(cid:96)d(X, A) + \u03b42\n\n\u03c6(X) := \u03b42\n\n(25)\n\n\u2207\u03c6(X) =(cid:0) X+A\n\n2\n\n2 +(cid:0) X+B\n(cid:1)\u22121 1\n\n2\n\n(cid:1)\u22121 1\n2 \u2212 X\u22121 = 0\n\n=\u21d2 (X + A)X\u22121(X + B) = 2X + A + B\n\n(26)\nThe latter equation is a Riccati equation that is known to have a unique, positive de\ufb01nite solution\ngiven by the matrix GM (21) (see [11; Prop 1.2.13]). All that remains to show is that this GM is in\nfact a local minimizer. To that end, we must show that the Hessian \u22072\u03c6(X) > 0 at X = A(cid:93)B; but\nthis claim is immediate from Theorem 18. So A(cid:93)B is a strict local minimum of (8), which is actually\na global minimum because it is the unique positive solution to \u03c6(X) = 0. Finally, the equidistance\nproperty follows after some algebraic manipulations; we omit details for brevity [19].\n\n=\u21d2 B = XA\u22121X.\n\n6\n\n\fLet us now turn to the general case (24). The \ufb01rst-order optimality condition is\n\n\u2207\u03c6(X) =\n\nin a convex, compact set speci\ufb01ed by(cid:0) 1\n\ni=1\n\n(cid:88)m\n\n1\n2\n\n(cid:1)\u22121 \u2212 1\n\n(cid:0) X+Ai\n(cid:80)m\ni=1 A\u22121\n\n2\n\ni\n\nm\n\n2 mX\u22121 = 0,\n\n(cid:1)\u22121 (cid:22) X (cid:22)(cid:0) 1\n\nm\n\n(cid:80)m\n\ni=1 Ai\n\n(cid:1).\n\nFrom (27) using Lemma 15 it can be inferred that [see also 2, 4] that any critical point X of (24) lies\n\nX > 0.\n\n(27)\n\nLemma 15 ([21; Ch.5]). The map X\u22121 on Pn is order reversing and operator convex. That is, for\n\u22121 \u2264 tX\u22121 +(1\u2212t)Y \u22121.\nX, Y \u2208 Pn, if X \u2265 Y , then X\u22121 \u2264 Y \u22121; for t \u2208 [0, 1], (tX + (1 \u2212 t)Y )\nLemma 16 ([19]). Let A, B, C, D \u2208 Pn, so that A \u2265 B and C \u2265 D. Then, A \u2297 C \u2265 B \u2297 D.\nLemma 17 (Uniqueness [2]). The nonlinear equation (27) has a unique positive solution.\n\nUsing the above results, we can \ufb01nally prove the main theorem of this section.\nTheorem 18. Let X be a matrix satisfying (27). Then, it is the unique global minimizer of (24).\n\nProof. The objective function \u03c6(X) (24) has only one positive stationary point, which follows from\nLemma 17. Let X be this stationary point satisfying (27). We show that X is actually a local\nminimum; global optimality is immediate from uniqueness of X.\nTo show local optimality, we prove that the Hessian \u22072\u03c6(X) > 0. Ignoring constants, showing\npositivity of the Hessian reduces to proving that\n\nNow replace mX\u22121 in (28) using the condition (27); therewith inequality (28) turns into\n\n2\n\n1\n2\n\ni=1\n\nmX\u22121 \u2297 X\u22121 \u2212(cid:88)m\n(cid:0) X+Ai\n(cid:1)\u22121 \u2297(cid:0) X+Ai\n(cid:16)(cid:88)m\n(cid:1)\u22121(cid:17) \u2297 X\u22121 >\n(cid:88)m\n(cid:0) X+Ai\n(cid:0) X+Ai\n\u21d0\u21d2 (cid:88)m\n(cid:88)m\n(cid:0) X+Ai\n(cid:0) X+Ai\n(cid:1)\u22121 \u2297 X\u22121 >\n(cid:1)\u22121 \u2297 (X + Ai)\n(cid:1)\u22121 \u2297 X\u22121 > (cid:0) X+Ai\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n> 0.\n\n(cid:1)\u22121\n(cid:1)\u22121 \u2297 (X + Ai)\n(cid:1)\u22121 \u2297 (X + Ai)\n\nFrom Lemma 15 we know that X\u22121 > (X + Ai)\n\n\u22121, so that an application of Lemma 16 shows that\n\u22121 for 1 \u2264 i \u2264 m. Summing up, we obtain (29),\n\nwhich implies the desired local (and by uniqueness, global) optimality of X.\nRemark 19. It is worth noting that Theorem 18 establishes that solving (27) yields the global\nminimum of a nonconvex optimization problem. This result is even more remarkable because unlike\nCAT(0)-metrics such as \u03b4R, the metric \u03b4(cid:96)d is not geodesically convex.\n\n(cid:0) X+Ai\n\n2\n\n(28)\n\n(29)\n\n\u22121\n\n\u22121 .\n\n4.1 Numerical Results\n\nWe present a key numerical result to illustrate the large savings in running time when computing with\n\u03b4(cid:96)d when compared with \u03b4R. To compute the Karcher mean we downloaded the \u201cMatrix Means\nToolbox\u201d of Bini and Iannazzo from http://bezout.dm.unipi.it/software/mmtoolbox/. In particular,\nwe use the \ufb01le called rich.m which implements a state-of-the-art method [33].\nThe \ufb01rst plot in Fig. 1 indicate that \u03b4(cid:96)d can be around 5 times faster than \u03b4R2 and up to 50 times\nfaster than \u03b4R1. The second plot shows how expensive it can be to compute GM (23) as opposed\nto GM(cid:96)d (24)\u2014up to 1000 times! The former was computed using the method of [33], while the\nlatter runs the \ufb01xed-point iteration proposed in [2] (the iteration was run until (cid:107)\u2207\u03c6(X)(cid:107) fell below\n10\u221210). The key point here is not that the \ufb01xed-point iteration is faster, but rather that (24) is a much\nsimpler problem thanks to the convenient eigenvalue free structure of \u03b4(cid:96)d.\n\n5 Conclusions and future work\n\nWe presented a new metric on the manifold of positive de\ufb01nite matrices, and related it to the classical\nRiemmannian metric on this manifold. Empirically, our new metric was shown to lead to large\ncomputational gains, while theoretically, a series of theorems demonstrated how it expresses the\nnegatively curved non-Euclidean geometry in a manner analogous to the Riemannian metric.\n\n7\n\n\fFigure 1: Running time comparisons between \u03b4R and \u03b4(cid:96)d. The left panel shows time (in seconds)\ntaken to compute \u03b4R and \u03b4(cid:96)d, averaged over 10 runs to reduce variance. In the plot, \u03b4R1 refers to the\nimplementation of \u03b4R in the matrix means toolbox [33], while \u03b4R2 is our own implementation.\n\nAt this point, there are several directions of future work opened by our paper. We mention some of\nthe most relevant ones below. (i) Study further geometric properties of the metric space (Pn, \u03b4(cid:96)d);\n(ii) Further enrich the connections to \u03b4R, and to other (Finsler) metrics on Pn; (iii) Study properties\nof geometric mean GM(cid:96)d (24), including faster algorithms to compute it; (iv) Akin to [4], apply \u03b4(cid:96)d\nin where \u03b4R has been so far dominant. We plan to tackle some of these problems, and hope that our\npaper encourages other researchers in machine learning and optimization to also study them.\n\nReferences\n\n[1] H. Lee and Y. Lim. Invariant metrics, contractions and nonlinear matrix equations. Nonlinearity, 21:\n\n857\u2013878, 2008.\n\n[2] Z. Chebbi and M. Moahker. Means of hermitian positive-de\ufb01nite matrices based on the log-determinant\n\n\u03b1-divergence function. Linear Algebra and its Applications, 436:1872\u20131889, 2012.\n\n[3] P. Bougerol. Kalman Filtering with Random Coef\ufb01cients and Contractions. SIAM J. Control Optim., 31\n\n(4):942\u2013959, 1993.\n\n[4] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Ef\ufb01cient Similarity Search for Covariance\nMatrices via the Jensen-Bregman LogDet Divergence. In International Conference on Computer Vision\n(ICCV), Nov. 2011.\n\n[5] F. Porikli, O. Tuzel, and P. Meer. Covariance Tracking using Model Update Based on Lie Algebra. In\n\nIEEE CVPR, 2006.\n\n[6] L. T. Skovgaard. A Riemannian Geometry of the Multivariate Normal Model. Scandinavian Journal of\n\nStatistics, 11(4):211\u2013223, 1984.\n\n[7] D. Petz. Quantum Information Theory and Quantum Statistics. Springer, 2008.\n[8] I. Dryden, A. Koloydenko, and D. Zhou. Non-Euclidean statistics for covariance matrices, with applica-\n\ntions to diffusion tensor imaging. Annals of Applied Statistics, 3(3):1102\u20131123, 2009.\n\n[9] H. Zhu, H. Zhang, J. G. Ibrahim, and B. S. Peterson. Statistical Analysis of Diffusion Tensors in Diffusion-\nWeighted Magnetic Resonance Imaging Data. Journal of the American Statistical Association, 102(480):\n1085\u20131102, 2007.\n\n[10] F. Hiai and D. Petz. Riemannian metrics on positive de\ufb01nite matrices related to means. Linear Algebra\n\nand its Applications, 430:3105\u20133130, 2009.\n\n[11] R. Bhatia. Positive De\ufb01nite Matrices. Princeton University Press, 2007.\n[12] M. R. Bridson and A. Hae\ufb02inger. Metric Spaces of Non-Positive Curvature. Springer, 1999.\n[13] A. Terras. Harmonic Analysis on Symmetric Spaces and Applications, volume II. Springer, 1988.\n[14] Yu. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. SIAM,\n\n1987.\n\n[15] A. Ben-Tal and A. Nemirovksii. Lectures on modern convex optimization: Analysis, algorithms, and\n\nengineering applications. SIAM, 2001.\n\n[16] Yu. Nesterov and M. J. Todd. On the riemannian geometry de\ufb01ned for self-concordant barriers and interior\n\npoint methods. Found. Comput. Math., 2:333\u2013361, 2002.\n\n8\n\n050010001500200010\u2212410\u2212310\u2212210\u22121100101102Dimensionality (n) of the matrices usedRunning time (seconds)Time taken to compute \u03b4R and \u03b4S \u03b4R1\u03b4R2\u03b4S05010015020010\u2212210\u22121100101102103Dimensionality (n) of the matrices usedRunning time (seconds)Time taken to compute GM and GMld for 10 matrices GMGMld\f[17] S. Helgason. Geometric Analysis on Symmetric Spaces. Number 39 in Mathematical Surveys and Mono-\n\ngraphs. AMS, second edition, 2008.\n\n[18] H. Wolkowicz, R. Saigal, and L. Vandenberghe, editors. Handbook of Semide\ufb01nite Programming: Theory,\n\nAlgorithms, and Applications. Kluwer Academic, 2000.\n\n[19] S. Sra. Positive de\ufb01nite matrices and the Symmetric Stein Divergence. arXiv: 1110.1773, October 2012.\n[20] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups: theory of positive de\ufb01nite\n\nand related functions, volume 100 of GTM. Springer, 1984.\n\n[21] R. Bhatia. Matrix Analysis. Springer, 1997.\n[22] R. Bellman. Introduction to Matrix Analysis. SIAM, second edition, 1970.\n[23] M. Harandi, C. Sanderson, R. Hartley, and B. Lovell. Sparse Coding and Dictionary Learning for Sym-\nIn European Conference on Computer Vision\n\nmetric Positive De\ufb01nite Matrices: A Kernel Approach.\n(ECCV), 2012.\n\n[24] M. Cuturi, K. Fukumizu, and J. P. Vert. Semigroup kernels on measures. JMLR, 6:1169\u20131198, 2005.\n[25] R. J. Muirhead. Aspects of multivariate statistical theory. Wiley Interscience, 1982.\n[26] J. Faraut and A. Kor\u00b4anyi. Analysis on Symmetric Cones. Clarendon Press, 1994.\n[27] M.-F. Bru. Wishart Processes. J. Theoretical Probability, 4(4), 1991.\n[28] S. G. Gindikin. Invariant generalized functions in homogeneous domains. Functional Analysis and its\n\nApplications, 9:50\u201352, 1975.\n\n[29] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Jensen-Bregman LogDet Divergence with\n\nApplication to Ef\ufb01cient Similarity Search for Covariance Matrices. IEEE TPAMI, 2012. Submitted.\n\n[30] T. Ando. Concavity of certain maps on positive de\ufb01nite matrices and applications to hadamard products.\n\nLinear Algebra and its Applications, 26(0):203\u2013241, 1979.\n\n[31] R. Bhatia and J. A. R. Holbrook. Riemannian geometry and matrix geometric means. Linear Algebra\n\nAppl., 413:594\u2013618, 2006.\n\n[32] M. Moakher. A differential geometric approach to the geometric mean of symmetric positive-de\ufb01nite\n\nmatrices. SIAM J. Matrix Anal. Appl. (SIMAX), 26:735\u2013747, 2005.\n\n[33] D. A. Bini and B. Iannazzo. Computing the Karcher mean of symmetric positive de\ufb01nite matrices. Linear\n\nAlgebra and its Applications, Oct. 2011. Available online.\n\n9\n\n\f", "award": [], "sourceid": 93, "authors": [{"given_name": "Suvrit", "family_name": "Sra", "institution": null}]}