{"title": "PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1473, "page_last": 1480, "abstract": "In this work, we consider the problem of learning a positive semidefinite matrix. The critical issue is how to preserve positive semidefiniteness during the course of learning. Our algorithm is mainly inspired by LPBoost [1] and the general greedy convex optimization framework of Zhang [2]. We demonstrate the essence of the algorithm, termed PSDBoost (positive semidefinite Boosting), by focusing on a few different applications in machine learning. The proposed PSDBoost algorithm extends traditional Boosting algorithms in that its parameter is a positive semidefinite matrix with trace being one instead of a classifier. PSDBoost is based on the observation that any trace-one positive semidefinitematrix can be decomposed into linear convex combinations of trace-one rank-one matrices, which serve as base learners of PSDBoost. Numerical experiments are presented.", "full_text": "PSDBoost: Matrix-Generation Linear Programming\n\nfor Positive Semide\ufb01nite Matrices Learning\n\nChunhua Shen\u2020\u2021, Alan Welsh\u2021, Lei Wang\u2021\n\n\u2020NICTA Canberra Research Lab, Canberra, ACT 2601, Australia\u2217\n\u2021Australian National University, Canberra, ACT 0200, Australia\n\nAbstract\n\nIn this work, we consider the problem of learning a positive semide\ufb01nite matrix.\nThe critical issue is how to preserve positive semide\ufb01niteness during the course\nof learning. Our algorithm is mainly inspired by LPBoost [1] and the general\ngreedy convex optimization framework of Zhang [2]. We demonstrate the essence\nof the algorithm, termed PSDBoost (positive semide\ufb01nite Boosting), by focus-\ning on a few different applications in machine learning. The proposed PSDBoost\nalgorithm extends traditional Boosting algorithms in that its parameter is a posi-\ntive semide\ufb01nite matrix with trace being one instead of a classi\ufb01er. PSDBoost is\nbased on the observation that any trace-one positive semide\ufb01nite matrix can be de-\ncomposed into linear convex combinations of trace-one rank-one matrices, which\nserve as base learners of PSDBoost. Numerical experiments are presented.\n\n1 Introduction\n\nColumn generation (CG) [3] is a technique widely used in linear programming (LP) for solving\nlarge-sized problems. Thus far it has mainly been applied to solve problems with linear constraints.\nThe proposed work here\u2014which we dub matrix generation (MG)\u2014extends the column generation\ntechnique to non-polyhedral semide\ufb01nite constraints. In particular, as an application we show how\nto use it for solving a semide\ufb01nite metric learning problem. The fundamental idea is to rephrase a\nbounded semide\ufb01nite constraint into a polyhedral one with in\ufb01nitely many variables. This construc-\ntion opens possibilities for use of the highly developed linear programming technology. Given the\nlimitations of current semide\ufb01nite programming (SDP) solvers to deal with large-scale problems,\nthe work presented here is of importance for many real applications.\n\nThe choice of a metric has a direct effect on the performance of many algorithms such as the simplest\nk-NN classi\ufb01er and some clustering algorithms. Much effort has been spent on learning a good\nmetric for pattern recognition and data mining. Clearly a good metric is task-dependent: different\napplications should use different measures for (dis)similarity between objects. We show how a\nMahalanobis metric is learned from examples of proximity comparison among triples of training\ndata. For example, assuming that we are given triples of images ai, aj and ak (ai, aj have same\nlabels and ai, ak have different labels, ai \u2208 RD), we want to learn a metric between pairs of images\nsuch that the distance from aj to ai (distij) is smaller than from ak to ai (distik). Triplets like this\nare the input of our metric learning algorithm. By casting the problem as optimization of the inner\nproduct of the linear transformation matrix and its transpose, the formulation is based on solving\na semide\ufb01nite program. The algorithm \ufb01nds an optimal linear transformation that maximizes the\nmargin between distances distij and distik.\n\n\u2217NICTA is funded by the Australian Government as represented by the Department of Broadband, Commu-\nnications and the Digital Economy and the Australian Research Council through the ICT Center of Excellence\nprogram.\n\n\fA major drawback of this formulation is that current SDP solvers utilizing interior-point (IP) meth-\nods do not scale well to large problems with computation complexity roughly O(n4.5) (n is the\nnumber of variables). On the other hand, linear programming is much better in terms of scalability.\nState-of-the-art solvers like CPLEX [4] can solve large problems up to millions of variables and\nconstraints. This motivates us to develop an LP approach to solve our SDP metric learning problem.\n\n2 Related Work\n\nWe overview some relevant work in this section.\n\nColumn generation was \ufb01rst proposed by Dantzig and Wolfe [5] for solving some special structured\nlinear programs with extremely large number of variables. [3] has presented a comprehensive survey\non this technique. The general idea of CG is that, instead of solving the original large-scale prob-\nlem (master problem), one works on a restricted master problem with a reasonably small subset of\nvariables at each step. The dual of the restricted master problem is solved by the simplex method,\nand the optimal dual solution is used to \ufb01nd the new column to be included into the restricted master\nproblem. LPBoost [1] is a direct application of CG in Boosting. For the \ufb01rst time, LPBoost shows\nthat in an LP framework, unknown weak hypotheses can be learned from the dual although the\nspace of all weak hypotheses is in\ufb01nitely large. This is the highlight of LPBoost, which has directly\ninspired our work.\n\nMetric learning using convex optimization has attracted a lot of attention recently [6\u20138]. These\nwork has made it possible to learn distance functions that are more appropriate for a speci\ufb01c task,\nbased on partially labeled data or proximity constraints. These techniques improve classi\ufb01cation\nor clustering accuracy by taking advantage of prior information. There is plenty of work reported.\nWe list a few that are most relevant to ours. [6] learns a Mahalanobis metric for clustering using\nconvex optimization to minimize the distance between examples belonging to the same class, while\nat the same time restricting examples in difference classes not to be too close. The work in [7] also\nlearns a Mahalanobis metric using SDP by optimizing a modi\ufb01ed k-NN classi\ufb01er. They have used\n\ufb01rst-order alternating projection algorithms, which are faster than generic SDP solvers. The authors\nin [8] learns a Mahalanobis by considering proximity relationships of training examples. The \ufb01nal\nformulation is also an SDP. They replace the positive semide\ufb01nite (PSD) conic constraint using a\nsequence of linear constraints under the fact that a diagonal dominance matrix must be PSD (but not\nvice versa). In other words the conic constraint is replaced by a more strict one. The feasibility set\nshrinks and the solution obtained is not necessarily a solution of the original SDP.\n\n3 Preliminaries\n\nWe begin with some notational conventions and basic de\ufb01nitions that will be useful.\n\nTr(\u00b7) is the trace of a square matrix and hX, Zi = Tr(XZ\u22a4) = Pij\n\nA bold lower case letter x represents a column vector and an upper case letter X is a matrix. We\ndenote the space of D \u00d7 D symmetric matrices by SD, and positive semide\ufb01nite matrices by SD\n+.\nXij Zij calculates the inner\nproduct of two matrices. An element-wise inequality between two vectors writes u \u2264 v, which\nmeans ui \u2264 vi for all i.\nWe use X < 0 to indicate that matrix X is positive semide\ufb01nite. For a matrix X \u2208 SD, the\nfollowing statements are equivalent: (1) X < 0 (X \u2208 SD\n+); (2) All eigenvalues of X are nonnegative\n(\u03bbi(X) \u2265 0, i = 1, \u00b7 \u00b7 \u00b7 , D); and (3) \u2200u \u2208 RD, u\u22a4Xu \u2265 0.\n\n3.1 Extreme Points of Trace-one Semide\ufb01nite Matrices\n\nBefore we present our main results, we prove an important theorem that serves the basis of the\nproposed algorithm.\n\nDe\ufb01nition 3.1 For any positive integer M, given a set of points {x1, ..., xM } in a real vector or\nmatrix space Sp, the convex hull of Sp spanned by M elements in Sp is de\ufb01ned as:\n\nconvM (Sp) = (cid:26)XM\n\ni=1\n\n\u03b8ixi(cid:12)(cid:12)(cid:12)\n\n\u03b8i \u2265 0,XM\n\ni=1\n\n\u03b8i = 1, xi \u2208 Sp(cid:27) .\n\n\fDe\ufb01ne the convex hull1 of Sp as:\nconv(Sp) = [M\n= (cid:26)XM\n\u03b8ixi(cid:12)(cid:12)(cid:12)\n\ni=1\n\nHere Z+ denotes the set of all positive integers.\n\nconvM (Sp)\n\n\u03b8i \u2265 0,XM\n\ni=1\n\n\u03b8i = 1, xi \u2208 Sp, M \u2208 Z+(cid:27) .\n\nDe\ufb01nition 3.2 Let us de\ufb01ne \u03931 to be the space of all positive semide\ufb01nite matrices X \u2208 SD\ntrace equaling one:\n\n+ with\n\n\u03931 = {X | X < 0, Tr(X) = 1 } ; 2\n\nand \u21261 to be the space of all positive semide\ufb01nite matrices with both trace and rank equaling one:\n\nWe also de\ufb01ne \u03932 as the convex hull of \u21261, i.e.,\n\n\u21261 = {Z | Z < 0, Tr(Z) = 1, rank(Z) = 1 } .\n\n\u03932 = conv(\u21261).\n\nLemma 3.3 Let \u21262 be a convex polytope de\ufb01ned as \u21262 = {\u03bb \u2208 RD| \u03bbk \u2265 0, \u2200k = 1, \u00b7 \u00b7 \u00b7 , D,\nPD\nk=1 \u03bbk = 1}, then the points with only one element equaling one and all the others being zeros\n\nare the extreme points (vertexes) of \u21262. All the other points can not be extreme points.\n\n\u21262: \u03bb\u2032 = PM\n\ni=1 \u03b8i\u03bbi, \u03b8i > 0, PM\n\nProof: Without loss of generality, let us consider such a point \u03bb\u2032 = {1, 0, \u00b7 \u00b7 \u00b7 , 0}. If \u03bb\u2032 is not an\nextreme point of \u21262, then it must be expressed as an convex combination of a few other points in\nk = 0,\n\u2200k = 2, \u00b7 \u00b7 \u00b7 , D. It follows that \u03bbi\n1 = 1 \u2200i. This is\ninconsistent with \u03bbi 6= \u03bb\u2032. Therefore such a convex combination does not exist and \u03bb\u2032 must be\nan extreme point. It is trivial to see that any \u03bb that has more than one active element is an convex\ncombination of the above-de\ufb01ned extreme points. So they can not be extreme points.\n(cid:3)\n\ni=1 \u03b8i = 1 and \u03bbi 6= \u03bb\u2032. Then we have equations: PM\nk = 0, \u2200i and k = 2, \u00b7 \u00b7 \u00b7 , D. That means, \u03bbi\n\ni=1 \u03b8i\u03bbi\n\nTheorem 3.4 \u03931 equals to \u03932; i.e., \u03931 is also the convex hull of \u21261. In other words, all Z \u2208 \u21261,\nforms the set of extreme points of \u03931.\n\nwith the following two facts: (1) a convex combination of PSD matrices is still a PSD matrix; (2)\n\nProof: It is easy to check that any convex combination Pi \u03b8iZi, such that Zi \u2208 \u21261, resides in \u03931,\nTr(cid:0)Pi \u03b8iZi(cid:1) = Pi(cid:0)\u03b8i Tr(Zi)(cid:1) = 1.\nBy denoting \u03bb1 \u2265 \u00b7 \u00b7 \u00b7 \u2265 \u03bbD \u2265 0 the eigenvalues of a Z \u2208 \u03931, we know that \u03bb1 \u2264 1 because\nPD\ni=1 \u03bbi = Tr(Z) = 1. Therefore, all eigenvalues of Z must satisfy: \u03bbi \u2208 [0, 1], \u2200i = 1, \u00b7 \u00b7 \u00b7 , D\nandPD\ni \u03bbi = 1. By looking at the eigenvalues of Z and using Lemma 3.3, it is immediate to see that\na matrix Z such that Z < 0, Tr(Z) = 1 and rank(Z) > 1 can not be an extreme point of \u03931. The\nonly candidates for extreme points are those rank-one matrices (\u03bb1 = 1 and \u03bb2,\u00b7\u00b7\u00b7 ,D = 0). Moreover,\nit is not possible that some rank-one matrices are extreme points and others are not because the other\ntwo constraints Z < 0 and Tr(Z) = 1 do not distinguish between different rank-one matrices.\nHence, all Z \u2208 \u21261 forms the set of extreme points of \u03931. Furthermore, \u03931 is a convex and compact\nset, which must have extreme points. Krein-Milman Theorem [9] tells us that a convex and compact\nset is equal to the convex hull of its extreme points.\n(cid:3)\n\nThis theorem is a special case of the results from [10] in the context of eigenvalue optimization. A\ndifferent proof for the above theorem\u2019s general version can also be found in [11]. In the context of\nSDP optimization, what is of interest about Theorem 3.4 is as follows: it tells us that a bounded\nPSD matrix constraint X \u2208 \u03931 can be equivalently replaced with a set of constrains which belong to\n\u03932. At the \ufb01rst glance, this is a highly counterintuitive proposition because \u03932 involves many more\ncomplicated constraints. Both \u03b8i and Zi (\u2200i = 1, \u00b7 \u00b7 \u00b7 , M) are unknown variables. Even worse, M\ncould be extremely (or even inde\ufb01nitely) large.\n\n1Strictly speaking, the union of convex hulls may not be a convex hull in general. It is a linear convex span.\n2Such a matrix X is called a density matrix, which is one of the main concepts in quantum physics. A\ndensity matrix of rank one is called a pure state, and a density matrix of rank higher than one is called a mixed\nstate.\n\n\f3.2 Boosting\n\nBoosting is an example of ensemble learning, where multiple learners are trained to solve the same\nproblem. Typically a boosting algorithm [12] creates a single strong learner by incrementally adding\nbase (weak) learners to the \ufb01nal strong learner. The base learner has an important impact on the\nstrong learner. In general, a boosting algorithm builds on a user-speci\ufb01ed base learning procedure\nand runs it repeatedly on modi\ufb01ed data that are outputs from the previous iterations.\n\nThe inputs to a boosting algorithm are a set of training example x, and their corresponding class\nlabels y. The \ufb01nal output strong classi\ufb01er takes the form\n\nHere fi(\u00b7) is a base learner. From Theorem 3.4, we know that a matrix X \u2208 \u03931 can be decomposed\nas\n\n\u03b8iZi, Zi \u2208 \u21261.\n\n(2)\n\n\u03b8ifi(x).\n\n(1)\n\nF\u03b8(x) = XM\n\ni=1\n\nX = XM\n\ni=1\n\nBy observing the similarity between Equations (1) and (2), we may view Zi as a weak classi\ufb01er\nand the matrix X as the strong classi\ufb01er we want to learn. This is exactly the problem that boosting\nmethods have been designed to solve. This observation inspires us to solve a special type of SDPs\nusing boosting techniques.\n\nA sparse greedy approximation algorithm proposed by Zhang [2] is an ef\ufb01cient way of solving a\nclass of convex problems, which provides fast convergence rates. It is shown in [2] that boosting\nalgorithms can be interpreted within the general framework of [2]. The main idea of sequential\ngreedy approximation is as follows. Given an initialization u0 \u2208 V, V can be a subset of a linear\nvector space, a matrix space or a functional space. The algorithm \ufb01nds ui \u2208 V, i = 1, \u00b7 \u00b7 \u00b7 , and\n0 \u2264 \u03bb \u2264 1 such that the cost function F ((1 \u2212 \u03bb)ui\u22121 + \u03bbui) is approximately minimized; Then the\nsolution ui is updated as ui = (1 \u2212 \u03bb)ui\u22121 + \u03bbui and the iteration goes on.\n\n4 Large-margin Semide\ufb01nite Metric Learning\n\nWe consider the Mahalanobis metric learning problem as an example although the proposed tech-\nnique can be applied to many other problems in machine learning such as nonparametric kernel\nmatrix learning [13].\nWe are given a set of training examples ai \u2208 RD, i = 1, 2, \u00b7 \u00b7 \u00b7 . The task is to learn a distance metric\nsuch that with the learned metric, classi\ufb01cation or clustering will achieve better performance on\ntesting data. The information available is a bunch of relative distance comparisons. Mathematically\nwe are given a set S which contains the training triplets: S = {(ai, aj, ak)| distij < distik},\nwhere distij measures distance between ai and aj with a certain metric. In this work we focus\non the case that dist calculates the Mahalanobis distance. Equivalently we are learning a linear\ntransformation P \u2208 RD\u00d7d such that dist is the Euclidean distance in the projected space: distij =\n2 = (ai \u2212 aj)\u22a4PP\u22a4(ai \u2212 aj). It is not dif\ufb01cult to see that the inequalities in the set\nS are non-convex because a difference of quadratic terms in P is involved. In order to convexify the\ninequalities in S, a new variable X = PP\u22a4 is instead used. This is a typical technique for modeling\nan SDP problem [14]. We wish to maximize the margin that is de\ufb01ned as the distance between\ndistij and distik. That is, \u03c1 = distik \u2212 distij = (ai \u2212 ak)\u22a4X(ai \u2212 ak) \u2212 (ai \u2212 aj)\u22a4X(ai \u2212 aj).\nAlso one may use soft margin to tolerate noisy data. Putting these thoughts together, the \ufb01nal convex\nprogram we want to optimize is:\n\n(cid:13)(cid:13)\nP\u22a4ai \u2212 P\u22a4aj(cid:13)(cid:13)\n\n2\n\nmax\n\u03c1,X,\u03be\n\n\u03c1 \u2212 CX|S|\n\nr=1\n\n\u03ber\n\ns.t. X < 0, Tr(X) = 1, \u03be \u2265 0,\n\n(3)\n\n(ai \u2212 ak)\u22a4X(ai \u2212 ak) \u2212 (ai \u2212 aj)\u22a4X(ai \u2212 aj) \u2265 \u03c1 \u2212 \u03ber,\n\u2200(ai, aj, ak) \u2208 S.\n\nHere r indexes the training set S. |S| denotes the size of S. C is a trade-off parameter that balances\nthe training error and the margin. Same as in support vector machine, the slack variable \u03be \u2265 0\n\n\fcorresponds to the soft-margin hinge loss. Note that the constraint Tr(X) = 1 removes the scale\nambiguity because the distance inequalities are scale invariant.\n\nTo simplify our exposition, we write\n\nAr = (ai \u2212 ak)(ai \u2212 ak)\u22a4 \u2212 (ai \u2212 aj)(ai \u2212 aj)\u22a4.\n\nThe last constraint in (3) is then written\n\nhAr, Xi \u2265 \u03c1 \u2212 \u03ber, \u2200Ar built from S; r = 1, \u00b7 \u00b7 \u00b7 |S|.\n\n(4)\n\n(5)\n\nProblem (3) is a typical SDP since it has a linear cost function and linear constraints plus a PSD\nconic constraint. Therefore it can be solved using off-the-shelf SDP solvers like CSDP [15]. As\nmentioned general interior-point SDP solvers do not scale well to large-sized problems. Current\nsolvers can only solve problems up to a few thousand variables, which makes many applications\nintractable. For example, in face recognition if the inputs are 30 \u00d7 30 images, then D = 900 and\nthere would be 0.41 million variables. Next we show how we reformulate the above SDP into an LP.\n\n5 Boosting via Matrix-Generation Linear Programming\n\nUsing Theorem 3.4, we can replace the PSD conic constraint in (3) with a linear convex combination\n\nof rank-one unitary PSD matrices: X = PM\n\ni=1 \u03b8iZi. Substituting X in Problem (3), we obtain\n\nmax\n\u03c1,\u03b8,\u03be,Z\n\n\u03c1 \u2212 CX|S|\n\nr=1\n\n\u03ber\n\ns.t. \u03be \u2265 0,\n\ni=1 \u03b8iZi(cid:11) = PM\n\ni=1(cid:10)Ar, Zi(cid:11)\u03b8i \u2265 \u03c1 \u2212 \u03ber,\n\n\u2200Ar built from S; r = 1, \u00b7 \u00b7 \u00b7 |S|,\n\n(cid:10)Ar,PM\nPM\n\ni=1 \u03b8i = 1, \u03b8 \u2265 0,\n\nZi \u2208 \u21261, i = 1, \u00b7 \u00b7 \u00b7 , M.\n\n(P1)\n\nThis above problem is still very hard to solve since it has non-convex rank constraints and an in-\nde\ufb01nite number of variables (M is inde\ufb01nite because there are an inde\ufb01nite number of rank-one\nmatrices). However if we somehow know matrices Zi (i = 1, \u00b7 \u00b7 \u00b7 ) a priori, we can then drop all the\nconstraints imposed on Zi (i = 1, \u00b7 \u00b7 \u00b7 ) and the problem becomes a linear program; or more precisely\na semi-in\ufb01nite linear program (SILP) because it has an in\ufb01nitely large set of variables \u03b8.\n\nColumn generation is a state-of-the-art method for optimally solving dif\ufb01cult large-scale optimiza-\ntion problems. It is a method to avoid considering all variables of a problem explicitly. If an LP\nhas extremely many variables (columns) but much fewer constraints, CG can be very bene\ufb01cial.\nThe crucial insight behind CG is: for an LP problem with many variables, the number of non-zero\nvariables of the optimal solution is equal to the number of constraints, hence although the number\nof possible variables may be large, we only need a small subset of these in the optimal solution. It\nworks by only considering a small subset of the entire variable set. Once it is solved, we ask the\nquestion: \u201cAre there any other variables that can be included to improve the solution?\u201d. So we must\nbe able to solve the subproblem: given a set of dual values, one either identi\ufb01es a variable that has\na favorable reduced cost, or indicates that such a variable does not exist. In essence, CG \ufb01nds the\nvariables with negative reduced costs without explicitly enumerating all variables. For a general LP,\nthis may not be possible. But for some types of problems it is possible.\n\nWe now consider Problem (P1) as if all Zi, (i = 1, \u00b7 \u00b7 \u00b7 ) were known. The dual of (P1) is easily\nderived:\n\nmin\n\u03c0,w\n\n\u03c0\n\ns.t. P|S|\nr=1(cid:10)Ar, Zi(cid:11)wr \u2264 \u03c0, i = 1, \u00b7 \u00b7 \u00b7 , M,\nP|S|\nr=1 wr = 1,\n\n0 \u2264 wr \u2264 C, r = 1, \u00b7 \u00b7 \u00b7 , |S|.\n\n(D1)\n\n\fFor convex programs with strong duality, the dual gap is zeros, which means the optimal value of\nthe primal and dual problems coincide. For LPs and SDPs, strong duality holds under very mild\nconditions (almost always satis\ufb01ed by LPs and SDPs considered here).\n\nWe now only consider a small subset of the variables in the primal; i.e., only a subset of Z (denoted\nby \u02dcZ)3 is used. The LP solved using \u02dcZ is usually termed restricted master problem (RMP). Because\nthe primal variables correspond to the dual constraints, solving RMP is equivalent to solving a\nrelaxed version of the dual problem. With a \ufb01nite \u02dcZ, the \ufb01rst set of constraints in (D1) are \ufb01nite, and\nwe can solve the LP that satis\ufb01es all the existing constraints.\n\nIf we can prove that among all the constraints that we have not added to the dual problem, no\nsingle constraint is violated, then we can conclude that solving the restricted problem is equivalent\nto solving the original problem. Otherwise, there exists at least one constraint that is violated. The\nviolated constraints correspond to variables in primal that are not in RMP. Adding these variables\nto RMP leads to a new RMP that needs to be re-optimized. In our case, by \ufb01nding the violated\nconstraint, we generate a rank-one matrix Z\u2032. Hence, as in LPBoost [1] we have a base learning\nalgorithm as an oracle that either \ufb01nds a new Z\u2032 such that\n\nwhere \u02dc\u03c0 is the solution of the current restricted problem, or a guarantee that such a Z\u2032 does not exist.\nTo make convergence fast, we \ufb01nd the one that has largest deviation. That is,\n\nP|S|\nr=1(cid:10)Ar, Z\u2032(cid:11)wr > \u02dc\u03c0,\n\nZ\u2032 = argmaxZnP|S|\n\nr=1(cid:10)Ar, Z(cid:11) \u02dcwr, s.t. Z \u2208 \u21261o .\n\n(B1)\n\nAgain here \u02dcwr (r = 1, \u00b7 \u00b7 \u00b7 , |S|) are obtained by solving the current restricted dual problem (D1). Let\nus denote Opt(B1) the optimal value of the optimization problem in (B1). We now have a criterion\nthat guarantees the optimal convex combination over all Z\u2019s satisfying the constraints in \u03932 has been\nfound. If Opt(B1) \u2264 \u02dc\u03c0, then we are done\u2014we have solved the original problem.\nThe presented algorithm is a variant of the CG technique. At each iteration, a new matrix is gener-\nated, hence the name matrix generation.\n\n5.1 Base Learning Algorithm\n\nIn this section, we show that the optimization problem (B1) can be exactly and ef\ufb01ciently solved\nusing eigen-decomposition.\n\nFrom Z < 0 and rank(Z) = 1, we know that Z has the format: Z = uu\u22a4, u \u2208 RD; and Tr(Z) = 1\nmeans kuk2 = 1. We have\n\nBy denoting\n\nP|S|\nr=1(cid:10)Ar, Z(cid:11) \u02dcwr = (cid:10)P|S|\n\nthe optimization in (B1) equals:\n\nmax\n\nu\n\nr=1 \u02dcwrAr(cid:1)u\u22a4.\n\nr=1 \u02dcwrAr, Z(cid:11) = u(cid:0)P|S|\n\u02dcH = P|S|\nu\u22a4 \u02dcHu, subject to kuk2 = 1.\n\nr=1 \u02dcwrAr,\n\n(6)\n\n(7)\n\nIt is clear that the largest eigenvalue of \u02dcH, \u03bbmax( \u02dcH), and its corresponding eigenvector u1 give the\nsolution to the above problem. Note that \u02dcH is symmetric. Therefore we have the solution of the\noriginal problem (B1): Opt(B1) = \u03bbmax( \u02dcH) and Z\u2032 = u1u\u22a4\n1 .\nThere are approximate eigenvalue solvers, which guarantee that for a symmetric matrix U and any\n\u03b5 > 0, a vector v is found such that v\u22a4Uv \u2265 \u03bbmax \u2212\u03b5. To approximately \ufb01nd the largest eigenvalue\nand eigenvector can be very ef\ufb01cient using Lanczos or power method. We use the MATLAB function\neigs to calculate the largest eigenvector, which calls mex \ufb01les of ARPACK. ARPACK is a collection\nof Fortran subroutines designed to solve large scale eigenvalue problems. When the input matrix is\nsymmetric, this software uses a variant of the Lanczos process called the implicitly restarted Lanczos\nmethod [16].\n\n3We also use \u02dc\u03b8, \u02dc\u03c0 and \u02dcw etc. to denote the solution of the current RMP and its dual.\n\n\fAlgorithm 1: PSDBoost for semide\ufb01nite metric learning.\nInput: Training set triplets (ai, aj, ak) \u2208 S; Calculate Ar, r = 1, \u00b7 \u00b7 \u00b7 from S using Equation (4).\nInitialization:\n\n1. M = 1 (no bases selected);\n2. \u03b8 = 0 (all primal coef\ufb01cients are zeros);\n3. \u03c0 = 0;\n4. wr = 1\n\n|S| , r = 1, \u00b7 \u00b7 \u00b7 , |S| (uniform dual weights).\n\nwhile true do\n\n1. Find a new base Z\u2032 by solving Problem (B1), i.e., eigen-decomposition of \u02dcH in (6);\n2. if Opt(B1) \u2264 \u03c0 then break (problem solved);\n3. Add Z\u2032 to the restricted master problem, which corresponds to a new constraint in\n\nProblem (D1);\n\n4. Solve the dual (D1) to obtain updated \u03c0 and wr (r = 1, \u00b7 \u00b7 \u00b7 , |S|);\n5. M = M + 1 (base count).\n\nend\nOutput:\n\n1. Calculate the primal variable \u03b8 from the optimality conditions and the last solved dual LP;\n\n2. The learned PSD matrix X \u2208 RD\u00d7D, X = PM\n\ni=1 \u03b8iZi.\n\nPutting all the above analysis together, we summarize our PSDBoost algorithm for metric learning\nin Algorithm 1. Note that, in practice, we can relax the convergence criterion by setting a small\npositive threshold \u03b5\u2032 > 0 in order to obtain a good approximation quickly. Namely the convergence\ncriterion is Opt(B1) \u2264 \u03c0 + \u03b5\u2032.\nThe algorithm has some appealing properties. Each iteration the solution is provably better than the\npreceding one, and has rank at most one larger. Hence after M iterations the algorithm attains a\nsolution with rank at most M. The algorithm preserves CG\u2019s property that each iteration improves\nthe quality of the solution. The bounded rank follows the fact that rank(A + B) \u2264 rank(A) +\nrank(B), \u2200 matrices A and B.\nAn advantage of the proposed PSDBoost algorithm over standard boosting schemes is the totally-\ncorrective weight update in each iteration, which leads faster convergence. The coordinate descent\noptimization employed by standard boosting algorithms is known to have a slow convergence rate in\ngeneral. However, the price of this totally-corrective update is obvious. PSDBoost spans the space of\nthe parameter X incrementally. The computational cost for solving the subproblem grows with the\nnumber of linear constraints, which increases by one at each iteration. Also it needs more and more\nmemory to store the generated base learner Zi as represented by a series of unit vectors. To alleviate\nthis problem, one can use a selection and compression mechanism as the aggregation step of bundle\nmethods [17]. When the size of of the bundle becomes too large, bundle methods select columns to\nbe discarded and the selected information is aggregated into a single one. It can be shown that as\nlong as the aggregated column is introduced in the bundle, the bundle algorithm remains convergent,\nalthough different selection of discarded columns may lead to different convergence speeds. See [17]\nfor details.\n\n6 Experiments\n\nIn the \ufb01rst experiment, we have arti\ufb01cially generated 600 points in 24 dimensions. Therefore the\nlearned metric is of size 24 \u00d7 24. The triplets are obtained in this way: For a point ai, we \ufb01nd its\nnearest neighbor in the same class aj and its nearest neighbor in the different class ak. We subsample\nto have 550 triplets for training. To show the convergence, we have plotted the optimal values of\nthe dual problem (D1) at each iteration in Figure 1. We see that PSDBoost quickly converges to\n\n\fthe near-optimal solution. We have observed the so-called tailing-off effect of CG on large datasets.\nWhile a near-optimal solution is approached considerably fast, only little progress per iteration is\nmade close to the optimum. Stabilization techniques have been introduced to partially alleviate\nthis problem [3]. However, approximate solutions are suf\ufb01cient for most machine learning tasks.\nMoreover, we usually are not interested in the numerical accuracy of the solution but the test error\nfor many problems such as metric and kernel learning.\n\nThe second experiment uses the Pendigits data from the UCI repository that contains handwritten\nsamples of digits 1, 5, 7, 9. The data for each digits are 16-dimensional. 80 samples for each digit\nare used for training and 500 for each digit for testing. The results show that PSDBoost converges\nquickly and the learned metric is very similar to the results obtained by a standard SDP solver. The\nclassi\ufb01cation errors on testing data with a 1-nearest neighbor are identical using the metrics learned\nby PSDBoost and a standard SDP solver. Both are 1.3%.\n\n7 Conclusion\n\nWe have presented a new boosting algorithm, PSDBoost, for learning a positive semide\ufb01nite ma-\ntrix. In particular, as an example, we use PSDBoost to learn a distance metric for classi\ufb01cation.\nPSDBoost can also be used to learn a kernel matrix, which is of interest in machine learning. We\nare currently exploring new applications with PSDBoost. Also we want to know what kind of SDP\noptimization problems can be approximately solved by PSDBoost.\n\nReferences\n[1] A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column generation. Mach. Learn., 46(1-3):225\u2013254,\n\n2002.\n\n[2] T. Zhang. Sequential greedy approximation for certain convex optimization problems. IEEE Trans. Inf. Theory, 49(3):682\u2013691, 2003.\n[3] M. E. L\u00a8ubbecke and J. Desrosiers. Selected topics in column generation. Operation Res., 53(6):1007\u20131023, 2005.\n[4]\n[5] G. B. Dantzig and P. Wolfe. Decomposition principle for linear programs. Operation Res., 8(1):101\u2013111, 1960.\n[6] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Proc. Adv.\n\nILOG, Inc. CPLEX 11.1, 2008. http://www.ilog.com/products/cplex/.\n\nNeural Inf. Process. Syst. MIT Press, 2002.\n\n[7] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation. In Proc. Adv.\n\nNeural Inf. Process. Syst., pages 1473\u20131480, 2005.\n\n[8] R. Rosales and G. Fung. Learning sparse metrics via linear programming. In Proc. ACM Int. Conf. Knowledge Discovery & Data Mining,\n\npages 367\u2013373, Philadelphia, PA, USA, 2006.\n\n[9] M. Krein and D. Milman. On extreme points of regular convex sets. Studia Mathematica, 9:133\u2013138, 1940.\n[10] M. L. Overton and R. S. Womersley. On the sum of the largest eigenvalues of a symmetric matrix. SIAM J. Matrix Anal. Appl.,\n\n13(1):41\u201345, 1992.\n\n[11] P. A. Fillmore and J. P. Williams. Some convexity theorems for matrices. Glasgow Math. Journal, 12:110\u2013117, 1971.\n[12] R. E. Schapire. Theoretical views of boosting and applications. In Proc. Int. Conf. Algorithmic Learn. Theory, pages 13\u201325, London,\n\nUK, 1999. Springer-Verlag.\n\n[13] B. Kulis, M. Sustik, and I. Dhillon. Learning low-rank kernel matrices. In Proc. Int. Conf. Mach. Learn., pages 505\u2013512, Pittsburgh,\n\nPennsylvania, 2006.\n\n[14] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[15] B. Borchers. CSDP, a C library for semide\ufb01nite programming. Optim. Methods and Softw., 11(1):613\u2013623, 1999.\n[16] D. Calvetti, L. Reichel, and D. C. Sorensen. An implicitly restarted Lanczos method for large symmetric eigenvalue problems. Elec.\n\nTrans. Numer. Anal, 2:1\u201321, Mar 1994. http://etna.mcs.kent.edu.\n\n[17] J. F. Bonnans, J. C. Gilbert, C. Lemar\u00b4echal, and C. A. Sagastiz\u00b4abal. Numerical Optimization: Theoretical and Practical Aspects (1st\n\nedition). Springer-Verlag, Berlin, 2003.\n\n)\n\n1\n\nD\n\n(\nt\np\nO\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221225\n0\n\n50\n\n100\n\niterations\n\n150\n\n200\n\n)\n\n1\n\nD\n\n(\nt\np\nO\n\n300\n\n200\n\n100\n\n0\n\n\u2212100\n\n\u2212200\n\n\u2212300\n0\n\n50\n\niterations\n\n100\n\n150\n\nFigure 1: The objective value of the dual problem (D1) on the \ufb01rst (left) and second (right) experiment. The dashed line shows the ground\ntruth obtained by directly solving the original primal SDP (3) using interior-point methods.\n\n\f", "award": [], "sourceid": 586, "authors": [{"given_name": "Chunhua", "family_name": "Shen", "institution": null}, {"given_name": "Alan", "family_name": "Welsh", "institution": null}, {"given_name": "Lei", "family_name": "Wang", "institution": null}]}