{"title": "Random Conic Pursuit for Semidefinite Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1135, "page_last": 1143, "abstract": "We present a novel algorithm, Random Conic Pursuit, that solves semidefinite programs (SDPs) via repeated optimization over randomly selected two-dimensional subcones of the PSD cone. This scheme is simple, easily implemented, applicable to very general SDPs, scalable, and theoretically interesting. Its advantages are realized at the expense of an ability to readily compute highly exact solutions, though useful approximate solutions are easily obtained. This property renders Random Conic Pursuit of particular interest for machine learning applications, in which the relevant SDPs are generally based upon random data and so exact minima are often not a priority. Indeed, we present empirical results to this effect for various SDPs encountered in machine learning; these experiments demonstrate the potential practical usefulness of Random Conic Pursuit. We also provide a preliminary analysis that yields insight into the theoretical properties and convergence of the algorithm.", "full_text": "Random Conic Pursuit for Semide\ufb01nite Programming\n\nAriel Kleiner\n\nComputer Science Division\nUniverisity of California\n\nBerkeley, CA 94720\n\nAli Rahimi\n\nIntel Research Berkeley\n\nBerkeley, CA 94720\n\nali.rahimi@intel.com\n\nMichael I. Jordan\n\nComputer Science Division\n\nUniversity of California\n\nBerkeley, CA 94720\n\nakleiner@cs.berkeley.edu\n\njordan@cs.berkeley.edu\n\nAbstract\n\nWe present a novel algorithm, Random Conic Pursuit, that solves semide\ufb01nite pro-\ngrams (SDPs) via repeated optimization over randomly selected two-dimensional\nsubcones of the PSD cone. This scheme is simple, easily implemented, applica-\nble to very general SDPs, scalable, and theoretically interesting. Its advantages\nare realized at the expense of an ability to readily compute highly exact solutions,\nthough useful approximate solutions are easily obtained. This property renders\nRandom Conic Pursuit of particular interest for machine learning applications, in\nwhich the relevant SDPs are generally based upon random data and so exact min-\nima are often not a priority. Indeed, we present empirical results to this effect for\nvarious SDPs encountered in machine learning; these experiments demonstrate\nthe potential practical usefulness of Random Conic Pursuit. We also provide a\npreliminary analysis that yields insight into the theoretical properties and conver-\ngence of the algorithm.\n\n1\n\nIntroduction\n\nMany dif\ufb01cult problems have been shown to admit elegant and tractably computable representations\nvia optimization over the set of positive semide\ufb01nite (PSD) matrices. As a result, semide\ufb01nite\nprograms (SDPs) have appeared as the basis for many procedures in machine learning, such as\nsparse PCA [8], distance metric learning [24], nonlinear dimensionality reduction [23], multiple\nkernel learning [14], multitask learning [19], and matrix completion [2].\nWhile SDPs can be solved in polynomial time, they remain computationally challenging. General-\npurpose solvers, often based on interior point methods, do exist and readily provide high-accuracy\nsolutions. However, their memory requirements do not scale well with problem size, and they typi-\ncally do not allow a \ufb01ne-grained tradeoff between optimization accuracy and speed, which is often a\ndesirable tradeoff in machine learning problems that are based on random data. Furthermore, SDPs\nin machine learning frequently arise as convex relaxations of problems that are originally compu-\ntationally intractable, in which case even an exact solution to the SDP yields only an approximate\nsolution to the original problem, and an approximate SDP solution can once again be quite useful.\nAlthough some SDPs do admit tailored solvers which are fast and scalable (e.g., [17, 3, 7]), deriv-\ning and implementing these methods is often challenging, and an easily usable solver that alleviates\nthese issues has been elusive. This is partly the case because generic \ufb01rst-order methods do not\napply readily to general SDPs.\nIn this work, we present Random Conic Pursuit, a randomized solver for general SDPs that is simple,\neasily implemented, scalable, and of inherent interest due to its novel construction. We consider\ngeneral SDPs over Rd\u00d7d of the form\n\nmin\nX(cid:23)0\n\nf (X)\n\ns.t.\n\ngj(X) \u2264 0,\n\nj = 1 . . . k,\n\n(1)\n\n1\n\n\fwhere f and the gj are convex real-valued functions, and (cid:23) denotes the ordering induced by the\nPSD cone. Random Conic Pursuit minimizes the objective function iteratively, repeatedly randomly\nsampling a PSD matrix and optimizing over the random two-dimensional subcone given by this\nmatrix and the current iterate. This construction maintains feasibility while avoiding the compu-\ntational expense of deterministically \ufb01nding feasible directions or of projecting into the feasible\nset. Furthermore, each iteration is computationally inexpensive, though in exchange we generally\nrequire a relatively large number of iterations. In this regard, Random Conic Pursuit is similar in\nspirit to algorithms such as online gradient descent and sequential minimal optimization [20] which\nhave illustrated that in the machine learning setting, algorithms that take a large number of simple,\ninexpensive steps can be surprisingly successful.\nThe resulting algorithm, despite its simplicity and randomized nature, converges fairly quickly to\nuseful approximate solutions. Unlike interior point methods, Random Conic Pursuit does not excel\nin producing highly exact solutions. However, it is more scalable and provides the ability to trade\noff computation for more approximate solutions. In what follows, we present our algorithm in full\ndetail and demonstrate its empirical behavior and ef\ufb01cacy on various SDPs that arise in machine\nlearning; we also provide early analytical results that yield insight into its behavior and convergence\nproperties.\n\n2 Random Conic Pursuit\n\nRandom Conic Pursuit (Algorithm 1) solves SDPs of the general form (1) via a sequence of sim-\nple two-variable optimizations (2). At each iteration, the algorithm considers the two-dimensional\ncone spanned by the current iterate, Xt, and a random rank one PSD matrix, Yt. It selects as its\nnext iterate, Xt+1, the point in this cone that minimizes the objective f subject to the constraints\ngj(Xt+1) \u2264 0 in (1). The distribution of the random matrices is periodically updated based on the\ncurrent iterate (e.g., to match the current iterate in expectation); these updates yield random matrices\nthat are better matched to the optimum of the SDP at hand.\nThe two-variable optimization (2) can be solved quickly in general via a two-dimensional bisection\nsearch. As a further speedup, for many of the problems that we considered, the two-variable opti-\nmization can be altogether short-circuited with a simple check that determines whether the solution\nXt+1 = Xt, with \u02c6\u03b2 = 1 and \u02c6\u03b1 = 0, is optimal. Additionally, SDPs with a trace constraint tr X = 1\nforce \u03b1 + \u03b2 = 1 and therefore require only a one-dimensional optimization.\nTwo simple guarantees for Random Conic Pursuit are immediate. First, its iterates are feasible for (1)\nbecause each iterate is a positive sum of two PSD matrices, and because the constraints gj of (2)\nare also those of (1). Second, the objective values decrease monotonically because \u03b2 = 1, \u03b1 = 0\nis a feasible solution to (2). We must also note two limitations of Random Conic Pursuit: it does\nnot admit general equality constraints, and it requires a feasible starting point. Nonetheless, for\nmany of the SDPs that appear in machine learning, feasible points are easy to identify, and equality\nconstraints are either absent or fortuitously pose no dif\ufb01culty.\nWe can gain further intuition by observing that Random Conic Pursuit\u2019s iterates, Xt, are positive\nweighted sums of random rank one matrices and so lie in the random polyhedral cones\n\n(cid:40) t(cid:88)\n\n(cid:41)\nt : \u03b3i \u2265 0\n\nF x\nt :=\n\n\u03b3ixtx(cid:48)\n\n\u2282 {X : X (cid:23) 0}.\n\n(3)\n\ni=1\n\nThus, Random Conic Pursuit optimizes the SDP (1) by greedily optimizing f w.r.t. the gj constraints\nwithin an expanding sequence of random cones {F x\nt }. These cones yield successively better inner\napproximations of the PSD cone (a basis for which is the set of all rank one matrices) while allowing\nus to easily ensure that the iterates remain PSD.\nIn light of this discussion, one might consider approximating the original SDP by sampling a random\ncone F x\nn in one shot and replacing the constraint X (cid:23) 0 in (1) with the simpler linear constraints\nX \u2208 F x\nn. For suf\ufb01ciently large n, F x\nn would approximate the PSD cone well (see Theorem 2 below),\nyielding an inner approximation that upper bounds the original SDP; the resulting problem would be\neasier than the original (e.g., it would become a linear program if the gj were linear). However, we\nhave found empirically that a very large n is required to obtain good approximations, thus negating\nany potential performance improvements (e.g., over interior point methods). Random Conic Pursuit\n\n2\n\n\f[brackets contain a particular, generally effective, sampling scheme]\n\nn \u2208 N: number of iterations\n[\u03ba \u2208 (0, 1): numerical stability parameter]\n\n[p \u2190 N (0, \u03a3) with \u03a3 = (1 \u2212 \u03ba)X0 + \u03baId]\n\nAlgorithm 1: Random Conic Pursuit\nInput: A problem of the form (1)\nX0: a feasible initial iterate\n\nOutput: An approximate solution Xn to (1)\np \u2190 a distribution over Rd\nfor t \u2190 1 to n do\n\nSample xt from p and set Yt \u2190 xtx(cid:48)\nSet \u02c6\u03b1, \u02c6\u03b2 to the optimizer of\n\nt\n\nmin\n\u03b1,\u03b2\u2208R f (\u03b1Yt + \u03b2Xt\u22121)\ns.t. gj(\u03b1Yt + \u03b2Xt\u22121) \u2264 0,\n\n\u03b1, \u03b2 \u2265 0\n\nj = 1 . . . k\n\n(2)\n\nSet Xt \u2190 \u02c6\u03b1Yt + \u02c6\u03b2Xt\u22121\nif \u02c6\u03b1 > 0 then Update p based on Xt\n\n[p \u2190 N (0, \u03a3) with \u03a3 = (1 \u2212 \u03ba)Xt + \u03baId]\n\nend\nreturn Xn\n\nsuccessfully resolves this issue by iteratively expanding the random cone F x\nt . As a result, we are\nable to much more ef\ufb01ciently access large values of n, though we compute a greedy solution within\nF x\nn rather than a global optimum over the entire cone. This tradeoff is ultimately quite advantageous.\n\n3 Applications and Experiments\n\nWe assess the practical convergence and scaling properties of Random Conic Pursuit by applying it\nto three different machine learning tasks that rely on SDPs: distance metric learning, sparse PCA,\nand maximum variance unfolding. For each, we compare the performance of Random Conic Pursuit\n(implemented in MATLAB) to that of a standard and widely used interior point solver, SeDuMi [21]\n(via cvx [9]), and to the best available solver which has been customized for each problem.\nTo evaluate convergence, we \ufb01rst compute a ground-truth solution X\u2217 for each problem instance\nby running the interior point solver with extremely low tolerance. Then, for each algorithm, we\nplot the normalized objective value errors [f (Xt) \u2212 f (X\u2217)]/|f (X\u2217)| of its iterates Xt as a function\nof the amount of time required to generate each iterate. Additionally, for each problem, we plot\nthe value of an application-speci\ufb01c metric for each iterate. These metrics provide a measure of\nthe practical implications of obtaining SDP solutions which are suboptimal to varying degrees. We\nevaluate scaling with problem dimensionality by running the various solvers on problems of different\ndimensionalities and computing various metrics on the solver runs as described below for each\nexperiment. Unless otherwise noted, we use the bracketed sampling scheme given in Algorithm 1\nwith \u03ba = 10\u22124 for all runs of Random Conic Pursuit.\n\n3.1 Metric Learning\nGiven a set of datapoints in Rd and a pairwise similarity relation over them, metric learning extracts\ndissimilar points are far apart [24]. Let S be the set of similar pairs of datapoints, and let \u00afS be its\n\na Mahalanobis distance dA(x, y) =(cid:112)(x \u2212 y)(cid:48)A(x \u2212 y) under which similar points are nearby and\ncomplement. The metric learning SDP, for A \u2208 Rd\u00d7d and C =(cid:80)\n\n(i,j)\u2208S (xi \u2212 xj)(xi \u2212 xj)(cid:48), is\n\nmin\nA(cid:23)0\n\ntr(CA)\n\ndA(xi, xj) \u2265 1.\n\n(4)\n\ns.t. (cid:88)\n\n(i,j)\u2208 \u00afS\n\nTo apply Random Conic Pursuit, X0 is set to a feasible scaled identity matrix. We solve the two-\nvariable optimization (2) via a double bisection search: at each iteration, \u03b1 is optimized out with\na one-variable bisection search over \u03b1 given \ufb01xed \u03b2, yielding a function of \u03b2 only. This resulting\nfunction is itself then optimized using a bisection search over \u03b2.\n\n3\n\n\fd\n100\n100\n100\n200\n200\n300\n300\n400\n400\n\nalg\nIP\nRCP\nPG\nRCP\nPG\nRCP\nPG\nRCP\nPG\n\nf after 2 hrs\u2217\n\ntime to Q > 0.99\n\n2.8e-7, 3.0e-7\n\n3.7e-9\n\n1.1e-5\n\n636.3\n\n142.7, 148.4\n\n42.3\n\n5.1e-8, 6.1e-8\n\n1.6e-5\n\n5.4e-8, 6.5e-8\n\n2.0e-5\n\n7.2e-8, 1.0e-8\n\n2.4e-5\n\n529.1, 714.8\n\n207.7\n\n729.1, 1774.7\n\n1095.8\n\n2128.4, 2227.2\n\n1143.3\n\nFigure 1: Results for metric learning. (plots) Trajectories of objective value error (left) and Q (right)\non UCI ionosphere data. (table) Scaling experiments on synthetic data (IP = interior point, RCP =\nRandom Conic Pursuit, PG = projected gradient), with two trials per d for RCP and times in seconds.\n\u2217For d = 100, third column shows f after 20 minutes.\n\nAs the application-speci\ufb01c metric for this problem, we measure the extent to which the metric\nlearning goal has been achieved: similar datapoints should be near each other, and dissimilar\ndatapoints should be farther away. We adopt the following metric of quality of a solution ma-\ni |{j : (i, j) \u2208 S}| \u00b7 |{l : (i, l) \u2208 \u00afS}| and 1[\u00b7] is the indicator function:\n\ntrix X, where \u03b6 = (cid:80)\n\n(cid:80)\n\ni\n\n(cid:80)\nj:(i,j)\u2208S(cid:80)\n\nQ(X) = 1\n\u03b6\n\nl:(i,l)\u2208 \u00afS 1[dij(X) < dil(X)].\n\nTo examine convergence behavior, we \ufb01rst apply the metric learning SDP to the UCI ionosphere\ndataset, which has d = 34 and 351 datapoints with two distinct labels (S contains pairs with identical\nlabels). We selected this dataset from among those used in [24] because it is among the datasets\nwhich have the largest dimensionality and experience the greatest impact from metric learning in\nthat work\u2019s clustering application. Because the interior point solver scales prohibitively badly in the\nnumber of datapoints, we subsampled the dataset to yield 4 \u00d7 34 = 136 datapoints.\nTo evaluate scaling, we use synthetic data in order to allow variation of d. To generate a d-\ndimensional data set, we \ufb01rst generate mixture centers by applying a random rotation to the elements\nof C1 = {(\u22121, 1), (\u22121,\u22121)} and C2 = {(1, 1), (1,\u22121)}. We then sample each datapoint xi \u2208 Rd\nfrom N (0, Id) and assign it uniformly at random to one of two clusters. Finally, we set the \ufb01rst two\ncomponents of xi to a random element of Ck if xi was assigned to cluster k \u2208 {1, 2}; these two\ncomponents are perturbed by adding a sample from N (0, 0.25I2).\nThe best known customized solver for the metric learning SDP is a projected gradient algorithm [24],\nfor which we used code available from the author\u2019s website.\nFigure 1 shows the results of our experiments. The two trajectory plots, for an ionosphere data\nproblem instance, show that Random Conic Pursuit converges to a very high-quality solution (with\nhigh Q and negligible objective value error) signi\ufb01cantly faster than interior point. Additionally,\nour performance is comparable to that of the projected gradient method which has been customized\nfor this task. The table in Figure 1 illustrates scaling for increasing d. Interior point scales badly\nin part because parsing the SDP becomes impracticably slow for d signi\ufb01cantly larger than 100.\nNonetheless, Random Conic Pursuit scales well beyond that point, continuing to return solutions\nwith high Q in reasonable time. On this synthetic data, projected gradient appears to reach high\nQ somewhat more quickly, though Random Conic Pursuit consistently yields signi\ufb01cantly better\nobjective values, indicating better-quality solutions.\n\n3.2 Sparse PCA\nSparse PCA seeks to \ufb01nd a sparse unit length vector that maximizes x(cid:48)Ax for a given data covariance\nmatrix A. This problem can be relaxed to the following SDP [8], for X, A \u2208 Rd\u00d7d:\n\n\u03c11(cid:48)|X|1 \u2212 tr(AX)\n\nmin\nX(cid:23)0\n\ns.t.\n\ntr(X) = 1,\n\n(5)\n\nwhere the scalar \u03c1 > 0 controls the solution\u2019s sparsity. A subsequent rounding step returns the\ndominant eigenvector of the SDP\u2019s solution, yielding a sparse principal component.\nWe use the colon cancer dataset [1] that has been used frequently in past studies of sparse PCA\nand contains 2,000 microarray readings for 62 subjects. The goal is to identify a small number of\n\n4\n\n073414682202293600.020.040.060.080.1time (sec)normalized objective value error Interior PointRandom PursuitProjected Gradient07341468220229360.40.60.81time (sec)pairwise distance quality (Q) Interior PointRandom PursuitProjected Gradient\fd\n120\n120\n120\n200\n200\n200\n300\n300\n300\n500\n500\n500\n\nalg\nIP\nRCP\nDSPCA\n\nIP\nRCP\nDSPCA\n\nIP\nRCP\nDSPCA\n\nIP\nRCP\nDSPCA\n\nf after 4 hrs\n\nsparsity after 4 hrs\n\n-10.25\n\n-9.98, -10.02\n\n-10.24\nfailed\n\n0.55\n\n0.47, 0.45\n\n0.55\nfailed\n\n-10.30, -10.27\n\n0.51, 0.50\n\n-11.07\nfailed\n\n-9.39, -9.29\n\n-11.52\nfailed\n\n-6.95, -6.54\n\n-11.61\n\n0.64\nfailed\n\n0.51, 0.51\n\n0.69\nfailed\n\n0.53, 0.50\n\n0.78\n\nFigure 2: Results for sparse PCA. All solvers quickly yield similar captured variance (not shown\nhere). (plots) Trajectories of objective value error (left) and sparsity (right), for a problem with\nd = 100. (table) Scaling experiments (IP = interior point, RCP = Random Conic Pursuit), with two\ntrials per d for RCP.\n\nmicroarray cells that capture the greatest variance in the dataset. We vary d by subsampling the\nreadings and use \u03c1 = 0.2 (large enough to yield sparse solutions) for all experiments.\nTo apply Random Conic Pursuit, we set X0 = A/ tr(A). The trace constraint (5) implies that\ntr(Xt\u22121) = 1 and so tr(\u03b1Yt + \u03b2Xt\u22121) = \u03b1 tr(Yt) + \u03b2 = 1 in (2). Thus, we can simplify the\ntwo-variable optimization (2) to a one-variable optimization, which we solve by bisection search.\nThe fastest available customized solver for the sparse PCA SDP is an adaptation of Nesterov\u2019s\nsmooth optimization procedure [8] (denoted by DSPCA), for which we used a MATLAB imple-\nmentation with heavy MEX optimizations that is downloadable from the author\u2019s web site.\n(cid:80)\nWe compute two application-speci\ufb01c metrics which capture the two goals of sparse PCA: high\ncaptured variance and high sparsity. Given the top eigenvector u of a solution matrix X, its captured\nj 1[|uj| < \u03c4 ]; we take \u03c4 = 10\u22123 based on\nvariance is u(cid:48)Au, and its sparsity is given by 1\nqualitative inspection of the raw microarray data covariance matrix A.\nThe results of our experiments are shown in Figure 2. As seen in the two plots, on a problem instance\nwith d = 100, Random Conic Pursuit quickly achieves an objective value within 4% of optimal and\nthereafter continues to converge, albeit more slowly; we also quickly achieve fairly high sparsity\n(compared to that of the exact SDP optimum). In contrast, interior point is able to achieve lower\nobjective value and even higher sparsity within the timeframe shown, but, unlike Random Conic\nPursuit, it does not provide the option of spending less time to achieve a solution which is still\nrelatively sparse. All of the solvers quickly achieve very similar captured variances, which are not\nshown. DSPCA is extremely ef\ufb01cient, requiring much less time than its counterparts to \ufb01nd nearly\nexact solutions. However, that procedure is highly customized (via several pages of derivation and an\noptimized implementation), whereas Random Conic Pursuit and interior point are general-purpose.\nThe table in Figure 2 illustrates scaling by reporting achieved objecive values and sparsities after\nthe solvers have each run for 4 hours. Interior point fails due to memory requirements for d > 130,\nwhereas Random Conic Pursuit continues to function and provide useful solutions, as seen from the\nachieved sparsity values, which are much larger than those of the raw data covariance matrix. Again,\nDSPCA continues to be extremely ef\ufb01cient.\n\nd\n\n3.3 Maximum Variance Unfolding (MVU)\n\nMVU searches for a kernel matrix that embeds high-dimensional input data into a lower-dimensional\nmanifold [23]. Given m data points and a neighborhood relation i \u223c j between them, it forms\ntheir centered and normalized Gram matrix G \u2208 Rm\u00d7m and the squared Euclidean distances d2\nij =\nGii +Gjj\u22122Gij. The desired kernel matrix is the solution of the following SDP, where X \u2208 Rm\u00d7m\nand the scalar \u03bd > 0 controls the dimensionality of the resulting embedding:\n\n(cid:88)\n\ni\u223cj\n\ntr(X) \u2212 \u03bd\n\n(Xii + Xjj \u2212 2Xij \u2212 d2\n\nij)2\n\nmax\nX(cid:23)0\n\ns.t.\n\n1(cid:48)X1 = 0.\n\n(6)\n\nTo apply Random Conic Pursuit, we set X0 = G and use the general sampling formulation in Algo-\nrithm 1 by setting p = N (0, \u03a0(\u2207f (Xt))) in the initialization (i.e., t = 0) and update steps, where\n\n5\n\n0107621523228430400.020.040.060.080.1time (sec)normalized objective value error Interior PointRandom PursuitDSPCA107621523228430400.130.260.390.52time (sec)top eigenvector sparsity Interior PointRandom PursuitDSPCA\fm\n40\n40\n40\n200\n200\n200\n400\n400\n800\n800\n\nalg\nIP\nRCP\nGD\nIP\nRCP\nGD\nIP\nRCP\nIP\nRCP\n\nf after convergence\n\nseconds to f > 0.99 \u02c6f\n\n23.4\n\n22.83 (0.03)\n\n23.2\n2972.6\n\n2921.3 (1.4)\n\n2943.3\n12255.6\n\n12207.96 (36.58)\n\n0.5 (0.03)\n\n0.4\n\n5.4\n12.4\n\n6.6 (0.8)\n965.4\n97.1\n\n26.3 (9.8)\n\nfailed\n\n71231.1 (2185.7)\n\nfailed\n\n115.4 (29.2)\n\nFigure 3: Results for MVU. (plots) Trajectories of objective value for m = 200 (left) and m = 800\n(right). (table) Scaling experiments showing convergence as a function of m (IP = interior point,\nRCP = Random Conic Pursuit, GD = gradient descent).\n\n\u03a0 truncates negative eigenvalues of its argument to zero. This scheme empirically yields improved\nperformance for the MVU problem as compared to the bracketed sampling scheme in Algorithm 1.\nTo handle the equality constraint, each Yt is \ufb01rst transformed to \u02d8Yt = (I \u2212 11(cid:48)/m)Yt(I \u2212 11(cid:48)/m),\nwhich preserves PSDness and ensures feasibility. The two-variable optimization (2) proceeds as\nbefore on \u02d8Yt and becomes a two-variable quadratic program, which can be solved analytically.\nMVU also admits a gradient descent algorithm, which serves as a straw-man large-scale solver for\nthe MVU SDP. At each iteration, the step size is picked by a line search, and the spectrum of the\niterate is truncated to maintain PSDness. We use G as the initial iterate.\nTo generate data, we randomly sample m points from the surface of a synthetic swiss roll [23]; we\nset \u03bd = 1. To quantify the amount of time it takes a solver to converge, we run it until its objective\ncurve appears qualitatively \ufb02at and declare the convergence point to be the earliest iterate whose\nobjective is within 1% of the best objective value seen so far (which we denote by \u02c6f).\nFigure 3 illustrates that Random Conic Pursuit\u2019s objective values converge quickly, and on problems\nwhere the interior point solver achieves the optimum, Random Conic Pursuit nearly achieves that\noptimum. The interior point solver runs out of memory when m > 400 and also fails on smaller\nproblems if its tolerance parameter is not tuned. Random Conic Pursuit easily runs on larger prob-\nlems for which interior point fails, and for smaller problems its running time is within a small factor\nof that of the interior point solver; Random Conic Pursuit typically converges within 1000 itera-\ntions. The gradient descent solver is orders of magnitude slower than the other solvers and failed to\nconverge to a meaningful solution for m \u2265 400 even after 2000 iterations (which took 8 hours).\n\n4 Analysis\n\nAnalysis of Random Conic Pursuit is complicated by the procedure\u2019s use of randomness and its\nhandling of the constraints gj \u2264 0 explicitly in the sub-problem (2), rather than via penalty functions\nor projections. Nonetheless, we are able to obtain useful insights by \ufb01rst analyzing a simpler setting\nhaving only a PSD constraint. We thus obtain a bound on the rate at which the objective values\nof Random Conic Pursuit\u2019s iterates converge to the SDP\u2019s optimal value when the problem has no\nconstraints of the form gj \u2264 0:\nTheorem 1 (Convergence rate of Random Conic Pursuit when f is weakly convex and k = 0). Let\nf : Rd\u00d7d \u2192 R be a convex differentiable function with L-Lipschitz gradients such that the minimum\nof the following optimization problem is attained at some X\u2217:\n\nmin\nX(cid:23)0\n\nf (X).\n\n(7)\n\nLet X1 . . . Xt be the iterates of Algorithm 1 when applied to this problem starting at iterate X0\n(using the bracketed sampling scheme given in the algorithm speci\ufb01cation), and suppose (cid:107)Xt\u2212X\u2217(cid:107)\nis bounded. Then\n\nEf (Xt) \u2212 f (X\u2217) \u2264 1\nt\nfor some constant \u0393 that does not depend on t.\n\n\u00b7 max(\u0393L, f (X0) \u2212 f (X\u2217)),\n\n(8)\n\n6\n\n01020301800200022002400260028003000Time (sec)Objective value Interior PointRandom Pursuit010020030040002468x 104Time (sec)Objective value Random Pursuit\fProof. We prove that equation (8) holds in general for any X\u2217, and thus for the optimizer of f in\nparticular. The convexity of f implies the following linear lower bound on f (X) for any X and Y :\n(9)\nThe Lipschitz assumption on the gradient of f implies the following quadratic upper bound on f (X)\nfor any X and Y [18]:\n\nf (X) \u2265 f (Y ) + (cid:104)\u2202f (Y ), X \u2212 Y (cid:105).\n\nf (X) \u2264 f (Y ) + (cid:104)\u2202f (Y ), X \u2212 Y (cid:105) + L\n\n(10)\nDe\ufb01ne the random variable \u02dcYt := \u03b3t(Yt)Yt with \u03b3t a positive function that ensures E \u02dcYt = X\u2217. It\nsuf\ufb01ces to set \u03b3t = q(Y )/\u02d8p(Y ), where \u02d8p is the distribution of Yt and q is any distribution with mean\nt with \u03b3t(x) = N (x|0, X\u2217)/N (x|0, \u03a3t) satis\ufb01es this.\nX\u2217. In particular, the choice \u02dcYt := \u03b3t(xt)xtx(cid:48)\nAt iteration t, Algorithm 1 produces \u03b1t and \u03b2t so that Xt+1 := \u03b1tYt + \u03b2tXt minimizes f (Xt+1).\nWe will bound the defect f (Xt+1) \u2212 f (X\u2217) at each iteration by sub-optimally picking \u02c6\u03b1t = 1/t,\n\u02c6\u03b2t = 1 \u2212 1/t, and \u02c6Xt+1 = \u02c6\u03b2tXt + \u02c6\u03b1t\u03b3t(Yt)Yt = \u02c6\u03b2tXt + \u02c6\u03b1t \u02dcYt. Conditioned on Xt, we have\n\n2 (cid:107)X \u2212 Y (cid:107)2.\n\nEf (Xt+1) \u2212 f (X\u2217) \u2264 Ef ( \u02c6\u03b2tXt + \u02c6\u03b1t \u02dcYt) \u2212 f (X\u2217) = Ef\n\n(cid:17) \u2212 f (X\u2217)\n\n(cid:16)\n(cid:69)\nXt \u2212 1\nt (Xt \u2212 \u02dcYt)\nt ( \u02dcYt \u2212 Xt)\n+ L\n2t2 E(cid:107)Xt \u2212 \u02dcYt(cid:107)2\nt (cid:104)\u2202f (Xt), X\u2217 \u2212 Xt(cid:105) + L\nt (f (X\u2217) \u2212 f (Xt)) + L\n2t2 E(cid:107)Xt \u2212 \u02dcYt(cid:107)2\n\n\u2202f (Xt), 1\n\n2t2 E(cid:107)Xt \u2212 \u02dcYt(cid:107)2\n\n(11)\n\n(12)\n\n\u2264 f (Xt) \u2212 f (X\u2217) + E(cid:68)\n=(cid:0)1 \u2212 1\n\n= f (Xt) \u2212 f (X\u2217) + 1\n\u2264 f (Xt) \u2212 f (X\u2217) + 1\n\n(cid:1)(cid:0)f (Xt) \u2212 f (X\u2217)(cid:1) + L\n\n(13)\n(14)\n(15)\nThe \ufb01rst inequality follows by the suboptimality of \u02c6\u03b1t and \u02c6\u03b2t, the second by Equation (10), and the\nthird by (9).\nDe\ufb01ne et := Ef (Xt)\u2212 f (X\u2217). The term E(cid:107) \u02dcYt\u2212 Xt(cid:107)2 is bounded above by some absolute constant\n\u0393 because E(cid:107) \u02dcYt \u2212 Xt(cid:107)2 = E(cid:107) \u02dcYt \u2212 X\u2217(cid:107)2 +(cid:107)Xt \u2212 X\u2217(cid:107)2. The \ufb01rst term is bounded because it is the\nvariance of \u02dcYt, and the second term is bounded by assumption. Taking expectation over Xt gives the\n\n2t2 E(cid:107)Xt \u2212 \u02dcYt(cid:107)2.\n\nt\n\nbound et+1 \u2264(cid:0)1 \u2212 1\n\n(cid:1) et + L\u0393\n\nt \u00b7 max(\u0393L, f (X0) \u2212 f (X\u2217)) [16].\n\nt\n\n2t2 , which is solved by et = 1\n\nDespite the extremely simple and randomized nature of Random Conic Pursuit, the theorem guar-\nantees that its objective values converge at the rate O(1/t) on an important subclass of SDPs. We\nomit here some readily available extensions: for example, the probability that a trajectory of iterates\nviolates the above rate can be bounded by noting that the iterates\u2019 objective values behave as a \ufb01nite\ndifference sub-martingale. Additionally, the theorem and proof could be generalized to hold for a\nbroader class of sampling schemes.\nDirectly characterizing the convergence of Random Conic Pursuit on problems with constraints ap-\npears to be signi\ufb01cantly more dif\ufb01cult and seems to require introduction of new quantities depending\non the constraint set (e.g., condition number of the constraint set and its overlap with the PSD cone)\nwhose implications for the algorithm are dif\ufb01cult to explicitly characterize with respect to d and\nthe properties of the gj, X\u2217, and the Yt sampling distribution. Indeed, it would be useful to better\nunderstand the limitations of Random Conic Pursuit. As noted above, the procedure cannot readily\naccommodate general equality constraints; furthermore, for some constraint sets, sampling only a\nrank one Yt at each iteration could conceivably cause the iterates to become trapped at a sub-optimal\nboundary point (this could be alleviated by sampling higher rank Yt). A more general analysis is\nthe subject of continuing work, though our experiments con\ufb01rm empirically that we realize usefully\nfast convergence of Random Conic Pursuit even when it is applied to a variety of constrained SDPs.\nWe obtain a different analytical perspective by recalling that Random Conic Pursuit computes a\nsolution within the random polyhedral cone F x\nn, de\ufb01ned in (3) above. The distance between this\ncone and the optimal matrix X\u2217 is closely related to the quality of solutions produced by Random\nConic Pursuit. The following theorem characterizes the distance between a sampled cone F x\nn and\nany \ufb01xed X\u2217 in the PSD cone:\nTheorem 2. Let X\u2217 (cid:31) 0 be a \ufb01xed positive de\ufb01nite matrix, and let x1, . . . , xn \u2208 Rd be drawn i.i.d.\nfrom N (0, \u03a3) with \u03a3 (cid:31) X\u2217. Then, for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4,\n\n\u221a\n2 log 1\n\u221a\n\u03b4\nn\n\n(cid:113)(cid:12)(cid:12)\u03a3X\u2217\u22121(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16)\n\nX\u2217\u22121 \u2212 \u03a3\u22121(cid:17)\u22121(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(cid:107)X \u2212 X\u2217(cid:107) \u2264 1 +\n\nmin\nX\u2208F x\n\nn\n\n\u00b7 2\ne\n\n7\n\n\fSee supplementary materials for proof. As expected, F x\nn provides a progressively better approxima-\ntion to the PSD cone (with high probability) as n grows. Furthermore, the rate at which this occurs\ndepends on X\u2217 and its relationship to \u03a3; as the latter becomes better matched to the former, smaller\nvalues of n are required to achieve an approximation of given quality.\nThe constant \u0393 in Theorem 1 can hide a dependence on the dimensionality of the problem d, though\nthe proof of Theorem 2 helps to elucidate the dependence of \u0393 on d and X\u2217 for the particular case\nwhen \u03a3 does not vary over time (the constants in Theorem 2 arise from bounding (cid:107)\u03b3t(xt)xtx(cid:48)\nt(cid:107)).\nA potential concern regarding both of the above theorems is the possibility of extremely adverse\ndependence of their constants on the dimensionality d and the properties (e.g., condition number)\nof X\u2217. However, our empirical results in Section 3 show that Random Conic Pursuit does indeed\ndecrease the objective function usefully quickly on real problems with relatively large d and solution\nmatrices X\u2217 which are rank one, a case predicted by the analysis to be among the most dif\ufb01cult.\n\n5 Related Work\n\nRandom Conic Pursuit and the analyses above are related to a number of existing optimization and\nsampling algorithms.\nOur procedure is closely related to feasible direction methods [22], which move along descent direc-\ntions in the feasible set de\ufb01ned by the constraints at the current iterate. Cutting plane methods [11],\nwhen applied to some SDPs, solve a linear program obtained by replacing the PSD constraint with\na polyhedral constraint. Random Conic Pursuit overcomes the dif\ufb01culty of \ufb01nding feasible descent\ndirections or cutting planes, respectively, by sampling directions randomly and also allowing the\ncurrent iterate to be rescaled.\nPursuit-based optimization methods [6, 13] return a solution within the convex hull of an a priori-\nspeci\ufb01ed convenient set of points M. At each iteration, they re\ufb01ne their solution to a point between\nthe current iterate and a point in M. The main burden in these methods is to select a near-optimal\npoint in M at each iteration. For SDPs having only a trace equality constraint and with M the set\nof rank one PSD matrices, Hazan [10] shows that such points in M can be found via an eigenvalue\ncomputation, thereby obtaining a convergence rate of O(1/t). In contrast, our method selects steps\nrandomly and still obtains a rate of O(1/t) in the unconstrained case.\nThe Hit-and-Run algorithm for sampling from convex bodies can be combined with simulated an-\nnealing to solve SDPs [15]. In this con\ufb01guration, similarly to Random Conic Pursuit, it conducts a\nsearch along random directions whose distribution is adapted over time.\nFinally, whereas Random Conic Pursuit utilizes a randomized polyhedral inner approximation of\nthe PSD cone, the work of Cala\ufb01ore and Campi [5] yields a randomized outer approximation to the\nPSD cone obtained by replacing the PSD constraint X (cid:23) 0 with a set of sampled linear inequality\nconstraints. It can be shown that for linear SDPs, the dual of the interior LP relaxation is identical\nto the exterior LP relaxation of the dual of the SDP. Empirically, however, this outer relaxation\nrequires impractically many sampled constraints to ensure that the problem remains bounded and\nyields a good-quality solution.\n\n6 Conclusion\n\nWe have presented Random Conic Pursuit, a simple, easily implemented randomized solver for\ngeneral SDPs. Unlike interior point methods, our procedure does not excel at producing highly exact\nsolutions. However, it is more scalable and provides useful approximate solutions fairly quickly,\ncharacteristics that are often desirable in machine learning applications. This fact is illustrated by\nour experiments on three different machine learning tasks based on SDPs; we have also provided a\npreliminary analysis yielding further insight into Random Conic Pursuit.\n\nAcknowledgments\n\nWe are grateful to Guillaume Obozinski for early discussions that motivated this line of work.\n\n8\n\n\fReferences\n[1] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns\nof gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonu-\ncleotide arrays. Proc. Natl. Acad. Sci. USA, 96:6745\u20136750, June 1999.\n\n[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[3] S. Burer and R.D.C Monteiro. Local minima and convergence in low-rank semide\ufb01nite programming.\n\nMathematical Programming, 103(3):427\u2013444, 2005.\n\n[4] S. Burer, R.D.C. Monteiro, and Y. Zhang. A computational study of a gradient-based log-barrier algorithm\n\nfor a class of large-scale sdps. Mathematical Programming, 95(2):359\u2013379, 2003.\n\n[5] G. Cala\ufb01ore and M.C. Campi. Uncertain convex programs: randomized solutions and con\ufb01dence levels.\n\nMathematical Programming, 102(1):25\u201346, 2005.\n\n[6] K. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. In Symposium on\n\nDiscrete Algorithms (SODA), 2008.\n\n[7] A. d\u2019Aspremont. Subsampling algorithms for semide\ufb01nite programming. Technical Report 0803.1990,\n\nArXiv, 2009.\n\n[8] A. d\u2019Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse pca\n\nusing semide\ufb01nite programming. SIAM Review, 49(3):434\u2013448, 2007.\n\n[9] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21.\n\nhttp://cvxr.com/cvx, May 2010.\n\n[10] E. Hazan. Sparse approximate solutions to semide\ufb01nite programs.\n\nTheoretical informatics, pages 306\u2013316, 2008.\n\nIn Latin American conference on\n\n[11] C. Helmberg. A cutting plane algorithm for large scale semide\ufb01nite relaxations. In Martin Gr\u00a8otschel,\n\neditor, The sharpest cut, chapter 15. MPS/SIAM series on optimization, 2001.\n\n[12] C. Helmberg and F. Rendl. A spectral bundle method for semide\ufb01nite programming. SIAM Journal on\n\nOptimization archive, 10(3):673\u2013696, 1999.\n\n[13] L. K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence rates for pro-\njection pursuit regression and neural network training. The Annals of Statistics, 20(1):608\u2013613, March\n1992.\n\n[14] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix\nwith semide\ufb01nite programming. Journal of Machine Learning Research (JMLR), 5:27\u201372, December\n2004.\n\n[15] L. Lov\u00b4asz and S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration\n\nand optimization. In Foundations of Computer Science (FOCS), 2006.\n\n[16] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[17] Y Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n152, May 2005.\n\n[18] Y. Nesterov. Smoothing technique and its applications in semide\ufb01nite optimization. Mathematical Pro-\n\ngramming, 110(2):245\u2013259, July 2007.\n\n[19] G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for\n\nmultiple classi\ufb01cation problems. Statistics and Computing, pages 1573\u20131375, 2009.\n\n[20] J. Platt. Using sparseness and analytic QP to speed training of Support Vector Machines. In Advances in\n\nNeural Information Processing Systems (NIPS), 1999.\n\n[21] J.F. Sturm. Using sedumi 1.02, a matlab toolbox for optimization over symmetric cones. Optimization\n\nMethods and Software, Special issue on Interior Point Methods, 11-12:625\u2013653, 1999.\n\n[22] W. Sun and Y. Yuan. Optimization Theory And Methods: Nonlinear Programming. Springer Optimization\n\nAnd Its Applications, 2006.\n\n[23] K. Q. Weinberger, F. Sha, Q. Zhu, and L. K. Saul. Graph laplacian regularization for large-scale semidef-\n\ninite programming. In Advances in Neural Information Processing Systems (NIPS), 2006.\n\n[24] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with\n\nside-information. In Advances in Neural Information Processing Systems (NIPS), 2003.\n\n9\n\n\f", "award": [], "sourceid": 1306, "authors": [{"given_name": "Ariel", "family_name": "Kleiner", "institution": null}, {"given_name": "Ali", "family_name": "Rahimi", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}