{"title": "Parametric Simplex Method for Sparse Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 188, "page_last": 197, "abstract": "High dimensional sparse learning has imposed a great computational challenge to large scale data analysis. In this paper, we investiage a broad class of sparse learning approaches formulated as linear programs parametrized by a {\\em regularization factor}, and solve them by the parametric simplex method (PSM). PSM offers significant advantages over other competing methods: (1) PSM naturally obtains the complete solution path for all values of the regularization parameter; (2) PSM provides a high precision dual certificate stopping criterion; (3) PSM yields sparse solutions through very few iterations, and the solution sparsity significantly reduces the computational cost per iteration. Particularly, we demonstrate the superiority of PSM over various sparse learning approaches, including Dantzig selector for sparse linear regression, sparse support vector machine for sparse linear classification, and sparse differential network estimation. We then provide sufficient conditions under which PSM always outputs sparse solutions such that its computational performance can be significantly boosted. Thorough numerical experiments are provided to demonstrate the outstanding performance of the PSM method.", "full_text": "Parametric Simplex Method for Sparse Learning\n\nHaotian Pang\u2021 Robert Vanderbei\u2021 Han Liu?\u2021\n\nTuo Zhao\u21e7\n\n\u2021Princeton University ?Tencent AI Lab \u2021Northwestern University \u21e7Georgia Tech\u21e4\n\nAbstract\n\nHigh dimensional sparse learning has imposed a great computational challenge to\nlarge scale data analysis. In this paper, we are interested in a broad class of sparse\nlearning approaches formulated as linear programs parametrized by a regularization\nfactor, and solve them by the parametric simplex method (PSM). Our parametric\nsimplex method offers signi\ufb01cant advantages over other competing methods: (1)\nPSM naturally obtains the complete solution path for all values of the regularization\nparameter; (2) PSM provides a high precision dual certi\ufb01cate stopping criterion;\n(3) PSM yields sparse solutions through very few iterations, and the solution\nsparsity signi\ufb01cantly reduces the computational cost per iteration. Particularly,\nwe demonstrate the superiority of PSM over various sparse learning approaches,\nincluding Dantzig selector for sparse linear regression, LAD-Lasso for sparse robust\nlinear regression, CLIME for sparse precision matrix estimation, sparse differential\nnetwork estimation, and sparse Linear Programming Discriminant (LPD) analysis.\nWe then provide suf\ufb01cient conditions under which PSM always outputs sparse\nsolutions such that its computational performance can be signi\ufb01cantly boosted.\nThorough numerical experiments are provided to demonstrate the outstanding\nperformance of the PSM method.\n\nIntroduction\n\n1\nA broad class of sparse learning approaches can be formulated as high dimensional optimization\nproblems. A well known example is Dantzig Selector, which minimizes a sparsity-inducing `1 norm\nwith an `1 norm constraint. Speci\ufb01cally, let X 2 Rn\u21e5d be a design matrix, y 2 Rn be a response\nvector, and \u2713 2 Rd be the model parameter. Dantzig Selector can be formulated as the solution to the\nfollowing convex program,\n(1.1)\n\nk\u2713k1\n\ns.t. kX>(y X\u2713 )k1 \uf8ff .\n\nwhere k\u00b7k 1 and k\u00b7k 1 denote the `1 and `1 norms respectively, and > 0 is a regularization factor.\nCandes and Tao (2007) suggest to rewrite (1.1) as a linear program solved by linear program solvers.\nDantzig Selector motivates many other sparse learning approaches, which also apply a regularization\nfactor to tune the desired solution. Many of them can be written as a linear program in the following\ngeneric form with either equality constraints:\n\nb\u2713 = argmin\n\n\u2713\n\nor inequality constraints:\n\nmax\n\nx\n\n(c + \u00afc)>x s.t. Ax = b + \u00afb, x 0,\n\nmax\n\nx\n\n(c + \u00afc)>x s.t. Ax \uf8ff b + \u00afb, x 0.\n\n(1.2)\n\n(1.3)\n\nExisting literature usually suggests the popular interior point method (IPM) to solve (1.2) and (1.3).\nThe interior point method is famous for solving linear programs in polynomial time. Speci\ufb01cally,\nthe interior point method uses the log barrier to handle the constraints, and rewrites (1.2) or (1.3)\n\n\u21e4Correspondence to Tuo Zhao: tuo.zhao@isye.gatech.edu.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fas a unconstrained program, which is further solved by the Newton\u2019s method. Since the log barrier\nrequires the Newton\u2019s method to only iterate within the interior of the feasible region, IPM cannot\nyield exact sparse iterates, and cannot take advantage of sparsity to boost the computation.\nAn alternative approach is the simplex method. From a geometric perspective, the classical simplex\nmethod iterates over the vertices of a polytope. Algebraically, the algorithm involves moving from\none partition of the basic and nonbasic variables to another. Each partition deviates from the previous\nin that one basic variable gets swapped with one nonbasic variable in a process called pivoting.\nDifferent variants of the simplex method are de\ufb01ned by different rules of pivoting. The simplex\nmethod has been shown to work well in practice, even though its worst-case iteration complexity has\nbeen shown to scale exponentially with the problem scale in existing literature.\nMore recently, some researchers propose to use alternating direction methods of multipliers (ADMM)\nto directly solve (1.1) without reparametrization as a linear program. Although these methods enjoy\nO(1/T ) convergence rates based on variational inequality criteria, where T is the number of iterations.\nADMM can be viewed as an exterior point method, and always gives infeasible solutions within\n\ufb01nite number of iterations. We often observe that after ADMM takes a large number of iterations,\nthe solutions still suffer from signi\ufb01cant feasibility violation. These methods work well only for\nmoderate scale problems (e.g., d < 1000). For larger d, ADMM becomes less competitive.\nThese methods, though popular, are usually designed for solving (1.2) and (1.3) for one single\nregularization factor. This is not satisfactory, since an appropriate choice of is usually unknown.\nThus, one usually expects an algorithm to obtain multiple solutions tuned over a reasonable range of\nvalues for . For each value of , we need to solve a linear program from scratch, and it is therefore\noften very inef\ufb01cient for high dimensional problems.\nTo overcome the above drawbacks, we propose to solve both (1.2) and (1.3) by a variant of the\nparametric simplex method (PSM) in a principled manner (Murty, 1983; Vanderbei, 1995). Speci\ufb01-\ncally, the parametric simplex method parametrizes (1.2) and (1.3) using the unknown regularization\nfactor as a \u201cparameter\u201d. This eventually yields a piecewise linear solution path for a sequence of\nregularization factors. Such a varying parameter scheme is also called homotopy optimization in\nexisting literature. PSM relies some special rules to iteratively choose the pair of variables to swap,\nwhich algebraically calculates the solution path during each pivoting. PSM terminates at a value of\nparameter, where we have successfully solved the full solution path to the original problem. Although\nin the worst-case scenario, PSM can take an exponential number of pivots to \ufb01nd an optimal solution\npath. Our empirical results suggest that the number of iterations is roughly linear in the number of\nnonzero variables for large regularization factors with sparse optima. This means that the desired\nsparse solutions can often be found using very few pivots.\nSeveral optimization methods for solving (1.1) are closely related to PSM. However, there is a lack\nof generic design in these methods. Their methods, for example, the simplex method proposed in\nYao and Lee (2014) can be viewed as a special example of our proposed PSM, where the perturbation\nis only considered on the right-hand-side of the inequalities in the constraints. DASSO algorithm\ncomputes the entire coef\ufb01cient path of Dantzig selector by a simplex-like algorithm. Zhu et al. (2004)\npropose a similar algorithm which takes advantage of the piece-wise linearity of the problem and\ncomputes the whole solution path on `1-SVM. These methods can be considered as similar algorithms\nderived from PSM but only applied to special cases, where the entire solution path is computed but\nan accurate dual certi\ufb01cate stopping criterion is not provided.\nNotations: We denote all zero and all one vectors by 1 and 0 respectively. Given a vector a =\n\n(a1, ..., ad)> 2 Rd, we de\ufb01ne the number of nonzero entries kak0 = Pj 1(aj 6= 0), kak1 =\nPj |aj|, kak2\nj, and kak1 = maxj |aj|. When comparing vectors, \u201c\u201d and \u201c\uf8ff\u201d mean\ncomponent-wise comparison. Given a matrix A 2 Rd\u21e5d with entries ajk, we use |||A||| to denote\nentry-wise norms and kAk to denote matrix norms. Accordingly |||A|||0 =Pj,k 1(ajk 6= 0), |||A|||1 =\nPj,k |ajk|, |||A|||1 = maxj,k |ajk|, kAk1 = maxkPj |ajk|, kAk1 = maxjPk |ajk|, kAk2 =\nmaxkak2\uf8ff1 kAak2, and kAk2\njk. We denote A\\i\\j as the submatrix of A with i-th row\nand j-th column removed. We denote Ai\\j as the i-th row of A with its j-th entry removed and\nA\\ij as the j-th column of A with its i-th entry removed. For any subset G of {1, 2, . . . , d}, we let\nAG denote the submatrix of A 2 Rp\u21e5d consisting of the corresponding columns of A. The notation\nA 0 means all of A\u2019s entries are nonnegative. Similarly, for a vector a 2 Rd, we let aG denote the\nsubvector of a associated with the indices in G. Finally, Id denotes the d-dimensional identity matrix\n\n2 = Pj a2\n\nF =Pj,k a2\n\n2\n\n\fand ei denotes vector that has a one in its i-th entry and zero elsewhere. In a large matrix, we leave a\nsubmatrix blank when all of its entries are zeros.\n2 Background\nMany sparse learning approaches are formulated as convex programs in a generic form:\n\n\u2713 k\u2713k1\nmin\n\n\u2713 L(\u2713) + k\u2713k1,\nmin\n\ns.t. krL(\u2713)k1 \uf8ff ,\n\n(2.1)\nwhere L(\u2713) is a convex loss function, and > 0 is a regularization factor controlling bias and\nvariance. Moreover, if L(\u2713) is smooth, we can also consider an alternative formulation:\n(2.2)\nwhere rL(\u2713) is the gradient of L(\u2713), and > 0 is a regularization factor. As will be shown later,\nboth (2.2) and (2.1) are naturally suited for our algorithm, when L(\u2713) is piecewise linear or quadratic\nrespectively. Our algorithm yields a piecewise-linear solution path as a function of by varying \nfrom large to small.\nBefore we proceed with our proposed algorithm, we \ufb01rst introduce the sparse learning problems of\nour interests, including sparse linear regression, sparse linear classi\ufb01cation, and undirected graph\nestimation. Due to space limit, we only present three examples, and defer the others to the appendix.\nDantzig Selector: The \ufb01rst problem is sparse linear regression. Let y 2 Rn be a response vector\nand X 2 Rn\u21e5d be the design matrix. We consider a linear model y = X\u2713\u21e4 + \u270f, where \u2713\u21e4 2 Rd is the\nunknown regression coef\ufb01cient vector, and \u270f is the observational noise vector. Here we are interested\nin a high dimensional regime: d is much larger than n, i.e., d n, and many entries in \u2713\u21e4 are zero,\ni.e., k\u2713\u21e4k0 = s\u21e4 \u2327 n. To get a sparse estimator of \u27130, machine learning researchers and statisticians\nhave proposed numerous approaches including Lasso (Tibshirani, 1996), Dantzig Selector (Candes\nand Tao, 2007) and LAD-Lasso (Wang et al., 2007).\nThe Dantzig selector is formulated as the solution to the following convex program:\n\nsubject to kX>(y X\u2713 )k1 \uf8ff .\n(2.3)\nj = \u2713j \u00b7 1(\u2713j < 0), we rewrite (2.3) as a\n\u2713\u25c6 \uf8ff\u27131 + X>y\n\n(2.4)\n\n1 X>y\u25c6 ,\u2713 +,\u2713 0.\nx =\u2713\u2713+\n\u2713\u25c6 .\n\n\u00afb = 1,\n\n\u00afc = 0,\n\nj = \u2713j \u00b7 1(\u2713j > 0) and \u2713+\nX>X X>X \u25c6\u2713\u2713+\ns.t.\u2713 X>X X>X\nb =\u2713 X>y\nX>y\u25c6 ,\n\nc = 1,\n\nBy complementary slackness, we can guarantee that the optimal \u2713+\ncomplementary to each other. Note that (2.4) \ufb01ts into our parametric linear program as (1.3) with\n\nj \u2019s and \u2713j \u2019s are nonnegative and\n\n\u2713 k\u2713k1\nmin\nBy setting \u2713 = \u2713+ \u2713 with \u2713+\nlinear program:\n\n1>(\u2713+ + \u2713)\n\nmin\n\u2713+,\u2713\n\nA =\u2713 X>X X>X\nX>X X>X \u25c6 ,\n\nSparse Support Vector Machine: The second problem is Sparse SVM (Support Vector Machine),\nwhich is a model-free discriminative modeling approach (Zhu et al., 2004). Given n independent\nand identically distributed samples (x1, y1), ..., (xn, yn), where xi 2 Rd is the feature vector and\nyi 2{ 1,1} is the binary label. Similar to sparse linear regression, we are interested in the high\ndimensional regime. To obtain a sparse SVM classi\ufb01er, we solve the following convex program:\n\nnXi=1\n\nmin\n\u27130,\u2713\n\n[1 yi(\u27130 + \u2713>xi)]+ s.t. k\u2713k1 \uf8ff ,\n\n(2.5)\n\ni ti . Notice [1 yi(\u27130 + \u2713>xi)]+ can be represented by t+\n\nwhere \u27130 2 R and \u2713 2 Rd. Given a new sample z 2 Rd, Sparse SVM classi\ufb01er predicts its label\nby sign(\u27130 + \u2713>z). Let ti = 1 yi(\u27130 + \u2713>xi) for i = 1, ..., n. Then ti can be expressed as\nti = t+\ni . We split \u2713 and \u27130 into\n0 + \u27130 and add slack variable w\npositive and negative parts as well: \u2713 = \u2713+ \u2713 and \u27130 = \u2713+\nto the constraint so that the constraint becomes equality: \u2713+ + \u2713 + w = 1, w 0. Now we\ncast the problem into the equality parametric simplex form (1.2). We identify each component\n\u27130 w> 2 R(n+1)\u21e5(2n+3d+2), x \nof (1.2) as the following: x = t+ t \u2713+ \u2713 \u2713+\nc = 1> 0> 0> 0> 0\n1>\u25c6 2\n1> 0> 2 Rn+1, \u00afb =0> 1> 2 Rn+1, and A =\u2713In In Z Z y y\nR(n+1)\u21e5(2n+3d+2), where Z = (y1x1, . . . , ynxn)> 2 Rn\u21e5d.\n\n0 0>> 2 R2n+3d+2,\n\n\u00afc = 0 2 R2n+3d+2,\n1> 1>\n\nb =\n\n0,\n\n0\n\n3\n\n\fDifferential Graph Estimation: The third problem is the differential graph estimation, which\naims to identify the difference between two undirected graphs (Zhao et al., 2013; Danaher et al.,\n2013). The related applications in biological and medical research can be found in existing literature\n(Hudson et al., 2009; Bandyopadhyaya et al., 2010; Ideker and Krogan, 2012). Speci\ufb01cally, given n1\ni.i.d. samples x1, ..., xn from Nd(\u00b50\nX) and n2 i.i.d. samples y1, ..., yn from Nd(\u00b50\nY ) We\nare interested in estimating the difference of the precision matrices of two distributions:\n\nX, \u23030\n\nY , \u23030\n\nWe de\ufb01ne the empirical covariance matrices as: SX = 1\n1\npropose to estimate 0 by solving the following problem:\n\nj=1(yj \u00afy)(yj \u00afx)>, where \u00afx = 1\n\nn2Pn2\n\n0 = (\u2303 0\n\nY )1.\n\nX)1 (\u23030\nn1Pn1\nj=1(xj \u00afx)(xj \u00afx)> andSY =\nn2Pn\nn1Pn\nj=1 yj. Then Zhao et al. (2013)\ns.t. |||SXSY SX + SY |||1 \uf8ff ,\n\nj=1 xj and \u00afy = 1\n\n(2.6)\n\n ||||||1\nmin\n\nwhere SX and SY are the empirical covariance matrices. As can be seen, (2.6) is essentially a special\nexample of a more general parametric linear program as follows,\n\nD |||D|||1\nmin\n\ns.t. |||XDZ Y |||1 \uf8ff ,\n\n(2.7)\n\nwhere D 2 Rd1\u21e5d2, X 2 Rm1\u21e5d1, Z 2 Rd2\u21e5m2 and Y 2 Rm1\u21e5m2 are given data matrices.\nInstead of directly solving (2.7), we consider a reparametrization by introducing an axillary variable\nC = XD. Similar to CLIME, we decompose D = D+ + D, and eventually rewrite (2.7) as\n1>(D+ + D)1 s.t. |||CZ Y |||1 \uf8ff , X(D+ D) = C, D+, D 0,\n\nmin\nD+,D\n\n(2.8)\n\n1A\n\nIm1m2\nz11Im1\n\nLet vec(D+), vec(D), vec(C) and vec(Y ) be the vectors obtained by stacking the columns of\nmatrices D+, D C and Y , respectively. We write (2.8) as a parametric linear program,\n\nIm1m2\n\nX 0 X 0 Im1d2\nZ0\nZ0\n\nA =0@\n1CA 2 Rm1d2\u21e5d1d2, Z0 = 0B@\n\nX\n\n...\n\n...\n\nX\n\n...\n\nzd21Im1\n\nvec(D)\n\nz1m2Im1\n\nx =\u21e5vec(D+)\n\nRm1m2\u21e5m1d2, where zij denotes the (i, j) entry of matrix Z;\n\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7 zd2m2Im1\n\nvec(C) w\u21e4> 2 R2d1d2+m1d2+2m1m2,\n\n1CA 2\nwith X 0 = 0B@\nwhere w 2 R2m1m2 is nonnegative slack variable vector used to make the inequality become\nan equality. Moreover, we also have b = \u21e50> vec(Y ) vec(Y )\u21e4> 2 Rm1d2+2m1m2, where\nthe \ufb01rst m1d2 components of vector b are 0 and the rest components are from matrix Y ; \u00afb =\n0> 1> 1>> 2 Rm1d2+2m1m2, where the \ufb01rst m1d2 components of vector \u00afb are 0 and the rest\n2m1m2 components are 1; c =1> 1> 0> 0>> 2 R2d1d2+m1d2+2m1m2, where the \ufb01rst\n2d1d2 components of vector c are 1 and the rest m1d2 + 2m1m2 components are 0.\n3 Homotopy Parametric Simplex Method\nWe \ufb01rst brie\ufb02y review the primal simplex method for linear programming, and then derive the\nproposed algorithm.\nPreliminaries: We consider a standard linear program as follows,\n\nx\n\nmax\n\nc>x s.t. Ax = b,\n\nx 0 x 2 Rn,\n\n(3.1)\nwhere A 2 Rm\u21e5n, b 2 Rm and c 2 Rn are given. Without loss of generality, we assume that m \uf8ff n\nand matrix A has full row rank m. Throughout our analysis, we assume that an optimal solution\nexists (it needs not be unique). The primal simplex method starts from a basic feasible solution (to\nbe de\ufb01ned shortly\u2014but geometrically can be thought of as any vertex of the feasible polytope) and\nproceeds step-by-step (vertex-by-vertex) to the optimal solution. Various techniques exist to \ufb01nd the\n\ufb01rst feasible solution, which is often referred to the Phase I method. See Vanderbei (1995); Murty\n(1983); Dantzig (1951).\n\n4\n\n\fAlgebraically, a basic solution corresponds to a partition of the indices {1, . . . , n} into m basic\nindices denoted B and n m non-basic indices denoted N . Note that not all partitions are allowed\u2014\nthe submatrix of A consisting of the columns of A associated with the basic indices, denoted AB,\nmust be invertible. The submatrix of A corresponding to the nonbasic indices is denoted AN .\nSuppressing the fact that the columns have been rearranged, we can write A = [AN , AB]. If\nwe rearrange the rows of x and c in the same way, we can introduce a corresponding partition of\n\nthese vectors: x =\uf8ffxNxB , c =\uf8ffcNcB . From the commutative property of addition, we rewrite the\n\nconstraint as AN xN + ABxB = b. Since the matrix AB is assumed to be invertible, we can express\nxB in terms of xN as follows:\n(3.2)\nB AN xN ,\nwhere we have written x\u21e4\nB b. This rearrangement of the equality constraints\nB\nis called a dictionary because the basic variables are de\ufb01ned as functions of the nonbasic variables.\nDenoting the objective c>x by \u21e3, then we also can write:\n\nB A1\nas an abbreviation for A1\n\nxB = x\u21e4\n\n(3.3)\n\nN )>xN ,\n\n\u21e3 = c>x = c>\nB = A1\nB b and z\u21e4\n\nB xB + c>\nN = (A1\n\nN xN = \u21e3\u21e4 (z\u21e4\nB AN )>cB cN .\n\nB A1\n\nB b, x\u21e4\n\nxB = x\u21e4\nB\n\nwhere \u21e3\u21e4 = c>\nWe call equations (3.2) and (3.3) the primal dictionary associated with the current basis B. Cor-\nresponding to each dictionary, there is a basic solution (also called a dictionary solution) ob-\ntained by setting the nonbasic variables to zero and reading off values of the basic variables:\n. This particular \u201csolution\u201d satis\ufb01es the equality constraints of the problem by\nxN = 0,\nconstruction. To be a feasible solution, one only needs to check that the values of the basic variables\nare nonnegative. Therefore, we say that a basic solution is a basic feasible solution if x\u21e4\nThe dual of (3.1) is given by\ny b>y\nmax\n\nIn this case, we separate variable z into basic and nonbasic parts as before: [z] = \uf8ffzNzB. The\n\nz 0 z 2 Rn, y 2 Rm.\n\ns.t. A>y z = c,\n\nB 0.\n\n(3.4)\n\ncorresponding dual dictionary is given by:\nN + (A1\n\nzN = z\u21e4\n\n(3.5)\nN =\n\nB AN )>zB, \u21e0 = \u21e3\u21e4 (x\u21e4\nB A1\n\nB)>zB,\nB b, x\u21e4\n\nB = A1\n\nB b and z\u21e4\n\nB AN )>cB cN .\n\nwhere \u21e0 denotes the objective function in the (3.4), \u21e3\u21e4 = c>\n(A1\nFor each dictionary, we set xN and zB to 0 (complementarity) and read off the solutions to xB and\nzN according to (3.2) and (3.5). Next, we remove one basic index and replacing it with a nonbasic\nindex, and then get an updated dictionary. The simplex method produces a sequence of steps to\nadjacent bases such that the value of the objective function is always increasing at each step. Primal\nfeasibility requires that xB 0, so while we update the dictionary, primal feasibility must always be\nsatis\ufb01ed. This process will stop when zN 0 (dual feasibility), since it satis\ufb01es primal feasibility,\ndual feasibility and complementarity (i.e., the optimality condition).\nParametric Simplex Method: We derive the parametric simplex method used to \ufb01nd the full\nsolution path while solving the parametric linear programming problem only once. A few variants of\nthe simplex method are proposed with different rules for choosing the pair of variables to swap at each\niteration. Here we describe the rule used by the parametric simplex method: we add some positive\nperturbations (\u00afb and \u00afc) times a positive parameter to both objective function and the right hand side\nof the primal problem. The purpose of doing this is to guarantee the primal and dual feasibility when\n is large. Since the problem is already primal feasible and dual feasible, there is no phase I stage\nrequired for the parametric simplex method. Furthermore, if the i-th entry of b or the j-th entry of c\nhas already satis\ufb01ed the feasibility condition (bi 0 or cj \uf8ff 0), then the corresponding perturbation\n\u00afbi or \u00afcj to that entry is allowed to be 0. With these perturbations, (3.1) becomes:\n(3.6)\n\n(c + \u00afc)>x s.t. Ax = b + \u00afb,\n\nmax\n\nx\n\nx 0 x 2 Rn.\n\nWe separate the perturbation vectors into basic and nonbasic parts as well and write down the the\ndictionary with perturbations corresponding to (3.2),(3.3), and (3.5) as:\n\nxB = (x\u21e4\n\nB + \u00afxB) A1\n\nB AN xN ,\u21e3 = \u21e3\u21e4 (z\u21e4\n\nN + \u00afzN )>xN ,\n\n(3.7)\n\n5\n\n\fzN = (z\u21e4\nB b, z\u21e4\n\nN + \u00afzN ) + (A1\nN = (A1\n\nB AN )>zB, \u21e0 = \u21e3\u21e4 (x\u21e4\n\nB + \u00afxB)>zB,\n\n(3.8)\n\nB = A1\n\n\u00afb and \u00afzN = (A1\nB AN )>cB cN , \u00afxB = A1\nwhere x\u21e4\nB\nWhen is large, the dictionary will be both primal and dual feasible (x\u21e4\nN +\u00afzN \n0). The corresponding primal solution is simple: xB = x\u21e4\nB + \u00afxB and xN = 0. This solution is\nvalid until hits a lower bound which breaks the feasibility. The smallest value of without break\nany feasibility is given by\n\nB +\u00afxB 0 and z\u21e4\n\nB AN )>\u00afcB \u00afcN .\n\n\u21e4 = min{ : z\u21e4\n\nN + \u00afzN 0 and x\u21e4\n\nIn other words, the dictionary and its corresponding solution xB = x\u21e4\nfor the value of 2 [\u21e4, max], where\n\nB + \u00afxB 0}.\n\n(3.9)\nB + \u00afxB and xN = 0 is optimal\n\n\u21e4 = max \u2713maxj2N , \u00afzNj >0 \nmax = min \u2713minj2N , \u00afzNj <0 \n\nz\u21e4\nNj\n\u00afzNj\n\nz\u21e4\nNj\n\u00afzNj\n\n, maxi2B,\u00afxBi >0 \n\n, mini2B,\u00afxBi <0 \n\nx\u21e4\nBi\n\n\u00afxBi\u25c6 ,\n\u00afxBi\u25c6 .\n\nx\u21e4\nBi\n\n(3.10)\n\n(3.11)\n\nx\u21e4j = t,\n\n\u00afx\u21e4j = \u00aft,\n\nB txB,\n\nz\u21e4i = s,\nz\u21e4\nN z\u21e4\n\n\u00afxB \u00afxB \u00aftxB,\n\nNote that although initially the perturbations are nonnegative, as the dictionary gets updated, the\nperturbation does not necessarily maintain nonnegativity. For each dictionary, there is a corresponding\ninterval of given by (3.10) and (3.11). We have characterized the optimal solution for this interval,\nand these together give us the solution path of the original parametric linear programming problem.\nNext, we show how the dictionary gets updated as the leaving variable and entering variable swap.\nWe expect that after swapping the entering variable j and leaving variable i, the new solution in the\ndictionary (3.7) and (3.8) would slightly change to:\nx\u21e4\nB x\u21e4\n\u00afzN \u00afzN \u00afszN ,\n\n\u00afz\u21e4i = \u00afs,\nN szN ,\n\nN + \u00afzN = 0 or an i 2B for which x\u21e4\n\nwhere t and \u00aft are the primal step length for the primal basic variables and perturbations, s and \u00afs are\nthe dual step length for the dual nonbasic variables and perturbations, xB and zN are the primal\nand dual step directions, respectively. We explain how to \ufb01nd these values in details now.\nThere is either a j 2N for which z\u21e4\nB + \u00afxB = 0 in (3.9). If\nit corresponds to a nonbasic index j, then we do one step of the primal simplex. In this case, we\ndeclare j as the entering variable, then we need to \ufb01nd the primal step direction xB. After the\nentering variable j has been selected, xN changes from 0 to tej, where t is the primal step length.\nThen according to (3.7), we have that xB = (x\u21e4\nB AN tej. The step direction xB\nis given by xB = A1\nB AN ej. We next select the leaving variable. In order to maintain primal\nfeasibility, we need to keep xB 0, therefore, the leaving variable i is selected such that i 2B\nachieves the maximal value of xi\n. It only remains to show how zN changes. Since i is the\nleaving variable, according to (3.8), we have zN = (A1\nB AN )>ei. After we know the entering\nvariables, the primal and dual step directions, the primal and dual step lengths can be found as\nt = x\u21e4i\nxi\nIf, on the other hand, the constraint in (3.9) corresponds to a basic index i, we declare i as the leaving\nvariable, then similar calculation can be made based on the dual simplex method (apply the primal\nsimplex method to the dual problem). Since it is very similar to the primal simplex method, we omit\nthe detailed description.\nThe algorithm will terminate whenever \u21e4 \uf8ff 0. The corresponding solution is optimal since our\ndictionary always satis\ufb01es primal feasibility, dual feasibility and complementary slackness condition.\nThe only concern during the entire process of the parametric simplex method is that does not equal\nto zero, so as long as can be set to be zero, we have the optimal solution to the original problem.\nWe summarize the parametric simplex method in Algorithm 1:\nThe following theorem shows that the updated basic and nonbasic partition gives the optimal solution.\nTheorem 3.1. For a given dictionary with parameter in the form of (3.7) and (3.8), let B be a basic\nindex set and N be an nonbasic index set. Assume this dictionary is optimal for 2 [\u21e4, max],\nwhere \u21e4 and max are given by (3.10) and (3.11), respectively. The updated dictionary with basic\nindex set B\u21e4 and nonbasic index set N \u21e4 is still optimal at = \u21e4.\n\nB + \u00afxB) A1\n\ns = z\u21e4j\nzj\n\n\u00afs = \u00afzj\nzj\n\n\u00aft = \u00afxi\nxi\n\nx\u21e4i +\u21e4 \u00afxi\n\n.\n\n,\n\n,\n\n,\n\n6\n\n\fWrite down the dictionary as in (3.7) and (3.8);\nFind \u21e4 given by (3.10);\nwhile \u21e4 > 0 do\n\nB AN ej;\n\nif the constraint in (3.10) corresponds to an index j 2N then\n\nelse if the constraint in (3.10) corresponds to an index i 2B then\n\nDeclare xj as the entering variable;\nCompute primal step direction. xB = A1\nSelect leaving variable. Need to \ufb01nd i 2B that achieves the maximal value of xi\nx\u21e4i +\u21e4 \u00afxi\nCompute dual step direction. It is given by zN = (A1\nDeclare zi as the leaving variable;\nCompute dual step direction. zN = (A1\nSelect entering variable. Need to \ufb01nd j 2N that achieves the maximal value of zj\nz\u21e4j +\u21e4 \u00afzj\nCompute primal step direction. It is given by xB = A1\n\nCompute the dual and primal step lengths for both variables and perturbations:\n\nB AN )>ei;\n\nB AN )>ei;\n\nB AN ej;\n\n;\n\n;\n\nt =\n\nx\u21e4i\nxi\n\n,\n\n\u00aft =\n\n\u00afxi\nxi\n\n,\n\ns =\n\nz\u21e4j\nzj\n\n,\n\n\u00afs =\n\n\u00afzj\nzj\n\n.\n\nUpdate the primal and dual solutions:\n\nx\u21e4j = t,\n\nz\u21e4i = s,\nN z\u21e4\nz\u21e4\nUpdate the basic and nonbasic index sets B := B \\ {i}\\{ j} and N := N \\ {j}\\{ i}. Write\ndown the new dictionary and compute \u21e4 given by (3.10);\n\n\u00afxj = \u00aft,\n\u00afxB \u00afxB \u00aftxB\n\n\u00afzi = \u00afs,\nN szN ,\n\n\u00afzN \u00afzN \u00afszN .\n\nB txB,\n\nx\u21e4\nB x\u21e4\n\nend\nSet the nonbasic variables as 0s and read the values of the basic variables.\nAlgorithm 1: The parametric simplex method\n\nDuring each iteration, there is an optimal solution corresponding to 2 [\u21e4, max]. Notice each of\nthese \u2019s range is determined by a partition between basic and nonbasic variables, and the number\nof the partition into basic and nonbasic variables is \ufb01nite. Thus after \ufb01nite steps, we must \ufb01nd the\noptimal solution corresponding to all values.\nTheory: We present our theoretical analysis on solving Dantzig selector using PSM. Speci\ufb01cally,\ngiven X 2 Rn\u21e5d, y 2 Rn, we consider a linear model y = X\u2713\u21e4 + \u270f, where \u2713\u21e4 is the unknown sparse\nregression coef\ufb01cient vector with k\u2713\u21e4k0 = s\u21e4, and \u270f \u21e0 N (0, 2In). We show that PSM always\nmaintains a pair of sparse primal and dual solutions. Therefore, the computation cost within each\niteration of PSM can be signi\ufb01cantly reduced. Before we proceed with our main result, we introduce\ntwo assumptions. The \ufb01rst assumption requires the regularization factor to be suf\ufb01ciently large.\nAssumption 3.2. Suppose that PSM solves (2.3) for a regularization sequence {K}N\nsmallest regularization factor N satis\ufb01es\n\nK=0. The\n\nN = Cr log d\n\nn 4kX>\u270fk1 for some generic constant C.\n\nExisting literature has extensively studied Assumption 3.2 for high dimensional statistical theories.\nSuch an assumption enforces all regularization parameters to be suf\ufb01ciently large in order to eliminate\nirrelevant coordinates along the regularization path. Note that Assumption 3.2 is deterministic\nfor any given N. Existing literature has veri\ufb01ed that for sparse linear regression models, given\n\u270f \u21e0 N (0, 2In), Assumption 3.2 holds with overwhelming probability.\nBefore we present the second assumption, we de\ufb01ne the largest and smallest s-sparse eigenvalues of\nn1X>X respectively as follows.\nDe\ufb01nition 3.3. Given an integer s 1, we de\ufb01ne\n\n\u21e2+(s) = sup\n\nkk0\uf8ffs\n\nT X>X\n\nnkk2\n\n2\n\nand \u21e2(s) = inf\n\nkk0\uf8ffs\n\n7\n\nT X>X\n\n.\n\nnkk2\n\n2\n\n\fAssumption 3.4 guarantees that n1X>X satis\ufb01es the sparse eigenvalue conditions as long as the\n\nes 100\uf8ffs\u21e4,\u21e2 +(s\u21e4 +es) < +1, and e\u21e2(s\u21e4 +es) > 0,\n\nAssumption 3.4. Given k\u2713\u21e4k0 \uf8ff s\u21e4, there exists an integeres such that\nwhere \uf8ff is de\ufb01ned as \uf8ff = \u21e2+(s\u21e4 +es)/e\u21e2(s\u21e4 +es).\nnumber of active irrelevant blocks never exceeds e2s along the solution path. That is closely related to\n\nthe restricted isometry property (RIP) and restricted eigenvalue (RE) conditions, which have been\nextensively studied in existing literature.\nWe then characterize the sparsity of the primal and dual solutions within each iteration.\nTheorem 3.5 (Primal and Dual Sparsity). Suppose that Assumptions 3.2 and 3.4 hold. We consider\nan alterantive formulation to the Dantzig selector,\n\nsubject to rjL(\u2713) \uf8ff 0, rjL(\u2713) \uf8ff 0.\n\n(3.12)\n\n\u2713\n\nk\u2713k1\n\nb\u27130 = argmin\n1 , ...,b\u00b50\nd ,b0\nd+1, ...,b0\n\nkX>\nS\n\nXS(X>\n\nS XS)1k1 \uf8ff 1 \u21e3,\n\n2d]> denote the optimal dual variables to (3.12). For any 0 ,\n\nLetb\u00b50 = [b\u00b50\nwe have kb\u00b50k0 + kb0k0 \uf8ff s\u21e4 +es. Moreover, given design matrix satisfying\nwhere \u21e3> 0 is a generic constant, S = {j | \u2713\u21e4j 6= 0} and S = {j | \u2713\u21e4j = 0}, we have kb\u27130k0 \uf8ff s\u21e4.\n\nThe proof of Theorem 3.5 is provided in Appendix B. Theorem 3.5 shows that within each iteration,\nboth primal and dual variables are sparse, i.e., the number of nonzero entries are far smaller than d.\nTherefore, the computation cost within each iteration of PSM can be signi\ufb01cantly reduced by a factor\nof O(d/s\u21e4). This partially justi\ufb01es the superior performance of PSM in sparse learning.\n4 Numerical Experiments\nIn this section, we present some numerical experiments and give some insights about how the\nparametric simplex method solves different linear programming problems. We verify the following\nassertions: (1) The parametric simplex method requires very few iterations to identify the nonzero\ncomponent if the original problem is sparse. (2) The parametric simplex method is able to \ufb01nd the\nfull solution path with high precision by solving the problem only once in an ef\ufb01cient and scalable\nmanner. (3) The parametric simplex method maintains the feasibility of the problem up to machine\nprecision along the solution path.\n\n2\n\nr\no\n\n1\n\n \n\nt\nc\ne\nV\ne\ns\nn\no\np\ns\ne\nR\ne\nh\n\n \n\nt\n \nf\n\no\n \ns\ne\ni\nr\nt\n\n \n\nn\nE\no\nr\ne\nz\nn\no\nN\n\n0\n\n1\n\u2212\n\n2\n\u2212\n\n3\n\u2212\n\n\u25cf\n\nTrue Value\nEstimated Path\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nh\n\nt\n\na\nP\ne\nh\n\n \n\nt\n \n\ng\nn\no\na\n\nl\n\n \n\na\nd\nb\nm\na\nL\n\n \nf\n\no\n\n \ns\ne\nu\na\nV\n\nl\n\n0\n0\n5\n\n0\n0\n4\n\n0\n0\n3\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\ny\nt\ni\nl\ni\n\ni\n\nb\ns\na\ne\nn\n\nf\n\nI\n\nFlare\nPSM\n\n0\n0\n4\n\n0\n0\n3\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\n0\n\n5\n\n10\nIteration\n\n15\n\n(a) Solution Path\n\n0\n\n5\n\n10\n\n15\n\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\nIteration\n\nIteration\n\n(b) Parameter Path (Rescaled by n)\n\n(c) Feasibility Violation\n\nFigure 1: Dantzig selector method: (a) The solution path of the parametric simplex method; (b) The parameter\npath of the parametric simplex method; (c) Feasibility violation along the solution path.\nSolution path of Dantzig selector: We start with a simple example that illustrates how the recov-\nered solution path of the Dantzig selector model changes as the parametric simplex method iterates.\nWe adopt the example used in Candes and Tao (2007). The design matrix X has n = 100 rows and\nd = 250 columns. The entries of X are generated from an array of independent Gaussian random\nvariables that are then Gaussianized so that each column has a given norm. We randomly select s = 8\nentries from the response vector \u27130, and set them as \u27130\ni = si(1 + ai), where si = 1 or 1, with\nprobability 1/2 and ai \u21e0N (0, 1). The other entries of \u27130 are set to zero. We form y = X\u2713 0 + \u270f,\nwhere \u270fi \u21e0N (0, ), with = 1. We stop the parametric simplex method when \uf8ff nplog d/n.\nThe solution path of the result is shown in Figure 1(a). We see that our method correctly identi\ufb01es all\nnonzero entries of \u2713 in less than 10 iterations. Some small overestimations occur in a few iterations\nafter all nonzero entries have been identi\ufb01ed. We also show how the parameter evolves as the\nparametric simplex method iterates in Figure 1(b). As we see, decreases sharply to less than 5\n\n8\n\n\fafter all nonzero components have been identi\ufb01ed. This reconciles with the theorem we developed.\nThe algorithm itself only requires a very small number of iterations to correctly identify the nonzero\nentries of \u2713. In our example, each iteration in the parametric simplex method identi\ufb01es one or two\nnon-sparse entries in \u2713.\nFeasibility of Dantzig Selector: Another advantage of the parametric simplex method is that the\nsolution is always feasible along the path while other estimating methods usually generate infeasible\nsolutions along the path. We compare our algorithm with \u201c\ufb02are\u201d (Li et al., 2015) which uses the\nAlternating Direction Method of Multipliers (ADMM) using the same example described above. We\ncompute the values of kX>X\u2713 i X>yk1 i along the solution path, where \u2713i is the i-th basic\nsolution (with corresponding i) obtained while the parametric simplex method is iterating. Without\nany doubts, we always obtain 0s during each iteration. We plug the same list of i into \u201c\ufb02are\u201d and\ncompute the solution path for this list as well. As shown in Table 1, the parametric simplex method is\nalways feasible along the path since it is solving each iteration up to machine precision; while the\nsolution path of the ADMM is almost always breaking the feasibility by a large amount, especially in\nthe \ufb01rst few iterations which correspond to large values. Each experiment is repeated for 100 times.\n\nTable 1: Average feasibility violation with standard errors along the solution path\n\nMaximum violation Minimum Violation\n\nADMM\nPSM\n\n498(122)\n\n0(0)\n\n143(73.2)\n\n0(0)\n\nPerformance Benchmark of Dantzig Selector:\nIn this part, we compare the timing performance\nof our algorithm with R package \u201c\ufb02are\u201d. We \ufb01x the sample size n to be 200 and vary the data\ndimension d from 100 to 5000. Again, each entries of X is independent Gaussian and Gaussianized\nsuch that the column has uniform norm. We randomly select 2% entries from vector \u2713 to be nonzero\nand each entry is chosen as \u21e0N (0, 1). We compute y = X\u2713 + \u270f, with \u270fi \u21e0N (0, 1) and try to\nrecover vector \u2713, given X and y. Our method stops when is less than 2plog d/n, such that\nthe full solution path for all the values of up to this value is computed by the parametric simplex\nmethod. In \u201c\ufb02are\u201d, we estimate \u2713 when is equal to the value in the Dantzig selector model. This\nmeans \u201c\ufb02are\u201d has much less computation task than the parametric simplex method. As we can see in\nTable 2, our method has a much better performance than \u201c\ufb02are\u201d in terms of speed. We compare and\npresent the timing performance of the two algorithms in seconds and each experiment is repeated for\n100 times. In practice, only very few iterations is required when the response vector \u2713 is sparse.\n\nTable 2: Average timing performance (in seconds) with standard errors in the parentheses on Dantzig selector\n\n500\n\nFlare\n19.5(2.72)\nPSM 2.40(0.220)\n\n1000\n\n44.4(2.54)\n29.7(1.39)\n\n2000\n\n142(11.5)\n47.5(2.27)\n\n5000\n\n1500(231)\n649(89.8)\n\ny =\u2326 0\n\nx + D.\n\nPerformance Benchmark of Differential Network: We now apply this optimization method to\nthe Differential Network model. We need the difference between two inverse covariance matrices\nto be sparse. We generate \u23030\nx = U>\u21e4U, where \u21e4 2 Rd\u21e5d is a diagonal matrix and its entries are\ni.i.d. and uniform on [1, 2], and U 2 Rd\u21e5d is a random matrix with i.i.d. entries from N (0, 1). Let\nD1 2 Rd\u21e5d be a random sparse symmetric matrix with a certain sparsity level. Each entry of D1\nis i.i.d. and from N (0, 1). We set D = D1 + 2|min(D1)|Id in order to guarantee the positive\nde\ufb01niteness of D, where min(D1) is the smallest eigenvalue of D1. Finally, we let \u23260\nx)1\nand \u23260\nWe then generate data of sample size n = 100. The corresponding sample covariance matrices SX\nand SY are also computed based on the data. We are not able to \ufb01nd other software which can\nef\ufb01ciently solve this problem, so we only list the timing performance of our algorithm as dimension\nd varies from 25 to 200 in Table 3. We stop our algorithm whenever the solution achieved the desired\nsparsity level. When d = 25, 50 and 100, the sparsity level of D1 is set to be 0.02 and when d = 150\nand 200, the sparsity level of D1 is set to be 0.002. Each experiment is repeated for 100 times.\nTable 3: Average timing performance (in seconds) and iteration numbers with standard errors in the parentheses\non differential network\n\nx = (\u2303 0\n\nTiming\n\nIteration Number\n\n0.0185(0.00689)\n\n15.5(7.00)\n\n0.376(0.124)\n55.3(18.8)\n\n25\n\n50\n\n100\n\n6.81(2.38)\n164(58.2)\n\n150\n\n200\n\n13.41(1.26)\n85.8(16.7)\n\n46.88(7.24)\n140(26.2)\n\n9\n\n\fReferences\nBANDYOPADHYAYA, S., MEHTA, M., KUO, D., SUNG, M.-K., CHUANG, R., JAEHNIG, E. J.,\nBODENMILLER, B., LICON, K., COPELAND, W., SHALES, M., FIEDLER, D., DUTKOWSKI,\nJ., GU\u00c9NOL\u00c9, A., ATTIKUM, H. V., SHOKAT, K. M., KOLODNER, R. D., HUH, W.-K.,\nAEBERSOLD, R., KEOGH, M.-C. and KROGAN, N. J. (2010). Rewiring of genetic networks in\nresponse to dna damage. Science Signaling 330 1385\u20131389.\n\nB\u00dcHLMANN, P. and VAN DE GEER, S. (2011). Statistics for high-dimensional data: methods,\n\ntheory and applications. Springer Science & Business Media.\n\nCAI, T. and LIU, W. (2011). A direct estimation approach to sparse linear discriminant analysis.\n\nJournal of the American Statistical Association 106 1566\u20131578.\n\nCAI, T., LIU, W. and LUO, X. (2011). A constrained l1 minimization approach to sparse precision\n\nmatrix estimation. Journal of the American Statistical Association 106 594\u2013607.\n\nCANDES, E. and TAO, T. (2007). The dantzig selector: Statistical estimation when p is much larger\n\nthan n. The Annals of Statistics 35 2313\u20132351.\n\nDANAHER, P., WANG, P. and WITTEN, D. M. (2013). The joint graphical lasso for inverse\ncovariance estimation across multiple classes. Journal of the Royal Statistical Society Series B 7\n373\u2013397.\n\nDANTZIG, G. (1951). Linear Programming and Extensions. Princeton University Press.\nDEMPSTER, A. (1972). Covariance selection. Biometrics 28 157\u2013175.\nGAI, Y., ZHU, L. and LIN, L. (2013). Model selection consistency of dantzig selector. Statistica\n\nSinica 615\u2013634.\n\nHUDSON, N. J., REVERTER, A. and DALRYMPLE, B. P. (2009). A differential wiring analysis of\nexpression data correctly identi\ufb01es the gene containing the causal mutation. PLoS Computational\nBiology. 5.\n\nIDEKER, T. and KROGAN, N. (2012). Differential network biology. Molecular Systems Biology 5\n\n565.\n\nLI, X., ZHAO, T., YUAN, X. and LIU, H. (2015). The \ufb02are package for hign dimensional linear\nregression and precision matrix estimation in r. Journal of Machine Learning Research 16 553\u2013557.\n\nMURTY, K. (1983). Linear Programming. Wiley, New York, NY.\nTIBSHIRANI, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society 101 267\u2013288.\n\nVANDERBEI, R. (1995). Linear Programming, Foundations and Extensions. Kluwer.\nWANG, H., LI, G. and JIANG, G. (2007). Robust regression shrinkage and consistent variable\n\nselection through the lad-lasso. Journal of Business & Economic Statistics 25 347\u2013355.\n\nYAO, Y. and LEE, Y. (2014). Another look at linear programming for feature selection via methods\n\nof regularization. Statistics and Computing 24 885\u2013905.\n\nZHAO, S. D., CAI, T. and LI, H. (2013). Direct estimation of differential networks. Biometrika 58\n\n253\u2013268.\n\nZHOU, H. and HASTIE, T. (2005). Regularization and variable selection via the elastic net. Journal\n\nof the Royal Statistical Society 67 301\u2013320.\n\nZHU, J., ROSSET, S., HASTIE, T. and TIBSHIRANI, R. (2004). 1-norm support vector machines.\n\nAdvances in Neural Information Processing Systems 16.\n\n10\n\n\f", "award": [], "sourceid": 164, "authors": [{"given_name": "Haotian", "family_name": "Pang", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Tencent AI Lab"}, {"given_name": "Robert", "family_name": "Vanderbei", "institution": "Princeton University"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Georgia Tech"}]}