{"title": "Exact Recovery of Hard Thresholding Pursuit", "book": "Advances in Neural Information Processing Systems", "page_first": 3558, "page_last": 3566, "abstract": "The Hard Thresholding Pursuit (HTP) is a class of truncated gradient descent methods for finding sparse solutions of $\\ell_0$-constrained loss minimization problems. The HTP-style methods have been shown to have strong approximation guarantee and impressive numerical performance in high dimensional statistical learning applications. However, the current theoretical treatment of these methods has traditionally been restricted to the analysis of parameter estimation consistency. It remains an open problem to analyze the support recovery performance (a.k.a., sparsistency) of this type of methods for recovering the global minimizer of the original NP-hard problem. In this paper, we bridge this gap by showing, for the first time, that exact recovery of the global sparse minimizer is possible for HTP-style methods under restricted strong condition number bounding conditions. We further show that HTP-style methods are able to recover the support of certain relaxed sparse solutions without assuming bounded restricted strong condition number. Numerical results on simulated data confirms our theoretical predictions.", "full_text": "Exact Recovery of Hard Thresholding Pursuit\n\nXiao-Tong Yuan\n\nB-DAT Lab\n\nNanjing University of Info. Sci.&Tech.\n\nNanjing, Jiangsu, 210044, China\n\nxtyuan@nuist.edu.cn\n\nPing Li\u2020\u2021 Tong Zhang\u2020\n\n\u2020Depart. of Statistics and \u2021Depart. of CS\n\nRutgers University\n\nPiscataway, NJ, 08854, USA\n\n{pingli,tzhang}@stat.rutgers.edu\n\nAbstract\n\nThe Hard Thresholding Pursuit (HTP) is a class of truncated gradient descent\nmethods for \ufb01nding sparse solutions of \u21130-constrained loss minimization prob-\nlems. The HTP-style methods have been shown to have strong approximation\nguarantee and impressive numerical performance in high dimensional statistical\nlearning applications. However, the current theoretical treatment of these meth-\nods has traditionally been restricted to the analysis of parameter estimation con-\nsistency. It remains an open problem to analyze the support recovery performance\n(a.k.a., sparsistency) of this type of methods for recovering the global minimizer\nof the original NP-hard problem. In this paper, we bridge this gap by showing,\nfor the \ufb01rst time, that exact recovery of the global sparse minimizer is possible\nfor HTP-style methods under restricted strong condition number bounding condi-\ntions. We further show that HTP-style methods are able to recover the support\nof certain relaxed sparse solutions without assuming bounded restricted strong\ncondition number. Numerical results on simulated data con\ufb01rms our theoretical\npredictions.\n\n1 Introduction\n\nf (x),\n\nmin\nx\u2208Rp\n\ns.t. \u2225x\u22250 \u2264 k,\n\nIn modern high dimensional data analysis tasks, a routinely faced challenge is that the number of\ncollected samples is substantially smaller than the dimensionality of features. In order to achieve\nconsistent estimation in such small-sample-large-feature settings, additional assumptions need to\nbe imposed on the model. Among others, the low-dimensional structure prior is the most popular\nassumption made in high dimensional analysis. This structure can often be captured by imposing\nsparsity constraint on model space, leading to the following \u21130-constrained minimization problem:\n(1)\nwhere f : Rp 7\u2192 R is a smooth convex loss function and \u2225x\u22250 denotes the number of nonzero\nentries in x. Due to the cardinality constraint, Problem (1) is not only non-convex, but also NP-hard\nin general (Natarajan, 1995). Thus, it is desirable to develop ef\ufb01cient computational procedures to\napproximately solve this problem.\nWhen the loss function is squared regression error, Problem (1) reduces to the compressive sensing\nproblem (Donoho, 2006) for which a vast body of greedy selection algorithms have been proposed\nincluding orthogonal matching pursuit (OMP) (Pati et al., 1993), compressed sampling matching\npursuit (CoSaMP) (Needell & Tropp, 2009), hard thresholding pursuit (HTP) (Foucart, 2011) and it-\nerative hard thresholding (IHT) (Blumensath & Davies, 2009) to name a few. The greedy algorithms\ndesigned for compressive sensing can usually be generalized to minimize non-quadratic loss func-\ntions (Shalev-Shwartz et al., 2010; Yuan & Yan, 2013; Bahmani et al., 2013). Comparing to those\nconvex-relaxation-based methods (Beck & Teboulle, 2009; Agarwal et al., 2010), these greedy se-\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\flection algorithms often exhibit similar accuracy guarantees but more attractive computational ef\ufb01-\nciency and scalability.\nRecently, the HTP/IHT-style methods have gained signi\ufb01cant interests and they have been witnessed\nto offer the fastest and most scalable solutions in many cases (Yuan et al., 2014; Jain et al., 2014).\nThe main theme of this class of methods is to iteratively perform gradient descent followed by a\ntruncation operation to preserve the most signi\ufb01cant entries, and an (optional) debiasing operation\nto minimize the loss over the selected entries. In (Blumensath, 2013; Yuan et al., 2014), the rate\nof convergence and parameter estimation error of HTP/IHT-style methods were established under\nproper Restricted Isometry Property (RIP) (or restricted strong condition number) bound conditions.\nJain et al. (2014) presented and analyzed several relaxed variants of HTP/IHT-style algorithms for\nwhich the estimation consistency can be established without requiring the RIP conditions. Very\nrecently, the extensions of HTP/IHT-style methods to structured and stochastic sparse learning prob-\nlems have been investigated in (Jain et al., 2016; Li et al., 2016; Shen & Li, 2016).\n\n1.1 An open problem: exact recovery of HTP\n\nIn this paper, we are particularly interested in the exact recovery and support recovery performance\nof the HTP-style methods. A pseudo-code of HTP is outlined in Algorithm 1 which is also known as\nGraHTP in (Yuan et al., 2014). Although this type of methods have been extensively analyzed in the\noriginal paper (Foucart, 2011) for compressive sensing and several recent followup work (Yuan et al.,\n2014; Jain et al., 2014, 2016) for generic sparse minimization, the state-of-the-art is only able to de-\nrive convergence rates and parameter estimation error bounds for HTP. It remains an open and chal-\nlenging problem to analyze its ability to exactly recover the global sparse minimizer of Problem (1)\nin general settings. Actually, the support/structure recovery analysis is the main challenge in many\nimportant sparsity models including compressive sensing and graphical models learning (Jalali et al.,\n2011; Ravikumar et al., 2011): once the support is recovered, computing the actual nonzero coef\ufb01-\ncients just boils down to solving a convex minimization problem restricted on the supporting set.\nSince the output of HTP is always k-sparse, the existing estimation error results in (Foucart, 2011;\nYuan et al., 2014; Jain et al., 2014) naturally imply some support recovery conditions. For example,\nfor perfect measurements, the results in (Foucart, 2011; Yuan et al., 2014) guarantee that HTP can\nexactly recover the underlying true sparse model parameters. For noisy models, roughly speaking,\nas long as the smallest (in magnitude) nonzero entry of the k-sparse minimizer of (1) is larger than\nthe estimation error bound of HTP, an exact recovery of the minimizer can be guaranteed. However,\nthese pieces of support recovery results implied by the estimation error bound turn out to be loose\nwhen compared to the main results we will derive in the current paper.\n\n:Loss function f (x), sparsity level k, step-size \u03b7.\n\nAlgorithm 1: Hard Thresholding Pursuit.\nInput\nInitialization x(0) = 0, t = 1.\nOutput : x(t).\nrepeat\n\n(S1) Compute ~x(t) = x(t\u22121) \u2212 \u03b7\u2207f (x(t\u22121));\n(S2) Select F (t) = supp(~x(t), k) be the indices of ~x(t) with the largest k absolute values;\n(S3) Compute x(t) = arg min{f (x), supp(x) \u2286 F (t)};\n(S4) Update t \u2190 t + 1;\n\nuntil F (t) = F (t\u22121);\n\n1.2 Overview of our results\n\nThe core contribution in this work is a deterministic support recovery analysis of HTP-style meth-\nods which to our knowledge has not been systematically conducted elsewhere in literature. Our \ufb01rst\nresult (see Theorem 1) shows that HTP as described in Algorithm 1 is able to exactly recover the\nk-sparse minimizer x\u22c6 = arg min\u2225x\u22250\u2264k f (x) if x\u22c6\nmin, i.e., the smallest non-zero entry of x\u22c6, is sig-\nni\ufb01cantly larger than \u2225\u2207f (x\u22c6)\u2225\u221e and certain RIP-type condition can be ful\ufb01lled as well. Moreover,\nthe exact recovery can be guaranteed in \ufb01nite running of Algorithm 1 with geometric rate of conver-\ngence. Our second result (see Theorem 2) shows that the support recovery of an arbitrary k-sparse\n\n2\n\n\fTable 1: Comparison between our results and several prior results on HTP-style algorithms.\n\nTarget Solution\n\nRelated Work\n(Foucart, 2011)\nTrue k-sparse signal x\n(Yuan et al., 2014) Arbitrary (cid:22)x with \u2225(cid:22)x\u22250 \u2264 k\n(cid:22)x = arg min\u2225x\u22250\u2264(cid:22)k f (x)\nfor proper (cid:22)k \u226a k\nArbitrary (cid:22)x with \u2225(cid:22)x\u22250 \u2264 k\n\n(Jain et al., 2014)\n\nSupport Recovery\n\nRIP Condition Free\n\n\u00d7\n\u00d7\n\u221a\n\n\u00d7\n\u00d7\n\u00d7\n\u221a\n\nOurs\n\n\u00d7 (for \u2225(cid:22)x\u22250 = k),\n\u221a\n(for \u2225(cid:22)x\u22250 \u226a k)\n\u221a\nk\u2225\u2207f (x\u22c6)\u2225\u221e or \u2225\u2207f (x\u22c6)\u2225\u221e, pend-\nvector (cid:22)x can be guaranteed if (cid:22)xmin is well discriminated from\ning on the optimality of (cid:22)x over its own supporting set. Our third result (see Theorem 3) shows that\nHTP is able to recover the support of certain relaxed sparse minimizer (cid:22)x with \u2225(cid:22)x\u22250 \u226a k under\n\u221a\nan arbitrary restricted strong condition number. More formally, given the restricted strong smooth-\nness/convexity (see De\ufb01nition 1) constants M2k and m2k, the recovery of supp((cid:22)x) is possible if\nk \u2265 (1 + 16M 2\n2k)(cid:22)k and the smallest non-zero element in (cid:22)x is signi\ufb01cantly larger than the\nf ((cid:22)x) \u2212 f (x\u22c6). The support recovery can also be guaranteed in \ufb01nite\nrooted objective value gap\niteration for this case. By specifying our deterministic analysis to least squared regression and logis-\ntic regression, we are able to obtain the sparsistency guarantees of HTP for these statistical learning\nexamples. Monte-Carlo simulation results con\ufb01rm our theoretical predictions. Table 1 summarizes\na high-level comparison between our work and the state-of-the-art analysis for HTP-style methods.\n\n2k/m2\n\n1.3 Notation and organization\nNotation Let x \u2208 Rp be a vector and F be an index set. We denote [x]i the ith entry of vector x, xF\nthe restriction of x to index set F and xk the restriction of x to the top k (in absolute vale) entries.\nThe notation supp(x) represents the index set of nonzero entries of x and supp(x, k) represents the\nindex set of the top k (in absolute vale) entries of x. We conventionally de\ufb01ne \u2225x\u2225\u221e = maxi |[x]i|\nand de\ufb01ne xmin = mini\u2208supp(x) |[x]i|.\nOrganization This paper proceeds as follows: In \u00a72, we analyze the exact recovery performance of\nHTP. The applications of our analysis to least squared regression and logistic regression models are\npresented in \u00a73. Monte-Carlo simulation results are reported in \u00a74. We conclude this paper in \u00a75.\nDue to space limit, all the technical proofs of our results are deferred to an appendix section which\nis included in the supplementary material.\n\n2 A Deterministic Exact Recovery Analysis\n\nIn this section, we analyze the exact support recovery performance of HTP as outlined in Algo-\nrithm 1. In large picture, the theory developed in this section can be decomposed into the following\nthree ingredients:\n\nx\u22c6 = arg min\u2225x\u22250\u2264k f (x). The related result is summarized in Proposition 1.\n\n\u2022 First, we will investigate the support recovery behavior of the global k-sparse minimizer\n\u2022 Second, we will present in Theorem 1 the guarantee of HTP for exactly recovering x\u22c6.\n\u2022 Finally, by combining the the above two results we will be able to establish the support re-\ncovery result of HTP in Theorem 2. Furthermore, we derive an RIP-condition-free support\nrecovery result in Theorem 3.\n\nOur analysis relies on the conditions of Restricted Strong Convexity/Smoothness (RSC/RSS) which\nare conventionally used in previous analysis for HTP (Yuan et al., 2014; Jain et al., 2014).\nDe\ufb01nition 1 (Restricted Strong Convexity/Smoothness). For any integer s > 0, we say f (x) is\nrestricted ms-strongly convex and Ms-smooth if there exist ms, Ms > 0 such that\n\nms\n2\n\n\u2225x \u2212 y\u22252 \u2264 f (x) \u2212 f (y) \u2212 \u27e8\u2207f (y), x \u2212 y\u27e9 \u2264 Ms\n2\n\n\u2225x \u2212 y\u22252,\n\n\u2200\u2225x \u2212 y\u22250 \u2264 s.\n\n(2)\n\n3\n\n\fThe ratio Ms/ms, which measures the curvature of the loss function over sparse subspaces, will be\nreferred to as restricted strong condition number in this paper.\n\n2.1 Preliminary: Support recovery of x\u22c6\n\nGiven a target solution (cid:22)x, the following result establishes some suf\ufb01cient conditions under which x\u22c6\nis able to exactly recover the supporting set of (cid:22)x. A proof of this result is provided in Appendix B\n(see the supplementary \ufb01le).\nProposition 1. Assume that f is M2k-smooth and m2k-strongly convex. Let (cid:22)x be an arbitrary\nk-sparse vector. Let (cid:22)x\u22c6 = arg minsupp(x)\u2286supp((cid:22)x) f (x) and (cid:22)l > 0 be a scalar such that\n\nf ((cid:22)x\u22c6) = f ((cid:22)x) + \u27e8\u2207f ((cid:22)x), (cid:22)x\u22c6 \u2212 (cid:22)x\u27e9 +\n\n\u2225(cid:22)x\u22c6 \u2212 (cid:22)x\u22252\n1.\n\n(cid:22)l\n2\n\nThen we have supp((cid:22)x) = supp(x\u22c6) if either of the following two conditions is satis\ufb01ed:\n\n{\n\n}\n\n(\n\u221a\n(1) (cid:22)xmin \u2265 2\n(2) (cid:22)xmin \u2265\n\n2k\nm2k\n\n)\n\u2225\u2207f ((cid:22)x)\u2225\u221e;\n\n(cid:22)\u03d1\n\nM2k\n\n+ 2 (cid:22)\u03d1+2\n\n(cid:22)l\n\n\u2225\u2207f ((cid:22)x)\u2225\u221e, m2k\n\n\u2265 max\n\nM2k\n\n\u221a\n\n3\n2\n\n3 (cid:22)\u03d1+1\n4 (cid:22)\u03d1 ,\n\n, for some (cid:22)\u03b8 > 1.\n\nRemark 1. The quantity (cid:22)l actually measures the strong-convexity of f at the point ((cid:22)x\u22c6 \u2212 (cid:22)x) in\n\u21131-norm. From its de\ufb01nition we can verify that (cid:22)l is valued in the interval [m2k/k, M2k] if (cid:22)x \u0338=\n(cid:22)x\u22c6. The closer (cid:22)l is to M2k, the weaker lower bound condition can be imposed on (cid:22)xmin in the\ncondition (2).\nIn (Nutini et al, 2015), a similar strong-convexity measurement has been de\ufb01ned\nover the entire vector space for re\ufb01ned convergence analysis of the coordinate descent methods.\nDifferent from (Nutini et al, 2015), we only require such an \u21131-norm strong-convexity condition holds\nat certain target points of interest. Particularly if (cid:22)x = (cid:22)x\u22c6, i.e., (cid:22)x is optimal over its supporting set,\nthen we may simply set (cid:22)l = \u221e in Proposition 1.\n\n2.2 Main results: Support recovery of HTP\n\nEquipped with Proposition 1, it will be straightforward to guarantee the support recovery of HTP if\nwe can derive suf\ufb01cient conditions under which HTP is able to exactly recover x\u22c6. Denote F \u22c6 =\nmin should be signi\ufb01cantly larger than \u2225\u2207f (x\u22c6)\u2225\u221e to attract HTP to be\nsupp(x\u22c6). Intuitively, x\u22c6\nstuck at x\u22c6 (see Lemma 5 in Appendix B for a formal elaboration). The exact recovery analysis\nalso relies on the following quantity \u25b3\u2212\u22c6 which measures the gap between the minimal k-sparse\n\u2217\nobjective value f (x\n\n) and the remaining ones over supporting sets other than supp(x\n\n\u2212\u22c6) \u2212 f (x\u22c6),\n\u2212\u22c6 = arg min\u2225x\u22250\u2264k,supp(x)\u0338=supp(x\u22c6),f (x)>f (x\u22c6) f (x).\n\n\u25b3\u2212\u22c6 := f (x\n\nIntuitively, the larger \u25b3\u2212\u22c6 is, the\nwhere x\neasier and faster x\u22c6 can be recovered by HTP. It is also reasonable to expect that the step-size \u03b7\nshould be well bounded away from zero to avoid undesirable early stopping.\nInspired by these intuitive points, we present the following theorem which guarantees the exact\nrecovery of HTP when the restricted strong condition number is well bounded. A proof of this\ntheorem is provided in Appendix C (see the supplementary \ufb01le).\nTheorem 1. Assume that f is M2k-smooth and m2k-strongly convex. Assume that \u03d1\u22c6 :=\nM2kx\u22c6\n\u2225\u2207f (x\u22c6)\u2225\u221e > 1 and m2k\n, then the optimal\nM2k\nk-sparse solution x\u22c6 is unique and HTP will terminate with output x(t) = x\u22c6 after at most\n\nIf we set the step-size to be \u03b7 = m2k\nM 2\n2k\n\n\u2265 7\u03d1\u22c6+1\n8\u03d1\u22c6 .\n\n):\n\nmin\n\n\u2217\n\nM 3\n2k\n\nof\n\nt =\n\niteration,\n\n2k(M2k \u2212 m2k)\nm2\nwhere \u25b3(0)\nsteps\nmin\u2225x\u22250\u2264k,supp(x)\u0338=supp(x\u22c6),f (x)>f (x\u22c6) {f (x) \u2212 f (x\u22c6)}.\nRemark 2. Theorem 1 suggests that HTP is able to exactly recover x\u22c6 provided that x\u22c6\nmin is strictly\nlarger than \u2225\u2207f (x\u22c6)\u2225\u221e/M2k and the restricted strong condition number is well bounded, i.e.,\nM2k/m2k \u2264 8\u03b8\u22c6\n\nln\nf (x(0)) \u2212 f (x\u22c6)\n\nand \u25b3\u2212\u22c6\n\n=\n\n=\n\n7\u03b8\u22c6+1 < 1.14.\n\n\u2309\n\n\u25b3(0)\n\u25b3\u2212\u22c6\n\n\u2308\n\n4\n\n\fAs a consequence of Proposition 1 and Theorem 1, the following theorem establishes the perfor-\nmance of HTP for recovering the support of an arbitrary k-sparse vector. A proof of this result is\nprovided in Appendix D (see the supplementary \ufb01le).\nTheorem 2. Let (cid:22)x be an arbitrary k-sparse vector and (cid:22)l be de\ufb01ned in Proposition 1. Assume that\nthe conditions in Theorem 1 hold. Then HTP will output x(t) satisfying supp(x(t)) = supp((cid:22)x) in\n\ufb01nite iteration, provided that either of the following two conditions is satis\ufb01ed in addition:\n\u2225\u2207f ((cid:22)x)\u2225\u221e.\n\n\u2225\u2207f ((cid:22)x)\u2225\u221e;\n\n(2) (cid:22)xmin \u2265\n\n2\u03d1\u22c6 + 2\n\n\u221a\n(1) (cid:22)xmin \u2265 2\n2k\nm2k\n\n\u03d1\u22c6\nM2k\n\n(\n\n)\n\n+\n\n(cid:22)l\n\nIn the following theorem, we further show that for proper (cid:22)k < k, HTP method is able to recover\nthe support of certain desired (cid:22)k-sparse vector without assuming bounded restricted strong condition\n(\nnumbers. A proof of this theorem can be found in Appendix E (see the supplementary \ufb01le).\nTheorem 3. Assume that f is M2k-smooth and m2k-strongly convex. Let (cid:22)x be an arbitrary (cid:22)k-sparse\n\u221a\nvector satisfying k \u2265\n\n(cid:22)k. Set the step-size to be \u03b7 = 1\n\n1 + 16M 2\nm2\n2k\n\n)\n\n2M2k\n\n2k\n\n.\n\n, then HTP will terminate in \ufb01nite iteration with output x(t)\n\n2(f ((cid:22)x)\u2212f (x\u22c6))\n\n(a) If (cid:22)xmin >\n\n\u221a\nsatisfying supp((cid:22)x) \u2286 supp(x(t)).\n\nm2k\n\n(b) Furthermore, if (cid:22)xmin > 1.62\n\n2(f ((cid:22)x)\u2212f (x\u22c6))\n\n, then HTP will terminate in \ufb01nite iteration\n\nm2k\n\nwith output x(t) satisfying supp(x(t), (cid:22)k) = supp((cid:22)x).\n\n\u221a\nRemark 3. The main message conveyed by the part (a) of Theorem 3 is: If the nonzero elements\nf ((cid:22)x) \u2212 f (x\u22c6), then supp((cid:22)x) \u2286\nin (cid:22)x are signi\ufb01cantly larger than the rooted objective value gap\nsupp(x(t)) can be guaranteed by HTP with suf\ufb01ciently large sparsity level k. Intuitively, the closer\nf ((cid:22)x) is to f (x\u22c6), the easier the conditions can be satis\ufb01ed. Given that f ((cid:22)x) is close enough to\nthe unconstrained global minimizer of f (i.e., the global minimizer of f is nearly sparse), we will\nhave f ((cid:22)x) close enough to f (x\u22c6) since f ((cid:22)x) \u2212 f (x\u22c6) \u2264 f ((cid:22)x) \u2212 minx f (x).\nIn the ideal case\nwhere the sparse vector (cid:22)x is an unconstrained minimum of f, we will have f ((cid:22)x) = f (x\u22c6), and thus\nsupp((cid:22)x) \u2286 supp(x(t)) holds under arbitrarily large restricted strong condition number.\nThe part (b) of Theorem 3 shows that under almost identical conditions (up to a slightly increased\nnumerical constant) to those in Part(a), HTP will output x(t) of which the top (cid:22)k entries are exactly\nthe supporting set of (cid:22)x. The implication of this result is: in order to recover certain (cid:22)k-sparse signals,\none may run HTP with a properly relaxed sparsity level k until convergence and then preserve the\ntop (cid:22)k entries of the k-sparse output as the \ufb01nal (cid:22)k-sparse solution.\n\n2.3 Comparison against prior results\n\n(\n\nIt is interesting to compare our support recovery results with those implied by the parameter estima-\ntion error bounds obtained in prior work (Yuan et al., 2014; Jain et al., 2014). Actually, parameter\nestimation error bound naturally leads to the so called x-min condition which is key to the support\n\u221a\nrecovery analysis. For example, it can be derived from the bounds in (Yuan et al., 2014) that under\nproper RIP condition \u2225x(t) \u2212 (cid:22)x\u2225 = O(\nk\u2225\u2207f ((cid:22)x)\u2225\u221e) when t is suf\ufb01ciently large. This implies that\n)\nas long as the (cid:22)xmin is signi\ufb01cantly larger than such an estimation error bound, exact recovery of (cid:22)x\ncan be guaranteed. In the meantime, the results in (Jain et al., 2014) show that for some (cid:22)k-sparse\nminimizer of (1) with (cid:22)k = O\n, it holds for arbitrary restrictive strong condition number that\n\u221a\nk\u2225\u2207f ((cid:22)x)\u2225\u221e) when t is suf\ufb01ciently large. Provided that (cid:22)xmin is signi\ufb01cantly\n\u2225x(t) \u2212 (cid:22)x\u2225 = O(\nlarger than such an error bound, it will hold true that supp((cid:22)x) \u2286 supp(x(t)). Table 2 summarizes\nour support recovery results and those implied by the state-of-the-art results regarding target solu-\ntion, dependency on RIP-type conditions and x-min condition. From this table, we can see that the\n\u221a\n\u221a\nx-min condition in Theorem 1 for recovering the global minimizer x\u22c6 is weaker than those implied\nk. Also our x-min\nin (Yuan et al., 2014) in the sense that the former is not dependent on a factor\ncondition in Theorem 3 is weaker than those implied in (Jain et al., 2014) because; 1) our bound\nO(\nk; and 2) it can be veri\ufb01ed from\nthe restricted strong-convexity of f that\n\n\u221a\nf ((cid:22)x) \u2212 f (x\u22c6)) is not explicitly dependent on a multiplier\nf ((cid:22)x) \u2212 f (x\u22c6) \u2264 \u221a\n\n\u221a\nk\u2225\u2207f ((cid:22)x)\u2225\u221e/\n2m2k.\n\nm2\n2k\nM 2\n2k\n\n\u221a\n\nk\n\n5\n\n\fTable 2: Comparison between our support recovery conditions and those implied by the existing\nestimation error bounds for HTP-style methods.\n\nResults\n\n(Yuan et al., 2014)\n(Jain et al., 2014)\n\nTheorem 1\n\nTheorem 2\n\nTheorem 3\n\n(\n\n(\n\nTarget Solution\n\nArbitrary k-sparse (cid:22)x\n\u2225(cid:22)x\u22250 = O\n)2k\nx\u22c6 = arg min\u2225x\u22250\u2264k f (x)\n\n( m2k\nM2k\n\nArbitrary k-sparse (cid:22)x\n\u2225(cid:22)x\u22250 = O\n\n)2k\n\n( m2k\nM2k\n\n)\n\n)\n\nRIP Cond.\nRequired\n\nFree\n\nRequired\n\nRequired\n\nFree\n\nX-min Condition\n\n\u221a\nk\u2225\u2207f ((cid:22)x)\u2225\u221e)\n(cid:22)xmin > O(\n\u221a\n(cid:22)xmin > O(\nk\u2225\u2207f ((cid:22)x\u2225\u221e)\nmin > O(\u2225\u2207f (x\u22c6)\u2225\u221e)\n\u221a\nx\u22c6\nk\u2225\u2207f ((cid:22)x)\u2225\u221e)\n(cid:22)xmin > O(\n)\n(\u221a\nor (cid:22)xmin > O(\u2225\u2207f ((cid:22)x)\u2225\u221e)\n(cid:22)xmin > O\nf ((cid:22)x) \u2212 f (x\u22c6)\n\nIt is also interesting to compare the support recovery result in Proposition 1 with those known for\nthe following \u21131-regularized estimator:\n\nf (x) + \u03bb\u2225x\u22251,\n\nmin\nx\u2208Rp\n\nwhere \u03bb is the regularization strength parameter. Recently, a uni\ufb01ed sparsistency analysis for this\ntype of convex-relaxed estimator was provided in (Li et al., 2015). We summarize in below a com-\nparison between our Proposition 1 and the state-of-the-art results in (Li et al., 2015) with respect to\nseveral key conditions:\n\n\u2022 Local structured smoothness/convexity condition: Our analysis only requires \ufb01rst-order\nlocal structured smoothness/convexity conditions (i.e., RSC/RSS) while the analysis\nin (Li et al., 2015, Theorem 5.1, Condition 1) relies on certain second-order and third-order\nlocal structured smoothness conditions.\n\u2022 Irrepresentablility condition: Our analysis is free of the Irrepresentablility Condition which\nis usually required to guarantee the sparsistency of \u21131-regularized estimators (Li et al.,\n2015, Theorem 5.1, Condition 3).\n\u2022 RIP-type condition: The analysis in (Li et al., 2015) is free of RIP-type condition, while\nours is partially relying on such a condition (see Condition (2) of Proposition 1).\n\u2022 X-min condition: Comparing to the x-min condition required in (Li et al., 2015, Theorem\n\u221a\n5.1, Condition 4), which is of order O(\nk\u2225\u2207f ((cid:22)x)\u2225\u221e), the x-min condition (1) in Propo-\nsition 1 is at the same order while the x-min condition (2) is weaker as it is not explicitly\ndependent on\n\n\u221a\nk.\n\n3 Applications to Statistical Learning Models\n\nIn this section, we apply our support recovery analysis to several sparse statistical learning mod-\nels, deriving concrete sparsistency conditions in each case. Given a set of n independently drawn\ndata samples {(u(i), v(i))}n\ni=1, we are interested in the following sparsity-constrained empirical loss\nminimization problem:\n\nn\u2211\n\ni=1\n\nmin\n\nw\n\nf (w) :=\n\n1\nn\n\n\u22a4\n\n\u2113(w\n\nu(i), v(i)),\n\nsubject to \u2225w\u22250 \u2264 k.\n\nwhere \u2113(\u00b7,\u00b7) is a loss function measuring the discrepancy between prediction and response and w is\na set of parameters to be estimated. In the subsequent subsections, we will investigate sparse linear\nregression and sparse logistic regression as two popular examples of the above formulation.\n\n3.1 Sparsity-constrained linear regression\n\nGiven a (cid:22)k-sparse parameter vector (cid:22)w, let us consider the samples are generated according to the\nlinear model v(i) = (cid:22)w\nu(i) + \u03b5(i) where \u03b5(i) are n i.i.d. sub-Gaussian random variables with\n\n\u22a4\n\n6\n\n\fparameter \u03c3. The sparsity-constrained least squared linear regression model is then given by:\n\nn\u2211\n\ni=1\n\nmin\n\nw\n\nf (w) =\n\n1\n2n\n\n\u2225v(i) \u2212 w\n\n\u22a4\n\nu(i)\u22252,\n(\n\u221a\n\nsubject to \u2225w\u22250 \u2264 k.\n)\n\n(3)\n\n)\n\n\u221a\n\nSuppose u(i) are drawn from Gaussian distribution with covariance (cid:6). Then it holds with high\nprobability that f (w) has RSC constant m2k \u2265 \u03bbmin((cid:6)) \u2212 O(k log p/n) and RSS constant M2k \u2264\n(\n\u03bbmax((cid:6)) + O(k log p/n), and\u2225\u2207f ( (cid:22)w)\u2225\u221e = O\n. From Theorem 2 we know that\nfor suf\ufb01ciently large n, if the condition number \u03bbmax((cid:6))/\u03bbmin((cid:6)) is well bounded and (cid:22)wmin >\nO\n\u221a\n, then supp( (cid:22)w) can be recovered by HTP after suf\ufb01cient iteration. Since \u03b5(i)\n\u2225\u03b5(i)\u22252 \u2264 \u03c32 holds with high probability. From\nare sub-Gaussian, we have f ( (cid:22)w) = 1\n2n\nTheorem 3 we can see that if (cid:22)wmin > 1.62\u03c3\n2/m2k, then supp( (cid:22)w) can be recovered, with high\nprobability, by HTP with a suf\ufb01ciently large sparsity level and a (cid:22)k-sparse truncation postprocessing.\n\n(cid:22)k log p/n\n\n\u2211\n\nlog p/n\n\nn\ni=1\n\n\u03c3\n\n\u03c3\n\n3.2 Sparsity-constrained logistic regression\n\nn\u2211\n\ni=1\n\nLogistic regression is one of the most popular models in statistical learning. In this model the relation\nbetween the random feature vector u \u2208 Rp and its associated random binary label v \u2208 {\u22121, +1}\nis determined by the conditional probability P(v|u; (cid:22)w) = exp(2v (cid:22)w\nu)). Given\na set of n independently drawn data samples {(u(i), v(i))}n\ni=1, the sparse logistic regression model\nlearns the parameters w so as to minimize the logistic log-likelihood over sparsity constraint:\n\nu)/(1 + exp(2v (cid:22)w\n\n\u22a4\n\n\u22a4\n\nmin\n\nw\n\nf (w) =\n\n1\nn\n\nlog(1 + exp(\u22122v(i)w\n\n\u22a4\n\nu(i))),\n\nsubject to \u2225w\u22250 \u2264 k.\n\n(4)\n\n(\n\n\u221a\n\n)\n\nIt has been shown in (Bahmani et al., 2013, Corollary 1) that under mild conditions, f (w) has RSC\nand RSS with overwhelming probability. Suppose u(i) are sub-Gaussian with parameter \u03c3, then it\nis known from (Yuan et al., 2014) that \u2225\u2207f ( (cid:22)w)\u2225\u221e = O\n. Then from Theorem 2 we\nknow that if the restrictive strong condition number is well bounded and (cid:22)wmin > O\n\u221a\n\u221a\n,\nthen supp( (cid:22)w) can be recovered by HTP after suf\ufb01cient iteration. By using Theorem 3 and the fact\nf ( (cid:22)w) \u2212 f (w\u22c6) = O(\n, then with\nhigh probability, supp( (cid:22)w) can be recovered by HTP using a suf\ufb01ciently large sparsity level k and\nproper postprocessing, without assuming bounded sparse condition number.\n\n\u221a\nk\u2225\u2207f ((cid:22)x)\u2225\u221e), we can show that if (cid:22)wmin > O\n\n\u221a\n)\n\n(cid:22)k log p/n\n\n(cid:22)k log p/n\n\nlog p/n\n\n(\n\n)\n\n(\n\n\u03c3\n\n\u03c3\n\n\u03c3\n\n4 Numerical Results\n\nIn this section, we conduct a group of Monte-Carlo simulation experiments on sparse linear regres-\nsion and sparse logistic regression models to verify our theoretical predictions.\nData generation: We consider a synthetic data model in which the sparse parameter (cid:22)w is a p = 500\ndimensional vector that has (cid:22)k = 50 nonzero entries drawn independently from the standard Gaussian\ndistribution. Each data sample u is a normally distributed dense vector. For the linear regression\nmodel, the responses are generated by v = u (cid:22)w + \u03b5 where \u03b5 is a standard Gaussion noise. For the\nlogistic regression model, the data labels, v \u2208 {\u22121, 1}, are then generated randomly according to\nthe Bernoulli distribution P(v = 1|u; (cid:22)w) = exp(2 (cid:22)w\nu)). We allow the sample\nsize n to be varying and for each n, we generate 100 random copies of data independently.\nEvaluation metric: In our experiment, we test HTP with varying sparsity level k \u2265 (cid:22)k. We use two\nmetrics to measure the support recovery performance. We say a relaxed support recovery is success-\nful if supp( (cid:22)w) \u2286 supp(w(t)) and an exact support recovery is successful if supp( (cid:22)w) = supp(w(t), (cid:22)k).\nWe replicate the experiment over the 100 trials and record the percentage of successful relaxed sup-\nport recovery and percentage of successful exact support recovery for each con\ufb01guration of (n, k).\nResults: Figure 1 shows the percentage of relaxed/exact success curves as functions of sample size\nn under varying sparsity level k. From the curves in Figure 1(a) for the linear regression model we\n\nu)/(1 + exp(2 (cid:22)w\n\n\u22a4\n\n\u22a4\n\n7\n\n\f(a) Linear Regression\n\n(b) Logistic Regression\n\nFigure 1: Chance of relaxed success (left panel) and exact success (right panel) curves for linear\nregression and logistic regression models.\n\n(a) Linear Regression\n\n(b) Logistic Regression\n\nFigure 2: HTP versus IHT: Chance of relaxed and exact success of support recovery.\n\ncan make two observations: 1) for each curve, the chance of success increases as n increases, which\nmatches the results in Theorem 1 and Theorem 2; 2) HTP has the best performance when using\nsparsity level k = 70 > (cid:22)k. Also it can be seen that the percentage of relaxed success is less sensitive\nto k than the percentage of exact success. These observations match the prediction in Theorem 3.\nSimilar observations can be made from the curves in Figure 1(b) for the logistic regression model.\nWe have also compared HTP with IHT (Blumensath & Davies, 2009) in support recovery perfor-\nmance. Note that IHT is a simpli\ufb01ed variant of HTP without the debiasing operation (S3) in Algo-\nrithm 1. Our exact support recovery analysis for HTP builds heavily upon such a debiasing operation.\nFigure 2 shows the chance of success curves for these two methods with sparsity level k = 70. Fig-\nure 2(a) shows that in linear regression model, HTP is superior to IHT when the sample size n is\nrelatively small and they become comparable as n increases. Figure 2(b) indicates that HTP slightly\noutperforms IHT when applied to the considered logistic regression task. From this group of re-\nsults we can draw the conclusion that the debiasing step of HPT does have signi\ufb01cant impact on\nimproving the support recovery performance especially in small sample size settings.\n\n5 Conclusions\n\nIn this paper, we provided a deterministic support recovery analysis for HTP-style methods widely\nused in sparse learning. Theorem 1 establishes suf\ufb01cient conditions for exactly recovering the global\nk-sparse minimizer x\u22c6 of the NP-hard problem (1). Theorem 2 provides suf\ufb01cient conditions to\nguarantee the support recovery of an arbitrary k-sparse target solution. Theorem 3 further shows\nthat even when the restricted strong condition number can be arbitrarily large, HTP is still able\nto recover a target sparse solution by using certain relaxed sparsity level in the algorithm. We\nhave applied our deterministic analysis to sparse linear regression and sparse logistic regression\nto establish the sparsistency of HTP in these statistical learning models. Based on our theoretical\njusti\ufb01cation and numerical observation, we conclude that HTP-style methods are not only accurate in\nparameter estimation, but also powerful for exactly recovering sparse signals even in noisy settings.\n\nAcknowledgments\n\nXiao-Tong Yuan and Ping Li were partially supported by NSF-Bigdata-1419210, NSF-III-1360971,\nONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Xiao-Tong Yuan is also partially sup-\nported by NSFC-61402232, NSFC-61522308, and NSFJP-BK20141003. Tong Zhang is supported\nby NSF-IIS-1407939 and NSF-IIS-1250985.\n\n8\n\n200400600800020406080100nPerc. of relaxed success (%) k=50k=70k=90k=110k=130k=150200400600800020406080100nPerc. of exact success (%) k=50k=70k=90k=110k=130k=150200400600800020406080100nPerc. of relaxed success (%) k=50k=70k=90k=110k=130k=150200400600800020406080100nPerc. of exact success (%) k=50k=70k=90k=110k=130k=150200400600800020406080100nPerc. of relaxed success (%) HTP: k=70IHT: k=70200400600800020406080100nPerc. of exact success (%) HTP: k=70IHT: k=70200400600800020406080100nPerc. of relaxed success (%) HTP: k=70IHT: k=70200400600800020406080100nPerc. of exact success (%) HTP: k=70IHT: k=70\fReferences\nAgarwal, A., Negahban, S., and Wainwright, M. Fast global convergence rates of gradient methods for high-\nIn Proceedings of the 24th Annual Conference on Neural Information\n\ndimensional statistical recovery.\nProcessing Systems (NIPS\u201910), 2010.\n\nBahmani, S., Raj, B., and Boufounos, P. Greedy sparsity-constrained optimization. Journal of Machine Learn-\n\ning Research, 14:807\u2013841, 2013.\n\nBeck, A. and Teboulle, Marc. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\nBlumensath, T. Compressed sensing with nonlinear observations and related nonlinear optimization problems.\n\nIEEE Transactions on Information Theory, 59(6):3466\u20133474, 2013.\n\nBlumensath, T. and Davies, M. E. Iterative hard thresholding for compressed sensing. Applied and Computa-\n\ntional Harmonic Analysis, 27(3):265\u2013274, 2009.\n\nDonoho, D. L. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306, 2006.\n\nFoucart, S. Hard thresholding pursuit: An algorithm for compressive sensing. SIAM Journal on Numerical\n\nAnalysis, 49(6):2543\u20132563, 2011.\n\nJain P. and Rao N. and Dhillon I. Structured sparse regression via greedy hard-thresholding. 2016 URL\n\nhttp://arxiv.org/pdf/1602.06042.pdf.\n\nJain, P., Tewari, A., and Kar, P. On iterative hard thresholding methods for high-dimensional m-estimation. In\nProceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS\u201914), 685\u2013693,\n2014.\n\nJalali, A., Johnson, C. C., and Ravikumar, P. K. On learning discrete graphical models using greedy methods.\nIn Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS\u201911), 2011.\n\nLi Xingguo, Zhao Tuo, Arora Raman, Liu Han and Haupt Jarvis. Stochastic variance reduced optimization\nfor nonconvex sparse learning. In Proceedings of the 33rd International Conference on Machine Learning\n(ICML\u201916), 2016.\n\nLi Yen-Huan, Scarlett Jonathan, Ravikumar Pradeep and Cevher Volkan Sparsistency of \u21131-regularized M-\nIn Proceedings of the 18th International Conference on Artif\ufb01cial Intelligence and Statistics\n\nestimators.\n(AISTATS\u201915), 2015.\n\nNatarajan, B. K. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227\u2013234,\n\n1995.\n\nNeedell, D. and Tropp, J. A. Cosamp: iterative signal recovery from incomplete and inaccurate samples. IEEE\n\nTransactions on Information Theory, 26(3):301\u2013321, 2009.\n\nNesterov, Y.\n\nIntroductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2004.\n\n1402075537.\n\nISBN 978-\n\nNutini, J., Schmidt, M.W., Laradji, I.H., Friedlander, M.P. and Koepke, H.A. Coordinate descent converges\nfaster with the Gauss-Southwell rule than random selection. In Proceedings of the 32nd International Con-\nference on Machine Learning (ICML\u201915), pp. 1632\u20131641, 2015.\n\nPati, Y.C., Rezaiifar, R., and Krishnaprasad, P.S. Orthogonal matching pursuit: Recursive function approxima-\ntion with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference\non Signals, Systems, and Computers, pp. 40\u201344, 1993.\n\nRavikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. High-dimensional covariance estimation by mini-\n\nmizing \u21131-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935\u2013980, 2011.\n\nShalev-Shwartz, Shai, Srebro, Nathan, and Zhang, Tong. Trading accuracy for sparsity in optimization prob-\n\nlems with sparsity constraints. SIAM Journal on Optimization, 20:2807\u20132832, 2010.\n\nJie,\n\nShen\n\nA tight\nhttp://arxiv.org/pdf/1605.01656.pdf.\n\nand Ping,\n\nLi.\n\nbound\n\nof\n\nhard\n\nthresholding.\n\n2016.\n\nURL\n\nYuan, X.-T. and Yan, S. Forward basis selection for pursuing sparse representations over a dictionary. IEEE\n\nTransactions on Pattern Analysis And Machine Intelligence, 35(12):3025\u20133036, 2013.\n\nYuan, X.-T., Li, P., and Zhang, T. Gradient hard thresholding pursuit for sparsity-constrained optimization. In\n\nProceedings of the 31st International Conference on Machine Learning (ICML\u201914), 2014.\n\n9\n\n\f", "award": [], "sourceid": 1777, "authors": [{"given_name": "Xiaotong", "family_name": "Yuan", "institution": "Nanjing University of Informat"}, {"given_name": "Ping", "family_name": "Li", "institution": "Rugters University"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Rutgers"}]}