{"title": "Partial Hard Thresholding: Towards A Principled Analysis of Support Recovery", "book": "Advances in Neural Information Processing Systems", "page_first": 3124, "page_last": 3134, "abstract": "In machine learning and compressed sensing, it is of central importance to understand when a tractable algorithm recovers the support of a sparse signal from its compressed measurements. In this paper, we present a principled analysis on the support recovery performance for a family of hard thresholding algorithms. To this end, we appeal to the partial hard thresholding (PHT) operator proposed recently by Jain et al. [IEEE Trans. Information Theory, 2017]. We show that under proper conditions, PHT recovers an arbitrary \"s\"-sparse signal within O(s kappa log kappa) iterations where \"kappa\" is an appropriate condition number. Specifying the PHT operator, we obtain the best known result for hard thresholding pursuit and orthogonal matching pursuit with replacement. Experiments on the simulated data complement our theoretical findings and also illustrate the effectiveness of PHT compared to other popular recovery methods.", "full_text": "Partial Hard Thresholding: Towards A Principled\n\nAnalysis of Support Recovery\n\nJie Shen\n\nPing Li\n\nDepartment of Computer Science\n\nDepartment of Statistics and Biostatistics\n\nSchool of Arts and Sciences\n\nDepartment of Computer Science\n\nRutgers University\nNew Jersey, USA\n\njs2007@rutgers.edu\n\nRutgers University\nNew Jersey, USA\n\npingli@stat.rutgers.edu\n\nAbstract\n\nIn machine learning and compressed sensing, it is of central importance to under-\nstand when a tractable algorithm recovers the support of a sparse signal from its\ncompressed measurements. In this paper, we present a principled analysis on the\nsupport recovery performance for a family of hard thresholding algorithms. To\nthis end, we appeal to the partial hard thresholding (PHT) operator proposed re-\ncently by Jain et al. [IEEE Trans. Information Theory, 2017]. We show that under\nproper conditions, PHT recovers an arbitrary s-sparse signal within O(s\u03ba log \u03ba)\niterations where \u03ba is an appropriate condition number. Specifying the PHT opera-\ntor, we obtain the best known results for hard thresholding pursuit and orthogonal\nmatching pursuit with replacement. Experiments on the simulated data comple-\nment our theoretical \ufb01ndings and also illustrate the effectiveness of PHT.\n\n1 Introduction\n\nThis paper is concerned with the problem of recovering an arbitrary sparse signal from a set of its\n\n(compressed) measurements. We say that a signal \u00afx \u2208 Rd is s-sparse if there are no more than s\nnon-zeros in \u00afx. This problem, together with its many variants, have found a variety of successful\napplications in compressed sensing, machine learning and statistics. Of particular interest is the\nsetting where \u00afx is the true signal and only a small number of linear measurements are given, referred\nto as compressed sensing. Such instance has been exhaustively studied in the last decade, along with\na large body of elegant work devoted to ef\ufb01cient algorithms including \u21131-based convex optimization\nand hard thresholding based greedy pursuits [7, 6, 15, 8, 3, 5, 11]. Another quintessential example is\nthe sparsity-constrained minimization program recently considered in machine learning [30, 2, 14,\n22], for which the goal is to ef\ufb01ciently learn the global sparse minimizer \u00afx from a set of training\ndata. Though in most cases, the underlying signal can be categorized into either of the two classes,\nwe note that it could also be other object such as the parameter of logistic regression [19]. Hence, for\na uni\ufb01ed analysis, this paper copes with an arbitrary sparse signal and the results to be established\nquickly apply to the special instances above.\n\nIt is also worth mentioning that while one can characterize the performance of an algorithm and\ncan evaluate the obtained estimate from various aspects, we are speci\ufb01cally interested in the qual-\nity of support recovery. Recall that for sparse recovery problems, there are two prominent metrics:\nthe \u21132 distance and the support recovery. Theoretical results phrased in terms of the \u21132 metric is\nalso referred to as parameter estimation, on which most of the previous papers emphasized. Under\nthis metric, many popular algorithms, e.g., the Lasso [24, 27] and hard thresholding based algo-\nrithms [9, 3, 15, 8, 10, 22], are guaranteed with accurate approximation up to the energy of noise.\nSupport recovery is another important factor to evaluate an algorithm, which is also known as feature\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fselection or variable selection. As one of the earliest work, [25] offered suf\ufb01cient and necessary con-\nditions under which orthogonal matching pursuit and basis pursuit identify the support. The theory\nwas then developed by [35, 32, 27] for the Lasso estimator and by [29] for the garrotte estimator.\n\nTypically, recovering the support of a target signal is more challenging than parameter estimation.\nFor instance, [18] showed that the restricted eigenvalue condition suf\ufb01ces for the Lasso to produce\nan accurate estimate whereas in order to recover the sign pattern, a more stringent mutual incoher-\nence condition has to be imposed [27]. However, as has been recognized, if the support is detected\nprecisely by a method, then the solution admits the optimal statistical rate [27]. In this regard, re-\nsearch on support recovery continues to be a central theme in recent years [33, 34, 31, 4, 17]. Our\nwork follows this line and studies the support recovery performance of hard thresholding based algo-\nrithms, which enjoy superior computational ef\ufb01ciency to the convex programs when manipulating a\nhuge volume of data [26].\n\nWe note that though [31, 4] have carried out theoretical understanding for hard thresholding pur-\nsuit (HTP) [10], showing that HTP identi\ufb01es the support of a signal within a few iterations, neither\nof them obtained the general results in this paper. To be more detailed, under the restricted isome-\ntry property (RIP) condition [6], our iteration bound holds for an arbitrary sparse signal of interest,\nwhile the results from [31, 4] hold either for the global sparse minimizer or for the true sparse signal.\nUsing a relaxed sparsity condition, we obtain a clear iteration complexity O(s\u03ba log \u03ba) where \u03ba is\na proper condition number. In contrast, it is hard to quantify the bound of [31] (see Theorem 3\ntherein). From the algorithmic perspective, we consider a more general algorithm than HTP. In fact,\nwe appeal to the recently proposed partial hard thresholding (PHT) operator [13] and demonstrate\nnovel results, which in turn indicates the best known iteration complexity for HTP and orthogonal\nmatching pursuit with replacement (OMPR) [12]. Thereby, the results in this paper considerably\nextend our earlier work on HTP [23]. It is also worth mentioning that, though our analysis hinges on\nthe PHT operator, the support recovery results to be established are stronger than the results in [13]\nsince they only showed parameter estimation of PHT. Finally, we remark that while a couple of\nprevious work considered signals that are not exactly sparse (e.g., [4]), we in this paper focus on the\nsparse case. Extensions to the generic signals are left as interesting future directions.\n\nContribution. The contribution of this paper is summarized as follows. We study the iteration\ncomplexity of the PHT algorithm, and show that under the RIP condition or the relaxed sparsity con-\ndition (to be clari\ufb01ed), PHT recovers the support of an arbitrary s-sparse signal within O(s\u03ba log \u03ba)\niterations. This strengthens the theoretical results of [13] where only parameter estimation of PHT\nwas established. Thanks to the generality of the PHT operator, our results shed light on the support\nrecovery performance of a family of prevalent iterative algorithms. As two extreme cases of PHT,\nthe new results immediately apply to HTP and OMPR, and imply the best known bound.\n\nRoadmap. The remainder of the paper is organized as follows. We describe the problem setting,\nas well as the partial hard thresholding operator in Section 2, followed by the main results regarding\nthe iteration complexity. In Section 3, we sketch the proof of the main results and list some useful\nlemmas which might be of independent interest. Numerical results are illustrated in Section 4 and\nSection 5 concludes the paper and poses several interesting future work. The detailed proof of our\ntheoretical results is deferred to the appendix (see the supplementary \ufb01le).\n\nNotation. We collect the notation that is involved in this paper. The upper-right letter C and its\nsubscript variants (e.g., C1) are used to denote absolute constants whose values may change from\nappearance to appearance. For a vector x \u2208 Rd, its \u21132 norm is denoted by kxk. The support set\nof x is denoted by supp (x) which indexes the non-zeros in x. With a slight abuse, supp (x, k)\nis the set of indices for the k largest (in magnitude) elements. Ties are broken lexicographically.\nWe interchangeably write kxk0 or |supp (x)| to signify the cardinality of supp (x). We will also\nconsider a vector restricted on a support set. That is, for a d-dimensional vector x and a support\nset T \u2282 {1, 2, . . . , d}, depending on the context, xT can either be a |T|-dimensional vector by\nextracting the elements belonging to T or a d-dimensional vector by setting the elements outside T\nto zero. The complement of a set T is denoted by T .\nWe reserve \u00afx \u2208 Rd for the target s-sparse signal whose support is denoted by S. The quantity\n\u00afxmin > 0 is the minimum absolute element in \u00afxS, where we recall that \u00afxS \u2208 Rs consists of the\nnon-zeros of \u00afx. The PHT algorithm will depend on a carefully chosen function F (x). We write\nits gradient as \u2207F (x) and we use \u2207kF (x) as a shorthand of (\u2207F (x))supp(\u2207F (x),k), i.e., the top k\nabsolute components of \u2207F (x).\n\n2\n\n\f2 Partial Hard Thresholding\n\nTo pursue a sparse solution, hard thresholding has been broadly invoked by many popular greedy\nalgorithms. In the present work, we are interested in the partial hard thresholding operator which\nsheds light upon a uni\ufb01ed design and analysis for iterative algorithms employing this operator and\nthe hard thresholding operator [13]. Formally, given a support set T and a freedom parameter r > 0,\nthe PHT operator which is used to produce a k-sparse approximation to z is de\ufb01ned as follows:\n\nPHTk (z; T, r) := arg min\n\nx\u2208Rd kx \u2212 zk , s.t. kxk0 \u2264 k, |T \\ supp (x)| \u2264 r.\n\n(1)\n\nThe \ufb01rst constraint simply enforces a k-sparse solution. To gain intuition on the second one, consider\nthat T is the support set of the last iterate of an iterative algorithm, for which |T| \u2264 k. Then\nthe second constraint ensures that the new support set differs from the previous one by at most r\npositions. As a special case, one may have noticed that the PHT operator reduces to the standard\nhard thresholding when picking the freedom parameter r \u2265 k. On the other spectrum, if we look at\nthe case where r = 1, the PHT operator yields the interesting algorithm termed orthogonal matching\npursuit with replacement [12], which in general replaces one element in each iteration.\n\nIt has been shown in [13] that the PHT operator can be computed in an ef\ufb01cient manner for a general\nsupport set T and a freedom parameter r. In this paper, our major focus will be on the case |T| = k1.\nThen Lemma 1 of [13] indicates that PHTk (z; T, r) is given as follows:\n\n(2)\nwhere HTk(\u00b7) is the standard hard thresholding operator that sets all but the k largest absolute com-\nponents of a vector to zero.\n\nT , r(cid:1) , PHTk (z; T, r) = HTk(cid:0)zT\u222atop(cid:1) ,\n\ntop = supp(cid:0)z\n\nEquipped with the PHT operator, we are now in the position to describe a general iterative greedy al-\ngorithm, termed PHT(r) where r is the freedom parameter in (1). At the t-th iteration, the algorithm\nreveals the last iterate xt\u22121 as well as its support set St\u22121, and returns a new solution as follows:\n\nzt = xt\u22121 \u2212 \u03b7\u2207F (xt\u22121),\nyt = PHTk(cid:0)zt; St\u22121, r(cid:1) , St = supp(cid:0)yt(cid:1) ,\nF (x), s.t. supp (x) \u2282 St.\n\nxt = arg min\n\nx\u2208Rd\n\nAbove, we note that \u03b7 > 0 is a step size and F (x) is a proxy function which should be carefully\nchosen (to be clari\ufb01ed later). Typically, the sparsity parameter k equals s, the sparsity of the target\nsignal \u00afx. In this paper, we consider a more general choice of k which leads to novel results. For\nfurther clarity, several comments on F (x) are in order.\n\nFirst, one may have observed that in the context of sparsity-constrained minimization, the proxy\nfunction F (x) used above is chosen as the objective function [30, 14]. In that scenario, the target\nsignal is a global optimum and PHT(r) proceeds as projected gradient descent. Nevertheless, recall\nthat our goal is to estimate an arbitrary signal \u00afx.\nIt is not realistic to look for a function F (x)\nsuch that our target happens to be its global minimizer. The remedy we will offer is characterizing\na deterministic condition between \u00afx and \u2207F (\u00afx) which is analogous to the signal-to-noise ratio\ncondition, so that any function F (x) ful\ufb01lling that condition suf\ufb01ces. In this light, we \ufb01nd that F (x)\nbehaves more like a proxy that guides the algorithm to a given target. Remarkably, our analysis also\nencompasses the situation considered in [30, 14].\n\nSecond, though it is not being made explicitly, one should think of F (x) as containing the mea-\nsurements or the training data. Consider, for example, recovering \u00afx from y = A\u00afx where A is a\ndesign matrix and y is the response (both are known). A natural way would be running the PHT(r)\nalgorithm with F (x) = ky \u2212 Axk2. One may also think of the logistic regression model where y\nis a binary vector (label), A is a collection of training data (feature), and F (x) is the logistic loss\nevaluated on the training samples.\n\nWith the above clari\ufb01cation, we are ready to make assumptions on F (x).\nIt turns out that two\nproperties of F (x) are vital for our analysis: restricted strong convexity and restricted smoothness.\nThese two conditions were proposed by [16] and have been standard in the literature [34, 1, 14, 22].\n\n1Our results actually hold for |T | \u2264 k. But we observe that the size of T we will consider is usually equal\n\nto k. Hence, for ease of exposition, we take |T | = k. This is also the case considered in [12].\n\n3\n\n\fDe\ufb01nition 1. We say that a differentiable function F (x) satis\ufb01es the property of restricted strong\nconvexity (RSC) at sparsity level s with parameter \u03c1\u2212s > 0 if for all x, x\u2032 \u2208 Rd with kx \u2212 x\u2032k0 \u2264 s,\n\nF (x) \u2212 F (x\u2032) \u2212 h\u2207F (x\u2032), x \u2212 x\u2032i \u2265\n\n\u03c1\u2212s\n2 kx \u2212 x\u2032k2 .\n\nLikewise, we say that F (x) satis\ufb01es the property of restricted smoothness (RSS) at sparsity level s\nwith parameter \u03c1+\n\ns > 0 if for all x, x\u2032 \u2208 Rd with kx \u2212 x\u2032k0 \u2264 s, it holds that\n\u03c1+\n2 kx \u2212 x\u2032k2 .\ns\n\nF (x) \u2212 F (x\u2032) \u2212 h\u2207F (x\u2032), x \u2212 x\u2032i \u2264\n\nWe call \u03bas = \u03c1+\ncondition number of the Hessian matrix of F (x) restricted on s-sparse directions.\n\ns /\u03c1\u2212s as the condition number of the problem, since it is essentially identical to the\n\n2.1 Deterministic Analysis\n\nThe following proposition shows that under very mild conditions, PHT(r) either terminates or re-\ncovers the support of an arbitrary s-sparse signal \u00afx using at most O(s\u03ba2s log \u03ba2s) iterations.\nProposition 2. Consider the PHT(r) algorithm with k = s. Suppose that F (x) is \u03c1\u22122s-RSC and \u03c1+\n2s-\nRSS, and the step size \u03b7 \u2208 (0, 1/\u03c1+\n2s/\u03c1\u22122s. Then PHT(r) either terminates or recovers\nthe support of \u00afx within O(s\u03ba log \u03ba) iterations provided that \u00afxmin \u2265 4\u221a2+2\u221a\u03ba\n\n2s). Let \u03ba := \u03c1+\n\nk\u22072sF (\u00afx)k.\n\n\u03c1\u2212\n2s\n\nA few remarks are in order. First, we remind the reader that under the conditions stated above, it is\nnot guaranteed that PHT(r) succeeds. We say that PHT(r) fails if it terminates at some time stamp t\nbut St 6= S. This indeed happens if, for example, we feed it with a bad initial point and pick a\nvery small step size. In particular, if x0\n, then the algorithm makes no progress.\nThe crux to remedy this issue is imposing a lower bound on \u03b7 or looking at more coordinates in\neach iteration, which is the theme below. However, the proposition is still useful because it asserts\nthat as far as we make sure that PHT(r) runs long enough (i.e., O(s\u03ba log \u03ba) iterations), it recovers\nthe support of an arbitrary sparse signal. We also note that neither the RIP condition nor a relaxed\nsparsity is assumed in this proposition.\n\nmin > \u03b7(cid:13)(cid:13)\u2207F (x0)(cid:13)(cid:13)\u221e\n\nThe \u00afxmin-condition above is natural, which can be viewed as a generalization of the well-known\nsignal-to-noise ratio (SNR) condition. This follows by considering the noisy compressed sensing\n\nproblem, where y = A\u00afx + e and F (x) = ky \u2212 Axk2. Here, the vector e is some noise. Now the\nRSC and RSS imply for any 2s-sparse x\n\nHence\n\n\u03c1\u22122s kxk2 \u2264 kAxk2 \u2264 \u03c1+\n\n2s kxk2 .\n\nIn fact, the \u00afxmin-condition has been used many times to establish support recovery. See, for exam-\nple, [31, 4, 23].\n\nk\u22072sF (\u00afx)k =(cid:13)(cid:13)(cid:13)\n\n(A\u22a4e)2s(cid:13)(cid:13)(cid:13)\n\n= \u0398(kek)\n\nIn the following, we strengthen Prop. 2 by considering the RIP condition which requires a well-\nbounded condition number, i.e., \u03ba \u2264 O(1).\nTheorem 3. Consider the PHT(r) algorithm with k = s. Suppose that F (x) is \u03c1\u22122s+r-RSC and\n2s+r/\u03c1\u22122s+r be the condition number which is smaller than 1 + 1/(\u221a2 + \u03bd)\n2s+r-RSS. Let \u03ba := \u03c1+\n\u03c1+\nwhere \u03bd = \u221as \u2212 r + 2. Pick the step size \u03b7 = \u03b7\u2032/\u03c1+\n< \u03b7\u2032 \u2264 1. Then\n\n2s+r such that \u03ba \u2212 1\u221a2+\u03bd\n\nPHT(r) recovers the support of \u00afx within\n\ntmax = log \u03ba\n\nlog(1/\u03b2)\n\n+\n\nlog(\u221a2/(1 \u2212 \u03bb))\n\nlog(1/\u03b2)\n\n+ 2!k\u00afxk0\n\niterations, provided that for some constant \u03bb \u2208 (0, 1)\n\n\u00afxmin \u2265\nAbove, \u03b2 = (\u221a2 + \u03bd)(\u03ba \u2212 \u03b7\u2032) \u2208 (0, 1).\n\n2\u03bd + 6\n\u03bb\u03c1\u22122s+r k\u2207s+rF (\u00afx)k .\n\n4\n\n\fWe remark several aspects of the theorem. The most important part is that Theorem 3 offers the\ntheoretical justi\ufb01cation that PHT(r) always recovers the support. This is achieved by imposing an\nRIP condition (i.e., bounding the condition number from the above) and using a proper step size.\n\nWe also make the iteration bound explicit, in order to examine the parameter dependency. First, we\nnote that tmax scales approximately linearly with \u03bb. This conforms the intuition because a small \u03bb\nactually indicates a large signal-to-noise ratio, and hence easy to distinguish the support of interest\nfrom the noise. The freedom parameter r is mainly encoded in the coef\ufb01cient \u03b2 through the quantity\n\u03bd. Observe that when increasing the scalar r, we have a small \u03b2, and hence fewer iterations. This\nis not surprising since a large value of r grants the algorithm more freedom to look at the current\niterate. Indeed, in the best case, PHT(s) is able to recover the support in O(1) iterations while\nPHT(1) has to take O(s) steps. However, if we investigate the conditions, we \ufb01nd that we need a\nstronger RSC/RSS condition to afford a large freedom parameter.\n\nIt is also interesting to contrast Theorem 3 to [31, 4], which independently built state-of-the-art sup-\nport recovery results for HTP. As has been mentioned, [31] made use of the optimality of the target\nsignal, which is a restricted setting compared to our result. Their iteration bound (see Theorem 1\ntherein), though provides an appealing insight, does not have a clear parameter dependence on the\nnatural parameters of the problem (e.g., sparsity and condition number). [4] developed O(k\u00afxk0)\niteration complexity for compressed sensing. Again, they con\ufb01ned to a special signal whereas we\ncarry out a generalization that allows us to analyze a family of algorithms.\n\nThough the RIP condition has been ubiquitous in the literature, many researchers point out that it\nis not realistic in practical applications [18, 20, 21]. This is true for large-scale machine learning\nproblems, where the condition number may grow with the sample size (hence one cannot upper\nbound it with a constant). A clever solution was \ufb01rst (to our knowledge) suggested by [14], where\nthey showed that using the sparsity parameter k = O(\u03ba2s) guarantees convergence of projected\ngradient descent. The idea was recently employed by [22, 31] to show an RIP-free condition for\nsparse recovery, though in a technically different way. The following theorem borrows this elegant\nidea to prove RIP-free results for PHT(r).\nTheorem 4. Consider the PHT(r) algorithm. Suppose that F (x) is \u03c1\u22122k-RSC and \u03c1+\n\u03ba := \u03c1+\n\u03b7 \u2208 (0, 1/\u03c1+\n\n2k/\u03c1\u22122k be the condition number. Further pick k \u2265 s +(cid:16)1 +\n\n2k). Then the support of \u00afx is included in the iterate of PHT(r) within\n\n4\n\n2k)2(cid:17) min{s, r} where\n\n2k-RSS. Let\n\n\u03b72(\u03c1\u2212\n\ntmax =(cid:18) 3 log \u03ba\n\nlog(1/\u00b5)\n\n+\n\n2 log(2/(1 \u2212 \u03bb))\n\nlog(1/\u00b5)\n\n+ 2(cid:19)k\u00afxk0\n\niterations, provided that for some constant \u03bb \u2208 (0, 1),\n\n\u00afxmin \u2265\n\nAbove, we have \u00b5 = 1 \u2212 \u03b7\u03c1\u2212\n\n2k (1\u2212\u03b7\u03c1+\n2k)\n\n.\n\n2\n\n\u221a\u03ba + 3\n\u03bb\u03c1\u22122k k\u2207k+sF (\u00afx)k .\n\nWe discuss the salient features of Theorem 4 compared to Prop. 2 and Theorem 3. First, note that\nwe can pick \u03b7 = 1/(2\u03c1+\n2k) in the above theorem, which results in \u00b5 = O(1 \u2212 1/\u03ba). So the itera-\ntion complexity is essentially given by O(s\u03ba log \u03ba) that is similar to the one in Prop. 2. However,\nin Theorem 4, the sparsity parameter k is set to be O(s + \u03ba2 min{s, r}) which guarantees support\ninclusion. We pose an open question of whether the \u00afxmin-condition might be re\ufb01ned, in that it\ncurrently scales with \u221a\u03ba which is stringent for ill-conditioned problems. Another important conse-\nquence implied by the theorem is that the sparsity parameter k actually depends on the minimum of\ns and r. Consider r = 1 which corresponds to the OMPR algorithm. Then k = O(s + \u03ba2) suf\ufb01ces.\nIn contrast, previous work of [14, 31, 22, 23] only obtained theoretical result for k = O(\u03ba2s), owing\nto a restricted problem setting. We also note that even in the original OMPR paper [12] and its latest\nversion [13], such an RIP-free condition was not established.\n\n2.2 Statistical Results\n\nUntil now, all of our theoretical results are phrased in terms of deterministic conditions (i.e., RSC,\nRSS and \u00afxmin). It is known that these conditions can be satis\ufb01ed by prevalent statistical models\n\n5\n\n\fsuch as linear regression and logistic regression. Here, we give detailed statistical results for sparse\nlinear regression, and we refer the reader to [1, 14, 22, 23] for other applications.\n\nConsider the sparse linear regression model\n\nyi = hai, \u00afxi + ei,\n\n1 \u2264 i \u2264 N,\n\nwhere ai are drawn i.i.d. from a sub-gaussian distribution with zero mean and covariance \u03a3 \u2208 Rd\u00d7d\nand ei are drawn i.i.d. from N (0, \u03c92). We presume that the diagonal elements of \u03a3 are properly\nscaled, i.e., \u03a3jj \u2264 1 for 1 \u2264 j \u2264 d. Let A = (a\u22a41 ; . . . ; a\u22a4N ) and y = (y1; . . . ; yN ). Our goal is to\n2 ky \u2212 Axk2. Let\nrecover \u00afx from the knowledge of A and y. To this end, we may choose F (x) = 1\n\u03c3min(\u03a3) and \u03c3max(\u03a3) be the smallest and the largest singulars of \u03a3, respectively. Then it is known\nthat with high probability, F (x) satis\ufb01es the RSC and RSS properties at the sparsity level K with\nparameters\n\n\u03c1\u2212K = \u03c3min(\u03a3) \u2212 C1 \u00b7\n\nK log d\n\nN\n\n,\n\n\u03c1+\nK = \u03c3max(\u03a3) + C2 \u00b7\n\nK log d\n\nN\n\n,\n\nrespectively. It is also known that with high probability, the following holds:\n\n(3)\n\n(4)\n\nk\u2207KF (\u00afx)k \u2264 2\u03c9r K log d\n\nN\n\n.\n\nSee [1] for a detailed discussion. With these probabilistic arguments on hand, we investigate the\nsuf\ufb01cient conditions under which the preceding deterministic results hold.\n\nFor Prop. 2, recall that the sparsity level of RSC and RSS is 2s. Hence, if we pick the sample size\nN = q \u00b7 2C1s log d/\u03c3min(\u03a3) for some q > 1, then\n\n4\u221a2 + 2\u221a\u03ba2s\n\n\u03c1\u22122s\n\nk\u22072sF (\u00afx)k \u2264 4\u03c9\n\n2\u221a2 +q \u03c3max(\u03a3)\n\u03c3min(\u03a3) \u00b7q 1+C2/qC1\n(1 \u2212 1/q)pqC1\u03c3min(\u03a3)\n\n1\u22121/q\n\n.\n\nThe right-hand side is monotonically decreasing with q, which indicates that as soon as we pick\nq large enough, it becomes smaller than \u00afxmin. To be more concrete, consider that the covariance\nmatrix \u03a3 is the identity matrix for which \u03c3min(\u03a3) = \u03c3max(\u03a3) = 1. Now suppose that q \u2265 2,\nwhich gives an upper bound\n\nThus, in order to ful\ufb01ll the \u00afxmin-condition in Prop. 2, it suf\ufb01ces to pick\n\n4\u221a2 + 2\u221a\u03ba2s\n\n\u03c1\u22122s\n\nk\u22072sF (\u00afx)k \u2264\n\n.\n\n\u221aqC1\n\n8\u03c9(2\u221a2 +p2 + C2/C1)\n!2).\n\nq = max(2, 8\u03c9(2\u221a2 +p2 + C2/C1)\n\n\u221aC1 \u00afxmin\n\nFor Theorem 3, it essentially asks for a well-conditioned design matrix at the sparsity level 2s + r.\nNote that (3) implies \u03ba2s+r \u2265 \u03c3max(\u03a3)/\u03c3min(\u03a3), which in return requires a well-conditioned\ncovariance matrix. Thus, to guarantee that \u03ba2s+r \u2264 1 + \u01eb for some \u01eb > 0, it suf\ufb01ces to choose \u03a3\nsuch that \u03c3max(\u03a3)/\u03c3min(\u03a3) < 1 + \u01eb and pick N = q \u00b7 C1(2s + r) log d/\u03c3min(\u03a3) with\n\nq =\n\n1 + \u01eb + C\u22121\n\n1 C2\u03c3max(\u03a3)/\u03c3min(\u03a3)\n\n1 + \u01eb \u2212 \u03c3max(\u03a3)/\u03c3min(\u03a3)\n\n.\n\n2k), which results in k \u2265 s+ (16\u03ba2\n\nFinally, Theorem 4 asserts support inclusion by expanding the support size of the iterates. Suppose\nthat \u03b7 = 1/(2\u03c1+\n2k + 1) min{r, s}. Given that the condition number\n\u03ba2k is always greater than 1, we can pick k \u2265 s + 20\u03ba2\n2k min{r, s}. At a \ufb01rst sight, this seems to\nbe weird in that k depends on the condition number \u03ba2k which itself relies on the choice of k. In the\nfollowing, we present concrete sample complexity showing that this condition can be met. We will\nfocus on two extreme cases: r = 1 and r = s.\nFor r = 1, we require k \u2265 s + 20\u03ba2\nobtain \u03c1\u22122k = 1\nof the design matrix \u03ba2k \u2264 (2 + C2\n\n2k. Let us pick N = 4C1k log d/\u03c3min(\u03a3). In this way, we\n)\u03c3max(\u03a3). It then follows that the condition number\n\n)\u03c3max(\u03a3)/\u03c3min(\u03a3). Consequently, we can set the parameter\n\n2k \u2264 (1 + C2\n\n2 \u03c3min(\u03a3) and \u03c1+\n\n2C1\n\nC1\n\nk = s + 20(cid:18)(cid:18)2 +\n\nC2\n\n\u03c3min(\u03a3)(cid:19)2\nC1(cid:19) \u03c3max(\u03a3)\n\n.\n\n6\n\n\f(cid:13)(cid:13)xt \u2212 \u00afx(cid:13)(cid:13) \u2264 \u03b1 \u00b7 \u03b2t(cid:13)(cid:13)x0 \u2212 \u00afx(cid:13)(cid:13) + \u03c81,\n(cid:13)(cid:13)xt \u2212 \u00afx(cid:13)(cid:13) \u2264 \u03b3(cid:13)(cid:13)\u00afxS t(cid:13)(cid:13) + \u03c82,\n\u221a2|\u00afxp+q| > \u03b1\u03b3 \u00b7 \u03b2\u2206\u22121(cid:13)(cid:13)\u00afx{p+1,...,s}(cid:13)(cid:13) + \u03a8,\n\n\u03a8 = \u03b1\u03c82 + \u03c81 +\n\n1\n\n\u03c1\u22122k k\u22072F (\u00afx)k ,\n\nwhere\n\nNote that the above quantities depend only on the covariance matrix. Again, if \u03a3 is the identity\nmatrix, the sample complexity is O(s log d).\nFor r = s, likewise k \u2265 20\u03ba2\n\n2ks suf\ufb01ces. Following the deduction above, we get\n\nk = 20(cid:18)(cid:18)2 +\n\nC2\n\n\u03c3min(\u03a3)(cid:19)2\nC1(cid:19) \u03c3max(\u03a3)\n\ns.\n\n3 Proof Sketch\n\nWe sketch the proof and list some useful lemmas which might be of independent interest. The\nhigh-level proof technique follows from the recent work of [4] which performs an RIP analysis for\ncompressed sensing. But for our purpose, we have to deal with the freedom parameter r as well as\nthe RIP-free condition. We also need to generalize the arguments in [4] to show support recovery\nresults for arbitrary sparse signals. Indeed, we prove the following lemma which is crucial for our\nanalysis. Below we assume without loss of generality that the elements in \u00afx are in descending order\naccording to the magnitude.\nLemma 5. Consider the PHT(r) algorithm. Assume that F (x) is \u03c1\u22122k-RSC and \u03c1+\nassume that the sequence of {xt}t\u22650 satis\ufb01es\n\n2k-RSS. Further\n\n(5)\n\n(6)\n\nfor positive \u03b1, \u03c81, \u03b3, \u03c82 and 0 < \u03b2 < 1. Suppose that at the n-th iteration (n \u2265 0), Sn contains the\nindices of top p (in magnitude) elements of \u00afx. Then, for any integer 1 \u2264 q \u2264 s \u2212 p, there exists an\ninteger \u2206 \u2265 1 determined by\n\nsuch that Sn+\u2206 contains the indices of top p + q elements of \u00afx provided that \u03a8 \u2264 \u221a2\u03bb\u00afxmin for\nsome constant \u03bb \u2208 (0, 1).\nWe isolate this lemma here since we \ufb01nd it inspiring and general. The lemma states that under\nproper conditions, as far as one can show that the sequence satis\ufb01es (5) and (6), then after a few\niterations, PHT(r) captures more correct indices in the iterate. Note that the condition (5) states that\nthe sequence should contract with a geometric rate, and the condition (6) follows immediately from\nthe fully corrective step (i.e., minimizing F (x) over the new support set).\n\nThe next theorem concludes that under the conditions of Lemma 5, the total iteration complexity for\nsupport recovery is proportional to the sparsity of the underlying signal.\n\nTheorem 6. Assume same conditions as in Lemma 5. Then PHT(r) successfully identi\ufb01es the sup-\n\nport of \u00afx using(cid:16) log 2\n\n2 log(1/\u03b2) + log(\u03b1\u03b3/(1\u2212\u03bb))\n\nlog(1/\u03b2) + 2(cid:17)k\u00afxk0 number of iterations.\n\nThe detailed proofs of these two results are given in the appendix. Armed with them, it remains to\nshow that PHT(r) satis\ufb01es the condition (5) under different settings.\n\nProof Sketch for Prop. 2. We start with comparing F (zt\nS t ) and F (xt\u22121). For the sake, we record\nseveral important properties. First, due to the fully corrective step, the support set of \u2207F (xt\u22121) is\northogonal to St\u22121. That means for any subset \u2126 \u2282 St\u22121, zt\n\u2126 and for any set \u2126 \u2282 St\u22121,\nzt\n\u2126 = \u2212\u03b7\u2207\u2126F (xt\u22121). We also note that due to the PHT operator, any element of zt\nS t\\S t\u22121 is not\nsmaller than that of zt\n\nS t\u22121\\S t . These critical facts together with the RSS condition result in\n\n\u2126 = xt\u22121\n\nF (xt) \u2212 F (xt\u22121) \u2264 F (zt\n\nS t ) \u2212 F (xt\u22121) \u2264 \u2212\u03b7(1 \u2212 \u03b7\u03c1+\n\n2s)(cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)S t\\S t\u22121(cid:13)(cid:13)(cid:13)\n\n2\n\n.\n\n7\n\n\fSince St \\ St\u22121 consists of the top elements of \u2207F (xt\u22121), we can show that\n\n(cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)S t\\S t\u22121(cid:13)(cid:13)(cid:13)\n\n2\n\n\u2265\n\n2\u03c1\u22122s(cid:12)(cid:12)St \\ St\u22121(cid:12)(cid:12)\n|St \\ St\u22121| + |S \\ St\u22121|(cid:0)F (xt\u22121) \u2212 F (\u00afx)(cid:1) .\n\nUsing the argument of Prop. 21, we establish the linear convergence of the iterates, i.e., the condi-\ntion (5). The result then follows.\n\nProof Sketch for Theorem 3. To prove this theorem, we present a more careful analysis on the prob-\n\nlem structure. In particular, let T = supp(cid:0)\u2207F (xt\u22121, r)(cid:1), J t = St\u22121\u222aT , and consider the elements\nof \u2207F (xt\u22121). Since T contains the largest elements, any element outside T is smaller than those\nof T . Then we may compare the elements of \u2207F (xt\u22121) on S \\ T and S \\ T . Though they have\n\ndifferent number of components, we can show the relationship between the averaged energy:\n\n1\n\n|T \\ S|(cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)T\\S(cid:13)(cid:13)(cid:13)\n\n2\n\n\u2265\n\n1\n\n|S \\ T|(cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)S\\T(cid:13)(cid:13)(cid:13)\n\n2\n\n.\n\nUsing this equation followed by some standard relaxation, we can bound (cid:13)(cid:13)\u00afxJ t(cid:13)(cid:13) in terms of\n(cid:13)(cid:13)xt\u22121 \u2212 \u00afx(cid:13)(cid:13) as follows.\nLemma 7. Assume that F (x) satis\ufb01es the properties of RSC and RSS at sparsity level k + s + r.\nLet \u03c1\u2212 := \u03c1\u2212k+s+r and \u03c1+ := \u03c1+\nk+s+r. Consider the support set J t = St\u22121 \u222a supp(cid:0)\u2207F (xt\u22121), r(cid:1).\nWe have for any 0 < \u03b8 \u2264 1/\u03c1+,\n\n(cid:13)(cid:13)\u00afxJ t(cid:13)(cid:13) \u2264 \u03bd(1 \u2212 \u03b8\u03c1\u2212)(cid:13)(cid:13)xt\u22121 \u2212 \u00afx(cid:13)(cid:13) +\nwhere \u03bd = \u221as \u2212 r + 2. In particular, picking \u03b8 = 1/\u03c1+ gives\n(cid:13)(cid:13)\u00afxJ t(cid:13)(cid:13) \u2264 \u03bd(cid:18)1 \u2212\n\u03ba(cid:19)(cid:13)(cid:13)xt\u22121 \u2212 \u00afx(cid:13)(cid:13) +\n\n1\n\n\u03bd\n\u03c1\u2212 k\u2207s+rF (\u00afx)k ,\n\n\u03bd\n\u03c1\u2212 k\u2207s+rF (\u00afx)k .\n\nNote that the lemma also applies to the two-stage thresholding algorithms (e.g., CoSaMP [15])\nwhose \ufb01rst step is expanding the support set.\n\nOn the other hand, we also know that\n\n2\n\n(cid:13)(cid:13)(cid:13)\n\nzt\n\nJ t\\S t(cid:13)(cid:13)(cid:13) \u2264(cid:13)(cid:13)(cid:13)\n\nzt\n\nJ t\\S(cid:13)(cid:13)(cid:13)\n\n\u2264 \u2212\n\nThen we prove the claim\n\n1 \u2212 \u03b7\u03c1+\n\n2k\n\n2\u03b7\n\n1 \u2212 \u03b7\u03c1+\n\n2k\n\n2\u03b7\n\nF (xt) \u2212 F (xt\u22121) \u2264 \u2212\n\nProof Sketch for Theorem 4. The proof idea of Theorem 4 is inspired by [31], though we give a\n\n.\nThis is because J t \\ St contains the r smallest elements of zt\nJ t . It then follows that(cid:13)(cid:13)\u00afxJ t\\S t(cid:13)(cid:13) can be\nupper bounded by(cid:13)(cid:13)xt\u22121 \u2212 \u00afx(cid:13)(cid:13). Finally, we note that St = (J t \\ St) \u222a J t. Hence, (5) follows.\ntighter and a more general analysis. We \ufb01rst observe that St \\ St\u22121 contains larger elements than\nSt\u22121 \\ St, due to PHT. This enables us to show that\n(cid:13)(cid:13)zt\nS t \u2212 xt\u22121(cid:13)(cid:13)\n(cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)S t\\S t\u22121(cid:13)(cid:13)(cid:13)\n\nTo this end, we consider whether r is larger than s. If r \u2265 s, then it is possible that(cid:12)(cid:12)St \\ St\u22121(cid:12)(cid:12) \u2265 s.\nIf(cid:12)(cid:12)St \\ St\u22121(cid:12)(cid:12) < s \u2264 r, then the above does not hold. But we may partition the set S \\ St\u22121 as a\nunion of T1 = S \\ (St \u222a St\u22121) and T2 = (St \\ St\u22121) \u2229 S, and show that the \u21132-norm of F (xt\u22121)\non T1 is smaller than that on T2 if k = s + \u03ba2s. In addition, the RSC condition gives\n\u03c1\u22122k\n\u03c1\u22122k (cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)S t\\S t\u22121(cid:13)(cid:13)(cid:13)\n4 (cid:13)(cid:13)\u00afx \u2212 xt\u22121(cid:13)(cid:13)\nSince T2 \u2282 St \\ St\u22121, it implies the desired bound by rearranging the terms.\nThe case r < s follows in a reminiscent way. The proof is complete.\n\n\u2265(cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)S\\S t\u22121(cid:13)(cid:13)(cid:13)\n\n(cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)S t\\S t\u22121(cid:13)(cid:13)(cid:13)\n\n(cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)S t\\S t\u22121(cid:13)(cid:13)(cid:13)\n\n\u03c1\u22122k (cid:13)(cid:13)(cid:13)(cid:0)\u2207F (xt\u22121)(cid:1)T1(cid:13)(cid:13)(cid:13)\n\n\u2265 \u03c1\u22122k(cid:0)F (xt\u22121) \u2212 F (\u00afx)(cid:1) .\n\n\u2265 \u03c1\u22122k(cid:0)F (xt\u22121) \u2212 F (\u00afx)(cid:1) .\n\nIn this case, using the RSC condition and the PHT property, we can show that\n\n\u2264 F (\u00afx) \u2212 F (xt\u22121) +\n\n2\n\n+\n\n1\n\n2\n\n.\n\n1\n\n2\n\n.\n\n2\n\n2\n\n2\n\n2\n\n8\n\n\f4 Simulation\n\nWe complement our theoretical results by performing numerical experiments in this section.\nIn\nparticular, we are interested in two aspects: \ufb01rst, the number of iterations required to identify the\nsupport of an s-sparse signal; second, the tradeoff between the iteration number and percentage of\nsuccess resulted from different choices of the freedom parameter r.\n\nWe consider the compressed sensing model y = A\u00afx + 0.01e, where the dimension d = 200 and the\nentries of A and e are i.i.d. normal variables. Given a sparsity level s, we \ufb01rst uniformly choose the\nsupport of \u00afx, and assign values to the non-zeros with i.i.d. normals. There are two con\ufb01gurations:\nthe sparsity s and the sample size N . Given s and N , we independently generate 100 signals and\ntest PHT(r) on them. We say PHT(r) succeeds in a trial if it returns an iterate with correct support\nwithin 10 thousands iterations. Otherwise we mark the trial as failure.\nIteration numbers to be\nreported are counted only on those success trials. The step size \u03b7 is \ufb01xed to be the unit, though one\ncan tune it using cross-validation for better performance.\n\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\ni\n\n#\n\n \n\nN = 200\n\nr = 1\nr = 2\nr = 5\nr = 10\nr = 100\n\n80\n70\n60\n50\n40\n30\n20\n10\n0\n1 10 20 30 40 50 60 70 80 90100\n\n \n\n#non\u2212zeros\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \n\nt\n\ne\ng\na\nn\ne\nc\nr\ne\np\n\nN = 200\n\n \n\nr = 1\nr = 2\nr = 5\nr = 10\nr = 100\n\n100\n\n80\n\n60\n\n40\n\n20\n\n \n\n0\n1 10 20 30 40 50 60 70 80 90100\n\n#non\u2212zeros\n\ns\nn\no\n\ni\nt\n\na\nr\ne\n\nt\ni\n\n#\n\n14\n12\n10\n8\n6\n4\n2\n1\n\ns = 10\n\nr = 1\n\nr = 2\n\nr = 5\nr = 100\n\n50\n\n100\n\n150\n\n#measurements\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \n\nt\n\ne\ng\na\nn\ne\nc\nr\ne\np\n\n100\n\n80\n\n60\n\n40\n\n20\n\n \n\n0\n1\n\n200\n\ns = 10\n\n \n\nr = 1\nr = 2\nr = 5\nr = 10\nr = 100\n\n50\n#measurements\n\n100\n\n150\n\n200\n\nFigure 1: Iteration number and success percentage against sparsity and sample size. The \ufb01rst\npanel shows that the iteration number grows linearly with the sparsity. The choice r = 5 suf\ufb01ces to\nguarantee a minimum iteration complexity. The second panel shows comparable statistical perfor-\nmance for different choices of r. The third one illustrates how the iteration number changes with the\nsample size and the last panel depicts phase transition.\n\nTo study how the iteration number scales with the sparsity in practice, we \ufb01x N = 200 and tune s\nfrom 1 to 100. We test different freedom parameter r on these signals. The results are shown in the\nleftmost \ufb01gure in Figure 1. As our theory predicted, we observe that within O(s) iterations, PHT(r)\nprecisely identi\ufb01es the true support. In the second sub\ufb01gure, we plot the percentage of success\nagainst the sparsity. It appears that PHT(r) lays on top of each other. This is possibly because we\nused a suf\ufb01ciently large sample size.\n\nNext, we \ufb01x s = 10 and vary N from 1 to 200. Surprisingly, from the rightmost \ufb01gure, we do\nnot observe performance degrade using a large freedom parameter. So we conjecture that the \u00afxmin-\ncondition we established can be re\ufb01ned.\n\nFigure 1 also illustrates an interesting phenomenon: after a particular threshold, say r = 5, PHT(r)\ndoes not signi\ufb01cantly reduces the iteration number by increasing r. This cannot be explained by our\ntheorems in the paper. We leave it as a promising research direction.\n\n5 Conclusion and Future Work\n\nIn this paper, we have presented a principled analysis on a family of hard thresholding algorithms.\nTo facilitate our analysis, we appealed to the recently proposed partial hard thresholding operator.\nWe have shown that under the RIP condition or the relaxed sparsity condition, the PHT(r) algorithm\nrecovers the support of an arbitrary sparse signal \u00afx within O(k\u00afxk0 \u03ba log \u03ba) iterations, provided that\na generalized signal-to-noise ratio condition is satis\ufb01ed. On account of our uni\ufb01ed analysis, we have\nestablished the best known bound for HTP and OMPR. We have also illustrated that the simulation\nresults agree with our \ufb01nding that the iteration number is proportional to the sparsity.\n\nThere are several interesting future directions. First, it would be interesting to examine if we can\nclose the logarithmic factor log \u03ba in the iteration bound. Second, it is also useful to study RIP-free\nconditions for two-stage PHT algorithms such as CoSaMP. Finally, we pose the open question of\nwhether one can improve the \u221a\u03ba factor in the \u00afxmin-condition.\n\nAcknowledgements.\n1360971. We thank the anonymous reviewers for valuable comments.\n\nThe work is supported in part by NSF-Bigdata-1419210 and NSF-III-\n\n9\n\n\fReferences\n\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradient methods\n\nfor high-dimensional statistical recovery. The Annals of Statistics, 40(5):2452\u20132482, 2012.\n\n[2] S. Bahmani, B. Raj, and P. T. Boufounos. Greedy sparsity-constrained optimization. Journal\n\nof Machine Learning Research, 14(1):807\u2013841, 2013.\n\n[3] T. Blumensath and M. E. Davies. Iterative hard thresholding for compressed sensing. Applied\n\nand Computational Harmonic Analysis, 27(3):265\u2013274, 2009.\n\n[4] J.-L. Bouchot, S. Foucart, and P. Hitczenko. Hard thresholding pursuit algorithms: number of\n\niterations. Applied and Computational Harmonic Analysis, 41(2):412\u2013435, 2016.\n\n[5] T. T. Cai and L. Wang. Orthogonal matching pursuit for sparse signal recovery with noise.\n\nIEEE Trans. Information Theory, 57(7):4680\u20134688, 2011.\n\n[6] E. J. Cand\u00e8s and T. Tao. Decoding by linear programming. IEEE Trans. Information Theory,\n\n51(12):4203\u20134215, 2005.\n\n[7] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM\n\nJournal on Scienti\ufb01c Computing, 20(1):33\u201361, 1998.\n\n[8] W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal reconstruction.\n\nIEEE Trans. Information Theory, 55(5):2230\u20132249, 2009.\n\n[9] I. Daubechies, M. Defrise, and C. D. Mol. An iterative thresholding algorithm for linear in-\nverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics,\n57(11):1413\u20131457, 2004.\n\n[10] S. Foucart. Hard thresholding pursuit: An algorithm for compressive sensing. SIAM Journal\n\non Numerical Analysis, 49(6):2543\u20132563, 2011.\n\n[11] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and\n\nNumerical Harmonic Analysis. Birkh\u00e4user, 2013.\n\n[12] P. Jain, A. Tewari, and I. S. Dhillon. Orthogonal matching pursuit with replacement.\n\nIn\nProceedings of the 25th Annual Conference on Neural Information Processing Systems, pages\n1215\u20131223, 2011.\n\n[13] P. Jain, A. Tewari, and I. S. Dhillon. Partial hard thresholding. IEEE Trans. Information Theory,\n\n63(5):3029\u20133038, 2017.\n\n[14] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional M-\nestimation. In Proceedings of the 28th Annual Conference on Neural Information Processing\nSystems, pages 685\u2013693, 2014.\n\n[15] D. Needell and J. A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate\n\nsamples. Applied and Computational Harmonic Analysis, 26(3):301\u2013321, 2009.\n\n[16] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\ndimensional analysis of M -estimators with decomposable regularizers. In Proceedings of the\n23rd Annual Conference on Neural Information Processing Systems, pages 1348\u20131356, 2009.\n\n[17] S. Osher, F. Ruan, J. Xiong, Y. Yao, and W. Yin. Sparse recovery via differential inclusions.\n\nApplied and Computational Harmonic Analysis, 41(2):436\u2013469, 2016.\n\n[18] Y. R. Peter J. Bickel and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector.\n\nThe Annals of Statistics, pages 1705\u20131732, 2009.\n\n[19] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A\n\nconvex programming approach. IEEE Trans. Information Theory, 59(1):482\u2013494, 2013.\n\n[20] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated\n\ngaussian designs. Journal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n10\n\n\f[21] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements.\n\nIEEE\n\nTrans. Information Theory, 59(6):3434\u20133447, 2013.\n\n[22] J. Shen and P. Li. A tight bound of hard thresholding. arXiv preprint arXiv:1605.01656, 2016.\n\n[23] J. Shen and P. Li. On the iteration complexity of support recovery via hard thresholding pursuit.\nIn Proceedings of the 34th International Conference on Machine Learning, pages 3115\u20133124,\n2017.\n\n[24] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety: Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[25] J. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Trans. Infor-\n\nmation Theory, 50(10):2231\u20132242, 2004.\n\n[26] J. A. Tropp and S. J. Wright. Computational methods for sparse solution of linear inverse\n\nproblems. Proceedings of the IEEE, 98(6):948\u2013958, 2010.\n\n[27] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using\n\u21131-constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55(5):2183\u2013\n2202, 2009.\n\n[28] J. Wang, S. Kwon, P. Li, and B. Shim. Recovery of sparse signals via generalized orthogonal\n\nmatching pursuit: A new analysis. IEEE Trans. Signal Processing, 64(4):1076\u20131089, 2016.\n\n[29] M. Yuan and Y. Lin. On the non-negative garrotte estimator. Journal of the Royal Statistical\n\nSociety: Series B (Statistical Methodology), 69(2):143\u2013161, 2007.\n\n[30] X.-T. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit for sparsity-constrained\noptimization. In Proceedings of the 31st International Conference on Machine Learning, pages\n127\u2013135, 2014.\n\n[31] X.-T. Yuan, P. Li, and T. Zhang. Exact recovery of hard thresholding pursuit. In Proceedings\nof the 30th Annual Conference on Neural Information Processing Systems, pages 3558\u20133566,\n2016.\n\n[32] T. Zhang. On the consistency of feature selection using greedy least squares regression. Journal\n\nof Machine Learning Research, 10:555\u2013568, 2009.\n\n[33] T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization.\n\nThe Annals of Statistics, 37(5A):2109\u20132144, 2009.\n\n[34] T. Zhang. Sparse recovery with orthogonal matching pursuit under RIP. IEEE Trans. Informa-\n\ntion Theory, 57(9):6215\u20136221, 2011.\n\n[35] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n11\n\n\f", "award": [], "sourceid": 1782, "authors": [{"given_name": "Jie", "family_name": "Shen", "institution": "Rutgers University"}, {"given_name": "Ping", "family_name": "Li", "institution": "Rugters University"}]}