{"title": "Beyond Sub-Gaussian Measurements: High-Dimensional Structured Estimation with Sub-Exponential Designs", "book": "Advances in Neural Information Processing Systems", "page_first": 2206, "page_last": 2214, "abstract": "We consider the problem of high-dimensional structured estimation with norm-regularized estimators, such as Lasso, when the design matrix and noise are drawn from sub-exponential distributions.Existing results only consider sub-Gaussian designs and noise, and both the sample complexity and non-asymptotic estimation error have been shown to depend on the Gaussian width of suitable sets. In contrast, for the sub-exponential setting, we show that the sample complexity and the estimation error will depend on the exponential width of the corresponding sets, and the analysis holds for any norm. Further, using generic chaining, we show that the exponential width for any set will be at most $\\sqrt{\\log p}$ times the Gaussian width of the set, yielding Gaussian width based results even for the sub-exponential case. Further, for certain popular estimators, viz Lasso and Group Lasso, using a VC-dimension based analysis, we show that the sample complexity will in fact be the same order as Gaussian designs. Our general analysis and results are the first in the sub-exponential setting, and are readily applicable to special sub-exponential families such as log-concave and extreme-value distributions.", "full_text": "Beyond Sub-Gaussian Measurements:\n\nHigh-Dimensional Structured Estimation with\n\nSub-Exponential Designs\n\nVidyashankar Sivakumar\n\nArindam Banerjee\n\nDepartment of Computer Science & Engineering\n\nUniversity of Minnesota, Twin Cities\n\n{sivakuma,banerjee}@cs.umn.edu\n\nPradeep Ravikumar\n\nDepartment of Computer Science\n\nUniversity of Texas, Austin\n\npradeepr@cs.utexas.edu\n\nAbstract\n\nWe consider the problem of high-dimensional structured estimation with norm-\nregularized estimators, such as Lasso, when the design matrix and noise are drawn\nfrom sub-exponential distributions. Existing results only consider sub-Gaussian\ndesigns and noise, and both the sample complexity and non-asymptotic estimation\nerror have been shown to depend on the Gaussian width of suitable sets. In con-\ntrast, for the sub-exponential setting, we show that the sample complexity and the\nestimation error will depend on the exponential width of the corresponding sets,\nand the analysis holds for any norm. Further, using generic chaining, we show that\nthe exponential width for any set will be at most\nlog p times the Gaussian width\nof the set, yielding Gaussian width based results even for the sub-exponential case.\nFurther, for certain popular estimators, viz Lasso and Group Lasso, using a VC-\ndimension based analysis, we show that the sample complexity will in fact be the\nsame order as Gaussian designs. Our general analysis and results are the \ufb01rst in\nthe sub-exponential setting, and are readily applicable to special sub-exponential\nfamilies such as log-concave and extreme-value distributions.\n\n\u221a\n\n1\n\nIntroduction\n\nWe consider the following problem of high dimensional linear regression:\n\ny = X\u03b8\u2217 + \u03c9 ,\n\n(1)\nwhere y \u2208 Rn is the response vector, X \u2208 Rn\u00d7p has independent isotropic sub-exponential ran-\ndom rows, \u03c9 \u2208 Rn has i.i.d sub-exponential entries and the number of covariates p is much larger\ncompared to the number of samples n. Given y, X and assuming that \u03b8\u2217 is \u2018structured\u2019, usually char-\nacterized as having a small value according to some norm R(\u00b7), the problem is to recover \u02c6\u03b8 close\nto \u03b8\u2217. Considerable progress has been made over the past decade on high-dimensional structured\nestimation using suitable M-estimators or norm-regularized regression [16, 2] of the form:\n\n\u02c6\u03b8\u03bbn = argmin\n\u03b8\u2208Rp\n\n1\n2n\n\n(cid:107)y \u2212 X\u03b8(cid:107)2\n\n2 + \u03bbnR(\u03b8) ,\n\n(2)\n\nwhere R(\u03b8) is a suitable norm, and \u03bbn > 0 is the regularization parameter. Early work focused\non high-dimensional estimation of sparse vectors using the Lasso and related estimators, where\nR(\u03b8) = (cid:107)\u03b8(cid:107)1 [13, 22, 23]. Sample complexity of such estimators have been rigorously established\nbased on the RIP(restricted isometry property) [4, 5] and the more general RE(restricted eigenvalue)\nconditions [3, 16, 2]. Several subsequent advances have considered structures beyond (cid:96)1, using more\ngeneral norms such as (overlapping) group sparse norms, k-support norm, nuclear norm, and so on\n[16, 8, 7]. In recent years, much of the literature has been uni\ufb01ed and nonasymptotic estimation\nerror bound analysis techniques have been developed for regularized estimation with any norm [2].\n\n1\n\n\fIn spite of such advances, most of the existing literature relies on the assumption that entries in\nthe design matrix X \u2208 Rn\u00d7p are sub-Gaussian. In particular, recent uni\ufb01ed treatments based on\ndecomposable norms, atomic norms, or general norms all rely on concentration properties of sub-\nGaussian distributions [16, 7, 2]. Certain estimators, such as the Dantzig selector and variants,\nconsider a constrained problem rather than a regularized problem as in (2) but the analysis again\nrelies on entries of X being sub-Gaussian [6, 8]. For the setting of constrained estimation, building\non prior work by [10], [20] outlines a possible strategy for such analysis which can work for any\ndistribution, but works out details only for the sub-Gaussian case. In recent work [9] considered\nsub-Gaussian design matrices but with heavy-tailed noise, and suggested modifying the estimator in\n(1) via a median-of-means type estimator based on multiple estimates of \u02c6\u03b8 from sub-samples.\nIn this paper, we establish results for the norm-regularized estimation problem as in (2) for any\nnorm R(\u03b8) under the assumption that elements Xij of the design matrix X \u2208 Rn\u00d7p follow a sub-\nexponential distribution, whose tails are dominated by scaled versions of the (symmetric) exponen-\ntial distribution, i.e., P (|Xij| > t) \u2264 c1 exp(\u2212t/c2) for all t \u2265 0 and for suitable constants c1, c2\n[12, 21]. To understand the motivation of our work, note that in most of machine learning and\nstatistics, unlike in compressed sensing, the design matrix cannot be chosen but gets determined\nby the problem. In many application domains like \ufb01nance, climate science, ecology, social net-\nwork analysis, etc., variables with heavier tails than sub-Gaussians are frequently encountered. For\nexample in climate science, to understand the relationship between extreme value phenomena like\nheavy precipitation variables from the extreme-value distributions are used. While high dimensional\nstatistical techniques have been used in practice for such applications, currently lacking is the the-\noretical guarantees on their performance. Note that the class of sub-exponential distributions have\nheavier tails compared to sub-Gaussians but have all moments. To the best of our knowledge, this\nis the \ufb01rst paper to analyze regularized high-dimensional estimation problems of the form (2) with\nsub-exponential design matrices and noise.\nIn our main result, we obtain bounds on the estimation error (cid:107) \u02c6\u2206n(cid:107)2 = (cid:107)\u02c6\u03b8\u03bbn \u2212 \u03b8\u2217(cid:107)2, where \u03b8\u2217 is\nthe optimal structured parameter. The sample complexity bounds are log p worse compared to the\nsub-Gaussian case. For example for the (cid:96)1 norm, we obtain n = O(s log2 p) sample complexity\nbound instead of O(s log p) for the sub-Gaussian case. The analysis depends on two key ingredients\nwhich have been discussed in previous work [16, 2]: 1. The satisfaction of the RE condition on a set\nA which is the error set associated with the norm, and 2. The design matrix-noise interaction man-\nifested in the form of lower bounds on the regularization parameter. Speci\ufb01cally, the RE condition\ndepends on the properties of the design matrix. We outline two different approaches for obtaining\nthe sample complexity, to satisfy the RE condition: one based on the \u2018exponential width\u2019 of A and\nanother based on the VC-dimension of linear predictors drawn from A [10, 20, 11]. For two widely\nused cases, Lasso and group-lasso, we show that the VC-dimension based analysis leads to a sharp\nbound on the sample complexity, which is exactly the same order as that for sub-Gaussian design\nmatrices! In particular, for Lasso with s-sparsity, O(s log p) samples are suf\ufb01cient to satisfy the RE\ncondition for sub-exponential designs. Further, we show that the bound on the regularization param-\neter depends on the \u2018exponential width\u2019 we(\u2126R) of the unit norm ball \u2126R = {u \u2208 Rp|R(u) \u2264 1}.\nThrough a careful argument based on generic chaining [19], we show that for any set T \u2282 Rp, the\nexponential width we(T ) \u2264 cwg(T )\nlog p, where wg(T ) is the Gaussian width of the set T and c\nis an absolute constant. Recent advances on computing or bounding wg(T ) for various structured\nsets can then be used to bound we(T ). Again, for the case of Lasso, we(\u2126R) \u2264 c log p.\nThe rest of the paper is organized as follows. In Section 2 we describe various aspects of the problem\nand highlight our contributions. In Section 3 we establish a key result on the relationship between\nGaussian and exponential widths of sets which will be used for our subsequent analysis. In Section\n4 we establish results on the regularization parameter \u03bbn, RE constant \u03ba and the non-asymptotic\nestimation error (cid:107) \u02c6\u2206n(cid:107)2. We show some experimental results before concluding in Section 6.\n\n\u221a\n\n2 Background and Preliminaries\n\nIn this section, we describe various aspects of the problem, introducing notations along the way, and\nhighlight our contributions. Throughout the paper values of constants change from line to line.\n\n2\n\n\f2.1 Problem setup\n\nWe consider the problem de\ufb01ned in (2). The goal of this paper is to establish conditions for consis-\ntent estimation and derive bounds on (cid:107) \u02c6\u2206n(cid:107)2 = (cid:107)\u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2.\nn X T (y\u2212X\u03b8\u2217)), \u03b2 > 1, the error vector \u02c6\u2206n = \u02c6\u03b8\u2212\u03b8\u2217\nError set: Under the assumption \u03bbn \u2265 \u03b2R\u2217( 1\nlies in a cone A \u2286 Sp\u22121 [3, 16, 2].\nn X T (y \u2212 X\u03b8)) following analysis in [16, 2].\nRegularization parameter: For \u03b2 > 1, \u03bbn \u2265 \u03b2R\u2217( 1\nRestricted Eigenvalue (RE) conditions: For consistent estimation, the design matrix X should\nn(cid:107)Xu(cid:107)2 \u2265 \u03ba on the error set A for some constant\n1\u221a\nsatisfy the following RE condition inf u\u2208A\n\u03ba > 0 [3, 16, 2, 20, 18]. The RE sample complexity is the number of samples n required to satisfy\nthe RE condition and has been shown to be related to the Gaussian width of the error set. [7, 2, 20].\nDeterministic recovery bounds: If X satis\ufb01es the RE condition on the error set A and \u03bbn satis\ufb01es\nthe assumptions stated earlier, [2] show the error bound (cid:107) \u02c6\u2206n(cid:107)2 \u2264 c\u03a8(A) \u03bbn\n\u03ba with high probability\n(w.h.p), for some constant c, where \u03a8(A) = supu\u2208A\n(cid:96)1 norm regularization: One example for R(\u00b7) we will consider throughout the paper is the (cid:96)1\nnorm regularization. In particular we will always consider (cid:107)\u03b8\u2217(cid:107)0 = s.\nGroup-sparse norms: Another popular example we consider is the group-sparse norm. Let G =\n{G1,G2, . . . ,GNG} denote a collection of groups, which are blocks of any vector \u03b8 \u2208 Rp. For any\nvector \u03b8 \u2208 Rp, let \u03b8NG denote a vector with coordinates \u03b8NG\ni = 0.\nLet m = maxi\u2208[1,\u00b7\u00b7\u00b7 ,NG ] |Gi| be the maximum size of any group. In the group sparse setting for\nany subset SG \u2286 {1, 2, . . . , NG} with cardinality |SG| = sG, we assume that the parameter vector\n\u03b8\u2217 \u2208 Rp satis\ufb01es \u03b8\u2217NG = (cid:126)0, \u2200NG (cid:54)\u2208 SG. Such a vector is called SG-group sparse. We will focus on\n\ni = \u03b8i if i \u2208 GNG , else \u03b8NG\n\nthe case when R(\u03b8) =(cid:80)NG\n\nR(u)\n(cid:107)u(cid:107)2\n\nis the norm compatibility constant.\n\ni=1 (cid:107)\u03b8i(cid:107)2.\n\n2.2 Contributions\n\nOne of our major results is the relationship between the Gaussian and exponential width of sets\nusing arguments from generic chaining [19]. Existing analysis frameworks for our problem for\nsub-Gaussian X and \u03c9 obtain results in terms of Gaussian widths of suitable sets associated with\nthe norm [2, 20]. For sub-exponential X and \u03c9 this dependency, in some cases, is replaced by the\nexponential width of the set. By establishing a precise relationship between the two quantities, we\nleverage existing results on the computation of Gaussian widths for our scenario. Another contribu-\ntion is obtaining the same order of the RE sample complexity bound as for the sub-Gaussian case\nfor (cid:96)1 and group-sparse norms. While this strong result has already been explored in [11] for (cid:96)1,\nwe adapt it for our analysis framework and also extend it to the group-sparse setting. As for the\napplication of our work, the results are applicable to all log-concave distributions which by de\ufb01-\nnition are distributions admitting a log-concave density f i.e. a density of the form f = e\u03a8 with\n\u03a8 any concave function. This covers many practically used distributions including extreme value\ndistributions.\n\n3 Relationship between Gaussian and Exponential Widths\nIn this section we introduce a complexity parameter of a set we(\u00b7), which we call the exponential\n\u221a\nwidth of the set, and establish a sharp upper bound for it in terms of the Gaussian width of the set\nwg(\u00b7). In particular, we prove the inequality: we(A) \u2264 c \u00b7 wg(A)\nlog p for some \ufb01xed constant c.\nTo see the connection with the rest of the paper, remember that our subsequent results for \u03bbn and \u03ba\nare expressed in terms of the Gaussian width and exponential width of speci\ufb01c sets associated with\nthe norm. With this result, we establish precise sample complexity bounds by leveraging a body of\nliterature on the computation of Gaussian widths for various structured sets [7, 20]. We note that\nwhile the exponential width has been de\ufb01ned and used earlier, see for e.g. [19, 15], to the best of\nour knowledge this is the \ufb01rst result establishing the relation between the Gaussian and exponential\nwidths of sets. Our result relies on generic chaining [19].\n\n3\n\n\f3.1 Generic Chaining, Gaussian Width and Exponential Widths\nConsider a process {Xt}t\u2208T = (cid:104)h, t(cid:105) indexed by a set T \u2286 Rp, where each element hi has mean\n0. It follows from the de\ufb01nition that the process is centered, i.e., E(Xt) = 0,\u2200t \u2208 T . We will\nalso assume for convenience w.l.o.g that set T is \ufb01nite. Also, for any s, t \u2208 T , consider a canonical\ndistance metric d(s, t). We are interested in computing the quantity E supt\u2208T Xt. Now, for reasons\ndetailed in the supplement, consider that we split T into a sequence of subsets T0 \u2286 T1 \u2286 . . . \u2286 T ,\nwith T0 = {t0}, |Tn| \u2264 22n for n \u2265 1 and Tm = T for some large m. Let function \u03c0n : T \u2192 Tn,\nde\ufb01ned as \u03c0n(t) = {s : d(s, t) \u2264 d(s1, t),\u2200s, s1 \u2208 Tn}, maps each point t \u2208 T to some point\ns \u2208 Tn closest according to d. The set Tn and the associated function \u03c0n de\ufb01ne a partition An\nof the set T . Each element of the partition An has some element s \u2208 Tn and all t \u2208 T closest to\nit according to the map \u03c0n. Also the size of the partition |An| \u2264 22n. An are called admissible\nsequences in generic chaining. Note that there are multiple admissible sequences corresponding to\nmultiple ways of de\ufb01ning the sets T0, T1, . . . , Tm. We will denote by \u2206(An(t)) the diameter of the\nelement An(t) w.r.t distance metric d de\ufb01ned as \u2206(An(t)) = sups,t\u2208An(t) d(s, t).\nDe\ufb01nition 1 \u03b3-functionals: [19] Given \u03b1 > 0, and a metric space (T, d) we de\ufb01ne\n\n\u03b3\u03b1(T, d) = inf sup\n\nt\n\n2n/\u03b1\u2206(An(t)) ,\n\n(3)\n\n(cid:88)\n\nn\u22650\n\n1\nL\n\n(cid:16)\u2212 u2\n\n(cid:17)\n\nwhere the inf is taken over all possible admissible sequences of the set T .\nGaussian width: Let {Xt}t\u2208T = (cid:104)g, t(cid:105) where each element gi is i.i.d N (0, 1). The quantity\nwg(T ) = E supt\u2208T Xt is called the Gaussian width of the set T . De\ufb01ne the distance metric\nd2(s, t) = (cid:107)s \u2212 t(cid:107)2. The relation between Gaussian width and the \u03b3-functionals is seen from\nthe following result from [Theorem 2.1.1] of [19] stated below:\n\n\u03b32(T, d2) \u2264 wg(T ) \u2264 L\u03b32(T, d2) .\n\n(4)\n\nNote that, following [Theorem 2.1.5] in [19] any process which satis\ufb01es the concentration bound\nP (|Xs \u2212 Xt| \u2265 u) \u2264 2 exp\nExponential width: Let {Xt}t\u2208T = (cid:104)e, t(cid:105) where each element ei is is a centered i.i.d exponential\nrandom variable satisfying P (|ei| \u2265 u) = exp(\u2212u). De\ufb01ne the distance metrics d2(s, t) = (cid:107)s\u2212 t(cid:107)2\nand d\u221e(s, t) = (cid:107)s \u2212 t(cid:107)\u221e. The quantity we(T ) = E supt\u2208T Xt is called the exponential width of\nthe set T . By [Theorem 1.2.7] and [Theorem 5.2.7] in [19], for some universal constant L, we(T )\nsatis\ufb01es:\n\nsatis\ufb01es the upper bound in (4).\n\nd2(s,t)2\n\n(\u03b32(T, d2) + \u03b31(T, d\u221e)) \u2264 we(T ) \u2264 L(\u03b32(T, d2) + \u03b31(T, d\u221e))\n\n(5)\nNote that any process which satis\ufb01es the sub-exponential concentration bound P (|Xs\u2212 Xt| \u2265 u) \u2264\n2 exp\n\nsatis\ufb01es the upper bound in the above inequality [15, 19].\n\n(cid:16)\u2212K min\n\n(cid:16) u2\n\n(cid:17)(cid:17)\n\n1\nL\n\nd2(s,t)2 ,\n\nu\n\nd\u221e(s,t)\n\n3.2 An Upper Bound for the Exponential Width\n\nIn this section we prove the following relationship between the exponential and Gaussian widths:\nTheorem 1 For any set T \u2282 Rp, for some constant c the following holds:\n\nwe(T ) \u2264 c \u00b7 wg(T )(cid:112)log p .\n\nProof: The result depends on geometric results [Lemma 2.6.1] and [Theorem 2.6.2] in [19].\nTheorem 2 [19] Consider a countable set T \u2282 Rp, and a number u > 0. Assume that the\nGaussian width is bounded i.e. S = \u03b32(T, d2) \u2264 \u221e. Then there is a decomposition T \u2282 T1 + T2\nwhere T1 + T2 = {t1 + t2 : t1 \u2208 T1, t2 \u2208 T2}, such that\n\n\u03b32(T1, d2) \u2264 LS ,\n\u03b32(T2, d2) \u2264 LS ,\n\n\u03b31(T1, d\u221e) \u2264 LSu\nT2 \u2282 LS\nu\n\nB1 ,\n\nwhere L is some universal constant and B1 is the unit (cid:96)1 norm ball in Rp.\n\n(6)\n\n(7)\n\n(8)\n\n4\n\n\fWe \ufb01rst examine the exponential widths of the sets T1 and T2. For the set T1:\n\nwe(T1) \u2264 L[\u03b32(T1, d2) + \u03b31(T1, d\u221e)] \u2264 L[S + Su] = L(wg(T ) + wg(T )u) ,\n\n(9)\nwhere the \ufb01rst inequality follows from (5) and the second inequality follows from (7). We will\nneed the following result on bounding the exponential width of an unit (cid:96)1-norm ball in p dimensions\nto compute the exponential width of T2. The proof, given in the supplement, is based on the fact\nsupt\u2208B1(cid:104)e, t(cid:105) = (cid:107)e(cid:107)\u221e and then using a simple union bound argument to bound (cid:107)e(cid:107)\u221e.\nLemma 1 Consider the set B1 = {t \u2208 Rp : (cid:107)t(cid:107)1 \u2264 1}. Then for some universal constant L:\n\n(cid:20)\n\n(cid:21)\n\nwe(B1) = E\n\n(cid:104)e, t(cid:105)\n\nsup\nt\u2208B1\n\n\u2264 L log p .\n\n(10)\n\nThe exponential width of T2 is:\nwe(T2) = we((LS/u)B1) = (LS/u)we(B1) = (L/u)wg(T )we(B1) \u2264 (L/u)wg(T ) log p . (11)\nThe \ufb01rst equality follows from (8) as T2 is a subset of a (LS/u)-scaled (cid:96)1 norm ball, the second\ninequality follows from elementary properties of widths of sets and the last inequality follows from\nLemma 1. Now as stated in Theorem 2, u in (9) and (11) is any number greater than 0. We choose\nu =\n\nlog p for some constant L yields:\n\nlog p and noting that (1 +\n\n\u221a\n\n\u221a\n\n\u221a\nlog p) \u2264 L\n\nwe(T1) \u2264 Lwg(T )(cid:112)log p,\n\n(12)\nThe \ufb01nal step, following arguments as [Theorem 2.1.6] [19], is to bound exponential width of set T .\n\nwe(T2) \u2264 Lwg(T )(cid:112)log p\n(cid:104)h, t2(cid:105)] \u2264 we(T1) + we(T2) \u2264 Lwg(T )(cid:112)log p .\n\nwe(T ) = E[sup\nt\u2208T\n\n(cid:104)h, t(cid:105)] \u2264 E[ sup\nt1\u2208T1\n\n(cid:104)h, t1(cid:105)] + E[ sup\nt2\u2208T2\n\nThis proves Theorem 1.\n\n4 Recovery Bounds\nWe obtain bounds on the error vector \u02c6\u2206n = \u02c6\u03b8 \u2212 \u03b8\u2217.\n\u03b2R\u2217( 1\nstant \u03ba, then [2, 16] obtain the following error bound w.h.p for some constant c:\n\nIf the regularization parameter \u03bbn \u2265\nn X T (y \u2212 X\u03b8\u2217)), \u03b2 > 1 and the RE condition is satis\ufb01ed on the error set A with RE con-\n\n(cid:107) \u02c6\u2206n(cid:107)2 \u2264 c \u00b7 \u03bbn\n\u03ba\n\n\u03a8(A) ,\n\n(13)\n\nwhere \u03a8(A) is the norm compatibility constant given by supu\u2208A(R(u)/(cid:107)u(cid:107)2).\n4.1 Regularization Parameter\n\nAs discussed earlier,\n\u03b2R\u2217( 1\nimplying that \u03bbn \u2265 \u03b2R\u2217( 1\nentries,\n\nfor our analysis the regularization parameter should satisfy \u03bbn \u2265\nn X T (y \u2212 X\u03b8\u2217)), \u03b2 > 1. Observe that for the linear model (1), \u03c9 = y \u2212 X\u03b8\u2217 is the noise,\n(cid:20)\nR\u2217(cid:18) 1\nn X T \u03c9). With e denoting a sub-exponential random vector with i.i.d\n\n(cid:28) 1\n\n(cid:29)(cid:21)\n\n(cid:19)(cid:21)\n\nE[(cid:107)\u03c9(cid:107)2]E\n\n(cid:104)e, u(cid:105)\n\n(cid:107)\u03c9(cid:107)2\n\n.\n\n(14)\n\n, u\n\n=\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n\nE\n\nX T \u03c9\n\n= E\n\nn\n\nsup\nu\u2208\u2126R\n\nsup\nu\u2208\u2126R\n\nX T \u03c9\n(cid:107)\u03c9(cid:107)2\n\nn\n\n1\nn\n\nThe \ufb01rst equality follows from the de\ufb01nition of dual norm. The second inequality follows from\nthe fact that X and \u03c9 are independent of each other. Also by elementary arguments [21],\ne = X T (\u03c9/|\u03c9(cid:107)2) has i.i.d sub-exponential entries with sub-exponential norm bounded by\nsup\u03c9\u2208Rn (cid:107)(cid:104)X T\ni , \u03c9/(cid:107)\u03c9(cid:107)2(cid:105)(cid:107)\u03c81. The above argument was \ufb01rst proposed for the sub-Gaussian case\nin [2]. For sub-exponential design and noise, the difference compared to the sub-Gaussian case is\nthe dependence on the exponential width instead of the Gaussian width of the unit norm ball. Us-\ning known results on the Gaussian widths of unit (cid:96)1 and group-sparse norms, corollaries below are\nderived using the relationship between Gaussian and exponential widths derived in Section 3:\n\n5\n\n\fCorollary 1 If R(\u00b7) is the (cid:96)1 norm, for sub-exponential design matrix X and noise \u03c9,\n\n(cid:20)\nR\u2217(cid:18) 1\n\nn\n\nE\n\n(cid:19)(cid:21)\n\nX T (y \u2212 X\u03b8\u2217)\n\n\u2264 \u03b70\u221a\nn\n\nlog p .\n\n(15)\n\nCorollary 2 If R(\u00b7) is the group-sparse norm, for sub-exponential design matrix X and noise \u03c9,\n\n(cid:20)\nR\u2217(cid:18) 1\n\nn\n\nE\n\n(cid:19)(cid:21)\n\n(cid:112)(m + log NG) log p .\n\nX T (y \u2212 X\u03b8\u2217)\n\n\u2264 \u03b70\u221a\nn\n\n(16)\n\n( 1\u221a\n\n( 1\u221a\n\nn )(cid:107)Xu(cid:107)2 \u2264 sup\nu\u2208A\n\n4.2 The RE condition\nFor Gaussian and sub-Gaussian X, previous work has established RIP bounds of the form \u03ba1 \u2264\nn )(cid:107)Xu(cid:107)2 \u2264 \u03ba2. In particular, RIP is satis\ufb01ed w.h.p if the number of\ninf\nu\u2208A\nsamples is of the order of square of the Gaussian width of the error set ,i.e., O(w2\ng(A)), which we\nwill call the sub-Gaussian RE sample complexity bound. As we move to heavier tails, establishing\nsuch two-sided bounds requires assumptions on the boundedness of the Euclidean norm of the rows\nof X [15, 17, 10]. On the other hand, analysis of only the lower bound requires very few assumptions\non X. In particular, (cid:107)Xu(cid:107)2 being the sum of random non-negative quantities the lower bound should\nbe satis\ufb01ed even with very weak moment assumptions on X. Making these observations, [10, 17]\ndevelop arguments obtaining sub-Gaussian RE sample complexity bounds when set A is the unit\nsphere Sp\u22121 even for design matrices having only bounded fourth moments. Note that with such\nweak moment assumptions, a non-trivial non-asymptotic upper bound cannot be established. Our\nanalysis for the RE condition essentially follow this premise and arguments from [10].\n\n4.2.1 A Bound Based on Exponential Width\n\nWe obtain a sample complexity bound which depends on the exponential width of the error set A.\nThe result we state below follows along similar arguments made in [20], which in turn are based on\narguments from [10, 14].\nTheorem 3 Let X \u2208 Rn\u00d7p have independent isotropic sub-exponential rows. Let A \u2286 Sp\u22121,\n0 < \u03be < 1, and c is a constant that depends on the sub-exponential norm K = supu\u2208A (cid:107)|(cid:104)X, u(cid:105)|(cid:107)\u03c81.\nLet we(A) denote the exponential width of the set. Then for some \u03c4 > 0 with probability atleast\n(1 \u2212 exp(\u2212\u03c4 2/2)),\n\n(cid:107)Xu(cid:107)2 \u2265 c\u03be(1 \u2212 \u03be2)2\u221a\n\nn \u2212 4we(A) \u2212 \u03be\u03c4 .\n\n(17)\n\ninf\nu\u2208A\n\nContrasting the result (17) with previous results for the sub-Gaussian case [2, 20] the dependence\non wg(A) on the r.h.s is replaced by we(A), thus leading to a log p worse sample complexity bound.\nThe corollary below applies the result for the (cid:96)1 norm. Note that results from [1] for (cid:96)1 norm show\nRIP bounds w.h.p for the same number of samples.\nCorollary 3 For an s-sparse \u03b8\u2217 and (cid:96)1 norm regularization, if n \u2265 c \u00b7 s log2 p then with probability\natleast (1 \u2212 exp(\u2212\u03c4 2/2)) and constants c, \u03ba depending on \u03be and \u03c4,\n\n(cid:107)Xu(cid:107)2 \u2265 \u03ba .\n\ninf\nu\u2208A\n\n(18)\n\n4.2.2 A Bound Based on VC-Dimensions\n\nIn this section, we show a stronger sub-Gaussian RE sample complexity result for sub-exponential\nX and (cid:96)1, group-sparse regularization. The arguments follow along similar lines to [11, 10].\nTheorem 4 Let X \u2208 Rn\u00d7p be a random matrix with isotropic random sub-exponential rows Xi \u2208\nRp. Let A \u2286 Sp\u22121, 0 < \u03be < 1, c is a constant that depends on the sub-exponential norm K =\nsupu\u2208A (cid:107)|(cid:104)X, u(cid:105)|(cid:107)\u03c81 and de\ufb01ne \u03b2 = c(1 \u2212 \u03be2)2. Let we(A) denote the exponential width of the set\n\n6\n\n\fA. Let C\u03be = {I[|(cid:104)Xi, u(cid:105)| > \u03be], u \u2208 A} be a VC-class with VC-dimension V C(C\u03be) \u2264 d. For some\nsuitable constant c1, if n \u2265 c1(d/\u03b22), then with probability atleast 1 \u2212 exp(\u2212\u03b70\u03b22n):\n\n1\u221a\nn\n\ninf\nu\u2208A\n\n(cid:107)Xu(cid:107)2 \u2265 c\u03be(1 \u2212 \u03be2)2\n\n2\n\n.\n\n(19)\n\ns1\n\nConsider the case of (cid:96)1 norm. A consequence of the above result is that the RE condition is satis\ufb01ed\non the set B = {u|(cid:107)u(cid:107)0 = s1} \u2229 Sp\u22121 for some s1 \u2265 c \u00b7 s where c is a constant that will depend\non the RE constant \u03ba when n is O(s1 log p). The argument follows from the fact that B \u2229 Sp\u22121 is a\n\n(cid:1) spheres. Thus the result is obtained by applying Theorem 4 to each sphere and using\n\nunion of(cid:0) p\n\na union bound argument. The \ufb01nal step involves showing that the RE condition is satis\ufb01ed on the\nerror set A if it is satis\ufb01ed on B using Maurey\u2019s empirical approximation argument [17, 18, 11].\nCorollary 4 For set A \u2286 Sp\u22121, which is the error set for the (cid:96)1 norm, if n \u2265 c2s log(ep/s)/\u03b22\nfor some suitable constant c2, then with probability atleast 1 \u2212 exp(\u2212\u03b70n\u03b22) \u2212\nw\u03b71 p\u03b71\u22121 , where\n1\n\u03b70, \u03b71, w > 1 are constants, the following result holds for \u03ba depending on the constant \u03be:\n\n1\u221a\nn\n\ninf\nu\u2208A\n\n(cid:107)Xu(cid:107)2 \u2265 \u03ba .\n\n(20)\n\nEssentially the same arguments for the group-sparse norm lead to the following result:\nCorollary 5 For set A \u2286 Sp\u22121, which is the error set for the group-sparse norm, if n \u2265 (c(msG +\nsG log(eNG/sG)))/\u03b22, then with probability atleast 1 \u2212 exp(\u2212\u03b70n\u03b22) \u2212\nG m\u03b71\u22121 where\n\u03b70, \u03b71, w > 1 are constants and \u03ba depending on constant \u03be,\n(cid:107)Xu(cid:107)2 \u2265 \u03ba .\n\n1\nw\u03b71 N \u03b71\u22121\n\n(21)\n\n1\u221a\nn\n\ninf\nu\u2208A\n\n4.3 Recovery Bounds for (cid:96)1 and Group-Sparse Norms\n\nWe combine result (13) with results obtained for \u03bbn and \u03ba previously for (cid:96)1 and group-sparse norms.\n\nCorollary 6 For the (cid:96)1 norm, when n \u2265 cs log p for some constant c, with high probability:\n\n(22)\nCorollary 7 For the group-sparse norm, when n \u2265 c(msG + sG log(NG)), for some constant c,\nwith high probability:\n\ns log p/\n\nn) .\n\n\u221a\n(cid:107) \u02c6\u2206n(cid:107)2 \u2264 O(\n\n\u221a\n\n(cid:107) \u02c6\u2206n(cid:107)2 \u2264 O\n\nsG log p(m + log NG)\n\nn\n\n.\n\n(23)\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n\u221a\n\nBoth bounds are\nlog p worse compared to corresponding bounds for the sub-Gaussian case. In\nterms of sample complexity, n should scale as O(s log2 p), instead of O(s log p) for sub-Gaussian,\nfor (cid:96)1 norm and O(sG log p(m + log NG)), instead of O(sG(m + log NG)) for the sub-Gaussian\ncase, for group-sparse lasso to get upto a constant order error bound.\n\n5 Experiments\n\nWe perform experiments on synthetic data to compare estimation errors for Gaussian and sub-\nexponential design matrices and noise for both (cid:96)1 and group sparse norms. For (cid:96)1 we run exper-\niments with dimensionality p = 300 and sparsity level s = 10. For group sparse norms we run\nexperiments with dimensionality p = 300, max. group size m = 6, number of groups NG = 50\ngroups each of size 6 and 4 non-zero groups. For the design matrix X, for the Gaussian case we\nsample rows randomly from an isotropic Gaussian distribution, while for sub-exponential design\n\n7\n\n\fFigure 1: Probability of recovery\nin noiseless case with increasing\nsample size. There is a sharp\nphase transition and the curves\noverlap for Gaussian and sub-\nexponential designs.\n\nFigure 2: Estimation error (cid:107) \u02c6\u2206n(cid:107)2 vs sample size for (cid:96)1 (left) and group-sparse norms (right). The curve for\nsub-exponential designs and noise decays slower than Gaussians.\n\nmatrices we sample each row of X randomly from an isotropic extreme-value distribution. The\nnumber of samples n in X is incremented in steps of 10 with an initial starting value of 5. For the\nnoise \u03c9, it is sampled i.i.d from the Gaussian and extreme-value distributions with variance 1 for\nthe Gaussian and sub-exponential cases respectively. For each sample size n, we repeat the proce-\ndure above 100 times and all results reported in the plots are average values over the 100 runs. We\nreport two sets of results. Figure 1 shows percentage of success vs sample size for the noiseless\ncase when y = X\u03b8\u2217. A success in the noiseless case denotes exact recovery which is possible when\nthe RE condition is satis\ufb01ed. Hence we expect the sample complexity for recovery to be order of\nsquare of Gaussian width for Gaussian and extreme-value distributions as validated by the plots in\nFigure 1. Figure 2 shows average estimation error vs number of samples for the noisy case when\ny = X\u03b8\u2217 +\u03c9. The noise is added only for runs in which exact recovery was possible in the noiseless\ncase. For example when n = 5 we do not have any results in Figure 2 as even noiseless recovery is\nnot possible. For each n, the estimation errors are average values over 100 runs. As seen in Figure\n2, the error decay is slower for extreme-value distributions compared to the Gaussian case.\n\n6 Conclusions\n\nThis paper presents a uni\ufb01ed framework for analysis of non-asymptotic error and structured recovery\nin norm regularized regression problems when the design matrix and noise are sub-exponential,\nessentially generalizing the corresponding analysis and results for the sub-Gaussian case. The main\nobservation is that the dependence on Gaussian width is replaced by the exponential width of suitable\nsets associated with the norm. Together with the result on the relationship between exponential and\nGaussian widths, previous analysis techniques essentially carry over to the sub-exponential case. We\nalso show that a stronger result exists for the RE condition for the Lasso and group-lasso problems.\nAs future work we will consider extending the stronger result for the RE condition for all norms.\nAcknowledgements: This work was supported by NSF grants IIS-1447566,\nIIS-1447574,\nIIS-1422557, CCF-1451986, CNS-1314560, IIS-0953274, IIS-1029711, and by NASA grant\nNNX12AQ39A.\n\n8\n\n02040608010012014016018020000.20.40.60.81Number of samplesProbability of success Basis pursuit with Gaussian designBasis pursuit with sub\u2212exponential designGroup sparse with Gaussian designGroup sparse with sub\u2212exponential design60801001201401601802000.550.60.650.70.750.80.850.90.951Estimation error Lasso with Gaussian design and noiseLasso with sub\u2212exponential design and noise1201301401501601701800.650.70.750.80.850.9Number of samplesEstimation error Group sparse lasso withGaussian design and noiseGroup sparse lasso withsub\u2212exponential design and noise\fReferences\n\n[1] R. Adamczak, A. E. Litvak, A. Pajor, and N. Tomczak-Jaegermann. Restricted isometry prop-\nerty of matrices with independent columns and neighborly polytopes by random sampling.\nConstructive Approximation, 34(1):61\u201388, 2011.\n\n[2] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with Norm Regularization.\n\nIn NIPS, 2014.\n\n[3] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nAnnals of Statistics, 37(4):1705\u20131732, 2009.\n\n[4] E. J. Candes, J. Romberg, and T. Tao. Robust Uncertainty Principles : Exact Signal Recon-\nstruction from Highly Incomplete Frequency Information. IEEE Transactions on Information\nTheory, 52(2):489\u2013509, 2006.\n\n[5] E. J. Candes and T. Tao. Decoding by Linear Programming. IEEE Transactions on Information\n\nTheory, 51(12):4203\u20134215, 2005.\n\n[6] E. J. Candes and T. Tao. The Dantzig selector : statistical estimation when p is much larger\n\nthan n. Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[7] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The Convex Geometry of Linear\n\nInverse Problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[8] S. Chatterjee, S. Chen, and A. Banerjee. Generalized Dantzig Selector: Application to the\n\nk-support norm. In NIPS, 2014.\n\n[9] D. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In ICML,\n\n2014.\n\n[10] V. Koltchinskii and S. Mendelson. Bounding the smallest singular value of a random matrix\n\nwithout concentration. arXiv:1312.3580, 2013.\n\n[11] G. Lecu\u00b4e and S. Mendelson.\n\narXiv:1401.2188, 2014.\n\nSparse recovery under weak moment assumptions.\n\n[12] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes.\n\nSpringer Berlin, 1991.\n\n[13] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional\n\ndata. Annals of Statistics, 37(1):246\u2013270, 2009.\n\n[14] S. Mendelson. Learning without concentration. Journal of the ACM, To appear, 2015.\n[15] S. Mendelson and G. Paouris. On generic chaining and the smallest singular value of random\n\nmatrices with heavy tails. Journal of Functional Analysis, 262(9):3775\u20133811, 2012.\n\n[16] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A Uni\ufb01ed Framework for High-\nDimensional Analysis of $M$-Estimators with Decomposable Regularizers. Statistical Sci-\nence, 27(4):538\u2013557, 2012.\n\n[17] R. I. Oliveira. The lower tail of random quadratic forms, with applications to ordinary least\n\nsquares and restricted eigenvalue properties. arXiv:1312.2903, 2013.\n\n[18] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements.\n\nTransaction on Information Theory, 59(6):3434\u20133447, 2013.\n[19] M. Talagrand. The Generic Chaining. Springer Berlin, 2005.\n[20] J. A. Tropp. Convex recovery of a structured signal from independent random linear measure-\n\nIEEE\n\nments. In Sampling Theory - a Renaissance. To appear, 2015.\n\n[21] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y Eldar\nand G. Kutyniok, editors, Compressed Sensing, pages 210\u2013268. Cambridge University Press,\nCambridge, 2012.\n\n[22] M. J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using L1\n-constrained quadratic programmming ( Lasso ). IEEE Transaction on Information Theory,\n55(5):2183\u20132201, 2009.\n\n[23] P. Zhao and B. Yu. On Model Selection Consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1316, "authors": [{"given_name": "Vidyashankar", "family_name": "Sivakumar", "institution": "UNIVERSITY OF MINNESOTA, TC"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "University of Texas at Austin"}]}