{"title": "The Impact of Regularization on High-dimensional Logistic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 12005, "page_last": 12015, "abstract": "Logistic regression is commonly used for modeling dichotomous outcomes. In the classical setting, where the number of observations is much larger than the number of parameters, properties of the maximum likelihood estimator in logistic regression are well understood. Recently, Sur and Candes~\\cite{sur2018modern} have studied logistic regression in the high-dimensional regime, where the number of observations and parameters are comparable, and show, among other things, that the maximum likelihood estimator is biased. In the high-dimensional regime the underlying parameter vector is often structured (sparse, block-sparse, finite-alphabet, etc.) and so in this paper we study regularized logistic regression (RLR), where a convex regularizer that encourages the desired structure is added to the negative of the log-likelihood function. An advantage of RLR is that it allows parameter recovery even for instances where the (unconstrained) maximum likelihood estimate does not exist. We provide a precise analysis of the performance of RLR via the solution of a system of six nonlinear equations, through which any performance metric of interest (mean, mean-squared error, probability of support recovery, etc.) can be explicitly computed. Our results generalize those of Sur and Candes and we provide a detailed study for the cases of $\\ell_2^2$-RLR and sparse ($\\ell_1$-regularized) logistic regression. In both cases, we obtain explicit expressions for various performance metrics and can find the values of the regularizer parameter that optimizes the desired performance. The theory is validated by extensive numerical simulations across a range of parameter values and problem instances.", "full_text": "The Impact of Regularization on High-dimensional\n\nLogistic Regression\n\nFariborz Salehi, Ehsan Abbasi, and Babak Hassibi\u2217\n\nPasadena, CA, USA.\n\nDepartment of Electrical Engineering\n\nCalifornia Institute of Technology\n\nAbstract\n\nLogistic regression is commonly used for modeling dichotomous outcomes. In\nthe classical setting, where the number of observations is much larger than the\nnumber of parameters, properties of the maximum likelihood estimator in logistic\nregression are well understood. Recently, Sur and Candes [26] have studied logistic\nregression in the high-dimensional regime, where the number of observations and\nparameters are comparable, and show, among other things, that the maximum\nlikelihood estimator is biased. In the high-dimensional regime the underlying\nparameter vector is often structured (sparse, block-sparse, \ufb01nite-alphabet, etc.) and\nso in this paper we study regularized logistic regression (RLR), where a convex\nregularizer that encourages the desired structure is added to the negative of the\nlog-likelihood function. An advantage of RLR is that it allows parameter recovery\neven for instances where the (unconstrained) maximum likelihood estimate does\nnot exist. We provide a precise analysis of the performance of RLR via the solution\nof a system of six nonlinear equations, through which any performance metric\nof interest (mean, mean-squared error, probability of support recovery, etc.) can\nbe explicitly computed. Our results generalize those of Sur and Candes and we\nprovide a detailed study for the cases of (cid:96)2\n2-RLR and sparse ((cid:96)1-regularized) logistic\nregression. In both cases, we obtain explicit expressions for various performance\nmetrics and can \ufb01nd the values of the regularizer parameter that optimizes the\ndesired performance. The theory is validated by extensive numerical simulations\nacross a range of parameter values and problem instances.\n\nIntroduction\n\n1\nLogistic regression is the most commonly used statistical model for predicting dichotomous out-\ncomes [11]. It has been extensively employed in many areas of engineering and applied sciences,\nsuch as in the medical [3, 32] and social sciences [14]. As an example, in medical studies logistic\nregression can be used to predict the risk of developing a certain disease (e.g. diabetes) based on a set\nof observed characteristics from the patient (age, gender, weight, etc.)\nLinear regression is a very useful tool for predicting a quantitive response. However, in many\nsituations the response variable is qualitative (or categorical) and linear regression is no longer appro-\npriate [12]. This is mainly due to the fact that least-squares often succeeds under the assumption that\nthe error components are independent with normal distribution. In categorical predictions, however,\nthe error components are neither inependent nor normally distributed [19].\nIn logistic regression we model the probability that the label, Y , belongs to a certain category. When\nno prior knowledge is available regarding the structure of the parameters, maximum likelihood is\noften used for \ufb01tting the model. Maximum likelihood estimation (MLE) is a special case of maximum\n\u2217This work was supported in part by the National Science Foundation under grants CNS-0932428, CCF-\n1018927, CCF-1423663 and CCF-1409204, by a grant from Qualcomm Inc., by a grant from Futurewei Inc., by\nNASA\u2019s Jet Propulsion Laboratory through the President and Director\u2019s Fund, and by King Abdullah University\nof Science and Technology.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fa posteriori estimation (MAP) that assumes a uniform prior distribution on the parameters.\nIn many applications in statistics, machine learning, signal processing, etc., the underlying parameter\nobeys some sort of structure (sparse, group-sparse, low-rank, \ufb01nite-alphabet, etc.). For instance,\nin modern applications where the number of features far exceeds the number of observations, one\ntypically enforces the solution to contain only a few non-zero entries. To exploit such structural\ninformation, inspired by the Lasso [31] algorithm for linear models, researchers have studied reg-\nularization methods for generalized linear models [24, 9]. From a statistical viewpoint, adding a\nregularization term provides a MAP estimate with a non-uniform prior distribution that has higher\ndensities in the set of structured solutions.\n\n1.1 Prior work\nClassical results in logistic regression mainly concern the regime where the sample size, n, is over-\nwhelmingly larger than the feature dimension p. It can be shown that in the limit of large samples\nwhen p is \ufb01xed and n \u2192 \u221e, the maximum likelihood estimator provides an ef\ufb01cient estimate of\nthe underlying parameter, i.e., an unbiased estimate with covariance matrix approaching the inverse\nof the the Fisher information [34, 17]. However, in most modern applications in data science, the\np (cid:29) 1 is not valid. Sur\ndatasets often have a huge number of features, and therefore, the assumption n\nand Candes [5, 26, 27] have recently studied the performance of the maximum likeliood estimator for\nlogistic regression in the regime where n is proportional to p. Their \ufb01ndings challenge the conven-\ntional wisdom, as they have shown that in the linear asymptotic regime the maximum likelikehood\nestimate is not even unbiased. Their analysis provides the precise performance of the maximum\nlikelihood estimator.\nThere have been many studies in the literature on the performance of regularized (penalized) lo-\ngistic regression, where a regularizer is added to the negative log-likelihood function (a partial list\nincludes [4, 13, 33]). These studies often require the underlying parameter to be heavily structured.\nFor example, if the parameters are sparse the sparsity is taken to be o(p). Furthermore, they provide\norderwise bounds on the performance but do not give a precise characterization of the quality of the\nresulting estimate. A major advantage of adding a regularization term is that it allows for recovery of\nthe parameter vector even in regimes where the maximum likelihood estimate does not exist (due to\nan insuf\ufb01cient number of observations.)\n\n1.2 Summary of contributions\nIn this paper, we study regularized logistic regression (RLR) for parameter estimation in high-\ndimensional logistic models. Inspired by recent advances in the performance analysis of M-estimators\nfor linear models [7, 8, 28], we precisely characterize the assymptotic performance of the RLR esti-\nmate. Our characterization is through a system of six nonlinear equations in six unknowns, through\nwhose solution all locally-Lipschitz performance measures such as the mean, mean-squared error,\nprobability of support recovery, etc., can be determined. In the special case when the regularization\nterm is absent, our 6 nonlinear equations reduce to the 3 nonlinear equations reported in [26]. When\nthe regularizer is quadratic in parameters, the 6 equations also simpli\ufb01es to 3. When the regularizer is\nthe (cid:96)1 norm, which corresponds to the popular sparse logistic regression [15, 16], our equations can\nbe expressed in terms of q-functions, and quantities such as the probability of correct support recovery\ncan be explicitly computed. Numerous numerical simulations validate the theoretical \ufb01ndings across\na range of problem settings. To the extent of our knowledge, this is the \ufb01rst work that precisely\ncharacterizes the performance of the regularized logistic regression in high dimensions.\nFor our analysis, we utilize the recently developed Convex Gaussian Min-max Theorem\n(CGMT) [29] which is a strengthened version of a classical Gaussian comparison inequality due to\nGordon [10], and whose origins are in [25]. Previously, the CGMT has been successfully applied to\nderive the precise performance in a number of applications such as regularized M-estimators [28],\nanalysis of the generalized lasso [18, 29], data detection in massive MIMO [1, 2, 30], and PhaseMax\nin phase retrieval [6, 23, 22].\n2 Preliminaries\n2.1 Notations\nWe gather here the basic notations that are used throughout this paper. N (\u00b5, \u03c32) denotes the normal\ndistribution with mean \u00b5 and variance \u03c32. X \u223c pX implies that the random variable X has a density\nP\u2212\u2192 and d\u2212\u2192 represent convergence in probability and in distribution, respectively. Lower letters\npX.\nare reserved for vectors and upper letters are for matrices. 1d, and Id respectively denote the all-one\n\n2\n\n\f(invariantly) separable if f (w) =(cid:80)p\n\nvector and the identity matrix in dimension d. For a vector v, vi denotes its ith entry, and ||v||p (for\np \u2265 1), is its (cid:96)p norm, where we remove the subscript when p = 2. A function f : Rp \u2192 R is called\n\u02dcf (wi) for all w \u2208 Rp, where \u02dcf (\u00b7) is a real-valued function.\nFor a function \u03a6 : Rd \u2192 R, the Moreau envelope associated with \u03a6(\u00b7) is de\ufb01ned as,\n\ni=1\n\nand the proximal operator is the solution to this optimization, i.e.,\n\nM\u03a6(v, t) = min\nx\u2208Rd\n\n||v \u2212 x||2 + \u03a6(x) ,\n\n1\n2t\n\nProxt\u03a6(\u00b7)(v) = arg min\nx\u2208Rd\n\n||v \u2212 x||2 + \u03a6(x) .\n\n1\n2t\n\n2.2 Mathematical Setup\nAssume we have n samples from a logistic model with parameter \u03b2\u2217 \u2208 Rp. Let D = {(xi, yi)}n\ndenote the set of samples (a.k.a. the training data), where for i = 1, 2, . . . , n, xi \u2208 Rp is the feature\nvector and the label yi \u2208 {0, 1} is a Bernouli random variable with,\n\ni=1\n\nP[yi = 1|xi] = \u03c1(cid:48)(xT\n\ni \u03b2\u2217) ,\n\n(3)\n1+et is the standard logistic function. The goal is to compute an estimate for \u03b2\u2217 from\n\nwhere \u03c1(cid:48)(t) := et\nthe training data D. The maximum likelihood estimator, \u02c6\u03b2M L, is de\ufb01ned as,\n\nfor i = 1, 2, . . . , n ,\n\n(1)\n\n(2)\n\nn(cid:89)\n\ni=1\n\n\u02c6\u03b2M L = arg max\n\u03b2\u2208Rp\n\nP\u03b2(yi|xi) = arg max\n\u03b2\u2208Rp\n\n= arg min\n\u03b2\u2208Rp\n\neyi(xT\ni \u03b2)\n1 + exT\n\ni \u03b2\n\n\u03c1(xT\n\ni \u03b2) \u2212 yi(xT\n\ni \u03b2) .\n\n(4)\n\nn(cid:89)\nn(cid:88)\n\ni=1\n\ni=1\n\n\u00b7(cid:2) n(cid:88)\n\n1\nn\n\nWhere \u03c1(t) := log(1 + et) is the link function which has the standard logistic function as its\nderivative. The last optimization is simply minimization over the negative log-likelihood. This is a\nconvex optimization program as the log-likelihood is concave with respect to \u03b2.\nAs explained earlier in Section 1, in many interesting settings the underlying parameter possesses\ncerain structure(s) (sparse, low-rank, \ufb01nite-alphabet, etc.). In order to exploit this structure we assume\nf : Rp \u2192 R is a convex function that measures the (so-called) \"complexity\" of the structured solution.\nWe \ufb01t this model by the regularized maximum (binomial) likelihood de\ufb01ned as follows,\n\n\u03c1(xT\n\ni \u03b2) \u2212 yi(xT\n\ni=1\n\nf (\u03b2) .\n\n\u02c6\u03b2 = arg min\n\u03b2\u2208Rp\n\nf (w) = (cid:80)\n\n(5)\nHere, \u03bb \u2208 R+ is the regularization parameter that must be tuned properly. In this paper, we study\nthe linear asymptotic regime in which the problem dimensions p, n grow to in\ufb01nity at a proportional\np > 0. Our main result characterizes the performance of \u02c6\u03b2 in terms of the ratio, \u03b4, and\nrate, \u03b4 := n\np . For our analysis we assume that the regularizer f (\u00b7) is separable,\nthe signal strength, \u03ba =\n\u02dcf (wi), and the data points are drawn independently from the Gaussian distribution,\n{xi}n\np Ip). Note that the assumptions considered in the analysis of the We further\nassume that the entries of \u03b2\u2217 are drawn from a distribution \u03a0. Our main result characterizes the\nperformance of the resulting estimator through the solution of a system of six nonlinear equations\nwith six unknowns. In particular, we use the solution to compute some common descriptive statistics\nof the estimate, such as the mean and the variance.\n\ni.i.d.\u223c N (0, 1\n\n||\u03b2\u2217||\u221a\n\ni=1\n\ni\n\ni \u03b2)(cid:3) +\n\n\u03bb\np\n\n3 Main Results\n\nIn this section, we present the main result of the paper, that is the characterization of the asymptotic\nperformance of regularized logistic regression (RLR). When the estimation performance is measured\nvia a locally- Lipschitz function (e.g. mean-squared error), Theorem 1 precisely predicts the asymp-\ntotic behavior of the error. The derived expression captures the role of the regularizer, f (\u00b7), and\nthe particular distribution of \u03b2\u2217, through a set of scalars derived by solving a system of nonlinear\nequations. In Section 3.1 we present this system of nonlinear equations along with some insights on\nhow to numerically compute its solution. After formally stating our result in Section 3.2, we use that\nto predict the general behavior of \u02c6\u03b2. In particular, in Section 3.3 we compute its correlation with the\ntrue signal as well as its mean-squared error.\n\n3\n\n\f3.1 A nonlinear system of equations\nAs we will see in Theorem 1, given the signal strength \u03ba, and the ratio \u03b4, the asymptotic performance\nof RLR is characterized by the solution to the following system of nonlinear equations with six\nunknowns (\u03b1, \u03c3, \u03b3, \u03b8, \u03c4, r).\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u03b3 =\n\n\u221a\n1\n\nr\n\n\u03b4\n\nr\u221a\n\u03b4\n\nZ)(cid:1)(cid:3) ,\n\nZ)(cid:1)(cid:3) ,\n\u03ba2\u03b1 = E(cid:2)\u03b2 Prox\u03bb\u03c3\u03c4 \u02dcf (\u00b7)\n(cid:0)\u03c3\u03c4 (\u03b8\u03b2 +\nE(cid:2)Z Prox\u03bb\u03c3\u03c4 \u02dcf (\u00b7)\n(cid:0)\u03c3\u03c4 (\u03b8\u03b2 +\nZ)(cid:1)2(cid:3) ,\n\u03ba2\u03b12 + \u03c32 = E(cid:2)Prox\u03bb\u03c3\u03c4 \u02dcf (\u00b7)\n(cid:0)\u03c3\u03c4 (\u03b8\u03b2 +\nE(cid:2)\u03c1(cid:48)(\u2212\u03baZ1)(cid:0)\u03ba\u03b1Z1 + \u03c3Z2 \u2212 Prox\u03b3\u03c1(\u00b7)(\u03ba\u03b1Z1 + \u03c3Z2)(cid:1)2(cid:3) ,\n\u03b8\u03b3 = \u22122 E(cid:2)\u03c1(cid:48)(cid:48)(\u2212\u03baZ1)Prox\u03b3\u03c1(\u00b7)\n(cid:0)\u03ba\u03b1Z1 + \u03c3Z2\n(cid:1)(cid:3) ,\n1 + \u03b3\u03c1(cid:48)(cid:48)(cid:0)Prox\u03b3\u03c1(\u00b7)(\u03ba\u03b1Z1 + \u03c3Z2)(cid:1)(cid:3) .\n= E(cid:2)\n\n2\u03c1(cid:48)(\u2212\u03baZ1)\n\n1 \u2212 \u03b3\n\u03c3\u03c4\n\nr\u221a\n\u03b4\n\nr\u221a\n\u03b4\n\n\u03b32 =\n\n2\nr2\n\n(6)\n\nHere Z, Z1, Z2 are standard normal variables, and \u03b2 \u223c \u03a0, where \u03a0 denotes the distribution on the\nentries of \u03b2\u2217. The following remarks provide some insights on solving the nonlinear system.\nRemark 1 (Proximal Operators). It is worth noting that the equations in (6) include the expectation\nof functionals of two proximal operators. The \ufb01rst three equations are in terms of Prox \u02dcf (\u00b7), which\ncan be computed explicitly for most widely used regularizers. For instance, in (cid:96)1-regularization, the\nproximal operator is the well-known shrinkage function de\ufb01ned as \u03b7(x, t) := x|x| (|x| \u2212 t)+. The\nremaining equations depend on computing the proximal operator of the link function \u03c1(\u00b7). For x \u2208 R,\nProxt\u03c1(\u00b7)(x) is the unique solution of z + t\u03c1(cid:48)(z) = x.\nRemark 2 (Numerical Evaluation). De\ufb01ne v := [\u03b1, \u03c3, \u03b3, \u03b8, \u03c4, r]T as the vector of unknonws. The\nnonlinear system (6) can be reformulated as v = S(v) for a properly de\ufb01ned S : R6 \u2192 R6. We have\nempirically observed in our numerical simulations that a \ufb01xed-point iterative method, vt+1 = S(vt),\nconverges to v\u2217, such that v\u2217 = S(v\u2217).\n3.2 Asymptotic performance of regularized logistic regression\nWe are now able to present our main result. Theorem 1 below describes the average behavior of\nthe entries of \u02c6\u03b2, the solution of the RLR. The derived expression is in terms of the solution of the\nnonlinear system (6), denoted by (\u00af\u03b1, \u00af\u03c3, \u00af\u03b3, \u00af\u03b8, \u00af\u03c4 , \u00afr). An informal statement of our result is that as\nn \u2192 \u221e, the entries of \u02c6\u03b2 converge as follows,\nj , Z) ,\n\nfor j = 1, 2, . . . , p ,\n\nd\u2192 \u0393(\u03b2\u2217\n\n(7)\n\n\u02c6\u03b2j\n\nwhere Z is a standard normal random variable, and \u0393 : R2 \u2192 R is de\ufb01ned as,\n\n\u0393(c, d) := Prox\u03bb\u00af\u03c3 \u00af\u03c4 \u02dcf (\u00b7)\n\n(cid:0)\u00af\u03c3\u00af\u03c4 (\u00af\u03b8c +\n\nd)(cid:1) .\n\n\u00afr\u221a\n\u03b4\n\n(8)\n\n(9)\n\nIn other words, the RLR solution has the same behavior as applying the proximal operator on the\n\"perturbed signal\", i.e., the true signal added with a Gaussian noise.\nTheorem 1. Consider the optimization program (5), where for i = 1, 2, . . . , n, xi has the multi-\nvariate Gaussian distribution N (0, 1\ni \u03b2\u2217), and the entries of \u03b2\u2217 are drawn\nindependently from a distribution \u03a0. Assume the parameters \u03b4, \u03ba, and \u03bb are such that the nonlinear\nsystem (6) has a unique solution (\u00af\u03b1, \u00af\u03c3, \u00af\u03b3, \u00af\u03b8, \u00af\u03c4 , \u00afr). Then, as p \u2192 \u221e, for any locally-Lipschitz2\nfunction \u03a8 : R \u00d7 R \u2192 R , we have,\n\np Ip), and yi = Ber(xT\n\np(cid:88)\n\nj=1\n\n1\np\n\nP\u2212\u2192 E(cid:2)\u03a8(cid:0)\u0393(\u03b2, Z), \u03b2(cid:1)(cid:3) ,\n\n\u03a8( \u02c6\u03b2j, \u03b2\u2217\nj )\n\nwhere Z \u223c N (0, 1), \u03b2 \u223c \u03a0 is independent of Z, and the function \u0393(\u00b7,\u00b7) is de\ufb01ned in (8).\n\n2A function \u03a6 : Rd \u2192 R is said to be locally-Lipschitz if,\n\n\u2200M > 0, \u2203LM \u2265 0, such that \u2200x, y \u2208(cid:2) \u2212 M, +M(cid:3)d :\n\n|\u03a6(x) \u2212 \u03a6(y)| \u2264 LM||x \u2212 y|| .\n\n4\n\n\fWe defer the detailed proof to the Appendix. In short, to show this result we \ufb01rst represnt the\noptimization as a bilinear form uT Xv, where X is the measurement matrix. Applying the CGMT\nto derive an equivalent optimization, we then simplify this optimization to obtain an unconstrained\noptimization with six scalar variables. The nonlinear system (6) represents the \ufb01rst-order optimality\ncondition of the resulting scalar optimization.\nBefore stating the consequences of this result, a few remarks are in order.\nRemark 3 (Assumptions). The assumptions in Theorem 1 are chosen in a conservative manner. In\nparticular, we could relax the separability condition on f (\u00b7), to some milder condition in terms of\nasymptotic convergence of its proximal operator. Furthermore, one can relax the assumption on the\nentries of \u03b2\u2217 being i.i.d. to a weaker assumption on the empirical distribution of its entries. However,\nfor the applications of this paper, the theorem in its current form is adequate.\nRemark 4 (Choosing \u03a8). The performance measure in Theorem 1 is computed in terms of evaluation\nof a locally-Lipschitz function, \u03a8(\u00b7,\u00b7) . As an example, \u03a8(u, v) = (u \u2212 v)2 can be used to compute\nthe mean-squared error. Later on, we will appeal to this theorem with various choices of \u03a8 to evaluate\ndifferent performance measures on \u02c6\u03b2.\n3.3 Correlation and variance of the RLR estimate\nAs the \ufb01rst application of Theorem 1 we compute common descriptive statistics of the estimate \u02c6\u03b2. In\nthe following corollaries, we establish that the parametrs \u00af\u03b1, and \u00af\u03c3 in (6) correspond to the correlation\nand the mean-squared error of the resulting estimate.\nCorollary 1. As p \u2192 \u221e,\nProof. Recall that ||\u03b2\u2217||2 = p\u03ba2. Applying Theorem 1 with \u03a8(u, v) = uv gives,\n\n\u02c6\u03b2T \u03b2\u2217 P\u2212\u2192 \u00af\u03b1 .\n\n1||\u03b2\u2217||2\n\nE(cid:2)\u03b2 Prox\u03bb\u00af\u03c3 \u00af\u03c4 \u02dcf (\u00b7)\n\n(cid:0)\u00af\u03c3\u00af\u03c4 (\u00af\u03b8\u03b2 +\n\nZ)(cid:1)(cid:3) = \u00af\u03b1 ,\n\n\u00afr\u221a\n\u03b4\n\n(10)\n\n1\n\n||\u03b2\u2217||2\n\n\u02c6\u03b2T \u03b2\u2217 =\n\n1\n\u03ba2p\n\n\u02c6\u03b2j\u03b2\u2217\n\nj\n\nP\u2212\u2192 1\n\u03ba2\n\np(cid:88)\n\nj=1\n\n\u02c6\u03b2\n\u00af\u03b1 and compute its mean-squared error in the following corollary.\n\nwhere the last equality is derived from the \ufb01rst equation in the nonlinear system (6), along with the\nfact that (\u00af\u03b1, \u00af\u03c3, \u00af\u03b3, \u00af\u03b8, \u00af\u03c4 , \u00afr) is a solution to this system.\nCorollary 1 states that upon centering \u02c6\u03b2 around \u00af\u03b1\u03b2\u2217, it becomes decorrelated from \u03b2\u2217. Therefore,\nwe de\ufb01ne a new estimate \u02dc\u03b2 :=\nCorollary 2. As p \u2192 \u221e, 1\nProof. We appeal to Theorem 1 with \u03a8(u, v) = (u \u2212 \u00af\u03b1v)2,\n1\np\n\n\u00af\u03c32\n\u00af\u03b12 ,\n(11)\nwhere the last equality is derived from the third equation in the nonlinear system (6) together with the\nresult of Corollary 1.\n\n|| \u02c6\u03b2 \u2212 \u00af\u03b1\u03b2\u2217||2(cid:1) P\u2212\u2192 1\n\nE(cid:2)(cid:0)Prox\u03bb\u00af\u03c3 \u00af\u03c4 \u02dcf (\u00b7)\n\nZ)(cid:1) \u2212 \u00af\u03b1\u03b2(cid:1)2(cid:3) =\n\np|| \u02dc\u03b2 \u2212 \u03b2\u2217||2 P\u2212\u2192 \u00af\u03c32\n\u00af\u03b12 .\n\n(cid:0)\u00af\u03c3\u00af\u03c4 (\u00af\u03b8\u03b2 +\n\n|| \u02dc\u03b2 \u2212 \u03b2\u2217||2 =\n\n(cid:0) 1\n\np\n\n\u00afr\u221a\n\u03b4\n\n1\n\u00af\u03b12\n\n\u00af\u03b12\n\nIn the next two sections, we investigate other properties of the estimate \u02c6\u03b2 under (cid:96)1 and (cid:96)2 regulariza-\ntion.\n4 RLR with (cid:96)2\nThe (cid:96)2 norm regularization is commonly used in machine learning applications to stabilize the model.\nAdding this regularization would simply shrink all the parameters toward the origin and hence\ndecrease the variance of the resulting model. Here, we provide a precise performance analysis of the\nRLR with (cid:96)2\n\n2-regularization\n\n2-regularization, i.e.,\n\n\u02c6\u03b2 = arg min\n\u03b2\u2208Rp\n\n1\nn\n\n\u03c1(xT\n\ni \u03b2) \u2212 yi(xT\n\n\u03b22\ni .\n\n(12)\n\n\u00b7(cid:2) n(cid:88)\n\ni=1\n\ni \u03b2)(cid:3) +\n\np(cid:88)\n\ni=1\n\n\u03bb\n2p\n\nTo analyze (12), we use the result of Theorem 1. It can be shown that in the nonlinear system (6), \u00af\u03b8,\n\u00af\u03c4, \u00afr can be derived explicitely from solving the \ufb01rst three equations. This is due to the fact that the\nproximal operator of \u02dcf (\u00b7) = 1\n\n2 (\u00b7)2 can be expressed in the following closed-form,\n\nProxt \u02dcf (\u00b7)(x) = arg min\ny\u2208R\n\n(y \u2212 x)2 +\n\n1\n2t\n\n1\n2\n\ny2 =\n\nx\n\n1 + t\n\n.\n\n(13)\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: The performance of the regularized logistic regression under (cid:96)2\n2 penalty (a) the correlation factor \u00af\u03b1\np|| \u02c6\u03b2 \u2212 \u03b2\u2217||2. The dashed lines depict the theoretical result\n(b) the variance \u00af\u03c32, and (c) the mean-squared error 1\nderived from Theorem 2, and the dots are the result of empirical simulations. The empirical results is the average\nover 100 independent trials with p = 250 and \u03ba = 1 .\n\nThis indicates that the proximal operator in this case is just a simple rescaling. Substituting (13) in\nthe nonlinear system (6), we can rewrite the \ufb01rst three equations as follows,\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u03b8 =\n\n\u03c4 =\n\n\u03b1\n\u03b3\u03b4\n\n,\n\n\u03c3(cid:0)1 \u2212 \u03bb\u03b4\u03b3(cid:1) ,\n\n\u03b4\u03b3\n\nr =\n\n\u03b3\n\n\u221a\n\u03c3\n\n.\n\n\u03b4\n\n(14)\n\nTherefore we can state the following Theorem for (cid:96)2\nTheorem 2. Consider the optimization (12) with parameters \u03ba, \u03b4, and \u03b3, and the same assumptions\nas in Theorem 1. As p \u2192 \u221e, for any locally-Lipschitz function \u03a8(\u00b7,\u00b7), the following convergence\nholds,\n\n2-regularization:\n\np(cid:88)\n\nj=1\n\n1\np\n\nP\u2212\u2192 E(cid:2)\u03a8(cid:0)\u00af\u03c3Z, \u03b2(cid:1)(cid:3) ,\n\n\u03a8( \u02c6\u03b2j \u2212 \u00af\u03b1\u03b2\u2217\n\nj , \u03b2\u2217\nj )\n\nwhere Z is standard norma, \u03b2 \u223c \u03a0, and \u00af\u03b1,\u00af\u03c3 are the unique solution to the following nonlinear\nsystem of equations,\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n= E(cid:2)\u03c1(cid:48)(\u2212\u03baZ1)(cid:0)\u03ba\u03b1Z1 + \u03c3Z2 \u2212 Prox\u03b3\u03c1(\u00b7)(\u03ba\u03b1Z1 + \u03c3Z2)(cid:1)2(cid:3) ,\n= E(cid:2)\u03c1(cid:48)(cid:48)(\u2212\u03baZ1)Prox\u03b3\u03c1(\u00b7)\n\n(cid:0)\u03ba\u03b1Z1 + \u03c3Z2\n(cid:1)(cid:3) ,\n1 + \u03b3\u03c1(cid:48)(cid:48)(cid:0)Prox\u03b3\u03c1(\u00b7)(\u03ba\u03b1Z1 + \u03c3Z2)(cid:1)(cid:3) .\n\n2\u03c1(cid:48)(\u2212\u03baZ1)\n\n\u03c32\n2\u03b4\n\u2212 \u03b1\n2\u03b4\n\n+ \u03bb\u03b3 = E(cid:2)\n\n1 \u2212 1\n\u03b4\n\n(15)\n\n(16)\n\n6\n\n\fThe proof is deferred to the Appendix. Theorem 2 states that upon centering the estimate \u02c6\u03b2, it\nbecomes decorrelated from \u03b2\u2217 and the distribution of the entries approach a zero-mean Gaussian\ndistribution with variance \u00af\u03c32.\nFigure 1 depicts the performance of the regularized estimate for different values of \u03bb. As observed\nin the \ufb01gure, increasing the value of \u03bb reduces the correlation factor \u00af\u03b1 (Figure 1a) and the variance\n\u00af\u03c32 (Figure 1b). Figure 1c shows the mean-squared-error of the estimate as a function of \u03bb . It\nindicates that for different values of \u03b4 there exist an optimal value \u03bbopt that achieves the minimum\nmean-squared error.\n\n4.1 Unstructured case\nWhen \u03bb = 0 in (12), we obtain the optimization with no regularization, i.e., the maximum likelihood\nestimate. When we set \u03bb to zero in (16), Theorem 2 gives the same result as Sur and Candes reported\nin [26]. In their analysis, they have also provided an interesting interpretation of \u00af\u03b3 in terms of the\nlikelihood ratio statistics. Studying the likelihood ratio test is beyond the scope of this paper.\n5 Sparse Logistic Regression\nIn this section we study the performance of our estimate when the regularizer is the (cid:96)1 norm. In\nmodern machine learning applications the number of features, p, is often overwhelmingly large.\nTherefore, to avoid over\ufb01tting one typically needs to perform feature selection, that is, to exclude\nirrelevent variables from the regression model [12]. Adding an (cid:96)1 penalty to the loss function is the\nmost popular approach for feature selection.\nAs a natural consequence of the result of Theorem 1, we study the performance of RLR with (cid:96)1\nregularizer (referred to as \"sparse LR\") and evaluate its success in recovery of the sparse signals. In\nSection 5.1, we extend our general analysis to the case of sparse LR. In other words, we will precisely\nanalyze the performance of the solution of the following optimization,\n\n\u02c6\u03b2 = arg min\n\u03b2\u2208Rp\n\n1\nn\n\n\u03c1(xT\n\ni \u03b2) \u2212 yi(xT\n\n||\u03b2||1 .\n\n\u03bb\np\n\n(17)\n\n\u00b7(cid:2) n(cid:88)\n\ni=1\n\ni \u03b2)(cid:3) +\n\nIn Section 5.1, we explicitly describe the expectations in the nonlinear system (6) using two q-\nfunctions3. In Section 5.2, we analyze the support recovery in the resulting estimate and show that\nthe two q-functions represent the probability of on and off support recovery.\n\n5.1 Convergence behavior of sparse LR\nFor our analysis in this section, we assume each entry \u03b2\u2217\ndistribution,\n\n\u03a0(\u03b2) = (1 \u2212 s) \u00b7 \u03b40(\u03b2) + s \u00b7(cid:0) \u03c6( \u03b2\n\n(cid:1),\n\n)\n\n\u03ba\u221a\ns\n\u03ba\u221a\ns\n\ni , for i = 1, . . . , p, is sampled i.i.d. from a\n\n(18)\n\nwhere s \u2208 (0, 1) is the sparsity factor, \u03c6(t) := e\u2212t2/2\u221a\nis the density of the standard normal\ndistribution, and \u03b40(\u00b7) is the Dirac delta function.\nIn other words, entries of \u03b2\u2217 are zero with\nprobability 1 \u2212 s, and the non-zero entries have a Gaussian distribution with appropriately de\ufb01ned\nvariance. Although our analysis can be extended further, here we only present the result for a\nGaussian distribution on the non-zero entries. The proximal operator of \u02dcf (\u00b7) = | \u00b7 | is the soft-\nthresholding operator de\ufb01ned as, \u03b7(x, t) = x|x| (x\u2212 t)+ . Therefore, we are able to explicitly compute\nthe expectations with respect to \u02dcf (\u00b7) in the nonlinear system (6). To streamline the representation, we\nde\ufb01ne the following two proxies,\n\n2\u03c0\n\n\u03bb(cid:113) r2\n\n\u03b4 + \u03b82\u03ba2\n\ns\n\nt1 =\n\n, t2 =\n\n\u03bb\nr\u221a\n\u03b4\n\n.\n\n(19)\n\nIn the next section, we provide an interpretation for t1 and t2. In particular, we will show that Q( \u00aft1),\nand Q( \u00aft2) are related to the probabilities of on and off support recovery. We can rewrite the \ufb01rst three\n\n3The q-function is the tail distribution of the standard normal r.v. de\ufb01ned as, Q(t) :=(cid:82) \u221e\n\ne\u2212x2 /2\u221a\n\n2\u03c0\n\ndx .\n\nt\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: The performance of the regularized logistic regression under (cid:96)1 penalty (a) the correlation factor \u00af\u03b1\np|| \u02c6\u03b2 \u2212 \u03b2\u2217||2. The dashed lines are the theoretical result\n(b) the variance \u00af\u03c32, and (c) the mean-squared error 1\nderived from Theorem 1, and the dots are the result of empirical simulations. For the numerical simulations, the\nresult is the average over 100 independent trials with p = 250 and \u03ba = 1 .\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nequations in (6) as follows,\n\n\u03b1\n2\u03c3\u03c4\n\u03b4\u03b3\n2\u03c3\u03c4\n\u03ba2\u03b12 + \u03c32\n\n= \u03b8 \u00b7 Q(t1) ,\n\n= s \u00b7 Q(cid:0)t1\n\n(cid:1) + (1 \u2212 s) \u00b7 Q(cid:0)t2\n(cid:1) ,\n(cid:1) \u2212 \u03bb2(s \u00b7 \u03c6(t1)\n+ \u03ba2\u03b82 \u00b7 Q(cid:0)t1\n\n2\u03c32\u03c4 2 =\n\n\u03b4\u03b3\u03bb2\n2\u03c3\u03c4\n\n+\n\n\u03b3r2\n2\u03c3\u03c4\n\n+ (1 \u2212 s) \u00b7 \u03c6(t2)\nt2\n\nt1\n\n(20)\n\n) .\n\nAppending the three equations in (20) to the last three equations in (6) gives the nonlinear system for\nsparse LR. Upon solving these equations, we can use the result of Theorem 1 to compute various\nperformance measure on the estimate \u02c6\u03b2. Figure 2 shows the performance of our estimate as a\nfunction of \u03bb. It can be seen that the bound derived from our theoretical result matches the empirical\nsimulations. Also, it can be inferrred from Figure 2c that the optimal value of \u03bb (\u03bbopt that achieves\nthe minimum mean-squared error) is a decreasing function of \u03b4.\n5.2 Support recovery\nIn this section, we study the support recovery in sparse LR. As mentioned earlier, sparse LR is\noften used when the underlying paramter has few non-zero entries. We de\ufb01ne the support of \u03b2\u2217 as\n\u2126 := {j|1 \u2264 j \u2264 p, \u03b2\u2217\nj (cid:54)= 0}. Here, we would like to compute the probability of success in recovery\nof the support of \u03b2\u2217.\nLet \u02c6\u03b2 denote the solution of the optimization (17). We \ufb01x the value \u0001 > 0 as a hard-threshold based\non which we decide whether an entry is on the support or not. In other words, we form the following\nset as our estimate of the support given \u02c6\u03b2,\n\n\u02c6\u2126 = {j|1 \u2264 j \u2264 p,| \u02c6\u03b2j| > \u0001}\n\nIn order to evaluate the success in support recovery, we de\ufb01ne the following two error measures,\n\nE1(\u0001) = Prob{j \u2208 \u02c6\u2126|j (cid:54)\u2208 \u2126} , E2(\u0001) = Prob{j (cid:54)\u2208 \u02c6\u2126|j \u2208 \u2126} .\n\n(21)\n\n(22)\n\n8\n\n\f(a)\n\n(b)\n\nFigure 3: The support recovery in the regularized logistic regression with (cid:96)1 penalty for (a) E1: the probability\nof false detection, (b) E2: the probability of missing an entry of the support. The dashed lines are the theoretical\nresults derived from Lemma 1, and the dots are the result of empirical simulations. For the numerical simulations,\nthe result is the average over 100 independent trials with p = 250 and \u03ba = 1 and \u0001 = 0.001 .\n\nIn our estimation, E1 represents the probability of false alarm, and E2 is the probability of misdetec-\ntion of an entry of the support. The following lemma indicates the asymptotic behavior of both errors\nas \u0001 approcahes zero .\nLemma 1 (Support Recovery). Let \u02c6\u03b2 be the solution to the optimization (17), and the entries of \u03b2\u2217\nhave distribution \u03a0 de\ufb01ned in (18). Assume \u03bb is chosen such that the nonlinear system (6) has a\nunique solution (\u00af\u03b1, \u00af\u03c3, \u00af\u03b3, \u00af\u03b8, \u00af\u03c4 , \u00afr). As p \u2192 \u221e we have,\n\np\u2212\u2192 2 Q(cid:0)\u00aft1\np\u2212\u2192 1 \u2212 2 Q(cid:0)\u00aft2\n\n(cid:1) where, \u00aft1 =\n(cid:1) where, \u00aft2 =\n\nlim\n\u0001\u21930\n\nE1(\u0001)\n\nlim\n\u0001\u21930\n\nE2(\u0001)\n\n, and,\n\n\u03bb\n\u00afr\u221a\n\u03b4\n\n\u03bb(cid:113)\n\n\u00afr2\n\n\u03b4 + \u00af\u03b82\u03ba2\n\ns\n\n(23)\n\n.\n\n6 Conclusion and Future Directions\n\nIn this paper, we analyzed the performance of the regularized logistic regression (RLR), which is\noften used for parameter estimation in binary classi\ufb01cation. We considered the setting where the\nunderlying parameter has certain structure (e.g. sparse, group-sparse, low-rank, etc.) that can be\nenforced via a convex penalty function f (\u00b7). We precisely characterized the performance of the\nregularized maximum likelihood estimator via the solution to a nonlinear system of equations. Our\nmain results can be used to measure the performance of RLR for a general convex penalty function\nf (\u00b7). In particular, we apply our \ufb01ndings to two important special cases, i.e., (cid:96)2\n2-RLR and (cid:96)1-RLR.\nWhen the regularizer is quadratic in parameters, we have shown that the nonlinear system can be\nsimpli\ufb01ed to three equations. When the regularization parameter, \u03bb, is set to zero, which corresponds\nto the maximum likelihood estimator, we simply derived the results reported by Sur and Candes [26].\nFor sparse logistic regression, we established that the nonlinear system can be represented using\ntwo q-functions. We further show that these two q-functions represent the probability of the support\nrecovery.\nFor our analysis, we assumed the datapoints are drawn independently from a gaussian distribution\nand utilized the CGMT framework. An interesting future work is to extend our analysis to non-\ngaussian distributions. To this end, we can exploit the techniques that have been used to establish\nthe universality law (see [20, 21] and the references therein). As mentioned earlier in Section 1, an\nadvantage of RLR is that it allows parameter recovery even for instances where the (unconstrained)\nmaximum likelihood estimate does not exist. Therefore, another interesting future direction is to\nanalyze the conditions on \u03bb (as a function of \u03b4 and \u03ba) that guarantees the existence of the solution\nto the RLR optimization (5). In the unstructured setting, this has been studied in a recent work by\nCandes and Sur [5].\n\n9\n\n\fReferences\n[1] Ehsan Abbasi, Fariborz Salehi, and Babak Hassibi. Performance analysis of convex data\ndetection in mimo. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech\nand Signal Processing (ICASSP), pages 4554\u20134558. IEEE, 2019.\n\n[2] Ismail Ben Atitallah, Christos Thrampoulidis, Abla Kammoun, Tareq Y Al-Naffouri, Babak\nHassibi, and Mohamed-Slim Alouini. Ber analysis of regularized least squares for bpsk recovery.\nIn 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),\npages 4262\u20134266. IEEE, 2017.\n\n[3] Carl R Boyd, Mary Ann Tolson, and Wayne S Copes. Evaluating trauma care: the triss method.\n\ntrauma score and the injury severity score. The Journal of trauma, 27(4):370\u2013378, 1987.\n\n[4] Florentina Bunea et al. Honest variable selection in linear and logistic regression models via 1\n\nand 1+ 2 penalization. Electronic Journal of Statistics, 2:1153\u20131194, 2008.\n\n[5] Emmanuel J Cand\u00e8s and Pragya Sur. The phase transition for the existence of the maximum\nlikelihood estimate in high-dimensional logistic regression. arXiv preprint arXiv:1804.09753,\n2018.\n\n[6] Oussama Dhifallah, Christos Thrampoulidis, and Yue M Lu. Phase retrieval via polytope opti-\nmization: Geometry, phase transitions, and new algorithms. arXiv preprint arXiv:1805.09555,\n2018.\n\n[7] David Donoho and Andrea Montanari. High dimensional robust m-estimation: Asymptotic\nvariance via approximate message passing. Probability Theory and Related Fields, 166(3-\n4):935\u2013969, 2016.\n\n[8] Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. On robust\nregression with high-dimensional predictors. Proceedings of the National Academy of Sciences,\n110(36):14557\u201314562, 2013.\n\n[9] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized\n\nlinear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.\n\n[10] Yehoram Gordon. Some inequalities for gaussian processes and applications. Israel Journal of\n\nMathematics, 50(4):265\u2013289, 1985.\n\n[11] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression,\n\nvolume 398. John Wiley & Sons, 2013.\n\n[12] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to\n\nstatistical learning, volume 112. Springer, 2013.\n\n[13] Sham Kakade, Ohad Shamir, Karthik Sindharan, and Ambuj Tewari. Learning exponential\nfamilies in high-dimensions: Strong convexity and sparsity. In Proceedings of the thirteenth\ninternational conference on arti\ufb01cial intelligence and statistics, pages 381\u2013388, 2010.\n\n[14] Gary King and Langche Zeng. Logistic regression in rare events data. Political analysis,\n\n9(2):137\u2013163, 2001.\n\n[15] Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for large-scale\nl1-regularized logistic regression. Journal of Machine learning research, 8(Jul):1519\u20131555,\n2007.\n\n[16] Balaji Krishnapuram, Lawrence Carin, Mario AT Figueiredo, and Alexander J Hartemink.\nIEEE\n\nSparse multinomial logistic regression: Fast algorithms and generalization bounds.\ntransactions on pattern analysis and machine intelligence, 27(6):957\u2013968, 2005.\n\n[17] Erich L Lehmann and Joseph P Romano. Testing statistical hypotheses. Springer Science &\n\nBusiness Media, 2006.\n\n[18] L\u00e9o Miolane and Andrea Montanari. The distribution of the lasso: Uniform control over sparse\n\nballs and adaptive parameter tuning. arXiv preprint arXiv:1811.01212, 2018.\n\n10\n\n\f[19] John Ashworth Nelder and Robert WM Wedderburn. Generalized linear models. Journal of the\n\nRoyal Statistical Society: Series A (General), 135(3):370\u2013384, 1972.\n\n[20] Samet Oymak and Joel A Tropp. Universality laws for randomized dimension reduction, with\n\napplications. Information and Inference: A Journal of the IMA, 7(3):337\u2013446, 2017.\n\n[21] Ashkan Panahi and Babak Hassibi. A universal analysis of large-scale regularized least squares\n\nsolutions. In Advances in Neural Information Processing Systems, pages 3381\u20133390, 2017.\n\n[22] Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi. Learning without the phase: Regularized\nphasemax achieves optimal sample complexity. In Advances in Neural Information Processing\nSystems, pages 8641\u20138652, 2018.\n\n[23] Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi. A precise analysis of phasemax in phase\nretrieval. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 976\u2013980.\nIEEE, 2018.\n\n[24] Shirish Krishnaj Shevade and S Sathiya Keerthi. A simple and ef\ufb01cient algorithm for gene\n\nselection using sparse logistic regression. Bioinformatics, 19(17):2246\u20132253, 2003.\n\n[25] Mihailo Stojnic. A framework to characterize performance of lasso algorithms. arXiv preprint\n\narXiv:1303.7291, 2013.\n\n[26] Pragya Sur and Emmanuel J Cand\u00e8s. A modern maximum-likelihood theory for high-\n\ndimensional logistic regression. arXiv preprint arXiv:1803.06964, 2018.\n\n[27] Pragya Sur, Yuxin Chen, and Emmanuel J Cand\u00e8s. The likelihood ratio test in high-dimensional\nlogistic regression is asymptotically a rescaled chi-square. Probability Theory and Related\nFields, pages 1\u201372, 2017.\n\n[28] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi. Precise error analysis of regularized\nm-estimators in high dimensions. IEEE Transactions on Information Theory, 64(8):5592\u20135628,\n2018.\n\n[29] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. Regularized linear regression: A\nprecise analysis of the estimation error. In Conference on Learning Theory, pages 1683\u20131709,\n2015.\n\n[30] Christos Thrampoulidis, Ilias Zadik, and Yury Polyanskiy. A simple bound on the ber of the\n\nmap decoder for massive mimo systems. arXiv preprint arXiv:1903.03949, 2019.\n\n[31] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n[32] Jack V Tu. Advantages and disadvantages of using arti\ufb01cial neural networks versus logistic\nregression for predicting medical outcomes. Journal of clinical epidemiology, 49(11):1225\u2013\n1231, 1996.\n\n[33] Sara A Van de Geer et al. High-dimensional generalized linear models and the lasso. The\n\nAnnals of Statistics, 36(2):614\u2013645, 2008.\n\n[34] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.\n\n11\n\n\f", "award": [], "sourceid": 6476, "authors": [{"given_name": "Fariborz", "family_name": "Salehi", "institution": "California Institute of Technology"}, {"given_name": "Ehsan", "family_name": "Abbasi", "institution": "Caltech"}, {"given_name": "Babak", "family_name": "Hassibi", "institution": "Caltech"}]}