{"title": "Alternating Estimation for Structured High-Dimensional Multi-Response Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2838, "page_last": 2848, "abstract": "We consider the problem of learning high-dimensional multi-response linear models with structured parameters. By exploiting the noise correlations among different responses, we propose an alternating estimation (AltEst) procedure to estimate the model parameters based on the generalized Dantzig selector (GDS). Under suitable sample size and resampling assumptions, we show that the error of the estimates generated by AltEst, with high probability, converges linearly to certain minimum achievable level, which can be tersely expressed by a few geometric measures, such as Gaussian width of sets related to the parameter structure. To the best of our knowledge, this is the first non-asymptotic statistical guarantee for such AltEst-type algorithm applied to estimation with general structures.", "full_text": "Alternating Estimation for Structured\n\nHigh-Dimensional Multi-Response Models\n\nSheng Chen\nArindam Banerjee\nDept. of Computer Science & Engineering\n\nUniversity of Minnesota, Twin Cities\n\n{shengc,banerjee}@cs.umn.edu\n\nAbstract\n\nWe consider the problem of learning high-dimensional multi-response linear mod-\nels with structured parameters. By exploiting the noise correlations among different\nresponses, we propose an alternating estimation (AltEst) procedure to estimate\nthe model parameters based on the generalized Dantzig selector (GDS). Under\nsuitable sample size and resampling assumptions, we show that the error of the\nestimates generated by AltEst, with high probability, converges linearly to certain\nminimum achievable level, which can be tersely expressed by a few geometric\nmeasures, such as Gaussian width of sets related to the parameter structure. To the\nbest of our knowledge, this is the \ufb01rst non-asymptotic statistical guarantee for such\nAltEst-type algorithm applied to estimation with general structures.\n\n1\n\nIntroduction\n\nMulti-response (a.k.a. multivariate) linear models [2, 8, 20, 21] have found numerous applications in\nreal-world problems, e.g. expression quantitative trait loci (eQTL) mapping in computational biology\n[28], land surface temperature prediction in climate informatics [17], neural semantic basis discovery\nin cognitive science [30], etc. Unlike simple linear model where each response is a scalar, one obtains\na response vector at each observation in multi-response model, given as a (noisy) linear combinations\nof predictors, and the parameter (i.e., coef\ufb01cient vector) to learn can be either response-speci\ufb01c\n(i.e., allowed to be different for every response), or shared by all responses. The multi-response\nmodel has been well studied under the context of the multi-task learning [10], where each response is\ncoined as a task. In recent years, the multi-task learning literature have largely focused on exploring\nthe parameter structure across tasks via convex formulations [15, 3, 26]. Another emphasis area in\nmulti-response modeling is centered around the exploitation of the noise correlation among different\nresponses [35, 36, 29, 40, 42], instead of assuming that the noise is independent for each response.\nTo be speci\ufb01c, we consider the following multi-response linear models with m real-valued outputs,\n\nyi = Xi\u03b8\u2217 + \u03b7i,\n\n\u03b7i \u223c N (0, \u03a3\u2217) ,\n\n(1)\nwhere yi \u2208 Rm is the response vector, Xi \u2208 Rm\u00d7p consists of m p-dimensional feature vectors,\nand \u03b7i \u2208 Rm is a noise vector sampled from a multivariate zero-mean Gaussian distribution with\ncovariance \u03a3\u2217. For simplicity, we assume Diag(\u03a3\u2217) = Im\u00d7m throughout the paper. The m\nresponses share the same underlying parameter \u03b8\u2217 \u2208 Rp, which corresponds to the so-called pooled\nmodel [19]. In fact, this seemingly restrictive setting is general enough to encompass the model\nwith response-speci\ufb01c parameters, which can be realized by block-diagonalizing rows of Xi and\nstacking all coef\ufb01cient vectors into a \u201clong\u201d vector. Under the assumption of correlated noise, the\ntrue noise covariance structure \u03a3\u2217 is usually unknown. Therefore it is typically required to estimate\nthe parameter \u03b8\u2217 along with the covariance \u03a3\u2217. In practice, we observe n data points, denoted by\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fD = {(Xi, yi)}n\n\n(cid:16) \u02c6\u03b8MLE, \u02c6\u03a3MLE\n\n(cid:17)\n\ni=1, and the maximum likelihood estimator (MLE) is simply as follows,\n\n= argmin\n\u03b8\u2208Rp, \u03a3(cid:23)0\n\nlog |\u03a3| +\n\n1\n2\n\n1\n2n\n\n2 (yi \u2212 Xi\u03b8)\n\n(2)\n\n(cid:13)(cid:13)(cid:13)\u03a3\u2212 1\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\nAlthough being convex w.r.t. either \u03b8 or \u03a3 when the other is \ufb01xed, the optimization problem\nassociated with the MLE is jointly non-convex for \u03b8 and \u03a3. A popular approach to dealing with such\nproblem is alternating minimization (AltMin), i.e., alternately solving for \u03b8 (and \u03a3) while keeping\n\u03a3 (and \u03b8) \ufb01xed. The AltMin algorithm for (2) iteratively performs two simple steps, solving least\nsquares for \u03b8 and computing empirical noise covariance for \u03a3. Recent work [24] has established\nthe non-asymptotic error bound of this approach for (2) with a brief extension to sparse parameter\nsetting using iterative hard thresholding method [25]. But they did not allow more general structure\nof the parameter. Previous works [35, 29, 33] also considered the regularized MLE approaches for\nmulti-response models with sparse parameters, which are solved by AltMin-type algorithms as well.\nUnfortunately, none of those works provide \ufb01nite-sample statistical guarantees for their algorithms.\nAltMin technique has also been applied to many other problems, such as matrix completion [23],\nsparse coding [1], and mixed linear regression [41], with provable performance guarantees. Despite\nthe success of AltMin, most existing works are dedicated to recovering unstructured sparse or\nlow-rank parameters, with little attention paid to general structures, e.g., overlapping sparsity [22],\nhierarchical sparsity [27], k-support sparsity [4], etc.\nIn this paper, we study the multi-response linear model in high-dimensional setting, i.e., sample size n\nis smaller than the problem dimension p, and the coef\ufb01cient vector \u03b8\u2217 is assumed to possess a general\nlow-complexity structure, which can be essentially captured by certain norm (cid:107) \u00b7 (cid:107) [5]. Structured\nestimation using norm regularization/minimization has been extensively studied for simple linear\nmodels over the past decade, and recent advances manage to characterize the estimation error for\nconvex approaches including Lasso-type (regularized) [38, 31, 6] and Dantzig-type (constrained)\nestimator [7, 12, 14], via a few simple geometric measures, e.g., Gaussian width [18, 11] and\nrestricted norm compatibility [31, 12]. Here we propose an alternating estimation (AltEst) procedure\nfor \ufb01nding the true parameters, which essentially alternates between estimating \u03b8 through the\ngeneralized Dantzig selector (GDS) [12] using norm (cid:107) \u00b7 (cid:107) and computing the approximate empirical\nnoise covariance for \u03a3. Our analysis puts no restriction on what the norm can be, thus the AltEst\nframework is applicable to general structures. In contrast to AltMin, our AltEst procedure cannot\nbe casted as a minimization of some joint objective function for \u03b8 and \u03a3, thus is conceptually more\ngeneral than AltMin. For the proposed AltEst, we provide the statistical guarantees for the iterate\n\u02c6\u03b8t with the resampling assumption (see Section 2), which may justify the applicability of AltEst\ntechnique to other problems without joint objectives for two set of parameters. Speci\ufb01cally, we\nshow that with overwhelming probability, the estimation error (cid:107) \u02c6\u03b8t \u2212 \u03b8\u2217(cid:107)2 for generally structured\n\u03b8\u2217 converges linearly to a minimum achievable error given sub-Gaussian design under moderate\nsample size. With a straightforward intuition, this minimum achievable error can be tersely expressed\nby the aforementioned geometric measures which simply depend on the structure of \u03b8\u2217. Moreover,\nour analysis implies the error bound for single response high-dimensional models as a by-product\n\u22121/2\n\u2217 X( \u02c6\u03b8t \u2212 \u03b8\u2217)]\n[12]. Note that the analysis in [24] focuses on the expected prediction error E[\u03a3\nfor unstructured \u03b8\u2217, which is related but different from our (cid:107) \u02c6\u03b8t \u2212 \u03b8\u2217(cid:107)2 for generally structured \u03b8\u2217.\nCompared with the error bound derived for unstructured \u03b8\u2217 in [24], our result also yields better\ndependency on sample size by removing the log n factor, which seems unnatural to appear.\nThe rest of the paper is organized as follows. We elaborate our AltEst algorithm in Section 2, along\nwith the resampling assumption. In Section 3, we present the statistical guarantees for AltEst. We\nprovide experimental results in Section 4 to support our theoretical development. Finally we conclude\nin Section 5. Due to space limitations, all proofs are deferred to the supplementary material.\n\n2 Alternating Estimation for High-Dimensional Multi-Response Models\n\nGiven the high-dimensional setting for (1), it is natural to consider the regularized MLE for (1) by\nadding the norm (cid:107) \u00b7 (cid:107) to (2), which captures the structural information of \u03b8\u2217 in (1),\n\n(cid:13)(cid:13)(cid:13)\u03a3\u2212 1\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n2 (yi \u2212 Xi\u03b8)\n\n+ \u03b3n(cid:107)\u03b8(cid:107) ,\n\n(3)\n\n(cid:16) \u02c6\u03b8, \u02c6\u03a3\n\n(cid:17)\n\n= argmin\n\u03b8\u2208Rp, \u03a3(cid:23)0\n\nlog |\u03a3| +\n\n1\n2\n\n1\n2n\n\n2\n\n\fwhere \u03b3n is a tuning parameter. Using AltMin the update of (3) can be given as\n\nn(cid:88)\n\n(cid:16)\n\n1\n2n\nyi \u2212 Xi\n\ni=1\n\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13) \u02c6\u03a3\n\u2212 1\nt\u22121(yi \u2212 Xi\u03b8)\n(cid:17)(cid:16)\n(cid:17)T\n\n2\n\n2\n\nyi \u2212 Xi\n\n\u02c6\u03b8t\n\n\u02c6\u03b8t\n\nn(cid:88)\n\ni=1\n\n\u02c6\u03b8t = argmin\n\u03b8\u2208Rp\n\n\u02c6\u03a3t =\n\n1\nn\n\n+ \u03b3n(cid:107)\u03b8(cid:107)\n\n(4)\n\n(5)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u2217\n\nn(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nn\n\nThe update of \u02c6\u03b8t is basically solving a regularized least squares problem, and the new \u02c6\u03a3t is obtained\nby computing the approximated empirical covariance of the residues evaluated at \u02c6\u03b8t. In this work,\nwe consider an alternative to (4), the generalized Dantzig selector (GDS) [12], which is given by\n\n(cid:107)\u03b8(cid:107)\n\nt\u22121(Xi\u03b8 \u2212 yi)\n\u02c6\u03a3\u22121\n\n\u2264 \u03b3n ,\n\ni=1\n\ns.t.\n\nXT\ni\n\n\u02c6\u03b8t = argmin\n\u03b8\u2208Rp\n\n(6)\nwhere (cid:107) \u00b7 (cid:107)\u2217 is the dual norm of (cid:107) \u00b7 (cid:107). Compared with (4), GDS has nicer geometrical properties,\nwhich is favored in the statistical analysis. More importantly, since iteratively solving (6) followed by\ncovariance estimation (5) no longer minimizes a speci\ufb01c objective function jointly, the updates go\nbeyond the scope of AltMin, leading to our broader alternating estimation (AltEst) framework, i.e.,\nalternately estimating one parameter by suitable approaches while keeping the other \ufb01xed. For the\nease of exposition, we focus on the m \u2264 n scenario, so that \u02c6\u03a3t can be easily computed in closed\nform as shown in (5). When m > n and \u03a3\u22121\u2217\nis sparse, it is bene\ufb01cial to directly estimate \u03a3\u22121\u2217\nusing more advanced estimators [16, 9]. Especially the CLIME estimator [9] enjoys certain desirable\nproperties, which \ufb01ts into our AltEst framework but not AltMin, and our AltEst analysis does not\nrely on the particular estimator we use to estimate noise covariance or its inverse. The algorithmic\ndetails are given in Algorithm 1, for which it is worth noting that every iteration t uses independent\nnew samples, D2t\u22121 and D2t in Step 3 and 4, respectively. This assumption is known as resampling,\nwhich facilitates the theoretical analysis by removing the statistical dependency between iterates.\nSeveral existing works bene\ufb01t from such assumption when analyzing their AltMin-type algorithms\n[23, 32, 41]. Conceptually resampling can be implemented by partitioning the whole dataset into T\nsubsets, though it is unusual to do so in practice. Loosely speaking, AltEst (AltMin) with resampling\nis an approximation of the practical AltEst (AltMin) with a single dataset D used by all iterations.\nFor AltMin, attempts have been made to directly analyze its practical version without resampling,\nby studying the properties of the joint objective [37], which come at the price of invoking highly\nsophisticated mathematical tools. This technique, however, might fail to work for AltEst since the\nprocedure is not even associated with a joint objective. In the next section, we will leverage such\nresampling assumption to show that the error of \u02c6\u03b8t generated by Algorithm 1 will converge to a\nsmall value with high probability. We again emphasize that the AltEst framework may work for other\nsuitable estimators for (\u03b8\u2217, \u03a3\u2217) although (5) and (6) are considered in our analysis.\n\nAlgorithm 1 Alternating Estimation with Resampling\nInput: Number of iterations T , Datasets D1 = {(Xi, yi)}n\n1: Initialize \u02c6\u03a30 = Im\u00d7m\n2: for t:= 1 to T do\n3:\n4:\n5: end for\n6: return \u02c6\u03b8T\n\nSolve the GDS (6) for \u02c6\u03b8t using dataset D2t\u22121\nCompute \u02c6\u03a3t according to (5) using dataset D2t\n\ni=1, . . . , D2T = {(Xi, yi)}2T n\n\ni=(2T\u22121)n+1\n\n3 Statistical Guarantees for Alternating Estimation\n\nIn this section, we establish the statistical guarantees for our AltEst algorithm. The road map for the\nanalysis is to \ufb01rst derive the error bounds separately for both (5) and (6), and then combine them\nthrough AltEst procedure to show the error bound of \u02c6\u03b8t. Throughout the analysis, the design X is\nassumed to centered, i.e., E[X] = 0m\u00d7p. \u03bbmax(\u00b7) and \u03bbmin(\u00b7) are used to denote the largest and\nsmallest eigenvalue of a real symmetric matrix. Before presenting the results, we provide some basic\nbut important concepts. First of all, we give the de\ufb01nition of sub-Gaussian matrix X.\n\n3\n\n\f(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)vT \u0393\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03c82\n\nDe\ufb01nition 1 (Sub-Gaussian Matrix) X \u2208 Rm\u00d7p is sub-Gaussian if the \u03c82-norm below is \ufb01nite,\n(7)\n\n\u2264 \u03ba < +\u221e ,\n\n\u2212 1\nu XT u\n\nsup\n\n=\n\n2\n\n|||X|||\u03c82\n\nv\u2208Sp\u22121, u\u2208Sm\u22121\n\nwhere \u0393u = E[XT uuT X]. Further we assume there exist constants \u00b5min and \u00b5max such that\n\n0 < \u00b5min \u2264 \u03bbmin(\u0393u) \u2264 \u03bbmax(\u0393u) \u2264 \u00b5max < +\u221e ,\n\n\u2200 u \u2208 Sm\u22121\n\n(8)\n\nThe de\ufb01nition (7) is also used in earlier work [24], which assumes the left end of (8) implicitly.\nLemma 1 gives an example of sub-Gaussian X, showing that condition (7) and (8) are reasonable.\nLemma 1 Assume that X \u2208 Rm\u00d7p has dependent anisotropic rows such that X = \u039e 1\n2 , where\n\u039e \u2208 Rm\u00d7m encodes the dependency between rows, \u02dcX \u2208 Rm\u00d7p has independent isotropic rows, and\n\u039b \u2208 Rp\u00d7p introduces the anisotropy. In this setting, if each row of \u02dcX satis\ufb01es |||\u02dcxi|||\u03c82\n\u2264 \u02dc\u03ba, then\ncondition (7) and (8) hold with \u03ba = C \u02dc\u03ba, \u00b5min = \u03bbmin(\u039e)\u03bbmin(\u039b), and \u00b5max = \u03bbmax(\u039e)\u03bbmax(\u039b).\n\n2 \u02dcX\u039b 1\n\nThe recovery guarantee of GDS relies on an important notion called restricted eigenvalue (RE). In\nmulti-response setting, it is de\ufb01ned jointly for designs Xi and a noise covariance \u03a3 as follows.\n\nDe\ufb01nition 2 (Restricted Eigenvalue Condition) The designs X1, X2, . . . , Xn and the covariance\n\u03a3 together satisfy the restricted eigenvalue condition for set A \u2286 Sp\u22121 with parameter \u03b1 > 0, if\n\n(cid:32)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nv\u2208A vT\ninf\n\n(cid:33)\n\ni \u03a3\u22121Xi\nXT\n\nv \u2265 \u03b1 .\n\n(9)\n\nApart from RE condition, the analysis of GDS is carried out on the premise that tuning parameter \u03b3n\nis suitably selected, which we de\ufb01ne as \u201cadmissible\u201d.\n\nDe\ufb01nition 3 (Admissible Tuning Parameter) The \u03b3n for GDS (6) is said to be admissible if \u03b3n is\nchosen such that \u03b8\u2217 belongs to the constraint set, i.e.,\n\ni \u03a3\u22121(Xi\u03b8\u2217 \u2212 yi)\nXT\n\n=\n\ni \u03a3\u22121\u03b7i\nXT\n\n\u2264 \u03b3n\n\n(10)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u2217\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nn\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nn\n\nn(cid:88)\n\ni=1\n\nFor structured estimation, one also needs to characterize the structural complexity of \u03b8\u2217, and an\nappropriate choice is the Gaussian width [18]. For any set A \u2286 Rp, its Gaussian width is given\nby w(A) = E [supu\u2208A (cid:104)u, g(cid:105)], where g \u223c N (0, Ip\u00d7p) is a standard Gaussian random vector. In\nthe analysis, the set A of our interests typically relies on the structure of \u03b8\u2217. Previously Gaussian\nwidth has been applied to statistical analyses for various problems [11, 6, 39], and recent works\n[34, 13] show that Gaussian width is computable for many structures. For the rest of the paper, we\nuse C, C0, C1 and so on to denote universal constants, which are different from context to context.\n\n3.1 Estimation of Coef\ufb01cient Vector\nIn this subsection, we focus on estimating \u03b8\u2217, i.e., Step 3 of Algorithm 1, using GDS of the form,\n\n\u02c6\u03b8 = argmin\n\u03b8\u2208Rp\n\n(cid:107)\u03b8(cid:107)\n\ns.t.\n\ni \u03a3\u22121(Xi\u03b8 \u2212 yi)\nXT\n\n\u2264 \u03b3n ,\n\n(11)\n\nwhere \u03a3 is an arbitrary but \ufb01xed input noise covariance matrix. The following lemma shows a\ndeterministic error bound for \u02c6\u03b8 under the RE condition and admissible \u03b3n de\ufb01ned in (9) and (10).\n\nLemma 2 Suppose the RE condition (9) is satis\ufb01ed by X1, . . . , Xn and \u03a3 with \u03b1 > 0 for the set\nA (\u03b8\u2217) = cone{v | (cid:107)\u03b8\u2217 + v(cid:107) \u2264 (cid:107)\u03b8\u2217(cid:107) } \u2229 Sp\u22121. If \u03b3n is admissible, \u02c6\u03b8 in (11) satis\ufb01es\n\n(cid:13)(cid:13)(cid:13) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:13)(cid:13)(cid:13)2\n\n\u2264 2\u03a8(\u03b8\u2217) \u00b7 \u03b3n\n\u03b1\n\n,\n\n(12)\n\nin which \u03a8(\u03b8\u2217) is the restricted norm compatibility de\ufb01ned as \u03a8(\u03b8\u2217) = supv\u2208A(\u03b8\u2217)\n\n(cid:107)v(cid:107)\n(cid:107)v(cid:107)2\n\n.\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u2217\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u2217\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nn\n\nn(cid:88)\n\ni=1\n\n4\n\n\fFrom Lemma 2, we can \ufb01nd that the L2-norm error is mainly determined by three quantities\u2013\u03a8(\u03b8\u2217),\n\u03b3n and \u03b1. The restricted norm compatibility \u03a8(\u03b8\u2217) purely hinges on the geometrical structure of\n\u03b8\u2217 and (cid:107) \u00b7 (cid:107), thus involving no randomness. On the contrary, \u03b3n and \u03b1 need to satisfy their own\nconditions, which are bound to deal with random Xi and \u03b7i. The set A(\u03b8\u2217) involved in RE condition\nand restricted norm compatibility has relatively simple structure, which will favor the derivation of\nerror bound for varieties of norms [13]. If RE condition fails to hold, i.e. \u03b1 = 0, the error bound is\nmeaningless. Though the error is proportional to the user-speci\ufb01ed \u03b3n, assigning arbitrarily small\nvalue to \u03b3n may not be admissible. Hence, in order to further derive the recovery guarantees for GDS,\nwe need to verify RE condition and \ufb01nd the smallest admissible value of \u03b3n.\nRestricted Eigenvalue Condition: Firstly the following lemma characterizes the relation between\nthe expectation and empirical mean of XT \u03a3\u22121X.\nLemma 3 Given sub-Gaussian X \u2208 Rm\u00d7p with its i.i.d. copies X1, . . . , Xn, and covariance\n\u03a3 \u2208 Rm\u00d7m with eigenvectors u1, . . . , um, let \u0393 = E[XT \u03a3\u22121X] and \u02c6\u0393 = 1\ni \u03a3\u22121Xi.\n\u2212 1\nDe\ufb01ne the set A\u0393j for A \u2286 Sp\u22121 and each \u0393j = E[XT ujuT\nj v \u2208\ncone(A)}. If n \u2265 C1\u03ba4 \u00b7 maxj\nhave\n\n(cid:8)w2(A\u0393j )(cid:9), with probability at least 1 \u2212 m exp(\u2212C2n/\u03ba4), we\n\ni=1 XT\nj X] as A\u0393j = {v \u2208 Sp\u22121 | \u0393\n\n(cid:80)n\n\nn\n\n2\n\nvT \u02c6\u0393v \u2265 1\n2\n\nvT \u0393v,\n\n\u2200 v \u2208 A .\n\n(13)\n\nInstead of w(A\u0393j ), ideally we want the condition above on n to be characterized by w(A), which\ncan be easier to compute in general. The next lemma accomplishes this goal.\nLemma 4 Let \u03ba0 be the \u03c82-norm of standard Gaussian random vector and \u0393u = E[XT uuT X],\nwhere u \u2208 Sm\u22121 is \ufb01xed. For A\u0393u de\ufb01ned in Lemma 3, we have\n\nw(A\u0393u) \u2264 C\u03ba0\n\n(14)\nLemma 4 implies that the Gaussian width w(A\u0393j ) appearing in Lemma 3 is of the same order as\nw(A). Putting Lemma 3 and 4 together, we can obtain the RE condition for the analysis of GDS.\n\u00b7 (w(A) + 3)2, then the\nCorollary 1 Under the notations of Lemma 3 and 4, if n \u2265 C1\u03ba2\nfollowing inequality holds for all v \u2208 A \u2286 Sp\u22121 with probability at least 1 \u2212 m exp(\u2212C2n/\u03ba4),\n(15)\n\n0\u03ba4 \u00b7 \u00b5max\n\n\u00b7 Tr(\u03a3\u22121)\n\n\u00b5min\n\n(cid:112)\u00b5max/\u00b5min \u00b7 (w(A) + 3) ,\n\nvT \u02c6\u0393v \u2265 \u00b5min\n2\n\n(cid:80)n\n\ni=1 XT\n\nAdmissible Tuning Parameter: Finding the admissible \u03b3n amounts to estimating the value of\n(cid:107) 1\ni \u03a3\u22121\u03b7i(cid:107)\u2217 in (10), which involves random Xi and \u03b7i. The next lemma establishes a\nn\nhigh-probability bound for this quantity, which can be viewed as the smallest \u201csafe\u201d choice of \u03b3n.\nLemma 5 Assume that Xi is sub-Gaussian and \u03b7i \u223c N (0, \u03a3\u2217). The following inequality holds\nwith probability at least 1 \u2212 exp\n\n(cid:17) \u2212 C2 exp\n\n(cid:17)\n\n(cid:16)\u2212 C2\n\u00b7(cid:112)Tr (\u03a3\u22121\u03a3\u2217\u03a3\u22121) \u00b7 w(B) ,\n\n1 w2(B)\n4\u03c12\n\ni \u03a3\u22121\u03b7i\nXT\n\n\u221a\n\u00b5max\u221a\n\u2264 C\u03ba\nn\n\n(16)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nn\n\nn(cid:88)\n\ni=1\n\n(cid:16)\u2212 n\u03c4 2\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u2217\n\n2\n\nwhere B denotes the unit ball of norm (cid:107) \u00b7 (cid:107), \u03c1 = supv\u2208B (cid:107)v(cid:107)2, and \u03c4 = (cid:107)\u03a3\u22121\u03a3\nEstimation Error of GDS: Building on Corollary 1, Lemma 2 and 5, the theorem below characterizes\nthe estimation of GDS for the multi-response linear model.\nTheorem 1 Under the setting of Lemma 5, if n \u2265 C1\u03ba2\nto C2\u03ba\n\n\u00b7 w(B), the estimation error of \u02c6\u03b8 given by (11) satis\ufb01es\n\n(cid:113) \u00b5max Tr(\u03a3\u22121\u03a3\u2217\u03a3\u22121)\n\n\u00b7 (w(A (\u03b8\u2217)) + 3)2, and \u03b3n is set\n\n2\u2217 (cid:107)F /(cid:107)\u03a3\u22121\u03a3\n\n0\u03ba4 \u00b7 \u00b5max\n\n2\u2217 (cid:107)2.\n\n\u00b5min\n\n1\n\n1\n\nn\n\n(cid:107) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2 \u2264 C\u03ba\n\n(cid:114) \u00b5max\n\n\u00b52\n\nmin\n\n\u00b7\n\n(cid:112)Tr (\u03a3\u22121\u03a3\u2217\u03a3\u22121)\n\nTr (\u03a3\u22121)\n\n\u00b7 \u03a8(\u03b8\u2217) \u00b7 w(B)\n\n\u221a\n\nn\n\n,\n\n(17)\n\n5\n\n\fwith probability at least 1 \u2212 m exp(cid:0)\u2212 C3n\nrole in the error bound through the multiplicative factor \u03be(\u03a3) =(cid:112)Tr (\u03a3\u22121\u03a3\u2217\u03a3\u22121)/ Tr(cid:0)\u03a3\u22121(cid:1). By\n\nRemark: We can see from the theorem above that the noise covariance \u03a3 input to GDS plays a\n\n(cid:1) \u2212 exp\n\n(cid:17) \u2212 C4 exp\n\n(cid:16)\u2212 n\u03c4 2\n\n(cid:16)\u2212 C2\n\n5 w2(B)\n4\u03c12\n\n(cid:17)\n\n\u03ba4\n\n.\n\n2\n\ntaking the derivative of \u03be2(\u03a3) w.r.t. \u03a3\u22121 and setting it to 0, we have\n\n2 Tr2(cid:0)\u03a3\u22121(cid:1) \u03a3\u2217\u03a3\u22121 \u2212 2 Tr(cid:0)\u03a3\u22121(cid:1) Tr(cid:0)\u03a3\u22121\u03a3\u2217\u03a3\u22121(cid:1) \u00b7 Im\u00d7m\n(cid:113)\n\nTr4 (\u03a3\u22121)\n\n= 0\n\nThen we can verify that \u03a3 = \u03a3\u2217 is the solution to the equation above, and thus is the minimizer of\nTr(\u03a3\u22121\u2217 ). This calculation con\ufb01rms that multi-response regression could\n\u03be(\u03a3) with \u03be(\u03a3\u2217) = 1/\n\u221a\nbene\ufb01t from taking into account the noise covariance, and the best performance is achieved when \u03a3\u2217\nis known. If we perform ordinary GDS by setting \u03a3 = Im\u00d7m, then \u03be(\u03a3) = 1/\nm. Therefore using\n\u03a3\u2217 will reduce the error by a factor of\nOne simple structure of \u03b8\u2217 to consider for Theorem 1 is the sparsity encoded by L1 norm. Given s-\nsparse \u03b8\u2217, it follows from previous results [31, 11] that \u03a8(\u03b8\u2217) = O(\n\u221a\ns log p)\nand w(B) = O(\n\n\u221a\ns), w(A(\u03b8\u2217)) = O(\nlog p). Therefore if n \u2265 O(s log p), then with high probability we have\n\nm/ Tr(\u03a3\u22121\u2217 ), compared with ordinary GDS.\n\n(cid:113)\n\n\u221a\n\n\u2202\u03be2(\u03a3)\n\u2202\u03a3\u22121 =\n\n(cid:32)\n\n(cid:114)\n\n(cid:33)\n\n(cid:107) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2 \u2264 O\n\n\u03be(\u03a3) \u00b7\n\ns log p\n\nn\n\n(18)\n\n(19)\n\nImplications for Simple Linear Models: Our general result in multi-response scenario implies\nsome existing results for simple linear models. If we set n = 1 and \u03a3 = \u03a3\u2217 = Im\u00d7m, i.e., only one\ndata point (X, y) is observed and the noise is independent for each response, the GDS is reduced to\n\n\u02c6\u03b8sg = argmin\n\u03b8\u2208Rp\n\n(cid:107)\u03b8(cid:107)\n\ns.t.\n\n(cid:13)(cid:13)XT (X\u03b8 \u2212 y)(cid:13)(cid:13)\u2217 \u2264 \u03b3 ,\n\nwhich exactly matches that in [12]. To bound its estimation error, we need X to be more structured\nbeyond the sub-Gaussianity. Essentially we consider the model of X in Lemma 1, where rows of \u02dcX\nare additionally assumed to be identical. For such X, a specialized RE condition is as follows.\n\nLemma 6 Assume X is de\ufb01ned as in Lemma 1 such that X = \u039e 1\ni.i.d. with |||\u02dcxj||| \u2264 \u02dc\u03ba. If mn \u2265 C1\u03ba2\n1 \u2212 exp(\u2212C2mn/\u02dc\u03ba4), the following inequality is satis\ufb01ed by all v \u2208 A \u2286 Sp\u22121,\n\n2 , and rows of \u02dcX are\n\u03bbmin(\u039e)\u03bbmin(\u039b) \u00b7 (w(A) + 3)2, with probability at least\n(cid:16)\n\n0\u02dc\u03ba4 \u00b7 \u03bbmax(\u039e)\u03bbmax(\u039b)\n\n2 \u02dcX\u039b 1\n\n1\n\n(cid:17) \u00b7 \u03bbmin (\u039b) .\n\nvT \u02c6\u0393v \u2265 m\n2\n\n\u00b7 \u03bbmin\n\n2 \u03a3\u22121\u039e\n\n1\n2\n\n\u039e\n\n(20)\n\nRemark: Lemma 6 characterizes the RE condition for a class of speci\ufb01cally structured design X. If\nwe specialize the general RE condition in Corollary 1 for this setting, X = \u039e 1\n\n2 , it becomes\n\n2 \u02dcX\u039b 1\n\nn \u2265 C1\u03ba2\n\n0\u02dc\u03ba4 \u03bbmax(\u039e)\u03bbmax(\u039b)\n\u03bbmin(\u039e)\u03bbmin(\u039b)\n\n(w(A) + 3)2\n\nwith probability 1\u2212\nm exp(\u2212C2n/\u02dc\u03ba4)\n==========\u21d2 vT \u02c6\u0393v \u2265 \u03bbmin(\u039e)\u03bbmin(\u039b)\n\n2\n\nTr(\u03a3\u22121)\n\nComparing the general result above with Lemma 6, there are two striking differences. Firstly, Lemma\n6 requires the same sample size of mn rather than n, which improves the general one. Secondly, (20)\nholds with much higher probability 1 \u2212 exp(\u2212C2mn/\u02dc\u03ba4) instead of 1 \u2212 m exp(\u2212C2n/\u02dc\u03ba4).\nGiven this specialized RE condition, we have the recovery guarantees of GDS for simple linear\nmodels, which encompass the settings discussed in [6, 12] as special cases.\nCorollary 2 Suppose y = X\u03b8\u2217 + \u03b7 \u2208 Rm, where X is described as in Lemma 6, and \u03b7 \u223c N (0, I).\n\nWith probability at least 1 \u2212 exp(cid:0)\u2212 m\n(cid:13)(cid:13)(cid:13) \u02c6\u03b8sg \u2212 \u03b8\u2217(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u2264 C \u02dc\u03ba \u00b7\n\n(cid:1) \u2212 C2 exp\n(cid:115)\n\n(cid:16)\u2212 C2\n\n1 w2(B)\n4\u03c12\n\n(cid:17) \u2212 exp(cid:0)\u2212C3m/\u02dc\u03ba4(cid:1), \u02c6\u03b8sg satis\ufb01es\n\u00b7 \u03a8(\u03b8\u2217) \u00b7 w(B)\n\n\u221a\n\n,\n\n(21)\n\nm\n\n\u03bbmax(\u039e)\u03bbmax(\u039b)\n\u03bb2\nmin(\u039e)\u03bb2\nmin(\u039b)\n\n6\n\n\f(22)\n\nand\n\n(23)\n\n(24)\n\n(cid:107)\u03b8\u2217 \u2212 \u03b8(cid:107)2\n\n2\n\n3.2 Estimation of Noise Covariance\n\nIn this subsection, we consider the estimation of noise covariance \u03a3\u2217 given an arbitrary parameter\nvector \u03b8. When m is small, we estimate \u03a3\u2217 by simply using the sample covariance\n\nTheorem 2 reveals the relation between \u02c6\u03a3 and \u03a3\u2217, which is suf\ufb01cient for our AltEst analysis.\nTheorem 2 If n \u2265 C 4m \u00b7 max\nXi is sub-Gaussian, with probability at least 1 \u2212 2 exp(\u2212C1m), \u02c6\u03a3 given by (22) satis\ufb01es\n\n\u03bbmin(\u03a3\u2217)\u00b5min\n\n4\n\n, \u03ba4(cid:16) \u03bbmax(\u03a3\u2217)\u00b5max\n\n(cid:17)2(cid:27)\n\n\u02c6\u03a3 =\n\n(cid:26)\n\ni=1\n\n1\nn\n\n(yi \u2212 Xi\u03b8) (yi \u2212 Xi\u03b8)T .\n\nn(cid:88)\n(cid:16)\n(cid:113) \u00b5max\n\u03bbmin(\u03a3\u2217) (cid:107)\u03b8\u2217 \u2212 \u03b8(cid:107)2\n(cid:17) \u2264 1 + C 2\u03ba2\n\n(cid:112)m/n +\n(cid:17) \u2265 1 \u2212 C 2\u03ba2\n\n\u03ba0 + \u03ba\n\n2\u00b5max\n\n(cid:112)m/n\n\n(cid:17)4\n\n\u2212 1\n2\u2217\n\n\u2212 1\n2\u2217\n\n0\n\n0\n\n\u03bbmin (\u03a3\u2217)\n\n\u03a3\n\u2212 1\n2\u2217\n\n\u02c6\u03a3\u03a3\n\u2212 1\n2\u2217\n\n(cid:16)\n\n\u03bbmax\n\n\u2212 1\n2\u2217\n\n\u03a3\n\n\u02c6\u03a3\u03a3\n\n\u2212 1\n2\u2217\n\n(cid:16)\n\n\u03bbmin\n\n\u02c6\u03a3\u03a3\n\nRemark: If \u02c6\u03a3 = \u03a3\u2217, then \u03bbmax(\u03a3\n) = 1. Hence \u02c6\u03a3 is nearly equal\nto \u03a3\u2217 when the upper and lower bounds (23) (24) are close to 1. We would like to point out that there\nis nothing speci\ufb01c to the particular form of estimator (22), which makes AltEst work. Similar results\ncan be obtained for other methods that estimate the inverse covariance matrix \u03a3\u22121\u2217\ninstead of \u03a3\u2217.\nFor instance, when m < n and \u03a3\u22121\u2217\nis sparse, we can replace (22) with GLasso [16] or CLIME [9],\nand AltEst only requires the counterparts of (23) and (24) in order to work.\n\n) = \u03bbmin(\u03a3\n\n\u02c6\u03a3\u03a3\n\n\u2212 1\n2\u2217\n\n\u2212 1\n2\u2217\n\n3.3 Error Bound for Alternating Estimation\n\nSection 3.1 shows that the noise covariance in GDS affects the error bound by the factor \u03be(\u03a3). In\norder to bound the error of \u02c6\u03b8T given by AltEst, we need to further quantify how \u03b8 affects \u03be( \u02c6\u03a3).\n\nLemma 7 If \u02c6\u03a3 is given as (22) and the condition in Theorem 2 holds, then the inequality below\nholds with probability at least 1 \u2212 2 exp(\u2212C1m),\n\n(cid:16) m\n\n(cid:17) 1\n\n4\n\nn\n\n(cid:114) \u00b5max\n\n\u03bbmin (\u03a3\u2217)\n\n1 + 2C\u03ba0\n\n+ 2\n\n(cid:19)\n\n(cid:107)\u03b8\u2217 \u2212 \u03b8(cid:107)2\n\n(25)\n\nBased on Lemma 7, the following theorem provides the error bound for \u02c6\u03b8T given by Algorithm 1.\n\n(cid:18)\n\n\u03be\n\n(cid:17) \u2264 \u03be (\u03a3\u2217) \u00b7\n(cid:16) \u02c6\u03a3\n(cid:113) \u00b5max\n(cid:113) \u03bbmin(\u03a3\u2217)\n(cid:13)(cid:13)(cid:13) \u02c6\u03b8T \u2212 \u03b8\u2217(cid:13)(cid:13)(cid:13)2\n\nmax(\u03a3\u2217)\n\u03bb2\n\n\u00b52\n\nmin\n\n(cid:18)\n\n\u2264 emin +\n\n2eorc\n\n(cid:40)\n\n(cid:16)\n\nTheorem 3 Let eorc = C1\u03ba\n\n\u03be(\u03a3\u2217)\u00b7\u03a8(\u03b8\u2217)w(B)\n\nmax\n\n4\n\n\u03ba0 + C1\nC2\n\n\u03a8(\u03b8\u2217)w(B)\n\nm\n\nand also satis\ufb01es the condition in Theorem 1, with high probability, the iterate \u02c6\u03b8T returned by\nAlgorithm 1 satis\ufb01es\n\n. If n \u2265 C 4m\u00b7\n\n(cid:19)2(cid:41)\n\n\u00b7 \u03be(\u03a3\u2217)\u03a8(\u03b8\u2217)w(B)\nm\u00b7\u03bbmin(\u03a3\u2217)\n\n1+2C\u03ba0( m\nn )\n\n1\n4\n\n(cid:113) \u00b5max\n\u03bbmin (\u03a3\u2217 )\n\u221a\n\n\u221a\n\n(cid:17)4\n\nn\n\n2C1\u03ba\u00b5max\n\n(cid:18)\n\n1\u22122eorc\n\n\u03bbmin(\u03a3\u2217)\u00b5min\n\nand emin = eorc\u00b7\n\n, \u03ba4(cid:16) \u03bbmax(\u03a3\u2217)\u00b5max\n(cid:114) \u00b5max\n\n(cid:17)2\n(cid:19)T\u22121 \u00b7(cid:16)(cid:13)(cid:13)(cid:13) \u02c6\u03b81 \u2212 \u03b8\u2217(cid:13)(cid:13)(cid:13)2\n\n\u03bbmin (\u03a3\u2217)\n\nC2\u00b5min\n\n,\n\n(cid:17)\n\n\u2212 emin\n\n(26)\n\nRemark: The three lower bounds for n inside curly braces correspond to three intuitive requirements.\nThe \ufb01rst one guarantees that the covariance estimation is accurate enough, and the other two respec-\ntively ensure that the initial error of \u02c6\u03b81 and eorc are reasonably small , such that the subsequent errors\ncan contract linearly. eorc is the estimation error incurred by the following oracle estimator,\n\n\u02c6\u03b8orc = argmin\n\u03b8\u2208Rp\n\n(cid:107)\u03b8(cid:107)\n\ns.t.\n\ni \u03a3\u22121\u2217 (Xi\u03b8 \u2212 yi)\nXT\n\n\u2264 \u03b3n ,\n\n(27)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nn\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u2217\n\nwhich is impossible to implement in practice. On the other hand, emin is the minimum achievable error,\nwhich has an extra multiplicative factor compared with eorc. The numerator of the factor compensates\n\n7\n\n\ffor the error of estimated noise covariance provided that \u03b8 = \u03b8\u2217 is plugged in (22), which merely\ndepends on sample size. Since having \u03b8 = \u03b8\u2217 is also unrealistic for (22), the denominator further\naccounts for the ballpark difference between \u03b8 and \u03b8\u2217. As we remark after Theorem 1, if we perform\nTr(\u03a3\u22121\u2217 )/m.\nordinary GDS with \u03a3 set to Im\u00d7m in (11), its error bound eodn satis\ufb01es eodn = eorc\nTr(\u03a3\u22121\u2217 )/m is independent of n, whereas emin will approach eorc with\n\nNote that this factor\nincreasing n as the factor between them converges to one.\n\n(cid:113)\n\n(cid:113)\n\n4 Experiments\n\nIn this section, we present some experimental results to support our theoretical analysis. Specif-\nically we focus on the sparse structure of \u03b8\u2217 captured by L1 norm. Throughout the experi-\nment, we \ufb01x problem dimension p = 500, sparsity level of \u03b8\u2217 s = 20, and number of iter-\nations for AltEst T = 5. Entries of design X is generated by i.i.d. standard Gaussians, and\n\u03b8\u2217 = [1, . . . , 1\n]T . \u03a3\u2217 is given as a block diagonal matrix with blocks\n\n,\u22121, . . . ,\u22121\n\n, 0, . . . , 0\n\n(cid:125)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n480\n\n(cid:124)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n(cid:104) 1\n(cid:105)\n\n10\na\n1\n\n(cid:123)(cid:122)\n\n10\n\na\n\n\u03a3(cid:48) =\nreplicated along diagonal, and number of responses m is assumed to be even.\nAll plots are obtained by averaging 100 trials. In the \ufb01rst set of experiments, we set a = 0.8, m = 10\nand investigate the error of \u02c6\u03b8t as n varies from 40 to 90. We run AltEst (with and without resampling),\nthe oracle GDS, and the ordinary GDS with \u03a3 = I. The results are given in Figure 1.\nFor the second experiment, we \ufb01x the product mn \u2248 500, and let m = 2, 4, . . . , 10. For our choice\nof \u03a3\u2217, the error incurred by oracle GDS eorc is the same for every m. We compare AltEst with both\noracle and ordinary GDS, and the result is shown in Figure 2(a) and 2(b).\nIn the third experiment, we test AltEst under different covariance matrices \u03a3\u2217 by varying a from\n0.5 to 0.9. m is set to 10 and sample size n is 90. We also compare AltEst against both oracle and\nordinary GDS, and the errors are reported in Figure 2(c) and 2(d).\n\n(a) Error for AltEst\n\n(b) Error for Resampled AltEst\n\n(c) Comparison of Estimators\n\nFigure 1: (a) When n = 40, AltEst is not quite stable due to the large initial error and poor quality of estimated\ncovariance. Then the errors start to decrease for n \u2265 50. (b) Resampld AltEst does bene\ufb01t from fresh samples,\nand its error is slightly smaller than AltEst as well as more stable when n is small. (c) Oracle GDS outperforms\nthe others, but the performance of AltEst is also competitive. Ordinary GDS is unable to utilize the noise\ncorrelation, thus resulting in relatively large error. By comparing the two implementations of AltEst, we can see\nthat resampled AltEst yields smaller error especially when data is inadequate, but their errors are very close if n\nis suitably large.\n\n(a) AltEst (for m)\n\n(b) Comparison (for m)\n\n(c) AltEst (for a)\n\n(d) Comparison (for a)\n\nFigure 2: (a) Larger error comes with bigger m, which con\ufb01rms that emin is increasing along with m when mn\nis \ufb01xed. (b) The plots for oracle and ordinary GDS imply that eorc and eodn remain unchanged, which matches\nthe error bounds in Theorem 1. Though emin increases, AltEst still outperform the ordinary GDS by a margin.\n(c) The error goes down when the true noise covariance becomes closer to singular, which is expected in view of\nTheorem 3. (d) eorc also decreases as a gets larger, and the gap between emin and eodn widens. The de\ufb01nition of\nemin in Theorem 3 indicates that the ratio between emin and eorc is almost a constant because both n and m are\n\ufb01xed. Here we observe that all the ratios at different a are between 1.05 and 1.1, which supports the theoretical\nresults. Also, Theorem 1 suggests that eodn does not change as \u03a3\u2217 varies, which is veri\ufb01ed here.\n\n8\n\n11.522.533.544.55Iterationt0.040.060.080.10.120.140.160.18NormalizedErrorfor\u02c6\u03b8tn = 40n = 50n = 60n = 70n = 80n = 9011.522.533.544.55Iterationt0.040.060.080.10.120.140.160.18NormalizedErrorfor\u02c6\u03b8tn = 40n = 50n = 60n = 70n = 80n = 904045505560657075808590SampleSizen0.040.060.080.10.120.140.16NormalizedErrorOracle GDSResampled AltEstAltEstOrdinary GDS11.522.533.544.55Iterationt0.040.060.080.10.120.140.16NormalizedErrorfor\u02c6\u03b8tm = 2m = 4m = 6m = 8m = 102345678910NumberofResponsesm0.040.060.080.10.120.140.16NormalizedErrorOracle GDSAltEstOrdinary GDS11.522.533.544.55Iterationt0.040.060.080.10.120.140.16NormalizedErrorfor\u02c6\u03b8ta = 0.9a = 0.8a = 0.7a = 0.6a = 0.50.50.550.60.650.70.750.80.850.9a0.020.040.060.080.10.120.14NormalizedErrorOracle GDSAltEstOrdinary GDS\f5 Conclusions\n\nIn this paper, we propose an alternating estimation (AltEst) procedure for solving the multi-response\nlinear models in high dimension. Our framework is based on the generalized Dantzig selector (GDS)\nand allows for general structures of the parameter vector, whose recovery guarantees are simply\ndetermined by a few geometric measures. Also, by leveraging the noise correlation among responses,\nAltEst can achieve signi\ufb01cantly smaller estimation error than ignoring the noise structure. With\nmoderate sample size and the resampling assumption, we show that the estimation error will converge\nlinearly to a minimal achievable error, which is comparable to the one incurred by the oracle estimator.\nIn the experiment, we demonstrate the numerical superiority of AltEst over the vanilla GDS, and it is\nalso suggested that the resampled version of AltEst give little bene\ufb01t in practice and we should better\nuse all data in every iteration.\n\nAcknowledgements\nThe research was supported by NSF grants IIS-1563950, IIS-1447566, IIS-1447574, IIS-1422557,\nCCF-1451986, CNS- 1314560, IIS-0953274, IIS-1029711, NASA grant NNX12AQ39A, and gifts\nfrom Adobe, IBM, and Yahoo.\n\nReferences\n[1] A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learning sparsely used\n\novercomplete dictionaries via alternating minimization. CoRR, abs/1310.7991, 2013.\n\n[2] T. W. Anderson. An introduction to multivariate statistical analysis. 2003.\n\n[3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[4] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In NIPS,\n\n2012.\n\n[5] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing\n\nnorms. Optimization for Machine Learning, 5, 2011.\n\n[6] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with norm regularization. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2014.\n\n[7] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nThe Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[8] L. Breiman and J. H. Friedman. Predicting multivariate responses in multiple linear regression.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3\u201354, 1997.\n\n[9] T. T. Cai, W. Liu, and X. Luo. A constrained (cid:96)1 minimization approach to sparse precision\nmatrix estimation. Journal of the American Statistical Association, 106(494):594\u2013607, 2011.\n\n[10] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n\n[11] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear\n\ninverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[12] S. Chatterjee, S. Chen, and A. Banerjee. Generalized dantzig selector: Application to the\n\nk-support norm. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[13] S. Chen and A. Banerjee. Structured estimation with atomic norms: General bounds and\n\napplications. In NIPS, pages 2908\u20132916, 2015.\n\n[14] S. Chen and A. Banerjee. Structured matrix recovery via the generalized dantzig selector. In\n\nAdvances in Neural Information Processing Systems, 2016.\n\n[15] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In KDD, pages 109\u2013117, 2004.\n\n9\n\n\f[16] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[17] A. R. Goncalves, P. Das, S. Chatterjee, V. Sivakumar, F. J. Von Zuben, and A. Banerjee.\n\nMulti-task sparse structure learning. In CIKM, pages 451\u2013460, 2014.\n\n[18] Y. Gordon. Some inequalities for gaussian processes and applications.\n\nMathematics, 50(4):265\u2013289, 1985.\n\nIsrael Journal of\n\n[19] W. H. Greene. Econometric Analysis. Prentice Hall, 7. edition, 2011.\n\n[20] A. J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivari-\n\nate analysis, 5(2):248\u2013264, 1975.\n\n[21] A. J. Izenman. Modern Multivariate Statistical Techniques: Regression, Classi\ufb01cation, and\n\nManifold Learning. Springer, 2008.\n\n[22] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In ICML,\n\n2009.\n\n[23] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimiza-\n\ntion. In STOC, pages 665\u2013674, 2013.\n\n[24] P. Jain and A. Tewari. Alternating minimization for regression problems with vector-valued\noutputs. In Advances in Neural Information Processing Systems (NIPS), pages 1126\u20131134,\n2015.\n\n[25] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional\n\nm-estimation. In NIPS, pages 685\u2013693, 2014.\n\n[26] A. Jalali, S. Sanghavi, C. Ruan, and P. K. Ravikumar. A dirty model for multi-task learning. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 964\u2013972, 2010.\n\n[27] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse\n\ncoding. J. Mach. Learn. Res., 12:2297\u20132334, 2011.\n\n[28] S. Kim and E. P. Xing. Tree-guided group lasso for multi-response regression with structured\n\nsparsity, with an application to eqtl mapping. Ann. Appl. Stat., 6(3):1095\u20131117, 2012.\n\n[29] W. Lee and Y. Liu. Simultaneous multiple response regression and inverse covariance matrix\nestimation via penalized gaussian maximum likelihood. J. Multivar. Anal., 111:241\u2013255, 2012.\n\n[30] H. Liu, M. Palatucci, and J. Zhang. Blockwise coordinate descent procedures for the multi-task\n\nlasso, with applications to neural semantic basis discovery. In ICML, pages 649\u2013656, 2009.\n\n[31] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for the analysis\n\nof regularized M-estimators. Statistical Science, 27(4):538\u2013557, 2012.\n\n[32] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. In NIPS,\n\n2013.\n\n[33] P. Rai, A. Kumar, and H. Daume. Simultaneously leveraging output and task structures for\n\nmultiple-output regression. In NIPS, pages 3185\u20133193, 2012.\n\n[34] N. Rao, B. Recht, and R. Nowak. Universal Measurement Bounds for Structured Sparse Signal\nRecovery. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2012.\n\n[35] A. J. Rothman, E. Levina, and J. Zhu. Sparse multivariate regression with covariance estimation.\n\nJournal of Computational and Graphical Statistics, 19(4):947\u2013962, 2010.\n\n[36] K.-A. Sohn and S. Kim. Joint estimation of structured sparsity and output structure in multiple-\noutput regression via inverse-covariance regularization. In AISTATS, pages 1081\u20131089, 2012.\n\n[37] R. Sun and Z.-Q. Luo. Guaranteed matrix completion via nonconvex factorization. In FOCS,\n\n2015.\n\n10\n\n\f[38] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58(1):267\u2013288, 1996.\n\n[39] J. A. Tropp. Convex Recovery of a Structured Signal from Independent Random Linear\n\nMeasurements, pages 67\u2013101. Springer International Publishing, 2015.\n\n[40] M. Wytock and Z. Kolter. Sparse gaussian conditional random \ufb01elds: Algorithms, theory, and\napplication to energy forecasting. In International conference on machine learning, pages\n1265\u20131273, 2013.\n\n[41] X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression. In\n\nICML, pages 613\u2013621, 2014.\n\n[42] X.-T. Yuan and T. Zhang. Partial gaussian graphical model estimation. IEEE Transactions on\n\nInformation Theory, 60:1673\u20131687, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1618, "authors": [{"given_name": "Sheng", "family_name": "Chen", "institution": "University of Minnesota"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota"}]}