{"title": "Unified View of Matrix Completion under General Structural Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 1180, "page_last": 1188, "abstract": "Matrix completion problems have been widely studied under special low dimensional structures such as low rank or structure induced by decomposable norms. In this paper, we present a unified analysis of matrix completion under general low-dimensional structural constraints induced by {\\em any} norm regularization.We consider two estimators for the general problem of structured matrix completion, and provide unified upper bounds on the sample complexity and the estimation error. Our analysis relies on generic chaining, and we establish two intermediate results of independent interest: (a) in characterizing the size or complexity of low dimensional subsets in high dimensional ambient space, a certain \\textit{\\modified}~complexity measure encountered in the analysis of matrix completion problems is characterized in terms of a well understood complexity measure of Gaussian widths, and (b) it is shown that a form of restricted strong convexity holds for matrix completion problems under general norm regularization. Further, we provide several non-trivial examples of structures included in our framework, notably including the recently proposed spectral $k$-support norm.", "full_text": "Uni\ufb01ed View of Matrix Completion under General\n\nStructural Constraints\n\nSuriya Gunasekar\nUT at Austin, USA\n\nsuriya@utexas.edu\n\nArindam Banerjee\n\nUMN Twin Cities, USA\n\nbanerjee@cs.umn.edu\n\nJoydeep Ghosh\nUT at Austin, USA\n\nghosh@ece.utexas.edu\n\nAbstract\n\nIn this paper, we present a uni\ufb01ed analysis of matrix completion under general\nlow-dimensional structural constraints induced by any norm regularization. We\nconsider two estimators for the general problem of structured matrix completion,\nand provide uni\ufb01ed upper bounds on the sample complexity and the estimation\nerror. Our analysis relies on results from generic chaining, and we establish\ntwo intermediate results of independent interest: (a) in characterizing the size\nor complexity of low dimensional subsets in high dimensional ambient space, a\ncertain partial complexity measure encountered in the analysis of matrix comple-\ntion problems is characterized in terms of a well understood complexity measure\nof Gaussian widths, and (b) it is shown that a form of restricted strong convexity\nholds for matrix completion problems under general norm regularization. Further,\nwe provide several non-trivial examples of structures included in our framework,\nnotably the recently proposed spectral k-support norm.\n\n1\n\nIntroduction\n\nThe task of completing the missing entries of a matrix from an incomplete subset of (potentially\nnoisy) entries is encountered in many applications including recommendation systems, data impu-\ntation, covariance matrix estimation, and sensor localization among others. Traditionally ill\u2013posed\nhigh dimensional estimation problems, where the number of parameters to be estimated is much\nhigher than the number of observations, has been extensively studied in the recent literature. How-\never, matrix completion problems are particularly ill\u2013posed as the observations are both limited\n(high dimensional), and the measurements are extremely localized, i.e., the observations consist of\nindividual matrix entries. The localized measurement model, in contrast to random Gaussian or\nsub\u2013Gaussian measurements, poses additional complications in high dimensional estimation.\nFor well\u2013posed estimation in high dimensional problems including matrix completion, it is imper-\native that low dimensional structural constraints are imposed on the target. For matrix completion,\nthe special case of low\u2013rank constraint has been widely studied. Several existing work propose\ntractable estimators with near\u2013optimal recovery guarantees for (approximate) low\u2013rank matrix com-\npletion [8, 7, 28, 26, 18, 19, 22, 11, 20, 21]. A recent work [16] addresses the extension to structures\nwith decomposable norm regularization. However, the scope of matrix completion extends for low\ndimensional structures far beyond simple low\u2013rankness or decomposable norm structures.\nIn this paper, we present a uni\ufb01ed statistical analysis of matrix completion under a general set of low\ndimensional structures that are induced by any suitable norm regularization. We provide statistical\nanalysis of two generalized matrix completion estimators, the constrained norm minimizer, and the\ngeneralized matrix Dantzig selector (Section 2.2). The main results in the paper (Theorem 1a\u20131b)\nprovide uni\ufb01ed upper bounds on the sample complexity and estimation error of these estimators for\nmatrix completion under any norm regularization. Existing results on matrix completion with low\nrank or other decomposable structures can be obtained as special cases of our general results.\n\n1\n\n\fOur uni\ufb01ed analysis of sample complexity is motivated by recent work on high dimensional estima-\ntion using global (sub) Gaussian measurements [10, 1, 35, 3, 37, 5]. A key ingredient in the recovery\nanalysis of high dimensional estimation involves establishing a certain variation of Restricted Isom-\netry Property (RIP) [9] of the measurement operator. It has been shown that such properties are sat-\nis\ufb01ed by Gaussian and sub\u2013Gaussian measurement operators with high probability. Unfortunately,\nas has been noted before by Candes et al. [8], owing to highly localized measurements, such con-\nditions are not satis\ufb01ed in the matrix completion problem, and the existing results based on global\n(sub) Gaussian measurements are not directly applicable. In fact, a key question we consider is:\ngiven the radically limited measurement model in matrix completion, by how much would the sam-\nple complexity of estimation increase beyond the known sample complexity bounds for global (sub)\nGaussian measurements. Our results upper bounds the sample complexity for matrix completion to\nwithin a log d factor over that for estimation under global (sub) Gaussian measurements [10, 3, 5].\nWhile the result was previously known for low rank matrix completion using nuclear norm min-\nimization [26, 20], with a careful use of generic chaining, we show that the log d factor suf\ufb01ces\nfor structures induced by any norm! As a key intermediate result, we show that a useful form of\nrestricted strong convexity (RSC) [27] holds for the localized measurements encountered in matrix\ncompletion under general norm regularized structures. The result substantially generalizes existing\nRSC results for matrix completion under the special cases of nuclear norm and decomposable norm\nregularization [26, 16].\nFor our analysis, we use tools from generic chaining [33] to characterize the main results (Theo-\nrem 1a\u20131b) in terms of the Gaussian width (De\ufb01nition 1) of certain error sets. Gaussian widths\nprovide a powerful geometric characterization for quantifying the complexity of a structured low di-\nmensional subset in a high dimensional ambient space. Numerous tools have been developed in the\nliterature for bounding the Gaussian width of structured sets. A uni\ufb01ed characterization of results in\nterms of Gaussian width has the advantage that this literature can be readily leveraged to derive new\nrecovery guarantees for matrix completion under suitable structural constraints (Appendix D.2).\nIn addition to the theoretical elegance of such a uni\ufb01ed framework, identifying useful but potentially\nnon\u2013decomposable low dimensional structures is of signi\ufb01cant practical interest. The broad class\nof structures enforced through symmetric convex bodies and symmetric atomic sets [10] can be\nanalyzed under this paradigm (Section 2.1). Such specialized structures can capture the constraints\nin certain applications better than simple low\u2013rankness. In particular, we discuss in detail, a non\u2013\ntrivial example of the spectral k\u2013support norm introduced by McDonald et al. [25].\nTo summarize the key contributions of the paper:\n\u2022 Theorem 1a\u20131b provide uni\ufb01ed upper bounds on sample complexity and estimation error for\nmatrix completion estimators using general norm regularization: a substantial generalization of the\nexisting results on matrix completion under structural constraints.\n\u2022 Theorem 1a is applied to derive statistical results for the special case of matrix completion under\nspectral k\u2013support norm regularization.\n\u2022 An intermediate result, Theorem 5 shows that under any norm regularization, a variant of Re-\nstricted Strong Convexity (RSC) holds in the matrix completion setting with extremely localized\nmeasurements. Further, a certain partial measure of complexity of a set is encountered in matrix\ncompletion analysis (12). Another intermediate result, Theorem 2 provides bounds on the par-\ntial complexity measures in terms of a better understood complexity measure of Gaussian width.\nThese intermediate results are of independent interest beyond the scope of the paper.\nNotations and Preliminaries\nIndexes i, j are typically used to index rows and columns respectively of matrices, and index k is\nused to index the observations. ei, ej, ek, etc. denote the standard basis in appropriate dimensions\u2217.\nNotation G and g are used to denote a matrix and vector respectively, with independent standard\nGaussian random variables. P(.) and E(.) denote the probability of an event and the expectation of\na random variable, respectively. Given an integer N, let [N ] = {1, 2, . . . , N}. Euclidean norm in a\n\nvector space is denoted as (cid:107)x(cid:107)2 =(cid:112)(cid:104)x, x(cid:105). For a matrix X with singular values \u03c31 \u2265 \u03c32 \u2265 . . .,\ncommon norms include the Frobenius norm (cid:107)X(cid:107)F =(cid:112)(cid:80)\ni , the nuclear norm (cid:107)X(cid:107)\u2217 =(cid:80)\n\ni \u03c3i,\nthe spectral norm (cid:107)X(cid:107)op = \u03c31, and the maximum norm (cid:107)X(cid:107)\u221e = maxij |Xij|. Also let, Sd1d2\u22121 =\n\ni \u03c32\n\n\u2217for brevity we omit the explicit dependence of dimension unless necessary\n\n2\n\n\f{X \u2208 Rd1\u00d7d2 : (cid:107)X(cid:107)F = 1} and Bd1d2 = {X \u2208 Rd1\u00d7d2 : (cid:107)X(cid:107)F \u2264 1}. Finally, given a norm (cid:107).(cid:107)\nde\ufb01ned on a vectorspace V, its dual norm is given by (cid:107)X(cid:107)\u2217 = sup(cid:107)Y (cid:107)\u22641(cid:104)X, Y (cid:105).\nDe\ufb01nition 1 (Gaussian Width). Gaussian width of a set S \u2282 Rd1\u00d7d2 is a widely studied measure of\ncomplexity of a subset in high dimensional ambient space and is given by:\n\nwG(S) = EG sup\nX\u2208S\n\n(cid:104)X, G(cid:105),\n\n(1)\n\n\u221a\n\nwhere recall that G is a matrix of independent standard Gaussian random variables. Some key results\non Gaussian width are discussed in Appendix D.2.\nDe\ufb01nition 2 (Sub\u2013Gaussian Random Variable [36]). The sub\u2013Gaussian norm of a random variable\nX is given by: (cid:107)X(cid:107)\u03a82 = supp\u22651 p\u22121/2(E|X|p)1/p. X is b\u2013sub\u2013Gaussian if (cid:107)X(cid:107)\u03a82 \u2264 b < \u221e.\nEquivalently, X is sub\u2013Gaussian if one of the following conditions are satis\ufb01ed for some constants\nk1, k2, and k3 [Lemma 5.5 of [36]].\n(1) \u2200p \u2265 1, (E|X|p)1/p \u2264 b\n(3) E[ek2X 2/b2\nDe\ufb01nition 3 (Restricted Strong Convexity (RSC)). A function L is said to satisfy Restricted Strong\nConvexity (RSC) at \u0398 with respect to a subset S, if for some RSC parameter \u03baL > 0,\nDe\ufb01nition 4 (Spikiness Ratio [26]). For X \u2208Rd1\u00d7d2, a measure of its \u201cspikiness\u201d is given by:\n\n(2) \u2200t > 0, P(|X| > t) \u2264 e1\u2212t2/k2\n(4) if EX = 0, then \u2200s > 0, E[esX ] \u2264 ek3s2b2/2.\n\n\u2200\u2206 \u2208 S,L(\u0398 + \u2206) \u2212 L(\u0398) \u2212 (cid:104)\u2207L(\u0398), \u2206(cid:105) \u2265 \u03baL(cid:107)\u2206(cid:107)2\nF .\n\nd1d2(cid:107)X(cid:107)\u221e\n(3)\n(cid:107)X(cid:107)F\nDe\ufb01nition 5 (Norm Compatibility Constant [27]). The compatibility constant of a norm R : V \u2192 R\nunder a closed convex cone C \u2282 V is de\ufb01ned as follows:\n\n] \u2264 e, or\n\n\u03b1sp(X) =\n\n1b2,\n\n\u221a\n\n(2)\n\np,\n\n.\n\n\u03a8R(C) = sup\n\nX\u2208C\\{0}\n\nR(X)\n(cid:107)X(cid:107)F\n\n.\n\n(4)\n\n2 Structured Matrix Completion\nDenote the ground truth target matrix as \u0398\u2217 \u2208 Rd1\u00d7d2; let d = d1 + d2. In the noisy matrix comple-\ntion, observations consists of individual entries of \u0398\u2217 observed through an additive noise channel.\nSub\u2013Gaussian Noise: Given, a list of independently sampled standard basis \u2126 = {Ek = eik e(cid:62)\nik \u2208 [d1], jk \u2208 [d2]} with potential duplicates, observations (yk)k \u2208 R|\u2126| are given by:\n(5)\nwhere \u03b7 \u2208 R|\u2126| is the noise vector of independent sub\u2013Gaussian random variables with E[\u03b7k] = 0\nand Var(\u03b7k) = 1, and \u03be2 is scaled variance of noise per observation. Further let (cid:107)\u03b7k(cid:107)\u03a82 \u2264 b for a\nconstant b (recall (cid:107).(cid:107)\u03a82 from De\ufb01nition 2). Also, without loss of generality, assume normalization\n(cid:107)\u0398\u2217(cid:107)F = 1.\nUniform Sampling: Assume that the entries in \u2126 are drawn independently and uniformly:\n\nyk = (cid:104)\u0398\u2217, Ek(cid:105) + \u03be\u03b7k, for k = 1, 2, . . . ,|\u2126|,\n\njk\n\n:\n\nEk \u223c uniform{eie(cid:62)\n\nj : i \u2208 [d1], j \u2208 [d2]}, for Ek \u2208 \u2126.\n\nLet {ek} be the standard basis of R|\u2126|. Given \u2126, de\ufb01ne P\u2126 : Rd1\u00d7d2 \u2192 R|\u2126| as:\n\n(6)\n\n(7)\nStructural Constraints For matrix completion with |\u2126| < d1d2, low dimensional structural con-\nstraints on \u0398\u2217 are necessary for well\u2013posedness. We consider a generalized constraint setting\nwherein for some low\u2013dimensional model space M, \u0398\u2217 \u2208 M is enforced through a surrogate\nnorm regularizer R(.). We make no further assumptions on R other than it being a norm in Rd1\u00d7d2.\nLow Spikiness In matrix completion under uniform sampling model, further restrictions on \u0398\u2217 (be-\nyond low dimensional structure) are required to ensure that the most informative entries of the matrix\nare observed with high probability [8]. Early work assumed stringent matrix incoherence conditions\nfor low\u2013rank completion to preclude such matrices [7, 18, 19], while more recent work [11, 26], re-\nlax these assumptions to a more intuitive restriction of the spikiness ratio, de\ufb01ned in (3). However,\nunder this relaxation only an approximate recovery is typically guaranteed in low\u2013noise regime, as\nopposed to near exact recovery under incoherence assumptions [26, 11].\nAssumption 1 (Spikiness Ratio). There exists \u03b1\u2217 > 0, such that\n\u2264 \u03b1\u2217\u221a\n\n(cid:107)\u0398\u2217(cid:107)\u221e = \u03b1sp(\u0398\u2217)\n\n(cid:107)\u0398\u2217(cid:107)F\u221a\n\n(cid:3)\n\n.\n\nd1d2\n\nd1d2\n\nP\u2126(X) =(cid:80)|\u2126|\n\nk=1(cid:104)X, Ek(cid:105)ek\n\n3\n\n\f2.1 Special Cases and Applications\n\nWe brie\ufb02y introduce some interesting examples of structural constraints with practical applications.\nExample 1 (Low Rank and Decomposable Norms). Low\u2013rankness is the most common structure\nused in many matrix estimation problems including collaborative \ufb01ltering, PCA, spectral clustering,\netc. Convex estimators using nuclear norm (cid:107)\u0398(cid:107)\u2217 regularization has been widely studied statistically\n[8, 7, 28, 26, 18, 19, 22, 11, 20, 21]. A recent work [16] extends the analysis of low rank matrix\ncompletion to general decomposable norms, i.e. R :\u2200X, Y \u2208 (M,M\u22a5),R(X+Y ) =R(X)+R(Y ).\nExample 2 (Spectral k\u2013support Norm). A non\u2013trivial and signi\ufb01cant example of norm regular-\nization that is not decomposable is the spectral k\u2013support norm recently introduced by McDon-\nald et al. [25]. Spectral k\u2013support norm is essentially the vector k\u2013support norm [2] applied on the\nsingular values \u03c3(\u0398) of a matrix \u0398 \u2208 Rd1\u00d7d2. Without loss of generality, let \u00afd = d1 = d2.\nLet Gk = {g \u2286 [ \u00afd] : |g| \u2264 k} be the set of all subsets [ \u00afd] of cardinality at most k, and let\nV(Gk) = {(vg)g\u2208Gk : vg \u2208 R \u00afd, supp(vg) \u2286 g}. The spectral k\u2013support norm is given by:\n\n(cid:107)\u0398(cid:107)k\u2013sp = inf\n\nv\u2208V(Gk)\n\n(cid:107)vg(cid:107)2 :\n\nvg = \u03c3(\u0398)\n\n,\n\n(8)\n\n(cid:110) (cid:88)\n\ng\u2208Gk\n\n(cid:88)\n\ng\u2208Gk\n\n(cid:111)\n\nMcDonald et al. [25] showed that spectral k\u2013support norm is a special case of cluster norm [17]. It\nwas further shown that in multi\u2013task learning, wherein the tasks (columns of \u0398\u2217) are assumed to be\nclustered into dense groups, the cluster norm provides a trade\u2013off between intra\u2013cluster variance,\n(inverse) inter\u2013cluster variance, and the norm of the task vectors. Both [17] and [25] demonstrate\nsuperior empirical performance of cluster norms (and k\u2013support norm) over traditional trace norm\nand spectral elastic net minimization on bench marked matrix completion and multi\u2013task learning\ndatasets. However, statistical analysis of consistent matrix completion using spectral k\u2013support\nnorm regularization has not been previously studied. In Section 3.2, we discuss the consequence of\nour main theorem for this non\u2013trivial special case.\nExample 3 (Additive Decomposition). Elementwise sparsity is a common structure often assumed\nin high\u2013dimensional estimation problems. However, in matrix completion, elementwise sparsity\ncon\ufb02icts with Assumption 1 (and more traditional incoherence assumptions). Indeed, it is easy to\nsee that with high probability most of the |\u2126| (cid:28) d1d2 uniformly sampled observations will be\nzero, and an informed prediction is infeasible. However, elementwise sparse structures can often\nk \u0398(k), such that each\ncomponent matrix \u0398(k) is in turn structured (e.g.\nlow rank+sparse used for robust PCA [6]). In\nsuch structures, there is no scope for recovering sparse components outside the observed indices,\nand it is assumed that: \u0398(k) is sparse \u21d2 supp(\u0398(k)) \u2286 \u2126. In such cases, our results are applicable\nunder additional regularity assumptions that enforces non\u2013spikiness on the superposed matrix. A\ncandidate norm regularizer for such structures is the weighted in\ufb01mum convolution of individual\nstructure inducing norms [6, 39],\n\nbe modelled within an additive decomposition framework, wherein \u0398\u2217 =(cid:80)\n\nRw(\u0398) = inf(cid:8)(cid:88)\n\nwkRk(\u0398(k)) :\n\n(cid:88)\n\n\u0398(k) = \u0398(cid:9).\n\nExample 4 (Other Applications). Other potential applications including cut matrices [30, 10], struc-\ntures induced by compact convex sets, norms inducing structured sparsity assumptions on the spec-\ntrum of \u0398\u2217, etc. can also be handled under the paradigm of this paper.\n\nk\n\nk\n\n2.2 Structured Matrix Estimator\nLet R be the norm surrogate for the structural constraints on \u0398\u2217, and R\u2217 denote its dual norm. We\npropose and analyze two convex estimators for the task of structured matrix completion:\nConstrained Norm Minimizer\n\nR(\u0398)\n\ns.t. (cid:107)P\u2126(\u0398) \u2212 y(cid:107)2 \u2264 \u03bbcn.\n\n(cid:98)\u0398cn = argmin\n\n(cid:107)\u0398(cid:107)\u221e\u2264 \u03b1\u2217\u221a\n\nd1d2\n\nGeneralized Matrix Dantzig Selector\nR(\u0398)\n\n(cid:98)\u0398ds = argmin\n\n(cid:107)\u0398(cid:107)\u221e\u2264 \u03b1\u2217\u221a\n\nd1d2\n\n\u221a\n|\u2126| R\u2217P \u2217\n\nd1d2\n\n\u2126(P\u2126(\u0398) \u2212 y) \u2264 \u03bbds,\n\ns.t.\n\n4\n\n(9)\n\n(10)\n\n\f\u2126 : R\u2126 \u2192 Rd1\u00d7d2 is the linear adjoint of P\u2126, i.e. (cid:104)P\u2126(X), y(cid:105) = (cid:104)X, P \u2217\n\nwhere recall that P \u2217\nNote: Theorem 1a\u20131b gives consistency results for (9) and (10), respectively, under certain con-\nditions on the parameters \u03bbcn > 0, \u03bbds > 0, and \u03b1\u2217 > 1. In particular, these conditions assume\nknowledge of the noise variance \u03be2 and spikiness ratio \u03b1sp(\u0398\u2217). In practice, typically \u03be and \u03b1sp(\u0398\u2217)\nare unknown and the parameters are tuned by validating on held out data.\n\n\u2126(y)(cid:105).\n\n3 Main Results\n\nWe de\ufb01ne the following \u201crestricted\u201d error cone and its subset:\n\n(11)\n\nTR = TR(\u0398\u2217) = cone{\u2206 : R(\u0398\u2217 + \u2206) \u2264 R(\u0398\u2217)}, and ER = TR \u2229 Sd1d2\u22121.\n\nLet(cid:98)\u0398cn and(cid:98)\u0398ds be the estimates from (9) and (10), respectively. If \u03bbcn and \u03bbds are chosen such that\n\u0398\u2217 belongs to the feasible sets in (9) and (10), respectively, then the error matrices (cid:98)\u2206cn = (cid:98)\u0398cn \u2212 \u0398\u2217\nand (cid:98)\u2206ds = (cid:98)\u0398ds \u2212 \u0398\u2217 are contained in TR.\nTheorem 1a (Constrained Norm Minimizer). Under the problem setup in Section 2, let(cid:98)\u0398cn = \u0398\u2217 +\n(cid:98)\u2206cn be the estimate from (9) with \u03bbcn = 2\u03be(cid:112)|\u2126|. For large enough c0, if |\u2126| > c2\n(cid:17)\nG(ER) log d,\nthen there exists an RSC parameter \u03bac0 > 0 with \u03bac0 \u2248 1 \u2212 o\n, and constants c1 and c2\n(cid:115)\n(cid:41)\nG(ER) log d),\nG(ER))\u22122 exp(\u2212c2w2\nsuch that, with probability greater than 1\u2212exp(\u2212c1w2\nG(ER) log d\n\u03b1\u22172\nd1d2\n\n(cid:107)(cid:98)\u2206cn(cid:107)2\n\n(cid:16) 1\u221a\n\nF \u2264 4 max\n\n(cid:40)\n\nc2\n0w2\n\n|\u2126|\n\n0w2\n\nd1d2\n\nlog d\n\n\u03be2\n\u03bac0\n\n1\n\n.\n\n,\n\nTheorem 1b (Matrix Dantzig Selector). Under the problem setup in Section 2,\n\n\u0398\u2217 + (cid:98)\u2206ds be the estimate from (10) with \u03bbds \u2265 2\u03be\n\nG(ER) log d, then there exists an RSC parameter \u03bac0 > 0 with \u03bac0 \u2248 1 \u2212 o\n\n|\u2126| > c2\nand a constant c1 such that, with probability greater than 1\u2212exp(\u2212c1w2\n\n0w2\n\n\u2126(\u03b7). For large enough c0, if\n,\n\nlog d\n\nlet (cid:98)\u0398ds =\n(cid:16) 1\u221a\n(cid:17)\n\nd1d2\n\n\u221a\n|\u2126| R\u2217P \u2217\n(cid:115)\n\nc2\n0w2\n\n\u03b1\u22172\nd1d2\n\n,\n\n(cid:41)\nG(ER)),\nG(ER) log d\n\n.\n\n|\u2126|\n\n(cid:40)\n\nds\u03a82R(TR)\n\u03bb2\n\n\u03ba2\nc0\n\n1\n\nd1d2\n\nF \u2264 16 max\n\n(cid:107)(cid:98)\u2206ds(cid:107)2\n(cid:113) d log d\n\nd1d2\n\nthen w2\n\nRecall Gaussian width wG and subspace compatibility constant \u03a8R from (1) and (4), respectively.\nRemarks:\nG(ER) \u2264 3dr, \u03a8R(TR) \u2264 2r and\n1. If R(\u0398) = (cid:107)\u0398(cid:107)\u2217 and rank(\u0398\u2217) = r,\n|\u2126| w.h.p [10, 14, 26]. Using these bounds in Theorem 1b recovers\n\n\u221a\n|\u2126| (cid:107)P \u2217\nnear\u2013optimal results for low rank matrix completion under spikiness assumption [26].\n\n\u2126(\u03b7)(cid:107)2 \u2264 2\n\n2. For both estimators, upper bound on sample complexity is dominated by the square of Gaussian\nwidth which is often considered the effective dimension of a subset in high dimensional space\nand plays a key role in high dimensional estimation under Gaussian measurement ensembles.\nThe results show that, independent of R(.), the upper bound on sample complexity for consistent\nmatrix completion with highly localized measurements is within a log d factor of the known\nsample complexity of \u223c w2\n\nG(ER) for estimation from Gaussian measurements [3, 10, 37, 5].\n\n3. First term in estimation error bounds in Theorem 1a\u20131b scales with \u03be2 which is the per observa-\ntion noise variance (upto constant). The second term is an upper bound on error that arises due\nto unidenti\ufb01ability of \u0398\u2217 within a certain radius under the spikiness constraints [26]; in contrast\n[7] show exact recovery when \u03be = 0 using more stringent matrix incoherence conditions.\n\n4. Bound on (cid:98)\u2206cn from Theorem 1a is comparable to the result by Cand\u00b4es et al. [7] for low rank\nmatrix completion under non\u2013low\u2013noise regime, where the \ufb01rst term dominates, and those of [10,\nG(ER), it\n35] for high dimensional estimation under Gaussian measurements. With a bound on w2\nis easy to specialize this result for new structural constraints. However, this bound is potentially\nloose and asymptotically converges to a constant error proportional to the noise variance \u03be2.\nfor speci\ufb01c structures, using application of Theorem 1b requires additional bounds on R\u2217P \u2217\nand \u03a8R(TR) besides w2\n\n5. The estimation error bound in Theorem 1b is typically sharper than that in Theorem 1a. However,\n\u2126(\u03b7)\n\nG(ER).\n\n5\n\n\f(cid:115) |\u2126|\n(cid:115) |\u2126|\n\nd1d2\n\n(cid:114)E\u2126 sup\n\nX,Y \u2208S\n\n3.1 Partial Complexity Measures\nRecall that for wG(S) = E supX\u2208S(cid:104)X, G(cid:105) and R|\u2126| (cid:51) g \u223c N (0, I|\u2126|) is a standard normal vector.\nDe\ufb01nition 6 (Partial Complexity Measures). Given a randomly sampled \u2126 = {Ek \u2208 Rd1\u00d7d2}, and\na centered random vector \u03b7 \u2208 R|\u2126|, the partial \u03b7\u2013complexity measure of S is given by:\n\nw\u2126,\u03b7(S) = E\u2126,\u03b7\n\nsup\n\n(12)\nX\u2208S\u2212S\nSpecial cases of \u03b7 being a vector of standard Gaussian g, or standard Rademacher \u0001 (i.e. \u0001k \u2208\n{\u22121, 1} w.p. 1/2) variables, are of particular interest.\nNote: In the case of symmetric \u03b7, like g and \u0001, w\u2126,\u03b7(S) = 2E\u2126,\u03b7 supX\u2208S(cid:104)X, P \u2217\n\u2126(\u03b7)(cid:105), and the later\n(cid:3)\nexpression will be used interchangeably ignoring the constant term.\nTheorem 2 (Partial Gaussian Complexity). Let S \u2286 Bd1d2 with non\u2013empty interior, and let \u2126 be\nsampled according to (6). \u2203 universal constants k1, k2, K1 and K2 such that:\n\n(cid:104)X, P \u2217\n\n\u2126(\u03b7)(cid:105).\n\nw\u2126,g(S) \u2264 k1\n\nwG(S) + k2\n\n(cid:107)P\u2126(X \u2212 Y )(cid:107)2\n\n2\n\n(13)\n\nw\u2126,g(S) \u2264 K1\n\nwG(S) + K2 sup\nX,Y \u2208S\n\n(cid:107)X \u2212 Y (cid:107)\u221e.\n\nd1d2\n\nAlso, for centered i.i.d. sub\u2013Gaussian vector \u03b7 \u2208 R|\u2126|, \u2203 constant K3 s.t. w\u2126,\u03b7(S) \u2264 K3w\u2126,g(S).\nNote: For \u2126 (cid:40) [d1]\u00d7 [d2], the second term in (13) is a consequence of the localized measurements.\n\n3.2 Spectral k\u2013Support Norm\n\nWe introduced spectral k\u2013support norm in Section 2.1. The estimators from (9) and (10) for spectral\nk\u2013support norm can be ef\ufb01ciently solved via proximal methods using the proximal operators derived\nin [25]. We are interested in the statistical guarantees for matrix completion using spectral k\u2013support\nnorm regularization. We extend the analysis for upper bounding the Gaussian width of the descent\ncone for the vector k\u2013support norm by [29] to the case of spectral k\u2013support norm. WLOG let\nd1 = d2 = \u00afd. Let \u03c3\u2217 \u2208 R \u00afd be the vector of singular values of \u0398\u2217 sorted in non\u2013ascending order.\nLet r \u2208 {0, 1, 2, . . . , k \u2212 1} be the unique integer satisfying: \u03c3\u2217\nk\u2212r.\nDenote I2 = {1, 2, . . . , k \u2212 r \u2212 1} and I1 = {k \u2212 r, k \u2212 r + 1, . . . , s}. Finally, for I \u2286 [ \u00afd],\nI )i = 0 \u2200i \u2208 I c, and (\u03c3\u2217\n(\u03c3\u2217\nLemma 3. If rank of \u0398\u2217 is s and ER is the error set for R(\u0398) = (cid:107)\u0398(cid:107)k\u2013sp, then\n(2 \u00afd \u2212 s).\n\n(cid:16) (r + 1)2(cid:107)\u03c3\u2217\n\n(cid:80)p\ni=k\u2212r \u03c3\u2217\n\nG(ER) \u2264 s(2 \u00afd \u2212 s) +\nw2\n\n+ |I1|(cid:17)\n\nk\u2212r\u22121 > 1\nr+1\n\nI )i = \u03c3\u2217\n\ni \u2200i \u2208 I.\n\ni \u2265 \u03c3\u2217\n\n(cid:107)2\n\n2\n\nI2\n\n(cid:107)\u03c3\u2217\n\nI1\n\n(cid:107)2\n\n1\n\nProof of the above lemma is provided in the appendix. Lemma 3 can be combined with Theorem 1a\nto obtain recovery guarantees for matrix completion under spectral k\u2013support norm.\n\n4 Discussions and Related Work\n\nSample Complexity: For consistent recovery in high dimensional convex estimation, it is desirable\nthat the descent cone at the target parameter \u0398\u2217 is \u201csmall\u201d relative to the feasible set (enforced by the\nobservations) of the estimator. Thus, it is not surprising that the sample complexity and estimation\nerror bounds of an estimator depends on some measure of complexity/size of the error cone at\n\u0398\u2217. Results in this paper are largely characterized in terms of a widely used complexity measure\nof Gaussian width wG(.), and can be compared with the literature on estimation from Gaussian\nmeasurements.\nError Bounds: Theorem 1a provides estimation error bounds that depends only on the Gaussian\nwidth of the descent cone. In non\u2013low\u2013noise regime, this result is comparable to analogous results\nof constrained norm minimization [6, 10, 35]. However, this bound is potentially loose owing to\nmismatched data\u2013\ufb01t term using squared loss, and asymptotically converges to a constant error pro-\nportional to the noise variance \u03be2.\n\n6\n\n\fA tighter analysis on the estimation error can be obtained for the matrix Dantzig selector (10) from\nTheorem 1b. However, application of Theorem 1b requires computing high probability upper bound\non R\u2217P \u2217\n\u2126(\u03b7). The literature on norms of random matrices [13, 24, 36, 34] can be exploited in com-\nputing such bounds. Beside, in special cases: if R(.) \u2265 K(cid:107).(cid:107)\u2217, then KR\u2217(.) \u2264 (cid:107).(cid:107)op can be used\nto obtain asymptotically consistent results.\nFinally, under near zero\u2013noise, the second term in the results of Theorem 1 dominates, and bounds\nare weaker than that of [6, 19] owing to the relaxation of stronger incoherence assumption.\nRelated Work and Future Directions: The closest related work is the result on consistency of\nmatrix completion under decomposable norm regularization by [16]. Results in this paper are a strict\ngeneralization to general norm regularized (not necessarily decomposable) matrix completion. We\nprovide non\u2013trivial examples of application where structures enforced by such non\u2013decomposable\nnorms are of interest. Further, in contrast to our results that are based on Gaussian width, the RSC\nparameter in [16] depends on a modi\ufb01ed complexity measure \u03baR(d,|\u2126|) (see de\ufb01nition in [16]). An\nadvantage of results based on Gaussian width is that, application of Theorem 1 for special cases can\ngreatly bene\ufb01t from the numerous tools in the literature for the computation of wG(.).\nAnother closely related line of work is the non\u2013asymptotic analysis of high dimensional estimation\nunder random Gaussian or sub\u2013Gaussian measurements [10, 1, 35, 3, 37, 5]. However, the analysis\nfrom this literature rely on variants of RIP of the measurement ensemble [9], which is not satis\ufb01ed by\nthe the extremely localized measurements encountered in matrix completion[8]. In an intermediate\nresult, we establish a form of RSC for matrix completion under general norm regularization: a result\nthat was previously known only for nuclear norm and decomposable norm regularization.\nIn future work, it is of interest to derive matching lower bounds on estimation error for matrix\ncompletion under general low dimensional structures, along the lines of [22, 5] and explore special\ncase applications of the results in the paper. We also plan to derive explicit characterization of \u03bbds\nin terms of Gaussian width of unit balls by exploiting generic chaining results for general Banach\nspaces [33].\n\n5 Proof Sketch\n\nProofs of the lemmas are provided in the Appendix.\n5.1 Proof of Theorem 1\nDe\ufb01ne the following set of \u03b2\u2013non\u2013spiky matrices in Rd1\u00d7d2 for constant c0 from Theorem 1:\n\n(cid:41)\n\n< \u03b2\n\n.\n\n(14)\n\n(cid:40)\n(cid:115)\n\nA(\u03b2) =\n\nX : \u03b1sp(X) =\n\n|\u2126|\n\n\u03b22\nc0\n\n=\n\nG(ER) log d\n\nc2\n0w2\n\n\u221a\n\nd1d2(cid:107)X(cid:107)\u221e\n(cid:107)X(cid:107)F\n\nDe\ufb01ne,\n\n(15)\n\n(cid:3)\n\n|\u2126|\n\n\u221a\n\n= 4\u03b1\u22172\n\nd1d2 in (3).\n\nG(ER) log d\n\n0w2\n\nF \u2264 4\u03b1\u22172\n\n\u03b22\nc0\n\n(cid:113) c2\n\nCase 1: Spiky Error Matrix When the error matrix from (9) or (10) has large spikiness ratio,\n\nfollowing bound on error is immediate using (cid:107)(cid:98)\u2206(cid:107)\u221e\u2264(cid:107)(cid:98)\u0398(cid:107)\u221e +(cid:107)\u0398\u2217(cid:107)\u221e\u22642\u03b1\u2217/\nProposition 4 (Spiky Error Matrix). For the constant c0 in Theorem 1a, if \u03b1sp((cid:98)\u2206cn) /\u2208 A(\u03b2c0), then\n(cid:107)(cid:98)\u2206cn(cid:107)2\n. An analogous result also holds for (cid:98)\u2206ds.\nCase 2: Non\u2013Spiky Error Matrix Let (cid:98)\u2206ds,(cid:98)\u2206cn \u2208 A(\u03b2). Recall from (5), that y \u2212 P\u2126(\u0398\u2217) = \u03be\u03b7,\nwhere \u03b7 \u2208 R|\u2126| consists of independent sub\u2013Gaussian random variables with E[\u03b7k] = 0, Var(\u03b7k) =\n1, and (cid:107)\u03b7k(cid:107)\u03a82 \u2264 b for a constant b.\n5.1.1 Restricted Strong Convexity (RSC)\nRecall TR and ER from (11). The most signi\ufb01cant step in the proof of Theorem 1 involves showing\nthat over a useful subset of TR, a form of RSC (2) is satis\ufb01ed by a squared loss penalty.\nTheorem 5 (Restricted Strong Convexity). Let |\u2126| > c2\nG(ER) log d, for large enough constant\n0w2\nc0. There exists a RSC parameter \u03bac0 > 0 with \u03bac0 \u2248 1 \u2212 o\n, and a constant c1 such that,\nthe following holds w.p. greater that 1 \u2212 exp(\u2212c1w2\nG(ER)),\n\n(cid:16) 1\u221a\n\n(cid:17)\n\nlog d\n\n7\n\n\f\u2200X \u2208 TR \u2229 A(\u03b2c0 ),\n\nd1d2\n\n|\u2126| (cid:107)P\u2126(X)(cid:107)2\n\n2 \u2265 \u03bac0(cid:107)X(cid:107)2\nF .\n\nProof in Appendix A combines empirical process tools along with Theorem 2.\n\n(cid:3)\n\n5.1.2 Constrained Norm Minimizer\nexists a universal constant c2 such that, if \u03bbcn\u2265 2\u03be(cid:112)|\u2126|, then w.p. greater than 1\u2212 2 exp (\u2212c2|\u2126|),\nLemma 6. Under the conditions of Theorem 1, let b be a constant such that \u2200k, (cid:107)\u03b7k(cid:107)\u03a82 \u2264 b. There\n(a) (cid:98)\u2206ds \u2208 TR, and (b) (cid:107)P\u2126((cid:98)\u2206cn)(cid:107)2\u2264 2\u03bbcn.\nUsing \u03bbcn = 2\u03be(cid:112)|\u2126| in (9), if (cid:98)\u2206cn\u2208A(\u03b2c0), then using Theorem 5 and Lemma 6, w.h.p.\n(cid:3)\n(cid:107)(cid:98)\u2206cn(cid:107)2\n\u2126(\u03b7) \u21d2 w.h.p. (a) (cid:98)\u2206ds\u2208TR; (b)\n\n5.1.3 Matrix Dantzig Selector\nProposition 7. \u03bbds\u2265 \u03be\n\n\u2126(P\u2126((cid:98)\u2206ds))\u2264 2\u03bbds.\n\n(cid:107)P\u2126((cid:98)\u2206cn)(cid:107)2\n\nAbove result follows from optimality of(cid:98)\u0398ds and triangle inequality. Also,\n\n\u221a\n|\u2126| R\u2217P \u2217\n\n\u221a\n|\u2126| R\u2217P \u2217\n\n\u2264 4\u03be2\n\u03bac0\n\n\u2264 1\n\u03bac0\n\n|\u2126|\n\nd1d2\n\n(16)\n\nd1d2\n\nd1d2\n\nF\n\n2\n\n.\n\n\u2126(P\u2126((cid:98)\u2206ds))R((cid:98)\u2206ds) \u2264 2\u03bbds\u03a8R(TR)(cid:107)(cid:98)\u2206ds(cid:107)F ,\n\nwhere recall norm compatibility constant \u03a8R(TR) from (4). Finally, using Theorem 5, w.h.p.\n\n\u221a\n\nd1d2\n\n|\u2126| (cid:107)P\u2126((cid:98)\u2206ds)(cid:107)2\n2 \u2264\n(cid:107)(cid:98)\u2206ds(cid:107)2\n\nF\n\nd1d2\n\nd1d2\n\n\u221a\n|\u2126| R\u2217P \u2217\n(cid:107)P\u2126((cid:98)\u2206ds)(cid:107)2\n\n2\n\n\u2264 1\n|\u2126|\n\n\u03bac0\n\n\u2264 4\u03bbds\u03a8R(TR)\n\n\u03bac0\n\n(cid:107)(cid:98)\u2206ds(cid:107)F\u221a\n\nd1d2\n\n.\n\n(17)\n\njk\n\n(X\u2126,g(X))X\u2208S, where X\u2126,g(X) = (cid:104)X, P \u2217\n\n5.2 Proof of Theorem 2\nLet the entries of \u2126 = {Ek = eik e(cid:62)\n: k = 1, 2, . . . ,|\u2126|} be sampled as in (6). Recall that g \u2208 R|\u2126|\nis a standard normal vector. For a compact S \u2286 Rd1\u00d7d2, it suf\ufb01ces to prove Theorem 2 for a dense\ncountable subset of S. Overloading S to such a countable subset, de\ufb01ne following random process:\n(18)\nWe start with a key lemma in the proof of Theorem 2. Proof of this lemma, provided in Appendix B,\nuses tools from the broad topic of generic chaining developed in recent works [31, 33].\nLemma 8. For a compact subset S \u2286 Rd1\u00d7d2 with non\u2013empty interior, \u2203 constants k1, k2 such that:\n(cid:3)\nw\u2126,g(S) = E sup\nX\u2208S\nLemma 9. There exists constants k3, k4, such that for compact S \u2286 Bd1d2 with non\u2013empty interior\n\n\u2126(g)(cid:105) =(cid:80)\n(cid:114)E sup\n\n(cid:107)P\u2126(X \u2212 Y )(cid:107)2\n2.\n\n(cid:115) |\u2126|\n\nX\u2126,g(X) \u2264 k1\n\nk(cid:104)X, Ek(cid:105)gk.\n\nwG(S) + k2\n\nX,Y \u2208S\n\nd1d2\n\nE sup\nX,Y \u2208S\n\n(cid:107)P\u2126(X \u2212 Y )(cid:107)2\n\n2 \u2264 k3\n\nw2\n\nG(S) + k4( sup\nX,Y \u2208S\n\n(cid:107)X \u2212 Y (cid:107)\u221e)w\u2126,g(S)\n\n|\u2126|\nd1d2\n\n\u221a\nTheorem 2 follows by combining Lemma 8 and Lemma 9, and simple algebraic manipulations using\nab \u2264 a/2 + b/2 and triangle inequality (See Appendix B.4).\nThe statement in Theorem 2 about partial sub\u2013Gaussian complexity follows from a standard result\n(cid:3)\nin empirical process given in Lemma 11 in the appendix.\nAcknowledgments We thank the anonymous reviewers for helpful comments and suggestions. S.\nGunasekar and J. Ghosh acknowledge funding from NSF grants IIS-1421729, IIS-1417697, and\nIIS1116656. A. Banerjee acknowledges NSF grants IIS-1447566, IIS-1422557, CCF-1451986,\nCNS-1314560, IIS-0953274, IIS-1029711, and NASA grant NNX12AQ39A.\n\n8\n\n\fReferences\n[1] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: A geometric theory of phase\n\ntransitions in convex optimization. Inform. Inference, 2014.\n\n[2] A. Argyriou, R. Foygel, and N. Srebro. Sparse prediction with the k-support norm. In NIPS, 2012.\n[3] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with norm regularization. In NIPS, 2014.\n[4] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. JMLR, 2005.\n[5] T. Cai, T. Liang, and A. Rakhlin. Geometrizing local rates of convergence for linear inverse problems.\n\narXiv preprint, 2014.\n\n[6] E. J. Cand\u00b4es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? ACM, 2011.\n[7] E. J. Cand\u00b4es and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 2010.\n[8] E. J. Cand\u00b4es and B. Recht. Exact matrix completion via convex optimization. FoCM, 2009.\n[9] Emmanuel J Candes and Terence Tao. Decoding by linear programming.\n\nInformation Theory, IEEE\n\nTransactions on, 2005.\n\n[10] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear inverse\n\nproblems. Foundations of Computational Mathematics, 2012.\n\n[11] M. A. Davenport, Y. Plan, E. Berg, and M. Wootters. 1-bit matrix completion. Inform. Inference, 2014.\n[12] R. M. Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal\n\n[13] A. Edelman. Eigenvalues and condition numbers of random matrices. Journal on Matrix Analysis and\n\nof Functional Analysis, 1967.\n\nApplications, 1988.\n\n[14] M. Fazel, H Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimum order\n\nsystem approximation. In American Control Conference, 2001.\n\n[15] J. Forster and M. Warmuth. Relative expected instantaneous loss bounds. Journal of Computer and\n\nSystem Sciences, 2002.\n\nstraints. In ICML, 2014.\n\n[16] S. Gunasekar, P. Ravikumar, and J. Ghosh. Exponential family matrix completion under structural con-\n\n[17] L. Jacob, J. P. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In NIPS, 2009.\n[18] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Trans. IT, 2010.\n[19] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. JMLR, 2010.\n[20] O. Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 2014.\n[21] O. Klopp. Matrix completion by singular value thresholding: sharp bounds. arXiv preprint arXiv, 2015.\n[22] Vladimir Koltchinskii, Karim Lounici, Alexandre B Tsybakov, et al. Nuclear-norm penalization and\n\noptimal rates for noisy low-rank matrix completion. The Annals of Statistics, 2011.\n\n[23] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer, 1991.\n[24] A. E. Litvak, A. Pajor, M. Rudelson, and N. Tomczak-Jaegermann. Smallest singular value of random\n\nmatrices and geometry of random polytopes. Advances in Mathematics, 2005.\n\n[25] A. M. McDonald, M. Pontil, and D. Stamos. New perspectives on k-support and cluster norms. arXiv\n\npreprint, 2014.\n\nbounds with noise. JMLR, 2012.\n\n[26] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix completion: Optimal\n\n[27] S. Negahban, B. Yu, M. J. Wainwright, and P. Ravikumar. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. In NIPS, 2009.\n\n[28] B. Recht. A simpler approach to matrix completion. JMLR, 2011.\n[29] E. Richard, G. Obozinski, and J.-P. Vert. Tight convex relaxations for sparse matrix factorization.\n\nIn\n\nArXiv e-prints, 2014.\n\n[30] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Learning Theory. Springer, 2005.\n[31] M. Talagrand. Majorizing measures: the generic chaining. The Annals of Probability, 1996.\n[32] M. Talagrand. Majorizing measures without measures. Annals of probability, 2001.\n[33] M. Talagrand. Upper and Lower Bounds for Stochastic Processes. Springer, 2014.\n[34] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational\n\n[35] J. A. Tropp. Convex recovery of a structured signal from independent random linear measurements. arXiv\n\nMathematics, 2012.\n\npreprint, 2014.\n[36] R. Vershynin.\n\npages 210\u2013268, 2012.\n\nIntroduction to the non-asymptotic analysis of random matrices. Compressed sensing,\n\n[37] R. Vershynin. Estimation in high dimensions: a geometric perspective. ArXiv e-prints, 2014.\n[38] A. G. Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra and its\n\nApplications, 1992.\n\n[39] E. Yang and P. Ravikumar. Dirty statistical models. In NIPS, 2013.\n\n9\n\n\f", "award": [], "sourceid": 735, "authors": [{"given_name": "Suriya", "family_name": "Gunasekar", "institution": "UT Austin"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota"}, {"given_name": "Joydeep", "family_name": "Ghosh", "institution": "UT Austin"}]}