{"title": "On the Sample Complexity of Robust PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 3221, "page_last": 3229, "abstract": "We estimate the sample complexity of a recent robust estimator for a generalized version of the inverse covariance matrix. This estimator is used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our model assumes a sub-Gaussian underlying distribution and an i.i.d.~sample from it. Our main result shows with high probability that the norm of the difference between the generalized inverse covariance of the underlying distribution and its estimator from an i.i.d.~sample of size $N$ is of order $O(N^{-0.5+\\eps})$ for arbitrarily small $\\eps>0$ (affecting the probabilistic estimate); this rate of convergence is close to one of direct covariance and inverse covariance estimation, i.e., $O(N^{-0.5})$. Our precise probabilistic estimate implies for some natural settings that the sample complexity of the generalized inverse covariance estimation when using the Frobenius norm is $O(D^{2+\\delta})$ for arbitrarily small $\\delta>0$ (whereas the sample complexity of direct covariance estimation with Frobenius norm is $O(D^{2})$). These results provide similar rates of convergence and sample complexity for the corresponding robust subspace recovery algorithm, which are close to those of PCA. To the best of our knowledge, this is the only work analyzing the sample complexity of any robust PCA algorithm.", "full_text": "On the Sample Complexity of Robust PCA\n\nDepartment of Electrical Engineering and Computer Science\n\nMassachusetts Institute of Technology\n\nMatthew Coudron\n\nCambridge, MA 02139\nmcoudron@mit.edu\n\nGilad Lerman\n\nSchool of Mathematics\nUniversity of Minnesota\nMinneapolis, MN 55455\n\nlerman@umn.edu\n\nAbstract\n\nWe estimate the rate of convergence and sample complexity of a recent robust es-\ntimator for a generalized version of the inverse covariance matrix. This estimator\nis used in a convex algorithm for robust subspace recovery (i.e., robust PCA). Our\nmodel assumes a sub-Gaussian underlying distribution and an i.i.d. sample from\nit. Our main result shows with high probability that the norm of the difference\nbetween the generalized inverse covariance of the underlying distribution and its\nestimator from an i.i.d. sample of size N is of order O(N\u22120.5+\u0001) for arbitrar-\nily small \u0001 > 0 (affecting the probabilistic estimate); this rate of convergence is\nclose to the one of direct covariance estimation, i.e., O(N\u22120.5). Our precise prob-\nabilistic estimate implies for some natural settings that the sample complexity\nof the generalized inverse covariance estimation when using the Frobenius norm\nis O(D2+\u03b4) for arbitrarily small \u03b4 > 0 (whereas the sample complexity of di-\nrect covariance estimation with Frobenius norm is O(D2)). These results provide\nsimilar rates of convergence and sample complexity for the corresponding robust\nsubspace recovery algorithm. To the best of our knowledge, this is the only work\nanalyzing the sample complexity of any robust PCA algorithm.\n\n1\n\nIntroduction\n\nA fundamental problem in probability and statistics is to determine with overwhelming probability\nthe rate of convergence of the empirical covariance (or inverse covariance) of an i.i.d. sample of\nincreasing size N to the covariance (or inverse covariance) of the underlying random variable (see\ne.g., [17, 3] and references therein). Clearly, this problem is also closely related to estimating\nwith high probability the sample complexity, that is, the number of samples required to obtain a\ngiven error of approximation \u0001. In the case of a compactly supported (or even more generally sub-\nGaussian) underlying distribution, it is a classical exercise to show that this rate of convergence is\nO(N\u22120.5) (with a comparability constant depending on properties of \u00b5, in particular D, as well as\non the threshold probability, see e.g., [17, Proposition 2.1]). The precise estimate for this rate of\nconvergence implies that the sample complexity of covariance estimation is O(D) when using the\nspectral norm and O(D2) when using the Frobenius norm. The rate of convergence and sample\ncomplexity of PCA immediately follow from these estimates (see e.g., [15]).\nWhile such estimates are theoretically fundamental, they can be completely useless in the presence\nof outliers. That is, direct covariance or inverse covariance estimation and its resulting PCA are very\nsensitive to outliers. Many robust versions of covariance estimation, PCA and dimension reduction\nhave been developed in the last three decades (see e.g., the standard textbooks [8, 10, 14]). In the last\nfew years new convex algorithms with provable guarantees have been suggested for robust subspace\nrecovery and its corresponding dimension reduction [5, 4, 19, 20, 11, 7, 2, 1, 21, 9].\nMost of these works minimize a mixture of an (cid:96)1-type norm (depending on the application) and the\nnuclear norm. Their algorithmic complexity is not as competitive as PCA and their sample com-\n\n1\n\n\fplexity is hard to estimate due to the problem of extending the nuclear norm out-of-sample. On\nthe other hand, Zhang and Lerman [21] have proposed a novel M-estimator for robust PCA, which\nis based on a convex relaxation of the sum of Euclidean distances to subspaces (which is origi-\nnally minimized over the non-convex Grassmannian). This procedure suggests an estimator for a\ngeneralized version of the inverse covariance matrix and uses it to robustly recover an underlying\nlow-dimensional subspace. This idea was extended in [9] to obtain an even more accurate method\nfor subspace recovery, though it does not estimate the generalized inverse covariance matrix (in par-\nticular, it has no analogous notion of singular values or their inverses). The algorithmic complexity\nof the algorithms solving the convex formulations of [21] and [9] is comparable to that of full PCA.\nHere we show that for the setting of sub-Gaussian distributions the sample complexity of the robust\nPCA algorithm in [21] (or its generalized inverse covariance estimation) is close to that of PCA (or\nto sample covariance estimation). Our analysis immediately extends to the robust PCA algorithm of\n[9].\n\n1.1 The Generalized Inverse Covariance and its Corresponding Robust PCA\n\nZhang and Lerman [21] formed the set\n\n(1.1)\nas a convex relaxation for the orthoprojectors (from RD to RD), and de\ufb01ned the following energy\nfunction on H (with respect to a data set X in RD):\n\nH := {Q \u2208 RD\u00d7D : Q = QT , tr(Q) = 1},\n\n(cid:88)\n\nx\u2208X\n\nFX (Q) :=\n\n(cid:107)Qx(cid:107),\n\n(1.2)\n\nwhere (cid:107) \u00b7 (cid:107) denotes the Euclidean norm of a vector in RD. Their generalized empirical inverse\ncovariance is\n\nFX (Q).\n\n\u02c6QX = arg min\n\n(1.3)\nThey showed that when replacing the term (cid:107)Qx(cid:107) by (cid:107)Qx(cid:107)2 in (1.2) and when Sp{X} = RD,\nthen the minimization (1.3) results in a scaled version of the empirical inverse covariance matrix.\nIt is thus clear why we can refer to \u02c6QX as a generalized empirical inverse covariance (or (cid:96)1-type\nversion of it). We describe the absolute notion of generalized inverse covariance matrix, i.e., non-\nempirical, in \u00a71.2. Zhang and Lerman [21] did not emphasize the empirical generalized inverse\ncovariance, but the robust estimate of the underlying low-dimensional subspace by the span of the\nbottom eigenvectors of this matrix. They rigorously proved that such a procedure robustly recovers\nthe underlying subspace under some conditions.\n\nQ\u2208H\n\n1.2 Main Result of this Paper\n\nWe focus on computing the sample complexity of the estimator \u02c6QX . This problem is practically\nequivalent with estimating the rate of convergence of \u02c6QX of an i.i.d. sample X to the \u201cgeneralized\ninverse covariance\u201d of the underlying distribution \u00b5. We may assume that \u00b5 is a sub-Gaussian\nprobability measure on RD (see \u00a72.1 and the extended version of this paper). However, in order\nto easily express the dependence of our probabilistic estimates on properties of the measure \u00b5, we\nassume for simplicity that \u00b5 is compactly supported and denote by R\u00b5 the minimal radius among all\nballs containing the support of \u00b5, that is,\n\nR\u00b5 = min{r > 0 : supp(\u00b5) \u2286 B(0, r)},\n\nwhere B(0, r) is the ball around the origin 0 with radius r. We further assume that for some\n0 < \u03b3 < 1, \u00b5 satis\ufb01es the following condition, which we refer to as the \u201ctwo-subspaces criterion\u201d\n(for \u03b3): For any pair of (D \u2212 1)-dimensional subspaces of RD, L1 and L2:\n\n\u00b5((L1 \u222a L2)c) \u2265 \u03b3.\n\n(1.4)\n\nWe note that if \u00b5 satis\ufb01es the two-subspaces criterion for any particular 0 < \u03b3 < 1, then its support\ncannot be a union of two hyperplanes of RD. The use of this assumption is clari\ufb01ed below in \u00a73.2,\nthough it is possible that one may weaken it.\n\n2\n\n\f(cid:18)\n\nP\n\n\u02c6Q & \u02c6QN are u.d. and (cid:107) \u02c6Q \u2212 \u02c6QN(cid:107)F \u2264 2\n\u03b10\n\n\u2265 1 \u2212 C0N D2\n\nexp\n\nwhere\n\n(cid:19)\n(cid:19)\n\nN\u2212 1\n\n2 +\u0001\n\n(cid:18) \u2212N 2\u0001\n(cid:32)\n\nD \u00b7 R2\n\n\u00b5\n\n(cid:19)2\n\n(cid:18) N\n\nD \u2212 1\n\n\u2212 2\n\n(1 \u2212 \u03b3)N\u22122(D\u22121),\n\n(1.8)\n\n(cid:33) D(D+1)\n\n2\n\n.\n\n(1.9)\n\nWe \ufb01rst formulate the generalized inverse covariance of the underlying measure as follows:\n\nwhere\n\nF (Q) =\n\n(cid:107)Qx(cid:107) d\u00b5(x).\n\n\u02c6Q = arg min\n\nQ\u2208H F (Q),\n\n(cid:90)\n\n(1.5)\n\n(1.6)\n\nLet {xi}\u221e\nbution \u00b5). Let XN := {xi}N\n\ni=1 and denote\n\ni=1 be a sequence of i.i.d. random variables sampled from \u00b5 (i.e., each variable has distri-\n\n\u02c6QN := \u02c6QXN and FN := FXN .\n\n(1.7)\n\nOur main result shows with high probability that \u02c6Q and \u02c6QN are uniquely de\ufb01ned (which we denote\nby u.d. from now on) and that { \u02c6QN}N\u2208N converges to \u02c6Q in the following speci\ufb01ed rate. It uses the\ncommon notation: a \u2228 b := max(a, b). We explain its implications in \u00a72.\nTheorem 1.1. If \u00b5 is a compactly supported distribution satisfying the two-subspaces criterion for\n\u03b3 > 0, then there exists a constant \u03b10 \u2261 \u03b10(\u00b5, D, \u0001) > 0 such that for any \u0001 > 0 and N > 2(D\u22121)\nthe following estimate holds:\n\nC0 \u2261 C0(\u03b10, D) := 4 \u00b7 ((4\u03b10) \u2228 2) \u00b7\n\n10 D\n\n2\u03b10 + 4((4\u03b10) \u2228 2)R\u00b5\n\n1 \u2212 2\u03b10\n(4\u03b10)\u22282\n\nIntuitively, \u03b10 represents a lower bound on the directional second derivatives of F . Therefore,\n\u03b10 should affect sample complexity because the number of random samples taken to approximate a\nminimum of F should be affected by how sharply F increases about its minimum. It is an interesting\nand important open problem to \ufb01nd lower bounds on \u03b10 for general \u00b5.\n\n2\n\nImplication and Extensions of the Main Result\n\n2.1 Generalization to Sub-Gaussian Measures\n\nWe can remove the assumption that the support of \u00b5 is bounded (with radius R\u00b5) and assume instead\nthat \u00b5 is sub-Gaussian. In this case, instead of Hoeffding\u2019s inequality, we apply [18, Proposition\n5.10] with ai = 1 for all 1 \u2264 i \u2264 n. When formulating the corresponding inequality, one may\nnote that supp\u22651 p\u22121/2(E\u00b5|x|p)1/p (where x represents a random variable sampled from \u00b5) can be\nregarded as a substitute for R\u00b5 (see [21] for more details of a similar analysis).\n\n2.2 Sample Complexity\n\nThe notion of sample complexity arises in the framework of Probably-Approximately-Correct\nLearning of Valiant [16]. Generally speaking, the sample complexity in our setting is the mini-\nmum number of samples N required, as a function of dimension D, to achieve a good estimation\nof \u02c6Q with high probability. We recall that in this paper we use the Frobenius norm for the estima-\ntion error. The following calculation will show that under some assumptions on \u00b5 it suf\ufb01ces to use\nN = \u2126(D\u03b7) samples for any \u03b7 > 2 (we recall that f (x) = \u2126(g(x)) as x \u2192 \u221e if and only if\n\u221a\ng(x) = O(f (x))). In our analysis we will have to assume that \u03b3 is a \ufb01xed constant, and \u03b10 goes\nD. These assumptions are placing additional restrictions on the measure \u00b5, which we expect\nas 1/\nto be reasonable in practice as we later clarify. We further assume that R\u00b5 = O(D\u22120.5) and also\nexplain later why it makes sense for the setting of robust subspace recovery.\n\n3\n\n\fTo bound the sample complexity we set C1 := 4 \u00b7 ((4\u03b10) \u2228 2) and C2 := 10 \u00b7 (2\u03b10 + 4((4\u03b10) \u2228\n2)R\u00b5)/(1\u22122\u03b10/(4\u03b10) \u2228 2) so that C0 \u2264 C1\u00b7(C2\u00b7D)D2 (see (1.9)). Applying this bound and (1.8)\nwe obtain that if \u03b7 > 2 is \ufb01xed, 1/\u03b7 < \u0001 < 1\n\nP\n\n2 and N \u2265 D\u03b7, then\nN\u2212 1\n\n(cid:18)\n(cid:19)\n(cid:18) \u2212N 2\u0001\n\u2265 1 \u2212 C1 exp(cid:0)log(C2 \u00b7 D1+\u03b7)D2 \u2212 D2\u03b7\u0001(cid:1)\n\n\u2265 1 \u2212 C1(C2 \u00b7 D \u00b7 N )D2\n\n\u02c6Q & \u02c6QN are u.d. and (cid:107) \u02c6Q \u2212 \u02c6QN(cid:107)F \u2264 2\n\u03b10\n\u2212 2 N 2(D\u22121)(1 \u2212 \u03b3)N\u22122(D\u22121)\n\nD \u00b7 R2\n\n\u00b5\n\n(cid:19)\n\nexp\n\n2 +\u0001\n\n\u2212 2 exp (2\u03b7(D \u2212 1) log(D) + log(1 \u2212 \u03b3)(D\u03b7 \u2212 2(D \u2212 1))) .\n\n(2.1)\n\n2 +\u0001 \u2264 D\u03b7(\u0001\u2212 1\n\nSince \u0001 > 1/\u03b7 the \ufb01rst term in the RHS of (2.1) decays exponentially as a function of D (or,\nequivalently, as a function of N \u2265 D\u03b7). Similarly, since 0 < \u03b3 < 1 and \u03b7 > 1 the second term in\nthe RHS of (2.1) decays exponentially as a function of D. Furthermore, since \u0001 < 1\n2 it follows that\nthe error term for the minimizer, i.e., N\u2212 1\n2 ), decays polynomially in D. Thus, in order\nto achieve low error estimation with high probability it is suf\ufb01cient to take N = \u2126(D\u03b7) samples for\nany \u03b7 > 2. The exact guarantees on error estimation and probability of error can be manipulated by\nchanging the constant hidden in the \u2126 term.\nWe would like to point out the expected tradeoff between the sample complexity and the rate of\nconvergence. If \u0001 approaches 0, then the rate of convergence becomes optimal but the sample com-\nplexity deteriorates. On the other hand, if \u0001 approaches 0.5, then the sample complexity becomes\noptimal, but the rate of convergence deteriorates.\nTo motivate our assumption on R\u00b5, \u03b3 and \u03b10, we recall the needle-haystack and syringe-haystack\nmodels of [9] as a prototype for robust subspace recovery. These models assume a mixtures of outlier\nand inliers components. The distribution of the outliers component is normal N (0, (\u03c3out2/D)ID)\nand the distribution of the inliers component is a mixture of N (0, (\u03c3in2/d)PL) (where L is a d-\nsubspace) and N (0, (\u03c3in2/(CD))ID), where C (cid:29) 1 (the latter component has coef\ufb01cient zero in\nthe needle-haystack model).\nThe underlying distribution of the syringe-haystack (or needle-haystack) model is not compactly\nsupported, but clearly sub-Gaussian (as discussed in \u00a72.1) and its standard deviation is of order\nO(D\u22120.5). We also note that \u03b3 here is the coef\ufb01cient of the outlier component in the needle-haystack\nmodel, which we denote by \u03bd0. Indeed, the only non-zero measure that can be contained in a (D-1)-\n\u221a\ndimensional subspace is the measure associated with N (0, (\u03c3in2/d)PL), and that has total weight\nat most (1 \u2212 \u03bd0). It is also possible to verify explicitly that \u03b10 is lower bounded by 1/\nD in this\ncase (though our argument is currently rather lengthy and will appear in the extended version of this\npaper).\n\n2.3 From Generalized Covariances to Subspace Recovery\n\nWe recall that the underlying d-dimensional subspace can be recovered from the bottom d eigen-\nvectors of \u02c6QN . Therefore, the rate of convergence of the subspace recovery (or its corresponding\nsample complexity) follows directly from Theorem 1.1 and the Davis-Kahan Theorem [6]. To for-\nmulate this, we assume here for simplicity that \u02c6Q and \u02c6QN are u.d. (recall Theorems 3.1 and 3.2).\nTheorem 2.1. If d < D, \u0001 > 0, \u03b10 \u2261 \u03b10(\u00b5, D, \u0001) is the positive constant guaranteed by Theo-\nrem 2.1, \u02c6Q and \u02c6QN are u.d. and \u02c6Ld, \u02c6Ld,N are the subspaces spanned by the bottom d eigenvectors\n(i.e., with lowest d eigenvalues) of \u02c6Q and \u02c6QN respectively, P \u02c6Ld\nare the orthoprojectors\non these subspaces and \u03bdD\u2212d is the (D \u2212 d)th eigengap of \u02c6Q, then\n\nand P \u02c6Ld,N\n\nP\n\n(cid:107)P \u02c6Ld\n\n\u2212 P \u02c6Ld,N\n\n(cid:107)F \u2264\n\n4\n\n\u03b10 \u00b7 \u03bdD\u2212d\n\nN\u2212 1\n\n2 +\u0001\n\n\u2265 1 \u2212 C0N D2\n\nexp\n\n.\n\n(2.2)\n\n(cid:19)\n\n(cid:18) \u2212N 2\u0001\n\nD \u00b7 R2\n\n\u00b5\n\n(cid:18)\n\n(cid:19)\n\n2.4 Nontrivial Robustness to Noise\n\nWe remark that (2.2) implies nontrivial robustness to noise for robust PCA. Indeed, assume for ex-\nample an underlying d-subspace L\u2217\nd and a mixture distribution (representing noisy inliers/outliers\n\n4\n\n\fcomponents) whose inliers component is symmetric around L\u2217\nd with relatively high level of variance\nin the orthogonal component of \u02c6Ld and its outliers component is spherically symmetric with suf\ufb01-\nciently small mixture coef\ufb01cient. One can show that in this case \u02c6Ld = L\u2217\nd. Combining this observa-\ntion and (2.2), we can verify robustness to nontrivial noise when recovering L\u2217\nd from i.i.d. samples\nof such distributions.\n\n2.5 Convergence Rate of the REAPER Estimator\n\nG := {Q \u2208 RD\u00d7D : Q = QT , tr(Q) = D \u2212 d and Q (cid:52) I},\n\nThe REAPER and S-REAPER Algorithms [9] are variants of the robust PCA algorithm of [21]. The\nobjective of the REAPER algorithm can be formulated as aiming to minimize the energy FX (Q)\nover the set\n(2.3)\nwhere (cid:52) denotes the semi-de\ufb01nite order. The d-dimensional subspace can then be recovered by\nthe bottom d eigenvectors of Q (in [9] this minimization is formulated with P = I \u2212 Q, whose\ntop d eigenvectors are found). The rate of convergence of the minimizer of FX (Q) over G to the\nminimizer of F (Q) over G is similar to that in Theorem 1.1. The proof of Theorem 1.1 must be\nmodi\ufb01ed to deal with the boundary of the set G. If the minimizer \u02c6Q lies on the interior of G then the\nproof is the same. If \u02c6Q is on the boundary of G we must only consider the directional derivatives\nwhich point towards the interior of G, or tangent to the boundary. Other than that the proof is the\nsame.\n\n2.6 Convergence Rate with Additional Sparsity Term\n\nRothman et al. [13] and Ravikumar et al. [12] have analyzed an estimator for sparse inverse covari-\nance. This estimator minimizes over all Q (cid:31) 0 the energy\n\n(cid:104)Q, (cid:98)\u03a3N(cid:105)F \u2212 log det(Q) + \u03bbN(cid:107)Q(cid:107)(cid:96)1,\nwhere (cid:98)\u03a3N is the empirical covariance matrix based on sample of size N, (cid:104)\u00b7, \u00b7(cid:105)F is the Frobenius\ninner product (i.e., sum of elementwise products) and (cid:107)Q(cid:107)(cid:96)1 =(cid:80)D\n\ni,j=1 |Qi,j|.\n\n(2.4)\n\nZhang and Zou [22] have suggested a similar minimization, which replaces the \ufb01rst two terms\nin (2.4) (corresponding to \u03bbN = 0) with\n\n(cid:104)Q2, (cid:98)\u03a3N(cid:105)F /2 \u2212 tr(Q).\n\n(2.5)\nN (assuming that the\n\nSp({xi}N\n\nIndeed, the minimizers of (2.4) when \u03bbN = 0 and of (2.5) are both equal to (cid:98)\u03a3\u22121\nUsing the de\ufb01nition of (cid:98)\u03a3N , i.e., (cid:98)\u03a3N =(cid:80)N\n(cid:104)Q2, (cid:98)\u03a3N(cid:105)F =\n\ni=1) = RD so that the inverse empirical covariance exists).\ni /N, we note that\n\n(cid:107)Qxi(cid:107)2.\n\nN(cid:88)\n\ni=1 xixT\n\n1\nN\n\n(2.6)\nTherefore, the minimizer of (2.5) over all Q (cid:31) 0 is the same up to a multiplicative constant as\nthe minimizer of the RHS of (2.6) over all Q (cid:31) 0 with tr(Q) = 1. Teng Zhang suggested to us\nreplacing the RHS of (2.6) with FX and modifying the original problem of (2.4) (or more precisely\nits variant in [22]) with the minimization over all Q \u2208 H of the energy\n\ni=1\n\nFX (Q) + \u03bbN(cid:107)Q(cid:107)(cid:96)1.\n\n(2.7)\n\nThe second term enforces sparseness and we expect the \ufb01rst term to enforce robustness.\nBy choosing \u03bbN = O(N\u22120.5) we can obtain similar rates of convergence for the minimizer of (2.7)\nas the one when \u03bbN = 0 (see extended version of this paper), namely, rate of convergence of order\nO(N\u22120.5+\u0001) for any \u0001 > 0. The dependence on D is also the same. That is, the minimum sample\nsize when using the Frobenius norm is O(D\u03b7) for any \u03b7 > 2. Nevertheless, Ravikumar et al. [12]\nshow that under some assumptions (see e.g., Assumption 1 in [12]), the minimal sample size is\nO(log(D)r2), where r is the maximum node degree for a graph, whose edges are the nonzero entries\nof the inverse covariance. It will be interesting to generalize such estimates to the minimization\nof (2.7).\n\n5\n\n\f3 Overview of the Proof of Theorem 1.1\n\n3.1 Structure of the Proof\n\nWe \ufb01rst discuss in \u00a73.2 conditions for uniqueness of \u02c6Q and \u02c6QN (with high probability). In \u00a73.3 and\n\u00a73.4 we explain in short the two basic components of the proof of Theorem 1.1. The \ufb01rst of them is\nthat (cid:107) \u02c6Q \u2212 \u02c6QN(cid:107)F can be controlled from above by differences of directional derivatives of F . The\nsecond component is that the rate of convergence of the derivatives of {FN}\u221e\nN =1 to the derivative\nof F is easily obtained by Hoeffding\u2019s inequality. In \u00a73.5 we gain some intuition for the validity\nof Theorem 1.1 in view of these two components and also explain why they are not suf\ufb01cient to\nconclude the proof. In \u00a73.6 we describe the construction of \u201cnets\u201d of increasing precision; using\nthese nets we conclude the proof of Theorem 1.1 in \u00a73.7. Throughout this section we only provide\nthe global ideas of the proof, whereas in the extended version of this paper we present the details.\n\n3.2 Uniqueness of the Minimizers\n\nThe two-subspaces criterion for \u00b5 guarantees that \u02c6Q is u.d. and that \u02c6QN is u.d. with overwhelming\nprobability for suf\ufb01ciently large N as follows.\nTheorem 3.1. If \u00b5 satis\ufb01es the two-subspaces criterion for some \u03b3 > 0, then F is strictly convex.\nTheorem 3.2. If \u00b5 satis\ufb01es the two-subspaces criterion for some \u03b3 > 0 and N > 2(D \u2212 1), then\n\nP (FN is not strictly convex) \u2264 2\n\n(1 \u2212 \u03b3)N\u22122(D\u22121).\n\n(3.1)\n\n(cid:19)2\n\n(cid:18) N\n\nD \u2212 1\n\n3.3 From Energy Minimizers to Directional Derivatives of Energies\nWe control the difference (cid:107)Q \u2212 \u02c6Q(cid:107)F from above by differences of derivatives of energies at Q and\n\u02c6Q. Here Q is an arbitrary matrix in Br( \u02c6Q) for some r > 0 (where Br( \u02c6Q) is the ball in H with\ncenter \u02c6Q and radius r w.r.t. the Frobenius norm), but we will later apply it with Q = \u02c6QN for some\nN \u2208 N.\n\n3.3.1 Preliminary Notation and De\ufb01nitions\n\nThe \u201cdirections\u201d of the derivatives, which we de\ufb01ne below, are elements in the unit sphere of the\ntangent space of H, i.e.,\n\nD := {D \u2208 RD\u00d7D | D = DT , tr(D) = 0,(cid:107)D(cid:107)F = 1}.\n\nThroughout the paper, directions in D are often determined by particular points Q1, Q2 \u2208 H, where\nQ1 (cid:54)= Q2. We denote the direction from Q1 to Q2 by DQ1,Q2, that is,\n\nDQ1,Q2 :=\n\nQ2 \u2212 Q1\n(cid:107)Q2 \u2212 Q1(cid:107)F\n\n.\n\n(3.2)\n\nDirectional derivatives with respect to an element of D may not exist and therefore we use directional\nderivatives from the right. That is, for Q \u2208 H and D \u2208 D, the directional derivative (from the right)\nof F at Q in the direction D is\n\nF (Q + tD)(cid:12)(cid:12)t=0+ .\n\n\u2207+\nDF (Q) :=\n\nd\ndt\n\n3.3.2 Mathematical Statement\nWe use the above notation to formulate the desired bound on (cid:107)Q \u2212 \u02c6Q(cid:107)F . It involves the constant\n\u03b10, which is also used in Theorem 1.1. The proof of this lemma clari\ufb01es the existence of \u03b10, though\nit does not suggest an explicit approximation for it.\nLemma 3.3. For r > 0 there exists a constant \u03b10 \u2261 \u03b10(r, \u00b5, D) > 0 such that for all Q \u2208\nBr( \u02c6Q) \\ { \u02c6Q}:\n\n\u2207+\nD \u02c6Q,Q\n\nF (Q) \u2212 \u2207+\n\nD \u02c6Q,Q\n\nF ( \u02c6Q) \u2265 \u03b10(cid:107)Q \u2212 \u02c6Q(cid:107)F\n\n(3.3)\n\n(3.4)\n\n6\n\n\fand consequently\n\n\u2207+\nD \u02c6Q,Q\n\nF (Q) \u2265 \u03b10(cid:107)Q \u2212 \u02c6Q(cid:107)F .\n\n(3.5)\n\n3.4 N\u22121/2 Convergence of Directional Derivatives\n\nWe formulate the following convergence rate of the directional derivatives of FN from the right:\nTheorem 3.4. For Q \u2208 H and D \u2208 D,\nDF (Q) \u2212 \u2207+\n\nDFN (Q)(cid:12)(cid:12) \u2265 N \u0001\u2212 1\n\n(cid:17) \u2264 2 exp\n\nP(cid:16)(cid:12)(cid:12)\u2207+\n\n(cid:18) \u2212N 2\u0001\n\n(cid:19)\n\n.\n\n2\n\n(3.6)\n\nD \u00b7 R2\n\n\u00b5\n\nIt will be desirable to replace \u2207+\nDF (Q), though it is impossible\nin general. We will later use the following observation to implicitly obtain a result in this direction.\nLemma 3.5. If Q \u2208 H \\ { \u02c6Q}, then\n\nDFN (Q) in (3.6) with \u2207+\n\nDF (Q)\u2212\u2207+\n\n\u2207+\nD \u02c6Q,Q\n\nF (Q) \u2265 0.\n\n(3.7)\n\n3.5 An Incomplete Idea for Proving Theorem 1.1\n\nAt this point we can outline the basic intuition behind the proof of Theorem 1.1. We assume for\nsimplicity that \u02c6QN is u.d. Suppose, for the moment, that we could use (3.6) of Theorem 3.4 with\nQ := \u02c6QN . This is actually not mathematically sound, as we will discuss shortly, but if we could do\nit then we would have from (3.6) that\nF ( \u02c6QN ) \u2212 \u2207+\n\n(cid:17) \u2264 2 exp\n\nFN ( \u02c6QN )| \u2265 N \u0001\u2212 1\n\n(cid:18) \u2212N 2\u0001\n\nP(cid:16)|\u2207+\n\n(cid:19)\n\n(3.8)\n\n.\n\n2\n\nD \u02c6Q, \u02c6QN\n\nD \u02c6Q, \u02c6QN\n\nD \u00b7 R2\n\n\u00b5\n\nWe note that (3.7) as well as both the convexity of FN and the de\ufb01nition of \u02c6QN imply that\n\nF ( \u02c6QN ) \u2265 0 and \u2207+\n\nFN ( \u02c6QN ) \u2264 0.\n\nD \u02c6Q, \u02c6QN\n\n(cid:17) \u2264 2 exp\n\n(cid:18) \u2212N 2\u0001\n\n(cid:19)\n\nD \u00b7 R2\n\n\u00b5\n\n(3.9)\n\n.\n\n(3.10)\n\nF ( \u02c6QN ) \u2265 N \u0001\u2212 1\n\n2\n\nCombining (3.8) and (3.9), we obtain that\n\n\u2207+\nD \u02c6Q, \u02c6QN\n\nP(cid:16)\u2207+\n\nD \u02c6Q, \u02c6QN\n\nAt last, combining (3.5), (3.10) and Theorem 3.2 we can formally prove Theorem 1.1.\nHowever, as mentioned above, we cannot legally use Theorem 3.4 with Q = \u02c6QN . This is because\n\u02c6QN is a function of the samples (random variables) {xi}N\ni=1, but for our proof to be valid, Q needs\nto be \ufb01xed before the sampling begins.\nTherefore, our new goal is to utilize the intuition described above, but modify the proof to make\nit mathematically sound. This is accomplished by creating a series of \u201cnets\u201d (subsets of H) of\nincreasing precision. Each matrix in each of the nets is determined before the sampling begins, so\nit can be used in Theorem 3.4. However, the construction of the nets guarantees that the Nth net\ncontains a matrix Q which is suf\ufb01ciently close to \u02c6QN to be used as a substitute for \u02c6QN in the above\nprocess.\n\n3.6 The Missing Component: Adaptive Nets\n\nWe describe here a result on the existence of a sequence of nets as suggested in \u00a73.5. They are\nconstructed in several stages, which cannot \ufb01t in here (see careful explanation in the extended version\nof this paper). We recall that B2( \u02c6Q) denotes a ball in H with center \u02c6Q and radius 2 w.r.t. the\nFrobenius norm.\nLemma 3.6. Given \u03ba \u2265 2 and \u03c4 > 0, there exists a sequence of sets {Sn}\u221e\nn=1 such that \u2200n \u2208 N\n2 , \u2203Q(cid:48) \u2208 Sn with\nSn \u2282 B2( \u02c6Q) and for any Q \u2208 B2( \u02c6Q) with (cid:107)Q \u2212 \u02c6Q(cid:107)F > n\u2212 1\n\n(cid:107)Q(cid:48) \u2212 \u02c6Q(cid:107)F \u2264 (cid:107)Q \u2212 \u02c6Q(cid:107)F ,\n\n(3.11)\n\n7\n\n\f2 (\u03c4 + \u03ba\u22121) \u2265 (cid:107)Q(cid:48) \u2212 Q(cid:107)F \u2265 n\u2212 1\n\n2n\u2212 1\n(cid:107)D \u02c6Q,Q(cid:48) \u2212 D \u02c6Q,Q(cid:107)F \u2264 \u03c4 n\u22121 .\n\n2 \u03ba\u22121 and\n\nFurthermore,\n\n|Sn| \u2264 2\u03ban\n\n1\n2\n\n(cid:18) 10Dn\n\n(cid:19) D(D+1)\n\n2\n\n.\n\n\u03c4\n\nThe following lemma shows that we can use SN to guarantee good approximation of \u02c6Q by \u02c6QN as\nlong as the differences of partial derivatives are well-controlled for elements of SN (it uses the \ufb01xed\nconstants \u03ba and \u03c4 for SN ; see Lemma 3.6).\nLemma 3.7. If for some \u0001 > 0, FN is strictly convex and\n\n(cid:12)(cid:12)(cid:12)\u2207+\n\nF (Q) \u2212 \u2207+\n\nDQ, \u02c6Q\n\nDQ, \u02c6Q\n\nFN (Q)\n\n2 +\u0001 \u2200Q \u2208 SN ,\n\n(cid:12)(cid:12)(cid:12) \u2264 N\u2212 1\n\nthen \u02c6QN is u.d. and\n\n(cid:107) \u02c6Q \u2212 \u02c6QN(cid:107)F \u2264 1 + 2\u03b10(\u03c4 + 1\n\u03b10\n\n\u03ba ) + 4R\u00b5\u03ba\u03c4\n\nN\u2212 1\n\n2 +\u0001.\n\n(3.12)\n(3.13)\n\n(3.14)\n\n(3.15)\n\n(3.16)\n\n(3.17)\n\n(3.18)\n\n(3.19)\n\n(3.20)\n\n3.7 Completing the Proof of Theorem 1.1\nLet us \ufb01x \u03ba0 = (4\u03b10) \u2228 2, \u03c40 := (1 \u2212 2\u03b10/\u03ba0)/(2\u03b10 + 4R\u00b5\u03ba0) and N > 2(D \u2212 1). We note that\n\nWe rewrite (3.14) using \u03ba := \u03ba0 and \u03c4 := \u03c40 and then bound its RHS from above as follows\n\n1 + 2\u03b10(\u03c40 +\n\n1\n\u03ba0\n\n) + 4R\u00b5\u03ba0\u03c40 = 2.\n\n(cid:32)\n\n|SN| \u2264 2((4\u03b10) \u2228 2)N\n\nD2+D+1\n\n2\n\n10D\n\n2\u03b10 + 4R\u00b5((4\u03b10) \u2228 2)\n\n1 \u2212 2\u03b10\n(4\u03b10)\u22282\n\n(cid:33) D(D+1)\n\n2\n\n\u2264 C0\n2\n\nN D2\n\n.\n\nCombining (3.6) (applied to any Q \u2208 SN ) and (3.18) we obtain that\n\nF (Q) \u2212 \u2207+\n\nDQ, \u02c6Q\n\nDQ, \u02c6Q\n\nFN (Q)\n\n(cid:12)(cid:12)(cid:12)\u2207+\n\nP(cid:16)\u2203Q \u2208 SN with\nP(cid:16)(cid:12)(cid:12)(cid:12)\u2207+\n\nFurthermore, (3.1) and (3.19) imply that\n\nF (Q) \u2212 \u2207+\n\nDQ, \u02c6Q\n\n\u2265 1 \u2212 C0N D2\n\nexp\n\nFN (Q)\n\n(cid:18) \u2212N 2\u0001\n\nDQ, \u02c6Q\n\n(cid:19)\n\nD \u00b7 R2\n\n\u00b5\n\n\u2212 2\n\nD \u2212 1\n\n\u2264 C0N D2\n\n(cid:12)(cid:12)(cid:12) \u2265 N\u2212 1\n2 +\u0001(cid:17)\nexp(cid:0)\u2212N 2\u0001/(D \u00b7 R2\n\u00b5)(cid:1) .\n(cid:12)(cid:12)(cid:12) \u2264 N\u2212 1\n(cid:17)\n(cid:18) N\n(cid:19)2\n\n2 +\u0001 \u2200Q \u2208 SN and \u02c6QN is u.d.\n\n(1 \u2212 \u03b3)N\u22122(D\u22121).\n\nTheorem 1.1 clearly concludes from Lemma 3.7 (applied with \u03ba := \u03ba0 and \u03c4 := \u03c40), (3.20)\nand (3.17).\n\nAcknowledgment\n\nThis work was supported by NSF grants DMS-09-15064 and DMS-09-56072. Part of this work was\nperformed when M. Coudron attended the University of Minnesota (as an undergraduate student).\nWe thank T. Zhang for valuable conversations and forwarding us [22].\n\n8\n\n\fReferences\n[1] A. Agarwal, S. Negahban, and M. Wainwright. Fast global convergence of gradient methods\n\nfor high-dimensional statistical recovery. Technical Report arXiv:1104.4824, Apr 2011.\n\n[2] A. Agarwal, S. Negahban, and M. Wainwright. Noisy matrix decomposition via convex relax-\n\nation: Optimal rates in high dimensions. In ICML, pages 1129\u20131136, 2011.\n\n[3] T. T. Cai, C.-H. Zhang, and H. H. Zhou. Optimal rates of convergence for covariance matrix\n\nestimation. Ann. Statist., 38(4):2118\u20132144, 2010.\n\n[4] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM,\n\n58(3):11, 2011.\n\n[5] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky. Rank-sparsity incoherence\n\nfor matrix decomposition. Arxiv, 02139:1\u201324, 2009.\n\n[6] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM J. on\n\nNumerical Analysis, 7:1\u201346, 1970.\n\n[7] D. Hsu, S. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions.\n\nInformation Theory, IEEE Transactions on, 57(11):7221 \u20137234, nov. 2011.\n\n[8] P. J. Huber and E. Ronchetti. Robust statistics. Wiley series in probability and mathematical\n\nstatistics. Probability and mathematical statistics. Wiley, 2009.\n\n[9] G. Lerman, M. McCoy, J. A. Tropp, and T. Zhang. Robust computation of linear models, or\n\nHow to \ufb01nd a needle in a haystack. ArXiv e-prints, Feb. 2012.\n\n[10] R. A. Maronna, R. D. Martin, and V. J. Yohai. Robust statistics: Theory and methods. Wiley\n\nSeries in Probability and Statistics. John Wiley & Sons Ltd., Chichester, 2006.\n\n[11] M. McCoy and J. Tropp. Two proposals for robust PCA using semide\ufb01nite programming. Elec.\n\nJ. Stat., 5:1123\u20131160, 2011.\n\n[12] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance esti-\nmation by minimizing (cid:96)1-penalized log-determinant divergence. Electron. J. Stat., 5:935\u2013980,\n2011.\n\n[13] A. J. Rothman, P. J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance\n\nestimation. Electron. J. Stat., 2:494\u2013515, 2008.\n\n[14] P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection. Wiley Series in\nProbability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons\nInc., New York, 1987.\n\n[15] J. Shawe-taylor, C. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the\nGram matrix and the generalisation error of kernel PCA. IEEE Transactions on Information\nTheory, 51(1):2510\u20132522, 2005.\n\n[16] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134\u20131142, Nov. 1984.\n[17] R. Vershynin. How close is the sample covariance matrix to the actual covariance matrix? to\n\nappear.\n\n[18] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. C. Eldar\nand G. Kutyniok, editors, Compressed Sensing: Theory and Applications. Cambridge Univ\nPress, to appear.\n\n[19] H. Xu, C. Caramanis, and S. Sanghavi. Robust pca via outlier pursuit. In NIPS, pages 2496\u2013\n\n2504, 2010.\n\n[20] H. Xu, C. Caramanis, and S. Sanghavi. Robust pca via outlier pursuit. Information Theory,\n\nIEEE Transactions on, PP(99):1, 2012.\n\n[21] T. Zhang and G. Lerman. A novel m-estimator for robust pca. Submitted, available at\n\narXiv:1112.4863.\n\n[22] T. Zhang and H. Zou. Sparse precision matrix estimation via positive de\ufb01nite constrained\n\nminimization of (cid:96)1 penalized d-trace loss. Personal Communication, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1477, "authors": [{"given_name": "Matthew", "family_name": "Coudron", "institution": null}, {"given_name": "Gilad", "family_name": "Lerman", "institution": null}]}