{"title": "Exponentially convergent stochastic k-PCA without variance reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 12393, "page_last": 12404, "abstract": "We present Matrix Krasulina, an algorithm for online k-PCA, by gen- eralizing the classic Krasulina\u2019s method (Krasulina, 1969) from vector to matrix case. We show, both theoretically and empirically, that the algorithm naturally adapts to data low-rankness and converges exponentially fast to the ground-truth principal subspace. Notably, our result suggests that despite various recent efforts to accelerate the convergence of stochastic-gradient based methods by adding a O(n)-time variance reduction step, for the k- PCA problem, a truly online SGD variant suffices to achieve exponential convergence on intrinsically low-rank data.", "full_text": "Exponentially convergent stochastic k-PCA without\n\nvariance reduction\n\nCheng Tang\nAmazon AI \u2217\n\nNew York, NY, 10001\n\ntcheng@amazon.com\n\nAbstract\n\nWe present Matrix Krasulina, an algorithm for online k-PCA, by generalizing\nthe classic Krasulina\u2019s method [1] from vector to matrix case. We show, both\ntheoretically and empirically, that the algorithm naturally adapts to data low-\nrankness and converges exponentially fast to the ground-truth principal subspace.\nNotably, our result suggests that despite various recent efforts to accelerate the\nconvergence of stochastic-gradient based methods by adding a O(n)-time variance\nreduction step, for the k-PCA problem, a truly online SGD variant suf\ufb01ces to\nachieve exponential convergence on intrinsically low-rank data.\n\n1\n\nIntroduction\n\nPrincipal Component Analysis (PCA) is ubiquitous in statistics, machine learning, and engineering\nalike: For a centered d-dimensional random vector X \u2208 Rd, the k-PCA problem is de\ufb01ned as \ufb01nding\nthe \u201coptimal\u201d projection of the random vector into a subspace of dimension k so as to capture as\nmuch of its variance as possible; formally, we want to \ufb01nd a rank k matrix W such that\n\nVar(cid:0)W (cid:62)W X(cid:1)\n\nmax\n\nW\u2208Rk\u00d7d,W W (cid:62)=Ik\n\nIn the objective above, W (cid:62)W = W (cid:62)(W W (cid:62))\u22121W is an orthogonal projection matrix into the\nsubspace spanned by the rows of W . Thus, the k-PCA problem seeks matrix W whose row-space\ncaptures as much variance of X as possible. This is equivalent to \ufb01nding a projection into a subspace\nthat minimizes variance of data outside of it:\n\nmin\n\nW\u2208Rk\u00d7d,W W (cid:62)=Ik\n\nE(cid:107)X \u2212 W (cid:62)W X(cid:107)2\n\n(1.1)\n\nLikewise, given a sample of n centered data points {Xi}n\nis\n\ni=1, the empirical version of problem (1.1)\n\nmin\n\nW\u2208Rk\u00d7d,W W (cid:62)=Ik\n\n1\nn\n\n(cid:107)Xi \u2212 W (cid:62)W Xi(cid:107)2\n\n(1.2)\n\nn(cid:88)\n\ni=1\n\nThe optimal k-PCA solution, the row space of optimal W , can be used to represent high-dimensional\ndata in a low-dimensional subspace (k (cid:28) d), since it preserves most variation from the original data.\nAs such, it usually serves as the \ufb01rst step in exploratory data analysis or as a way to compress data\nbefore further operation.\nThe solutions to the nonconvex problems (1.1) and (1.2) are the subspaces spanned by the top k\neigenvectors (also known as the principal subspace) of the population and empirical data covariance\n\u2217A major part of this work was done prior to the author joining Amazon when she was a student at George\n\nWashington University.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmatrix, respectively. Although we do not have access to the population covariance matrix to directly\nsolve (1.1), given a batch of samples {xi}n\ni=1 from the same distribution, we can \ufb01nd the solution to\n(1.2), which asymptotically converges to the population k-PCA solution [2]. Different approaches\nexist to solve (1.2) depending on the nature of the data and the computational resources available:\n\nSVD-based solvers When data size is manageable, one can \ufb01nd the exact solution to (1.2) via a\nsingular value decomposition (SVD) of the empirical data matrix in min{O(nd2), O(n2d)}-time and\nO(nd)-space, or in case of truncated SVD in O(ndk)-time (O(nd log k) for randomized solver [3]).\n\nPower method For large-scale datasets, that is, both n and d are large, the full data may not \ufb01t in\nmemory. Power method [4, p.450] and its variants are popular alternatives in this scenario; they have\nless computational and memory burden than SVD-based solvers; power method approximates the\nprincipal subspace iteratively: At every iteration, power method computes the inner product between\nthe algorithm\u2019s current solution and n data vectors {xi}n\ni=1, an O(nds)-time operation, where ds is\nthe average data sparsity. Power method converges exponentially fast [5]: To achieve \u03b5 accuracy, it\nhas a total runtime of O(nds log 1\n\u03b5 ). That is, power method requires multiple passes over the full\ndataset.\n\nOnline (incremental) PCA In real-world applications, datasets might become so large that even\nexecuting a full data pass is impossible. Online learning algorithms are developed under an abstraction\nof this setup: They assume that data come from an \u201cendless stream\u201d and only process one data point\n(or a constant sized batch) at a time. Online PCA mostly fall under two frameworks: 1. The online\nworst-case scenario, where the stream of data can have a non-stationary distribution [6\u20138]. 2. The\nstochastic scenario, where one has access to i.i.d. samples from an unknown but \ufb01xed distribution\n[5, 9\u201311].\nIn this paper, we focus on the stochastic setup: We show that a simple variant of stochastic gradient\ndescent (SGD), which generalizes the classic Krasulina\u2019s algorithm from k = 1 to general k \u2265 1, can\nprovably solve the k-PCA problem in Eq. (1.1) with an exponential convergence rate. It is worth\nnoting that stochastic PCA algorithms, unlike batch-based solvers, can be used to optimize both the\npopulation PCA objective (1.1) and its empirical counterpart (1.2).\n\nOja\u2019s method and VR-PCA While SGD-type algorithms have iteration-wise runtime independent\nof the data size, their convergence rate, typically linear in the number of iterations, is signi\ufb01cantly\nslower than that of batch gradient descent (GD). To speed up the convergence of SGD, the seminal\nwork of Johnson and Zhang [12] initiated a line of effort in deriving Variance-Reduced (VR) SGD\nby cleverly mixing the stochastic gradient updates with occasional batch gradient updates. For\nconvex problems, VR-SGD algorithms have provable exponential convergence rate. Despite the\nnon-convexity of k-PCA problem, Shamir [5, 13] augmented Oja\u2019s method [14], a popular stochastic\nversion of power method, with the VR step, and showed both theoretically and empirically that the\nresulting VR-PCA algorithm achieves exponential convergence. However, since a single VR iteration\nrequires a full-pass over the dataset, VR-PCA is no longer an online algorithm.\n\nMinimax lower bound In general, the tradeoff between convergence rate and iteration-wise com-\nputational cost is unavoidable in light of the minimax information lower bound [15, 16]: Let \u2206n\n(see De\ufb01nition 1) denote the distance between the ground-truth rank-k principal subspace and the\nalgorithm\u2019s estimated subspace after seeing n samples. Vu and Lei [15, Theorem 3.1] established\nthat there exists data distribution (with full-rank covariance matrices) such that the following lower\nbound holds:\n\nE [\u2206n] \u2265 \u2126(\n\n) for \u03c32 \u2265\n\n\u03c32\nn\n\n\u03bb1\u03bbk+1\n\n(\u03bbk \u2212 \u03bbk+1)2 ,\n\n(1.3)\n\nHere \u03bbk denotes the k-th largest eigenvalue of the data covariance matrix. This immediately implies a\n\u2126( \u03c32\nt ) lower bound on the convergence rate of online k-PCA algorithms, since for online algorithms\nthe number of iterations t equals the number of data samples n. Thus, sub-linear convergence rate is\nimpossible for online k-PCA algorithms on general data distributions.\n\n2\n\n\f1.1 Our result: escaping minimax lower bound on intrinsically low rank data\n\nDespite the discouraging lower bound for online k-PCA, note that in Eq. (1.3), \u03c3 equals zero when\nthe data covariance has rank less than or equal to k, and consequently, the lower bound becomes\nun-informative. Does this imply that data low-rankness can be exploited to overcome the lower bound\non the convergence rate of online k-PCA algorithms?\nOur result answers the question af\ufb01rmatively: Theorem 1 suggests that on low-rank data, an online k-\nPCA algorithm, namely, Matrix Krasulina (Algorithm 1), produces estimates of the principal subspace\nthat converges to the ground-truth in order O(exp (\u2212Ct)), where t is the number of iterations (the\nnumber of samples seen) and C is a constant. Our key insight is that Krasulina\u2019s method [1], in\ncontrast to its better-studied cousin Oja\u2019s method [14], is stochastic gradient descent with a self-\nregulated gradient for the PCA problem, and that when the data is of low-rank, the gradient variance\nvanishes as the algorithm\u2019s performance improves.\n\n2 Preliminaries\nWe consider the following online stochastic learning setting: At time t \u2208 N \\ {0}, we receive a\nrandom vector X t \u2208 Rd drawn i.i.d from an unknown centered probability distribution with a \ufb01nite\nsecond moment. We denote by X a generic random sample from this distribution. Our goal is to\nlearn W \u2208 Rk(cid:48)\u00d7d so as to optimize the objective in Eq (1.1).\n\nNotations We let \u03a3\u2217 denote the covariance matrix of X, \u03a3\u2217 := E(cid:2)XX(cid:62)(cid:3) . We let {ui}k\n\u03bb1 \u2265, . . . ,\u2265 \u03bbk. Given that \u03a3\u2217 has rank r, we can represent it as: \u03a3\u2217 :=(cid:80)r\nU\u2217 :=(cid:80)k\n\ndenote the top k eigenvectors of covariance matrix \u03a3\u2217, corresponding to its largest k eigenvalues,\ni . We let\ni . That is, U\u2217 is the orthogonal projection matrix into the subspace spanned by\ni=1. For any integer p > 0, we let Ip denote the p-by-p identity matrix. We denote by (cid:107) \u00b7 (cid:107)F the\n{ui}k\nFrobenius norm, by tr(\u00b7) the trace operator. For two square matrices A and B of the same dimension,\nwe denote by A (cid:23) B if A \u2212 B is positive semide\ufb01nite. We use curly capitalized letters such as G\nto denote events. For an event G, we denote by 1G its indicator random variable; that is, 1G = 1 if\nevent G occurs and 0 otherwise.\n\ni=1 \u03bbiuiu(cid:62)\n\ni=1 uiu(cid:62)\n\ni=1\n\nOptimizing the empirical objective We remark that our setup and theoretical results apply not\nonly to the optimization of population k-PCA problem (1.1) in the in\ufb01nite data stream scenario, but\nalso to the empirical version (1.2): Given a \ufb01nite dataset, we can simulate the stochastic optimization\nsetup by sampling uniformly at random from it. This is, for example, the setup adopted by Shamir\n[13, 5].\n\nAssumptions\nnorm is bounded almost surely; that is, there exits b and k such that\n\nIn the analysis of our main result, we assume that \u03a3\u2217 has low rank and that the data\n\n(cid:107)X(cid:107)2 > b\n\n= 0 and rank(\u03a3\u2217) = k\n\n(2.4)\n\n(cid:18)\n\nsup\nX\n\n(cid:19)\n\nP\n\n2.1 Oja and Krasulina\n\nIn this section, we introduce two classic online algorithms for 1-PCA, Oja\u2019s method and Krasulina\u2019s\nmethod.\nOja\u2019s method Let wt \u2208 Rd denote the algorithm\u2019s estimate of the top eigenvector of \u03a3\u2217 at time t.\nThen letting \u03b7t denote learning rate, and X be a random sample, Oja\u2019s algorithm has the following\nupdate rule:\n\nwt \u2190 wt\u22121 + \u03b7t(XX(cid:62)wt\u22121) and wt \u2190 wt\n(cid:107)wt(cid:107)\n\nWe see that Oja\u2019s method is a stochastic approximation algorithm to power method. For k > 1,\nOja\u2019s method can be generalized straightforwardly, by replacing wt with matrix W t \u2208 Rk(cid:48)\u00d7d, and by\nreplacing the normalization step with row orthonormalization, for example, by QR factorizaiton.\n\n3\n\n\fKrasulina\u2019s method Krasulina\u2019s update rule is similar to Oja\u2019s update but has an additional term:\n\nwt \u2190 wt\u22121 + \u03b7t(XX(cid:62)wt\u22121 \u2212 wt\u22121(X(cid:62) wt\u22121\n\n(cid:107)wt\u22121(cid:107) )2)\n\nIn fact, this is stochastic gradient descent on the objective function below, which is equivalent to Eq\n(1.1):\n\nE(cid:107)X \u2212 wt(wt)(cid:62)\n\n(cid:107)wt(cid:107)2 X(cid:107)2\n\n2.2 Gradient variance in Krasulina\u2019s method\n\nOur key observation of Krasulina\u2019s method is as follows: Let \u02dcwt := wt\nre-written as\n\nwt \u2190 wt\u22121 + (cid:107)wt(cid:107)\u03b7t(XX(cid:62) \u02dcwt\u22121 \u2212 \u02dcwt\u22121(X(cid:62) \u02dcwt\u22121)2)\n\n(cid:107)wt(cid:107); Krasulina\u2019s update can be\n\nLet\n\nand\n\nst := ( \u02dcwt)(cid:62)X (projection coef\ufb01cient)\n\nrt := X(cid:62) \u2212 st( \u02dcwt)(cid:62) = X(cid:62) \u2212 ( \u02dcwt)(cid:62)X( \u02dcwt)(cid:62) (projection residual)\n\nKrasulina\u2019s algorithm can be further written as:\n\nThe variance of the stochastic gradient term can be upper bounded as:\n\n(cid:107)X(cid:107)2 E(cid:107)rt(cid:107)2\n\nwt \u2190 wt\u22121 + (cid:107)wt(cid:107)\u03b7tst\u22121(rt\u22121)(cid:62)\n\n(cid:107)wt(cid:107)2 Var(cid:0)st\u22121(rt\u22121)(cid:62)(cid:1) \u2264 (cid:107)wt(cid:107)2 sup\n\nNote that\n\nX\n\nE(cid:107)rt(cid:107)2 = E(cid:107)X \u2212 wt(wt)(cid:62)\n\n(cid:107)wt(cid:107)2 X(cid:107)2\n\nThis reveals that the variance of the gradient naturally decays as Krasulina\u2019s method decreases the\nk-PCA optimization objective. Intuitively, as the algorithm\u2019s estimated (one-dimensional) subspace\nwt gets closer to the ground-truth subspace u1, (wt)(cid:62)X will capture more and more of X\u2019s variance,\nand E(cid:107)rt(cid:107)2 eventually vanishes.\nIn our analysis, we take advantage of this observation to prove the exponential convergence rate of\nKrasulina\u2019s method on low rank data.\n\n3 Related Works\n\nStochastic optimization for PCA Theoretical guarantees of stochastic optimization traditionally\nrequire convexity [17]. However, many modern machine learning problems, especially those arising\nfrom deep learning and unsupervised learning, are non-convex; PCA is one of them: The objective in\n(1.1) is non-convex in W . Despite this, a series of recent theoretical works have proven stochastic\noptimization to be effective for PCA, mostly variants of Oja\u2019s method [18, 19, 13, 5, 20, 21, 9].\nKrasulina\u2019s method [1] was much less studied than Oja\u2019s method; a notable exception is the work\nof Balsubramani et al. [9], which proved O(1/t) rate in expectation for both Oja\u2019s and Krasulina\u2019s\n\u221a\nalgorithm for 1-PCA. We also noticed a recent pre-print [22] that analyzes Krasulina\u2019s algorithm,\nwhich establishes O(1/\n\nt) convergence with high probability.\n\nStochastic optimization for k-PCA There were very few theoretical analysis of stochastic k-PCA\nalgorithms with k > 1, with the exception of Allen-Zhu and Li [18], Shamir [13], Balcan et al.\n[23], Li et al. [24]. All had focused on variants of Oja\u2019s algorithm, among which Shamir [13] was\nthe only previous work, to the best of our knowledge, that provided a local exponential convergence\nrate guarantee of Oja\u2019s algorithm for k \u2265 1. Their result holds for general data distribution, but their\nvariant of Oja\u2019s algorithm, VR-PCA, requires several full passes over the datasets, and thus not fully\nonline.\n\n4\n\n\fAlgorithm 1 Matrix Krasulina\n\nInput: Initial matrix W o \u2208 Rk(cid:48)\u00d7d; learning rate schedule (\u03b7t); number of iterations, T ;\nwhile t \u2264 T do\n\n1. Sample X t i.i.d. from the data distribution\n2. Orthonormalize the rows of W t\u22121 (e.g., via QR factorization)\n3. W t \u2190 W t\u22121 + \u03b7tW t\u22121X t(X t \u2212 (W t\u22121)(cid:62)W t\u22121X t)(cid:62)\n\nend while\nOutput: W (cid:62)\n\nThe machine learning and computer science community has studied the PCA problem without\nimposing strong assumptions on data. A typical assumption would be a gap on the eigenvalues\n[9, 21, 13, 24, 23]; recent work of Allen-Zhu and Li [18] has even removed this assumption. The\nsubspace tracking literature has approached the problem from a different angle (see Balzano et al.\n[25] for an overview).\nSubspace tracking Under a generative model assumption, X = \u00afU s, where \u00afU \u2208 Rd\u00d7k is a basis of\na k-dimensional subspace and Cov(s) = Ik [26, Condition 1], Zhang and Balzano [26] established\nthe global convergence (in a different subspace distance metric than ours) of a subspace tracking\nalgorithm called GROUSE [27]. However, no theoretical guarantee on GROUSE was provided\nwithout the generative model assumption, which also implies data low-rankness.\nOur Theorem 1 shows that when the data is of low-rank Matrix Krasulina can achieve local exponential\nconvergence, and Theorem 2 shows that without low-rank assumption Matrix Krasulina has O( 1\nt )\nlocal convergence rate. Our Theorem 3 provides preliminary results on how to make the convergence\nglobal.\n\n4 Main results\nGeneralizing vector wt \u2208 Rd to matrix W t \u2208 Rk(cid:48)\u00d7d as the algorithm\u2019s estimate at time t, we derive\nMatrix Krasulina (Algorithm 1), so that the row space of W t converges to the k-dimensional subspace\nspanned by {u1, . . . , uk}.\n\n4.1 Matrix Krasulina\n\nInspired by the original Krasulina\u2019s method, we design the following update rule for the Matrix\nKrasulina (Algorithm 1): Let\n\nst := W t\u22121X t and rt := X t \u2212 (W t\u22121)(cid:62)(W t\u22121(W t\u22121)(cid:62))\u22121W t\u22121X t ,\n\nSince we impose an orthonormalization step in Algorithm 1, rt is simpli\ufb01ed to\n\nrt := X t \u2212 (W t\u22121)(cid:62)W t\u22121X t ,\nThen the update rule of Matrix Krasulina can be re-written as\nW t \u2190 W t\u22121 + \u03b7tst(rt)(cid:62) ,\n\nFor k(cid:48) = 1, this reduces to Krasulina\u2019s update with (cid:107)wt(cid:107) = 1. The self-regulating variance argument\nfor the original Krasulina\u2019s method still holds, that is, we have\n\nE(cid:107)st(rt)(cid:62)(cid:107)2 \u2264 b E(cid:107)rt(cid:107)2 = b E(cid:107)X \u2212 (W t)(cid:62)W tX(cid:107)2 ,\n\nwhere b is as de\ufb01ned in Eq (2.4). We see that the last term coincides with the objective in Eq. (1.1).\n\n4.1.1 Loss measure\n\nGiven the algorithm\u2019s estimate W t at time t, we let P t denote the orthogonal projection matrix into\nthe subspace spanned by its rows, {W t\n\ni=1, that is,\n\ni,(cid:63)}k(cid:48)\n\nP t := (W t)(cid:62)(W t(W t)(cid:62))\u22121W t = (W t)(cid:62)W t ,\n\nIn our analysis, we use the following loss measure to track the evolvement of W t:\n\n5\n\n\fDe\ufb01nition 1 (Subspace distance). Let S and \u02c6S t be the ground-truth principal subspace and its\nestimate of Algorithm 1 at time t with orthogonal projectors U\u2217 and P t, respectively. We de\ufb01ne the\nsubspace distance between S and \u02c6St as \u2206t := tr(U\u2217(I \u2212 P t)) = k \u2212 tr(U\u2217P t).\nNote that \u2206t in fact equals the sum of squared canonical angles between S and \u02c6S t, and coincides with\nthe subspace distance measure used in related theoretical analyses of k-PCA algorithms [18, 13, 15].\n\n4.2 Convergence rates of Matrix Krasulina\n\nThis section presents our convergence results; proofs are deferred to the Appendix. Our \ufb01rst theorem\nproves the exponential convergence rate of Matrix Krasulina measured by \u2206t.\nTheorem 1 (Exponential convergence with constant learning rate). Suppose assumption Eq. (2.4)\nholds. Suppose the initial estimate W o \u2208 Rk(cid:48)\u00d7d (k(cid:48) \u2265 k) in Algorithm 1 satis\ufb01es that, for some\n\u03c4 \u2208 (0, 1),\n\n\u2206o \u2264 1 \u2212 \u03c4 ,\n\n(cid:26)\u221a\n\nSuppose for any \u03b4 > 0, we choose a constant learning rate \u03b7t = \u03b7 such that\n\n\u03b7 \u2264 min\n\n2 \u2212 1\nb\n\n,\n\n\u03bbk\u03c4\n\n\u03bb1b(k + 3)\n\n,\n\n2\u03bbk\u03c4\n\n8\n\n1\u2212\u03c4 ln 1\n\n\u03b4 (b + (cid:107)\u03a3\u2217(cid:107)F )2 + b(k + 1)\u03bb1\n\n(cid:27)\n\n,\n\nThen there exists event Gt such that P (Gt) \u2265 1 \u2212 \u03b4 , and\n\nE(cid:2)\u2206t|Gt\n\n(cid:3) \u2264 1\n\n1 \u2212 \u03b4\n\nexp (\u2212t\u03b7\u03c4 \u03bbk) .\n\nOn a high level, Theorem 1 is proved in the following steps (all proofs are deferred to the Appendix):\n\nIn section A.2 We show that if the algorithm\u2019s iterates, W t, stay inside the basin of attraction,\nwhich we formally de\ufb01ne as event Gt, Gt := {\u2206i \u2264 1 \u2212 \u03c4,\u2200i \u2264 t} , then a suitable transformation of\nthe stochastic process (\u2206t) forms a supermartingale.\n\nIn section A.3 Using martingale concentration inequality, we show that provided a good initializa-\ntion, it is likely that the algorithm\u2019s outputs W 1, . . . , W t stay inside the basin of attraction.\nIn section A.4 We show that at each iteration t, conditioning on Gt, \u2206t+1 \u2264 \u03b2\u2206t for some \u03b2 < 1\nif we set the learning rate \u03b7t to be a properly chosen constant.\n\nIn section D We iteratively apply this recurrence relation to prove Theorem 1.\nFrom Theorem 1, we observe that (a). The convergence rate of Algorithm 1 on strictly low-rank data\ndoes not depend on the data dimension d, but only on the intrinsic dimension k. This is veri\ufb01ed by our\nexperiments (see Sec. 5). (b). We see that the learning rate should be of order O( \u03bbk\n): Empirically,\nk\u03bb1\nwe found that setting \u03b7 to be roughly\ngives us the best convergence result. Note, however, this\nlearning rate setup is not practical since it requires knowledge of eigenvalues.\n\n10\u03bb1\n\n1\n\nComparison between Theorem 1 and Shamir [13, Theorem 1]\n(1). The result in Shamir [13]\ndoes not rely on the low-rank assumption of \u03a3\u2217. Since the variance of update in Oja\u2019s method is\nnot naturally decaying, they use VR technique inspired by Johnson and Zhang [12] to reduce the\nvariance of the algorithm\u2019s iterate, which is computationally heavy: the block version of VR-PCA\nconverges at rate O(exp (\u2212CT )), where T denotes the number of data passes. (2). Our result has\na similar learning rate dependence on the data norm bound b as that of Shamir [13, Theorem 1].\n(3). The initialization requirement in Theorem 1 is comparable to Shamir [13, Theorem 1]. (4).\nConditioning on the event of successful convergence, their exponential convergence rate result holds\ndeterministically, whereas our convergence rate guarantee holds in expectation.\nWhile our main focus is on taking advantage of low-rank data, the next theorem shows that on\nfull-rank datasets, if we tune the learning rate to decay at order O( 1\nt ), then the algorithm achieves\nO( 1\n\nt ) convergence.\n\n6\n\n\fAlgorithm 2 Warm-start with Matrix Krasulina\n\nInput: Epoch budget N; inner loop budget T , learning rate \u03b7; number of rows in initial matrix k(cid:48);\nshrinkage factor \u03c1;\nSet ko \u2190 k(cid:48) and initialize W o\nwhile i < N do\n\no \u2208 Rko\u00d7d with entries (W o\n\no )ij \u223c N (0, 1).\n\nwhile t < T do\nUpdate W t+1\n\ni \u2190 W t\n\nend while\nSet ki+1 \u2190 ki(1 \u2212 \u03c1), and construct W o\n\ni by running Matrix Krasulina iteration with learning rate schedule \u03b7\ni+1 by randomly sampling ki+1 rows from W T\u22121\n\n.\n\ni\n\nend while\n\nTheorem 2 (Linear convergence on full rank data). Suppose P(cid:0)sup(cid:107)X(cid:107)2 > b(cid:1) = 0. Suppose the\nto+t , for some constants c, to, and let B := max(cid:0)8(b + (cid:107)\u03a3\u2217(cid:107)F )2k, (kb +\n2cb2 + c2b3)\u03bb1(d \u2212 k)(cid:1). If we choose c, to such that\n\ninitial estimate W o \u2208 Rk\u00d7d in Algorithm 1 satis\ufb01es \u2206o \u2264 1\u2212\u03c4\nrate schedule be \u03b7t = c\n\n, for some \u03c4 \u2208 (0, 1). Let the learning\n\n2\n\n1\n\n(\u03bbk \u2212 \u03bbk+1)\u03c4\n\nand to \u2265 max{ 64Bc2 ln 1\n(\u2206o)2\n\nc \u2265\ne ), there exists event Gt such that P (Gt) \u2265 1 \u2212 \u03b4 , and E [\u2206t|Gt] \u2264 O( 1\nThen for any \u03b4 \u2208 (0, 1\nt ).\nTheorem 2 generalizes the result of [9], where linear convergence rate of Krasulina\u2019s algorithm is\nestablished for the 1-PCA problem on full-rank data. The linear convergence rate on full-rank data\nmatches that of the minimax lower bound in Eq (1.3) up to constants (note that here the initialization\ncondition is more strict than Theorem 1; whether this is an artifact of our analysis is left to future\nwork).\n\n, 1} ,\n\n\u03b4\n\n4.3 Random initialization guarantee of W o\n\nTheorem 1 focuses on the convergence rate of Matrix Krasulina from a good initialization point. Next,\nwe show that if we are willing to use k(cid:48) > k rows in W o, then randomly initializing the weights in\nW o is suf\ufb01cient to guarantee the initialization requirement, \u2206o \u2264 1 \u2212 \u03c4.\nTheorem 3 (Success guarantee of an over-complete initialization). Let \u03b5, t > 0 be any constants. If\nwe choose k(cid:48) \u2265 1+t\ndt2 ,\n\nk )d , then with probability at least 1\u2212 2k exp(cid:0)\u2212(\u03b52 \u2212 \u03b53)k(cid:48)/4(cid:1)\u2212 k(cid:48)+1\n\n1\u2212\u03b5 (1\u2212 1\u2212\u03c4\n\n\u2206o \u2264 1 \u2212 \u03c4 .\n\nThe proof is a simple application of Lemma 4 in Section F.\n1\u2212\u03b5\nHow large should k(cid:48) be given d, k?\n1+t , and\nthen we get the lower bound k(cid:48) \u2265 1. We need to choose a larger k(cid:48) as the intrinsic rank k gets larger.\nIn general, k(cid:48) is of order \u2126(d \u2212 d\nk ).\n\nIn the special case of k = 1, we can choose \u03c4 = 1\nd\n\nA phase-wise warm start with over-complete random initialization As seen from previous\ndiscussion, using vanilla random initialization is only reasonable if the ratio k\nd is small. Otherwise,\nthe number of rows in W o, k(cid:48), can almost be as large as d. To deal with this drawback, inspired by\nOja++ of Allen-Zhu and Li [18], we propose a warm-start strategy as Algorithm 2. The main insight\nis captured by the following lemma:\nLemma 1. For any i > 0, at the end of i-th epoch of Algorithm 2,\n\nE [tr(U\u2217P (W o\n\ni ))] \u2265 ki\nki\u22121\n\nE(cid:2)tr(U\u2217P (W T\u22121\ni\u22121 ))(cid:3)\n\nNote that the error of the \ufb01rst iterate at the i-th epoch is \u2206(W o\ni )). So Lemma 1\nquanti\ufb01es how much the error is increased between the last iterate of epoch i \u2212 1 and the \ufb01rst iterate\nof epoch i, due to the row-sampling step at the end of i\u2212 1-th epoch. Based on Lemma 1, the intuition\n\ni ) = k \u2212 tr(U\u2217P (W o\n\n7\n\n\fof why Algorithm 2 works is as follows: At the initial epoch, we choose ko to be large enough to\nsatisfy the condition of Theorem 3. Then Theorem 1 implies that after T inner loop iterations, the\nexpected error will decrease in order O(\u2212\u03b7\u03c4 \u03bbkT ); now if we choose a suitable shrinkage factor \u03c1,\n, will still satisfy the condition\nthen we can guarantee that the error of W o\nin Theorem 3. Thus, applying Theorem 1, we can decrease the error of W o\n1 rapidly again, and so on.\nk ) epochs, we will obtain a matrix of O(k) number of rows, while satisfying\nEventually, after O(log ko\nthe condition of Theorem 3. We leave the formal analysis and empirical evaluation of Algorithm 2 to\nfuture work.\n\n1 , although larger than W T\u22121\n\no\n\n4.4 Open question: extending our result to effectively low-rank data\n\ni>k \u03bbi\nj\u2264k \u03bbj\n\nMany real-world datasets are not strictly low-rank, but effectively low-rank (see, for example,\n(cid:80)\nFigure 2): Informally, we say a dataset is effectively low-rank if there exists k (cid:28) d such that\n(cid:80)\nis small , We conjecture that our analysis can be adapted to show theoretical guarantee of\nAlgorithm 1 on effectively low-rank datasets as well. In Section 5, our empirical results support\nthis conjecture. Formally characterizing the dependence of convergence rate on the \u201ceffective low-\nrankness\u201d of a dataset can provide a smooth transition between the linear convergence lower bound\n[15] and our result in Theorem 1.\n\n5 Experiments\n\nIn this section, we present our empirical evaluation of Algorithm 1 to understand its convergence\nproperty on low-rank or effectively low-rank datasets 2. We \ufb01rst veri\ufb01ed its performance on simulated\nlow-rank data and effectively low-rank data, and then we evaluated its performance on two real-world\neffectively low-rank datasets.\n\nk = 1, d = 100\n\nk = 10, d = 100\n\nk = 50, d = 100\n\nk = 1, d = 500\n\nk = 10, d = 500\n\nk = 50, d = 500\n\nFigure 1: log-convergence graph of Algorithm 1: ln(\u2206t) vs t at different levels of noise-over-signal\nratio (\n\n)\n\n(cid:80)\n(cid:80)\n\ni>k \u03bbi\nj\u2264k \u03bbj\n\n5.1 Simulations\n\nThe low-rank data is generated as follows: we sample i.i.d. standard normal on the \ufb01rst k coordinates\nof the d-dimensional data (the rest d \u2212 k coordinates are zero), then we rotate all data using a random\northogonal matrix (unknown to the algorithm).\n\n2Code will be available at https://github.com/chengtang48/neurips19.\n\n8\n\n\fFigure 2: top 6 eigenvalues ex-\nplains 80% of the data variance.\n\nMNIST (d = 784; k(cid:48) = 44) VGG (d = 2304; k(cid:48) = 6); red\nvertical line marks a full pass over\nthe dataset\n\nSimulating effectively low-rank data In practice, hardly any dataset is strictly low-rank but many\ndatasets have sharply decaying spectra (recall Figure 2). Although our Theorem 1 is developed under\na strict low-rankness assumption, here we empirically test the robustness of our convergence result\nwhen data is not strictly low rank but only effectively low rank. Let \u03bb1,\u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd \u2265 0 be the\nspectrum of a covariance matrix. For a \ufb01xed k \u2208 [d], we let noise-over-signal :=\n. The\nnoise-over-signal ratio intuitively measures how \u201cclose\u201d the matrix is to a rank-k matrix: The smaller\nthe number is, the shaper the spectral decay; when the ratio equals zero, the matrix is of rank at most\nk. In our simulated data, we perturb the spectrum of a strictly rank-k covariance matrix and generate\ndata with full-rank covariance matrices at the following noise-over-signal ratios, {0, 0.01, 0.1, 0.5}.\n\ni>k \u03bbi\nj\u2264k \u03bbj\n\n(cid:80)\n(cid:80)\n\nResults Figure 1 shows the log-convergence graph of Algorithm 1 on our simulated data: We\ninitialized Algorithm 1 with a random matrix W o and ran it for one or a few epochs, each consists of\n5000 iterations. (1). We veri\ufb01ed that, on strictly low rank data (noise-over-signal= 0), the algorithm\nindeed has an exponentially convergence rate (linear in log-error); (2). As we increase the noise-over-\nsignal ratio, the convergence rate gradually becomes slower; (3). The convergence rate is not affected\nby the actual data dimension d, but only by the intrinsic dimension k, as predicted by Theorem 1.\n\n5.2 Real effectively low-rank datasets\n\nWe take a step further to test the performance of Algorithm 1 on two real-world datasets: VGG [28]\nis a dataset of 10806 image \ufb01les from 2622 distinct celebrities crawled from the web, with d = 2304.\nFor MNIST [29], we use the 60000 training examples of digit pixel images, with d = 784. Both\ndatasets are full-rank, but we choose k(cid:48) such that the noise-over-signal ratio at k(cid:48) is 0.25; that is, the\ntop k(cid:48) eigenvalues explain 80% of data variance. We compare Algorithm 1 against the exponentially\nconvergent VR-PCA: we initialize the algorithms with the same random matrix and we train (and\nrepeated for 5 times) using the best constant learning rate we found empirically for each algorithm.\nWe see that Algorithm 1 retains fast convergence even if the datasets are not strictly low rank, and\nthat it has a clear advantage over VR-PCA before the iteration reaches a full pass; indeed, VR-PCA\nrequires a full-pass over the dataset before its \ufb01rst iterate.\n\n9\n\n\fAcknowledgments\n\nCheng would like to thank all anonymous reviewers and the meta-reviewer for providing insightful\nfeedbacks on improving the quality of this paper. Cheng is very grateful to her PhD advisor, Claire\nMonteleoni, for her kind encouragement in pursuing this project, and to Amazon Web Services for\nvarious supports.\n\nReferences\n[1] T.P. Krasulina. The method of stochastic approximation for the determination of the least\neigenvalue of a symmetrical matrix. USSR Computational Mathematics and Mathematical\nPhysics, 9(6):189 \u2013 195, 1969. ISSN 0041-5553. doi: https://doi.org/10.1016/0041-5553(69)\n90135-9.\n\n[2] Andreas Loukas. How close are the eigenvectors of the sample and actual covariance matrices?\nIn Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference\non Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2228\u2013\n2237, International Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR. URL\nhttp://proceedings.mlr.press/v70/loukas17a.html.\n\n[3] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic\nalgorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217\u2013288,\nMay 2011. ISSN 0036-1445. doi: 10.1137/090771806. URL http://dx.doi.org/10.\n1137/090771806.\n\n[4] Gene H. Golub and Charles F. Van Loan. Matrix Computations (3rd Ed.). Johns Hopkins\n\nUniversity Press, Baltimore, MD, USA, 1996. ISBN 0-8018-5414-8.\n\n[5] Ohad Shamir. A stochastic pca and svd algorithm with an exponential convergence rate. In\nProceedings of the 32Nd International Conference on International Conference on Machine\nLearning - Volume 37, ICML\u201915, pages 144\u2013152. JMLR.org, 2015. URL http://dl.acm.\norg/citation.cfm?id=3045118.3045135.\n\n[6] Jiazhong Nie, Wojciech Kotlowski, and Manfred K. Warmuth. Online pca with optimal regret.\n\nJournal of Machine Learning Research, 17(173):1\u201349, 2016.\n\n[7] Christos Boutsidis, Dan Garber, Zohar Karnin, and Edo Liberty. Online principal components\nIn Proceedings of the Twenty-sixth Annual ACM-SIAM Symposium on Discrete\nanalysis.\nAlgorithms, SODA \u201915, pages 887\u2013901, Philadelphia, PA, USA, 2015. Society for Industrial\nand Applied Mathematics. URL http://dl.acm.org/citation.cfm?id=2722129.\n2722190.\n\n[8] Manfred K. Warmuth and Dima Kuzmin. Randomized pca algorithms with regret bounds that\nare logarithmic in the dimension. In Proceedings of the 19th International Conference on Neural\nInformation Processing Systems, NIPS\u201906, pages 1481\u20131488, Cambridge, MA, USA, 2006.\nMIT Press. URL http://dl.acm.org/citation.cfm?id=2976456.2976642.\n\n[9] Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incremental\nPCA. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on\nNeural Information Processing Systems 2013. Proceedings of a meeting held December 5-8,\n2013, Lake Tahoe, Nevada, United States., pages 3174\u20133182, 2013.\n\n[10] Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory limited, streaming\npca. In Proceedings of the 26th International Conference on Neural Information Processing\nSystems - Volume 2, NIPS\u201913, pages 2886\u20132894, USA, 2013. Curran Associates Inc. URL\nhttp://dl.acm.org/citation.cfm?id=2999792.2999934.\n\n[11] Raman Arora, Andrew Cotter, and Nathan Srebro. Stochastic optimization of pca with capped\nmsg. In Proceedings of the 26th International Conference on Neural Information Processing\nSystems - Volume 2, NIPS\u201913, pages 1815\u20131823, USA, 2013. Curran Associates Inc. URL\nhttp://dl.acm.org/citation.cfm?id=2999792.2999815.\n\n10\n\n\f[12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-\nance reduction. In Proceedings of the 26th International Conference on Neural Information\nProcessing Systems - Volume 1, NIPS\u201913, pages 315\u2013323, USA, 2013. Curran Associates Inc.\nURL http://dl.acm.org/citation.cfm?id=2999611.2999647.\n\n[13] Ohad Shamir. Fast stochastic algorithms for svd and pca: Convergence properties and convexity.\nIn Proceedings of the 33rd International Conference on International Conference on Machine\nLearning - Volume 48, ICML\u201916, pages 248\u2013256. JMLR.org, 2016. URL http://dl.acm.\norg/citation.cfm?id=3045390.3045418.\n\n[14] Erkki Oja. Simpli\ufb01ed neuron model as a principal component analyzer. Journal of Mathematical\n\nBiology, 15(3):267\u2013273, Nov 1982. ISSN 1432-1416.\n\n[15] Vincent Q. Vu and Jing Lei. Minimax sparse principal subspace estimation in high dimensions.\nISSN 00905364. URL http://www.\n\nThe Annals of Statistics, 41(6):2905\u20132947, 2013.\njstor.org/stable/23566753.\n\n[16] Vincent Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions.\nIn Neil D. Lawrence and Mark Girolami, editors, Proceedings of the Fifteenth International\nConference on Arti\ufb01cial Intelligence and Statistics, volume 22 of Proceedings of Machine\nLearning Research, pages 1278\u20131286, La Palma, Canary Islands, 21\u201323 Apr 2012. PMLR.\nURL http://proceedings.mlr.press/v22/vu12.html.\n\n[17] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Stochastic convex\noptimization. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec,\nCanada, June 18-21, 2009, 2009. URL http://www.cs.mcgill.ca/~colt2009/\npapers/018.pdf#page=1.\n\n[18] Z. Allen-Zhu and Y. Li. First ef\ufb01cient convergence for streaming k-pca: A global, gap-free, and\nnear-optimal rate. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science\n(FOCS), pages 487\u2013492, Oct 2017. doi: 10.1109/FOCS.2017.51.\n\n[19] Ohad Shamir. Convergence of stochastic gradient descent for pca. In Proceedings of the\n33rd International Conference on International Conference on Machine Learning - Volume\n48, ICML\u201916, pages 257\u2013265. JMLR.org, 2016. URL http://dl.acm.org/citation.\ncfm?id=3045390.3045419.\n\n[20] C. De Sa, K. Olukotun, and C. R\u00e9. Global Convergence of Stochastic Gradient Descent for\n\nSome Non-convex Matrix Problems. ArXiv e-prints, November 2014.\n\n[21] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications.\nIn Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 27, pages 2861\u20132869. Curran Associates,\nInc., 2014.\n\n[22] J. Chen. Convergence of Krasulina Scheme. ArXiv e-prints, August 2018.\n\n[23] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved\ngap-dependency analysis of the noisy power method. In Vitaly Feldman, Alexander Rakhlin, and\nOhad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of\nMachine Learning Research, pages 284\u2013309, Columbia University, New York, New York, USA,\n23\u201326 Jun 2016. PMLR. URL http://proceedings.mlr.press/v49/balcan16a.\nhtml.\n\n[24] Chun-Liang Li, Hsuan-Tien Lin, and Chi-Jen Lu. Rivalry of two families of algorithms\nfor memory-restricted streaming pca.\nIn Arthur Gretton and Christian C. Robert, editors,\nProceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 51 of Proceedings of Machine Learning Research, pages 473\u2013481, Cadiz, Spain, 09\u201311\nMay 2016. PMLR. URL http://proceedings.mlr.press/v51/li16b.html.\n\n[25] L. Balzano, Y. Chi, and Y. M. Lu. Streaming pca and subspace tracking: The missing data case.\n\nProceedings of the IEEE, 106(8):1293\u20131310, 2018. doi: 10.1109/JPROC.2018.2847041.\n\n11\n\n\f[26] Dejiao Zhang and Laura Balzano. Global convergence of a grassmannian gradient descent\n\nalgorithm for subspace estimation. In AISTATS, 2015.\n\n[27] Laura Balzano, Robert D. Nowak, and Benjamin Recht. Online identi\ufb01cation and tracking\nof subspaces from highly incomplete information. 2010 48th Annual Allerton Conference on\nCommunication, Control, and Computing (Allerton), pages 704\u2013711, 2010.\n\n[28] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision\n\nConference, 2015.\n\n[29] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http:\n\n//yann.lecun.com/exdb/mnist/.\n\n12\n\n\f", "award": [], "sourceid": 6715, "authors": [{"given_name": "Cheng", "family_name": "Tang", "institution": "Amazon"}]}