{"title": "Gen-Oja: Simple & Efficient Algorithm for Streaming Generalized Eigenvector Computation", "book": "Advances in Neural Information Processing Systems", "page_first": 7016, "page_last": 7025, "abstract": "In this paper, we study the problems of principle Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting. We propose a simple and efficient algorithm for these problems. We prove the global convergence of our algorithm, borrowing ideas from the theory of fast-mixing Markov chains and two-Time-Scale Stochastic Approximation, showing that it achieves the optimal rate of convergence. In the process, we develop tools for understanding stochastic processes with Markovian noise which might be of independent interest.", "full_text": "Gen-Oja: A Simple and Ef\ufb01cient Algorithm for\nStreaming Generalized Eigenvector Computation\n\nKush Bhatia\u2217\n\nUniversity of California, Berkeley\n\nkushbhatia@berkeley.edu\n\nNicolas Flammarion\n\nUniversity of California, Berkeley\n\nflammarion@berkeley.edu\n\nAldo Pacchiano\u2217\n\nUniversity of California, Berkeley\n\npacchiano@berkeley.edu\n\nPeter L. Bartlett\n\nUniversity of California, Berkeley\n\npeter@berkeley.edu\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\n\njordan@cs.berkeley.edu\n\nAbstract\n\nIn this paper, we study the problems of principal Generalized Eigenvector compu-\ntation and Canonical Correlation Analysis in the stochastic setting. We propose\na simple and ef\ufb01cient algorithm, Gen-Oja, for these problems. We prove the\nglobal convergence of our algorithm, borrowing ideas from the theory of fast-\nmixing Markov chains and two-time-scale stochastic approximation, showing that\nit achieves the optimal rate of convergence. In the process, we develop tools\nfor understanding stochastic processes with Markovian noise which might be of\nindependent interest.\n\n1\n\nIntroduction\n\nCannonical Correlation Analysis (CCA) and the Generalized Eigenvalue Problem are two fundamental\nproblems in machine learning and statistics, widely used for feature extraction in applications\nincluding regression [18], clustering [9] and classi\ufb01cation [19].\nOriginally introduced by Hotelling in [16], CCA is a statistical tool for the analysis of multi-view\ndata that can be viewed as a \u201ccorrelation-aware\" version of Principal Component Analysis (PCA).\nGiven two multidimensional random variables, the objective in CCA is to obtain a pair of linear\ntransformations that maximize the correlation between the transformed variables.\nGiven access to samples {(xi, yi)n\ni=1} of zero mean random variables X, Y \u2208 Rd with an unknown\njoint distribution PXY , CCA can be used to discover features expressing similarity or dissimilarity\nbetween X and Y . Formally, CCA aims to \ufb01nd a pair of vectors u, v \u2208 Rd such that projections of X\nonto v and Y onto u are maximally correlated. In the population setting, the corresponding objective\nis given by:\n\nmax v(cid:62)E[XY (cid:62)]u\n\ns.t.\n\nv(cid:62)E[XX(cid:62)]v = 1 and u(cid:62)E[Y Y (cid:62)]u = 1.\n\n(1)\n\n\u2217Equal contribution.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn the context of covariance matrices, the objective of the generalized eigenvalue problem is to obtain\nthe direction u or v \u2208 Rd maximizing discrepancy between X and Y and can be formulated as,\n\narg max\nv(cid:54)=0\n\nv(cid:62)E[XX(cid:62)]v\nv(cid:62)E[Y Y (cid:62)]v\n\nand arg max\nu(cid:54)=0\n\nu(cid:62)E[Y Y (cid:62)]u\nu(cid:62)E[XX(cid:62)]u\n\n.\n\n(2)\n\nMore generally, given symmetric matrices A, B, with B positive de\ufb01nite, the objective of the\nprincipal generalized eigenvector problem is to obtain a unit norm vector w such that Aw = \u03bbBw\nfor \u03bb maximal.\nCCA and the generalized eigenvalue problem are intimately related. In fact, the CCA problem can be\ncast as a special case of the generalized eigenvalue problem by solving for u and v in the following\nobjective:\n\n(cid:18)\n(cid:124)\n\n0\n\nE[Y X(cid:62)]\n\nE[XY (cid:62)]\n\n0\n\n(cid:123)(cid:122)\n\nA\n\n(cid:18)v\n\n(cid:19)\n\nu\n\n(cid:19)\n(cid:125)\n\n(cid:18)E[XX(cid:62)]\n(cid:123)(cid:122)\n(cid:124)\n\n0\n\nB\n\n= \u03bb\n\n(cid:18)v\n\n(cid:19)\n\nu\n\n(cid:19)\n(cid:125)\n\n0\n\nE[Y Y (cid:62)]\n\n.\n\n(3)\n\nThe optimization problems underlying both CCA and the generalized eigenvector problem are non-\nconvex in general. While they admit closed-form solutions, even in the of\ufb02ine setting a direct\ncomputation requires O(d3) \ufb02ops which is infeasible for large-scale datasets. Recently, there has\nbeen work on solving these problems by leveraging fast linear system solvers [14, 2] while requiring\ncomplete knowledge of the matrices A and B.\nIn the stochastic setting, the dif\ufb01culty increases because the objective is to maximize a ratio of\nexpectations, in contrast to the standard setting of stochastic optimization [26], where the objective is\nthe maximization of an expectation. There has been recent interest in understanding and developing\nef\ufb01cient algorithms with provable convergence guarantees for such non-convex problems. [17] and\n[27] recently analyzed the convergence rate of Oja\u2019s algorithm [25], one of the most commonly used\nalgorithm for streaming PCA.\nIn contrast, for the stochastic generalized eigenvalue problem and CCA problem, the focus has\nbeen to translate algorithms from the of\ufb02ine setting to the online one. For example, [12] proposes a\nstreaming algorithm for the stochastic CCA problem which utilizes a streaming SVRG method to\nsolve an online least-squares problem. Despite being streaming in nature, this algorithm requires a\nnon-trivial initialization and, in contrast to the spirit of streaming algorithms, updates its eigenvector\nestimate only after every few samples. This raises the following challenging question:\n\nIs it possible to obtain an ef\ufb01cient and provably convergent counterpart to Oja\u2019s Algorithm for\ncomputing the principal generalized eigenvector in the stochastic setting?\n\nIn this paper, we propose a simple, globally convergent, two-line algorithm, Gen-Oja, for the\nstochastic principal generalized eigenvector problem and, as a consequence, we obtain a natural\nextension of Oja\u2019s algorithm for the streaming CCA problem. Gen-Oja is an iterative algorithm\nwhich works by updating two coupled sequences at every time step. In contrast with existing methods\n[17], at each time step the algorithm can be seen as performing a step of Oja\u2019s method, with a noise\nterm which is neither zero mean nor conditionally independent, but instead is Markovian in nature.\nThe analysis of the algorithm borrows tools from the theory of fast mixing of Markov chains [11]\nas well as two-time-scale stochastic approximation [6, 7, 8] to obtain an optimal (up to dimension\ndependence) fast convergence rate of \u02dcO(1/n).\nNotation: We denote by \u03bbi(M ) and \u03c3i(M ) the ith largest eigenvalue and singular value of a square\nmatrix M. For any positive semi-de\ufb01nite matrix N, we denote inner product in the N-norm by (cid:104)\u00b7,\u00b7(cid:105)N\nand the corresponding norm by (cid:107) \u00b7 (cid:107)N . We let \u03baN = \u03bbmax(N )\n\u03bbmin(N ) denote the condition number of N.\nWe denote the eigenvalues of the matrix B\u22121A by \u03bb1 > \u03bb2 \u2265 . . . \u2265 \u03bbd with (ui)d\ndenoting the corresponding right and left eigenvectors of B\u22121A whose existence is guaranteed by\nLemma G.3 in Appendix G.3. We use \u2206\u03bb to denote the eigengap \u03bb1 \u2212 \u03bb2.\n\ni=1 and (\u02dcui)d\n\ni=1\n\n2 Problem Statement\nIn this section, we focus on the problem of estimating principal generalized eigenvectors in a\nstochastic setting. The generalized eigenvector, vi, corresponding to a system of matrices (A, B),\n\n2\n\n\fwhere A \u2208 Rd\u00d7d is a symmetric matrix and B \u2208 Rd\u00d7d is a symmetric positive de\ufb01nite matrix,\nsatis\ufb01es\n\n(4)\nThe principal generalized eigenvector v1 corresponds to the vector with the largest value2 of \u03bbi, or,\nequivalently, v1 is the principal eigenvector of the non-symmetric matrix B\u22121A. The vector v1 also\ncorresponds to the maximizer of the generalized Rayleigh quotient given by\n\nAvi = \u03bbiBvi.\n\nv1 = arg max\nv\u2208Rd\n\nv(cid:62)Av\nv(cid:62)Bv\n\n.\n\n(5)\n\nIn the stochastic setting, we only have access to a sequence of matrices A1, . . . , An \u2208 Rd\u00d7d and\nB1, . . . , Bn \u2208 Rd\u00d7d assumed to be drawn i.i.d. from an unknown underlying distribution, such that\nE[Ai] = A and E[Bi] = B and the objective is to estimate v1 given access to O(d) memory.\nIn order to quantify the error between a vector and its estimate, we de\ufb01ne the following generalization\nof the sine with respect to the B-norm as,\n\nB(v, w) = 1 \u2212(cid:16) v(cid:62)Bw\n\n(cid:107)v(cid:107)B(cid:107)w(cid:107)B\n\n(cid:17)2\n\nsin2\n\n.\n\n(6)\n\n3 Related Work\n\nPCA. There is a vast literature dedicated to the development of computationally ef\ufb01cient algorithms\nfor the PCA problem in the of\ufb02ine setting (see [23, 13] and references therein). In the stochastic\nsetting, sharp convergence results were obtained recently by [17] and [27] for the principal eigenvector\ncomputation problem using Oja\u2019s algorithm and later extended to the streaming k-PCA setting by [1].\n\u221a\nThey are able to obtain a O(1/n) convergence rate when the eigengap of the matrix is positive and a\nO(1/\n\nn) rate is attained in the gap free setting.\n\nOf\ufb02ine CCA and generalized eigenvector. Computationally ef\ufb01cient optimization algorithms\nwith \ufb01nite convergence guarantees for CCA and the generalized eigenvector problem based on\nEmpirical Risk Minimization (ERM) on a \ufb01xed dataset have recently been proposed in [14, 31, 2].\nThese approaches work by reducing the CCA and generalized eigenvector problem to that of solving\n\u22121\na PCA problem on a modi\ufb01ed matrix M (e.g., for CCA, M = B\n2 ). This reformulation is\nthen solved by using an approximate version of the Power Method that relies on a linear system\nsolver to obtain the approximate power method step. [14, 2] propose an algorithm for the generalized\neigenvector computation problem and instantiate their results for the CCA problem. [20, 21, 31]\nfocus on the CCA problem by optimizing a different objective:\n\n\u22121\n2 AB\n\nmin\n\n1\n2\n\n\u02c6E|\u03c6(cid:62)xi \u2212 \u03c8(cid:62)yi|2 + \u03bbx(cid:107)\u03c6(cid:107)2\n\n2 + \u03bby(cid:107)\u03c8(cid:107)2\n\n2\n\ns.t.\n\n(cid:107)\u03c6(cid:107)\u02c6E[xx(cid:62)] = (cid:107)\u03c8(cid:107)\u02c6E[yy(cid:62)] = 1,\n\nwhere \u02c6E denotes the empirical expectation. The proposed methods utilize the knowledge of complete\ndata in order to solve the ERM problem, and hence is unclear how to extend them to the stochastic\nsetting.\n\nStochastic CCA and generalized eigenvector. There has been a dearth of work for solving these\nproblems in the stochastic setting owing to the dif\ufb01culties mentioned in Section 1. Recently, [12]\nextend the algorithm of [31] from the of\ufb02ine to the streaming setting by utilizing a streaming version\nof the SVRG algorithm for the least squares system solver. Their algorithm, based on the shift and\ninvert method, suffers from two drawbacks: a) contrary to the spirit of streaming algorithms, this\nmethod does not update its estimate at each iteration \u2013 it requires to use logarithmic samples for\nsolving an online least squares problem, and, b) their algorithm critically relies on obtaining an\nestimate of \u03bb1 to a small accuracy for which it requires to burn a few samples in the process. In\ncomparison, Gen-Oja takes a single stochastic gradient step for the inner least squares problem and\nupdates its estimate of the eigenvector after each sample. Perhaps the closest to our approach is [4],\nwho propose an online method by solving a convex relaxation of the CCA objective with an inexact\nstochastic mirror descent algorithm. Unfortunately, the computational complexity of their method is\nO(d2) which renders it infeasible for large-scale problems.\n\n3\n\n\fAlgorithm 1: Gen-Oja for Streaming Av = \u03bbBv\nInput: Time steps T , step size \u03b1t (Least Squares), \u03b2t (Oja)\nInitialize: (w0, v0) \u2190 sample uniformly from the unit sphere in Rd, \u00afv0 = v0\nfor t = 1, . . . , T do\n\nDraw sample (At, Bt)\nwt \u2190 wt\u22121 \u2212 \u03b1t(Btwt\u22121 \u2212 Atvt\u22121)\nt \u2190 vt\u22121 + \u03b2twt\nv(cid:48)\nvt \u2190 v(cid:48)\nt(cid:107)vt(cid:107)2\n\nOutput: Estimate of Principal Generalized Eigenvector: vT\n\n4 Gen-Oja\n\nIn this section, we describe our proposed approach for the stochastic generalized eigenvector problem\n(see Section 2). Our algorithm Gen-Oja, described in Algorithm 1, is a natural extension of the\npopular Oja\u2019s algorithm used for solving the streaming PCA problem. The algorithm proceeds by\niteratively updating two coupled sequences (wt, vt) at the same time: wt is updated using one step of\nstochastic gradient descent with constant step-size to minimize w(cid:62)Bw \u2212 2w(cid:62)Avt and vt is updated\nusing a step of Oja\u2019s algorithm. Gen-Oja has its roots in the theory of two-time-scale stochastic\napproximation, by viewing the sequence wt as a fast mixing Markov chain and vt as a slowly evolving\none. In the sequel, we describe the evolution of the Markov chains (wt)t\u22650, (vt)t\u22650, in the process\noutlining the intuition underlying Gen-Oja and understanding the key challenges which arise in the\nconvergence analysis.\nOja\u2019s algorithm. Gen-Oja is closely related to the Oja\u2019s algorithm [25] for the streaming PCA\nproblem. Consider a special case of the problem, when each Bt = I. In the of\ufb02ine setting, this\nreduces the generalized eigenvector problem to that of computing the principal eigenvector of A.\nWith the setting of step-size \u03b1t = 1, Gen-Oja recovers the Oja\u2019s algorithm given by\n\nvt =\n\nvt\u22121 + \u03b2tAtvt\u22121\n(cid:107)vt\u22121 + \u03b2tAtvt\u22121.(cid:107)\n\nThis algorithm is exactly a projected stochastic gradient ascent on the Rayleigh quotient v(cid:62)Av (with\na step size \u03b2t). Alternatively, it can be interpreted as a randomized power method on the matrix\n(I + \u03b2tA)[15].\nTwo-time-scale approximation. The theory of two-time-scale approximation forms the underlying\nbasis for Gen-Oja. It considers coupled iterative systems where one component changes much faster\nthan the other [7, 8]. More precisely, its objective is to understand classical systems of the type:\n\n(7)\nxt = xt\u22121 + \u03b1t\n(8)\nyt = yt\u22121 + \u03b2t\nwhere g and h are the update functions and (\u03be1\nt ) correspond to the noise vectors at step t and\ntypically assumed to be martingale difference sequences.\nIn the above model, whenever the two step sizes \u03b1t and \u03b2t satisfy \u03b2t/\u03b1t \u2192 0, the sequence yt moves\non a slower timescale than xt. For any \ufb01xed value of y the dynamical system given by xt,\n\nt , \u03be2\n\nt\n\nxt = xt\u22121 + \u03b1t[h (xt\u22121, y) + \u03be1\nt ],\n\n(9)\nconverges to to a solution x\u2217(y). In the coupled system, since the state variables xt move at a much\nfaster time scale, they can be seen as being close to x\u2217(yt), and thus, we can alternatively consider:\n(10)\nIf the process given by yt above were to converge to y\u2217, under certain conditions, we can argue that\nthe coupled process (xt, yt) converges to (x\u2217(y\u2217), y\u2217). Intuitively, because xt and yt are evolving at\ndifferent time-scales, xt views the process yt as quasi-constant while yt views xt as a process rapidly\nconverging to x\u2217(yt).\n\n(cid:2)g (x\u2217(yt\u22121), yt\u22121) + \u03be2\n\nyt = yt\u22121 + \u03b2t\n\n(cid:3) .\n\nt\n\n2Note that we consider here the largest signed value of \u03bbi\n\n4\n\n(cid:2)h (xt\u22121, yt\u22121) + \u03be1\n(cid:3)\n(cid:2)g (xt\u22121, yt\u22121) + \u03be2\n(cid:3) ,\n\nt\n\n\fGen-Oja can be seen as a particular instance of the coupled iterative system given by Equations\n(7) and (8) where the sequence vt evolves with a step-size \u03b2t \u2248 1\nt , much slower than the sequence\nwt, which has a step-size of \u03b1t \u2248 1\nlog(t). Proceeding as above, the sequence vt views wt as having\nconverged to B\u22121Avt + \u03bet, where \u03bet is a noise term, and the update step for vt in Gen-Oja can be\nviewed as a step of Oja\u2019s algorithm, albeit with Markovian noise.\nWhile previous works on the stochastic CCA problem required to use logarithmic independent\nsamples to solve the inner least-squares problem in order to perform an approximate power method\n(or Oja) step, the theory of two-time-scale stochastic approximation suggests that it is possible to\nobtain a similar effect by evolving the sequences wt and vt at two different time scales.\nUnderstanding the Markov Process {wt}.\nsequence wt, we consider the homogeneous Markov chain (wv\n\nIn order to understand the process described by the\n\nt ) de\ufb01ned by\n\nwv\n\nt = wv\n\n(11)\nfor a constant vector v and we denote its t-step kernel by \u03c0t\nv [22]. This Markov process is an iterative\nlinear model and has been extensively studied by [28, 10, 5]. It is known that for any step-size\n\u03b1 \u2264 2/R2, the Markov chain (wv\nt )t\u22650 admits a unique stationary distribution, denoted by \u03bdv. In\naddition,\n\nt\u22121 \u2212 \u03b1(Btwv\n\nW 2\n\n2 (\u03c0t\n\nv(w0,\u00b7), \u03bdv) \u2264 (1 \u2212 2\u00b5\u03b1(1 \u2212 \u03b1R2\n\nB/2))t\n\n(cid:107)w0 \u2212 w(cid:107)2\n\n2d\u03bdv(w),\n\n(12)\n\nt\u22121 \u2212 Atv),\n(cid:90)\n\nRd\n\nwhere W 2\n2 (\u03bb, \u03bd) denotes the Wasserstein distance of order 2 between probability measures \u03bb and \u03bd\n(see, e.g., [30] for more properties of W2). Equation (12) implies that the iterative linear process\ndescribed by (11) mixes exponentially fast to the stationary distribution. This forms a crucial\ningredient in our convergence analysis where we use the fast mixing to obtain a bound on the\nexpected norm of the Markovian noise (see Lemma 6.1).\nMoreover, one can compute the mean \u00afwv of the process wt under the stationary distribution by taking\nexpectation under \u03bdv on both sides in equation (11). Doing so, we obtain, \u00afwv = B\u22121Av. Thus, in\nour setting, since the vt process evolves slowly, we can expect that wt \u2248 B\u22121Avt, allowing Gen-Oja\nto mimic Oja\u2019s algorithm.\n\n5 Main Theorem\n\nsymmetric matrix B \u2208 Rd\u00d7d with B (cid:60) \u00b5I for \u00b5 > 0.\n\nIn this section, we present our main convergence guarantee for Gen-Oja when applied to the streaming\ngeneralized eigenvector problem. We begin by listing the key assumptions required by our analysis:\n(A1) The matrices (Ai)i\u22650 satisfy E[Ai] = A for a symmetric matrix A \u2208 Rd\u00d7d.\n(A2) The matrices (Bi)i\u22650 are such that each Bi (cid:60) 0 is symmetric and satis\ufb01es E[Bi] = B for a\n(A3) There exists R \u2265 0 such that max{(cid:107)Ai(cid:107),(cid:107)Bi(cid:107)} \u2264 R almost surely.\nUnder the assumptions stated above, we obtain the following convergence theorem for Gen-Oja with\nrespect to the sin2\nTheorem 5.1 (Main Result). Fix any \u03b4 > 0 and \u00011 > 0. Suppose that the step sizes are set to\n\u03b1t =\n\nB distance, as described in Section 2.\n\nlog(d2\u03b2+t) and \u03b2t =\n\n\u03b3\n\nc\n\n\uf8eb\uf8ed\n\n\u2206\u03bb(d2\u03b2+t) for \u03b3 > 1/2 , c > 1 and\n\u00b52 + R5\n\n\u00b5 + R3\n\n200\n\n20\u03b32\u03bb2\n1\n\n(cid:16) R\n\n(cid:17)\n\n(cid:16)\n\n\u00b53\n\u03b4\u22062\n\u03bb\n\n\u03b2 = max\n\n\u22062\n\n\u03bbd2 log\n\nlog\n\n1 + R2\n\n\u00b5 + R4\n\n\u00b52\n\nSuppose that the number of samples n satisfy\n\nd2\u03b2 + n\n\n1\n\nThen, the output vn of Algorithm 1 satis\ufb01es,\n\nlog\n\nmin(1,2\u03b3\u03bb1/\u2206\u03bb ) (d2\u03b2 + n)\n\nB(u1, vn) \u2264 (2 + \u00011)cd(cid:107)(cid:80)d\n\ni=1 \u02dcui \u02dcu(cid:62)\n\u03b42(cid:107)\u02dcu1(cid:107)2\n\n2\n\nsin2\n\n1+\u00011\n\n(cid:17) ,\n\n(cid:16) 1+\u03b4/100\n(cid:18)\ni (cid:107)2 log(cid:0) 1\n\n\u2265\n\ncd\n\n\u03b4\n\n\u03b41 min(1, \u03bb1)\n\n(cid:19)\n\n(cid:32)\n\n(cid:1)\n\n5\n\n(cid:17)\n\n\uf8f6\uf8f8 .\n(cid:18) c\u03bb2\n(cid:18) d2\u03b2 + log3(d2\u03b2)\n\n(cid:19)\n\n1\nd2\n\nd2\u03b2 + n + 1\n\n(cid:19)2\u03b3(cid:33)\n\n,\n\n1\n\nmin(1,2\u03b3\u03bb1/\u2206\u03bb )\n\n(d3\u03b2 + 1) exp\n\nc\u03b32 log3(d2\u03b2 + n)\n\u22062\n\u03bb(d2\u03b2 + n + 1)\n\n+\n\ncd\n\u2206\u03bb\n\n\f2(2+\u00011) .\n\nwith probability at least 1 \u2212 \u03b4 with c depending polynomially on parameters of the problem\n\u03bb1, \u03baB, R, \u00b5. The parameter \u03b41 is set as \u03b41 = \u00011\nThe above result shows that with probability at least 1 \u2212 \u03b4, Gen-Oja converges in the B-norm to\nthe right eigenvector, u1, corresponding to the maximum eigenvalue of the matrix B\u22121A. Further,\nGen-Oja exhibits an \u02dcO(1/n) rate of convergence, which is known to be optimal for stochastic\napproximation algorithms even with convex objectives [24].\nComparison with Streaming PCA. In the setting where B = I, and A (cid:23) 0 is a covariance matrix,\nthe principal generalized eigenvector problem reduces to performing PCA on the A. When compared\nwith the results obtained for streaming PCA by [17], our corresponding results differ by a factor of\ndimension d and problem dependent parameters \u03bb1, \u2206\u03bb. We believe that such a dependence is not\ninherent to Gen-Oja but a consequence of our analysis. We leave this task of obtaining a dimension\nfree bound for Gen-Oja as future work.\nGap-independent step size: While the step size for the sequence vn in Gen-Oja depends on eigen-\ngap, which is a priori unknown, one can leverage recent results as in [29] to get around this issue by\nusing a streaming average step size.\n\n6 Proof Sketch\n\nIn this section, we detail out the two key ideas underlying the analysis of Gen-Oja to obtain the\nconvergence rate mentioned in Theorem 5.1: a) controlling the non i.i.d. Markovian noise term which\nis introduced because of the coupled Markov chains in Gen-Oja and b) proving that a noisy power\nmethod with such Markovian noise converges to the correct solution.\n\nIn order to better understand the sequence vt, we rewrite\n\nv(cid:48)\nt = vt\u22121 + \u03b2twt = vt\u22121 + \u03b2t(B\u22121Avt\u22121 + \u03bet),\n\nControlling Markovian perturbations.\nthe update as,\n(13)\nwhere \u03bet = wt \u2212 B\u22121Avt\u22121 is the prediction error which is a Markovian noise. Note that the\nnoise term is neither mean zero nor a martingale difference sequence. Instead, the noise term \u03bet is\ndependent on all previous iterates, which makes the analysis of the process more involved. This\nframework with Markovian noise has been extensively studied by [6, 3].\nFrom the update in Equation (13), we observe that Gen-Oja is performing an Oja update but with\na controlled Markovian noise. However, we would like to highlight that classical techniques in the\nstudy of stochastic approximation with Markovian noise (as the Poisson Equation [6, 22]) were not\nenough to provide adequate control on the noise to show convergence.\nIn order to overcome this dif\ufb01culty, we leverage the fast mixing of the chain wv\nt for understanding the\nMarkovian noise. While it holds that E[(cid:107)\u03bet(cid:107)2] = O(1) (see Appendix C), a key part of our analysis\nis the following lemma, the proof of which can be found in Appendix B).\n), and assuming that (cid:107)ws(cid:107) \u2264 Ws for t \u2264 s \u2264\nLemma 6.1. . For any choice of k > 4 \u03bb1(B)\nt + k we have that\n\n\u00b5\u03b1 log( 1\n\u03b2t+k\n\n(cid:107)E[\u0001t+k|Ft](cid:107)2 = O(\u03b2tk2\u03b1tWt+k)\n\nLemma 6.1 uses the fast mixing of wt to show that (cid:107)E[\u03bet]|Ft\u2212r(cid:107)2 = \u02dcO(\u03b2t) where r = O(log t), i.e.,\nthe magnitude of the expected noise is small conditioned on log(t) steps in the past.\n\nAnalysis of Oja\u2019s algorithm. The usual proofs of convergence for stochastic approximation de\ufb01ne\na Lyapunov function and show that it decreases suf\ufb01ciently at each iteration. Oftentimes control on\nthe per step rate of decrease can then be translated into a global convergence result. Unfortunately in\nthe context of PCA, due to the non-convexity of the Raleigh quotient, the quality of the estimate vt\ncannot be related to the previous vt\u22121. Indeed vt may become orthogonal to the leading eigenvector.\nInstead [17] circumvent this issue by leveraging the randomness of the initialization and adopt an\noperator view of the problem. We take inspiration from this approach in our analysis of Gen-Oja. Let\nGi = wiv(cid:62)\n\ni=1(I + \u03b2iGi), Gen-Oja\u2019s update can be equivalently written as\n\ni\u22121 and Ht =(cid:81)t\n\nvt =\n\nHtv0\n(cid:107)Htv0(cid:107)2\n\n2\n\n,\n\n6\n\n\fTr(HH(cid:62)(cid:80)\n\nj(cid:54)=i \u02dcuj \u02dcu(cid:62)\nj )\n\n\u02dcu(cid:62)\ni HH(cid:62) \u02dcui\n\npushing, for the analysis only, the normalization step at the end. This point of view enables us to\nanalyze the improvement of Ht over Ht\u22121 since allows one to interpret Oja\u2019s update as one step\nof power method on Ht starting on a random vector v0. We present here an easy adaptation of [17,\nLemma 3.1] that takes into account the special geometry of the generalized eigenvector problem and\nthe asymmetry of B\u22121A. The proof can be found in Appendix A.\nLemma 6.2. Let H \u2208 Rd\u00d7d, (ui)d\ni=1 be the corresponding right and left eigenvectors of\nB\u22121A and w \u2208 Rd chosen uniformly on the sphere, then with probability 1\u2212 \u03b4 (over the randomness\nin the initial iterate)\n\ni=1 and (\u02dcui)d\n\nsin2\n\nB(ui, Hw) \u2264 C log(1/\u03b4)\n\n\u03b4\n\n,\n\n(14)\n\nfor some universal constant C > 0.\n\nalgorithm. Indeed we only have to prove that Ht will be close to(cid:81)t\n\nThis lemma has the virtue of highly simplifying the challenging proof of convergence of Oja\u2019s\ni=1(I + \u03b2iB\u22121A) for t large\nenough which can be interpreted as an analogue of the law of large numbers for the multiplication of\ni HtH(cid:62)\nmatrices. This will ensure that Tr(HtH(cid:62)\nt \u02dcui\nand be enough with Lemma 6.2 to prove Theorem 5.1. The proof follows the line of [17] with\ntwo additional tedious dif\ufb01culties: the Markovian noise is neither unbiased nor independent of the\nprevious iterates, and the matrix B\u22121A is no longer symmetric, which is precisely why we consider\nthe left eigenvector \u02dcui in the right-hand side of Eq. (14). We highlight two key steps:\n\nj ) is relatively small compared to \u02dcu(cid:62)\n\n(cid:80)\nj(cid:54)=i \u02dcuj \u02dcu(cid:62)\n\nt\n\nt\n\n\u2022 First we show that E Tr(HtH(cid:62)\n\nplies by Markov\u2019s inequality the same bound on Tr(HtH(cid:62)\nprobability. See Lemmas E.2 for more details.\n\nj ) grows as O(exp(2\u03bb2\n\ni=1 \u03b2i)), which im-\nj ) with constant\ni HH(cid:62) \u02dcui\ni=1 \u03b2i)) which implies by Chebshev\u2019s inequality the same bound\n\nt \u02dcui grows as O(exp(4\u03bb1\n\n\u2022 Second we show that Var \u02dcu(cid:62)\n\ni HH(cid:62) \u02dcui with constant probability. See Lemmas E.3 and E.5 for more details.\n\ngrows as O(exp(2\u03bb1\nfor \u02dcu(cid:62)\n\ni HtH(cid:62)\n\n(cid:80)t\n\nt\n\n(cid:80)t\n(cid:80)\n(cid:80)t\nj(cid:54)=i \u02dcuj \u02dcu(cid:62)\ni=1 \u03b2i)) and E\u02dcu(cid:62)\n\n(cid:80)\nj(cid:54)=i \u02dcuj \u02dcu(cid:62)\n\n7 Application to Canonical Correlation Analysis\nConsider two random vectors X \u2208 Rd and Y \u2208 Rd with joint distribution PXY . The objective of\ncanonical correlation analysis in the population setting is to \ufb01nd the canonical correlation vectors\n\u03c6, \u03c8 \u2208 Rd,d which maximize the correlation\n\n(cid:112)E[(\u03c6(cid:62)X)2]E[(\u03c8(cid:62)Y )2]\n\nE[(\u03c6(cid:62)X)(\u03c8(cid:62)Y )]\n\n.\n\nmax\n\u03c6,\u03c8\n\nequivalent\n\nproblem is\n\nto maximizing\nE[XX(cid:62)]\u22121/2E[XY (cid:62)]E[Y Y (cid:62)]\u22121/2,\n\nThis\nthe\nconstraint\nE[(\u03c6(cid:62)X)2] = E[(\u03c8(cid:62)Y )2] = 1 and admits a closed form solution:\nif we de-\n\ufb01ne T\n=\n(E[XX(cid:62)]\u22121/2a1E[Y Y (cid:62)]\u22121/2b1) where a1, b1 are the left and right principal singular vec-\ntors of T . By the KKT conditions, there exist \u03bd1, \u03bd2 \u2208 R such that this solution satis\ufb01es the\nstationarity equation\n\nsolution is\n\nthen the\n\n(\u03c6\u2217, \u03c8\u2217)\n\n=\n\n\u03c6(cid:62)E[XY (cid:62)]\u03c8 under\n\nE[XY (cid:62)]\u03c8 = \u03bd1E[XX(cid:62)]\u03c6\n\nand E[Y X(cid:62)]\u03c6 = \u03bd2E[Y Y (cid:62)]\u03c8.\n\nUsing the constraint conditions we conclude that \u03bd1 = \u03bd2. This condition can be written (for\n\u03bb = \u03bd1) in the matrix form of Eq. (3). As a consequence, \ufb01nding the largest generalized eigenvector\nfor the matrices (A, B) will recover the canonical correlation vector (\u03c6, \u03c8). Solving the associated\ngeneralized streaming eigenvector problem, we obtain the following result for estimating the canonical\ncorrelation vector whose proof easily follows from Theorem 5.1 (setting \u03b3 = 6).\nTheorem 7.1. Assume that max{(cid:107)X(cid:107),(cid:107)Y (cid:107)} \u2264 R a.s., min{\u03bbmin(E[XX(cid:62)]), \u03bbmin(E[Y Y (cid:62)])} =\n\u00b5 > 0 and \u03c31(T ) \u2212 \u03c32(T ) = \u2206 > 0. Fix any \u03b4 > 0, let \u00011 \u2265 0, and suppose the step sizes are set to\n\u03b1t =\n\n2R2 log(d2\u03b2+t) and \u03b2t =\n\n\u2206(d2\u03b2+t) and\n\n1\n\n6\n\n\uf8eb\uf8ed\n\n\u03b2 = max\n\n\u22062d2 log\n\n(cid:16) R\n\n(cid:16) 1+\u03b4/100\n\n720\u03c32\n1\n\n(cid:17) ,\n\n1+\u00011\n\n7\n\n200\n\n\u00b5 + R3\n\n\u00b52 + R5\n\n\u00b53\n\n\u03b4 log(1 + R2\n\n\u00b5 + R4\n\u00b52 )\n\n(cid:17) 1\n\n\u22062\n\n\uf8f6\uf8f8\n\n\f(cid:18)\n\n\u2265\n\n(cid:19)\n\ncd\n\nd2\u03b2 + n\n\n1\n\n1\n\nmin(1,12\u03bb1/\u2206\u03bb )\n\n(d3\u03b2 + 1) exp\n\n(cid:19)\n\n(cid:18) c\u03bb2\n\n1\nd2\n\nFigure 1: Synthetic Generalized Eigenvalue problem. Left: Comparison with two-steps methods.\nMiddle: Robustness to step size \u03b1t. Right: Robustness to step size \u03b2t (Streaming averaged Gen-Oja\nis dashed).\n\nSuppose that the number of samples n satisfy\n\nlog\n\nmin(1,12\u03bb1/\u2206\u03bb ) (d2\u03b2 + n)\n\nThen the output (\u03c6t, \u03c8t) of Algorithm 1 applied to (A, B) de\ufb01ned above satis\ufb01es,\nlog3(d2\u03b2 + n)\n\nB((\u03c6\u2217, \u03c8\u2217), (\u03c6t, \u03c8t)) \u2264 (2 + \u00011)cd2 log(cid:0) 1\n\nsin2\n\n(cid:1)\n\n\u03b4\n\n,\n\n\u03b41 min(1, \u03bb1)\n\n\u03b42(cid:107)\u02dcu1(cid:107)2\n\n2\n\n\u22062(d2\u03b2 + n + 1)\n\nwith probability at least 1 \u2212 \u03b4 with c depending on parameters of the problem and independent of d\nand \u2206 where \u03b41 = \u00011\n\n2(2+\u00011) .\n\nWe can make the following observations:\n\n\u2022 The convergence guarantee are comparable with the sample complexity obtained by the\nERM (t = \u02dcO(d/(\u03b5\u22062) for sub-Gaussian variables and t = \u02dcO(1/(\u03b5\u22062\u00b52) for bounded\nvariables)[12].\n\u2022 The sample complexity in [12] is better in term of the dependence on d. They obtain the\nsame rates as the ERM. We are unable to explicitly compare our bounds with [4] since they\nwork in the gap free setting and their computational complexity is O(d2).\n\n8 Simulations\nHere we illustrate the practical utility of Gen-Oja on a synthetic, streaming generalized eigenvector\nproblem. We take d = 20 and T = 106. The streams (At, Bt) \u2208 (Rd\u00d7d)2 are normally-distributed\nwith covariance matrix A and B with random eigenvectors and eigenvalues decaying as 1/i, for\ni = 1, . . . , d. Here R2 denotes the radius of the streams with R2 = max{Tr A, Tr B}. All results\nare averaged over ten repetitions.\nComparison with two-steps methods.\nIn the left plot of Figure 1 we compare the behavior of\nGen-Oja to different two-steps algorithms. Since the method by [4] is of complexity O(d2), we\ncompare Gen-Oja to a method which alternates between one step of Oja\u2019s algorithm and \u03c4 steps\nof averaged stochastic gradient descent with constant step size 1/2R2. Gen-Oja is converging at\nrate O(1/t) whereas the other methods are very slow. For \u03c4 = 10, the solution of the inner loop\nis too inaccurate and the steps of Oja are inef\ufb01cient. For \u03c4 = 10000, the output of the sgd steps is\nvery accurate but there are too few Oja iterations to make any progress. \u03c4 = 1000 seems an optimal\nparameter choice but this method is slower than Gen-Oja by an order of magnitude.\nRobustness to incorrect step-size \u03b1.\nIn the middle plot of Figure 1 we compare the behavior\nof Gen-Oja for step size \u03b1 \u2208 {\u03b1\u2217, \u03b1\u2217/8, \u03b1\u2217/16} where \u03b1\u2217 = 1/R2. We observe that Gen-Oja\nconverges at a rate O(1/t) independently of the choice of \u03b1.\n\u221a\n\u221a\nRobustness to incorrect step-size \u03b2t.\nIn the right plot of Figure 1 we compare the behavior of\ni} where \u03b2\u2217 corresponds to the minimal\nGen-Oja for step size \u03b2t \u2208 {\u03b2\u2217/t, \u03b2\u2217/16t, \u03b2\u2217/\ni, \u03b2\u2217/16\nerror after one pass over the data. We observe that Gen-Oja is not robust to the choice of the constant\n\n8\n\n0246log10(t)-5-4-3-2-10log10[sinB(vt,u1)]=1=10=1000=100000246log10(t)-5-4-3-2-10log10[sinB(vt,u1)]=*=*/8=*/160246log10(t)-5-4-3-2-10log10[sinB(vt,u1)]=*/t=*/16t=*/t1/2=*/16t1/2\ffor step size \u03b2t \u221d 1/t. If the constant is too small, the rate of convergence is arbitrary slow. We\n\u221a\nobserve that considering the streaming average of [29] on Gen-Oja with a step size \u03b2t \u221d 1/\nt\nenables to recover the fast O(1/t) convergence while being robust to constant misspeci\ufb01cation.\n\n9 Conclusion\nWe have proposed and analyzed a simple online algorithm to solve the streaming generalized\neigenvector problem and applied it to CCA. This algorithm, inspired by two-time-scale stochastic\n\u221a\napproximation achieves a fast O(1/t) convergence. Considering recovering the k-principal general-\nized eigenvector (for k > 1) and obtaining a slow convergence rate O(1/\nt) in the gap free setting\nare promising future directions. Finally, it would be worth considering removing the dimension\ndependence in our convergence guarantee.\n\nAcknowledgements\n\nWe gratefully acknowledge the support of the NSF through grant IIS-1619362. AP acknowledges\nHuawei\u2019s support through a BAIR-Huawei PhD Fellowship. This work was supported in part by the\nMathematical Data Science program of the Of\ufb01ce of Naval Research under grant number N00014-18-\n1-2764. This work was partially supported by AFOSR through grant FA9550-17-1-0308.\n\nReferences\n[1] Z. Allen-Zhu and Y. Li. First ef\ufb01cient convergence for streaming k-PCA: a global, gap-free, and near-\noptimal rate. In Foundations of Computer Science (FOCS), 2017 IEEE 58th Annual Symposium on, pages\n487\u2013492. IEEE, 2017.\n\n[2] Z. Allen-Zhu and Y. Li. Doubly accelerated methods for faster CCA and generalized eigendecomposition.\n\nIn International Conference on Machine Learning, 2017.\n\n[3] C. Andrieu, E. Moulines, and P. Priouret. Stability of stochastic approximation under veri\ufb01able conditions.\n\nSIAM Journal on Control and Optimization, 44(1):283\u2013312, 2005.\n\n[4] R. Arora, T. V. Marinov, P. Mianjy, and N. Srebro. Stochastic approximation for canonical correlation\n\nanalysis. In Advances in Neural Information Processing Systems. 2017.\n\n[5] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate\n\nO(1/n). In Advances in Neural Information Processing Systems, 2013.\n\n[6] A. Benveniste, M. M\u00e9tivier, and P. Priouret. Adaptive Algorithms and Stochastic Approximations. Springer\n\nPublishing Company, Incorporated, 1990.\n\n[7] V. S. Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291\u2013294,\n\n1997.\n\n[8] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer, 2009.\n[9] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan. Multi-view clustering via canonical correlation\n\nanalysis. In International Conference on Machine Learning, pages 129\u2013136, 2009.\n\n[10] P. Diaconis and D. Freedman. Iterated random functions. SIAM Review, 41(1):45\u201376, 1999.\n[11] A. Dieuleveut, A. Durmus, and F. Bach. Bridging the gap between constant step size stochastic gradient\n\ndescent and Markov chains. arXiv preprint arXiv:1707.06386, 2017.\n\n[12] C. Gao, D. Garber, N. Srebro, J. Wang, and W. Wang. Stochastic canonical correlation analysis. arXiv\n\npreprint arXiv:1702.06533, 2017.\n\n[13] D. Garber, E. Hazan, J. Jin, S. Kakade, C. Musco, P. Netrapalli, and A. Sidford. Faster eigenvector\ncomputation via shift-and-invert preconditioning. In International Conference on Machine Learning, 2016.\n[14] R. Ge, C. Jin, S. Kakade, P. Netrapalli, and A. Sidford. Ef\ufb01cient algorithms for large-scale generalized\neigenvector computation and canonical correlation analysis. In International Conference on International\nConference on Machine, 2016.\n\n[15] M. Hardt and E. Price. The noisy power method: A meta algorithm with applications. In Advances in\n\nNeural Information Processing Systems, pages 2861\u20132869, 2014.\n\n[16] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321\u2013377, 1936.\n[17] P. Jain, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford. Streaming PCA: Matching matrix Bernstein\nand near-optimal \ufb01nite sample guarantees for Oja\u2019s algorithm. In Conference on Learning Theory, 2016.\n\n9\n\n\f[18] S. M. Kakade and D. P. Foster. Multi-view regression via canonical correlation analysis. In International\n\nConference on Computational Learning Theory, pages 82\u201396. Springer, 2007.\n\n[19] N. Karampatziakis and P. Mineiro. Discriminative features via generalized eigenvectors. arXiv preprint\n\narXiv:1310.1934, 2013.\n\n[20] Y. Lu and D. P. Foster. Large scale canonical correlation analysis with iterative least squares. In Advances\n\nin Neural Information Processing Systems, 2014.\n\n[21] Z. Ma, Y. Lu, and D. Foster. Finding linear structure in large datasets with scalable canonical correlation\n\nanalysis. In International Conference on International Conference on Machine Learning, 2015.\n\n[22] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Cambridge University Press, 2009.\n[23] C. Musco and C. Musco. Randomized block Krylov methods for stronger and faster approximate singular\n\nvalue decomposition. In Advances in Neural Information Processing Systems, 2015.\n\n[24] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley-\n\nInterscience Series in Discrete Mathematics. John Wiley & Sons, 1983.\n\n[25] E. Oja. Simpli\ufb01ed neuron model as a principal component analyzer. Journal of Mathematical Biology, 15\n\n(3):267\u2013273, 1982.\n\n[26] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22\n\n(3):400\u2013407, 1951.\n\n[27] O. Shamir. Convergence of stochastic gradient descent for PCA. In International Conference on Machine\n\nLearning, 2016.\n\n[28] D. Steinsaltz. Locally contractive iterated function systems. Annals of Probability, pages 1952\u20131979,\n\n1999.\n\n[29] N. Tripuraneni, N. Flammarion, F. Bach, and M. I. Jordan. Averaging stochastic gradient descent on\n\nRiemannian manifolds. In Conference on Learning Theory, 2018.\n\n[30] C. Villani. Optimal Transport: Old and New, volume 338. Springer-Verlag Berlin Heidelberg, 2008.\n[31] W. Wang, J. Wang, D. Garber, and N. Srebro. Ef\ufb01cient globally convergent stochastic optimization for\n\ncanonical correlation analysis. In Advances in Neural Information Processing Systems, 2016.\n\n10\n\n\f", "award": [], "sourceid": 3489, "authors": [{"given_name": "Kush", "family_name": "Bhatia", "institution": "UC Berkeley"}, {"given_name": "Aldo", "family_name": "Pacchiano", "institution": "UC Berkeley"}, {"given_name": "Nicolas", "family_name": "Flammarion", "institution": "UC Berkeley"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}