{"title": "Randomized Block Krylov Methods for Stronger and Faster Approximate Singular Value Decomposition", "book": "Advances in Neural Information Processing Systems", "page_first": 1396, "page_last": 1404, "abstract": "Since being analyzed by Rokhlin, Szlam, and Tygert and popularized by Halko, Martinsson, and Tropp, randomized Simultaneous Power Iteration has become the method of choice for approximate singular value decomposition. It is more accurate than simpler sketching algorithms, yet still converges quickly for *any* matrix, independently of singular value gaps. After ~O(1/epsilon) iterations, it gives a low-rank approximation within (1+epsilon) of optimal for spectral norm error.We give the first provable runtime improvement on Simultaneous Iteration: a randomized block Krylov method, closely related to the classic Block Lanczos algorithm, gives the same guarantees in just ~O(1/sqrt(epsilon)) iterations and performs substantially better experimentally. Our analysis is the first of a Krylov subspace method that does not depend on singular value gaps, which are unreliable in practice.Furthermore, while it is a simple accuracy benchmark, even (1+epsilon) error for spectral norm low-rank approximation does not imply that an algorithm returns high quality principal components, a major issue for data applications. We address this problem for the first time by showing that both Block Krylov Iteration and Simultaneous Iteration give nearly optimal PCA for any matrix. This result further justifies their strength over non-iterative sketching methods.", "full_text": "Randomized Block Krylov Methods for Stronger and\nFaster Approximate Singular Value Decomposition\n\nCameron Musco\n\nChristopher Musco\n\nMassachusetts Institute of Technology, EECS\n\nMassachusetts Institute of Technology, EECS\n\nCambridge, MA 02139, USA\n\ncnmusco@mit.edu\n\nCambridge, MA 02139, USA\n\ncpmusco@mit.edu\n\nAbstract\n\nSince being analyzed by Rokhlin, Szlam, and Tygert [1] and popularized by\nHalko, Martinsson, and Tropp [2], randomized Simultaneous Power Iteration has\nbecome the method of choice for approximate singular value decomposition. It is\nmore accurate than simpler sketching algorithms, yet still converges quickly for\nany matrix, independently of singular value gaps. After \u02dcO(1/\u0001) iterations, it gives\na low-rank approximation within (1 + \u0001) of optimal for spectral norm error.\nWe give the \ufb01rst provable runtime improvement on Simultaneous Iteration: a ran-\n\u221a\ndomized block Krylov method, closely related to the classic Block Lanczos algo-\nrithm, gives the same guarantees in just \u02dcO(1/\n\u0001) iterations and performs substan-\ntially better experimentally. Our analysis is the \ufb01rst of a Krylov subspace method\nthat does not depend on singular value gaps, which are unreliable in practice.\nFurthermore, while it is a simple accuracy benchmark, even (1 + \u0001) error for\nspectral norm low-rank approximation does not imply that an algorithm returns\nhigh quality principal components, a major issue for data applications. We address\nthis problem for the \ufb01rst time by showing that both Block Krylov Iteration and\nSimultaneous Iteration give nearly optimal PCA for any matrix. This result further\njusti\ufb01es their strength over non-iterative sketching methods.\n\nIntroduction\n\n1\nAny matrix A \u2208 Rn\u00d7d with rank r can be written using a singular value decomposition (SVD) as\nA = U\u03a3VT . U \u2208 Rn\u00d7r and V \u2208 Rd\u00d7r have orthonormal columns (A\u2019s left and right singular\nvectors) and \u03a3 \u2208 Rr\u00d7r is a positive diagonal matrix containing A\u2019s singular values: \u03c31 \u2265 . . . \u2265 \u03c3r.\nA rank k partial SVD algorithm returns just the top k left or right singular vectors of A. These are\nthe \ufb01rst k columns of U or V, denoted Uk and Vk respectively.\nAmong countless applications, the SVD is used for optimal low-rank approximation and principal\ncomponent analysis (PCA). Speci\ufb01cally, for k < r, a partial SVD can be used to construct a rank k\napproximation Ak such that both (cid:107)A \u2212 Ak(cid:107)F and (cid:107)A \u2212 Ak(cid:107)2 are as small as possible. We simply\nset Ak = UkUT\nk A. That is, Ak is A projected onto the space spanned by its top k singular vectors.\nFor principal component analysis, A\u2019s top singular vector u1 provides a top principal component,\nwhich describes the direction of greatest variance within A. The ith singular vector ui provides the\nith principal component, which is the direction of greatest variance orthogonal to all higher principal\ncomponents. Formally, denoting A\u2019s ith singular value as \u03c3i,\n\ni AAT ui = \u03c32\nuT\n\ni =\n\nmax\n\nx:(cid:107)x(cid:107)2=1, x\u22a5uj\u2200j<i\n\nxT AAT x.\n\nTraditional SVD algorithms are expensive, typically running in O(nd2) time, so there has been sub-\nstantial research on randomized techniques that seek nearly optimal low-rank approximation and\n\n1\n\n\fPCA [3, 4, 1, 2, 5]. These methods are quickly becoming standard tools in practice and implemen-\ntations are widely available [6, 7, 8, 9], including in popular learning libraries [10].\nRecent work focuses on algorithms whose runtimes do not depend on properties of A. In contrast,\nclassical literature typically gives runtime bounds that depend on the gaps between A\u2019s singular\nvalues and become useless when these gaps are small (which is often the case in practice \u2013 see\nSection 6). This limitation is due to a focus on how quickly approximate singular vectors converge\nto the actual singular vectors of A. When two singular vectors have nearly identical values they are\ndif\ufb01cult to distinguish, so convergence inherently depends on singular value gaps.\nOnly recently has a shift in approximation goal, along with an improved understanding of random-\nization, allowed for algorithms that avoid gap dependence and thus run provably fast for any matrix.\nFor low-rank approximation and PCA, we only need to \ufb01nd a subspace that captures nearly as much\nvariance as A\u2019s top singular vectors \u2013 distinguishing between two close singular values is overkill.\n\n1.1 Prior Work\nThe fastest randomized SVD algorithms [3, 5] run in O(nnz(A)) time1, are based on non-iterative\nsketching methods, and return a rank k matrix Z with orthonormal columns z1, . . . , zk satisfying\n\ni>k \u03c32\n\n(cid:107)A \u2212 ZZT A(cid:107)F \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)F .\n\nFrobenius Norm Error:\n\nF =(cid:80)\n\n(1)\nUnfortunately, as emphasized in prior work [1, 2, 11, 12], Frobenius norm error is often hopelessly\ninsuf\ufb01cient, especially for data analysis and learning applications. When A has a \u201cheavy-tail\u201d of\nsingular values, which is common for noisy data, (cid:107)A \u2212 Ak(cid:107)2\ni can be huge, potentially\nmuch larger than A\u2019s top singular value. This renders (1) meaningless since Z does not need to\nalign with any large singular vectors to obtain good multiplicative error.\nTo address this shortcoming, a number of papers target spectral norm low-rank approximation error,\n(2)\nwhich is intuitively stronger. When looking for a rank k approximation, A\u2019s top k singular vectors\nare often considered data and the remaining tail is considered noise. A spectral norm guarantee\nroughly ensures that ZZT A recovers A up to this noise threshold.\nA series of work [1, 2, 13, 14, 15] shows that the decades old Simultaneous Power Iteration (also\ncalled subspace iteration or orthogonal iteration) implemented with random start vectors, achieves\n(2) after \u02dcO(1/\u0001) iterations. Hence, this method, which was popularized by Halko, Martinsson, and\nTropp in [2], has become the randomized SVD algorithm of choice for practitioners [10, 16].\n\n(cid:107)A \u2212 ZZT A(cid:107)2 \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)2,\n\nSpectral Norm Error:\n\n2 Our Results\n\nAlgorithm 1 SIMULTANEOUS ITERATION\ninput: A \u2208 Rn\u00d7d, error \u0001 \u2208 (0, 1), rank k \u2264 n, d\noutput: Z \u2208 Rn\u00d7k\n1: q := \u0398( log d\n\n\u0001 ), \u03a0 \u223c N (0, 1)d\u00d7k\n\n2: K :=(cid:0)AAT(cid:1)q\n\nA\u03a0\n\nQ \u2208 Rn\u00d7k.\n\n3: Orthonormalize the columns of K to obtain\n4: Compute M := QT AAT Q \u2208 Rk\u00d7k.\n5: Set \u00afUk to the top k singular vectors of M.\n6: return Z = Q \u00afUk.\n\nAlgorithm 2 BLOCK KRYLOV ITERATION\ninput: A \u2208 Rn\u00d7d, error \u0001 \u2208 (0, 1), rank k \u2264 n, d\noutput: Z \u2208 Rn\u00d7k\n1: q := \u0398( log d\u221a\n\n2: K :=(cid:2)A\u03a0, (AAT )A\u03a0, ..., (AAT )qA\u03a0(cid:3)\n\n\u0001 ), \u03a0 \u223c N (0, 1)d\u00d7k\n\nQ \u2208 Rn\u00d7qk.\n\n3: Orthonormalize the columns of K to obtain\n4: Compute M := QT AAT Q \u2208 Rqk\u00d7qk.\n5: Set \u00afUk to the top k singular vectors of M.\n6: return Z = Q \u00afUk.\n\n2.1 Faster Algorithm\nWe show that Algorithm 2, a randomized relative of the Block Lanczos algorithm [17, 18], which\n\u221a\nwe call Block Krylov Iteration, gives the same guarantees as Simultaneous Iteration (Algorithm 1)\nin just \u02dcO(1/\n\u0001) iterations. This not only gives the fastest known theoretical runtime for achieving\n(2), but also yields substantially better performance in practice (see Section 6).\n\n1Here nnz(A) is the number of non-zero entries in A and this runtime hides lower order terms.\n\n2\n\n\fEven though the algorithm has been discussed and tested for potential improvement over Simulta-\nneous Iteration [1, 19, 20], theoretical bounds for Krylov subspace and Lanczos methods are much\nmore limited. As highlighted in [11],\n\n\u201cDespite decades of research on Lanczos methods, the theory for [randomized\npower iteration] is more complete and provides strong guarantees of excellent\naccuracy, whether or not there exist any gaps between the singular values.\u201d\n\nOur work addresses this issue, giving the \ufb01rst gap independent bound for a Krylov subspace method.\n2.2 Stronger Guarantees\n\nIn addition to runtime improvements, we target a much stronger notion of approximate SVD that is\nneeded for many applications, but for which no gap-independent analysis was known.\nSpeci\ufb01cally, as noted in [21], while intuitively stronger than Frobenius norm error, (1 + \u0001) spec-\ntral norm low-rank approximation error does not guarantee any accuracy in Z for many matrices2.\nConsider A with its top k + 1 squared singular values all equal to 10 followed by a tail of smaller\nsingular values (e.g. 1000k at 1). (cid:107)A \u2212 Ak(cid:107)2\n2 = 10 for any rank\nk Z, leaving the spectral norm bound useless. At the same time, (cid:107)A \u2212 Ak(cid:107)2\nF is large, so Frobenius\nerror is meaningless as well. For example, any Z obtains (cid:107)A \u2212 ZZT A(cid:107)2\nWith this scenario in mind, it is unsurprising that low-rank approximation guarantees fail as an\naccuracy measure in practice. We ran a standard sketch-and-solve approximate SVD algorithm\n(see Section 3) on SNAP/AMAZON0302, an Amazon product co-purchasing dataset [22, 23], and\nachieved very good low-rank approximation error in both norms for k = 30:\n\n2 = 10 but in fact (cid:107)A \u2212 ZZT A(cid:107)2\n\nF \u2264 (1.01)(cid:107)A \u2212 Ak(cid:107)2\nF .\n\n(cid:107)A \u2212 ZZT A(cid:107)F < 1.001(cid:107)A \u2212 Ak(cid:107)F\n\nand\n\n(cid:107)A \u2212 ZZT A(cid:107)2 < 1.038(cid:107)A \u2212 Ak(cid:107)2.\n\nHowever, the approximate principal components given by Z are of signi\ufb01cantly lower quality than\nA\u2019s true singular vectors (see Figure 1). We saw similar results for a number of other datasets.\n\nFigure 1: Poor per vector error (3) for SNAP/AMAZON0302 returned by a sketch-and-solve ap-\nproximate SVD that gives very good low-rank approximation in both spectral and Frobenius norm.\n\nWe address this issue by introducing a per vector guarantee that requires each approximate singular\nvector z1, . . . , zk to capture nearly as much variance as the corresponding true singular vector:\n\nPer Vector Error:\n\n(3)\nThe error bound (3) is very strong in that it depends on \u0001\u03c32\nk+1, which is better then relative error\nfor A\u2019s large singular values. While it is reminiscent of the bounds sought in classical numerical\nanalysis [24], we stress that (3) does not require each zi to converge to ui in the presence of small\nsingular value gaps. In fact, we show that both randomized Block Krylov Iteration and our slightly\nmodi\ufb01ed Simultaneous Iteration algorithm achieve (3) in gap-independent runtimes.\n\ni AAT zi\n\nk+1.\n\n(cid:12)(cid:12) \u2264 \u0001\u03c32\n\n\u2200i, (cid:12)(cid:12)uT\n\ni AAT ui \u2212 zT\n\n2.3 Main Result\n\nOur contributions are summarized in Theorem 1. Its detailed proof is relegated to the full version of\nthis paper [25]. The runtimes are given in Theorems 6 and 7, and the three error bounds shown in\nTheorems 10, 11, and 12. In Section 4 we provide a sketch of the main ideas behind the result.\n\n2In fact, it does not even imply (1 + \u0001) Frobenius norm error.\n\n3\n\n5101520253050100150200250300350400450Index iSingular Value \u03c3i 2 = uiT(AAT)ui ziT(AAT)zi\fTheorem 1 (Main Theorem). With high probability, Algorithms 1 and 2 \ufb01nd approximate singular\nvectors Z = [z1, . . . , zk] satisfying guarantees (1) and (2) for low-rank approximation and (3) for\nPCA. For error \u0001, Algorithm 1 requires q = O(log d/\u0001) iterations while Algorithm 2 requires q =\n\u0001) iterations. Excluding lower order terms, both algorithms run in time O(nnz(A)kq).\nO(log d/\n\n\u221a\n\nIn the full version of this paper we also use our results to give an alternative analysis that does\ndepend on singular value gaps and can offer signi\ufb01cantly faster convergence when A has decaying\nsingular values. It is possible to take further advantage of this result by running Algorithms 1 and 2\nwith a \u03a0 that has > k columns, a simple modi\ufb01cation for accelerating either method.\nIn Section 6 we test both algorithms on a number of large datasets. We justify the importance of gap\nindependent bounds for predicting algorithm convergence and we show that Block Krylov Iteration\nin fact signi\ufb01cantly outperforms the more popular Simultaneous Iteration.\n2.4 Comparison to Classical Bounds\nDecades of work has produced a variety of gap dependent bounds for Krylov methods [26]. Most\nrelevant to our work are bounds for block Krylov methods with block size equal to k [27]. Roughly\nspeaking, with randomized initialization, these results offer guarantees equivalent to our strong equa-\n\ntion (3) for the top k singular directions after O(log(d/\u0001)/(cid:112)\u03c3k/\u03c3k+1 \u2212 1) iterations.\n\nThis bound is recovered in Section 7 of this paper\u2019s full version [25]. When the target accuracy \u0001\nis smaller than the relative singular value gap (\u03c3k/\u03c3k+1 \u2212 1), it is tighter than our gap independent\nresults. However, as discussed in Section 6, for high dimensional data problems where \u0001 is set far\nabove machine precision, gap independent bounds more accurately predict required iteration count.\nPrior work also attempts to analyze algorithms with block size smaller than k [24]. While \u201csmall\nblock\u201d algorithms offer runtime advantages, it is well understood that with b duplicate singular\nvalues, it is impossible to recover the top k singular directions with a block of size < b [28]. More\ngenerally, large singular value clusters slow convergence, so any small block algorithm must have\nruntime dependence on the gaps between each adjacent pair of top singular values [29].\n\n3 Analyzing Simultaneous Iteration\nBefore discussing our proof of Theorem 1, we review prior work on Simultaneous Iteration to\ndemonstrate how it can achieve the spectral norm guarantee (2).\nAlgorithms for Frobenius norm error (1) typically work by sketching A into very few dimensions\nusing a Johnson-Lindenstrauss random projection matrix \u03a0 with poly(k/\u0001) columns.\n\nAn\u00d7d \u00d7 \u03a0d\u00d7poly(k/\u0001) = (A\u03a0)n\u00d7poly(k/\u0001)\n\n\u03a0 is usually a random Gaussian or (possibly sparse) random sign matrix and Z is computed using\nthe SVD of A\u03a0 or of A projected onto A\u03a0 [3, 5, 30]. This \u201csketch-and-solve\u201d approach is very\nef\ufb01cient \u2013 the computation of A\u03a0 is easily parallelized and, regardless, pass-ef\ufb01cient in a single\nprocessor setting. Furthermore, once a small compression of A is obtained, it can be manipulated\nin fast memory for the \ufb01nal computation of Z.\nHowever, Frobenius norm error seems an inherent limitation of sketch-and-solve methods. The\nnoise from A\u2019s lower r \u2212 k singular values corrupts A\u03a0, making it impossible to extract a good\npartial SVD if the sum of these singular values (equal to (cid:107)A \u2212 Ak(cid:107)2\nIn order to achieve spectral norm error (2), Simultaneous Iteration must reduce this noise down to\nthe scale of \u03c3k+1 = (cid:107)A \u2212 Ak(cid:107)2. It does this by working with the powered matrix Aq [31].3 By the\nspectral theorem, Aq has exactly the same singular vectors as A, but its singular values are equal to\nthose of A raised to the qth power. Powering spreads the values apart and accordingly, Aq\u2019s lower\nsingular values are relatively much smaller than its top singular values (see example in Figure 2a).\n\u0001 ) is suf\ufb01cient to increase any singular value \u2265 (1 + \u0001)\u03c3k+1 to be signi\ufb01-\nSpeci\ufb01cally, q = O( log d\ncantly (i.e. poly(d) times) larger than any value \u2264 \u03c3k+1. This effectively denoises our problem \u2013\nif we use a sketching method to \ufb01nd a good Z for approximating Aq up to Frobenius norm error, Z\nwill have to align very well with every singular vector with value \u2265 (1 + \u0001)\u03c3k+1. It thus provides\nan accurate basis for approximating A up to small spectral norm error.\n\nF ) is too large.\n\n3For nonsymmetric matrices we work with (AAT )qA, but present the symmetric case here for simplicity.\n\n4\n\n\f(a) A\u2019s singular values compared to those of\nAq, rescaled to match on \u03c31. Notice the sig-\nni\ufb01cantly reduced tail after \u03c38.\n\n\u221a\n(b) An O(1/\n\u0001)-degree Chebyshev polyno-\n\u221a\n\u0001)(x), pushes low values nearly\nmial, TO(1/\nas close to zero as xO(1/\u0001).\n\nFigure 2: Replacing A with a matrix polynomial facilitates higher accuracy approximation.\n\nComputing Aq directly is costly, so Aq\u03a0 is computed iteratively \u2013 start with a random \u03a0 and\nrepeatedly multiply by A on the left. Since even a rough Frobenius norm approximation for Aq\nsuf\ufb01ces, \u03a0 can be chosen to have just k columns. Each iteration thus takes O(nnz(A)k) time.\nWhen analyzing Simultaneous Iteration, [15] uses the following randomized sketch-and-solve result\nto \ufb01nd a Z that gives a coarse Frobenius norm approximation to B = Aq and therefore a good\nspectral norm approximation to A. The lemma is numbered for consistency with our full paper.\nLemma 4 (Frobenius Norm Low-Rank Approximation). For any B \u2208 Rn\u00d7d and \u03a0 \u2208 Rd\u00d7k where\nthe entries of \u03a0 are independent Gaussians drawn from N (0, 1). If we let Z be an orthonormal\nbasis for span (B\u03a0), then with probability at least 99/100, for some \ufb01xed constant c,\n\n(cid:107)B \u2212 ZZT B(cid:107)2\n\nF \u2264 c \u00b7 dk(cid:107)B \u2212 Bk(cid:107)2\nF .\n\nFor analyzing block methods, results like Lemma 4 can effectively serve as a replacement for earlier\nrandom initialization analysis that applies to single vector power and Krylov methods [32].\n\u03c3k+1(Aq) \u2264 1\n\npoly(d) \u03c3m(Aq) for any m with \u03c3m(A) \u2265 (1 + \u0001)\u03c3k+1(A). Plugging into Lemma 4:\n\nr(cid:88)\n\n(cid:107)Aq \u2212 ZZT Aq(cid:107)2\n\nF \u2264 cdk \u00b7\n\ni (Aq) \u2264 cdk \u00b7 d \u00b7 \u03c32\n\u03c32\n\nk+1(Aq) \u2264 \u03c32\n\nm(Aq)/ poly(d).\n\ni=k+1\n\nRearranging using Pythagorean theorem, we have (cid:107)ZZT Aq(cid:107)2\nm(Aq)\npoly(d) . That is, Aq\u2019s\nprojection onto Z captures nearly all of its Frobenius norm. This is only possible if Z aligns very\nwell with the top singular vectors of Aq and hence gives a good spectral norm approximation for A.\n\nF \u2265 (cid:107)Aq(cid:107)2\n\nF \u2212 \u03c32\n\n4 Proof Sketch for Theorem 1\nThe intuition for beating Simultaneous Iteration with Block Krylov Iteration matches that of many\naccelerated iterative methods. Simply put, there are better polynomials than Aq for denoising tail\nsingular values. In particular, we can use a lower degree polynomial, allowing us to compute fewer\npowers of A and thus leading to an algorithm with fewer iterations. For example, an appropriately\n\u0001) degree Chebyshev polynomial can push the tail of A nearly as close to\nshifted q = O(log(d)/\nzero as AO(log d/\u0001), even if the long run growth of the polynomial is much lower (see Figure 2b).\nSpeci\ufb01cally, we prove the following scalar polynomial lemma in the full version of our paper [25],\nwhich can then be applied to effectively denoising A\u2019s singular value tail.\nLemma 5 (Chebyshev Minimizing Polynomial). For \u0001 \u2208 (0, 1] and q = O(log d/\na degree q polynomial p(x) such that p((1 + \u0001)\u03c3k+1) = (1 + \u0001)\u03c3k+1 and,\n\n\u0001), there exists\n\n\u221a\n\n\u221a\n\n1) p(x) \u2265 x for x \u2265 (1 + \u0001)\u03c3k+1\n\npoly(d) for x \u2264 \u03c3k+1.\nFurthermore, we can choose the polynomial to only contain monomials with odd powers.\n\n|p(x)| \u2264 \u03c3k+1\n\n2)\n\n5\n\n05101520051015Index iSingular Value \u03c3i Spectrum of ASpectrum of Aq00.20.40.60.81\u22125051015202530354045x xO(1/\u03b5)TO(1/\u221a\u03b5)(x)\fBlock Krylov Iteration takes advantage of such polynomials by working with the Krylov subspace,\n\nK =(cid:2)\u03a0 A\u03a0 A2\u03a0 A3\u03a0 . . . Aq\u03a0(cid:3) ,\n\nfrom which we can construct pq(A)\u03a0 for any polynomial pq(\u00b7) of degree q.4 Since the polynomial\nfrom Lemma 5 must be scaled and shifted based on the value of \u03c3k+1, we cannot easily compute it\ndirectly. Instead, we argue that the very best k rank approximation to A lying in the span of K at\nleast matches the approximation achieved by projecting onto the span of pq(A)\u03a0. Finding this best\napproximation will therefore give a nearly optimal low-rank approximation to A.\nUnfortunately, there\u2019s a catch. Surprisingly, it is not clear how to ef\ufb01ciently compute the best spectral\nnorm error low-rank approximation to A lying in a given subspace (e.g. K\u2019s span) [14, 33]. This\nchallenge precludes an analysis of Krylov methods parallel to recent work on Simultaneous Iteration.\nNevertheless, since our analysis shows that projecting to Z captures nearly all the Frobenius norm\nof pq(A), we can show that the best Frobenius norm low-rank approximation to A in the span of K\ngives good enough spectral norm approximation. By the following lemma, this optimal Frobenius\nnorm low-rank approximation is given by ZZT A, where Z is exactly the output of Algorithm 2.\nLemma 6 (Lemma 4.1 of [15]). Given A \u2208 Rn\u00d7d and Q \u2208 Rm\u00d7n with orthonormal columns,\n\n(cid:107)A \u2212 (QQT A)k(cid:107)F = (cid:107)A \u2212 Q(cid:0)QT A(cid:1)\n\nmin\n\nC|rank(C)=k\n\nQ(cid:0)QT A(cid:1)\nletting M = \u00afU \u00af\u03a32 \u00afUT be the SVD of M, and Z = Q \u00afUk then Q(cid:0)QT A(cid:1)\n\nk can be obtained using an SVD of the m \u00d7 m matrix M = QT (AAT )Q. Speci\ufb01cally,\n\n(cid:107)A \u2212 QC(cid:107)F .\n\nk (cid:107)F =\n\nk = ZZT A.\n\n4.1 Stronger Per Vector Error Guarantees\nAchieving the per vector guarantee of (3) requires a more nuanced understanding of how Simultane-\nous Iteration and Block Krylov Iteration denoise the spectrum of A. The analysis for spectral norm\nlow-rank approximation relies on the fact that Aq (or pq(A) for Block Krylov Iteration) blows up\nany singular value \u2265 (1 + \u0001)\u03c3k+1 to much larger than any singular value \u2264 \u03c3k+1. This ensures that\nour output Z aligns very well with the singular vectors corresponding to these large singular values.\nIf \u03c3k \u2265 (1 + \u0001)\u03c3k+1, then Z aligns well with all top k singular vectors of A and we get good\nFrobenius norm error and the per vector guarantee (3). Unfortunately, when there is a small gap\nbetween \u03c3k and \u03c3k+1, Z could miss intermediate singular vectors whose values lie between \u03c3k+1\nand (1 + \u0001)\u03c3k+1. This is the case where gap dependent guarantees of classical analysis break down.\nHowever, Aq or, for Block Krylov Iteration, some q-degree polynomial in our Krylov subspace, also\nsigni\ufb01cantly separates singular values > \u03c3k+1 from those < (1 \u2212 \u0001)\u03c3k+1. Thus, each column of Z\nat least aligns with A nearly as well as uk+1. So, even if we miss singular values between \u03c3k+1 and\n(1 + \u0001)\u03c3k+1, they will be replaced with approximate singular values > (1\u2212 \u0001)\u03c3k+1, enough for (3).\nFor Frobenius norm low-rank approximation (1), we prove that the degree to which Z falls outside of\nthe span of A\u2019s top k singular vectors depends on the number of singular values between \u03c3k+1 and\n(1\u2212\u0001)\u03c3k+1. These are the values that could be \u2018swapped in\u2019 for the true top k singular values. Since\ntheir weight counts towards A\u2019s tail, our total loss compared to optimal is at worst \u0001(cid:107)A \u2212 Ak(cid:107)2\nF .\n5\nFor both Algorithm 1 and 2, \u03a0 can be replaced by a random sign matrix, or any matrix achieving\nthe guarantee of Lemma 4. \u03a0 may also be chosen with p > k columns. In our full paper [25], we\ndiscuss in detail how this approach can give improved accuracy.\n5.1 Simultaneous Iteration\nIn our implementation we set Z = Q \u00afUk, which is necessary for achieving per vector guarantees for\napproximate PCA. However, for near optimal low-rank approximation, we can simply set Z = Q.\nProjecting A to Q \u00afUk is equivalent to projecting to Q as these matrices have the same column spans.\nSince powering A spreads its singular values, K = (AAT )qA\u03a0 could be poorly conditioned. To\nimprove stability we orthonormalize K after every iteration (or every few iterations). This does not\nchange K\u2019s column span, so it gives an equivalent algorithm in exact arithmetic.\n\nImplementation and Runtimes\n\n4Algorithm 2 in fact only constructs odd powered terms in K, which is suf\ufb01cient for our choice of pq(x).\n\n6\n\n\fTheorem 7 (Simultaneous Iteration Runtime). Algorithm 1 runs in time\n\nO(cid:0)nnz(A)k log(d)/\u0001 + nk2 log(d)/\u0001(cid:1) .\n\n(cid:0)AAT(cid:1)i\n\nA\u03a0 given (cid:0)AAT(cid:1)i\u22121\n\nProof. Computing K requires \ufb01rst multiplying A by \u03a0, which takes O(nnz(A)k) time. Computing\nA\u03a0 then takes O(nnz(A)k) time to \ufb01rst multiply our (n \u00d7 k)\nmatrix by AT and then by A. Reorthogonalizing after each iteration takes O(nk2) time via Gram-\nSchmidt. This gives a total runtime of O(nnz(A)kq + nk2q) for computing K. Finding Q takes\nO(nk2) time. Computing M by multiplying from left to right requires O(nnz(A)k + nk2) time.\nM\u2019s SVD then requires O(k3) time using classical techniques. Finally, multiplying \u00afUk by Q takes\ntime O(nk2). Setting q = \u0398(log d/\u0001) gives the claimed runtime.\n\n5.2 Block Krylov Iteration\nIn the traditional Block Lanczos algorithm, one starts by computing an orthonormal basis for A\u03a0,\nthe \ufb01rst block in K. Bases for subsequent blocks are computed from previous blocks using a three\nterm recurrence that ensures QT AAT Q is block tridiagonal, with k \u00d7 k sized blocks [18]. This\ntechnique can be useful if qk is large, since it is faster to compute the top singular vectors of a block\ntridiagonal matrix. However, computing Q using a recurrence can introduce a number of stability\nissues, and additional steps may be required to ensure that the matrix remains orthogonal [28].\nAn alternative, uesd in [1], [19], and our Algorithm 2, is to compute K explicitly and then \ufb01nd Q\nusing a QR decomposition. This method does not guarantee that QT AAT Q is block tridiagonal,\nbut avoids stability issues. Furthermore, if qk is small, taking the SVD of QT AAT Q will still be\nfast and typically dominated by the cost of computing K.\nAs with Simultaneous Iteration, we orthonormalize each block of K after it is computed, avoiding\npoorly conditioned blocks and giving an equivalent algorithm in exact arithmetic.\nTheorem 8 (Block Krylov Iteration Runtime). Algorithm 2 runs in time\n\n\u0001 + nk2 log2(d)/\u0001 + k3 log3(d)/\u00013/2(cid:17)\n\n.\n\n\u221a\n\n(cid:16)\n\nO\n\nnnz(A)k log(d)/\n\nProof. Computing K, including reorthogonalization, requires O(nnz(A)kq + nk2q) time. The re-\nmaining steps are analogous to those in Simultaneous Iteration except somewhat more costly as we\nwork with a k \u00b7 q rather than k dimensional subspace. Finding Q takes O(n(kq)2) time. Computing\nM take O(nnz(A)(kq) + n(kq)2) time and its SVD then requires O((kq)3) time. Finally, multi-\nplying \u00afUk by Q takes time O(nk(kq)). Setting q = \u0398(log d/\n\n\u0001) gives the claimed runtime.\n\n\u221a\n\n6 Experiments\nWe close with several experimental results. A variety of empirical papers, not to mention widespread\nadoption, already justify the use of randomized SVD algorithms. Prior work focuses in particular on\nbenchmarking Simultaneous Iteration [19, 11] and, due to its improved accuracy over sketch-and-\nsolve approaches, this algorithm is popular in practice [10, 16]. As such, we focus on demonstrating\nthat for many data problems Block Krylov Iteration can offer signi\ufb01cantly better convergence.\nWe implement both algorithms in MATLAB using Gaussian random starting matrices with exactly\nk columns. We explicitly compute K for both algorithms, as described in Section 5, and use re-\northonormalization at each iteration to improve stability [34]. We test the algorithms with varying\niteration count q on three common datasets, SNAP/AMAZON0302 [22, 23], SNAP/EMAIL-ENRON\n[22, 35], and 20 NEWSGROUPS [36], computing column principal components in all cases. We plot\nerror vs. iteration count for metrics (1), (2), and (3) in Figure 3. For per vector error (3), we plot the\nmaximum deviation amongst all top k approximate principal components (relative to \u03c3k+1).\nUnsurprisingly, both algorithms obtain very accurate Frobenius norm error, (cid:107)A \u2212 ZZT A(cid:107)F /(cid:107)A \u2212\nAk(cid:107)F , with very few iterations. This is our intuitively weakest guarantee and, in the presence of a\nheavy singular value tail, both iterative algorithms will outperform the worst case analysis.\nOn the other hand, for spectral norm low-rank approximation and per vector error, we con\ufb01rm that\nBlock Krylov Iteration converges much more rapidly than Simultaneous Iteration, as predicted by\n\n7\n\n\f(a) SNAP/AMAZON0302, k = 30\n\n(b) SNAP/EMAIL-ENRON, k = 10\n\n(c) 20 NEWSGROUPS, k = 20\n\n(d) 20 NEWSGROUPS, k = 20, runtime cost\n\nFigure 3: Low-rank approximation and per vector error convergence rates for Algorithms 1 and 2.\n\nour theoretical analysis. It it often possible to achieve nearly optimal error with < 8 iterations where\nas getting to within say 1% error with Simultaneous Iteration can take much longer.\nThe \ufb01nal plot in Figure 3 shows error verses runtime for the 11269\u00d7 15088 dimensional 20 NEWS-\nGROUPS dataset. We averaged over 7 trials and ran the experiments on a commodity laptop with\n16GB of memory. As predicted, because its additional memory overhead and post-processing costs\nare small compared to the cost of the large matrix multiplication required for each iteration, Block\nKrylov Iteration outperforms Simultaneous Iteration for small \u0001.\nMore generally, these results justify the importance of convergence bounds that are independent of\nsingular value gaps. Our analysis in Section 6 of the full paper predicts that, once \u0001 is small in\n\u2212 1, we should see much more rapid convergence since q will depend\ncomparison to the gap \u03c3k\n\u03c3k+1\non log(1/\u0001) instead of 1/\u0001. However, for Simultaneous Iteration, we do not see this behavior with\nSNAP/AMAZON0302 and it only just begins to emerge for 20 NEWSGROUPS.\nWhile all three datasets have rapid singular value decay, a careful look con\ufb01rms that their singular\nvalue gaps are actually quite small! For example, \u03c3k/\u03c3k+1 \u2212 1 is .004 for SNAP/AMAZON0302\nand .011 for 20 NEWSGROUPS, in comparison to .042 for SNAP/EMAIL-ENRON. Accordingly, the\nfrequent claim that singular value gaps can be taken as constant is insuf\ufb01cient, even for small \u0001.\nReferences\n[1] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for principal component\n\nanalysis. SIAM Journal on Matrix Analysis and Applications, 31(3):1100\u20131124, 2009.\n\n[2] Nathan Halko, Per-Gunnar Martinsson, and Joel Tropp. Finding structure with randomness: Probabilistic\n\nalgorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217\u2013288, 2011.\n\n[3] Tam\u00b4as Sarl\u00b4os. Improved approximation algorithms for large matrices via random projections. In Pro-\n\nceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006.\n\n[4] Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. A randomized algorithm for the approxi-\n\nmation of matrices. Technical Report 1361, Yale University, 2006.\n\n[5] Kenneth Clarkson and David Woodruff. Low rank approximation and regression in input sparsity time. In\nProceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), pages 81\u201390, 2013.\n\n[6] Antoine Liutkus. Randomized SVD, 2014. MATLAB Central File Exchange.\n[7] Daisuke Okanohara. redsvd: RandomizED SVD. https://code.google.com/p/redsvd/, 2010.\n\n8\n\n51015202500.050.10.150.20.250.3Iterations qError \u03b5 Block Krylov \u2212 Frobenius ErrorBlock Krylov \u2212 Spectral ErrorBlock Krylov \u2212 Per Vector ErrorSimult. Iter. \u2212 Frobenius ErrorSimult. Iter. \u2212 Spectral ErrorSimult. Iter. \u2212 Per Vector Error51015202500.050.10.150.20.250.30.350.4Iterations qError \u03b5 Block Krylov \u2212 Frobenius ErrorBlock Krylov \u2212 Spectral ErrorBlock Krylov \u2212 Per Vector ErrorSimult. Iter. \u2212 Frobenius ErrorSimult. Iter. \u2212 Spectral ErrorSimult. Iter. \u2212 Per Vector Error51015202500.050.10.150.20.250.30.35Error \u03b5Iterations q Block Krylov \u2212 Frobenius ErrorBlock Krylov \u2212 Spectral ErrorBlock Krlyov \u2212 Per Vector ErrorSimult. Iter. \u2212 Frobenius ErrorSimult. Iter. \u2212 Spectral ErrorSimult. Iter. \u2212 Per Vector Error0123456700.050.10.150.20.250.30.35Runtime (seconds)Error \u03b5 Block Krylov \u2212 Frobenius ErrorBlock Krylov \u2212 Spectral ErrorBlock Krylov \u2212 Per Vector ErrorSimult. Iter. \u2212 Frobenius ErrorSimult. Iter. \u2212 Spectral ErrorSimult. Iter. \u2212 Per Vector Error\f[8] David Hall et al. ScalaNLP: Breeze. http://www.scalanlp.org/, 2009.\n[9] IBM Reseach Division, Skylark Team. libskylark: Sketching-based Distributed Matrix Computations for\n\nMachine Learning. IBM Corporation, Armonk, NY, 2014.\n\n[10] F. Pedregosa et al. Scikit-learn: Machine learning in Python. JMLR, 12:2825\u20132830, 2011.\n[11] Arthur Szlam, Yuval Kluger, and Mark Tygert. An implementation of a randomized algorithm for princi-\n\npal component analysis. arXiv:1412.3510, 2014.\n\n[12] Zohar Karnin and Edo Liberty. Online PCA with spectral bounds. In Proceedings of the 28th Annual\n\nConference on Computational Learning Theory (COLT), pages 505\u2013509, 2015.\n\n[13] Ra\ufb01 Witten and Emmanuel J. Cand`es. Randomized algorithms for low-rank matrix factorizations: Sharp\n\nperformance bounds. Algorithmica, 31(3):1\u201318, 2014.\n\n[14] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix recon-\n\nstruction. SIAM Journal on Computing, 43(2):687\u2013717, 2014.\n\n[15] David P. Woodruff. Sketching as a tool for numerical linear algebra. Found. Trends in Theoretical\n\nComputer Science, 10(1-2):1\u2013157, 2014.\n\n[16] Andrew Tulloch. Fast randomized singular value decomposition. http://research.facebook.\n\ncom/blog/294071574113354/fast-randomized-svd/, 2014.\n\n[17] Jane Cullum and W.E. Donath. A block Lanczos algorithm for computing the q algebraically largest\neigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices. In IEEE Conference\non Decision and Control including the 13th Symposium on Adaptive Processes, pages 505\u2013509, 1974.\n\n[18] Gene Golub and Richard Underwood. The block Lanczos method for computing eigenvalues. Mathemat-\n\nical Software, (3):361\u2013377, 1977.\n\n[19] Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert. An algorithm for the principal\n\ncomponent analysis of large data sets. SIAM Journal on Scienti\ufb01c Computing, 33(5):2580\u20132594, 2011.\n\n[20] Nathan Halko. Randomized methods for computing low-rank approximations of matrices. PhD thesis, U.\n\nof Colorado, 2012.\n\n[21] Ming Gu. Subspace iteration randomization and singular value problems. arXiv:1408.2208, 2014.\n[22] Timothy A. Davis and Yifan Hu. The university of \ufb02orida sparse matrix collection. ACM Transactions on\n\nMathematical Software, 38(1):1:1\u20131:25, December 2011.\n\n[23] Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. The dynamics of viral marketing. ACM\n\nTransactions on the Web, 1(1), May 2007.\n\n[24] Y. Saad. On the rates of convergence of the Lanczos and the Block-Lanczos methods. SIAM Journal on\n\nNumerical Analysis, 17(5):687\u2013706, 1980.\n\n[25] Cameron Musco and Christopher Musco. Randomized block Krylov methods for stronger and faster\n\napproximate singular value decomposition. arXiv:1504.05477, 2015.\n\n[26] Yousef Saad. Numerical Methods for Large Eigenvalue Problems: Revised Edition, volume 66. 2011.\n[27] Gene Golub, Franklin Luk, and Michael Overton. A block Lanczos method for computing the singular\n\nvalues and corresponding singular vectors of a matrix. ACM Trans. Math. Softw., 7(2):149\u2013169, 1981.\n\n[28] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, 3rd edition, 1996.\n[29] Ren-Cang Li and Lei-Hong Zhang. Convergence of the block Lanczos method for eigenvalue clusters.\n\nNumerische Mathematik, 131(1):83\u2013113, 2015.\n\n[30] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality\nreduction for k-means clustering and low rank approximation. In Proceedings of the 47th Annual ACM\nSymposium on Theory of Computing (STOC), 2015.\n\n[31] Friedrich L. Bauer. Das verfahren der treppeniteration und verwandte verfahren zur l\u00a8osung algebraischer\n\neigenwertprobleme. Zeitschrift f\u00a8ur angewandte Mathematik und Physik ZAMP, 8(3):214\u2013235, 1957.\n\n[32] J. Kuczy\u00b4nski and H. Wo\u00b4zniakowski. Estimating the largest eigenvalue by the power and Lanczos algo-\nrithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4):1094\u20131122, 1992.\n[33] Kin Cheong Sou and Anders Rantzer. On the minimum rank of a generalized matrix approximation\nproblem in the maximum singular value norm. In Proceedings of the 19th International Symposium on\nMathematical Theory of Networks and Systems (MTNS), 2010.\n\n[34] Per-Gunnar Martinsson, Arthur Szlam, and Mark Tygert. Normalized power iterations for the computation\n\nof SVD, 2010. NIPS Workshop on Low-rank Methods for Large-scale Machine Learning.\n\n[35] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: Densi\ufb01cation laws, shrinking\ndiameters and possible explanations. In Proceedings of the 11th ACM SIGKDD International Conference\non Knowledge Discovery and Data Mining (KDD), pages 177\u2013187, 2005.\n\n[36] Jason Rennie. 20 newsgroups. http://qwone.com/\u02dcjason/20Newsgroups/, May 2015.\n\n9\n\n\f", "award": [], "sourceid": 850, "authors": [{"given_name": "Cameron", "family_name": "Musco", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Christopher", "family_name": "Musco", "institution": "Mass. Institute of Technology"}]}