{"title": "Learning-Based Low-Rank Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 7402, "page_last": 7412, "abstract": "We introduce a \u201clearning-based\u201d algorithm for the low-rank decomposition problem: given an $n \\times d$ matrix $A$, and a parameter $k$,  compute a rank-$k$ matrix $A'$ that  minimizes the approximation loss $\\|A-A'\\|_F$. The algorithm uses a training set of input matrices in order to optimize its performance. \nSpecifically, some of the most efficient approximate algorithms for computing low-rank approximations proceed by computing a projection $SA$, where  $S$ is a sparse random $m \\times n$ \u201csketching matrix\u201d, and then performing the singular value decomposition of $SA$. We \n show how to replace the random matrix $S$ with a \u201clearned\u201d matrix of the same sparsity to reduce the error. \nOur experiments show that,  for multiple types of data sets, \na learned sketch matrix can substantially reduce the approximation loss compared to a random matrix $S$, sometimes up to one order of magnitude. We also study mixed matrices where only some of the rows are trained and the remaining ones are random, and show that matrices still offer improved performance while retaining worst-case guarantees. \n\nFinally, to understand the theoretical aspects of our approach, we study the special case of $m=1$. In particular, we give an approximation algorithm for minimizing the empirical loss, with approximation factor depending on the stable rank of matrices in the training set. We also show generalization bounds for the sketch matrix learning problem.", "full_text": "Learning-Based Low-Rank Approximations\n\nPiotr Indyk\nCSAIL, MIT\n\nAli Vakilian\u2217\n\nUniversity of Wisconsin - Madison\n\nYang Yuan\u2217\n\nTsinghua University\n\nindyk@mit.edu\n\nvakilian@wisc.edu\n\nyuanyang@tsinghua.edu.cn\n\nAbstract\n\nWe introduce a \u201clearning-based\u201d algorithm for the low-rank decomposition prob-\nlem: given an n \u00d7 d matrix A, and a parameter k, compute a rank-k matrix A(cid:48) that\nminimizes the approximation loss (cid:107)A \u2212 A(cid:48)(cid:107)F . The algorithm uses a training set of\ninput matrices in order to optimize its performance. Speci\ufb01cally, some of the most\nef\ufb01cient approximate algorithms for computing low-rank approximations proceed\nby computing a projection SA, where S is a sparse random m \u00d7 n \u201csketching\nmatrix\u201d, and then performing the singular value decomposition of SA. We show\nhow to replace the random matrix S with a \u201clearned\u201d matrix of the same sparsity\nto reduce the error.\nOur experiments show that, for multiple types of data sets, a learned sketch matrix\ncan substantially reduce the approximation loss compared to a random matrix S,\nsometimes by one order of magnitude. We also study mixed matrices where only\nsome of the rows are trained and the remaining ones are random, and show that\nmatrices still offer improved performance while retaining worst-case guarantees.\nFinally, to understand the theoretical aspects of our approach, we study the special\ncase of m = 1. In particular, we give an approximation algorithm for minimizing\nthe empirical loss, with approximation factor depending on the stable rank of\nmatrices in the training set. We also show generalization bounds for the sketch\nmatrix learning problem.\n\n1\n\nIntroduction\n\nThe success of modern machine learning made it applicable to problems that lie outside of the\nscope of \u201cclassic AI\u201d. In particular, there has been a growing interest in using machine learning\nto improve the performance of \u201cstandard\u201d algorithms, by \ufb01ne-tuning their behavior to adapt to the\nproperties of the input distribution, see e.g., [Wang et al., 2016, Khalil et al., 2017, Kraska et al., 2018,\nBalcan et al., 2018, Lykouris and Vassilvitskii, 2018, Purohit et al., 2018, Gollapudi and Panigrahi,\n2019, Mitzenmacher, 2018, Mousavi et al., 2015, Baldassarre et al., 2016, Bora et al., 2017, Metzler\net al., 2017, Hand and Voroninski, 2018, Khani et al., 2019, Hsu et al., 2019]. This \u201clearning-based\u201d\napproach to algorithm design has attracted a considerable attention over the last few years, due to its\npotential to signi\ufb01cantly improve the ef\ufb01ciency of some of the most widely used algorithmic tasks.\nMany applications involve processing streams of data (video, data logs, customer activity etc) by\nexecuting the same algorithm on an hourly, daily or weekly basis. These data sets are typically not\n\u201crandom\u201d or \u201cworst-case\u201d; instead, they come from some distribution which does not change rapidly\nfrom execution to execution. This makes it possible to design better algorithms tailored to the speci\ufb01c\ndata distribution, trained on past instances of the problem.\nThe method has been particularly successful in the context of compressed sensing. In the latter\nframework, the goal is to recover an approximation to an n-dimensional vector x, given its \u201clinear\nmeasurement\u201d of the form Sx, where S is an m \u00d7 n matrix. Theoretical results [Donoho, 2006,\n\n\u2217This work was mostly done when the second and third authors were at MIT.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fCand\u00e8s et al., 2006] show that, if the matrix S is selected at random, it is possible to recover the\nk largest coef\ufb01cients of x with high probability using a matrix S with m = O(k log n) rows. This\nguarantee is general and applies to arbitrary vectors x. However, if vectors x are selected from some\nnatural distribution (e.g., they represent images), recent works [Mousavi et al., 2015, Baldassarre\net al., 2016, Metzler et al., 2017] show that one can use samples from that distribution to compute\nmatrices S that improve over a completely random matrix in terms of the recovery error.\nCompressed sensing is an example of a broader class of problems which can be solved using random\nprojections. Another well-studied problem of this type is low-rank decomposition: given an n \u00d7 d\nmatrix A, and a parameter k, compute a rank-k matrix\n\n[A]k = argminA(cid:48): rank(A(cid:48))\u2264k(cid:107)A \u2212 A(cid:48)(cid:107)F .\n\nLow-rank approximation is one of the most widely used tools in massive data analysis, machine\nlearning and statistics, and has been a subject of many algorithmic studies. In particular, multiple\nalgorithms developed over the last decade use the \u201csketching\u201d approach, see e.g., [Sarlos, 2006,\nWoolfe et al., 2008, Halko et al., 2011, Clarkson and Woodruff, 2009, 2017, Nelson and Nguy\u00ean,\n2013, Meng and Mahoney, 2013, Boutsidis and Gittens, 2013, Cohen et al., 2015]. Its idea is to\nuse ef\ufb01ciently computable random projections (a.k.a., \u201csketches\u201d) to reduce the problem size before\nperforming low-rank decomposition, which makes the computation more space and time ef\ufb01cient.\nFor example, [Sarlos, 2006, Clarkson and Woodruff, 2009] show that if S is a random matrix of size\nm \u00d7 n chosen from an appropriate distribution2, for m depending on \u0001, then one can recover a rank-k\nmatrix A(cid:48) such that\n\n(cid:107)A \u2212 A(cid:48)(cid:107)F \u2264 (1 + \u0001)(cid:107)A \u2212 [A]k(cid:107)F\n\nby performing an SVD on SA \u2208 Rm\u00d7d followed by some post-processing. Typically the sketch\nlength m is small, so the matrix SA can be stored using little space (in the context of streaming\nalgorithms) or ef\ufb01ciently communicated (in the context of distributed algorithms). Furthermore, the\nSVD of SA can be computed ef\ufb01ciently, especially after another round of sketching, reducing the\noverall computation time. See the survey [Woodruff, 2014] for an overview of these developments.\nIn light of the aforementioned work on learning-based compressive sensing, it is natural to ask\nwhether similar improvements in performance could be obtained for other sketch-based algorithms,\nnotably for low-rank decompositions. In particular, reducing the sketch length m while preserving its\naccuracy would make sketch-based algorithms more ef\ufb01cient. Alternatively, one could make sketches\nmore accurate for the same values of m. This is the problem we address in this paper.\n\nOur Results. Our main \ufb01nding is that learned sketch matrices can indeed yield (much) more\naccurate low-rank decompositions than purely random matrices. We focus our study on a stream-\ning algorithm for low-rank decomposition due to [Sarlos, 2006, Clarkson and Woodruff, 2009],\ndescribed in more detail in Section 2. Speci\ufb01cally, suppose we have a training set of matrices\nTr = {A1, . . . , AN} sampled from some distribution D. Based on this training set, we compute a\nmatrix S\u2217 that (locally) minimizes the empirical loss\n\n(cid:107)Ai \u2212 SCW(S\u2217, Ai)(cid:107)F\n\n(1)\n\n(cid:88)\n\ni\n\nwhere SCW(S\u2217, Ai) denotes the output of the aforementioned Sarlos-Clarkson-Woodruff streaming\nlow-rank decomposition algorithm on matrix Ai using the sketch matrix S\u2217. Once the sketch matrix\nS\u2217 is computed, it can be used instead of a random sketch matrix in all future executions of the SCW\nalgorithm.\nWe demonstrate empirically that, for multiple types of data sets, an optimized sketch matrix S\u2217 can\nsubstantially reduce the approximation loss compared to a random matrix S, sometimes by one order\nof magnitude (see Figure 2 or 3). Equivalently, the optimized sketch matrix can achieve the same\napproximation loss for lower values of m which results in sketching matrices with lower space usage.\nNote that since we augment a streaming algorithm, our main focus is on improving its space usage\n\n2Initial algorithms used matrices with independent sub-gaussian entries or randomized Fourier/Hadamard\nmatrices [Sarlos, 2006, Woolfe et al., 2008, Halko et al., 2011]. Starting from the seminal work of [Clarkson\nand Woodruff, 2017], researchers began to explore sparse binary matrices, see e.g., [Nelson and Nguy\u00ean, 2013,\nMeng and Mahoney, 2013]. In this paper we mostly focus on the latter distribution.\n\n2\n\n\f(which in the distributed setting translates into the amount of communication). The latter is O(md),\nthe size of SA.\nA possible disadvantage of learned sketch matrices is that an algorithm that uses them no longer\noffers worst-case guarantees. As a result, if such an algorithm is applied to an input matrix that\ndoes not conform to the training distribution, the results might be worse than if random matrices\nwere used. To alleviate this issue, we also study mixed sketch matrices, where (say) half of the rows\nare trained and the other half are random. We observe that if such matrices are used in conjunction\nwith the SCW algorithm, its results are no worse than if only the random part of the matrix was\nused (Theorem 1 in Section 4)3. Thus, the resulting algorithm inherits the worst-case performance\nguarantees of the random part of the sketching matrix. At the same time, we show that mixed matrices\nstill substantially reduce the approximation loss compared to random ones, in some cases nearly\nmatching the performance of \u201cpure\u201d learned matrices with the same number of rows. Thus, mixed\nrandom matrices offer \u201cthe best of both worlds\u201d: improved performance for matrices from the training\ndistribution, and worst-case guarantees otherwise.\nFinally, in order to understand the theoretical aspects of our approach further, we study the special\ncase of m = 1. This corresponds to the case where the sketch matrix S is just a single vector. Our\nresults are two-fold:\n\n\u2022 We give an approximation algorithm for minimizing the empirical loss as in Equation 1,\nwith an approximation factor depending on the stable rank of matrices in the training set.\nSee Appendix B.\n\u2022 Under certain assumptions about the robustness of the loss minimizer, we show generaliza-\n\ntion bounds for the solution computed over the training set. See Appendix C.\n\nThe theoretical results on the case of m = 1 are deferred to the full version of this paper.\n\n1.1 Related work\n\nAs outlined in the introduction, over the last few years there has been multiple papers exploring the\nuse of machine learning methods to improve the performance of \u201cstandard\u201d algorithms. Among\nthose, the closest to the topic of our paper are the works on learning-based compressive sensing, such\nas [Mousavi et al., 2015, Baldassarre et al., 2016, Bora et al., 2017, Metzler et al., 2017], and on\nlearning-based streaming algorithms [Hsu et al., 2019]. Since neither of these two lines of research\naddresses computing matrix spectra, the technical development therein was quite different from ours.\nIn this paper we focus on learning-based optimization of low-rank approximation algorithms that use\nlinear sketches, i.e., map the input matrix A into SA and perform computation on the latter. There\nare other sketching algorithms for low-rank approximation that involve non-linear sketches [Liberty,\n2013, Ghashami and Phillips, 2014, Ghashami et al., 2016]. The bene\ufb01t of linear sketches is that they\nare easy to update under linear changes to the matrix A, and (in the context of our work) that they are\neasy to differentiate, making it possible to compute the gradient of the loss function as in Equation 1.\nWe do not know whether it is possible to use our learning-based approach for non-linear sketches, but\nwe believe this is an interesting direction for future research.\n\n2 Preliminaries\nNotation. Consider a distribution D on matrices A \u2208 Rn\u00d7d. We de\ufb01ne the training set as\n{A1,\u00b7\u00b7\u00b7 , AN} sampled from D. For matrix A, its singular value decomposition (SVD) can be\nwritten as A = U \u03a3V (cid:62) such that both U \u2208 Rn\u00d7n and V \u2208 Rd\u00d7n have orthonormal columns and\n\u03a3 = diag{\u03bb1,\u00b7\u00b7\u00b7 , \u03bbd} is a diagonal matrix with nonnegative entries. Moreover, if rank(A) = r,\nthen the \ufb01rst r columns of U are an orthonormal basis for the column space of A (we denote it as\ncolsp(A)), the \ufb01rst r columns of V are an orthonormal basis for the row space of A (we denote it\nas rowsp(A))4 and \u03bbi = 0 for i > r. In many applications it is quicker and more economical to\ncompute the compact SVD which only contains the rows and columns corresponding to the non-zero\nsingular values of \u03a3: A = U c\u03a3c(V c)(cid:62) where U c \u2208 Rn\u00d7r, \u03a3c \u2208 Rr\u00d7r and V c \u2208 Rd\u00d7r.\n\nalgorithms. See Section 4 for further discussion.\n\n3We note that this property is non-trivial, in the sense that it does not automatically hold for all sketching\n4The remaining columns of U and V respectively are orthonormal bases for the nullspace of A and A(cid:62).\n\n3\n\n\fHow sketching works. We start by describing the SCW algorithm for low-rank matrix approxima-\ntion, see Algorithm 1. The algorithm computes the singular value decomposition of SA = U \u03a3V (cid:62),\nand compute the best rank-k approximation of AV . Finally it outputs [AV ]kV (cid:62) as a rank-k approxi-\nmation of A. We emphasize that Sarlos and Clarkson-Woodruff proposed Algorithm 1 with random\nsketching matrices S. In this paper, we follow the same framework but use learned (or partially\nlearned) matrices.\n\nAlgorithm 1 Rank-k approximation of a matrix A using a sketch matrix S (refer to Section 4.1.1 of\n[Clarkson and Woodruff, 2009])\n1: Input: A \u2208 Rn\u00d7d, S \u2208 Rm\u00d7n\n2: U, \u03a3, V (cid:62) \u2190 COMPACTSVD(SA) (cid:66) {r = rank(SA), U \u2208 Rm\u00d7r, V \u2208 Rd\u00d7r}\n3: Return: [AV ]kV (cid:62)\n\nNote that if m is much smaller than d and n, the space bound of this algorithm is signi\ufb01cantly\nbetter than when computing a rank-k approximation for A in the na\u00efve way. Thus, minimizing m\nautomatically reduces the space usage of the algorithm.\n\nSketching matrix. We use matrix S that is sparse5 Speci\ufb01cally, each column of S has exactly one\nnon-zero entry, which is either +1 or \u22121. This means that the fraction of non-zero entries in S is\n1/m. Therefore, one can use a vector to represent S, which is very memory ef\ufb01cient. It is worth\nnoting, however, after multiplying the sketching matrix S with other matrices, the resulting matrix\n(e.g., SA) is in general not sparse.\n\n3 Training Algorithm\n\nIn this section, we describe our learning-based algorithm for computing a data dependent sketch\nS. The main idea is to use backpropagation algorithm to compute the stochastic gradient of S with\nrespect to the rank-k approximation loss in Equation 1, where the initial value of S is the same random\nsparse matrix used in SCW. Once we have the stochastic gradient, we can run stochastic gradient\ndescent (SGD) algorithm to optimize S, in order to improve the loss. Our algorithm maintains the\nsparse structure of S, and only optimizes the values of the n non-zero entries (initially +1 or \u22121).\n\nAlgorithm 2 Differentiable SVD implementation\n1: Input: A1 \u2208 Rm\u00d7d(m < d)\n2: U, \u03a3, V \u2190 {},{},{}\n3: for i \u2190 1 . . . m do\nv1 \u2190 random initialization in Rd\n4:\nfor t \u2190 1 . . . T do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n13: Return: U, \u03a3, V\n\nvt+1 \u2190 A(cid:62)\n(cid:107)A(cid:62)\nend for\nV [i] \u2190 vT +1\n\u03a3[i] \u2190 (cid:107)AiV [i](cid:107)2\nU [i] \u2190 AiV [i]\nAi+1 \u2190 Ai \u2212 \u03a3[i]U [i]V [i](cid:62)\n\n\u03a3[i]\n\ni Aivt\ni Aivt(cid:107)2\n\n(cid:66) {power method}\n\nv1\n\n\u00d7T times\n\nvt+1 \u2190 A(cid:62)\n(cid:107)A(cid:62)\n\ni Aivt\ni Aivt(cid:107)2\n\nU [i]\n\n\u03a3[i]\n\nV [i]\n\nU\n\u03a3\nV\n\nFigure 1: i-th iteration of power method\nHowever, the standard SVD implementation (step 2 in Algorithm 1 ) is not differentiable, which means\nwe cannot get the gradient in the straightforward way. To make SVD implementation differentiable,\nwe use the fact that the SVD procedure can be represented as m individual top singular value\ndecompositions (see e.g. [Allen-Zhu and Li, 2016]), and that every top singular value decomposition\n\n5The original papers [Sarlos, 2006, Clarkson and Woodruff, 2009] used dense matrices, but the work of\n[Clarkson and Woodruff, 2017] showed that sparse matrices work as well. We use sparse matrices since they are\nmore ef\ufb01cient to train and to operate on.\n\n4\n\n\fcan be computed using the power method. See Figure 1 and Algorithm 2. We store the results of the\ni-th iteration into the i-th entry of the list U, \u03a3, V , and \ufb01nally concatenate all entries together to get\nthe matrix (or matrix diagonal) format of U, \u03a3, V . This allows gradients to \ufb02ow easily.\nDue to the extremely long computational chain, it is infeasible to write down the explicit form of\nloss function or the gradients. However, just like how modern deep neural networks compute their\ngradients, we used the autograd feature in PyTorch to numerically compute the gradient with respect\nto the sketching matrix S.\nWe emphasize again that our method is only optimizing S for the training phase. After S is fully\ntrained, we still call Algorithm 1 for low rank approximation, which has exactly the same running time\nas the SCW algorithm, but with better performance (i.e., the quality of the returned rank-k matrix).\nWe remark that the time complexity of SCW algorithm is O(nmd) assuming k \u2264 m \u2264 min(n, d).\n\n4 Worst Case Bound\n\nIn this section, we show that concatenating two sketching matrices S1 and S2 (of size respectively\nm1\u00d7n and m2\u00d7n) into a single matrix S\u2217 (of size (m1+m2)\u00d7n) will not increase the approximation\nloss of the \ufb01nal rank-k solution computed by Algorithm 1 compared to the case in which only one of\nS1 or S2 are used as the sketching matrix. In the rest of this section, the sketching matrix S\u2217 denotes\nthe concatenation of S1 and S2 as follows:\n\n\uf8ee\uf8f0 S1(m1\u00d7n)\n\nS2(m2\u00d7n)\n\n\uf8f9\uf8fb\n\nS\u2217((m1+m2)\u00d7n) =\n\nFormally, we prove the following theorem on the worst case performance of mixed matrices.\nTheorem 1. Let U\u2217\u03a3\u2217V (cid:62)\n1 respectively denote the SVD of S\u2217A and S1A. Then,\n\u2217 \u2212 A||F \u2264 ||[AV1]kV (cid:62)\n\n\u2217 and U1\u03a31V (cid:62)\n||[AV\u2217]kV (cid:62)\n\n1 \u2212 A||F .\n\nIn particular, the above theorem implies that the output of Algorithm 1 with the sketching matrix S\u2217\nis a better rank-k approximation to A compared to the output of the algorithm with S1. In the rest of\nthis section we prove Theorem 1.\nBefore proving the main theorem, we state the following helpful lemma.\nLemma 1 (Lemma 4.3 in [Clarkson and Woodruff, 2009]). Suppose that V is a matrix with orthonor-\nmal columns. Then, a best rank-k approximation to A in the colsp(V ) is given by [AV ]kV (cid:62).\nSince the above statement is a transposed version of the lemma from [Clarkson and Woodruff, 2009],\nwe include the proof in the appendix for completeness.\nProof of Theorem 1. First, we show that colsp(V1) \u2286 colsp(V\u2217). By the properties of the (compact)\nSVD, colsp(V1) = rowsp(S1A) and colsp(V\u2217) = rowsp(S\u2217A). Since, S\u2217 has all rows of S1, then\n(2)\n\ncolsp(V1) \u2286 colsp(V\u2217).\n\nBy Lemma 1,\n\n||A \u2212 [AV\u2217]kV (cid:62)\n\n\u2217 ||F =\n\n||A \u2212 [AV1]kV (cid:62)\n\n1 ||F =\n\nmin\n\nrowsp(X)\u2286colsp(V\u2217);\n\nrank(X)\u2264k\n\nmin\n\nrowsp(X)\u2286colsp(V1);\n\nrank(X)\u2264k\n\n||X \u2212 A||F\n\n||X \u2212 A||F\n\nFinally, together with (2),\n||A \u2212 [AV\u2217]kV (cid:62)\n\n\u2217 ||F =\n\n\u2264\n\nwhich completes the proof.\n\nmin\n\nrowsp(X)\u2286colsp(V\u2217);\n\nrank(X)\u2264k\n\nmin\n\nrowsp(X)\u2286colsp(V1);\n\nrank(X)\u2264k\n\n5\n\n||X \u2212 A||F\n\n||X \u2212 A||F = ||A \u2212 [AV1]kV (cid:62)\n\n1 ||F .\n\n\fFinally, we note that the property of Theorem 1 is not universal, i.e., it does not hold for all sketching\nalgorithms for low-rank decomposition. For example, an alternative algorithm proposed in [Cohen\net al., 2015] proceeds by letting Z to be the top k singular vectors of SA (i.e., Z = V where\n[SA]k = U \u03a3V T ) and then reports AZZ(cid:62). It is not dif\ufb01cult to see that, by adding extra rows to the\nsketching matrix S (which may change all top k singular vectors compared to the ones of SA), one\ncan skew the output of the algorithm so that it is far from the optimal.\n\n5 Experimental Results\n\nThe main question considered in this paper is whether, for natural matrix datasets, optimizing\nthe sketch matrix S can improve the performance of the sketching algorithm for the low-rank\ndecomposition problem. To answer this question, we implemented and compared the following\nmethods for computing S \u2208 Rm\u00d7n.\n\ndistribution (we include this method for comparison).\n\noptimize the sketching matrix using the training set, and return the optimized matrix.\n\n\u2022 Sparse Random. Sketching matrices are generated at random as in [Clarkson and Woodruff,\n2017]. Speci\ufb01cally, we select a random hash function h : [n] \u2192 [m], and for all i = 1 . . . n,\nSh[i],i is selected to be either +1 or \u22121 with equal probability. All other entries in S are set\nto 0. Therefore, S has exactly n non-zero entries.\n\u2022 Dense Random. All the nm entries in the sketching matrices are sampled from Gaussian\n\u2022 Learned. Using the sparse random matrix as the initialization, we run Algorithm 2 to\n\u2022 Mixed (J). We \ufb01rst generate two sparse random matrices S1, S2 \u2208 R m\n2 \u00d7n (assuming m is\neven), and de\ufb01ne S to be their combination. We then run Algorithm 2 to optimize S using\nthe training set, but only S1 will be updated, while S2 is \ufb01xed. Therefore, S is a mixture of\nlearned matrix and random matrix, and the \ufb01rst matrix is trained jointly with the second one.\n2 \u00d7n using the training set, and then\nappend another sparse random matrix S2 to get S \u2208 Rm\u00d7n. Therefore, S is a mixture of\nlearned matrix and random matrix, but the learned matrix is trained separately.\n\n\u2022 Mixed (S). We \ufb01rst compute a learned matrix S1 \u2208 R m\n\nFigure 2: Test error by datasets and sketching matrices For k = 10, m = 20\n\nFigure 3: Test error for Logo (left), Hyper (middle) and Tech (right) when k = 10.\n\nDatasets. We used a variety of datasets to test the performance of our methods:\n\n6\n\nLogoEagleFriendsHyperTech02468Test Error0.10.20.20.52.81.94.04.13.17.92.04.74.03.57.8LearnedSparse RandomDense Random20406080m101100Test ErrorLearnedSparse RandomDense Random20406080m100101Test ErrorLearnedSparse RandomDense Random20406080m100101Test ErrorLearnedSparse RandomDense Random\fTable 1: Test error in various settings\n\nTable 2: Comparison with mixed sketches\n\nLogo Eagle\nk, m, Sketch\n0.31\n0.39\n10, 10, Learned\n6.33\n10, 10, Random 5.22\n0.10\n0.18\n10, 20, Learned\n4.31\n10, 20, Random 2.09\n0.66\n0.61\n20, 20, Learned\n5.79\n20, 20, Random 4.18\n0.41\n0.18\n20, 40, Learned\n20, 40, Random 1.19\n3.50\n1.06\n0.72\n30, 30, Learned\n6.03\n30, 30, Random 3.11\n0.61\n0.21\n30, 60, Learned\n30, 60, Random 0.82\n3.28\n\nFriends Hyper Tech\n6.70\n1.03\n17.08\n11.56\n0.22\n2.95\n7.99\n4.11\n7.79\n1.41\n14.55\n9.10\n3.09\n0.42\n2.44\n6.20\n7.14\n1.78\n12.82\n6.27\n2.78\n0.42\n1.79\n4.84\n\n1.25\n7.90\n0.52\n2.92\n1.68\n5.71\n0.72\n2.23\n1.90\n5.23\n0.84\n1.88\n\nk, m, Sketch\n10, 10, Learned\n10, 10, Random\n10, 20, Learned\n10, 20, Mixed (J)\n10, 20, Mixed (S)\n10, 20, Random\n10, 40, Learned\n10, 40, Mixed (J)\n10, 40, Mixed (S)\n10, 40, Random\n10, 80, Learned\n10, 80, Random\n\nLogo Hyper Tech\n6.70\n0.39\n17.08\n5.22\n0.10\n2.95\n3.73\n0.20\n3.69\n0.24\n7.99\n2.09\n1.16\n0.04\n0.05\n1.31\n1.20\n0.05\n3.28\n0.45\n0.31\n0.02\n0.09\n0.80\n\n1.25\n7.90\n0.52\n0.78\n0.87\n2.92\n0.28\n0.34\n0.34\n1.12\n0.16\n0.32\n\n\u2022 Videos6: Logo, Friends, Eagle. We downloaded three high resolution videos from Youtube,\nincluding logo video, Friends TV show, and eagle nest cam. From each video, we collect\n500 frames of size 1920 \u00d7 1080 \u00d7 3 pixels, and use 400 (100) matrices as the training (test)\nset. For each frame, we resize it as a 5760 \u00d7 1080 matrix.\n\u2022 Hyper. We use matrices from HS-SOD, a dataset for hyperspectral images from natural\nscenes [Imamoglu et al., 2018]. Each matrix has 1024 \u00d7 768 pixels, and we use 400 (100)\nmatrices as the training (test) set.\n\u2022 Tech. We use matrices from TechTC-300, a dataset for text categorization [Davidov et al.,\n2004]. Each matrix has 835, 422 rows, but on average only 25, 389 of the rows contain\nnon-zero entries. On average each matrix has 195 columns. We use 200 (95) matrices as the\ntraining (test) set.\n\nEvaluation metric. To evaluate the quality of a sketching matrix S, it suf\ufb01ces to evaluate the output\nof Algorithm 1 using the sketching matrix S on different input matrices A. We \ufb01rst de\ufb01ne the optimal\napproximation loss for test set Te as follows: App\u2217\nNote that App\u2217\nTe does not depend on S, and in general it is not achievable by any sketch S with\nm < d, because of information loss. Based on the de\ufb01nition of the optimal approximation loss, we\nde\ufb01ne the error of the sketch S for Te as Err(Te, S) (cid:44) EA\u223cTe(cid:107)A \u2212 SCW(S, A)(cid:107)F \u2212 App\u2217\nTe.\nIn our datasets, some of the matrices have much larger singular values than the others. To avoid\nimbalance in the dataset, we normalize the matrices so that their top singular values are all equal.\n\n(cid:44) EA\u223cTe(cid:107)A \u2212 [A]k(cid:107)F .\n\nTe\n\nFigure 4: Low rank approximation results for Logo video frame: the best rank-10 approximation\n(left), and rank-10 approximations reported by Algorithm 1 using a sparse learned sketching matrix\n(middle) and a sparse random sketching matrix (right).\n\n5.1 Average test error\n\nWe \ufb01rst test all methods on different datasets, with various combination of k, m. See Figure 2 for\nthe results when k = 10, m = 20. As we can see, for video datasets, learned sketching matrices\ncan get 20\u00d7 better test error than the sparse random or dense random sketching matrices. For other\n6They can be downloaded from http://youtu.be/L5HQoFIaT4I, http://youtu.be/xmLZsEfXEgE and\n\nhttp://youtu.be/ufnf_q_3Ofg\n\n7\n\n\fdatasets, learned sketching matrices are still more than 2\u00d7 better. In this experiment, we have\nrun each con\ufb01guration 5 times, and computed the standard error of each test error7. For Logo,\nEagle, Friends, Hyper and Tech, the standard errors of learned, sparse random and dense random\nsketching matrices are respectively, (1.5, 8.4, 35.3, 124, 41) \u00d7 10\u22126, (3.1, 5.3, 7.0, 2.9, 4.5) \u00d7 10\u22122\nand (3.5, 18.1, 4.6, 10.7, 3.3) \u00d7 10\u22122. It is clear that the standard error of the learned sketching\nmatrix is a few order of magnitudes smaller than the random sketching matrices, which shows another\nbene\ufb01t of learning sketching matrices.\nSimilar improvement of the learned sketching matrices over the random sketching matrices can be\nobserved when k = 10, m = 10, 20, 30, 40,\u00b7\u00b7\u00b7 , 80, see Figure 3. We also include the test error\nresults in Table 1 for the case when k = 20, 30. Finally, in Figure 4, we visualize an example output\nof the algorithm for the case k = 10, m = 20 for the Logo dataset.\n\n5.2 Comparing Random, Learned and Mixed\n\nIn Table 2, we investigate the performance of the mixed sketching matrices by comparing them with\nrandom and learned sketching matrices. In all scenarios, the mixed sketching matrices yield much\nbetter results than the random sketching matrices, and sometimes the results are comparable to those\nof learned sketching matrices. This means, in most cases it suf\ufb01ces to train half of the sketching\nmatrix to obtain good empirical results, and at the same time, by our Theorem 1, we can use the\nremaining random half of the sketching matrix to obtain worst-case guarantees.\nMoreover, if we do not \ufb01x the number of learned rows to be half, the test error increases as the number\nof learned rows decreases. In Figure 5, we plot the test error for the setting with m = 20, k = 10\nusing 100 Logo matrices, running for 3000 iterations.\n\n5.3 Mixing Training Sets\n\nIn our previous experiments, we constructed a different learned sketching matrix S for each data set.\nHowever, one can use a single random sketching matrix for all three data sets simultaneously. Next,\nwe study the performance of a single learned sketching matrix for all three data sets. In Table 3, we\nconstructed a single learned sketching matrix S with m = k = 10 on a training set containing 300\nmatrices from Logo, Eagle and Friends (each has 100 matrices). Then, we tested S on Logo matrices\nand compared its performance to the performance of a learned sketching matrix SL trained on Logo\ndataset (i.e., using 100 Logo matrices only), as well as to the performance of a random sketching SR.\nThe performance of the sketching matrix S with a mixed training set from all three datasets is close\nto the performance of the sketching matrix SL with training set only from Logo dataset, and is much\nbetter than the performance of the random sketching matrix SR.\n\n5.4 Running Time\n\nThe runtimes of the algorithm with a random sketching matrix and our learned sketching matrix\nare the same, and are much less than the runtime of the \u201cstandard\u201d SVD method (implemented in\nPytorch). In Table 4, we present the runtimes of the algorithm with different types of sketching\nmatrices (i.e., learned and random) on Logo matrices with m = k = 10, as well as the training time\nof the learned case. Notice that training only needs to be done once, and can be done of\ufb02ine.\n\n6 Conclusions\n\nIn this paper we introduced a learning-based approach to sketching algorithms for computing low-rank\ndecompositions. Such algorithms proceed by computing a projection SA, where A is the input matrix\nand S is a random \u201csketching\u201d matrix. We showed how to train S using example matrices A in\norder to improve the performance of the overall algorithm. Our experiments show that for several\ndifferent types of datasets, a learned sketch can signi\ufb01cantly reduce the approximation loss compared\nto a random matrix. Further, we showed that if we mix a random matrix and a learned matrix (by\nconcatenation), the result still offers an improved performance while inheriting worst case guarantees\nof the random sketch component.\n\n7They were very small, so we did not plot in the \ufb01gures\n\n8\n\n\fTable 3: Evaluation of the sketching matrix trained on different sets\n\nTest Error\n\nLogo+Eagle+Friends Logo only Random\n0.67\n\n0.27\n\n5.19\n\nTable 4: Runtimes of the algorithm with different sketching matrices\n\nSVD Random Learned-Inference Learned-Training\n2.2s\n\n9481.25s\n\n0.03s\n\n0.03s\n\nFigure 5: Test errors of mixed\nsketching matrices with differ-\nent number of \u201clearned\u201d rows.\n\nAcknowledgment\n\nThis research was supported by NSF TRIPODS award #1740751 and Simons Investigator Award.\nThe authors would like to thank the anonymous reviewers for their insightful comments and sugges-\ntions.\n\nReferences\nZ. Allen-Zhu and Y. Li. Lazysvd: even faster svd decomposition yet without agonizing pain. In\n\nAdvances in Neural Information Processing Systems, pages 974\u2013982, 2016.\n\nM.-F. Balcan, T. Dick, T. Sandholm, and E. Vitercik. Learning to branch. In International Conference\n\non Machine Learning, pages 353\u2013362, 2018.\n\nL. Baldassarre, Y.-H. Li, J. Scarlett, B. G\u00f6zc\u00fc, I. Bogunovic, and V. Cevher. Learning-based\ncompressive subsampling. IEEE Journal of Selected Topics in Signal Processing, 10(4):809\u2013822,\n2016.\n\nA. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In\n\nInternational Conference on Machine Learning, pages 537\u2013546, 2017.\n\nC. Boutsidis and A. Gittens. Improved matrix algorithms via the subsampled randomized hadamard\n\ntransform. SIAM Journal on Matrix Analysis and Applications, 34(3):1301\u20131340, 2013.\n\nE. J. Cand\u00e8s, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction\nfrom highly incomplete frequency information. IEEE Transactions on information theory, 52(2):\n489\u2013509, 2006.\n\nK. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model. In Proceedings\n\nof the forty-\ufb01rst annual symposium on Theory of computing (STOC), pages 205\u2013214, 2009.\n\nK. L. Clarkson and D. P. Woodruff. Low-rank approximation and regression in input sparsity time.\n\nJournal of the ACM (JACM), 63(6):54, 2017.\n\nM. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu. Dimensionality reduction for k-means\nclustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium\non Theory of computing, pages 163\u2013172, 2015.\n\nD. Davidov, E. Gabrilovich, and S. Markovitch. Parameterized generation of labeled datasets for text\ncategorization based on a hierarchical directory. In Proceedings of the 27th Annual International\nACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR \u201904, pages\n250\u2013257, 2004.\n\nD. L. Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289\u20131306,\n\n2006.\n\nM. Ghashami and J. M. Phillips. Relative errors for deterministic low-rank matrix approximations.\nIn Proceedings of the twenty-\ufb01fth annual ACM-SIAM symposium on Discrete algorithms (SODA),\npages 707\u2013717, 2014.\n\n9\n\n01020#Learned Rows0.00.51.01.52.0Test Error\fM. Ghashami, E. Liberty, J. M. Phillips, and D. P. Woodruff. Frequent directions: Simple and\n\ndeterministic matrix sketching. SIAM Journal on Computing, 45(5):1762\u20131792, 2016.\n\nS. Gollapudi and D. Panigrahi. Online algorithms for rent-or-buy with expert advice. In International\n\nConference on Machine Learning, pages 2319\u20132327, 2019.\n\nN. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic\nalgorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217\u2013288,\n2011.\n\nP. Hand and V. Voroninski. Global guarantees for enforcing deep generative priors by empirical risk.\n\nIn Conference On Learning Theory, 2018.\n\nS. Har-Peled, P. Indyk, and R. Motwani. Approximate nearest neighbor: Towards removing the curse\n\nof dimensionality. Theory of computing, 8(1):321\u2013350, 2012.\n\nC.-Y. Hsu, P. Indyk, D. Katabi, and A. Vakilian. Learning-based frequency estimation algorithms.\n\nInternational Conference on Learning Representations, 2019.\n\nN. Imamoglu, Y. Oishi, X. Zhang, G. Ding, Y. Fang, T. Kouyama, and R. Nakamura. Hyperspectral\nimage dataset for benchmarking on salient object detection. In Tenth International Conference on\nQuality of Multimedia Experience, (QoMEX), pages 1\u20133, 2018.\n\nE. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms\n\nover graphs. In Advances in Neural Information Processing Systems, pages 6348\u20136358, 2017.\n\nM. Khani, M. Alizadeh, J. Hoydis, and P. Fleming. Adaptive neural signal detection for massive\n\nMIMO. CoRR, abs/1906.04610, 2019.\n\nT. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In\nProceedings of the 2018 International Conference on Management of Data, pages 489\u2013504, 2018.\n\nE. Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD\n\ninternational conference on Knowledge discovery and data mining, pages 581\u2013588, 2013.\n\nT. Lykouris and S. Vassilvitskii. Competitive caching with machine learned advice. In International\n\nConference on Machine Learning, pages 3302\u20133311, 2018.\n\nX. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and\napplications to robust linear regression. In Proceedings of the forty-\ufb01fth annual ACM symposium\non Theory of computing, pages 91\u2013100, 2013.\n\nC. Metzler, A. Mousavi, and R. Baraniuk. Learned d-amp: Principled neural network based compres-\nsive image recovery. In Advances in Neural Information Processing Systems, pages 1772\u20131783,\n2017.\n\nM. Mitzenmacher. A model for learned bloom \ufb01lters and optimizing by sandwiching. In Advances in\n\nNeural Information Processing Systems, pages 464\u2013473, 2018.\n\nA. Mousavi, A. B. Patel, and R. G. Baraniuk. A deep learning approach to structured signal recovery.\nIn Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on,\npages 1336\u20131343. IEEE, 2015.\n\nJ. Nelson and H. L. Nguy\u00ean. Osnap: Faster numerical linear algebra algorithms via sparser subspace\nembeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium\non, pages 117\u2013126, 2013.\n\nM. Purohit, Z. Svitkina, and R. Kumar. Improving online algorithms via ml predictions. In Advances\n\nin Neural Information Processing Systems, pages 9661\u20139670, 2018.\n\nT. Sarlos. Improved approximation algorithms for large matrices via random projections. In 47th\n\nAnnual IEEE Symposium on Foundations of Computer Science (FOCS), pages 143\u2013152, 2006.\n\nS. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge University Press, 2014.\n\n10\n\n\fJ. Wang, W. Liu, S. Kumar, and S.-F. Chang. Learning to hash for indexing big data - a survey.\n\nProceedings of the IEEE, 104(1):34\u201357, 2016.\n\nD. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R(cid:13) in\n\nTheoretical Computer Science, 10(1\u20132):1\u2013157, 2014.\n\nF. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for the approximation\n\nof matrices. Applied and Computational Harmonic Analysis, 25(3):335\u2013366, 2008.\n\n11\n\n\f", "award": [], "sourceid": 4025, "authors": [{"given_name": "Piotr", "family_name": "Indyk", "institution": "MIT"}, {"given_name": "Ali", "family_name": "Vakilian", "institution": "University of Wisconsin-Madison"}, {"given_name": "Yang", "family_name": "Yuan", "institution": "Cornell University"}]}