{"title": "Fast recovery from a union of subspaces", "book": "Advances in Neural Information Processing Systems", "page_first": 4394, "page_last": 4402, "abstract": "We address the problem of recovering a high-dimensional but structured vector from linear observations in a general setting where the vector can come from an arbitrary union of subspaces. This setup includes well-studied problems such as compressive sensing and low-rank matrix recovery. We show how to design more efficient algorithms for the union-of subspace recovery problem by using *approximate* projections. Instantiating our general framework for the low-rank matrix recovery problem gives the fastest provable running time for an algorithm with optimal sample complexity. Moreover, we give fast approximate projections for 2D histograms, another well-studied low-dimensional model of data. We complement our theoretical results with experiments demonstrating that our framework also leads to improved time and sample complexity empirically.", "full_text": "Fast recovery from a union of subspaces\n\nChinmay Hegde\n\nIowa State University\n\nPiotr Indyk\n\nMIT\n\nLudwig Schmidt\n\nMIT\n\nAbstract\n\nWe address the problem of recovering a high-dimensional but structured vector\nfrom linear observations in a general setting where the vector can come from an\narbitrary union of subspaces. This setup includes well-studied problems such as\ncompressive sensing and low-rank matrix recovery. We show how to design more\nef\ufb01cient algorithms for the union-of-subspace recovery problem by using approx-\nimate projections. Instantiating our general framework for the low-rank matrix\nrecovery problem gives the fastest provable running time for an algorithm with\noptimal sample complexity. Moreover, we give fast approximate projections for 2D\nhistograms, another well-studied low-dimensional model of data. We complement\nour theoretical results with experiments demonstrating that our framework also\nleads to improved time and sample complexity empirically.\n\n1\n\nIntroduction\n\nOver the past decade, exploiting low-dimensional structure in high-dimensional problems has become\na highly active area of research in machine learning, signal processing, and statistics. In a nutshell,\nthe general approach is to utilize a low-dimensional model of relevant data in order to achieve\nbetter prediction, compression, or estimation compared to a \u201cblack box\u201d treatment of the ambient\nhigh-dimensional space. For instance, the seminal work on compressive sensing and sparse linear\nregression has shown how to estimate a sparse, high-dimensional vector from a small number of\nlinear observations that essentially depends only on the small sparsity of the vector, as opposed to its\nlarge ambient dimension. Further examples of low-dimensional models are low-rank matrices, group-\nstructured sparsity, and general union-of-subspaces models, all of which have found applications in\nproblems such as matrix completion, principal component analysis, compression, and clustering.\nThese low-dimensional models have a common reason for their success: they capture important\nstructure present in real world data with a formal concept that is suitable for a rigorous mathematical\nanalysis. This combination has led to statistical performance improvements in several applications\nwhere the ambient high-dimensional space is too large for accurate estimation from a limited number\nof samples. However, exploiting the low-dimensional structure also comes at a cost: incorporating\nthe structural constraints into the statistical estimation procedure often results in a more challenging\nalgorithmic problems. Given the growing size of modern data sets, even problems that are solvable\nin polynomial time can quickly become infeasible. This leads to the following important question:\nCan we design ef\ufb01cient algorithms that combine (near)-optimal statistical ef\ufb01ciency with good\ncomputational complexity?\nIn this paper, we make progress on this question in the context of recovering a low-dimensional\nvector from noisy linear observations, which is the fundamental problem underlying both low-rank\nmatrix recovery and compressive sensing / sparse linear regression. While there is a wide range of\nalgorithms for these problems, two approaches for incorporating structure tend to be most common:\n(i) convex relaxations of the low-dimensional constraint such as the `1- or the nuclear norm [19], and\n(ii) iterative methods based on projected gradient descent, e.g., the IHT (Iterative Hard Thresholding)\nor SVP (Singular Value Projection) algorithms [5, 15]. Since the convex relaxations are often also\nsolved with \ufb01rst order methods (e.g., FISTA or SVT [6]), the low-dimensional constraint enters both\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fapproaches through a structure-speci\ufb01c projection or proximal operator. However, this projection\n/ proximal operator is often computationally expensive and dominates the overall time complexity\n(e.g., it requires a singular value decomposition for the low-rank matrix recovery problem).\nIn this work, we show how to reduce the computational bottleneck of the projection step by using\napproximate projections. Instead of solving the structure-speci\ufb01c projection exactly, our framework\nallows us to employ techniques from approximation algorithms without increasing the sample\ncomplexity of the recovery algorithm. While approximate projections have been used in prior work,\nour framework is the \ufb01rst to yield provable algorithms for general union-of-subspaces models (such\nas low-rank matrices) that combine better running time with no loss in sample complexity compared\nto their counterparts utilizing exact projections. Overall, we make three contributions:\n\n1. We introduce an algorithmic framework for recovering vectors from linear observations\ngiven an arbitrary union-of-subspaces model. Our framework only requires approximate\nprojections, which leads to recovery algorithms with signi\ufb01cantly better time complexity.\n2. We instantiate our framework for the well-studied low-rank matrix recovery problem, which\nyields a provable algorithm combining the optimal sample complexity with the best known\ntime complexity for this problem.\n\n3. We also instantiate our framework for the problem of recovering 2D-histograms (i.e.,\npiecewise constant matrices) from linear observations, which leads to a better empirical\nsample complexity than the standard approach based on Haar wavelets.\n\nOur algorithmic framework generalizes recent results for structured sparse recovery [12, 13] and\nshows that approximate projections can be employed in a wider context. We believe that these\nnotions of approximate projections are useful in further constrained estimation settings and have\nalready obtained preliminary results for structured sparse PCA. For conciseness, we focus on the\nunion-of-subspaces recovery problem in this paper.\n\nOutline of the paper.\nIn Section 2, we formally introduce the union-of-subspaces recovery problem\nand state our main results. Section 3 then explains our algorithmic framework in more detail\nand Section 4 instantiates the framework for low-rank matrix recovery. Section 5 concludes with\nexperimental results. Due to space constraints, we address our results for 2D histograms mainly in\nAppendix C of the supplementary material.\n\n2 Our contributions\n\nWe begin by de\ufb01ning our problem of interest. Our goal is to recover an unknown, structured vector\n\u2713\u21e4 2 Rd from linear observations of the form\n\ny = X\u2713\u21e4 + e ,\n\n(1)\nwhere the vector y 2 Rn contains the linear observations / measurements, the matrix X 2 Rn\u21e5d is\nthe design / measurement matrix, and the vector e 2 Rn is an arbitrary noise vector. The formal goal\nis to \ufb01nd an estimate \u02c6\u2713 2 Rd such that k\u02c6\u2713  \u2713\u21e4k2 \uf8ff C \u00b7k ek2, where C is a \ufb01xed, universal constant\nand k\u00b7k2 is the standard `2-norm (for notational simplicity, we omit the subscript on the `2-norm in\nthe rest of the paper). The structure we assume is that the vector \u2713\u21e4 belongs to a subspace model:\nDe\ufb01nition 1 (Subspace model). A subspace model U is a set of linear subspaces. The set of vectors\nassociated with the subspace model U is M(U) = {\u2713 | \u2713 2 U for some U 2 U}.\nA subspace model is a natural framework generalizing many of the low-dimensional data models\nmentioned above. For example, the set of sparse vectors with s nonzeros can be represented with\nd\ns subspaces corresponding to thed\n\ns possible sparse support sets. The resulting problem of\n\nrecovering \u2713\u21e4 from observations of the form (1) then is the standard compressive sensing / sparse\nlinear regression problem. Structured sparsity is a direct extension of this formulation in which we\nonly include a smaller set of allowed supports, e.g., supports corresponding to group structures.\nOur framework also includes the case where the union of subspaces is taken over an in\ufb01nite set: we\ncan encode the low-rank matrix recovery problem by letting U be the set of rank-r matrix subspaces,\ni.e., each subspace is given by a set of r orthogonal rank-one matrices. By considering the singular\n\n2\n\n\f\u02c6\u2713i+1 PU(\u02c6\u2713i  \u2318 \u00b7 X T (X \u02c6\u2713i  y))\n\nvalue decomposition, it is easy to see that every rank-r matrix can be written as the linear combination\nof r orthogonal rank-one matrices.\nNext, we introduce related notation. For a linear subspace U of Rd, let PU 2 Rd\u21e5d be the orthogonal\nprojection onto U. We denote the orthogonal complement of the subspace U with U? so that \u2713 =\nPU \u2713+PU?\u2713. We extend the notion of adding subspaces (i.e., U +V = {u+v | u 2 U and v 2 V }) to\nsubspace models: the sum of two subspace models U and V is UV = {U +V | U 2 U and V 2 V}.\nWe denote the k-wise sum of a subspace model with k U = U  U  . . .  U.\nFinally, we introduce a variant of the well-known restricted isometry property (RIP) for subspace\nmodels. The RIP is a common regularity assumption for the design matrix X that is often used in\ncompressive sensing and low-rank matrix recovery in order to decouple the analysis of algorithms\nfrom concrete sampling bounds.1 Formally, we have:\nDe\ufb01nition 2 (Subspace RIP). Let X 2 Rn\u21e5d, let U be a subspace model, and let   0. Then X\nsatis\ufb01es the (U, )-subspace RIP if for all \u2713 2M (U) we have (1 )k\u2713k2 \uf8ff kX\u2713k2 \uf8ff (1 + )k\u2713k2.\n2.1 A framework for recovery algorithms with approximate projections\nConsidering the problem (1) and the goal of estimating under the `2-norm, a natural algorithm is\nprojected gradient descent with the constraint set M(U). This corresponds to iterations of the form\n(2)\nwhere \u2318 2 R is the step size and we have extended our notation so that PU denotes a projection onto\nthe set M(U). Hence we require an oracle that projects an arbitrary vector b 2 Rd into a subspace\nmodel U, which corresponds to \ufb01nding a subspace U 2 U so that kb  PU bk is minimized. Recovery\nalgorithms of the form (2) have been proposed for various instances of the union-of-subspaces\nrecovery problem and are known as Iterative Hard Thresholding (IHT) [5], model-IHT [1], and\nSingular Value Projection (SVP) [15]. Under regularity conditions on the design matrix X such as the\nRIP, these algorithms \ufb01nd accurate estimates \u02c6\u2713 from an asymptotically optimal number of samples.\nHowever, for structures more complicated than plain sparsity (e.g., group sparsity or a low-rank\nconstraint), the projection oracle is often the computational bottleneck.\nTo overcome this barrier, we propose two complementary notions of approximate subspace projections.\nNote that for an exact projection, we have that kbk2 = kb  PUbk2 + kPUbk2. Hence minimizing the\n\u201ctail\u201d error kb  PUbk is equivalent to maximizing the \u201chead\u201d quantity kPUbk. Instead of minimizing /\nmaximizing these quantities exactly, the following de\ufb01nitions allow a constant factor approximation:\nDe\ufb01nition 3 (Approximate tail projection). Let U and UT be subspace models and let cT  0. Then\nT : Rd ! UT is a (cT , U, UT )-approximate tail projection if the following guarantee holds for all\nb 2 Rd: The returned subspace U = T (b) satis\ufb01es kb  PU bk \uf8ff cT kb  PUbk.\nDe\ufb01nition 4 (Approximate head projection). Let U and UH be subspace models and let cH > 0.\nThen H : Rd ! UH is a (cH, U, UH)-approximate head projection if the following guarantee holds\nfor all b 2 Rd: The returned subspace U = H(b) satis\ufb01es kPU bk  cHkPUbk.\nIt is important to note that the two de\ufb01nitions are distinct in the sense that a constant-factor head\napproximation does not imply a constant-factor tail approximation, or vice versa (to see this, consider\na vector with a very large or very small tail error, respectively). Another feature of these de\ufb01nitions is\nthat the approximate projections are allowed to choose subspaces from a potentially larger subspace\nmodel, i.e., we can have U(U H (or UT ). This is a useful property when designing approximate\nhead and tail projection algorithms as it allows for bicriterion approximation guarantees.\nWe now state the main result for our new recovery algorithm. In a nutshell, we show that using both\nnotions of approximate projections achieves the same statistical ef\ufb01ciency as using exact projections.\nAs we will see in later sections, the weaker approximate projection guarantees allow us to design\nalgorithms with a signi\ufb01cantly better time complexity than their exact counterparts. To simplify the\nfollowing statement, we defer the precise trade-off between the approximation ratios to Section 3.\n\n1Note that exact recovery from arbitrary linear observations is already an NP-hard problem in the noiseless\ncase, and hence regularity conditions on the design matrix X are necessary for ef\ufb01cient algorithms. While there\nare more general regularity conditions such as the restricted eigenvalue property, we state our results here under\nthe RIP assumption in order to simplify the presentation of our algorithmic framework.\n\n3\n\n\fTheorem 5 (informal). Let H and T be approximate head and tail projections with constant\napproximation ratios, and let the matrix X satisfy the (c U, )-subspace RIP for a suf\ufb01ciently large\nconstant c and a suf\ufb01ciently small constant . Then there is an algorithm AS-IHT that returns an\nestimate \u02c6\u2713 such that k\u02c6\u2713  \u2713\u21e4k \uf8ff Ckek. The algorithm requires O(logk\u2713k/kek) multiplications with\nX and X T , and O(logk\u2713k/kek) invocations of H and T .\nUp to constant factors, the requirements on the RIP of X in Theorem 5 are the same as for exact\nprojections. As a result, our sample complexity is only affected by a constant factor through the use\nof approximate projections, and our experiments in Section 5 show that the empirical loss in sample\ncomplexity is negligible. Similarly, the number of iterations O(logk\u2713k/kek) is also only affected by\na constant factor compared to the use of exact projections [5, 15]. Finally, it is worth mentioning that\nusing two notions of approximate projections is crucial: prior work in the special case of structured\nsparsity has already shown that only one type of approximate projection is not suf\ufb01cient for strong\nrecovery guarantees [13].\n\n2.2 Low-rank matrix recovery\n\nWe now instantiate our new algorithmic framework for the low-rank matrix recovery problem.\nVariants of this problem are widely studied in machine learning, signal processing, and statistics, and\nare known under different names such as matrix completion, matrix sensing, and matrix regression.\nAs mentioned above, we can incorporate the low-rank matrix structure into our general union-of-\nsubspaces model by considering the union of all low-rank matrix subspaces. For simplicity, we state\nthe following bounds for the case of square matrices, but all our results also apply to rectangular\nmatrices. Formally, we assume that \u2713\u21e4 2 Rd is the vectorized form of a rank-r matrix \u21e5\u21e4 2 Rd1\u21e5d1\n1 and typically r \u2327 d1. Seminal results have shown that it is possible to achieve the\nwhere d = d2\nsubspace-RIP for low-rank matrices with only n = O(r \u00b7 d1) linear observations, which can be much\nsmaller than the total dimensionality of the matrix d2\n1. However, the bottleneck in recovery algorithms\nis often the singular value decomposition (SVD), which is necessary for both exact projections and\nsoft thresholding operators and has a time complexity of O(d3\n1).\nOur new algorithmic framework for approximate projections allows us to leverage recent results\non approximate SVDs. We show that it is possible to compute both head and tail projections for\n1) time for an exact\n\nlow-rank matrices in eO(r \u00b7 d2\n1) time, which is signi\ufb01cantly faster than the O(d3\nSVD in the relevant regime where r \u2327 d1. Overall, we get the following result.\nTheorem 6. Let X 2 Rn\u21e5d be a matrix with subspace-RIP for low-rank matrices, and let TX denote\nthe time to multiply a d-dimensional vector with X or X T . Then there is an algorithm that recovers\nan estimate \u02c6\u2713 such that k\u02c6\u2713  \u2713\u21e4k \uf8ff Ckek. Moreover, the algorithm runs in time eO(TX + r \u00b7 d2\n1).\nsuch as a subsampled Fourier matrix achieve TX = eO(d2\nour algorithm runs in time eO(r \u00b7 d2\n\nIn the regime where multiplication with the matrix X is fast, the time complexity of the projection\ndominates the time complexity of the recovery algorithms. For instance, structured observations\n1); see Appendix D for details. Here,\n1), which is the \ufb01rst provable running time faster than the O(d3\n1)\nbottleneck given by a single exact SVD. While prior work has suggested the use of approximate SVDs\nin low-rank matrix recovery [9], our results are the \ufb01rst that give a provably better time complexity\nfor this combination of projected gradient descent and approximate SVDs. Hence Theorem 6 can be\nseen as a theoretical justi\ufb01cation for the heuristic use of approximate SVDs.\nFinally, we remark that Theorem 6 does not directly cover the low-rank matrix completion case\nbecause the subsampling operator does not satisfy the low-rank RIP [9]. To clarify our use of\napproximate SVDs, we focus on the RIP setting in our proofs, similar to recent work on low-rank\nmatrix recovery [7, 22]. We believe that similar results as for SVP [15] also hold for our algorithm,\nand our experiments in Section 5 show that our algorithm works well for low-rank matrix completion.\n\n2.3\n\n2D-histogram recovery\n\nNext, we instantiate our new framework for 2D-histograms, another natural low-dimensional model.\nAs before, we think of the vector \u2713\u21e4 2 Rd as a matrix \u21e5 2 Rd1\u21e5d1 and assume the square case for\nsimplicity (again, our results also apply to rectangular matrices). We say that \u21e5 is a k-histogram if the\ncoef\ufb01cients of \u21e5 can be described as k axis-aligned rectangles on which \u21e5 is constant. This de\ufb01nition\n\n4\n\n\fis a generalization of 1D-histograms to the two-dimensional setting and has found applications in\nseveral areas such as databases and density estimation. Moreover, the theoretical computer science\ncommunity has studied sketching and streaming algorithms for histograms, which is essentially the\nproblem of recovering a histogram from linear observations. While the wavelet tree model with Haar\nwavelets give the correct sample complexity of n = O(k log d) for 1D-histograms, the wavelet tree\napproach incurs a suboptimal sample complexity of O(k log2 d) for 2D-histograms. It is possible\nto achieve the optimal sample complexity O(k log d) also for 2D-histograms, but the corresponding\nexact projection requires a complicated dynamic program (DP) with time complexity O(d5\n1k2), which\nis impractical for all but very small problem dimensions [18].\nWe design signi\ufb01cantly faster approximate projection algorithms for 2D histograms. Our approach is\nbased on an approximate DP [18] that we combine with a Lagrangian relaxation of the k-rectangle\nconstraint. Both algorithms have parameters for controlling the trade-off between the size of the\noutput histogram, the approximation ratio, and the running time. As mentioned above, the bicriterion\nnature of our approximate head and tail guarantees becomes useful here. In the following two\ntheorems, we let Uk be the subspace model of 2D histograms consisting of k-rectangles.\nTheorem 7. Let \u21e3> 0 and \"> 0 be arbitrary. Then there is an (1 + \", Uk, Uc\u00b7k)-approximate tail\n\nprojection for 2D histograms where c = O(1/\u21e32\"). Moreover, the algorithm runs in time eO(d1+\u21e3).\nTheorem 8. Let \u21e3> 0 and \"> 0 be arbitrary. Then there is an (1  \", Uk, Uc\u00b7k)-approximate head\nprojection for 2D histograms where c = O(1/\u21e32\"). Moreover, the algorithm runs in time eO(d1+\u21e3).\n\nNote that both algorithms offer a running time that is almost linear, and the small polynomial gap to\na linear running time can be controlled as a trade-off between computational and statistical ef\ufb01ciency\n(a larger output histogram requires more samples to recover). While we provide rigorous proofs for\nthe approximation algorithms as stated above, we remark that we do not establish an overall recovery\nresult similar to Theorem 6. The reason is that the approximate head projection is competitive\nwith respect to k-histograms, but not with the space Uk  Uk, i.e., the sum of two k-histogram\nsubspaces. The details are somewhat technical and we give a more detailed discussion in Appendix\nC.3. However, under a natural structural conjecture about sums of k-histogram subspaces, we obtain\na similar result as Theorem 6. Moreover, we experimentally demonstrate that the sample complexity\nof our algorithms already improves over wavelets for k-histograms of size 32 \u21e5 32.\nFinally, we note that our DP approach also generalizes to -dimensional histograms for any constant\n  2. As the dimension of the histogram structure increases, the gap in sample complexity\nbetween our algorithm and the prior wavelet-based approach becomes increasingly wide and scales\nas O(k log d) vs O(k log d). For simplicity, we limit our attention to the 2D case described above.\n\n2.4 Related work\n\nRecently, there have been several results on approximate projections in the context of recovering\nlow-dimensional structured vectors. (see [12, 13] for an overview). While these approaches also work\nwith approximate projections, they only apply to less general models such as dictionary sparsity [12]\nor structured sparsity [13] and do not extend to the low-rank matrix recovery problem we address.\nAmong recovery frameworks for general union-of-subspaces models, the work closest to ours is [4],\nwhich also gives a generalization of the IHT algorithm. It is important to note that [4] addresses\napproximate projections, but requires additive error approximation guarantees instead of the weaker\nrelative error approximation guarantees required by our framework. Similar to the structured sparsity\ncase in [13], we are not aware of any algorithms for low-rank or histogram projections that offer\nadditive error guarantees faster than an exact projection. Overall, our recovery framework can be\nseen as a generalization of the approaches in [13] and [4].\nLow-rank recovery has received a tremendous amount of attention over the past few years, so we\nrefer the reader to the recent survey [9] for an overview. When referring to prior work on low-rank\nrecovery, it is important to note that the fastest known running time for an exact low-rank SVD (even\nfor rank 1) of a d1 \u21e5 d2 matrix is O(d1d2 min(d1, d2)). Several papers provide rigorous proofs for\nlow-rank recovery using exact SVDs and then refer to Lanczos methods such as PROPACK [16]\nwhile accounting a time complexity of O(d1d2r) for a rank-r SVD. While Lanczos methods can be\nfaster than exact SVDs in the presence of singular value gaps, it is important to note that all rigorous\nresults for Lanczos SVDs either have a polynomial dependence on the approximation ratio or singular\n\n5\n\n\fvalue gaps [17, 20]. No prior work on low-rank recovery establishes such singular value gaps for\nthe inputs to the SVD subroutines (and such gaps would be necessary for all iterates in the recovery\nalgorithm). In contrast, we utilize recent work on gap-independent approximate SVDs [17], which\nenables us to give rigorous guarantees for the entire recovery algorithm. Our results can be seen as\njusti\ufb01cation for the heuristic use of Lanczos methods in prior work.\nThe paper [2] contains an analysis of an approximate SVD in combination with an iterative recovery\nalgorithm. However, [2] only uses an approximate tail projection, and as a result the approximation\nratio cT must be very close to 1 in order to achieve a good sample complexity. Overall, this leads to a\ntime complexity that does not provide an asymptotic improvement over using exact SVDs.\nRecently, several papers have analyzed a non-convex approach to low-rank matrix recovery via\nfactorized gradient descent [3, 7, 22\u201324]. While these algorithms avoid SVDs in the iterations of\nthe gradient method, the overall recovery proofs still require an exact SVD in the initialization step.\nIn order to match the sample complexity of our algorithm or SVP, the factorized gradient methods\nrequire multiple SVDs for this initialization [7, 22]. As a result, our algorithm offers a better provable\ntime complexity. We remark that [7, 22] use SVP for their initialization, so combining our faster\nversion of SVP with factorized gradient descent might give the best overall performance.\nAs mentioned earlier, 1D and 2D histograms have been studied extensively in several areas such\nas databases [8, 14] and density estimation. They are typically used to summarize \u201ccount vectors\u201d,\nwith each coordinate of the vector \u2713 corresponding the number of items with a given value in some\ndata set. Computing linear sketches of such vectors, as well as ef\ufb01cient methods for recovering\nhistogram approximations from those sketches, became key tools for designing space ef\ufb01cient\ndynamic streaming algorithms [10, 11, 21]. For 1D histograms it is known how to achieve the\noptimal sketch length bound of n = O(k log d): it can be obtained by representing k-histograms\nusing a tree of O(k log d) wavelet coef\ufb01cients as in [10] and then using the structured sparse recovery\nalgorithm of [1]. However, applying this approach to 2D histograms leads to a sub-optimal bound of\nO(k log2 d).\n\n3 An algorithm for recovery with approximate projections\n\nWe now introduce our algorithm for recovery from general subspace models using only approximate\nprojections. The pseudo code is formally stated in Algorithm 1 and can be seen as a generalization\nof IHT [5]. Similar to IHT, we give a version without step size parameter here in order to simplify\nthe presentation (it is easy to introduce a step size parameter in order to \ufb01ne-tune constant factors).\nTo clarify the connection with projected gradient descent as stated in Equation (2), we use H(b) (or\nT (b)) as a function from Rd to Rd here. This function is then understood to be b 7! PH(b)b, i.e., the\northogonal projection of b onto the subspace identi\ufb01ed by H(b).\nAlgorithm 1 Approximate Subspace-IHT\n1: function AS-IHT(y, X, t)\n2:\n3:\n4:\n5:\n6:\n\nbi X T (y  X \u02c6\u2713i)\n\u02c6\u2713i+1 T (\u02c6\u2713i + H(bi))\n\n\u02c6\u27130 0\nfor i 0, . . . , t do\n\nreturn \u02c6\u2713 \u02c6\u2713t+1\n\nThe main difference to \u201cstandard\u201d projected gradient descent is that we apply a projection to both the\ngradient step and the new iterate. Intuitively, the head projection ensures two points: (i) The result of\nthe head projection on bi still contains a constant fraction of the residual \u2713\u21e4  \u02c6\u2713i (see Lemma 13 in\nAppendix A). (ii) The input to the tail approximation is close enough to the constraint set U so that\nthe tail approximation does not prevent the overall convergence. In a nutshell, the head projection\n\u201cdenoises\u201d the gradient so that we can then safely apply an approximate tail projection (as pointed\nout in [13], only applying an approximate tail projection fails precisely because of \u201cnoisy\u201d updates).\nFormally, we obtain the following theorem for each iteration of AS-IHT (see Appendix A.1 for the\ncorresponding proof):\n\n6\n\n\fTheorem 9. Let \u02c6\u2713i be the estimate computed by AS-IHT in iteration i and let ri+1 = \u2713\u21e4  \u02c6\u2713i+1 be\nthe corresponding residual. Moreover, let U be an arbitrary subspace model. We also assume:\n\n\u2022 y = X\u2713\u21e4 + e as in Equation (1) with \u2713\u21e4 2M (U).\n\u2022 T is a (cT , U, UT )-approximate tail projection.\n\u2022 H is a (cH, U  UT , UH)-approximate head projection.\n\u2022 The matrix X satis\ufb01es the (U  UT  UH, )-subspace RIP.\n\nThen the residual error of the next iterate, i.e., ri+1 = \u2713\u21e4  \u02c6\u2713i+1 satis\ufb01es\n\nwhere\n\n\u2318 = (1 + cT )\u2713 +q1  \u23182\n\n\u23180 = cH(1  )  ,\n\nri+1 \uf8ff \u2318ri + \u21e2kek ,\n0\u25c6 ,\u21e2\n\nand\n\n= (1 + cT ) \u23180\u21e20p1  \u23182\n\n\u21e20 = (1 + cH)p1 +  .\n\n0\n\n+ p1 + ! ,\n\nThe important conclusion of Theorem 9 is that AS-IHT still achieves linear convergence when the\napproximation ratios cT , cH are suf\ufb01ciently close to 1 and the RIP-constant  is suf\ufb01ciently small.\nFor instance, our approximation algorithms for both low-rank matrices offer such approximation\nguarantees. We can also achieve a suf\ufb01ciently small value of  by using a larger number of linear\nobservations in order to strengthen the RIP guarantee (see Appendix D). Hence the use of approximate\nprojections only affects the theoretical sample complexity bounds by constant factors. Moreover,\nour experiments show that approximate projections achieve essentially the same empirical sample\ncomplexity as exact projections (see Section 5).\nGiven suf\ufb01ciently small / large constants cT , cH, and , it is easy to see that the linear convergence\nimplied by Theorem 9 directly gives the recovery guarantee and bound on the number of iterations\nstated in Theorem 5 (see Appendix A.1). However, in some cases it might not be possible to design\napproximation algorithms with constants cT and cH suf\ufb01ciently close to 1 (in constrast, increasing\nthe sample complexity by a constant factor in order to improve  is usually a direct consequence of\nthe RIP guarantee or similar statistical regularity assumptions). In order to address this issue, we\nshow how to \u201cboost\u201d an approximate head projection so that the new approximation ratio is arbitrarily\nclose to 1. While this also increases the size of the resulting subspace model, this increase usually\naffects the sample complexity only by constant factors as before. Note that for any \ufb01xed cT , setting\ncH suf\ufb01ciently close to 1 and  suf\ufb01ciently small leads to a convergence rate \u2318< 1 (c.f. Theorem 9).\nHence head boosting enables a linear convergence result for any initial combinations of cT and cH\nwhile only increasing the sample complexity by a constant factor (see Appendix A.3). Formally, we\nhave the following theorem for head boosting, the proof of which we defer to Appendix A.2.\nTheorem 10. Let H be a (cH, U, UH)-approximate head projection running in time O(T ), and let\n\"> 0. Then there is a constant c = c\",cH\nthat depends only on \" and cH such that we can construct\na (1  \", U,c UH)-approximate head projection running in time O(c(T + T 01 + T 02)) where T 01 is\nthe time needed to apply a projection onto a subspace in c UH, and T 02 is the time needed to \ufb01nd an\northogonal projector for the sum of two subspaces in c UH.\nWe note that the idea of head boosting has already appeared in the context of structured sparse\nrecovery [13]. However, the proof of Theorem 10 is more involved because the subspace in a general\nsubspace model can have arbitrary angles (for structured sparsity, the subspaces are either parallel or\northogonal in each coordinate).\n\n4 Low-rank matrix recovery\n\nWe now instantiate our framework for recovery from a subspace model to the low-rank matrix\nrecovery problem. Since we already have proposed the top-level recovery algorithm in the previous\nsection, we only have to provide the problem-speci\ufb01c head and tail approximation algorithms here.\nWe use the following result from prior work on approximate SVDs.\nFact 11 ([17]). There is an algorithm APPROXSVD with the following guarantee. Let A 2 Rd1\u21e5d2\nbe an arbitrary matrix, let r 2 N be the target rank, and let \"> 0 be the desired accuracy. Then with\nprobability 1  , APPROXSVD(A, r, \") returns an orthonormal set of vectors z1, . . . , zr 2 Rd1\nsuch that for all i 2 [r], we have\n(3)\n\nr+1 ,\n\ni AAT zi  2\n\nzT\n\ni \uf8ff \"2\n\n7\n\n\fMatrix recovery\n\nMatrix completion\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nr\ne\nv\no\nc\ne\nr\n\nf\no\n\ny\nt\ni\nl\ni\nb\na\nb\no\nr\nP\n\n0\n1\nOversampling ratio n/r(d1 + d2)\n\n1.5\n\n2.5\n\n2\n\n3\n\n200\n\n150\n\n100\n\n50\n\n)\nc\ne\ns\n(\n\ne\nm\n\ni\nt\n\ng\nn\ni\nn\nn\nu\nR\n\n0\n5\n\nExact SVD\nPROPACK\nKrylov (1 iters)\nKrylov (8 iters)\n\nPROPACK\nLinearTimeSVD\nKrylov (2 iters)\n\n10\n\n6\n\n7\n\nOversampling ratio n/rd1\n\n8\n\n9\n\nFigure 1: Left: Results for a low-rank matrix recovery experiment using subsampled Fourier measure-\nments. SVP / IHT with one iteration of a block Krylov SVD achieves the same phase transition as\nSVP with an exact SVD. Right: Results for a low-rank matrix completion problem. SVP / IHT with a\nblock Krylov SVD achieves the best running time and is about 4 \u2013 8 times faster than PROPACK.\nwhere i is the i-th largest singular value of A. Furthermore, let Z 2 Rd1\u21e5r be the matrix with\ncolumns zi. Then we also have\n(4)\nwhere Ar is the best rank-r Frobenius-norm approximation of A. Finally, the algorithm runs in time\n\nA  ZZT AF \uf8ff (1 + \")kA  ArkF ,\n\n+ d1r2 log2(d2/ )\n\n+ r3 log3(d2/ )\n\nO\u21e3 d1d2r log(d2/ )\n\np\"\n\nIt is important to note that the above results hold for any input matrix and do not require singular value\ngaps. The guarantee (4) directly gives a tail approximation guarantee for the subspace corresponding\nto the matrix ZZT A. Moreover, we can convert the guarantee (3) to a head approximation guarantee\n(see Theorem 18 in Appendix B for details). Since the approximation \" only enters the running time\nin the approximate SVD, we can directly combine these approximate projections with Theorem 9,\nwhich then yields Theorem 6 (see Appendix B.1 for details).2 Empirically, we show in the next\nsection that a very small number of iterations in APPROXSVD already suf\ufb01ces for accurate recovery.\n\n\"\n\n\"3/2\n\n\u2318.\n\n5 Experiments\n\nWe now investigate the empirical performance of our proposed algorithms. We refer the reader to\nAppendix E for more details about the experiments and results for 2D histograms.\nConsidering our theoretical results on approximate projections for low-rank recovery, one important\nempirical question is how the use of approximate SVDs such as [17] affects the sample complexity\nof low-rank matrix recovery. For this, we perform a standard experiment and use several algorithms\nto recover an image of the MIT logo from subsampled Fourier measurements (c.f. Appendix D). The\nMIT logo has also been used in prior work [15, 19]; we use an image with dimensions 200 \u21e5 133\nand rank 6 (see Appendix E). We limit our attention here to variants of SVP because the algorithm\nhas good empirical performance and has been used as baseline in other works on low-rank recovery.\nFigure 1 shows that SVP / IHT combined with a single iteration of a block Krylov SVD [17] achieves\nthe same phase transition as SVP with exact SVDs. This indicates that the use of approximate\nprojections for low-rank recovery is not only theoretically sound but can also lead to practical\nalgorithms. In Appendix E we also show corresponding running time results demonstrating that the\nblock Krylov SVD also leads to the fastest recovery algorithm.\nWe also study the performance of approximate SVDs for the matrix completion problem. We generate\na symmetric matrix of size 2048 \u21e5 2048 with rank r = 50 and observe a varying number of entries\nof the matrix. The approximation errors of the various algorithms are again comparable and reported\nin Appendix E. Figure 1 shows the resulting running times for several sampling ratios. Again,\nSVP combined with a block Krylov SVD [17] achieves the best running time. Depending on the\noversampling ratio, the block Krylov approach (now with two iterations) is 4 to 8 times faster than\nSVP with PROPACK.\n\n2We remark that our de\ufb01nitions require head and tail projections to be deterministic, while the approximate\nSVD is randomized. However, the running time of APPROXSVD depends only logarithmically on the failure\nprobability, and it is straightforward to apply a union bound over all iterations of AS-IHT. Hence we ignore\nthese details here to simplify the presentation.\n\n8\n\n\fReferences\n[1] Richard G. Baraniuk, Volkan Cevher, Marco F. Duarte, and Chinmay Hegde. Model-based compressive\n\nsensing. IEEE Transactions on Information Theory, 56(4):1982\u20132001, 2010.\n\n[2] Stephen Becker, Volkan Cevher, and Anastasios Kyrillidis. Randomized low-memory singular value\n\nprojection. In SampTA (Conference on Sampling Theory and Applications), 2013.\n\n[3] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-\n\nde\ufb01nite optimization. arXiv preprint 1509.03917, 2015.\n\n[4] Thomas Blumensath. Sampling and reconstructing signals from a union of linear subspaces.\n\nTransactions on Information Theory, 57(7):4660\u20134671, 2011.\n\nIEEE\n\n[5] Thomas Blumensath and Mike E. Davies. Iterative hard thresholding for compressive sensing. Applied\n\nand Computational Harmonic Analysis, 27(3):265\u2013274, 2009.\n\n[6] Jian-Feng Cai, Emmanuel J. Cand\u00e8s, and Zuowei Shen. A singular value thresholding algorithm for matrix\n\ncompletion. SIAM Journal on Optimization, 20(4):1956\u20131982, 2010.\n\n[7] Yudong Chen and Martin J. Wainwright. Fast low-rank estimation by projected gradient descent: General\n\nstatistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025, 2015.\n\n[8] Graham Cormode, Minos Garofalakis, Peter J. Haas, and Chris Jermaine. Synopses for massive data:\n\nSamples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1\u20133):1\u2013294, 2012.\n\n[9] Mark Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observa-\n\ntions. arXiv preprint 1601.06422, 2016.\n\n[10] Anna C. Gilbert, Sudipto Guha, Piotr Indyk, Yannis Kotidis, S. Muthukrishnan, and Martin J. Strauss. Fast,\n\nsmall-space algorithms for approximate histogram maintenance. In STOC, 2002.\n\n[11] Anna C. Gilbert, Yannis Kotidis, S Muthukrishnan, and Martin J. Strauss. Sur\ufb01ng wavelets on streams:\n\nOne-pass summaries for approximate aggregate queries. In VLDB, volume 1, pages 79\u201388, 2001.\n\n[12] Raja Giryes and Deanna Needell. Greedy signal space methods for incoherence and beyond. Applied and\n\nComputational Harmonic Analysis, 39(1):1 \u2013 20, 2015.\n\n[13] Chinmay Hegde, Piotr Indyk, and Ludwig Schmidt. Approximation algorithms for model-based compres-\n\nsive sensing. IEEE Transactions on Information Theory, 61(9):5129\u20135147, 2015.\n\n[14] Yannis Ioannidis. The history of histograms (abridged). In Proceedings of the 29th international conference\n\non Very large data bases-Volume 29, pages 19\u201330. VLDB Endowment, 2003.\n\n[15] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank minimization via singular value\n\nprojection. In NIPS, 2010.\n\n[16] Rasmus M. Larsen. Propack. http://sun.stanford.edu/~rmunk/PROPACK/.\n[17] Cameron Musco and Christopher Musco. Randomized block Krylov methods for stronger and faster\n\napproximate singular value decomposition. In NIPS, 2015.\n\n[18] S. Muthukrishnan, Viswanath Poosala, and Torsten Suel. On rectangular partitionings in two dimensions:\n\nAlgorithms, complexity and applications. In ICDT, pages 236\u2013256, 1999.\n\n[19] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear\n\nmatrix equations via nuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[20] Yousef Saad. On the rates of convergence of the Lanczos and the block-Lanczos methods. SIAM Journal\n\non Numerical Analysis, 17(5):687\u2013706, 1980.\n\n[21] Nitin Thaper, Sudipto Guha, Piotr Indyk, and Nick Koudas. Dynamic multidimensional histograms. In\n\nSIGMOD, 2002.\n\n[22] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank\n\nsolutions of linear matrix equations via Procrustes Flow. In ICML, 2016.\n\n[23] Tuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex low rank matrix factorization via inexact \ufb01rst order\n\noracle. https://www.princeton.edu/~zhaoran/papers/LRMF.pdf.\n\n[24] Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimization and\n\nsemide\ufb01nite programming from random linear measurements. In NIPS. 2015.\n\n9\n\n\f", "award": [], "sourceid": 2165, "authors": [{"given_name": "Chinmay", "family_name": "Hegde", "institution": "Iowa State University"}, {"given_name": "Piotr", "family_name": "Indyk", "institution": "MIT"}, {"given_name": "Ludwig", "family_name": "Schmidt", "institution": "MIT"}]}