NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1451 Optimal Analysis of Subset-Selection Based L_p Low-Rank Approximation

### Reviewer 1

The main result is an extension of the two known results: a recent result from [13] that provide the suboptimal approximation ratio of order $O(k+1)$ over $p \ge 1$, and the classical result from Deshpande et al. ([14]) that analyzes the case $p = 2$ (i.e., Frobenius norm), and proves the bound $O(\sqrt{k+1})$ matching that of the present work. The key technical novelty is the application of the Riesz-Torin theorem that allows to bound the approximation ratio for any~$p \ge 1$ by interpolating the bounds for $p = 1$, $p =2$, and the bound for $p = \infty$ obtained by the authors (the analysis for the latter case is, in fact, way simpler than for the case $p = 2$ addressed in [14]). This allows to obtain sharp results for $p \in [2, \infty$, whereas those for $p \in [1,2]$ remain suboptimal. As far as I can tell, the Riesz-Torin theorem has been hitherto unknown in the TCS community, and drawing attention to it would be useful; hence I recommend to accept the paper. The main reason for my (modest) evaluation score is that the case $p = 2$, clearly the hardest to analyze among $p \in \{1,2,\infty\}$, is already known from the classical work [14]. The key contribution is really the application of the Riesz-Torin theorem to interpolate between the different values of $p$; arguably, this step is less interesting mathematically than the analysis of the case $p = 2$ (cf. lines 412-412 vs. lines 421-423 in Appendix 3). The lower bound construction also crucially relies on the known results. However, the application of the Riesz-Torin theorem is non-trivial (as it invoves introducing a specific multi-linear map), so I think this is a decent contribution, and the paper is worth to be accepted, even in absence of experiments and demonstrated applications of the new results. My other remarks are minor and mostly refer to the exposition. First, the authors denote the usual and ordered subsets by the same symbols $I$ and $J$ throughout the text, which is at times quite confuzing, and hindered my verification of the proof of Theorem 1.1 (which I was still able to verify). For example, in lines 123-127, $I$ and $J$ denote the ordered subsets, whil, e.g., in the proof of Lemma 2.3 the same symbols are used for unordered subsets. Second, the reduction to the cases $p \in \{1,2,\infty\}$ in lines 223-227 is very concise (I had to perform the calculation ad-hoc), as this part of the text was unreadable for me. I urge the authors to improve this part. Regarding the quality of the text and the exposition, the paper is quite well-written. A few typos: L194: "The second [to] last" L417: The factor $2$ in the second term of the right hand side is extraneous, accordingly further down the text. This does not change the argument. L421: $l$ instead of $t$ in the second line of the display.

### Reviewer 2

I read the author feedback. Thanks for clarifying how the analysis of the case p=2 is different than prior work. ---------- The paper considers low-rank matrix approximation in l_p norm for any p>=1. One of the popular approaches in this area is to construct a rank-k approximation of an n x m matrix A from a subset of k columns of A. This restriction is natural in practical settings when columns represent data points and we want to find a small but representative core-set of data. While this problem is well understood for p=2, this is not the case for other l_p norms which offer advantages in terms of robustness and sparsity of the solutions. Prior to this work, Chierichetti et al (2017) [13] showed that for any p>=1 a subset of k columns can always be found such that the l_p approximation error is within a k+1 factor of the optimum, while for p=2 Deshpande et al (2006) [14] showed that for p=2 a factor of sqrt{k+1} can be achieved. The results of this paper close that gap by showing that we can achieve a (k+1)^{1/p} factor for 1<=p<=2 and (k+1)^{1-1/p} for p>=2. This matches the best known factors for p=1 and 2 but improves them in all other cases. New matching lower bounds are shown for p>=2. While algorithmic efficiency is of secondary concern in this paper, the main results imply improved guarantees for existing polynomial time algorithms for this task. It is worth noting that the cases of p=1 and 2 are still the ones with primary practical interest, and it would be helpful if the authors found a reference to an applied paper showing the advantages of low-rank approximation with l_p error for, say, 1

### Reviewer 3

The paper is well written, well motivated and most of the results are stated clearly with well written proofs. I have enumerated my questions and comments below. Questions and Comments 1. In Ln. 46 the authors say "Unfortunately the loss surface of the problem suffers from spurious local minima and saddle points [12]". [12] doesnt seem to be a correct reference for this fact. Is this fact known? And if yes, can the authors point to a reference for this. 2. (Ln. 70) CUR is undefined. 3. (Ln. 79) I think you mean lower bounded by \Omega(k^{1-2/p}) 4. (Ln. 83) you say there is O(k^{2/p}-1) gap but given your results there seems to be a O(k^{1/p - 1}) gap. 5. Theorem 1.2 is hard for me to understand. Specifically I am not certain I understand what "There exist infinitely many different valyes of k" means here. Do the authors mean for every m there exists a k such that the lower bound holds? 6. (Ln. 126) The dimension of matrix seems non standard, I would prefer if the authors change that to row of the matrix (which is the term that they have used before) ============Post Author Feedback============================= I would like to thank the authors for their polite rebuttal and for clarifying my confusion about their lower bound.