NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
In general, the total least-squares (TLS) problem, also its regularized variant, is important for machine learning. For large-scale problems using methods that can exploit the sparsity of the inputs is crucial, thus the solution suggested by the authors is relevant. The ideas look original and the paper is fairly clearly written, though there are some presentational issues: for example, I do not see the point of having an "informal" version of the main theorem, as Theorem 1 is not much less complicated than Theorem 3.10. The informal discussion of the result should not be called a "theorem". The proposed solution is interesting and the theoretical and empirical results are promising; however, it is not elegant that the approximation is only provided with a fixed 0.9 probability, instead of a user-chosen arbitrary confidence probability. This is probably the most restrictive property of the solution. The 0.9 should ideally be generalized to an arbitrary confidence probability. If this could be done, then it would highly increase the significance of the proposed method. Minor comments: - The "poly(n/epsilon) d" part is outside the \tilde{O} notation in the abstract, but it is inside in Theorem 3.10. - If the "informal" version of the theorem is kept, it should not have "with high probability", but "with probability 0.9" - Was there really 4TB RAM in the server used for testing, or is it a typo (TB vs GB)? Post rebuttal comments: thank you for your reply, I am satisfied with your answers and hence I have raised my score accordingly. Please, do include the extension to arbitrary confidence probabilities in the next version of your paper (as discussed in your rebuttal).
Reviewer 2
Update: I have read the author responses, and they have addressed my main concern by explaining how Table 1 was generated. My score remains 7, accept, as before. The work is a novel combination of existing approaches to sketching for low-rank approximation: the authors carefully chain together CountSketch and leverage score sketching. The resulting algorithm demonstrates a desirable empirical tradeoff between accuracy and run-time, and is accompanied by supporting theory. The paper is clearly written: it provides an outline of the proof of the correctness of the main algorithm that gives the reader an intuitive grasp of how the dimensionality reduction arguments come together to ensure a low relative-error approximation and a fast run-time. The empirical evaluation is not sufficiently convincing to this reviewer: the description of the experiments on real data that generated Table 1 is missing -- how were these datasets converted into TLS problems? The significance of this paper is mainly theoretical: it shows that sketching can be employed to obtain a near linear-time approximation algorithm for the TLS problem. Some comments: - Claim 3.1 should use the square of OPT - On line 265, the rank should be 2, not 3 - It would be clearer to report the times in Figure 1 as time_method/time_TLS in the same way as the costs are reported relative to those of TLS. - Section 4.1 references a figure (Figure 2) that has been moved to the supplement
Reviewer 3
This paper brings the techniques of fast sketching and randomized embedding to total regression. These techniques have been extremely successful for giving fast input sparsity time algorithms for approximately computing solutions for linear regression problems. Directly applying these algorithms to total regression leads to much larger running times as compared to the "input sparsity" time established by this paper. As a result, I think it's an interesting result for the Neurips audience, for a pretty basic problem. To the best of my knowledge, this is the first input sparsity time algorithm for this problem. On the flip side, this problem is far less omnipresent compared to linear regression, and the techniques involved are the same as developed for linear regression. The experimental evaluation is reasonable, but unimpressive. The scale of problems that the algorithm is able to handle is medium (~20k rows). The UCI datasets used are also on the smaller side. Small comment - the figures don't have axis labels / units - figure 2 is referred in the paper, but is only in the supplementary material - I am not sure I see error bars on Figure 1, though it's indicated in the checklist that they are present Post-rebuttal: Thank you for your response. I had read your response, and I wish to revise my view of the significance/novelty of the paper. I'm upgrading my overall score to a 7.