NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5042
Title:Tight Dimensionality Reduction for Sketching Low Degree Polynomial Kernels

Reviewer 1


		
This work achieves an improved bound on the sample complexity of random tensor projection and it is argued that this bound is tight and nearly optimal. A key observation is to view the random sketch as a bilinear form of a random matrix. It makes the analysis suitable to apply general matrix concentration inequalities. The authors can obtain better bounds by analyzing both operator and Frobenius norm of the random matrix, which is the key challenges of this work. Their proof techniques are different from previous approaches but very impressive. It has a great impact in term of applying complex matrix concentration inequalities for exponentially improved bound. This work can give a good direction to challengeable analysis in many other works. The paper is also well-written and its organization is comprehensive. Although the proof techniques can be complicated and require a large amount of prior knowledge, the contributions and proofs are straightforward to understand. However, readers not familiar with prior works of random tensor sketch may have some hardness since there is no kind preliminaries section. It would be better to provide some backgrounds with a concrete description including Johnson-Lindenstrauss transform, CountSketch, and TensorSketch. In overall, I believe that the impact of this work is impressive and vote for acceptance. Some minor issues or typos: - Please write the detail information of reference [28]. If the reference is not published yet, it would be better not to refer it. - ln equation between line 135 and 136, the notation of norm is not complete. - In line 191, the right parenthesis is missing. - In below the equation (2) (between line 155 and 156), S^i x^i should be vector while the right-hand side seems to be scalar. ============================================================================== I read all reviews and author feedback. The final manuscript with authors response can improve the writing quality and highlight the novelty of this work.

Reviewer 2


		
Summary: The paper presents a sketching method (a variant of the method given in Kar and Karnick, 2012) for tensors and provides its theoretical analysis. Specifically, the authors improve the previously known rate for the sketch-size n^q (exponential in data dimension n) to log^q(n). Further, the authors combine CountSketch with the proposed method, which enjoys a better theoretical guarantee (the dependence on n is removed) while fast for rank-1 tensors (the computational cost given by the order of sketch size * nnz(x^i), x^i is a n-dim component vector of a q-th order tensor). It is shown that the proposed method outperforms TensorSketch (Pham and Pagh, 2013) and RandomMaclaurin (Kar and Karnick, 2012) in approximating polynomial kernels. Also, the proposed method has a better test-error performance in compressing a neural network compared to Arora et al., 2018 with a small sketch size. The analysis given by the authors is technically sound. Tensor sketching is an important task in machine learning, and the exponential improvement of the sketch size is useful particularly in high-dimensional settings such as NLP (e.g. a vocabulary set for text representation could be large). I also value the practicality of the proposed method using CountSketch as sketching itself could be computationally expensive. Together with the results in the experiments, I consider the contributions given by the paper significant. Some clarifications that could make the paper more accessible: * The scaling given in Kar and Karnick, 2012 is different from the proposed method due to the coefficient by Maclaurin. Should not this be clarified? " Line For each sketch coordinate it randomly picks degree t with probability 2−t and computes degree-t Tensorized Random Projection." What does degree mean here? Minor points: * The statements regarding epsilon-distortion in Theorem 2.1, 2.2, and Lemma 2.3: should they not use equality? e.g. P(1-eps||x|| <= ||Sx|| <= 1+eps||x||) * The scaling 1/\sqrt{m} is missing in line 151 (S^i) and 168 (S_0). * Should the proof lemma clarity that it deals with the case where x^i is a unit vector? * Line 437: p-th entry -> i-th entry? * Lemma 4.6 Inductiion index should be q instead of k (i.e. prove for q=2, and then q=k>2?)

Reviewer 3


		
This paper has studied theoretical property of existing Tensorized Random Projection. It has also provided a study of theoretical property of sketching method by combining tensorized Random Projection with a CountSketch. Theoretical bound improvement is marginal. More over no experimental comparison has been done to compare the performance of combine method and existing Tensorized Random Projection. Paper is technically sound and has theoretical significance. But is not well written and it needs improvement. Major comments: Authors have provided proof of having an exponentially better embedding dimension for a particular selection of S in Theorem 2.2. It is not clear if that particular choice of S is a contribution to this paper. If so, then it is better to provide a clear introduction of the proposed S before theorem 2.2. What is T is in Theorem 2.4 is not clear to me? It is not clear if experiments have been done with existing Tensorized Random Projection which has been proposed by [29] or some new version of it which is different than the existing [29] one. If it is the Tensorized Random Projection from [29] then motivation behind the experiments is not clear to me. If it is something different than Tensorized Random Projection from [29] then it is better to have a comparison with Tensorized Random Projection from [29]. Figure1 is not legible. What happens if m=n for both the methods? Minor comments: The required definition of few notations has not been provided. For example, in line 48 the definition of sketching matrix is not clear. Why all experiments have been provided for a polynomial kernel of degree 2? What happens if we use higher-order kernels? ---After Rebuttal--- the author has replied my comments. Inclusion required definitions have been promised. I would like to increase my score to 6 provided they update the manuscript as promised in rebuttal.