Review for NeurIPS paper: On Convergence of Nearest Neighbor Classifiers over Feature Transformations

NeurIPS 2020

On Convergence of Nearest Neighbor Classifiers over Feature Transformations

Review 1

Summary and Contributions: Update: Thanks for addressing the concerns raised by the reviewers, based on re-reading the paper and going over the comments, I am able to understand the experiments better - and based on the authors comments that they will revise the draft to make things more clear, I will change my score to accept. Having said that, I would still keep my confidence low since I am unable to accurately access the significance of the result and I believe that would be a key factor to consider in a novel theoretical paper. ------------------------------------------------- The paper provides a theoretical result of an upper bound on the convergence rate of kNN classifiers over feature transformations. The result is based on two key properties of the transformed space that they identify. The first is 'safety', which is a measure of how well can we recover the posterior in the original space from the feature space. The second is smoothness, which is a measure of how hard it is to recover the posterior in the original space from the feature space.

Strengths: The paper seems to address a novel issue of bridging the gap between theoretical understanding of kNN and practical application. They provide a theoretical upper bound on the convergence.

Weaknesses: While there are experiments presented, the results could have been presented with more clarity. I couldn't see the significance of the result immediately in the experiments presented, perhaps a more elaborate explanation would be helpful.

Correctness: They appear to be correct but I have not verified the proofs step by step.

Clarity: I had some issues with certain parts of the paper - it would be helpful if the authors provided definitions of all notations used.

Relation to Prior Work: They have mentioned previous work on kNN but they state that the paper covers a novel analysis and is one of the first papers in this direction.

Reproducibility: Yes

Additional Feedback: Could you please elaborate on how the definitions of safety and smoothness relate to the representations as loss the Lipschitz constant respectively? In section 5, you have mentioned that there is significant improvement in correlation on all the test classification datasets. Could you elaborate on this further?

Review 2

Summary and Contributions: This paper analyzes k-NN classification error when the feature space has been transformed from its original raw space to, for instance, a learned embedding space. Importantly the resulting bounds depend on a notion of "safety" (how well the raw feature space's posterior probability function can be recovered) and smoothness (Lipschitz constant related to the recovery function).

Strengths: This work addresses a major gap between theory and practice of k-NN methods: nowadays, k-NN classification is typically applied in a learned embedding space (e.g., a penultimate or bottleneck layer of a deep net) rather than in the raw feature space, but theory for k-NN methods is typically stated for the raw feature space or alternatively only in the transformed space (if one treats the transformation as a pre-processing step, so that the transformation itself doesn't appear in the theory except implicitly in the metric). This paper is the first one that I'm aware of that explicitly characterizes error, accounting for there being raw and transformed feature spaces as well as a transformation function. This is a significant advance in theory for nonparametric methods. The authors do a good job motivating and explaining the theory especially in illustrative examples of how transformations affect the Bayes error rate and how this plays into the definition of delta-safety.

Weaknesses: It would be very helpful getting a sense of how tight the theoretical bounds are, if not in theory then in numerical simulations (for example in a toy dataset where the true posterior probability function is known) with different choices of transformations. I think the experimental results can be improved by explicitly also looking at different values of k (without optimizing over k, which is what seems to be done in the supplemental material), especially since your bound has the O(k^{-1/2}) and O(||w|| (k/n)^{1/d}) terms as well. Then you can use CCA to find a linear combination of k^{-1/2}, ||w|| (k/n)^{1/d}, and MSE^{1/4} to try to explain the k-NN error (if I understand how you're using CCA, basically by choosing k=1, you can just look at ||w|| and MSE^{1/4}?). It seems like picking k=1 perhaps doesn't yield a good bound (unless it turns out that in practice that first O(k^{-1/2}) term isn't that important). It would be helpful to understand how much the k^{-1/2} and (k/n)^{1/d} terms matter.

Correctness: The claims and experimental setup look correct.

Clarity: The paper is overall well-written. The paragraph in lines 294-297 could be written a bit more clearly (also there's a typo there where "later" should be "latter"); from my understanding the linear combination of the MSE and norm is determined via CCA, although it seems like you might actually want MSE^{1/4}?

Relation to Prior Work: Yes, the authors place their work in the context of existing literature.

Reproducibility: Yes

Additional Feedback: Update (after author feedback/reading other reviews): Overall, it seems like the experimental results could use polishing in presentation/experiments. The author feedback makes it clear that they're aware of the issues and have fixes in mind. I strongly encourage the authors to revise the experiments section to make things more clear, as they indicated in their response. I am leaving my score the same, and still think that this paper should be accepted.

Review 3

Summary and Contributions: This paper proposes a novel convergence analysis of kNN over transformed features. This work is novel and fills the gap between the applications and theoretical analysis where most are derived for the raw feature space. The authors then propose to evaluate the convergence bound based on the safety and smoothness of the transformed space. Then empirical evaluations are conducted with 30 feature transformations on image and text data.

Strengths: + The proposed analysis is novel. While most previous analysis focuses on the raw feature space, most applications use feature transformation such as PCA and pre-trained neural network. This mismatch leads to a gap between the theoretical analysis and applications where this work tries to tackle. + The theoretical analysis is sound. The authors provide detailed proof and analysis. + The writing is clear and easy to follow.

Weaknesses: - The experimental results are not significant. Two straightforward error rate indicators: the dimension of the space and LR error is chosen as baselines and the authors claim that the proposed bound is superior to these two baselines. However, from the experimental results, Pearson’s r correlation in Fig.4 shows that the proposed improvement is marginal. The results on image dataset from the supplementary material also shows that the Pearson’s r correlation is very close between the LR Error and MSE. - The authors only conduct experiments with k=1. It is unclear whether the empirical conclusion remains the same in Fig.4 if k becomes larger than 1. - The figures are not clear enough, e.g. what does each point stand for in Fig.1 and Fig.2. The authors could add the necessary information to make this paper more self-contained.

Correctness: The claims and methods are correct. The empirical methodology is correct but not convincing enough.

Clarity: Yes, this paper is clearly written.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: The authors could provide further analysis of the close performances between LR Error and MSE. ****************updated review**************** I was convinced by authors' rebuttal and would like to raise my score to "above the acceptance threshold".