Review for NeurIPS paper: Efficient Clustering Based On A Unified View Of K-means And Ratio-cut

NeurIPS 2020

Efficient Clustering Based On A Unified View Of K-means And Ratio-cut

Review 1

Summary and Contributions: The paper introduces a computationally efficient clustering algorithm whose objective is the minimisation of the sum-of-square (SS) distances between points in the same cluster. To allow for more flexible clustering, only the k-nearest neighbour distances of each point are considered in the objective. In addition a simple truncation of large distances is introduced to mitigate the effect of outliers. The method is shown to produce promising practical performance in comparison with existing techniques both computationally and in terms of clustering performance.

Strengths: The main strength of the paper lies in the practical aspects of the proposed algorithm, in that it is efficient and the performance is extremely compelling. Despite the combinatorial nature of the objective, the authors have devised an intelligent and efficient approach to update the cluster assignments iteratively with minimal computational cost.

Weaknesses: The paper is, however, not clearly written. There are numerous instances of notation which is either not defined or is used before being defined. The discussion relating to the connection between k-means and clustering with graph cuts is also not new, see for example [1]. Furthermore, while the authors claim that this is the basis for their algorithm, the justification for this motivation does not come through. The objective is clear enough without this motivation. [1] Dhillon, Inderjit, Yuqiang Guan, and Brian Kulis. "A unified view of kernel k-means, spectral clustering and graph partitioning." Dept. Comput. Sci., Univ. Texas at Austin, Austin, Tech. Rep. TR-04-25 (2005).

Correctness: As far as I can tell nothing in the paper is incorrect

Clarity: The paper is not clearly written, especially in the technical parts. This is a major weakness of the paper. I will give a few specific examples in a later section

Relation to Prior Work: The existing work on the relationship between kmeans and graph cuts is not sufficiently discussed

Reproducibility: Yes

Additional Feedback: EDIT: I am satisfied by the response of the reviewers that they will address the issues of clarity, after which I believe the paper represents a valuable contribution. I commend the authors for what appears to be an innovative algorithm with extremely good practical performance. I believe the paper could be a very influential one, but I feel the presentation of the work needs to be modified and improved. (1) Try to be clearer on the motivation for using the connection between kmeans and graph cuts. I think there are a few too many concessions which are made. For example, you begin with ratio cut, then change to normalised cut when you assert that the affinity matrix is made doubly stochastic. This seems only for the purpose of making the objective look like the objective you use, since in your algorithm you do not normalise the affinity matrix (which in your case is more like a distance matrix). I think that the reformulation of the sum of squared distances between points and their mean in terms of the sum of inter-point squared distances (Lemma 1... note also that this is a well known fact and doesn't need its own lemma), combined with the use of only nearest neighbours to avoid the limitations of kmeans in terms of cluster shape is sufficient to motivate your formulation. (2) The notation and general presentation of the technical material should be reviewed and improved. It currently makes the paper unnecessarily difficult to follow. Some steps in the derivations are also not explained. (3) Have the paper proof-read to remove grammatical errors. In addition, I have the following specific comments/queries 1. Line 30: the computational complexity of spectral clustering is (cn^2) since only c (the number of clusters) eigenvectors are needed, not all n 2. Line 59: typo: "is difficulty" 3. Line 63: The sentence is grammatically incorrect 4. Line 80: You mention that you use boldface for vectors, but this is not the case in the paper 5. Line 84: Introduce the dimension of the set of cluster indicator vectors \Phi at this point, i.e., introduce the indices n and c when you first introduce \Phi 6. Section 2: The discussion of graph cuts is incomplete. You introduce the concept of vertices and nodes without defining them, nor mentioning their equivalence 7. Eq 3: Define the complement of A, \bar A 8. Line 90: define the indicator function 9. Line 93: what was previously the degree matrix D is suddenly changed to \Delta 10. Lemma 1: typo, p should be greater than zero! also, define what it means for a vector to be greater than zero 11. Line 111: bistochastic should be doubly stochastic 12. Equation 12: You introduce the constraint of balanced cluster size, but then subsequently ignore it. It looks as though this is introduced just to get rid of some of the terms in the trace, so that the problem is easier to solve. That is not a problem in and of itself, we often make concessions to get approximate solutions to a problem, but you should not enforce it as a constraint unless you actually use it that way. 13. The description of the matrix G is very poor and requires considerably more clarity. 14. Equation 14: You introduce the matrix U_{(i)} without defining it. As far as I can tell this is a permutation matrix to swap the first and i-th rows/columns, but this should be made precise and explicit. 15. Line 138: what is mean by "Ỹ represents the probability indicator matrix"? 16. Line 141: There is some stray bold font in this line. Where is that from? 17. Figures 1 (b) and (c) are not explained. I presume these are clustering solutions from a data set with outliers, but it isn't clear. 18. In the tables you have ignored when SC performs best. Why is this the case? Also, mention that bold face in the tables indicates the best performance. 19. In the experiments you refer to both accuracy and precision. Do you mean these to be the same? 20. Line 230: typo: "tow" should be "two". Finally, I think that the practical performance of the algorithm is almost sufficient to warrant publication. However, there are currently just too many minor errors and imprecisions in the paper. If the authors can fix these, then I would be encouraged to raise my score.

Review 2

Summary and Contributions: The paper suggests a new algorithm for the k-means problem. The algorithm is the coordinate descent algorithm on a new optimization problem defined in the paper. This optimization emerges from the phrasing of k-means and ratio cuts in the language of linear algebra. The authors run experiments examining the new algorithm. The datasets are (a) synthetic and (b) face images. They compared the new algorithm to several baselines like k-means++ and spectral clustering.

Strengths: The k-means problem is a fundamental problem in machine learning, and fast algorithms are desired. Even though the problem is 50 years old and researched thoroughly, the authors bring their new viewpoint and suggest a new algorithm.

Weaknesses: * There are several assumptions made during the text, for example, the distance measure is doubly stochastic, the clusters are of similar size. These assumptions might not always hold. * It's unclear what is the novelty of the work. The optimization problem presented in the paper is not stated as one of the contributions of the paper. So I deduce that the algorithm is the novel part, but it is an efficient standard optimization technique (coordinate descent). * The algorithm adds a new hyperparameter to the problem: how many nearest neighbors (k) to use.

Correctness: In the experience section 5, the accuracy of the algorithms is calculated using a reference clustering. What is this reference? Are the datasets labeled and the labels serve as a proxy of the optimal clustering?

Clarity: The paper is well-written. A few suggestions to improve it: - Give more intuition for the different equations on page 4 - State all the assumptions in one place, at the beginning - D tilde on line 122 is unnecessarily

Relation to Prior Work: It wasn't clear if the optimization problem in Equation (13) was known in the literature. It will be great if a subsection was added on the literature of k-means from the linear algebra point of view.

Reproducibility: Yes

Additional Feedback: Thanks for your feedback. I'm interested in better understanding your proposed model's novelty compared to the known literature (for example, the one suggested by Reviewer #1). I am looking forward reading your revised paper.

Review 3

Summary and Contributions: The work proposed a clustering method (called "k-sums") based on "a unified view of k-means and ratio-cut" algorithms. A number of benefits were demonstrated with the method, including a time complexity that is independent of cluster numbers, directly obtaining the clustering results without post-processing, and improved empirical results on the experimental datasets. ================= Although concerns still exist, the overall score was increased to "weak acceptance" in response to other reviewers' rough consensus.

Strengths: 1. Clustering is a fundamental and important problem in machine learning and data analysis. The proposed work fits the scope of NeurIPS community quite well. 2. The proposed clustering algorithm runs much faster than the classical k-means algorithm, which is a good characteristic that is often highly desired in practical applications.

Weaknesses: 1. On novelty: The relationship between k-means type algorithms and the graph-cut based algorithms was well studied in machine learning literature, for example, Dhillon et al. 2004. Deriving algorithms from this viewpoint is not surprising, and the novelty of the proposed work seems might be challenged. 2. On evaluation: The evaluation needs to be improved. The proposed clustering method uses k-d tree to construct a k-nearest neighbors graph as an intermediate step. Correspondingly, it is suggested to include the clustering performance of k-d tree as a baseline to compare. 3. With the constructed k-nearest neighbor graph, why not include the performance of ratio-cut or normalized-cut algorithms as another comparison baseline? This will help figure out where the improvement of the proposed approach come from.

Correctness: I didn't verify all the mathematical details, while the result is not surprising to me.

Clarity: Although the writing is roughly acceptable, there is much room for improvement. A thorough proofreading and re-organization are strongly suggested. Line 99, "For any p<=0", should it be "p>=0"? Line 156, "It takes O(n + c) time to sort v using bucket sort algorithm". Why?

Relation to Prior Work: Roughly acceptable, but can be improved significantly.

Reproducibility: Yes

Additional Feedback: Ref. "Weaknesses" section.

Review 4

Summary and Contributions: I read the rebuttal and overall am satisfied with the response, and thus would keep my score. A review of the uniﬁed framework of kmeans and ratio-cut is introduced in this paper. And then, a novel and efﬁcient clustering algorithm is proposed based on this framework.

Strengths: The most attractive contribution of this article is the performance in terms of computational overhead and effectiveness. Specifically, (1) the model proposed in this article called k-sums has a computational complexity of O(nk) that is independent of the number of clusters to construct, which is a very surprising and important result; (2) the effectiveness of k-sums has been verified through experiments conducted on 20 benchmark datasets ranging between a few hundred and nearly .5 million samples and 5 synthetic datasets ranging between 100k and 1 million samples. Pros: 1, To be honest, clustering methods with excellent performance are not rare in the literature. However, K-means is still the clustering algorithm that engineers prefer. This is mainly because those algorithms often have a high time complexity or involve nuisance hyper-parameter(s). Only one parameter is involved in k-sums, i.e., the number of nearest neighbors that is an integer and is easy to tune, which is very important in practical applications. 2, The method used to optimize the objective function of k-sums is a standard coordinate descent algorithm. However, the idea of speeding up the optimization process is very interesting. Specifically, it makes the computational overhead is independent of the number of clusters to construct. This conclusion is very rare in partition-based clustering methods. 3, Extensive experiments on 12 middle-scale and 8 large-scale benchmark datasets validate the effectiveness and efficiency of the proposed algorithms compared to the state-of-the-art clustering algorithms.

Weaknesses: 1，The equivalent symbol (\LeftRightArrow) between formulas 12 and 13 makes me very confused. It is recommended that the author use a clearer way to express the derivation of the model. 2, The review of k-means and spectral clustering takes up too much space. It is recommended to delete some content of this section. 3, This submission has few typos as follows. (1) In page 7, line 202, "18x and 29x" should be "15x and 13x", as shown in supplementary material. (2) In page 5, line 141, "ng, " should not be shown in bold. (3) Some equations are not followed by punctuation marks, such as (4), (5), (7), and (19), etc. 4, The sensitivity of the algorithm to the parameter K should be supplemented. If the performance of the algorithm is greatly affected by the value of K, then this will be a disadvantage of the algorithm. Overall, this paper presented an algorithm with excellent performance in terms of both computational overhead and effectiveness. Such an algorithm should be very popular in practical applications.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: see weakness