Review for NeurIPS paper: Fast and Accurate $k$-means++ via Rejection Sampling

NeurIPS 2020

Fast and Accurate $k$-means++ via Rejection Sampling

Review 1

Summary and Contributions: The paper shows how to speed up the k-means++ algorithms, without substantial loss in quality of the solution. In particular, the authors consider the case of very high k (think \sqrt{n}), where even one iteration takes O(ndk) time. They reduce the runtime to about O~(nd+n^{1+epsilon}). The quality of the solution, theoretically, drops from O(log k) to O(log k)*(1/epsilon)^{O(1)}. Empirically, the paper shows that, with almost no loss in accuracy, the algorithm is faster, by about an order of magnitude for k's more than 1000 (in datasets of 1/4 to 1/2 million points).

Strengths: - this work speeds-up a very popular algorithm, of wide interest and applicability, - at first it's almost surprising that one can improve on O(ndk): after all just direct computation of which points goes to which cluster would take that time. - the use of hierarchical tree is interesting - emprical results are convincing

Weaknesses: No significant weaknesses. One of course may quibble that there are a number of nuissance 1/epsilon and log factors (some hidden in the O~ notation).

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: After reading the authors' response, my evaluation remains the same. The submission presents a new and fast algorithm for k-means++ seeding: the authors analysed the approximation guarantee and the runtime fo the presented algorithm; extensive experimental results were also reported.

Strengths: The runtime of the presented algorithm is significantly faster than the previously best-known algorithm for a large value of k; the paper is very well written, and high-level ideas behind the design and analysis of the algorithm is very clear to me; the problem studied in the submission is very relevant to the NeurIPS community.

Weaknesses: The major weakness of the work is the significance of the current work comparing with the previous ones. Specifically, although the runtime of the presented algorithm is faster than the previous algorithms, from their experiments the cost value of the k clusters returned by their new algorithm is significantly worse (about 10%) when k<=1000. I note that the authors did highlight in Footnote 3 the applications in which k-means for a large value of k is needed. However, I am not connived that there're "many practical applications" that require a k-means for k>1000. The lack of applications and the algorithm's weaker performance for k<1000 are the submission's main weakness.

Correctness: The claims and methods are appropriately discussed, and experimental setup is very clear to me.

Clarity: The paper is very well written, and my detailed comments are listed in the Additional feedback session. There's one point that concerns me, i.e. the use of "nearly linear time". Since the algorithm runtime is at least n^{1+epsilon}, this family of algorithms is called *almost linear time algorithms * instead of linear-linear time algorithms. That is, after ignoring poly-log factors, the runtime's dependency of the algorithm is still superlinear in n.

Relation to Prior Work: Relation to prior work is clearly discussed, and I understand how the authors are able to achieve an improvement over the previous algorithms.

Reproducibility: Yes

Additional Feedback: - The algorithm's runtime is function of logDelta, where Delta is the ratio between the maximum distance and minimum distance between two points in the dataset. Now assume that we add an additional point p to the input, and p is arbitrary close to some input point. While it doesn't influence the k-means output for most input instances, introducing p would dramatically increase the value of Delta, and the algorithm's runtime (which doesn't look reasonable to me). Hence, I'm wondering if this log(Delta) term is really needed, or is there some way to redefine Delta, avoid this case, and consequently improve the algorithm's runtime. - Line 186, drop comma before "equals the sum of weights" - Line 214: *from* the D2-distribution - Line 243: We start *by*

Review 3

Summary and Contributions: The goal of the k-means problem is to partition the data to k clusters with minimum cost. k-menas++ is an algorithm for solving the k-means problem which has both theoretical guarantees and used in practice. The first step in the k-means algorithm is to find a good initialization using samples from D^2-distribution. The paper main contribution is to find an approximated way to sample from the D^2-distribution that has theoretical guarantees on the (1) quality of the result, which is comparable to k-means++ guarantees and (2) running time. Importantly, there is a regime where the paper achieves a better running time than other algorithms.

Strengths: The paper designs a new algorithm that is provably better than known algorithms in some regimes. These regimes are suited to several datasets.

Weaknesses: Theory: The proposed algorithm has better running time compared to other methods only if the following two conditions hold (1) k is very large; k >> d (2) aspect ratio is small Experimental: - The new algorithms have running time much better than k-means++ only for large k, but for large k the clusters are very small. For example, in KDD-Cup with k=5000, which showed the greatest improvement in running time, on average, cluster size is only 62. - The authors design two algorithms: "Fast K-means++" and a faster algorithm "Rejection Sampling". Empirically, we see that the theoretically slower algorithm (Fast K-means++) is actually faster.

Correctness: The suggested algorithms and experiments seems sound

Clarity: Yes, the paper is well written. It clearly explains how it compared to other methods, the new algorithms, and the experiments.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The authors present a near real-time algorithm for k-means++ seeding. The main contribution is a O(nd + (nlog(\Delta)^{1+\eps}) time algorithm that returns a O(log(k) / eps^3) approximate solution. The proposed algorithm is based on three key ingredients; (a) Bartal/FRT's tree embedding to approximate distances from a point to a subset of points (using only 3 trees; Lemma 3.1) (b) a tree data-structure that allows to sample points from kmeans++ D^2 distribution (Corollary 4.3) and (c) rejection sampling along with LSH approximate nearest neighbor data structure (Theorem 5.1). The high level idea is to use a constant number of tree metrics and an additional "sample tree" to maintain an approximate D^2 distribution implicitly. Rejection sampling and LSH is then applied to efficiently sample from the tree data structures.

Strengths: * Solid theoretical results that are well presented * Novel use of rejection sampling and tree embeddings * Speeding up k-means++ is of great importance to practitioners

Weaknesses: * Time complexity-bounds depends on input dataset, i.e., \Delta, i.e., maximum over minimum distane which can be unbounded * Experimental evaluation is based on only 2 datasets. * Resulting algorithm is quite evolved (tree embeddings, LSH and sample tree) to be practical.

Correctness: To the best of my knowledge, all results are stated rigorously and well-explained. I have only a single comment that the authors should explicit mention the probabilistic nature of the main result due to LSH. I will share elaborate on this on my comments below.

Clarity: The paper is well-written and the authors made an effort to clearly present their contribution

Relation to Prior Work: The authors do a great job discussion prior contributions. A notable example if footnote 6.

Reproducibility: Yes

Additional Feedback: Overall: Why only 3 trees are sufficient for Lemma 3.1? Three looks like a magic number after reading the paper. L90-92 you explain the known results that a single tree metric does not suffice, but why three trees? What are the space requirements of the proposed algorithm? L36-41: In your main contribution, you should *not* Use \tilde{O} without defining explicitly the hidden terms. Dependency on \epsilon in the approximation should be explicit. Is it O(logk / eps^3) based on Corollary 5.5. Please be precise. The resulting algorithm is randomized due to the failure probability of LSH. Please transfer this statement to the main result, i.e., having a failure probability. Section 3: MultiTree* methods can be renamed to ThreeTree* since you exactly have three trees. Section 4: You should define earlier what you mean by "open centers". It is not clear until the reader arrives at L136-137. Lemma 4.1: Shouldn't the time complexity depend on k? The expression "opening any set S of k points" is not clear. Maybe you want to state that Algorithm 1 runs in the mentioned time complexity. Lemma 4.2: "is output" -> is sampled (suggestion to rephrase, feel free to ignore) Why Corollary 4.3 is true? Lemma 4.2 returns sampling proabilities that are approximate with respect to D^2 distribution. L198-206 are correct in terms of time complexity but I cannot follow the approximation guarantee. Section 5 L235-236: "We therefore assume throughout the analysis that our data structure is successful". Please remove this assumption when you present the main contribution, i.e, make your statements probabilistic. L277: "O(logn) our algorithm" -> I think the word "dimensions" is missing. Corollary 5.5: make \eps explicit in the bound. Why \Theta here? Can you show a lower bound on the time complexity? Corollary 5.5: "there is an algorithm.." -> you can make this statement more explicity, i.e., there is a randomized algorithm and use failure probability in the statement. ======= After reading the authors feedback, I still consider the submission to be slightly below the acceptance threshold but I lower my confidence level. I still believe that the dependency on \Delta is the weakest point on the submission.