NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6975
Title:k-Means Clustering of Lines for Big Data

Reviewer 1


		
K-means clustering for lines is a natural generalization of vanilla k-means problem, and it has potential in dealing with noise, error and missing information. However, few studies have focused on this problem and most of them are either for special cases or heuristic algorithms. This paper proposes the first PTAS algorithm for this problem with dependency near optimal in data size (linear). And the algorithm can also be easily extended to online cases where data may be distributed among several machines. Detailed proof of important results and codes are provided to inspire more follow-up work in this area. Overall structure of this paper is fine and easy to follow, but there still exists some confusion. See improvements for details.

Reviewer 2


		
The paper is extremely difficult to go read for a non-expert. The sensitivity sampling techniques are made use of in the algorithm design. However, it is hard to grasp the intuition and main ideas behind the algorithms. The algorithms are given as pseudocode and properties of these algorithms are stated in terms of technical theorems. Most of the proofs are in the full version (that I did not go through). I think due to space restrictions, the authors did not have space to give the main ideas behind some of the pseudocode that they give. However, I think it might be better to give the intuition and push the pseudocode to the full version.

Reviewer 3


		
The authors consider the problem of clustering a set of lines in R^d. The goal is to minimize the k-means objective: given n lines L in R^d find the best set of k points c1,...,ck in R^d so as to minimize sum_{l in L} min_{ci} dist(ci, l)^2. This a clean, nicely motivated problem. The authors provide a coreset construction (namely a small size summary of the input so that any alpha-approximation for the summary yields an alpha(1+epsilon)-approximation for the entire input). This implies the first (1+epsilon)-approximation for the problem with running time nd exp(poly(k)) together with a streaming algorithm with similar running time and memory size 2^{poly(k)} log n. En route to the result the authors provide a bicriteria approximation algorithms: namely a solution that contains O(k (log n dk log k)) centers and whose cost is at most 4 times the cost of the optimal solution with k centers. I think the paper introduces a couple of new techniques and new ideas and make a significant progress on the problem. The ideas behind the approaches (sampling to estimate the location of the center, sensitivity sampling, bounding the VC dimension, merge and reduce, etc.) are not completely new but proving that they indeed work for the case of line clustering is a challenge and a good result. The experiments are ok and seems to indicate that the algorithm is competitive (see comments below though). I think the results are interesting, I recommend acceptance. Comments: - You are saying that you have a deterministic algorithm for the coreset construction, yet the theorem says the opposite. - There are various typos here and there, please check carefully. - Why using EM + kmeans++ and not simply k-means++? - How did you compute the optimal solution? - It seems to me that for getting an offline (1+epsilon)-approximation, one may be able to combined your lemmas on sampling (say Lemma 6.3) together with [A Simple D^2 -Sampling Based PTAS for k-Means and Other Clustering Problems] by Ragesh Jaiswal, Amit Kumar, Sandeep Sen so as to obtain an improved running time of nd 2^k.