NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID: 1141 A Practical Algorithm for Distributed Clustering and Outlier Detection

### Reviewer 1

This paper investigates distributed K-means clustering with outlier detection, following a simultaneous communication setting in which data are partitioned across multiple sites. The method follows a standard two-level clustering. The first-level performs a summary of the local data set of each site. At the second level, a coordinator performs clustering using all the summaries. The main contribution of this work is in constructing simple summaries at the sites, which runs in time O(max{k; log n} n), with n the size of the data set and k the maximum number of cluster prototypes. The authors report experiments on both real and synthetic data, showing competitive performances of the proposed algorithm. The paper is clear and very well written. The experiments are comprehensive. The approach is simple, which is good. It is, therefore, fast. Several of the existing algorithms, which provide approximation guarantee in the same setting, require high-order polynomial time (quadratic or higher in n). This makes the proposed algorithm quite practical for large-scale data set. Having said that, I am not very familiar with the literature in these distributed and simultaneous communication settings. So, in summary, if there is no competing method in the literature with similar or lower complexity, then this is a nice work that tackles an important problem (definitely worth publication). Apart from this, I do not have any serious concern about the submission. Opinion after rebuttal: The author response is convincing to me and there is consensus among the reviewers that the paper is interesting. I am keeping my initial score.

### Reviewer 2

The paper addresses the problem of performing the distributed k-mean/median clustering in the presence of outliers, and at the same time identifying the outliers. Data are partitioned across multiple sites either adversarially or randomly, and the sites and a central coordinator work jointly by communications to get the data clustering and the outliers. The authors proposed a practical algorithm with bounded running time $O(max{k,\log n} n)$, and bounded communication cost $O(s(k\log n+t))$ and $O(sk\log n+t)$ for adversarial and random data partitioning respectively, for a dataset with n data points, k centers, t outliers, and partitioned across s sites. They used a traditional two-level clustering framework (Guha et al. 2017). If using a $\gamma$-approximation algorithm for (k,t)-mean/median as the second-level clustering algorithm, their distributed algorithm has a bounded $O(\gamma)$ approximation factor. Extensive experimental studies were conducted to compare the performance of their algorithm with three baseline algorithms. The problem studied is very interesting and obviously has strong practical motivations. The authors can provide theoretical guarantees on the approximation factor, and upper bounds on the size of summary, the communication cost and the computation time. This is excitingly favourable. The analysis in bounding the approximation factor (Theorem 1) is interesting. The paper is pretty well written and is easy to follow. However, the summary construction algorithm and the distributed k-mean/median algorithm seem to largely follow Mettu and Plaxton 2002 and Guha et al. 2003, 2017. The paper does not provide significant algorithmic insights and a complete understanding, as existing clustering based NIPS papers, e.g. Chen et al. 2016. Currently the authors have upper bounds on the computation time and the communication cost. Is it possible to derive the time lower bound, as in Mettu and Plaxton 2002, or the communication lower bound, as in Chen et al. 2016? Only after deriving those lower bounds, they can have a complete understanding on the problem studied. Therefore, it might not be qualified as a seminar paper to get in NIPS. Furthermore, it lacks comprehensive experimental studies. The current experimental work focuses on two real-world datasets KddSp and KddFull. However, KddSp is only a subset of KddFull! More evaluation on real-world datasets are required before drawing the conclusion. The other two datasets in the supp. material are both synthetic datasets.