Average Sensitivity of Euclidean k-Clustering

Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track

Bibtex Paper Supplemental

Authors

Yuichi Yoshida, Shinji Ito

Abstract

Given a set of $n$ points in $\mathbb{R}^d$, the goal of Euclidean $(k,\ell)$-clustering is to find $k$ centers that minimize the sum of the $\ell$-th powers of the Euclidean distance of each point to the closest center. In practical situations, the clustering result must be stable against points missing in the input data so that we can make trustworthy and consistent decisions. To address this issue, we consider the average sensitivity of Euclidean $(k,\ell)$-clustering, which measures the stability of the output in total variation distance against deleting a random point from the input data. We first show that a popular algorithm \textsc{$k$-means++} and its variant called \textsc{$D^\ell$-sampling} have low average sensitivity. Next, we show that any approximation algorithm for Euclidean $(k,\ell)$-clustering can be transformed to an algorithm with low average sensitivity while almost preserving the approximation guarantee. As byproducts of our results, we provide several algorithms for consistent $(k,\ell)$-clustering and dynamic $(k,\ell)$-clustering in the random-order model, where the input points are randomly permuted and given in an online manner. The goal of the consistent setting is to maintain a good solution while minimizing the number of changes to the solution during the process, and that of the dynamic setting is to maintain a good solution while minimizing the (amortized) update time.