Learning the k in k-means

Part of Advances in Neural Information Processing Systems 16 (NIPS 2003)

Bibtex Metadata Paper

Authors

Greg Hamerly, Charles Elkan

Abstract

When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic prob- lem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test ac- cepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the stand- ard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity.

1 Introduction and related work

Clustering algorithms are useful tools for data mining, compression, probability density es- timation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated.

Center-based clustering algorithms (in particular k-means and Gaussian expectation- maximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center.

In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm

Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster.

is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity.

Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds.

Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C)− p 2 log n where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available.

Bischof et al. [1] use a minimum description length (MDL) framework, where the descrip- tion length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data.

With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5.1]. This method of estimating k is best applied with domain-specific knowledge and human intuition.

2 The Gaussian-means (G-means) algorithm

The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k.

G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear

−0.100.10.20.30.40.50.60.70.80.90.10.20.30.40.50.60.70.80.9−3−2−10123−4−3−2−101234 Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯x}). 2: C ← kmeans(C, X). 3: Let {xi|class(xi) = j} be the set of datapoints assigned to center cj. 4: Use a statistical test to detect if each {xi|class(xi) = j} follow a Gaussian distribution

(at confidence level α).

5: If the data look Gaussian, keep cj. Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added.

to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints.

2.1 Testing clusters for Gaussian fit

To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0: The data around the center are sampled from a Gaussian. • H1: The data around the center are not sampled from a Gaussian.

If we accept the null hypothesis H0, then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1, then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i)), where F is the N(0, 1) cumulative distribution function. Then the statistic is