{"title": "An Impossibility Theorem for Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 463, "page_last": 470, "abstract": "", "full_text": "An Impossibility Theorem for Clustering\n\nJon Kleinberg\n\nDepartment of Computer Science\n\nCornell University\nIthaca NY 14853\n\nAbstract\n\nAlthough the study of clustering is centered around an intuitively\ncompelling goal, it has been very di(cid:14)cult to develop a uni(cid:12)ed\nframework for reasoning about it at a technical level, and pro-\nfoundly diverse approaches to clustering abound in the research\ncommunity. Here we suggest a formal perspective on the di(cid:14)culty\nin (cid:12)nding such a uni(cid:12)cation, in the form of an impossibility theo-\nrem: for a set of three simple properties, we show that there is no\nclustering function satisfying all three. Relaxations of these prop-\nerties expose some of the interesting (and unavoidable) trade-o(cid:11)s\nat work in well-studied clustering techniques such as single-linkage,\nsum-of-pairs, k-means, and k-median.\n\n1\n\nIntroduction\n\nClustering is a notion that arises naturally in many (cid:12)elds; whenever one has a het-\nerogeneous set of objects, it is natural to seek methods for grouping them together\nbased on an underlying measure of similarity. A standard approach is to represent\nthe collection of objects as a set of abstract points, and de(cid:12)ne distances among the\npoints to represent similarities | the closer the points, the more similar they are.\nThus, clustering is centered around an intuitively compelling but vaguely de(cid:12)ned\ngoal: given an underlying set of points, partition them into a collection of clusters so\nthat points in the same cluster are close together, while points in di(cid:11)erent clusters\nare far apart.\n\nThe study of clustering is uni(cid:12)ed only at this very general level of description, how-\never; at the level of concrete methods and algorithms, one quickly encounters a be-\nwildering array of di(cid:11)erent clustering techniques, including agglomerative, spectral,\ninformation-theoretic, and centroid-based, as well as those arising from combinato-\nrial optimization and from probabilistic generative models. These techniques are\nbased on diverse underlying principles, and they often lead to qualitatively di(cid:11)erent\nresults. A number of standard textbooks [1, 4, 6, 9] provide overviews of a range of\nthe approaches that are generally employed.\n\nGiven the scope of the issue, there has been relatively little work aimed at reasoning\nabout clustering independently of any particular algorithm, objective function, or\ngenerative data model. But it is not clear that this needs to be the case. To take\na motivating example from a technically di(cid:11)erent but methodologically similar set-\n\n\fting, research in mathematical economics has frequently formalized broad intuitive\nnotions (how to fairly divide resources, or how to achieve consensus from individual\npreferences) in what is often termed an axiomatic framework | one enumerates a\ncollection of simple properties that a solution ought to satisfy, and then studies how\nthese properties constrain the solutions one is able to obtain [10]. In some striking\ncases, as in Arrow\u2019s celebrated theorem on social choice functions [2], the result is\nimpossibility | there is no solution that simultaneously satis(cid:12)es a small collection\nof simple properties.\n\nIn this paper, we develop an axiomatic framework for clustering. First, as is stan-\ndard, we de(cid:12)ne a clustering function to be any function f that takes a set S of n\npoints with pairwise distances between them, and returns a partition of S. (The\npoints in S are not assumed to belong to any ambient space; the pairwise distances\nare the only data one has about them.) We then consider the e(cid:11)ect of requir-\ning the clustering function to obey certain natural properties. Our (cid:12)rst result is\na basic impossibility theorem:\nfor a set of three simple properties | essentially\nscale-invariance, a richness requirement that all partitions be achievable, and a\nconsistency condition on the shrinking and stretching of individual distances | we\nshow that there is no clustering function satisfying all three. None of these prop-\nerties is redundant, in the sense that it is easy to construct clustering functions\nsatisfying any two of the three. We also show, by way of contrast, that certain\nnatural relaxations of this set of properties are satis(cid:12)ed by versions of well-known\nclustering functions, including those derived from single-linkage and sum-of-pairs.\nIn particular, we fully characterize the set of possible outputs of a clustering function\nthat satis(cid:12)es the scale-invariance and consistency properties.\n\nHow should one interpret an impossibility result in this setting? The fact that it\narises directly from three simple constraints suggests a technical underpinning for\nthe di(cid:14)culty in unifying the initial, informal concept of \\clustering.\" It indicates a\nset of basic trade-o(cid:11)s that are inherent in the clustering problem, and o(cid:11)ers a way\nto distinguish between clustering methods based not simply on operational grounds,\nbut on the ways in which they resolve the choices implicit in these trade-o(cid:11)s. Ex-\nploring relaxations of the properties helps to sharpen this type of analysis further\n| providing a perspective, for example, on the distinction between clustering func-\ntions that (cid:12)x the number of clusters a priori and those that do not; and between\nclustering functions that build in a fundamental length scale and those that do not.\n\nOther Axiomatic Approaches. As discussed above, the vast majority of ap-\nproaches to clustering are derived from the application of speci(cid:12)c algorithms, the\noptima of speci(cid:12)c objective functions, or the consequences of particular probabilis-\ntic generative models for the data. Here we brie(cid:13)y review work seeking to examine\nproperties that do not overtly impose a particular objective function or model.\n\nJardine and Sibson [7] and Puzicha, Hofmann, and Buhmann [12] have considered\naxiomatic approaches to clustering, although they operate in formalisms quite dif-\nferent from ours, and they do not seek impossibility results. Jardine and Sibson are\nconcerned with hierarchical clustering, where one constructs a tree of nested clus-\nters. They show that a hierarchical version of single-linkage is the unique function\nconsistent with a collection of properties; however, this is primarily a consequence\nof the fact that one of their properties is an implicit optimization criterion that\nis uniquely optimized by single-linkage. Puzicha et al. consider properties of cost\nfunctions on partitions; these implicitly de(cid:12)ne clustering functions through the pro-\ncess of choosing a minimum-cost partition. They investigate a particular class of\nclustering functions that arises if one requires the cost function to decompose into\na certain additive form. Recently, Kalai, Papadimitriou, Vempala, and Vetta have\n\n\falso investigated an axiomatic framework for clustering [8]; like the approach of Jar-\ndine and Sibson [7], and in contrast to our work here, they formulate a collection\nof properties that are su(cid:14)cient to uniquely specify a particular clustering function.\n\nAxiomatic approaches have also been applied in areas related to clustering | par-\nticularly in collaborative (cid:12)ltering, which harnesses similarities among users to make\nrecommendations, and in discrete location theory, which focuses on the placement\nof \\central\" facilities among distributed collections of individuals. For collaborative\n(cid:12)ltering, Pennock et al. [11] show how results from social choice theory, including\nversions of Arrow\u2019s Impossibility Theorem [2], can be applied to characterize recom-\nmendation systems satisfying collections of simple properties. In discrete location\ntheory, Hansen and Roberts [5] prove an impossibility result for choosing a central\nfacility to serve a set of demands on a graph; essentially, given a certain collection of\nrequired properties, they show that any function that speci(cid:12)es the resulting facility\nmust be highly sensitive to small changes in the input.\n\n2 The Impossibility Theorem\n\nA clustering function operates on a set S of n (cid:21) 2 points and the pairwise distances\namong them. Since we wish to deal with point sets that do not necessarily belong\nto an ambient space, we identify the points with the set S = f1; 2; : : : ; ng. We then\nde(cid:12)ne a distance function to be any function d : S (cid:2) S ! R such that for distinct\ni; j 2 S, we have d(i; j) (cid:21) 0, d(i; j) = 0 if and only if i = j, and d(i; j) = d(j; i). One\ncan optionally restrict attention to distance functions that are metrics by imposing\nthe triangle inequality: d(i; k) (cid:20) d(i; j) + d(j; k) for all i; j; k 2 S. We will not\nrequire the triangle inequality in the discussion here, but the results to follow |\nboth negative and positive | still hold if one does require it.\n\nA clustering function is a function f that takes a distance function d on S and\nreturns a partition (cid:0) of S. The sets in (cid:0) will be called its clusters. We note that, as\nwritten, a clustering function is de(cid:12)ned only on point sets of a particular size (n);\nhowever, all the speci(cid:12)c clustering functions we consider here will be de(cid:12)ned for all\nvalues of n larger than some small base value.\n\nHere is a (cid:12)rst property one could require of a clustering function. If d is a distance\nfunction, we write (cid:11)(cid:1)d to denote the distance function in which the distance between\ni and j is (cid:11)d(i; j).\n\nScale-Invariance. For any distance function d and any (cid:11) > 0,\nwe have f (d) = f ((cid:11) (cid:1) d).\n\nThis is simply the requirement that the clustering function not be sensitive to\nchanges in the units of distance measurement | it should not have a built-in \\length\nscale.\" A second property is that the output of the clustering function should be\n\\rich\" | every partition of S is a possible output. To state this more compactly,\nlet Range(f ) denote the set of all partitions (cid:0) such that f (d) = (cid:0) for some distance\nfunction d.\n\nRichness. Range(f ) is equal to the set of all partitions of S.\n\nIn other words, suppose we are given the names of the points only (i.e. the indices\nin S) but not the distances between them. Richness requires that for any desired\npartition (cid:0), it should be possible to construct a distance function d on S for which\nf (d) = (cid:0).\n\n\fFinally, we discuss a Consistency property that is more subtle that the (cid:12)rst two.\nWe think of a clustering function as being \\consistent\" if it exhibits the following\nbehavior: when we shrink distances between points inside a cluster and expand\ndistances between points in di(cid:11)erent clusters, we get the same result. To make this\nprecise, we introduce the following de(cid:12)nition. Let (cid:0) be a partition of S, and d and\nd0 two distance functions on S. We say that d0 is a (cid:0)-transformation of d if (a) for\nall i; j 2 S belonging to the same cluster of (cid:0), we have d0(i; j) (cid:20) d(i; j); and (b) for\nall i; j 2 S belonging to di(cid:11)erent clusters of (cid:0), we have d0(i; j) (cid:21) d(i; j).\n\nConsistency. Let d and d0 be two distance functions. If f (d) = (cid:0),\nand d0 is a (cid:0)-transformation of d, then f (d0) = (cid:0).\n\nIn other words, suppose that the clustering (cid:0) arises from the distance function d. If\nwe now produce d0 by reducing distances within the clusters and enlarging distance\nbetween the clusters then the same clustering (cid:0) should arise from d0.\n\nWe can now state the impossibility theorem very simply.\n\nTheorem 2.1 For each n (cid:21) 2, there is no clustering function f that satis(cid:12)es Scale-\nInvariance, Richness, and Consistency.\n\nWe will prove Theorem 2.1 in the next section, as a consequence of a more general\nstatement. Before doing this, we re(cid:13)ect on the relation of these properties to one\nanother by showing that there exist natural clustering functions satisfying any two\nof the three properties.\n\nTo do this, we describe the single-linkage procedure (see e.g. [6]), which in fact de-\n(cid:12)nes a family of clustering functions. Intuitively, single-linkage operates by initial-\nizing each point as its own cluster, and then repeatedly merging the pair of clusters\nwhose distance to one another (as measured from their closest points of approach)\nis minimum. More concretely, single-linkage constructs a weighted complete graph\nGd whose node set is S and for which the weight on edge (i; j) is d(i; j). It then\norders the edges of Gd by non-decreasing weight (breaking ties lexicographically),\nand adds edges one at a time until a speci(cid:12)ed stopping condition is reached. Let\nHd denote the subgraph consisting of all edges that are added before the stopping\ncondition is reached; the connected components of Hd are the clusters.\n\nThus, by choosing a stopping condition for the single-linkage procedure, one obtains\na clustering function, which maps the input distance function to the set of connected\ncomponents that results at the end of the procedure. We now show that for any\ntwo of the three properties in Theorem 2.1, one can choose a single-linkage stopping\ncondition so that the resulting clustering function satis(cid:12)es these two properties.\nHere are the three types of stopping conditions we will consider.\n\n(cid:15) k-cluster stopping condition. Stop adding edges when the subgraph (cid:12)rst\nconsists of k connected components. (We will only consider this condition\nto be well-de(cid:12)ned when the number of points is at least k.)\n\n(cid:15) distance-r stopping condition. Only add edges of weight at most r.\n(cid:15) scale-(cid:11) stopping condition. Let (cid:26)(cid:3) denote the maximum pairwise distance;\n\ni.e. (cid:26)(cid:3) = maxi;j d(i; j). Only add edges of weight at most (cid:11)(cid:26)(cid:3).\n\nIt is clear that these various stopping conditions qualitatively trade o(cid:11) certain of\nthe properties in Theorem 2.1. Thus, for example, the k-cluster stopping condition\ndoes not attempt to produce all possible partitions, while the distance-r stopping\ncondition builds in a fundamental length scale, and hence is not scale-invariant.\n\n\fHowever, by the appropriate choice of one of these stopping conditions, one can\nachieve any two of the three properties in Theorem 2.1.\n\nTheorem 2.2 (a) For any k (cid:21) 1, and any n (cid:21) k, single-linkage with the k-cluster\nstopping condition satis(cid:12)es Scale-Invariance and Consistency.\n\n(b) For any positive (cid:11) < 1, and any n (cid:21) 3, single-linkage with the scale-(cid:11) stopping\ncondition satis(cid:12)es Scale-Invariance and Richness.\n\n(c) For any r > 0, and any n (cid:21) 2, single-linkage with the distance-r stopping\ncondition satis(cid:12)es Richness and Consistency.\n\n3 Antichains of Partitions\n\nWe now state and prove a strengthening of the impossibility result. We say that\na partition (cid:0)0 is a re(cid:12)nement of a partition (cid:0) if for every set C 0 2 (cid:0)0, there is a\nset C 2 (cid:0) such that C 0 (cid:18) C. We de(cid:12)ne a partial order on the set of all partitions\nby writing (cid:0)0 (cid:22) (cid:0) if (cid:0)0 is a re(cid:12)nement of (cid:0). Following the terminology of partially\nordered sets, we say that a collection of partitions is an antichain if it does not\ncontain two distinct partitions such that one is a re(cid:12)nement of the other.\n\nFor a set of n (cid:21) 2 points, the collection of all partitions does not form an antichain;\nthus, Theorem 2.1 follows from\n\nTheorem 3.1 If a clustering function f satis(cid:12)es Scale-Invariance and Consistency,\nthen Range(f ) is an antichain.\n\nProof. For a partition (cid:0), we say that a distance function d (a; b)-conforms to (cid:0) if,\nfor all pairs of points i; j that belong to the same cluster of (cid:0), we have d(i; j) (cid:20) a,\nwhile for all pairs of points i; j that belong to di(cid:11)erent clusters, we have d(i; j) (cid:21) b.\nWith respect to a given clustering function f , we say that a pair of positive real\nnumbers (a; b) is (cid:0)-forcing if, for all distance functions d that (a; b)-conform to (cid:0),\nwe have f (d) = (cid:0).\n\nLet f be a clustering function that satis(cid:12)es Consistency. We claim that for any\npartition (cid:0) 2 Range(f ), there exist positive real numbers a < b such that the pair\n(a; b) is (cid:0)-forcing. To see this, we (cid:12)rst note that since (cid:0) 2 Range(f ), there exists a\ndistance function d such that f (d) = (cid:0). Now, let a0 be the minimum distance among\npairs of points in the same cluster of (cid:0), and let b0 be the maximum distance among\npairs of points that do not belong to the same cluster of (cid:0). Choose numbers a < b\nso that a (cid:20) a0 and b (cid:21) b0. Clearly any distance function d0 that (a; b)-conforms to\n(cid:0) must be a (cid:0)-transformation of d, and so by the Consistency property, f (d0) = (cid:0).\nIt follows that the pair (a; b) is (cid:0)-forcing.\n\nNow suppose further that the clustering function f satis(cid:12)es Scale-Invariance, and\nthat there exist distinct partitions (cid:0)0; (cid:0)1 2 Range(f ) such that (cid:0)0 is a re(cid:12)nement\nof (cid:0)1. We show how this leads to a contradiction.\n\nLet (a0; b0) be a (cid:0)0-forcing pair, and let (a1; b1) be a (cid:0)1-forcing pair, where a0 < b0\nand a1 < b1; the existence of such pairs follows from our claim above. Let a2 be any\nnumber less than or equal to a1, and choose \" so that 0 < \" < a0a2b(cid:0)1\n0 . It is now\nstraightforward to construct a distance function d with the following properties:\nFor pairs of points i; j that belong to the same cluster of (cid:0)0, we have d(i; j) (cid:20) \"; for\npairs i; j that belong to the same cluster of (cid:0)1 but not to the same cluster of (cid:0)0,\nwe have a2 (cid:20) d(i; j) (cid:20) a1; and for pairs i; j the do not belong to the same cluster\nof (cid:0)1, we have d(i; j) (cid:21) b1.\n\n\fThe distance function d (a1; b1)-conforms to (cid:0)1, and so we have f (d) = (cid:0)1. Now set\n(cid:11) = b0a(cid:0)1\n2 , and de(cid:12)ne d0 = (cid:11) (cid:1) d. By Scale-Invariance, we must have f (d0) = f (d) =\n(cid:0)1. But for points i; j in the same cluster of (cid:0)0 we have d0(i; j) (cid:20) \"b0a(cid:0)1\n2 < a0,\nwhile for points i; j that do not belong to the same cluster of (cid:0)0 we have d0(i; j) (cid:21)\na2b0a(cid:0)1\n2 (cid:21) b0. Thus d0 (a0; b0)-conforms to (cid:0)0, and so we must have f (d0) = (cid:0)0. As\n(cid:0)0 6= (cid:0)1, this is a contradiction.\n\nThe proof above uses our assumption that the clustering function f is de(cid:12)ned on\nthe set of all distance functions on n points. However, essentially the same proof\nyields a corresponding impossibility result for clustering functions f that are de(cid:12)ned\nonly on metrics, or only on distance functions arising from n points in a Euclidean\nspace of some dimension. To adapt the proof, one need only be careful to choose\nthe constant a2 and distance function d to satisfy the required properties.\n\nWe now prove a complementary positive result; together with Theorem 3.1, this\ncompletely characterizes the possible values of Range(f ) for clustering functions f\nthat satisfy Scale-Invariance and Consistency.\n\nTheorem 3.2 For every antichain of partitions A, there is a clustering function f\nsatisfying Scale-Invariance and Consistency for which Range(f ) = A.\n\nProof. Given an arbitrary antichain A, it is not clear how to produce a stopping\ncondition for the single-linkage procedure that gives rise to a clustering function f\nwith Range(f ) = A. (Note that the k-cluster stopping condition yields a clustering\nfunction whose range is the antichain consisting of all partitions into k sets.) Thus,\nto prove this result, we use a variant of the sum-of-pairs clustering function (see\ne.g. [3]), adapted to general antichains. We focus on the case in which jAj > 1,\nsince the case of jAj = 1 is trivial.\n\nFor a partition (cid:0) 2 A, we write (i; j) (cid:24) (cid:0) if both i and j belong to the same cluster in\n(cid:0). The A-sum-of-pairs function f seeks the partition (cid:0) 2 A that minimizes the sum\nof all distances between pairs of points in the same cluster; in other words, it seeks\nthe (cid:0) 2 A minimizing the objective function (cid:8)d((cid:0)) = P(i;j)(cid:24)(cid:0) d(i; j): (Ties are\nbroken lexicographically.) It is crucial that the minimization is only over partitions\nin A; clearly, if we wished to minimize this objective function over all partitions,\nwe would choose the partition in which each point forms its own cluster.\n\nIt is clear that f satis(cid:12)es Scale-Invariance, since (cid:8)(cid:11)(cid:1)d((cid:0)) = (cid:11)(cid:8)d((cid:0)) for any partition\n(cid:0). By de(cid:12)nition we have Range(f ) (cid:18) A, and we argue that Range(f ) (cid:19) A as\nfollows. For any partition (cid:0) 2 A, construct a distance function d with the following\nproperties: d(i; j) < n(cid:0)3 for every pair of points i; j belonging to the same cluster\nof (cid:0), and d(i; j) (cid:21) 1 for every pair of points i; j belonging to di(cid:11)erent clusters of\n(cid:0). We have (cid:8)d((cid:0)) < 1; and moreover (cid:8)d((cid:0)0) < 1 only for partitions (cid:0)0 that are\nre(cid:12)nements of (cid:0). Since A is an antichain, it follows that (cid:0) must minimize (cid:8)d over\nall partitions in A, and hence f (d) = (cid:0).\n\nIt remains only to verify Consistency. Suppose that for the distance function d,\nwe have f (d) = (cid:0); and let d0 be a (cid:0)-transformation of d. For any partition (cid:0)0, let\n(cid:1)((cid:0)0) = (cid:8)d((cid:0)0) (cid:0) (cid:8)d0((cid:0)0): It is enough to show that for any partition (cid:0)0 2 A, we\nhave (cid:1)((cid:0)) (cid:21) (cid:1)((cid:0)0):\nBut this follows simply because (cid:1)((cid:0)) = P(i;j)(cid:24)(cid:0) d(i; j) (cid:0) d0(i; j); while\n\n(cid:1)((cid:0)0) = X\n\nd(i; j) (cid:0) d0(i; j) (cid:20)\n\nX\n\nd(i; j) (cid:0) d0(i; j) (cid:20) (cid:1)((cid:0));\n\n(i;j)(cid:24)(cid:0)0\n\n(i;j)(cid:24)(cid:0)0 and (i;j)(cid:24)(cid:0)\n\nwhere both inequalities follow because d0 is a (cid:0)-transformation of d: (cid:12)rst, only\n\n\fterms corresponding to pairs in the same cluster of (cid:0) are non-negative; and second,\nevery term corresponding to a pair in the same cluster of (cid:0) is non-negative.\n\n4 Centroid-Based Clustering and Consistency\n\nIn a widely-used approach to clustering, one selects k of the input points as centroids,\nand then de(cid:12)nes clusters by assigning each point in S to its nearest centroid. The\ngoal, intuitively, is to choose the centroids so that each point in S is close to at least\none of them. This overall approach arises both from combinatorial optimization\nperspectives, where it has roots in facility location problems [9], and in maximum-\nlikelihood methods, where the centroids may represent centers of probability density\nfunctions [4, 6]. We show here that for a fairly general class of centroid-based\nclustering functions, including k-means and k-median, none of the functions in the\nclass satis(cid:12)es the Consistency property. This suggests an interesting tension between\nbetween Consistency and the centroid-based approach to clustering, and forms a\ncontrast with the results for single-linkage and sum-of-pairs in previous sections.\n\nSpeci(cid:12)cally, for any natural number k (cid:21) 2, and any continuous, non-decreasing,\nand unbounded function g : R+ ! R+, we de(cid:12)ne the (k; g)-centroid clustering\nfunction as follows. First, we choose the set of k \\centroid\" points T (cid:18) S for\nwhich the objective function (cid:3)g\nd(T ) = Pi2S g(d(i; T )) is minimized. (Here d(i; T ) =\nminj2T d(i; j).) Then we de(cid:12)ne a partition of S into k clusters by assigning each\npoint to the element of T closest to it. The k-median function [9] is obtained\nby setting g to be the identity function, while the objective function underlying\nk-means clustering [4, 6] is obtained by setting g(d) = d2.\n\nTheorem 4.1 For every k (cid:21) 2 and every function g chosen as above, and for n\nsu(cid:14)ciently large relative to k, the (k; g)-centroid clustering function does not satisfy\nthe Consistency property.\n\nProof Sketch. We describe the proof for k = 2 clusters; the case of k > 2 is similar.\nWe consider a set of points S that is divided into two subsets: a set X consisting\nof m points, and a set Y consisting of (cid:13)m points, for a small number (cid:13) > 0. The\ndistance between points in X is r, the distance between points in Y is \" < r, and\nthe distance from a point in X to a point in Y is r + (cid:14), for a small number (cid:14) > 0.\nBy choosing (cid:13), r, \", and (cid:14) appropriately, the optimal choice of k = 2 centroids\nwill consist of one point from X and one from Y , and the resulting partition (cid:0) will\nhave clusters X and Y . Now, suppose we divide X into sets X0 and X1 of equal\nsize, and reduce the distances between points in the same Xi to be r0 < r (keeping\nall other distances the same). This can be done, for r0 small enough, so that the\noptimal choice of two centroids will now consist of one point from each Xi, yielding\na di(cid:11)erent partition of S. As our second distance function is a (cid:0)-transformation of\nthe (cid:12)rst, this violates Consistency.\n\n5 Relaxing the Properties\n\nIn addition to looking for clustering functions that satisfy subsets of the basic prop-\nerties, we can also study the e(cid:11)ect of relaxing the properties themselves. Theo-\nrem 3.2 is a step in this direction, showing that the sum-of-pairs function satis(cid:12)es\nScale-Invariance and Consistency, together with a relaxation of the Richness prop-\nerty. As an another example, it is interesting to note that single-linkage with the\ndistance-r stopping condition satis(cid:12)es a natural relaxation of Scale-Invariance:\nif\n(cid:11) > 1, then f ((cid:11) (cid:1) d) is a re(cid:12)nement of f (d).\n\n\fWe now consider some relaxations of Consistency. Let f be a clustering function,\nand d a distance function such that f (d) = (cid:0). If we reduce distances within clusters\nand expand distances between clusters, Consistency requires that f output the same\npartition (cid:0). But one could imagine requiring something less: perhaps changing\ndistances this way should be allowed to create additional sub-structure, leading to\na new partition in which each cluster is a subset of one of the original clusters. Thus,\nwe can de(cid:12)ne Re(cid:12)nement-Consistency, a relaxation of Consistency, to require that\nif d0 is an f (d)-transformation of d, then f (d0) should be a re(cid:12)nement of f (d).\n\nWe can show that the natural analogue of Theorem 2.1 still holds: there is no cluster-\ning function that satis(cid:12)es Scale-Invariance, Richness, and Re(cid:12)nement-Consistency.\nHowever, there is a crucial sense in which this result \\just barely\" holds, render-\ning it arguably less interesting to us here. Speci(cid:12)cally, let (cid:0)(cid:3)\nn denote the par-\ntition of S = f1; 2; : : : ; ng in which each individual element forms its own clus-\nter. Then there exist clustering functions f that satisfy Scale-Invariance and\nRe(cid:12)nement-Consistency, and for which Range(f ) consists of all partitions except\n(cid:0)(cid:3)\nn. (One example is single-linkage with the distance-((cid:11)(cid:14)) stopping condition, where\n(cid:14) = mini;j d(i; j) is the minimum inter-point distance, and (cid:11) (cid:21) 1.) Such func-\ntions f , in addition to Scale-Invariance and Re(cid:12)nement-Consistency, thus satisfy a\nkind of Near-Richness property: one can obtain every partition as output except\nfor a single, trivial partition.\nIt is in this sense that our impossibility result for\nRe(cid:12)nement-Consistency, unlike Theorem 2.1, is quite \\brittle.\"\n\nTo relax Consistency even further, we could say simply that if d0\nis an f (d)-\ntransformation of d, then one of f (d) or f (d0) should be a re(cid:12)nement of the other.\nIn other words, f (d0) may be either a re(cid:12)nement or a \\coarsening\" of f (d). It is\npossible to construct clustering functions f that satisfy this even weaker variant of\nConsistency, together with Scale-Invariance and Richness.\n\nI thank Shai Ben-David, John Hopcroft, and Lillian Lee for\nAcknowledgements.\nvaluable discussions on this topic. This research was supported in part by a David\nand Lucile Packard Foundation Fellowship, an ONR Young Investigator Award, an\nNSF Faculty Early Career Development Award, and NSF ITR Grant IIS-0081334.\n\nReferences\n\n[1] M. Anderberg, Cluster Analysis for Applications, Academic Press, 1973.\n\n[2] K. Arrow, Social Choice and Individual Values, Wiley, New York, 1951.\n\n[3] M. Bern, D. Eppstein, \\Approximation algorithms for geometric prolems,\" in Approxi-\nmation Algorithms for NP-Hard Problems, (D. Hochbaum, Ed.), PWS Publishing, 1996.\n\n[4] R. Duda, P. Hart, D. Stork, Pattern Classi(cid:12)cation (2nd edition), Wiley, 2001.\n\n[5] P. Hansen, F. Roberts, \\An impossibility result in axiomatic location theory,\" Mathe-\n\nmatics of Operations Research 21(1996).\n\n[6] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1981.\n\n[7] N. Jardine, R. Sibson, Mathematical Taxonomy Wiley, 1971.\n\n[8] A. Kalai, C. Papadimitriou, S. Vempala, A. Vetta, personal communication, June 2002.\n\n[9] P. Mirchandani, R. Francis, Discrete Location Theory, Wiley, 1990.\n\n[10] M. Osborne A. Rubinstein, A Course in Game Theory, MIT Press, 1994.\n\n[11] D. Pennock, E. Horvitz, C.L. Giles, \\Social choice theory and recommender systems:\nAnalysis of the axiomatic foundations of collaborative (cid:12)ltering,\" Proc. 17th AAAI, 2000.\n\n[12] J. Puzicha, T. Hofmann, J. Buhmann \\A Theory of Proximity Based Clustering:\n\nStructure Detection by Optimization,\" Pattern Recognition, 33(2000).\n\n\f", "award": [], "sourceid": 2340, "authors": [{"given_name": "Jon", "family_name": "Kleinberg", "institution": null}]}