{"title": "Robust k-means: a Theoretical Revisit", "book": "Advances in Neural Information Processing Systems", "page_first": 2891, "page_last": 2899, "abstract": "Over the last years, many variations of the quadratic k-means clustering procedure have been proposed, all aiming to robustify the performance of the algorithm in the presence of outliers. In general terms, two main approaches have been developed: one based on penalized regularization methods, and one based on trimming functions. In this work, we present a theoretical analysis of the robustness and consistency properties of a variant of the classical quadratic k-means algorithm, the robust k-means, which borrows ideas from outlier detection in regression. We show that two outliers in a dataset are enough to breakdown this clustering procedure. However, if we focus on \u201cwell-structured\u201d datasets, then robust k-means can recover the underlying cluster structure in spite of the outliers. Finally, we show that, with slight modifications, the most general non-asymptotic results for consistency of quadratic k-means remain valid for this robust variant.", "full_text": "Robust k-means: a Theoretical Revisit\n\nAlexandros Georgogiannis\n\nSchool of Electrical and Computer Engineering\n\nTechnical University of Crete, Greece\n\nalexandrosgeorgogiannis at gmail.com\n\nAbstract\n\nOver the last years, many variations of the quadratic k-means clustering procedure\nhave been proposed, all aiming to robustify the performance of the algorithm in\nthe presence of outliers. In general terms, two main approaches have been devel-\noped: one based on penalized regularization methods, and one based on trimming\nfunctions. In this work, we present a theoretical analysis of the robustness and\nconsistency properties of a variant of the classical quadratic k-means algorithm,\nthe robust k-means, which borrows ideas from outlier detection in regression. We\nshow that two outliers in a dataset are enough to breakdown this clustering pro-\ncedure. However, if we focus on \u201cwell-structured\u201d datasets, then robust k-means\ncan recover the underlying cluster structure in spite of the outliers. Finally, we\nshow that, with slight modi\ufb01cations, the most general non-asymptotic results for\nconsistency of quadratic k-means remain valid for this robust variant.\n\n1 Introduction\nLet \u03c6 : R \u2192 R+ be a lower semi-continuous (lsc) and symmetric function with minimum value\n\u03c6(0). Given a set of points X n = {x1, . . . , xn}\u2282 Rp, consider the generalized k-means problem\n(GKM) [7]\n\nOur aim is to \ufb01nd a set of k centers {c1, . . . , ck} that minimize the clustering risk Rn. These centers\nde\ufb01ne a partition of X n into k clusters A = {A1, . . . , Ak}, de\ufb01ned as\n\n\u03c6(||xi \u2212 cl||2)\n\nmin\nc1,...,ck\n\nmin\n1\u2264l\u2264k\n\nRn(c1, . . . , ck) =\n\nn!i=1\nsubject to cl \u2208 Rp, l \u2208{ 1, . . . , k}.\nAl =\"x \u2208X n : l = argmin1\u2264j\u2264k \u03c6(||x \u2212 cj||2)#,\n\n(GKM)\n\n(1)\n\n(3)\n\nwhere ties are broken randomly. Varying \u03c6 beyond the usual quadratic function (\u03c6(t) = t2) we\nexpect to gain some robustness against the outliers [9]. When \u03c6 is upper bounded by \u03b4, the clusters\nare de\ufb01ned as follows. For l \u2264 k, let\n\nAl =\"x \u2208X n : l = argmin1\u2264j\u2264k \u03c6(||x \u2212 cj||2) and \u03c6(||x \u2212 cl||2) \u2264 \u03b4#,\n\nand de\ufb01ne the extra cluster\n\n(2)\n\nAk+1 =\"x \u2208X n : min\n\n1\u2264j\u2264k\n\n\u03c6(||x \u2212 cj||2) >\u03b4#.\n\nThis extra cluster contains points whose distance from their closest center, when measured according\nto \u03c6(||x\u2212cl||2), is larger than \u03b4 and, as will become clear later, it represents the set of outliers. From\nnow on, given a set of centers {c1, . . . , ck}, we write just A = {A1, . . . , Ak} and implicitly mean\nA\u222a Ak+1 when \u03c6 is bounded.1\n\n1 For a similar de\ufb01nition for the set of clusters induced by a bounded \u03c6 see also Section 4 in [2].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fNow, consider the following instance of (GKM), for the same set of points X n,\n\nmin\nc1,...,ck\n\nR\u2032n(c1, . . . , ck) =\n\nn!i=1\nsubject to cl \u2208 Rp, l = 1, . . . , k,\noi \u2208 Rp, i = 1, . . . , n,\n\noi\n\nmin\n\n1\u2264l\u2264k\" min\n$\n\n1\n2||xi \u2212 cl \u2212 oi||2\n%&\n\u03c6(||xi\u2212cl||2)\n\n2 + f\u03bb(||oi||2)\n\n#\n\n(RKM)\n\n\u2019\n\nwhere f\u03bb : R \u2192 R+ is a symmetric, lsc, proper2 and bounded from below function, with minimum\nvalue f\u03bb(0), and \u03bb a non-negative parameter. This problem is called robust k-means (RKM) and,\nas we show later, it takes the form of (GKM) when \u03c6 equals the Moreau envelope of f\u03bb. The\nproblem (RKM) [5, 24] describes the following simple model: we allow each observation xi to take\non an \u201cerror\u201d term oi and we penalize the errors, using a group penalty, in order to encourage most\nof the observations\u2019 errors to be equal to zero. We consider functions f\u03bb where the parameter \u03bb \u2265 0\nhas the following effect: for \u03bb = 0, all oi\u2019s may become arbitrary large (all observations are outliers),\nwhile, for \u03bb \u2192 \u221e, all oi\u2019s become zero (no outliers); non-trivial cases occur for intermediate values\n0 <\u03bb< \u221e. Our interest is in understanding the robustness and consistency properties of (RKM).\nRobustness: Although robustness is an important notion, it has not been given a standard technical\nde\ufb01nition in the literature. Here, we focus on the \ufb01nite sample breakdown point [18], which counts\nhow many outliers a dataset may contain without causing signi\ufb01cant damage in the estimates of\nthe centers. Such damage is re\ufb02ected to an arbitrarily large magnitude of at least one center. In\nSection 3, we show that two outliers in a dataset are enough to breakdown some centers. On the other\nhand, if we restrict our focus on some \u201cwell structured\u201d datasets, then (RKM) has some remarkable\nrobustness properties even if there is a considerable amount of contamination.\nConsistency: Much is known about the consistency of (GKM) when the function \u03c6 is lsc and\nincreasing [11, 15]. It turns out that this case also includes the case of (RKM) when f\u03bb is convex\n(see Section 3.1 for details). In Section 4, we show that the known non-asymptotic results about\nconsistency of quadratic k-means may remain valid even when f\u03bb is non-convex.\n\n2 Preliminaries and some technical remarks\n\nWe start our analysis with a few technical tools from variational analysis [19]. Here, we introduce\nthe necessary notation and a lemma (the proofs are in the appendix). The Moreau envelope e\u00b5\nf (x)\nwith parameter \u00b5 > 0 (De\ufb01nition 1.22 in [19]) of an lsc, proper, and bounded from below function\nf : Rp \u2192\u2192 Rp are\nf : Rp \u2192 R and the associated (possibly multivalued) proximal map P \u00b5\n1\n2\u00b5||x \u2212 z||2\nf (x) = argminz\u2208Rp\n\ne\u00b5\nf (x) = min\nz\u2208Rp\nrespectively.\nIn order to simplify the notation, in the following, we \ufb01x \u00b5 to 1 and suppress the\nsuperscript. The Moreau envelope is a continuous approximation from below of f having the same\nset of minimizers while the proximal map gives the (possibly non-unique) minimizing arguments\nin (4). For (GKM), we de\ufb01ne \u03a6: Rp \u2192 R as \u03a6(x) := \u03c6(||x||2). Accordingly, for (RKM), we\nde\ufb01ne F\u03bb : Rp \u2192 R as F\u03bb(x) := f\u03bb(||x||2). Thus, we obtain the following pairs:\n\n1\n2\u00b5 ||x \u2212 z||2\n\n2 + f (z) and P \u00b5\n\n2 + f (z),\n\n(4)\n\nef\u03bb(x) := min\no\u2208R\neF\u03bb(x) := min\no\u2208Rp\n\n1\n(x \u2212 o)2 + f\u03bb(o), Pf\u03bb(x) := argmino\u2208Ref\u03bb(x), x \u2208 R\n2\n1\n2||x \u2212 o||2\n\n2 + F\u03bb(o), PF\u03bb(x) := argmino\u2208RpeF\u03bb(x), x \u2208 Rp.\n\n(5b)\nObviously, (RKM) is equivalent to (GKM) when \u03a6(x) = eF\u03bb(x). Every map P : R \u2192\u2192 R throughout\nthe text is assumed to be i) odd, i.e., P(\u2212x) = \u2212P(x), ii) compact-valued, iii) non-decreasing, and\niv) have a closed graph. We know that for any such map there exists at least one function f\u03bb such\nthat P = Pf\u03bb (Proposition 3 in [26]).3 Finally, for our purposes (outlier detection), it is natural\n2We call f proper if f (x) < \u221e for at least one x \u2208 Rn, and f (x) > \u2212\u221e for all x \u2208 Rn; in words, if the\n3 Accordingly, for a general function \u03c6 : R \u2192 [0,\u221e) to be a Moreau envelope, i.e., \u03c6(\u00b7) = ef\u03bb (\u00b7) as\n2|\u00b7|2 is a concave function (Proposition 1 in [26]).\n\nde\ufb01ned in (5a) for some function f\u03bb, we require that \u03c6(\u00b7)\u2212 1\n\ndomain of f is a nonempty set on which f is \ufb01nite (see page 5 in [19]).\n\n(5a)\n\n2\n\n\fto require that v) P is a shrinkage rule, i.e., P(x) \u2264 x,\u2200x \u2265 0. The following corollary is quite\nstraightforward and useful in the sequel.\nCorollary 1. Using the notation in de\ufb01nitions (5a) and (5b), we have\n\nPF\u03bb(x) =\n\nx\n||x||2\n\nPf\u03bb(||x||2) and eF\u03bb(x) = ef\u03bb(||x||2).\n\n(6)\n\nPassing from a model of minimization in terms of a single problem, like (GKM), to a model in which\na problem is expressed in a particular parametric form, like (RKM) with the Moreau envelope, the\ndescription of optimality conditions is opened to the incorporation of the multivalued map PF\u03bb. The\nnext lemma describes the necessary conditions for a center cl to be (local) optimal for (RKM). Since\nwe deal with the general case, well known results, such as smoothness of the Moreau envelope or\nconvexity of its subgradients, can no longer be taken for granted.\nRemark 1. Let \u03a6(\u00b7) = eF\u03bb(\u00b7). The usual subgradient, denoted as \u02c6\u2202\u03a6(x), is not suf\ufb01cient to\ncharacterize the differentiability properties of R\u2032n in (RKM). Instead, we use the (generalized) subd-\nifferential \u2202\u03a6(x) (De\ufb01nition 8.3 in [19]). For all x, we have \u02c6\u2202\u03a6(x) \u2286 \u2202\u03a6(x). Usually, the previous\ntwo sets coincide at a point x. In this case, \u03a6 is called regular at x. However, it is common in\npractice that the sets \u02c6\u2202\u03a6(x) and \u2202\u03a6(x) are different (for a detailed exposition on subgradients see\nChapter 8 in [19]; see also Example 1 in Appendix A.9).\nLemma 1. Let PF\u03bb : Rp \u2192\u2192 Rp be a proximal map and set \u03a6(\u00b7) = eF\u03bb(\u00b7). The necessary (general-\nized) \ufb01rst order conditions for the centers {c1, . . . , ck}\u2282 Rp to be optimal for (RKM) are\n0 \u2208 \u2202\"!i\u2208Al\n\n(cl \u2212 xi + PF\u03bb(xi \u2212 cl)) , l \u2208{ 1, . . . , k}.\n(7)\nThe interpretation of the set inclusion above is the following: for any center cl \u2208 Rp, every subgra-\ndient vector in \u2202\u03a6(xi \u2212 cl) must be a vector associated with a vector in PF\u03bb(xi \u2212 cl) (Theorem\n10.13 in [19]). However, in general, the converse does not hold true. We note that when the proximal\nmap is single-valued and continuous, which happens for example not only when f\u03bb is convex, but\nalso for many popular non-convex penalties, both set inclusions become equalities and the converse\nholds, i.e., every vector in PF\u03bb(xi \u2212 cl) is a vector associated with a subgradient in \u2202\u03a6(xi \u2212 cl)\n(Theorem 10.13 in [19] and Proposition 7 in [26]).\nWe close this section with some useful remarks on the properties of the Moreau envelope as a map\nbetween two spaces of functions. There exist cases where two different functions, f\u03bb \u0338= f\u2032\u03bb, have\nequal Moreau envelopes, ef\u03bb = ef\u2032\u03bb (Proposition 1 in [26]), implying that two different forms\nof (RKM) correspond to the same \u03c6 in (GKM). For example, the proximal hull of f\u03bb, de\ufb01ned as\nh\u00b5\n)(x), is a function different from f\u03bb but has the same Moreau envelope as f\u03bb\nf\u03bb\n(see also Example 1.44 in [19], Proposition 2 and Example 3 in [26]). This is the main reason we\npreferred the proximal map instead of the penalty function point of view for the analysis of (RKM).\n\n\u03a6(xi \u2212 cl)# \u2286 !i\u2208Al\n\n\u2202\u03a6(xi \u2212 cl) \u2286 !i\u2208Al\n\n(x) := \u2212e\u00b5\n\n(\u2212e\u00b5\nf\u03bb\n\n3 On the breakdown properties of robust k-means\n\nIn this section, we study the \ufb01nite sample breakdown point of (RKM) and, more speci\ufb01cally, its\nuniversal breakdown point. Loosely speaking, the breakdown point measures the minimum fraction\nof outliers that can cause excessive damage in the estimates of the centers. Here, it will become clear\nhow the interplay between the two forms, (GKM) and (RKM), helps the analysis. Given a dataset\nm is an m-modi\ufb01cation if it\nX n = {x1, . . . , xn} and a nonnegative integer m \u2264 n, we say that X n\narises from X n after replacing m of its elements by arbitrary elements x\u2032i \u2208 Rp [6]. Denote as r(\u03bb)\nthe non-outlier samples, as counted after solving (RKM), for a dataset X n and some \u03bb \u2265 0, i.e., 4\n(8)\nThen, the number of estimated outliers is q(\u03bb) = n \u2212 r(\u03bb). In order to simplify notation, we drop\nthe dependence of r and q on \u03bb. With this notation, we proceed to the following de\ufb01nition.\n\nr(\u03bb) :=((({xi \u2208X n : ||oi||2 = 0, i = 1, . . . , n}(((.\n\n4More than one \u03bb can yield the same r, but this does not affect our analysis.\n\n3\n\n\fDe\ufb01nition 1 (universal breakdown point for the centers [6]). Let n, r, k be such that n \u2265 r \u2265 k + 1.\nm in Rp, let {c1, . . . , ck} denote the (global) optimal set of centers for (RKM).\nGiven a dataset X n\nThe universal breakdown value of (RKM) is\n\n\u03b2(n, r, k) := min\nX n\n\nmin\n\n1\u2264m\u2264n\" m\nm \u2282 Rp runs over all m-modi\ufb01cations of X n.\n\n1\u2264l\u2264k ||cl||2 = \u221e#.\n\n: sup\nX n\n\nmax\n\nn\n\nm\n\nHere, X n = {x1, . . . , xn}\u2282 Rp while X n\nAccording to the concept of universal breakdown point, (RKM) breaks down at the \ufb01rst integer m\nfor which there exists a set X n such that the estimates of the cluster centers become arbitrarily bad\nfor a suitable modi\ufb01cation X n\nm. Our analysis is based on Pf\u03bb and considers two cases: those of\nbiased and unbiased proximal maps. The former corresponds to the class of convex functions f\u03bb,\nwhile the latter corresponds to a class of non-convex f\u03bb.\n\n(9)\n\n3.1 Biased proximal maps: the case of convex f\u03bb\nIf f\u03bb is convex, then \u03a6= eF\u03bb is also convex while PF\u03bb is continuous, single-valued, and satis-\n\ufb01es [19]\n\n||x||2 \u2192 \u221e.\n\n||x \u2212 PF\u03bb(x)||2 \u2192 \u221e as\n\n(10)\nProximal maps with this property are called biased since, as the l2-norm of x increases, so does the\nnorm of the difference in (10). In this case, for each xi \u2208 Al, from Lemma 1 and expression (10),\nwe have\n||\u2207\u03a6(xi\u2212cl)||2 = ||\u2207eF\u03bb(xi\u2212cl)||2 = ||cl\u2212xi+PF\u03bb(xi\u2212cl)||2 \u2192 \u221e as ||xi\u2212cl||2 \u2192 \u221e. (11)\nThe supremum value of ||\u2207\u03a6(x \u2212 cl)||2 is closely related to the gross error sensitivity of an estima-\ntor [9]. It is interpreted as the worst possible in\ufb02uence which a sample x can have on cl [7]. In view\nof (11) and the de\ufb01nition of the clusters in (1), (RKM) is extremely sensitive. Although it can detect\nan outlier, i.e., a sample xi with a nonzero estimate for ||oi||2, it does not reject it since the in\ufb02uence\nof xi on its closest center never vanishes.5 The l1-norm, f\u03bb(x) = \u03bb|x|, which has Moreau envelope\nequal to the Huber loss-function [24], is the limiting case for the class of convex penalty functions\nthat, although it keeps the difference ||x \u2212 PF\u03bb(x)||2 in (10) constant and equal to \u03bb, introduces a\nbias term proportional to \u03bb in the estimate cl. The following proposition shows that (RKM) with a\nbiased PF\u03bb has breakdown point equal to 1\nProposition 1. Assume k \u2265 2, k + 1 < r \u2264 n. Given a biased proximal map, there exist a dataset\nX n and a modi\ufb01cation X n\n3.2 Unbiased proximal maps: the case of non-convex f\u03bb\nConsider now the l0-(pseudo)norm on R, f\u03bb(z) := \u03bb|z|0 = \u03bb2\nthresholding proximal operator P\u03bb|\u00b7|0 : R \u2192\u2192 R,\n\nn, i.e., one outlier suf\ufb01ces to breakdown a center.\n\n{z\u0338=0}, and the associated hard-\n\n1 such that (RKM) breaks down.\n\n2\n\nP\u03bb|\u00b7|0(x) = arg minz\u2208R\n\n1\n\n2 (x \u2212 z)2 + f\u03bb(z) =\u23a7\u23a8\u23a9\n\n0,\n{0, x},\nx,\n\n|x| <\u03bb,\n|x| = \u03bb,\n|x| >\u03bb.\n\n(12)\n\nAccording to Lemma 1, for p = 1 (scalar case), we have\n(12)\n\n\u2202\u03a6(xi \u2212 cl) \u2286 cl \u2212 xi + P\u03bb|\u00b7|0(xi \u2212 cl)\n\n(13)\nimplying that \u03a6(xi \u2212 cl), as a function of cl, remains constant for |xi \u2212 cl| >\u03bb . As a consequence\nof (13), if cl is local optimal, then 0 \u2208 \u2202{,i\u2208Al\n(cl \u2212 xi) + !i\u2208Al,\n\n|xi\u2212cl|=\u03bb-cl \u2212 xi + P\u03bb|\u00b7|(xi \u2212 cl). .\n\n= {0} for |xi \u2212 cl| >\u03bb, x i \u2208 Al,\n\n0 \u2208 !i\u2208Al,\n\n\u03a6(xi \u2212 cl)} and\n\n|xi\u2212cl|<\u03bb\n\n(14)\n\nDepending on the value of \u03bb, (RKM) with the l0-norm is able to ignore samples with distance from\ntheir closest center larger than \u03bb. This is done since P\u03bb|\u00b7|0(xi\u2212 cl) = xi\u2212 cl whenever |xi\u2212 cl| >\u03bb\n\n5See the analysis in [7] about the in\ufb02uence function of (GKM) when \u03c6 is convex.\n\n4\n\n\fand the in\ufb02uence of xi vanishes. In fact, there is a whole family of non-convex f\u03bb\u2019s whose proximal\nmap Pf\u03bb satis\ufb01es\n\nPf\u03bb(x) = x,\n\nfor all |x| >\u03c4,\n\n(15)\n\nfor some \u03c4> 0. These are called unbiased proximal maps [13, 20] and have the useful property\nthat, as one observation is arbitrarily modi\ufb01ed, all estimated cluster centers remain bounded by a\nconstant that depends only on the remaining unmodi\ufb01ed samples. Under certain circumstances, the\nproof of the following proposition reveals that, if there exists one outlier in the dataset, then robust\nk-means will reject it.\nProposition 2. Assume k \u2265 2, k + 1 < r \u2264 n, and consider the dataset X n = {x1, . . . , xn}\nalong with its modi\ufb01cation by one replacement y, X n\n1 = {x1, . . . , xn\u22121, y}. If we solve (RKM) with\n1 and an unbiased proximal map satisfying (15), then all estimates for the cluster centers remain\nX n\nbounded by a constant that depends only on the unmodi\ufb01ed samples of X n.\nNext, we show that, even for this class of maps, there always exists a dataset that causes one of the\nestimated centers to breakdown as two particular observations are suitably replaced.\nTheorem 1 (Universal breakdown point for (RKM)). Assume k \u2265 2 and n \u2265 r \u2265 k + 2. Given an\nunbiased proximal map satisfying (15), there exist a dataset X n and a modi\ufb01cation X n\n2 , such that\n(RKM) breaks down.\n\nHence,\nthe universal breakdown point of\n(RKM) with an unbiased proximal map is 2\nn. In\nFigure 1, we give a visual interpretation of The-\norem 1. The top sub\ufb01gure depicts the unmod-\ni\ufb01ed initial dataset X 9 = {x1, . . . , x9} (black\ncircles) with a clear two-cluster structure; the\nbottom sub\ufb01gure shows the modi\ufb01cation X 9\n2\n(dashed line arrows). Theorem 1 states that\n2 fails to be robust since, every\n(RKM) on X 9\nsubset of X 9\n2 with r = 8 points has a cluster\ncontaining an outlier.\n\nFigure 1: The top sub\ufb01gure is the unmodi\ufb01ed\ndataset X 9. Theorem 1 states that every subset of\nthe modi\ufb01cation X 9\n2 (bottom sub\ufb01gure) with size\n8 contains an outlier.\n\n3.3 Restricted robustness of robust k-means for well-clustered data\n\nThe result of Theorem 1 is disappointing but it is not (RKM) to be blamed for the poor perfor-\nmance but the tight notion of the de\ufb01nition about the breakdown point [6, 7]; allowing any kind of\ncontamination in a dataset is a very general assumption.\nIn this section, we place two restrictions: i) we consider datasets where inlier samples can be covered\nby unions of balls with centers that are \u201cfar apart\u201d each other, and ii) we ask a question different\nfrom the \ufb01nite sample breakdown point. We want to exploit as much as possible the results of [2]\nconcerning a new quantitative measure of noise robustness which compares the output of (RKM) on\na contaminated dataset to its output on the uncontaminated version of the dataset. Our aim is to show\nthat (RKM), with a certain class of proximal maps and datasets that are well-structured ignores the\nin\ufb02uence of outliers when grouping the inliers.\nFirst, we introduce Corollary 2 which states the form that Pf\u03bb should have in order the results\nof [2] to apply to (RKM) and, second, we give details about the datasets which we consider as well-\nstructured. Using this corollary we are able to design proximal maps for which Theorems 3, 4, and\n5 in [2] apply; otherwise, it is not clear how the analysis of [2] is valid for (RKM).\nLet h : R \u2192 R be a continuous function with the following properties:\n\n1. h is odd and non-decreasing (h+(\u00b7) is used to denote its restriction on [0,\u221e));\n2. h is a shrinkage rule: 0 \u2264 h+(x) \u2264 x, \u2200x \u2208 [0,\u221e);\n3. the difference x \u2212 h+(x) is non-decreasing, i.e., for 0 \u2264 x1 \u2264 x2 we have x1 \u2212 h+(x1) \u2264\n\nx2 \u2212 h+(x2).\n\n5\n\n\fDe\ufb01ne the map\n\nh(x),\n{h(x), x},\nx,\n\n|x| <\u03bb,\n|x| = \u03bb,\n|x| >\u03bb.\n\nPf\u03bb(x) :=\u23a7\u23a8\u23a9\ng(x) :=/ x\n\n0\n\n(16)\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\nMultivaluedness of Pf\u03bb at |x| = \u03bb signals that ef\u03bb is non-smooth at these points. An immediate\nconsequence for the Moreau envelope associated with the previous map is the following.\nCorollary 2. Let the function g : [0,\u221e) \u2192 [0,\u221e) be de\ufb01ned as\n\n(u \u2212 h(u))du, x \u2208 [0,\u221e).\n\nThen, the Moreau envelope associated with Pf\u03bb in (16) is\n\nef\u03bb(x) = min{g(|x|), g(\u03bb)} = g(min{|x|,\u03bb}).\n\nNext, we de\ufb01ne what it means for a dataset to be (\u03c11,\u03c1 2)-balanced; this is the class of datasets\nwhich we consider to be well-structured.\nDe\ufb01nition 2 ((\u03c11,\u03c1 2) balanced dataset [2]). Assume that a set X n \u2282 Rp has a subset I (inliers),\nwith at least n\n\n2 samples, and the following properties:\n\nl=1 Bl, where Bl = B(bl, r) is a ball in Rp with bounded radius r and center bl;\n\n1. I =0k\n2. \u03c11|I| \u2264 |Bl|\u2264 \u03c12|I| for every l, where |Bl| is the number of samples in Bl and \u03c11,\u03c12 > 0;\n3. ||bl \u2212 bl\u2032||2 > v for every l \u0338= l\u2032, i.e., the centers of the balls are at least v > 0 apart.\n\nThen, X n is a (\u03c11,\u03c1 2)-balanced dataset.\nWe now state the form that Theorem 3 in [2] takes for (RKM).\nTheorem 2 (Restricted robustness of (RKM)). If i) ef\u03bb is as in Corollary 2, i.e., ef\u03bb(||x||2) =\ng(min{||x||2,\u03bb}), ii) X n has a (\u03c11,\u03c1 2)-balanced subset of samples I with k balls, and\niii) the centers of the balls are at least v > 4r + 2g\u22121( \u03c11+\u03c12\nthen for \u03bb \u2208\n2 \u2212 2r) \u2212 (\u03c11 + \u03c12)g(r))33 the set of outliers X n\\I has no effect on the\n1 v\n2 , g\u221212 |I|\ngrouping of inliers I. In other words, if {x, y}\u2208 Bl and {c1, . . . , ck} are the optimal centers when\nsolving (RKM) for a \u03bb as described before, then\n\ng(r)) apart,\n\n(\u03c11g( v\n\n|X n\\I|\n\n\u03c11\n\nl = argmin1\u2264j\u2264kef\u03bb(||x \u2212 cj||2) = argmin1\u2264j\u2264kef\u03bb(||y \u2212 cj||2).\n\nFor the sake of completeness, we give a proof of this theorem in the appendix. In a similar way, we\ncan recast the results of Theorems 4 and 5 in [2] to be valid for (RKM).\n\n4 On the consistency of robust k-means\nLet X n be a set with n independent and identically distributed random samples xi from a \ufb01xed but\nunknown probability distribution \u00b5. Let \u02c6C be the empirical optimal set of centers, i.e.,\n\nThe population optimal set of centers is the set\n\n\u02c6C := argminc1...,ck\u2208RpR\u2032n(c1, . . . , ck).\n\nwhere R\u2032 is the population clustering risk, de\ufb01ned as\n\nR\u2032(c1, . . . , ck) :=/ min\n\nC\u2217 := argminc1...,ck\u2208RpR\u2032(c1, . . . , ck),\n1\u2264l\u2264k\" min\n$\n\n1\n2||x \u2212 cl \u2212 o||2\n%&\n\u03c6(||x\u2212cl||2)=ef\u03bb (||x\u2212cl||2)\n\no\u2208Rp\n\n2 + f\u03bb(||o||2)\n\n#\u00b5(dx).\n\n\u2019\n\nLoss consistency and (simply) consistency for (RKM) require, respectively, that\n\nR\u2032n( \u02c6C) n\u2192\u221e\u2212\u2192 R\u2032(C\u2217) and \u02c6C n\u2192\u221e\u2212\u2192 C\u2217.\n\n6\n\n\fIn words, as the size n of the dataset X n increases, the empirical clustering risk R\u2032n( \u02c6C) converges\nalmost surely to the minimum population risk R\u2032(C\u2217) and (for n large enough) \u02c6C can effectively\nreplace the optimal set C\u2217 in quantizing the unknown probability measure \u00b5.\nFor the case of convex f\u03bb, non-asymptotic results describing the rate of convergence of R\u2032n to R\nin (22) are already known ([11], Theorem 3). Noting that the Moreau envelope of a non-convex f\u03bb\nbelongs to a class of functions with polynomial discrimination [16] (the shatter coef\ufb01cient of this\nclass is bounded by a polynomial) we give a sketch proof of the following result.\nTheorem 3 (Consistency of (RKM)). Let the samples xi \u2208X n, i \u2208{ 1, . . . , n}, come from a \ufb01xed\nbut unknown probability measure \u00b5. For any k \u2265 1 and any unbiased proximal map, we have\n\nlim\nn\u2192\u221e\n\nER\u2032( \u02c6C) \u2192 R\u2032(C\u2217) and\n\nlim\nn\u2192\u221e\n\n\u02c6C\u2192C \u2217 (convergence in probability).\n\n(23)\n\nTheorem 3 reads like an asymptotic convergence result. However, its proof (given in the appendix)\nuses combinatorial tools from Vapnik-Chervonenkis theory, revealing that the non-asymptotic rate\n\nof convergence of ER\u2032( \u02c6C) to R\u2032(C\u2217) is of order O(4log n/n) (see Corollary 12.1 in [4]).\n\n5 Relating (RKM) to trimmed k-means\n\nAs the effectiveness of robust k-means on real world and synthetic data has already been evaluated [5,\n24], the purpose of this section is to relate (RKM) to trimmed k-means (TKM) [7]. Trimmed k-\nmeans is based on the methodology of \u201cimpartial trimming\u201d, which is a combinatorial problem\nfundamentally different from (RKM). Despite their differences, the experiments show that, both\n(RKM) and (TKM) perform remarkably similar in practice. The solution of (TKM) (which is also\na set of k centers) is the solution of quadratic k-means on the subsample containing \u2308n(1 \u2212 \u03b1)\u2309\npoints with the smallest mean deviation (0 <\u03b1< 1). The only common characteristic of (RKM)\nand (TKM) is that they both have the same universal breakdown point, i.e., 2\nn, for arbitrary datasets.\nTrimmed k-means takes as input a dataset X n, the number of clusters k, and a proportion of outliers\na \u2208 (0, 1) to remove.6 A popular heuristic algorithm for (TKM) is the following. After the initial-\nization, each iteration of (TKM) consists of the following steps: i) the distance of each observation\nfrom its closest center is computed, ii) the top \u2308an\u2309 observations with larger distance from its clos-\nest center are removed, iii) the remaining points are used to update the centers. The previous three\nsteps are repeated untill the centers converge.7 As for robust k-means, we solve the (RKM) problem\nwith a coordinate optimization procedure (see Appendix A.9 for details).\nThe synthetic data for the experiments come from a mixture of Gaussians with 10 components and\nwithout any overlap between them.8 The number of inlier samples is 500 and each inlier xi \u2208\n[\u22121, 1]10 for i \u2208{ 1, . . . , 500}. On top of the inliers lie 150 outliers in R10 distributed uniformly\nin general positions over the entire space. We consider two scenarios: in the \ufb01rst, the outliers lie\nin [\u22123, 3]10 (call it mild-contamination), while, in the second, the outliers lie in [\u22126, 6]10 (call it\nheavy-contamination). The parameter a in trimmed k-means (the percentage of outliers) is set to\na = 0.3, while the value of the parameter \u03bb for which (RKM) yields 150 outliers is found through\na search over a grid on the set \u03bb \u2208 (0,\u03bb max) (we set \u03bbmax as the maximum distance between two\npoints in a dataset). Both algorithms, as they are designed, require as input an initial set of k points;\nthese points form the initial set of centers. In all experiments, both (RKM) and (TKM) take the same\nk vectors as initial centers, i.e., k points sampled randomly from the dataset.\nThe statistics we use for the comparison are: i) the rand-index for clustering accuracy [17] ii) the\ncluster estimation error, i.e., the root mean square error between the estimated cluster centers and\nthe sample mean of each cluster, iii) the true positive outlier detection rate, and \ufb01nally, iv) the false\npositive outlier detection rate. In Figures 2-3, we plot the results for a proximal map Pf like the one\nin (16) with h(x) = \u03b1x and \u03b1 = 0.005; with this choice for h, we mimic the hard-thresholding\noperator. The results for each scenario (accuracy, cluster estimation error, etc) are averages over 150\nruns of the experiment. As seen, both algorithms share almost the same statistics in all cases.\n\n6We use the implementation of trimmed k-means in the R package trimcluster [10].\n7The previous three steps are performed also by another robust variant of k-means, the k-means\u2212 (see [3]).\n8We use the R toolbox MixSim [14] that guarantees no overlap among the 10 mixtures.\n\n7\n\n\f0.9\n\n0.7\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\nA\n\n\u25cf\u25cf\n\u25cf\u25cf\u25cf\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\n\u25cf\u25cf\u25cf\n\u25cf\u25cf\n\u25cf\u25cf\u25cf\u25cf\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\u25cf\u25cf\u25cf\u25cf\n\u25cf\u25cf\u25cf\u25cf\u25cf\n\u25cf\u25cf\u25cf\u25cf\n\u25cf\u25cf\n\u25cf\u25cf\u25cf\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n12.5\n\n10.0\n\n7.5\n\n5.0\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nm\n\ni\nt\ns\nE\n\n \nr\ne\n\nt\n\nn\ne\nC\n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\n \n\nE\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n0.975\n\n0.950\n\n0.925\n\n\u25cf\u25cf\n\u25cf\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\n0.015\n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\n0.010\n\n0.005\n\n0.000\n\n \n\nE\ne\nv\ni\nt\ni\ns\no\nP\ne\ns\na\nF\n\n \n\nl\n\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\u25cf\u25cf\n\nrobust k\u2212means trimmed k\u2212means\n\nrobust k\u2212means trimmed k\u2212means\n\nrobust k\u2212meanstrimmed k\u2212means\n\nrobust k\u2212meanstrimmed k\u2212means\n\nFigure 2: Performance of robust and trimmed k-means on a mixture of 10 Gaussians without overlap.\nOn top of the 500 samples from the mixture there are 150 outliers uniformly distributed in [\u22121, 1]10.\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nm\n\ni\nt\ns\nE\n \ns\nu\nd\na\nR\n\ni\n\n \nr\ne\n\nt\ns\nu\nC\n\nl\n\n9\n\n6\n\n3\n\n0\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\nrobust k\u2212means trimmed k\u2212means\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\u25cf\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n17.5\n\nr\no\nr\nr\n\nE\n \nn\no\ni\nt\na\nm\n\ni\nt\ns\nE\n\n \nr\ne\nt\nn\ne\nC\n\n15.0\n\n12.5\n\n10.0\n\ne\nt\na\nR\n\n \nr\no\nr\nr\n\nE\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\nu\nr\nT\n\n1.00\n\n0.75\n\n0.50\n\n0.25\n\n0.00\n\n\u25cf\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\u25cf\u25cf\u25cf\n\u25cf\n\ne\nt\na\nR\n\n \nr\no\nr\nr\n\nE\n \ne\nv\ni\nt\ni\ns\no\nP\n \ne\ns\na\nF\n\nl\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\u25cf\n\nrobust k\u2212means trimmed k\u2212means\n\nrobust k\u2212means trimmed k\u2212means\n\nrobust k\u2212means trimmed k\u2212means\n\nrobust k\u2212means trimmed k\u2212means\n\nFigure 3: The same setup as in Figure 2 except that the coordinates of each outlier lie in [\u22123, 3]10.\n\n4\n\n3\n\n2\n\n1\n\nr\no\nr\nr\n\nE\n \nn\no\ni\nt\na\nm\n\ni\nt\ns\nE\n \ns\nu\nd\na\nR\n\ni\n\n \nr\ne\nt\ns\nu\nC\n\nl\n\nrobust k\u2212means trimmed k\u2212means\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nm\n\ni\nt\ns\nE\n \ns\nu\nd\na\nR\n\ni\n\n \nr\ne\n\nt\ns\nu\nC\n\nl\n\n7.5\n\n5.0\n\n2.5\n\n0.0\n\nrobust k\u2212means trimmed k\u2212means\n\n1.0\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.8\n\n0.6\n\n\u25cf\u25cf\u25cf\u25cf\n\u25cf\u25cf\u25cf\n\u25cf\n\u25cf\u25cf\u25cf\n\u25cf\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\u25cf\u25cf\u25cf\n\u25cf\n\u25cf\n\u25cf\u25cf\u25cf\n\u25cf\n\u25cf\u25cf\u25cf\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n30\n\n20\n\n10\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nm\n\ni\nt\ns\nE\n\n \nr\ne\n\nt\n\nn\ne\nC\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\n \n\nE\ne\nv\ni\nt\ni\ns\no\nP\ne\nu\nr\nT\n\n \n\n1.000\n\n0.975\n\n0.950\n\n0.925\n\n0.900\n\n\u25cf\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\n\u25cf\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nt\n\ne\na\nR\n\n \nr\no\nr\nr\n\n \n\nE\ne\nv\ni\nt\ni\ns\no\nP\ne\ns\na\nF\n\n \n\nl\n\n\u25cf\n\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\u25cf\n\u25cf\n\n\u25cf\n\n\u25cf\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n0.00\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\n\u25cf\n\n\u25cf\u25cf\n\u25cf\n\u25cf\u25cf\n\n\u25cf\n\u25cf\u25cf\n\u25cf\u25cf\n\u25cf\n\u25cf\u25cf\u25cf\n\nrobust k\u2212means trimmed k\u2212means\n\nrobust k\u2212means trimmed k\u2212means\n\nrobust k\u2212meanstrimmed k\u2212means\n\nrobust k\u2212means trimmed k\u2212means\n\nFigure 4: Results on two spherical clusters with equal radius r, each one with 150 samples, and\ncenters are at least 4r apart. On top of the samples lie 150 outliers uniformly distributed in [\u22126, 6]10.\n\nIn Figure 4, we plot the results for the case of two spherical clusters in R10 with equal radius r, each\none with 150 samples, and centers that are at least 4r apart from each other. The inlier samples are\nin [\u22123, 3]10. The outliers are 150 (half of the dataset is contaminated) and are uniformly distributed\nin [\u22126, 6]10. The results (accuracy, cluster estimation error, etc) are averages over 150 runs of the\nexperiment. This con\ufb01guration is a heavy contamination scenario but, due to the structure of the\ndataset, as expected from Theorem 2, (RKM) performs remarkably well; the same holds for (TKM).\n\n6 Conclusions\n\nWe provided a theoretical analysis for the robustness and consistency properties of a variation of\nthe classical quadratic k-means called robust k-means (RKM). As a by-product of the analysis, we\nderived a detailed description of the optimality conditions for the associated minimization problem.\nIn most cases, (RKM) shares the computational simplicity of quadratic k-means, making it a \u201ccom-\nputationally cheap\u201d candidate for robust nearest neighbor clustering. We show that (RKM) cannot\nbe robust against any type of contamination and any type of datasets, no matter the form of the\nproximal map we use. If we restrict our attention to \u201cwell-structured\u201d datasets, then the algorithm\nexhibits some desirable noise robustness. As for the consistency properties, we showed that most\ngeneral results for consistency of quadratic k-means still remain valid for this robust variant.\n\nAcknowledgments\nThe author would like to thank Athanasios P. Liavas for useful comments and suggestions that\nimproved the quality of the article.\n\n8\n\n\fReferences\n[1] Anestis Antoniadis and Jianqing Fan. Regularization of wavelet approximations. Journal of the American\n\nStatistical Association, 2011.\n\n[2] Shai Ben-David and Nika Haghtalab. Clustering in the presence of background noise. In Proceedings of\n\nthe 31st International Conference on Machine Learning (ICML-14), pages 280\u2013288, 2014.\n\n[3] Sanjay Chawla and Aristides Gionis. k-means-: A uni\ufb01ed approach to clustering and outlier detection.\n\nSIAM.\n\n[4] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Stochastic Mod-\n\nelling and Applied Probability. Springer New York, 1997.\n\n[5] Pedro A Forero, Vassilis Kekatos, and Georgios B Giannakis. Robust clustering using outlier-sparsity\n\nregularization. Signal Processing, IEEE Transactions on, 60(8):4163\u20134177, 2012.\n\n[6] Mar\u00b4\u0131a Teresa Gallegos and Gunter Ritter. A robust method for cluster analysis. Annals of Statistics, pages\n\n347\u2013380, 2005.\n\n[7] Luis \u00b4Angel Garc\u00b4\u0131a-Escudero and Alfonso Gordaliza. Robustness properties of k-means and trimmed\n\nk-means. Journal of the American Statistical Association, 94(447):956\u2013969, 1999.\n\n[8] Michael R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-\n\nCompleteness. W. H. Freeman & Co., New York, NY, USA, 1979.\n\n[9] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust statistics: the\n\napproach based on in\ufb02uence functions, volume 114. John Wiley & Sons, 2011.\n\n[10] Christian Hennig. trimcluster: Cluster analysis with trimming, 2012. R package version 0.1-2.\n[11] Tam\u00b4as Linder. Learning-theoretic methods in vector quantization. In Principles of nonparametric learn-\n\ning, pages 163\u2013210. Springer, 2002.\n\n[12] Stuart P Lloyd. Least squares quantization in pcm. Information Theory, IEEE Transactions on, 28(2):129\u2013\n\n137, 1982.\n\n[13] Rahul Mazumder, Jerome H Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with nonconvex\n\npenalties. Journal of the American Statistical Association, 2012.\n\n[14] Volodymyr Melnykov, Wei-Chen Chen, and Ranjan Maitra. MixSim: An R package for simulating data\n\nto study performance of clustering algorithms. Journal of Statistical Software, 51(12):1\u201325, 2012.\n\n[15] David Pollard. Strong consistency of k-means clustering. The Annals of Statistics, 9(1):135\u2013140, 1981.\n[16] David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 1984.\n[17] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American\n\nStatistical association, 66(336):846\u2013850, 1971.\n\n[18] G. Ritter. Robust Cluster Analysis and Variable Selection. Chapman & Hall/CRC Monographs on Statis-\n\ntics & Applied Probability. CRC Press, 2014.\n\n[19] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business\n\nMedia, 2009.\n\n[20] Yiyuan She et al. Thresholding-based iterative selection procedures for model selection and shrinkage.\n\nElectronic Journal of statistics, 3:384\u2013415, 2009.\n\n[21] Marc Teboulle. A uni\ufb01ed continuous optimization framework for center-based clustering methods. The\n\nJournal of Machine Learning Research, 8:65\u2013102, 2007.\n\n[22] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Jour-\n\nnal of optimization theory and applications, 109(3):475\u2013494, 2001.\n\n[23] Sara Van De Geer. Empirical processes in m-estimation. June 13, 2003. Handout at New Directions in\n\nGeneral Equilibrium Analysis (Cowles Workshop, Yale University).\n\n[24] Daniela M Witten. Penalized unsupervised learning with outliers. Statistics and its Interface, 6(2):211,\n\n2013.\n\n[25] Stephen J Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3\u201334, 2015.\n[26] Yaoliang Yu, Xun Zheng, Micol Marchetti-Bowick, and Eric P Xing. Minimizing nonconvex non-\n\nseparable functions. In AISTATS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1451, "authors": [{"given_name": "ALEXANDROS", "family_name": "GEORGOGIANNIS", "institution": "TECHNICAL UNIVERSITY OF CRETE"}]}