{"title": "Query K-means Clustering and the Double Dixie Cup Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 6649, "page_last": 6658, "abstract": "We consider the problem of approximate $K$-means clustering with outliers and side information provided by same-cluster queries and possibly noisy answers. Our solution shows that, under some mild assumptions on the smallest cluster size, one can obtain an $(1+\\epsilon)$-approximation for the optimal potential with probability at least $1-\\delta$, where $\\epsilon>0$ and $\\delta\\in(0,1)$, using an expected number of $O(\\frac{K^3}{\\epsilon \\delta})$ noiseless same-cluster queries and comparison-based clustering of complexity $O(ndK + \\frac{K^3}{\\epsilon \\delta})$; here, $n$ denotes the number of points and $d$ the dimension of space. Compared to a handful of other known approaches that perform importance sampling to account for small cluster sizes, the proposed query technique reduces the number of queries by a factor of roughly $O(\\frac{K^6}{\\epsilon^3})$, at the cost of possibly missing very small clusters. We extend this settings to the case where some queries to the oracle produce erroneous information, and where certain points, termed outliers, do not belong to any clusters. Our proof techniques differ from previous methods used for $K$-means clustering analysis, as they rely on estimating the sizes of the clusters and the number of points needed for accurate centroid estimation and subsequent nontrivial generalizations of the double Dixie cup problem. We illustrate the performance of the proposed algorithm both on synthetic and real datasets, including MNIST and CIFAR $10$.", "full_text": "Query K-means Clustering and the Double Dixie Cup\n\nProblem\n\nI (Eli) Chien\nDepartment ECE\n\nUIUC\n\nChao Pan\n\nDepartment ECE\n\nUIUC\n\nichien3@illinois.edu\n\nchaopan2@illinois.edu\n\nmilenkov@illinois.edu\n\nOlgica Milenkovic\nDepartment ECE\n\nUIUC\n\nAbstract\n\nWe consider the problem of approximate K-means clustering with outliers and\nside information provided by same-cluster queries and possibly noisy answers. Our\nsolution shows that, under some mild assumptions on the smallest cluster size,\none can obtain an (1 + \u0001)-approximation for the optimal potential with probability\nat least 1 \u2212 \u03b4, where \u0001 > 0 and \u03b4 \u2208 (0, 1), using an expected number of O( K3\n\u0001\u03b4 )\nnoiseless same-cluster queries and comparison-based clustering of complexity\nO(ndK + K3\n\u0001\u03b4 ); here, n denotes the number of points and d the dimension of\nspace. Compared to a handful of other known approaches that perform importance\nsampling to account for small cluster sizes, the proposed query technique reduces\nthe number of queries by a factor of roughly O( K6\n\u00013 ), at the cost of possibly missing\nvery small clusters. We extend this settings to the case where some queries to the\noracle produce erroneous information, and where certain points, termed outliers,\ndo not belong to any clusters. Our proof techniques differ from previous methods\nused for K-means clustering analysis, as they rely on estimating the sizes of\nthe clusters and the number of points needed for accurate centroid estimation\nand subsequent nontrivial generalizations of the double Dixie cup problem. We\nillustrate the performance of the proposed algorithm both on synthetic and real\ndatasets, including MNIST and CIFAR 10.\n\n1\n\nIntroduction\n\nK-means clustering is one of the most studied unsupervised learning problems [1, 2, 3] that has\na rich application domain spanning areas as diverse as lossy source coding and quantization [4],\nimage segmentation [5] and community detection [3]. The core question in K-means clustering is\nto \ufb01nd a set of K centroids that minimizes the K-means potential function, equal to the sum of the\nsquared distances of the points from their closest centroids. An optimal set of centroids can be used\nto partition the points into clusters by simply assigning each point to its closest centroid.\nThe K-means clustering problem is NP-hard even for the case when K= 2, and when the points lie in\na two-dimensional Euclidean space [6]. Moreover, \ufb01nding an (1 + \u0001)-approximation for 0 < \u0001 < 1\nremains NP-hard, unless further assumptions are made on the point and cluster structures [7, 8].\nAmong the state-of-the-art K-means approximation methods are the algorithms of Kanungo et al. and\nAhmadian et al [9, 10]. There also exist many heuristic algorithms for solving the problem, including\nLloyd\u2019s algorithm [2] and Hartigan\u2019s method [1].\nAn interesting new direction in K-means clustering was recently initiated by Ashtiani et al [11] who\nproposed to examine the effects of side-information on the complexity of the K-means algorithm.\nIn their semi-supervised active clustering framework, one is allowed to query an oracle whether\ntwo points from the dataset belong to the same optimal cluster or not. The oracle answer to queries\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(\u03b3\u22121)4\n\ninvolving any pair of points is assumed to be consistent with a unique optimal solution, and it takes\nthe form \u201csame (cluster)\u201d and \u201cdifferent (cluster)\u201d. The method of Ashtiani et al [11] operates\non special cluster structures which satisfy the so-called \u03b3-margin assumption with \u03b3 > 1, which\nasserts that every point is at least a \u03b3-factor closer to its corresponding centroid than any other\ncentroid. The oracle queries are noiseless and O(K log n + K 2 log K+log( 1\n\u03b4 )\n) same-cluster queries\non n points are needed to ensure that with probability at least 1 \u2212 \u03b4, the obtained partition is the\nsought optimal solution. Ailon et al. [12] proposed to dispose of the \u03b3-margin assumption and exact\nclustering requirements, and addressed the issue of noisy same-cluster queries in the context of\nthe K-means++ algorithm. In their framework, each pairwise query may return the wrong answer\nwith some prescribed probability, but repeated queries on the same pair of points always produce\nthe same answers. Given that no constraints on the cluster sizes and distances of points are made,\none is required to perform elaborate nonuniform probabilistic sampling and subsequent selection of\npoints that represent uniform samples in the preselected pool. This two-layer sampling procedure\nresults in a large number of noiseless and noisy queries - in the former case, with running time of\nthe order of O( ndK9\n) - and may hence be impractical whenever the number of clusters is large, the\n\u00014\nsmallest cluster size is bounded away from one, and the queries are costly and available only for\na small set of pairs of points. Further extensions of the problem include the work of Gamlath et\nal. [13] that provides an framework for ensuring small clustering error probabilities via PAC (probably\napproximately correct) learning, and the weak-oracle analysis of Kim and Ghosh which allows for\n\u201cdo not know\u201d answers [14].\n\n1.1 Our Contributions\n\nKsmin\n\n), 1 \u2264 \u03b1 \u2264 n\n\nUnlike other semi-supervised approaches proposed for K-means clustering, we address the problem\nin the natural setting where the size of the smallest cluster is bounded from below by a small value\ndependent on the number of clusters K and the approximation constant \u0001, and where the points\ncontain outliers. Hence, we do not require that the clusters satisfy the \u03b3-margin property, nor do\nwe insist on being able to deal with very small clusters that seldom appear in practice. Outliers\nare de\ufb01ned as points at \u201clarge\u201d distance from all clusters, for which all queries return negative\nanswers and hence add additional uncertainty regarding point placements. In this case, we wish to\nsimultaneously perform approximate clustering and outlier identi\ufb01cation. Bounding the smallest\ncluster size is a prevalent analytical practice in clustering, community detection and learning on\ngraphs [15, 16, 17]. Often, K-means clustering methods are actually constrained to avoid solutions\nthat produce empty or small clusters as these are considered to be artifacts or consequences of poor\nlocal minima solutions [18].\nLet \u03b1 = ( n\nK , denote the cluster size imbalance, where smin equals the size of\nthe smallest cluster in the optimal clustering; when \u03b1 = 1, all clusters are of the same size n\nK .\nFurthermore, when the upper bound is met, the size of the smallest cluster equals one.\nOur main results are summarized below.\nTheorem 1.1 (Query complexity with noiseless queries). Assume that one is given parameters\n\u0001 \u2208 (0, 1), \u03b4 \u2208 (0, 1) and K, and n points in Rd. Furthermore, assume that the unique optimal\nclustering has imbalance \u03b1, where \u03b1 \u2208 [1, n\nK ]. Then, there exists a same-cluster query algorithm\nthat with probability at least 1 \u2212 \u03b4 outputs a set of\nwith an expected number of queries O\ncluster centers whose corresponding clustering potential function is within a multiplicative factor\n(1 + \u0001) of the optimal. The expected running time of the query-based clustering algorithm equals\nO(Kdn + \u03b1 K3\nTheorem 1.2 (Query complexity with noisy queries and outliers). Assume that one is given parame-\nters \u0001 \u2208 (0, 1), \u03b4 \u2208 (0, 1) and K, and n points in Rd. Let po be the fraction of outliers in the dataset.\nFurthermore, assume that the unique optimal clustering without outliers has imbalance \u03b1, where\n\u03b1 \u2208 [1, n\nK ], and that the oracle may return an erroneous answer with probability pe < 1/2. When\npresented with a query involving at least one outlier point, the oracle always produces the answer\n\u201cdifferent (cluster).\u201d Then, there exists a noisy same-cluster query algorithm that requires\n\n\u0001\u03b4 ).\n\n(cid:16) \u03b1 K3\n\n(cid:17)\n\n\u0001\u03b4\n\n(cid:18)\n\nO\n\n(cid:19)\n\n\u03b1K 4\n\n\u03b4\u0001 (1 \u2212 po)(1 \u2212 2pe)8 log2\n\n\u03b1K 2\n\n\u03b4 (2pe \u2212 1)4 (1 \u2212 po)\n\n2\n\n\fqueries and with probability at least 1 \u2212 \u03b4 outputs clusters whose corresponding clustering potential\nfunction is within (1 + \u0001) of the optimal. The expected complete running time of the noisy clustering\nalgorithm is bounded from above by O\n, provided\nthat the outliers satisfy a mild separability constraint (see Section 2).\n\n\u03b4\u0001(1\u2212po)(2pe\u22121)10 log3\n\n\u03b4\u0001(2pe\u22121)4(1\u2212po)\n\nKdn +\n\n(cid:16)\n\n\u03b1K6\n\n\u03b1K2\n\n(cid:17)\n\n\u03b4\u0001\n\n(cid:17)\n\n(cid:16) \u03b1K3\n\n\u0001 ). Compared to the result of Ailon et al. [12], as long as smin \u2265 n\u00013\n\nNote that Theorem 1.1 gives performance guarantees in expectation, while Theorem 1.2 provides\nsimilar guarantees with high probability. Nevertheless, in the former case, a straightforward applica-\ntion of Markov\u2019s inequality and the union bound allow us to also bound, with high probability, the\nquery complexity. In the noiseless setting, we conclude that using O\nqueries, with probability\nat least 1 \u2212 \u03b4 our clustering produces an (1 + \u0001)-approximation. For example, by choosing \u03b4 = 0.01,\nwe guarantee that with probability at least 0.99, the query complexity of our noiseless method equals\nO( \u03b1K3\nK7 , our method is more\nef\ufb01cient than the two-level sampling procedure of [12]. The ef\ufb01ciency gap increases as smin increases.\nAs an illustrative example, let n = 106, K = 10 and \u0001 = 0.1. Then, the minimum cluster size\nK7 = 10\u22124 < 1).\nconstraint only requires the smallest cluster to contain at least one point (since n\u00013\nOur proof techniques rely on novel generalizations of the double Dixie coup problem [19, 20].\nSimilarly to Ailon et al. [12], we make use of Lemma 2 from [21] described in Section 2. But unlike\nthe former approach, which \ufb01rst performs K-means++ sampling and than subsampling that meets the\nconditions of Lemma 2, we perform a one-pass sampling. Given the smallest cluster size constraint,\nit is possible to estimate during the query phase the number of points one needs to collect from each\ncluster so as to ensure an (1 + \u0001)-approximation for all the estimated centroid. With this information\nat hand, queries are performed until each cluster (representing a coupon type) contains suf\ufb01ciently\nmany points (coupons). The double Dixie coup problem pertains to the same setting, and asks for\nthe smallest number of coupons one has to purchase in order to collect s complete sets of coupons.\nThe main technical dif\ufb01culty arises from the fact that the number of coupons required is represented\nby the expected value of the maximum order statistics of random variables distributed according\nto the Erlang distribution [20], for which asymptotic analysis is hard when the number of types of\ncoupons is not a constant. In our setting, the number of types depends on K, and the number of\ncoupons purchased cannot exceed n. To address this issue, we use Poissonization methods [22] and\nconcentration inequalities. Detailed proofs are relegated to the Supplement.\nFor the case of noisy queries and outliers, our solution consists of two steps. In the \ufb01rst step, we\ninvoke the results of Mazumdar and Saha [23, 24] that describe how to reconstruct all clusters of\nsuf\ufb01ciently large sizes when using similarity matrices of stochastic block model [25] along with same-\ncluster queries. The underlying modeling assumption is that every query can be wrong independently\nfrom all other queries with probability p, and that we cannot repeatedly ask the same query and apply\nmajority voting to decrease the error probability, as each query response is \ufb01xed. In the second step,\nwe simply compute the cluster centers via averaging.\nIn the given context, we only need to retrieve a fraction of the cluster points correctly. Note that the\nminimum cluster size our algorithm can handle is constrained both in terms of sampling complexity\nof the double Dixie cup as well as in terms of the cluster sizes that [24] can handle. Additional issues\narise when considering outliers, in which case we assume the oracle always returns a negative answer\n(\"different clusters\"). Note that if the \ufb01rst point queried is an outlier, the seeding procedure may fail\nas an answer of the form \"different cluster\" may cause outliers to be placed into valid clusters. To\nmitigate this problem, we propose a simple search and comparison scheme which ensures that the\n\ufb01rst point assigned to any cluster is not an outlier.\nWe experimentally tested the proposed algorithms on synthetic and real datasets in terms of the\napproximation accuracy for the potential function, query complexity and the misclassi\ufb01cation ratio,\nequal to the ratio of the number of misclassi\ufb01ed data points and the total number of points. Note that\nmisclassi\ufb01cation errors arise as the centroids are only estimates of the true centroids, and placements\nof point according to closest centroids may be wrong. Synthetic datasets are generated via Gaussian\nmixture models, while the real world datasets pertain to image classi\ufb01cation with crowdsourced\nquery answers, including the MNIST [26] and CIFAR-10 [27] datasets. The results show order of\nmagnitude performance improvements compared to other known techniques.\nA few comments are at place. The models studied in [11, 24] are related to our work through the use\nof query models for improving clustering. Nevertheless, Ashtiani et al. [11] only consider ground\n\n3\n\n\ftruth clusters satisfying the \u03b3-margin assumption, and K-means clustering with perfect (noiseless)\nqueries. The focus of the work by Mazumdar et al. [24] is on the stochastic block model, and\nalthough it allows for noisy queries it does not address the K-means problem directly. The two\nmodels most closely related to ours are Ailon et al. [12] and Kim et al. [14]. Ailon et al. [12]\nfocus on developing approximate K-means algorithms with noisy same-cluster queries. The three\nmain differences between this line of work and ours are that we impose mild smallest cluster size\nconstraints which signi\ufb01cantly reduce the query complexity both in the noiseless and noisy regime,\nthat we introduce outliers into our analysis, and that our proofs are based on a variation of the double\nDixie cup problem rather than standard theoretical computer science analyses that use notions of\ncovered and uncovered clusters. The work of Kim et al. [14] is related to ours only in so far that\nit allows for query responses of the form \u201cdo not know\u201d which can also be used for dealing with\noutliers.\n\n2 Background and Problem Formulation\n\nWe start with a formal de\ufb01nition of the K-means problem.\nGiven a set of n points X \u2282 Rd, and a number of clusters K, the K-means problem asks for \ufb01nding a\nset of points C = {c1, ..., cK} \u2282 Rd that minimizes the following objective function\n\n(cid:88)\n\nx\u2208X\n\n\u03c6(X ; C) =\n\n||x \u2212 c||2,\n\nmin\nc\u2208C\n\ni=1 C\u2217\n\nX =(cid:83)K\n\nwhere || \u00b7 || denotes the L2 norm. Throughout the paper, we assume that the optimal solution is\nK}. The set of centroids C\u2217 induces an optimal partition\nunique, and denote it by C\u2217 = {c\u2217\nK(X ) to\ni , where \u2200i \u2208 [K],C\u2217\n\n1, ..., c\u2217\ni = {x \u2208 X : ||x \u2212 c\u2217\n\ni || \u2264 ||x \u2212 c\u2217\n\nj|| \u2200j (cid:54)= i}. We use \u03c6\u2217\n\ndenote the optimal value of the objective function.\nAs already stated, the K-means clustering problem is NP-hard, and hard to approximate within a\n(1 + \u0001) factor, for 0 < \u0001 < 1. An important question in the approximate clustering setting was\naddressed by Inaba et al. [21], who showed how many points from a set have to be sampled uniformly\nat random to guarantee that for any \u0001 > 0 and with high probability, the centroid of the set can\nbe estimated within a multiplicative (1 + \u0001)-term. This result was used by Ailon et.al [12] in the\nsecond (sub)sampling procedure. In our work, we make use of the same result in order to determine\nthe smallest number of points (coupons) one needs to collect for each cluster (coupon type). For\ncompleteness, the result is stated below.\nLemma 2.1 (Centroid lemma, Lemma 2 of [21]). Let A be a set of points obtained by sampling with\nreplacement m points independently from each other, uniformly at random, from a point set S. Then,\nfor any \u03b4 > 0, one has\n\nP (\u03c6(S; c(A)) \u2264 (1 +\n\n)\u03c6\u2217(S)) \u2265 1 \u2212 \u03b4,\n\n1\n\u03b4m\n\nwhere c(A) stands for the centroid of A.\nIn our proof, the Centroid lemma is used in conjunction with a generalization of the double Dixie\ncup problem to establish the stated query complexity results in the noiseless and noisy setting. The\ndouble Dixie cup problem is an extension of the classical coupon collector problem in which the\ncollector is required to collect m \u2265 2 sets of coupons. While the classical coupon collector problem\nmay be analyzed using elementary probabilistic tools, the double Dixie cup problem solution requires\nusing generating functions and complex analysis techniques. For the most basic incarnation of the\nproblem where each coupon type is equally likely and each coupon needs to be collected at least\nm times, where m is a constant, Newman and Shepp [19] showed that one needs to purchase an\naverage of O(K(log K + (m \u2212 1) log log K)) coupons. This setting is inadequate for our analysis,\nas our coupons represent points from different clusters that have different sizes, and hence give rise\nto different coupon (cluster point) probabilities. Furthermore, in our analysis we require m = K\n\u03b4\u0001,\nwhich scales with K and hence is harder to analyze. The starting point of our generalization of\nthe nonuniform probability double Dixie cup problem is the work of Doumas [20]. We extend\nthe Poissonization argument and perform a careful analysis of the expectation of the maximum\norder statistics of independent random variables distributed according to the Erlang distribution. All\ntechnical details are delegated to the Supplement.\n\n4\n\n\fOften, one seeks the K-means solutions in a setting where the cluster points X satisfy certain\nseparability and cluster size constraints, such as the \u03b3-margin and the bounded minimum cluster size\nconstraint, respectively. Both are formally de\ufb01ned below.\nDe\ufb01nition 2.2 (The \u03b3-margin property [11]). Let \u03b3 > 1 be a real number. We say that X satis\ufb01es\nthe \u03b3-margin property if \u2200 i (cid:54)= j \u2208 [K], x \u2208 C\u2217\ni , y \u2208 C\u2217\ni || < ||y \u2212 c\u2217\n\u03b3||x \u2212 c\u2217\ni ||.\n\nj , one has\n\nTo describe the cluster size constraint, we now formally introduce the previously mentioned notion of\n\u03b1-imbalance.\nDe\ufb01nition 2.3 (The \u03b1-imbalance property). Let \u03b1 \u2208 [1, n/K] be a real number. We say that the\npoint set X satis\ufb01es the \u03b1-imbalance property if \u03b1 = n\n\n.\n\nK smin\n\nTo avoid complicated and costly two-level queries, we impose an \u03b1-imbalance constraint on the\noptimal clustering, excluding outliers.\nFor the set of outliers, we use a milder version of the \u03b3-margin constraint, described as follows.\nAssume that X = Xt \u222a Xo, where Xt and Xo are the nonintersecting sets of true cluster points and\noutliers, respectively. Outliers are formally de\ufb01ned as follows.\nDe\ufb01nition 2.4. The set Xo consists of points that satisfy the \u0393(\u03be)-separation property, de\ufb01ned as\n\n\u2200x \u2208 Xo, \u2200 i \u2208 [K], ||x \u2212 c\u2217\n\ni || > max\ny\u2208C\u2217\n\ni\n\n||y \u2212 c\u2217\n\ni || +\n\n\u03be \u03c6\u2217(C\u2217\ni | \u2265 \u0393(\u03be).\ni )\n|C\u2217\n\n(cid:115)\n\nHere, \u0393(\u03be) stands for the minimum of the lower bounds obtained for all values of i \u2208 [K].\nThis is a reasonable modeling assumption, as outliers are commonly de\ufb01ned as points that lie in\n\u201coutlier clusters\u201d that are well-separated from all \u201cregular\u201d clusters. The de\ufb01nition is reminiscent\nof the \u03b3-margin assumption, but adapted to outliers. Note that the second term serves as a scaled\nproxy for the empirical standard deviation of the average distance between cluster points and their\ncentroids. In this extended setting, the objective is to minimize the function \u03c6(Xt, C). Furthermore,\nwith a slight abuse of notation, we use C\u2217\nK to denote both the optimal partition for Xt and X .\nIt should be clear from the context which clusters are referred to.\nSide information for the K-means problem is provided by a query oracle O such that\ni , x2 \u2208 C\u2217\ni ;\n\nif \u2203i \u2208 [K] s.t. x1 \u2208 C\u2217\n\n(cid:26)0,\n\n1 , ...,C\u2217\n\n(1)\n\n\u2200x1, x2 \u2208 X , O(x1, x2) =\n\n1, otherwise.\n\nQuery complexity is measured in terms of the number of times that an algorithm requests access to the\noracle. The goal is to devise query algorithms with query complexity as small as possible. The noisy\noracle On may be viewed as the response of a binary symmetry channel with parameter pe to an input\nproduced by a noiseless oracle O. Equivalently, \u2200x1, x2 \u2208 X , P (On(x1, x2) = O(x1, x2)) = 1\u2212pe,\nand P (On(x1, x2) (cid:54)= O(x1, x2)) = pe, independently from other queries. Each pair (x1, x2) is\nqueried only once, and the noisy oracle On always produces the same answer for the same query.\nWhen presented with at least one outlier point in the pair (x1, x2), the noiseless oracle always returns\nO(x1, x2) = 1, while the noisy oracle On may \ufb02ip the answer with probability pe. The problem of\nidentifying outliers placed in regular clusters is resolved by invoking the algorithm of [24], which\nplaces outliers into small clusters that are expurgated from the list of valid clusters.\n\n3 Algorithmic Solutions\n\nIn what follows, we present two algorithms that describe how to perform noiseless queries and noisy\nqueries with outliers in order to seed the clusters. In the process, we sketch some of the proofs\nestablishing the theoretical performance guarantees of our methods.\nThe noiseless query K-means algorithm is conceptually simple and it consists of two steps. In the\n\u03b4\u0001 points for each of the K\n\ufb01rst step, we sample and query pairs of points until we collect at least K\nclusters. In the second step, we compute the centroids of clusters by using the queried and classi\ufb01ed\npoints. The number of points to be collected is dictated by the size of the smallest cluster and the\ndouble Dixie cup coupon collector\u2019s requirements derived in the Supplement, and summarized below.\n\n5\n\n\fAlgorithm 1: Approximate Noiseless Query K-means Clustering\nInput: A set of n points X , number of clusters K, an oracle O\nOutput :Estimates of the centers C\nInitialization: t = 1, Ci = \u2205,\u2200i \u2208 [K],Ri = \u2205,\u2200i \u2208 [K].\nUniformly at random sample a point x from X , C1 \u2190 C1 \u222a {x}, R1 \u2190 x.\nwhile mini\u2208[K] |Ci| < K\n\n\u03b4\u0001 do\n\nUniformly at random sample with replacement a point x from X\nif \u2200i \u2208 [t], O(Ri, x) = 0 then\nelse\n\nCi \u2190 Ci \u222a {x}\nt \u2190 t + 1, Ct \u2190 {x}, Rt \u2190 x\n\nend\n\nend\nfor k = 1 to K do\n\nend\n\nLet ck,i denote the ith element added to Ck, \u00b5k = 1|Ck|\n\n(cid:80)Sk\ni=1 ck,i, C \u2190 C \u222a {\u00b5k}\n\nLemma 3.1. Assume that there are K types of coupons and that the smallest probability of a coupon\ntype pmin is lower bounded by 1\nK ]. Then, on average, one needs to sample at most\n\n\u03b1 K , with \u03b1 \u2208 [1, n\n\ncoupons in order to guarantee the presence of at least m complete sets, where m = O(K).\n\n2\u03b1K(log K + m log 2)\n\nNote that in our analysis, we require that m = K\n\u03b4\u0001 , for some \u0001, \u03b4 > 0, while classical coupon collection\nand Dixie cup results are restricted to using constant m [20, 19]. In the latter case, the number of\nsamples equals O(K(log K + (m \u2212 1) log log K)), which signi\ufb01cantly differs from our bound.\nTwo remarks are at place. First, one may modify Algorithm 1 to enforce a stopping criteria for the\nsampling procedure (see the Supplement). Furthermore, when performing pairwise oracle queries,\nwe assumed that in the worst case, one needs to perform K queries, one for each cluster. Clearly,\none may signi\ufb01cantly reduce the query complexity by choosing at each query time to \ufb01rst probe the\nclusters with estimated centroids closest to the queried point. This algorithm is discussed in more\ndetail in the Supplement.\nThe steps of the algorithm for approximate query-based clustering with noisy responses and outliers\nare listed in Algorithm 2. The gist of the approach is to assume that outliers create separate clusters that\nare \ufb01ltered out using the noisy-query clustering method of [24]. Unfortunately, the aforementioned\nmethod assumes that sampling is performed without replacement, which in our setting requires\nthat we modify the Centroid lemma to account for sampling points uniformly at random without\nreplacement. This modi\ufb01cation is described in the next lemma.\nLemma 3.2 (The Modi\ufb01ed Centroid Lemma). Let S be a set of points obtained by sampling m\npoints uniformly at random without replacement from a point set A. Then, for any \u03b4 > 0, with\nprobability at least 1 \u2212 \u03b4, one has\n\n(cid:32)\n\n(cid:33)\n\n1 \u2212 m\u22121\n|A|\u22121\n\u03b4m\n\n\u03c6(A; c(S)) \u2264\n\n1 +\n\n(cid:18)\n\n(cid:19)\n\n1(A) \u2264\n\u03c6\u2217\n\n1 +\n\n1\n\u03b4m\n\n1(A).\n\u03c6\u2217\n\nHere, c(S) denotes the center of mass center of S, and m \u2264 |A|.\nFurthermore, the requirement that sampling is performed without replacement gives rise to a new\nversion of the double Dixie cup coupon collection paradigm in which one is given only a limited\nsupply of coupons of each type, with the total number of coupons being equal to n. As a result,\nthe number of points sampled from each cluster without replacement can be captured by an iid\nmultivariate hypergeometric random vector with parameters (n, np1, ..., npK, m). To establish the\nquery complexity results in this case, we do not need to estimate the expected number of points\nsampled, but need instead to ensure concentration results for hypergeometric random vectors. This\nis straightforward to accomplish, as it is well known that a hypergeometric random variable may\n\n6\n\n\fall i, and(cid:80)\n\nbe written as a sum of independent but nonidentically distributed Bernoulli random variables [28].\nAlong with tight bounds on the Kulback-Leibler divergence and Hoeffding\u2019s inequality [29], this\nleads to the following bound on the probability of sampling a suf\ufb01ciently large number of points from\nthe smallest cluster.\nTheorem 3.3. Without loss of generality, assume that p1 \u2264 p2 \u2264 . . . pK, where pi \u2208 (0, 1) for\ni pi = 1. Furthermore, assume that during the query procedure, m points from K\nnonuniformly sized clusters of sizes (np1, ..., npK) are sampled uniformly at random, without\nreplacement. Then, the probability that at least mo = m p1\n2 points S are sampled from the smallest\ncluster is bounded as\nP{S \u2265 mo} \u2265 1 \u2212 K exp\n\n(cid:16)\u2212 mo\n\n(2)\n\n(cid:17)\n\n.\n\n4\n\nprobability pe, a precomputed value M, and probability po of outliers.\n\nAlgorithm 2: Approximate Noisy Query K-means Clustering with Outliers\nInput: A set of n points X , the number of clusters K, a noisy oracle On with output error\nOutput :Centroids set C\nPhase 1: Seed the clusters by running Algorithm 5 for noisy query-based clustering\nUniformly at random sample M points from X without replacement. The sampled set equals A.\ni=1 Ai.\nPhase 2: Estimate the centroids\nFor all i \u2208 [K], ci \u2190 c(Ai) where c(Ai) is the center of mass of the set Ai. C \u2190 {c1, ..., cK}\n\nRun Algorithm 5 (described in the Supplement) on A to obtain a K-partition of A =(cid:83)K\n\nRecall that the oracle treats outliers as points that do not belong to the optimal clusters, so that in\nAlgorithm 5 described in the Supplement, outliers are treated as singleton clusters. In this case,\nthe minimum cluster size requirement from [24] automatically \ufb01lters out all outliers. Nevertheless,\nnontrivial changes compared to the noisy query algorithm derived from [24] are needed, as the\npresence of outliers changes the effective number of clusters. How to deal with this issue is described\nin the Supplement.\n\n4 Experiments\n\nSynthetic Data. For our synthetic data experiments, we start by selecting all relevant problem\nparameters, the number of clusters K, the cluster imbalance \u03b1, the dimension of the point dataset\nd, the approximation factor \u0001 and the error tolerance level \u03b4. We uniformly at random sample K\ncluster centroids in the hypercube [0, 5]d \u2013 this choice of the centroids allows one to easily control the\noverlap between clusters. Then, we generate ni points for each cluster i = 1, . . . , K, where the values\n{ni}k\ni=1 are chosen so as to satisfy the \u03b1-imbalance property and so that ni \u2208 [1000, 6000]. The\npoints in the cluster indexed by i are obtained by sampling d-dimensional vectors from a Gaussian\ndistribution N (0, \u03c32\ni I), with I representing the d \u00d7 d identity matrix, and adding these Gaussian\nsamples to the corresponding cluster centroid. When generating outliers, we uniformly at random\nchoose a subset of points of size po \u00d7 n, where n is the total number of points to be clustered. Then\nwe adjust the positions of the points to make sure that they satisfy the \u0393(2)-separation property,\ndescribed in the previous sections. In the noisy oracle setting, we assume that the oracle produces the\n\ncorrect answer with probability 1 \u2212 pe, for pe \u2208(cid:0)0, 1\n\n(cid:1).\n\n2\n\nWe evaluated our algorithms with respect to three performance measures. The \ufb01rst measure is\nthe value of the potential function. As all our algorithms are guaranteed to produce an (1 + \u0001)-\napproximation for the optimal potential, it is of interest to compare the theoretically guaranteed\nand actually obtained potential values. The second performance measure is the query complexity,\nfor which we once again have analytic upper bounds. The third performance criteria is the overall\nmisclassi\ufb01cation ratio, de\ufb01ned as the fraction of misclassi\ufb01ed data points. We also compared our\nAlgorithm 1 with the state-of-the art Algorithm 2 of [12] for the case that there exists one cluster\ncontaining one point only. Recall that [12] does not require the smallest cluster size to be bounded\naway from one, and may in principle operate more ef\ufb01ciently in settings where clusters of smallest\npossible size (one) exist. As will be seen from our simulation studies, even in this case, our method\nsigni\ufb01cantly outperforms [12].\n\n7\n\n\fThe results of our experiments for the noiseless setting are shown in Figure 1. As may be seen, our\nanalytic approximation results for the potential closely match the results obtained via simulations.\nIn contrast, the actual query complexity is signi\ufb01cantly lower in practice than predicted through\nour analysis, due to the fact that we assumed a worst case scenario for pairwise queries, and set the\nnumber of comparisons to K. For the misclassi\ufb01cation ratio, we observe that the general trend is\nas expected \u2013 the larger the number of clusters K, the larger the misclassi\ufb01cation ratio. Still, the\nmisclassi\ufb01cation error in all tested examples did not exceed 2.9%. From Figure 1-(d) we can see\nclearly that our method performs signi\ufb01cantly better than Algorithm 2 in [12] even when \u03b1 is fairly\nlarge. We did not compare our noisy query method with outliers with the noisy sampling method\nof [12] as the latter cannot deal with outliers.\n\n(a)\n\n(e)\n\n(b)\n\n(f)\n\n(c)\n\n(g)\n\n(d)\n\n(h)\n\nFigure 1: Figures (a) to (c) and (e) to (g) list the results for synthetic data and the noiseless oracle\nAlgorithm 1 and noisy oracle with outliers Algorithm 2, respectively. The parameters are d =\n20, K = [2 : 20], \u03b1 = [1, 6], \u03c3i = [0, 2], \u03b4 = \u0001 = 0.2, po = pe = 0.05. Figures ((a), (e)) plot the\npotential, Figures ((b), (f)) the query complexity, and Figures ((c), (g)) the misclassi\ufb01cation ratio.\nFigures (d) and (h) provide comparisons with the noiseless Algorithm 2 of Ailon et. al [12] for a\nclustering problem with one cluster of size equal to one, with all cluster sizes in the range [100, 600].\n\n\u00014\n\nFigure 1-(d) reveals that there exists a substantial gap between the query complexity of our method\nand that of [12] in the noiseless setting. For example, when K=5 and K=10, we require 510, 932 and\n4.16 \u00d7 106 queries. In comparison, Algorithm [12] requires 6.55 \u00d7 1011 and 5.24 \u00d7 1012 queries,\nwhich in the latter case is roughly a \ufb01ve orders larger number of queries. As a matter of fact, the\nalgorithm in [12] involves a very large constant in its complexity bound, equal to 223 K3\n, which for\npractical clustering settings dominates the complexity expression.\nReal Data. Since the query complexity of our methods is independent from the size of the dataset,\nwe can provide ef\ufb01cient solutions to large-scale crowdsourcing problems that can be formulated\nas K-means problems, such as is the case of image classi\ufb01cation. We use the following two image\nclassi\ufb01cation datasets for which the ground-truth clusters are known and can hence be used to generate\nthe outputs of both the noiseless and noisy oracle:\n1) The well-known MNIST dataset [26] comprises 60, 000 training and 10, 000 test images of\nhandwritten digits. Each image is normalized to \ufb01t into a 28 \u00d7 28 pixel bounding box and is anti-\naliased, which results in grayscale levels.\n2) The CIFAR-10 dataset [27] contains 60, 000 color images with 32 \u00d7 32 pixels, grouped into 10\ndifferent clusters of equal size, representing 10 different objects. The clusters are nonintersecting and\nwe sampled 10, 000 cluster points.\nHere, we set po = 0 and pe = 0.05, hence asserting that there are no outliers, but that 5% of the\ndata points are mislabelled. Note that all the query complexity reported are needed to achieve an\n(1 + \u0001)-approximation of the potential. The results are shown in Table 1.\n\n8\n\n\fTable 1: Real Datasets Results\n\nActual query complexity\n\nTheoretical query complexity\n\nMNIST-Algorithm 1\nMNIST-Algorithm 2\nCIFAR 10-Algorithm 1\nCIFAR 10-Algorithm 2\n\n12,195\n3,628,193,647\n12,490\n128,458,964\n\n38,868\n6,439,271,969\n37,479\n898,432,836\n\nAcknowledgments\n\nThis work was supported in part by the grants 239 SBC Purdue 4101-38050 STC Center for Science\nof Information and NSF CCF 15-27636.\n\nReferences\n\n[1] J. A. Hartigan and M. A. Wong, \u201cAlgorithm as 136: A k-means clustering algorithm,\u201d Journal\nof the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100\u2013108, 1979.\n[2] S. Lloyd, \u201cLeast squares quantization in PCM,\u201d IEEE Transactions on Information Theory,\n\nvol. 28, no. 2, pp. 129\u2013137, 1982.\n\n[3] A. K. Jain, \u201cData clustering: 50 years beyond K-means,\u201d Pattern recognition letters, vol. 31,\n\nno. 8, pp. 651\u2013666, 2010.\n\n[4] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, \u201cAn\nef\ufb01cient k-means clustering algorithm: Analysis and implementation,\u201d IEEE transactions on\npattern analysis and machine intelligence, vol. 24, no. 7, pp. 881\u2013892, 2002.\n\n[5] S. Ray and R. H. Turi, \u201cDetermination of number of clusters in k-means clustering and appli-\ncation in colour image segmentation,\u201d in Proceedings of the 4th international conference on\nadvances in pattern recognition and digital techniques. Calcutta, India, 1999, pp. 137\u2013143.\n[6] M. Mahajan, P. Nimbhorkar, and K. Varadarajan, \u201cThe planar k-means problem is NP-hard,\u201d in\n\nInternational Workshop on Algorithms and Computation. Springer, 2009, pp. 274\u2013285.\n\n[7] P. Awasthi, M. Charikar, R. Krishnaswamy, and A. K. Sinop, \u201cThe hardness of approximation\n\nof euclidean k-means,\u201d arXiv preprint arXiv:1502.03316, 2015.\n\n[8] E. Lee, M. Schmidt, and J. Wright, \u201cImproved and simpli\ufb01ed inapproximability for k-means,\u201d\n\nInformation Processing Letters, vol. 120, pp. 40\u201343, 2017.\n\n[9] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, \u201cA local\nsearch approximation algorithm for k-means clustering,\u201d Computational Geometry, vol. 28, no.\n2-3, pp. 89\u2013112, 2004.\n\n[10] S. Ahmadian, A. Norouzi-Fard, O. Svensson, and J. Ward, \u201cBetter guarantees for k-means and\neuclidean k-median by primal-dual algorithms,\u201d in Foundations of Computer Science (FOCS),\n2017 IEEE 58th Annual Symposium on.\n\nIeee, 2017, pp. 61\u201372.\n\n[11] H. Ashtiani, S. Kushagra, and S. Ben-David, \u201cClustering with same-cluster queries,\u201d in Advances\n\nin Neural Information Processing Systems, 2016, pp. 3216\u20133224.\n\n[12] N. Ailon, A. Bhattacharya, R. Jaiswal, and A. Kumar, \u201cApproximate clustering with same-cluster\n\nqueries,\u201d arXiv preprint arXiv:1704.01862, 2017.\n\n[13] B. Gamlath, S. Huang, and O. Svensson, \u201cSemi-supervised algorithms for approximately\n\noptimal and accurate clustering,\u201d arXiv preprint arXiv:1803.00926, 2018.\n\n[14] T. Kim and J. Ghosh, \u201cSemi-supervised active clustering with weak oracles,\u201d arXiv preprint\n\narXiv:1709.03202, 2017.\n\n[15] S.-Y. Yun and A. Proutiere, \u201cCommunity detection via random and adaptive sampling,\u201d in\n\nConference on Learning Theory, 2014, pp. 138\u2013175.\n\n[16] K.-G. Woo, J.-H. Lee, M.-H. Kim, and Y.-J. Lee, \u201cFindit: a fast and intelligent subspace\nclustering algorithm using dimension voting,\u201d Information and Software Technology, vol. 46,\nno. 4, pp. 255\u2013271, 2004.\n\n[17] G. Dasarathy, R. Nowak, and X. Zhu, \u201cS2: An ef\ufb01cient graph based active learning algorithm\nwith application to nonparametric classi\ufb01cation,\u201d in Conference on Learning Theory, 2015, pp.\n503\u2013522.\n\n9\n\n\f[18] P. Bradley, K. Bennett, and A. Demiriz, \u201cConstrained k-means clustering,\u201d Microsoft Research,\n\nRedmond, pp. 1\u20138, 2000.\n\n[19] D. J. Newman, \u201cThe double dixie cup problem,\u201d The American Mathematical Monthly, vol. 67,\n\nno. 1, pp. 58\u201361, 1960.\n\n[20] A. V. Doumas and V. G. Papanicolaou, \u201cThe coupon collector\u2019s problem revisited: generalizing\nthe double dixie cup problem of newman and shepp,\u201d ESAIM: Probability and Statistics, vol. 20,\npp. 367\u2013399, 2016.\n\n[21] M. Inaba, N. Katoh, and H. Imai, \u201cApplications of weighted voronoi diagrams and randomization\nto variance-based k-clustering,\u201d in Proceedings of the tenth annual symposium on Computational\ngeometry. ACM, 1994, pp. 332\u2013339.\n\n[22] W. Szpankowski, \u201cAnalytic poissonization and depoissonization,\u201d Average Case Analysis of\n\nAlgorithms on Sequences, pp. 442\u2013519, 2001.\n\n[23] A. Mazumdar and B. Saha, \u201cClustering with an oracle,\u201d in Communication, Control, and\nIEEE, 2016, pp. 738\u2013739.\n[24] \u2014\u2014, \u201cClustering with noisy queries,\u201d in Advances in Neural Information Processing Systems,\n\nComputing (Allerton), 2016 54th Annual Allerton Conference on.\n\n2017, pp. 5790\u20135801.\n\n[25] E. Abbe, A. S. Bandeira, and G. Hall, \u201cExact recovery in the stochastic block model,\u201d IEEE\n\nTransactions on Information Theory, vol. 62, no. 1, pp. 471\u2013487, 2016.\n\n[26] Y. LeCun, \u201cThe mnist database of handwritten digits,\u201d http://yann. lecun. com/exdb/mnist/,\n\n1998.\n\n[27] A. Krizhevsky and G. Hinton, \u201cLearning multiple layers of features from tiny images,\u201d 2009.\n[28] S. Hui and C. Park, \u201cThe representation of hypergeometric random variables using independent\nbernoulli random variables,\u201d Communications in Statistics-Theory and Methods, vol. 43, no. 19,\npp. 4103\u20134108, 2014.\n\n[29] W. Hoeffding, \u201cProbability inequalities for sums of bounded random variables,\u201d Journal of the\n\nAmerican statistical association, vol. 58, no. 301, pp. 13\u201330, 1963.\n\n[30] N. B. Shank and H. Yang, \u201cCoupon collector problem for non-uniform coupons and random\n\nquotas,\u201d the electronic journal of combinatorics, vol. 20, no. 2, p. 33, 2013.\n\n[31] Wikipedia contributors, \u201cChernoff bound \u2014 Wikipedia, the free encyclopedia,\u201d 2018. [Online].\n\nAvailable: https://goo.gl/CFJsvT\n\n[32] M. Skala, \u201cHypergeometric tail\n\narXiv:1311.5939, 2013.\n\ninequalities:\n\nending the insanity,\u201d arXiv preprint\n\n10\n\n\f", "award": [], "sourceid": 3346, "authors": [{"given_name": "I", "family_name": "Chien", "institution": "UIUC"}, {"given_name": "Chao", "family_name": "Pan", "institution": "University of Illinois Urbana-Champaign"}, {"given_name": "Olgica", "family_name": "Milenkovic", "institution": "University of Illinois at Urbana-Champaign"}]}