{"title": "Adaptive Clustering through Semidefinite Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1795, "page_last": 1803, "abstract": "We analyze the clustering problem through a flexible probabilistic model that aims to identify an optimal partition on the sample X1,...,Xn. We perform exact clustering with high probability using a convex semidefinite estimator that interprets as a corrected, relaxed version of K-means. The estimator is analyzed through a non-asymptotic framework and showed to be optimal or near-optimal in recovering the partition. Furthermore, its performances are shown to be adaptive to the problem\u2019s effective dimension, as well as to K the unknown number of groups in this partition. We illustrate the method\u2019s performances in comparison to other classical clustering algorithms with numerical experiments on simulated high-dimensional data.", "full_text": "Adaptive Clustering through Semide\ufb01nite\n\nProgramming\n\nLaboratoire de Math\u00e9matiques d\u2019Orsay, Univ. Paris-Sud, CNRS,\n\nMartin Royer\n\nUniversit\u00e9 Paris-Saclay,\n\n91405 Orsay, France\n\nmartin.royer@math.u-psud.fr\n\nAbstract\n\nWe analyze the clustering problem through a \ufb02exible probabilistic model that\naims to identify an optimal partition on the sample X1, ..., Xn. We perform\nexact clustering with high probability using a convex semide\ufb01nite estimator that\ninterprets as a corrected, relaxed version of K-means. The estimator is analyzed\nthrough a non-asymptotic framework and showed to be optimal or near-optimal in\nrecovering the partition. Furthermore, its performances are shown to be adaptive\nto the problem\u2019s effective dimension, as well as to K the unknown number of\ngroups in this partition. We illustrate the method\u2019s performances in comparison\nto other classical clustering algorithms with numerical experiments on simulated\nhigh-dimensional data.\n\n1\n\nIntroduction\n\nClustering, a form of unsupervised learning, is the classical problem of assembling n observations\nX1, ..., Xn from a p-dimensional space into K groups. Applied \ufb01elds are craving for robust clus-\ntering techniques, such as computational biology with genome classi\ufb01cation, data mining or image\nsegmentation from computer vision. But the clustering problem has proven notoriously hard when\nthe embedding dimension is large compared to the number of observations (see for instance the recent\ndiscussions from [2, 21]).\nA famous early approach to clustering is to solve for the geometrical estimator K-means [19, 13, 14].\nThe intuition behind its objective is that groups are to be determined in a way to minimize the total\nintra-group variance. It can be interpreted as an attempt to \"best\" represent the observations by\nK points, a form of vector quantization. Although the method shows great performances when\nobservations are homoscedastic, K-means is a NP-hard, ad-hoc method. Clustering with probabilistic\nframeworks are usually based on maximum likelihood approaches paired with a variant of the EM\nalgorithm for model estimation, see for instance the works of Fraley & Raftery [11] and Dasgupta\n& Schulman [9]. These methods are widespread and popular, but they tend to be very sensitive to\ninitialization and model misspeci\ufb01cations.\nSeveral recent developments establish a link between clustering and semide\ufb01nite programming. Peng\n& Wei [17] show that the K-means objective can be relaxed into a convex, semide\ufb01nite program,\nleading Mixon et al. [16] to use this relaxation under a subgaussian mixture model to estimate the\ncluster centers. Yan and Sarkar [24] use a similar semide\ufb01nite program in the context of covariate\nclustering, when the network has nodes and covariates. Chr\u00e9tien et al. [8] use a slightly different form\nof a semide\ufb01nite program to recover the adjacency matrix of the cluster graph with high probability.\nLastly in the different context of variable clustering, Bunea et al. [6] present a semide\ufb01nite program\nwith a correction step to produce non-asymptotic exact recovery results.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this work, we build upon the work and context of [6], and transpose and adapt their ideas for point\nclustering: we introduce a semide\ufb01nite estimator for point clustering inspired by the \ufb01ndings of [17]\nwith a correction component originally presented in [6]. We show that it produces a very strong\ncontender for clustering recovery in terms of speed, adaptivity and robustness to model perturbations.\nIn order to do so we produce a \ufb02exible probabilistic model inducing an optimal partition of the data\nthat we aim to recover. Using the same structure of proof in a different context, we establish elements\nof stochastic control (see for instance Lemma A.1 on the concentration of random subgaussian Gram\nmatrices in the supplementary material) to derive conditions of exact clustering recovery with high\nprobability and show optimal performances \u2013 including in high dimensions, improving on [16], as\nwell as adaptivity to the effective dimension of the problem. We also show that our results continue\nto hold without knowledge of the number of structures given one single positive tuning parameter.\nLastly we provide evidence of our method\u2019s ef\ufb01ciency and further insight from simulated data.\nNotation. Throughout this work we use the convention 0/0 := 0 and [n] = {1, ..., n}. We take\nan (cid:46) bn to mean that an is smaller than bn up to an absolute constant factor. Let Sd\u22121 denote the\nunit sphere in Rd. For q \u2208 N\u2217 \u222a {+\u221e}, \u03bd \u2208 Rd, |\u03bd|q is the lq-norm and for M \u2208 Rd\u00d7d(cid:48)\n, |M|q,\n|M|F and |M|op are respectively the entry-wise lq-norm, the Frobenius norm associated with scalar\nproduct (cid:104)., .(cid:105) and the operator norm. |D|V is the variation semi-norm for a diagonal matrix D, the\ndifference between its maximum and minimum element. Let A (cid:60) B mean that A \u2212 B is symmetric,\npositive semide\ufb01nite.\n\n2 Probabilistic modeling of point clustering\nConsider X1, ..., Xn and let \u03bda = E [Xa]. The variable Xa can be decomposed into\n\nXa = \u03bda + Ea,\n\na = 1, ..., n,\n\n(1)\n\nwith Ea stochastic centered variables in Rp.\nDe\ufb01nition 1. For K > 1, \u00b5 = (\u00b51, ..., \u00b5K) \u2208 (Rp)K, \u03b4 (cid:62) 0 and G = {G1, ..., GK} a partition of\n[n], we say X1, ..., Xn are (G, \u00b5, \u03b4)-clustered if \u2200k \u2208 [K],\u2200a \u2208 Gk,|\u03bda \u2212 \u00b5k|2 (cid:54) \u03b4. We then call\n(2)\n\n|\u00b5k \u2212 \u00b5l|2\n\n\u2206(\u00b5) := min\nk 1 such that X1, ..., Xn is (G, \u00b5, \u03b4)-clustered with\n|G| = K and \u03c1(G, \u00b5, \u03b4) > 4. By Proposition 1, G is then identi\ufb01able. It is the partition we aim to\nrecover.\nWe also assume that X1, ..., Xn are independent observations with subgaussian behavior. Instead of\nthe classical isotropic de\ufb01nition of a subgaussian random vector (see for example [20]), we use a\nmore \ufb02exible de\ufb01nition that can account for anisotropy.\nDe\ufb01nition 2. Let Y be a random vector in Rd, Y has a subgaussian distribution if there exist\n\u03a3 \u2208 Rd\u00d7d such that \u2200x \u2208 Rd,\n\nK, \u00b5\u2217, \u03b4\u2217) > 4 then G\u2217\n\nIf \u03c1(G\u2217\n\nE(cid:104)\nexT (Y \u2212E Y )(cid:105) (cid:54) exT \u03a3x/2.\n\n(4)\n\n2\n\n\fWe then call \u03a3 a variance-bounding matrix of random vector Y , and write shorthand Y \u223c subg(\u03a3).\nNote that Y \u223c subg(\u03a3) implies Cov(Y ) (cid:52) \u03a3 in the semide\ufb01nite sense of the inequality. To sum-up\nour modeling assumptions in this work:\nHypothesis 1. Let X1, ..., Xn be independent, subgaussian, (G, \u00b5, \u03b4)-clustered with \u03c1(G, \u00b5, \u03b4) > 4.\nRemark that the modelization of Hypothesis 1 can be connected to another popular probabilistic\nmodel: if we further ask that X1, ..., Xn are identically-distributed within a group (and hence \u03b4 = 0),\nthe model becomes a realization of a mixture model.\n\n3 Exact partition recovery with high probability\nLet G = {G1, ..., GK} and m := mink\u2208[K] |Gk| denote the minimum cluster size. G can be\nrepresented by its caracteristic matrix B\u2217 \u2208 Rn\u00d7n de\ufb01ned as \u2200k, l \u2208 [K]2,\u2200(a, b) \u2208 Gk \u00d7 Gl,\n\n(cid:26) 1/|Gk|\n\n0\n\nB\u2217\nab :=\n\nif k = l\notherwise.\n\nIn what follows, we will demonstrate the recovery of G through recovering its caracteristic matrix\nB\u2217. We introduce the sets of square matrices\n\n: BT = B, tr(B) = K, B1n = 1n, B2 = B}\n\n: BT = B, tr(B) = K, B1n = 1n, B (cid:60) 0}\n\nK\n\n:= {B \u2208 Rn\u00d7n\n\nC{0,1}\nCK := {B \u2208 Rn\u00d7n\nC :=\n\n(cid:91)\n\nCK.\n\n+\n\n+\n\nK\u2208N\n\n(5)\n(6)\n(7)\n\nWe have: C{0,1}\n(2007) [17] shows that the K-means estimator \u00afB can be expressed as\n\n\u2282 CK \u2282 C and CK is convex. Notice that B\u2217 \u2208 C{0,1}\n\nK\n\nK . A result by Peng, Wei\n\nfor(cid:98)\u039b := ((cid:104)Xa, Xb(cid:105))(a,b)\u2208[n]2 \u2208 Rn\u00d7n, the observed Gram matrix. Therefore a natural relaxation is\n\nK\n\nto consider the following estimator:\n\nB\u2208CK\n\nNotice that E(cid:98)\u039b = \u039b + \u0393 for \u039b := ((cid:104)\u03bda, \u03bdb(cid:105))(a,b)\u2208[n]2 \u2208 Rn\u00d7n, and \u0393 := E [(cid:104)Ea, Eb(cid:105)](a,b)\u2208[n]2 =\ndiag (tr(Var(Ea)))1(cid:54)a(cid:54)n \u2208 Rn\u00d7n. The following two results demonstrate that \u039b is the signal\nstructure that lead the optimizations of (8) and (9) to recover B\u2217, whereas \u0393 is a bias term that can\nhurt the process of recovery.\nProposition 2. There exist c0 > 1 absolute constant such that if \u03c12(G, \u00b5, \u03b4) > c0(6 +\nm\u22062(\u00b5) > 8|\u0393|V , then we have\narg max\nB\u2208C{0,1}\n\nThis proposition shows that the (cid:98)B estimator, as well as the K-means estimator, would recover partition\n\n(cid:104)\u039b + \u0393, B(cid:105) = B\u2217 = arg max\nB\u2208CK\n\nG on the population Gram matrix if the variation semi-norm of \u0393 were suf\ufb01ciently small compared to\n\u221a\nthe cluster separation. Notice that to recover the partition on the population version, we require the\ndiscriminating capacity to grow as fast as 1 + (\nn/m)1/2 instead of simply 1 from Hypothesis 1.\nThe following proposition demonstrates that if the condition on the variation semi-norm of \u0393 is not\nmet, G may not even be recovered on the population version.\nProposition 3. There exist G, \u00b5, \u03b4 and \u0393 such that \u03c12(G, \u00b5, \u03b4) = +\u221e but we have m\u22062(\u00b5) < 2|\u0393|V\nand\n\n(cid:104)\u039b + \u0393, B(cid:105).\n\nn/m) and\n\n(10)\n\n\u221a\n\nK\n\nB\u2217 /\u2208 arg max\nB\u2208C{0,1}\n\nK\n\n(cid:104)\u039b + \u0393, B(cid:105)\n\nand B\u2217 /\u2208 arg max\nB\u2208CK\n\n(cid:104)\u039b + \u0393, B(cid:105).\n\n(11)\n\n3\n\n(cid:104)(cid:98)\u039b, B(cid:105)\n\n\u00afB = arg max\nB\u2208C{0,1}\n\n(cid:98)B := arg max\n\n(cid:104)(cid:98)\u039b, B(cid:105).\n\n(8)\n\n(9)\n\n\fSo Proposition 3 shows that even if the population clusters are perfectly discriminated, there is a\ncon\ufb01guration for the variances of the noise that makes it impossible to recover the right clustering by\nK-means. This shows that K-means may fail when the random variable homoscedasticity assumption\nis violated, and that it is important to correct for \u0393 = diag(tr(Var(Ea)))1(cid:54)a(cid:54)n.\n\nSuppose we produce such an estimator(cid:98)\u0393corr. Then substracting(cid:98)\u0393corr from(cid:98)\u039b can be interpreted as a\ncorrecting term, i.e. a way to de-bias(cid:98)\u039b as an estimator of \u039b. Hence the previous results demonstrate\n\nthe interest of studying the following semi-de\ufb01nite estimator of the projection matrix B\u2217, let\n\n(cid:98)Bcorr := arg max\n\n(cid:104)(cid:98)\u039b \u2212(cid:98)\u0393corr, B(cid:105).\n\n(12)\nIn order to demonstrate the recovery of B\u2217 by this estimator, we introduce different quantitative\nmeasures of the \"spread\" of our stochastic variables, that affect the quality of the recovery. By\nHypothesis 1 there exist \u03a31, ..., \u03a3n such that \u2200a \u2208 [n], Xa \u223c subg(\u03a3a). Let\n\nB\u2208CK\n\n\u03c32 := max\na\u2208[n]\n\n|\u03a3a|op, V 2 := max\na\u2208[n]\n\n|\u03a3a|F ,\n\n\u03b32 := max\na\u2208[n]\n\ntr(\u03a3a)\n\nWe now produce(cid:98)\u0393corr. Since there is no relation between the variances of the points in our model,\n(cid:105)(cid:12)(cid:12),\n(cid:12)(cid:12)(cid:104)Xa \u2212 Xb, Xc\u2212Xd\n(cid:98)b1 := arg minb\u2208[n]\\{a} V (a, b) and(cid:98)b2 := arg minb\u2208[n]\\{a,(cid:98)b1} V (a, b). Then for a \u2208 [n], let\n\nthere is very little hope of estimating Var(Ea). As for our quantity of interest tr(Var(Ea)), a\nform of volume, a rough estimation is challenging but possible. The estimator from [6] can be\nadapted to our context. For (a, b) \u2208 [n]2 let V (a, b) := max(c,d)\u2208([n]\\{a,b})2\n\n|Xc\u2212Xd|2\n\n(13)\n\n(14)\n\n(cid:98)\u0393corr := diag\n|(cid:98)\u0393corr \u2212 \u0393|\u221e (cid:54) c7\n\n, Xa \u2212 X(cid:98)b2\n\n(cid:16)(cid:104)Xa \u2212 X(cid:98)b1\n(cid:17)\n(cid:16)\n\u03c32log n + (\u03b4 + \u03c3(cid:112)log n)\u03b3 + \u03b42(cid:17)\n\n(cid:105)a\u2208[n]\n\n.\n\nProposition 4. Assume that m > 2. For c6, c7 > 0 absolute constants, with probability larger than\n1 \u2212 c6/n we have\n\n.\n\n(15)\n\nSo apart from the radius \u03b4 terms, that come from generous model assumptions, a proxy for \u0393 is\nproduced at a \u03c32 log n rate that we could not expect to improve on. Nevertheless, this control on \u0393 is\nkey to attain the optimal rates below. It is general and completely independent of the structure of G,\nas there is no relation between G and \u0393.\nWe are now ready to introduce this paper\u2019s main result: a condition on the separation between the\ncluster means suf\ufb01cient for ensuring recovery of B\u2217 with high probability.\nTheorem 1. Assume that m > 2. For c1, c2 > 0 absolute constants, if\n\n(cid:0)\u03c32(n + m log n) + V 2((cid:112)n + m log n) + \u03b3(\u03c3(cid:112)log n + \u03b4) + \u03b42(\n\nn + m)(cid:1),\n\nm\u22062(\u00b5) (cid:62) c2\n\nthen with probability larger than 1 \u2212 c1/n we have (cid:98)Bcorr = B\u2217, and therefore (cid:98)Gcorr = G.\n\n(16)\n\n\u221a\n\nWe call the right hand side of (16) the separating rate. Notice that we can read two kinds of\nrequirements coming from the separating rate: requirements on the radius \u03b4, and requirements\non \u03c32,V 2, \u03b3 dependent on the distributions of observations. It appears as if \u03b4 + \u03c3\nlog n can be\ninterpreted as a geometrical width of our problem. If we ask that \u03b4 is of the same order as \u03c3\nlog n,\na maximum gaussian deviation for n variables, then all conditions on \u03b4 from (16) can be removed.\nThus for convenience of the following discussion we will now assume \u03b4 (cid:46) \u03c3\nHow optimal is the result from Theorem 1? Notice that our result is adapted to anisotropy in the noise,\nbut to discuss optimality it is easier to look at the isotropic scenario: V 2 =\np\u03c32 and \u03b32 = p\u03c32.\nTherefore \u22062(\u00b5)/\u03c32 represents a signal-to-noise ratio. For simplicity let us also assume that all\ngroups have equal size, that is |G1| = ... = |GK| = m so that n = mK and the suf\ufb01cient condition\n(16) becomes\n\nlog n.\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n(K + log n)\n\npK\nn\n\n.\n\n(17)\n\n(cid:38)(cid:0)K + log n(cid:1) +\n\n(cid:114)\n\n\u22062(\u00b5)\n\n\u03c32\n\n4\n\n\fOptimality. To discuss optimality, we distinguish between low and high dimensional setups.\nIn the low-dimensional setup n \u2228 m log n (cid:38) p, we obtain the following condition:\n\n(cid:38)(cid:0)K + log n(cid:1).\n\n\u22062(\u00b5)\n\n\u03c32\n\n(18)\n\nDiscriminating with high probability between n observations from two gaussians in dimension 1\nwould require a separating rate of at least \u03c32 log n. This implies that when K (cid:46) log n, our result is\nminimax. Otherwise, to our knowledge the best clustering result on approximating mixture center\nis from [16], and on the condition that \u22062(\u00b5)/\u03c32 (cid:38) K 2. Furthermore, the K (cid:38) log n regime is\nknown in the stochastic-block-model community as a hard regime where a gap is surmised to exist\nbetween the minimal information-theoretic rate and the minimal achievable computational rate (see\nfor example [7]).\nIn the high-dimensional setup n \u2228 m log n (cid:46) p, condition (17) becomes:\n\n(cid:114)\n\n\u22062(\u00b5)\n\n(cid:38)\n\n\u03c32\n\n(K + log n)\n\n.\n\n(19)\n\npK\nn\n\nThere are few information-theoretic bounds for high-dimension clustering. Recently, Banks, Moore,\nVershynin, Verzelen and Xu (2017) [3] proved a lower bound for Gaussian mixture clustering\n\ndetection, namely they require a separation of order (cid:112)K(log K)p/n. When K (cid:46) log n, our\n\ncondition is only different in that it replaces log(K) by log(n), a price to pay for going from detecting\nthe clusters to exactly recovering the clusters. Otherwise when K grows faster than log n there might\nexist a gap between the minimal possible rate and the achievable, as discussed previously.\nAdaptation to effective dimension. We can analyse further the condition (16) by introducing an\neffective dimension r\u2217, measuring the largest volume repartition for our variance-bounding matrices\n\u03a31, ..., \u03a3n. We will show that our estimator adapts to this effective dimension. Let\n\nr\u2217 :=\n\n\u03b32\n\u03c32 =\n\nmaxa\u2208[n] tr(\u03a3a)\nmaxa\u2208[n] |\u03a3a|op\n\n,\n\n(20)\n\nr\u2217 can also be interpreted as a form of global effective rank of matrices \u03a3a. Indeed, de\ufb01ne Re(\u03a3) :=\ntr(\u03a3)/|\u03a3|op, then we have r\u2217 (cid:54) maxa\u2208[n] Re(\u03a3a) (cid:54) maxa\u2208[n] rank(\u03a3a) (cid:54) p.\nNow using V 2 (cid:54) \u221a\n\nr\u2217\u03c3, condition (16) can be written as\n\nr\u2217\u03c32 and \u03b3 =\n\n\u221a\n\n(cid:38)(cid:0)K + log n(cid:1) +\n\n(cid:114)\n\n\u22062(\u00b5)\n\n\u03c32\n\n(K + log n)\n\nr\u2217K\nn\n\n.\n\n(21)\n\nBy comparing this equation to (17), notice that r\u2217 is in place of p, indeed playing the role of an\neffective dimension for the problem. This shows that our estimator adapts to this effective dimension,\nwithout the use of any dimension reduction step. In consequence, equation (21) distinguishes between\nan actual high-dimensional setup: n\u2228 m log n (cid:46) r\u2217 and a \"low\" dimensional setup r\u2217 (cid:46) n\u2228 m log n\nunder which, regardless of the actual value of p, our estimators recovers under the near-minimax\ncondition of (18).\n\nThis informs on the effect of correcting term(cid:98)\u0393corr in the theorem above when n + m log n (cid:46) r\u2217.\n\u03c32r\u2217/m, but with the(cid:98)\u0393corr correction on the other hand, (21) has leading separating factor smaller\nthan \u03c32(cid:112)(K + log n)r\u2217/m = \u03c32\u221a\nsetup, our correction enhances the separating rate of at least a factor(cid:112)(n + m log n)/r\u2217.\n\nThe un-corrected version of the semi-de\ufb01nite program (9) has a leading separating rate of \u03b32/m =\n\nr\u2217/m. This proves that in a high-dimensional\n\nn + m log n \u00d7 \u221a\n\n4 Adaptation to the unknown number of group K\n\nIt is rarely the case that K is known, but we can proceed without it. We produce an estimator adaptive\n\nto the number of groups K: let(cid:98)\u03ba \u2208 R+, we now study the following adaptive estimator:\n\n(cid:101)Bcorr := arg max\n\n(cid:104)(cid:98)\u039b \u2212(cid:98)\u0393corr, B(cid:105) \u2212(cid:98)\u03ba tr(B).\n\nB\u2208C\n\n(22)\n\n5\n\n\fc4\n\n(cid:16)V 2\u221a\n\nTheorem 2. Suppose that m > 2 and (16) is satis\ufb01ed. For c3, c4, c5 > 0 absolute constants suppose\n\nthat the following condition on(cid:98)\u03ba is satis\ufb01ed\n(cid:17)\nn + \u03c32n + \u03b3(\u03c3(cid:112)log n + \u03b4) + \u03b42\u221a\nthen we have (cid:101)Bcorr = B\u2217 with probability larger than 1 \u2212 c3/n\nNotice that condition (23) essentially requires(cid:98)\u03ba to be seated between m\u22062(\u00b5) and some components\nadaptive estimator (cid:101)Bcorr as well and this shows that it is not necessary to know K in order to\nperform well for recovering G. Finding an optimized, data-driven parameter(cid:98)\u03ba using some form of\n\nof the right-hand side of (16). So under (23), the results from the previous section apply to the\n\n< c5(cid:98)\u03ba < m\u22062(\u00b5),\n\nn\n\n(23)\n\ncross-validation is outside of the scope of this paper.\n\n5 Numerical experiments\n\nWe illustrate our method on simulated Gaussian data in two challenging, high-dimensional setup\nexperiments for comparing clustering estimators. Our sample of n = 100 points are drawn from\nK = 5 identically-sized, perfectly discriminated non-isovolumic clusters of Gaussians - that is we\nhave \u2200k \u2208 [K],\u2200a \u2208 Gk, Ea \u223c N (0, \u03a3k) such that |G1| = ... = |GK| = 20. The distributions are\nchosen to be isotropic, and the ratio between the lowest and the highest standard deviation is of 1 to 10.\nWe draw points of a Rp space in two different scenarii. In (S1), for a given dimension space p = 500\nand a \ufb01xed isotropic noise level, we report the algorithm\u2019s performances as the signal-to-noise ratio\n\u22062(\u00b5)/\u03c32 is increased from 1 to 15. In (S2) we impose a \ufb01xed signal to noise ratio and observe the\nalgorithm\u2019s decay in performance as the space dimension p is increased from 102 to 105 (logarithmic\nscale). All reported points of the simulated space represent a hundred simulations, and indicate a\nmedian value with asymmetric standard deviations in the form of errorbars.\n\nSolving for estimator (cid:98)Bcorr is a hard problem as n grows. For this task we implemented an ADMM\nwe compare the recovering capacities of (cid:98)Gcorr, labeled \u2019pecok\u2019 in Figure 1 with other classical\n\nsolver from the work of Boyd et al. [4] with multiple stopping criterions including a \ufb01xed number of\niterations of T = 1000. The complexity of the optimization is then roughly O(T n3). For reference,\n\ninstance [22]) between the truth and its estimate. In the two experiments, the results of (cid:98)Gcorr are\n\nclustering algorithm. We chose three different but standard clustering procedures: Lloyd\u2019s K-means\nalgorithm [13] with a thousand K-means++ initialization of [1] (although in scenario (S2), the\nalgorithm is too slow to converge as p grows so we do not report it), Ward\u2019s method for Hierarchical\nClustering [23] and the low-rank clustering algorithm applied to the Gram matrix, a spectral method\nappearing in McSherry [15]. Lastly we include the CORD algorithm from Bunea et al. [5].\nWe measure the performances of estimators by computing the adjusted mutual information (see for\nmarkedly better than that of other methods. Scenario (S1) shows it can achieve exact recovery with a\nlesser signal to noise ratio than its competitors, whereas scenario (S2) shows its performances start to\ndecay much later than the other methods as the space dimension is increased exponentially.\nTable 1 summarizes the simulations in a different light: for different parameter value on each line, we\ncount the number of experiments (out of a hundred) that had an adjusted mutual information score\nequal to 0.9 or higher. This accounts for exact recoveries, or approximate recoveries that reasonably\n\nre\ufb02ected the underlying truth. In this table it is also evident that (cid:98)Gcorr performs uniformly better, be\n\nit for exact or approximate recovery: it manages to recover the underlying truth much sooner in terms\nof signal-to-noise ratio, and for a given signal-to-noise ratio it will represent the truth better as the\nembedding dimension increases.\nLastly Table 1 provides the median computing time in seconds for each method over the entire\n\nexperiment. (cid:98)Gcorr comes with important computation times because(cid:98)\u0393corr is very costly to compute.\n\nOur method is computationally intensive but it is of polynomial order. The solving of a semide\ufb01nite\nprogram is a vastly developing \ufb01eld of Operational Research and even though we used the classical\nADMM method of [4] that proved effective, this instance of the program could certainly have seen a\nmore pro\ufb01table implementation in the hands of a domain expert. All of the compared methods have a\nvery hard time reaching high sample sizes n in the high dimensional context.\nThe PYTHON3 implementation of the method used is found in open access here: martinroyer/pecok\n[18]\n\n6\n\n\fScenario (S1)\n\nFigure 1: Performance comparison for clustering estimators and (cid:98)Gcorr, labeled \u2019pecok4\u2019 in reference\n\nto [6]. The adjusted mutual information equals 1 when the clusterings are identical, 0 when they are\nindependent.\n\nScenario (S2)\n\nhierarchical\n\nkmeans++\n\nlowrank-spectral\n\npecok4\n\ncord\n\n(S1)\n\n(S2)\n\n90% SNR=4.75\n90% SNR=6\n90% SNR=7.25\n90% SNR=8.5\nmed time (s)\n90% dim=102\n90% dim=103\n90% dim=5.103\n90% dim=104\nmed time (s)\n\n0\n0\n18\n100\n0.01\n100\n0\n0\n0\n\n0.14\n\n0\n0\n0\n0\n\n2.76\n\n/\n/\n/\n/\n\u221e\n\n0\n0\n12\n100\n0.23\n100\n0\n0\n0\n\n51\n100\n100\n100\n\n1.84 (+18.92)1\n\n100\n100\n100\n49\n\n0\n0\n26\n76\n0.76\n94\n31\n0\n0\n\n0.19\n\n0.68\nTable 1: Approximate recovery result for experiment (S1) and (S2): number of experiments that had\na score superior to 90%, out of a hundred, and computing times over the experiments\n1The median time in parenthesis is the time to compute(cid:98)\u0393corr, as opposed to the main time for performing\nthe SDP. Indeed the(cid:98)\u0393corr is very time consuming, its cost is roughly O(n4p). It must be noted that much\n\nfaster alternatives, such as the one presented in [6], perform equally well (there is no signi\ufb01cant difference in\nperformance) for the recovery of G, but this is outside the scope of this paper.\n\n1.94 (+68.12)1\n\n6 Conclusion\n\nIn this paper we analyzed a new semide\ufb01nite positive algorithm for point clustering within the context\nof a \ufb02exible probabilistic model and exhibit the key quantities that guarantee non-asymptotic exact\nrecovery. It implies an essential bias-removing correction that signi\ufb01canty improves the recovering\nrate in the high-dimensional setup. Hence we showed the estimator to be near-minimax, adapted to an\neffective dimension of the problem. We also demonstrated that our estimator can be optimally adapted\nto a data-driven choice of K, with a single tuning parameter. Lastly we illustrated on high-dimensional\nexperiments that our approach is empirically stronger than other classical clustering methods. The\n\n(cid:98)\u0393corr correction step of the algorithm, it can be interpreted as an independent, denoising step for\n\nthe Gram matrix, and we recommend using such a procedure where the probabilistic framework we\ndeveloped seems appropriate.\n\n7\n\n\u000f\u0011\u0013\u0001\u000e\r\u000e\u000f\u000e\u0011\u000e\u0013$\u001f#\u0003\u22062(\u00b5)/\u03c32\r\u000b\u000f\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\r,/\u0004*2\u0004G,cG\u000420,3850.4\u0004\u00044\u00057,3\u0004\n850.97,\u0004\u0004\u000407,7.\u0004\u0004.,\u0004.47/\u000e\r\u000f\u000e\r\u0010\u000e\r\u0011\u000e\r\u00125\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\r,/\u0004*2\u0004G,cG50.4\u0004\u00044\u00057,3\u0004\n850.97,\u0004\u0004\u000407,7.\u0004\u0004.,\u0004.47/\fIn practice, it is generally more realistic to look at approximate clustering results, but in this work we\nchose the point of view of exact clustering for investigating theoretical properties of our estimator.\nOur experimental results provide evidence that this choice is not restrictive, i.e. that our \ufb01ndings\ntranslate very well to approximate recovery. We expect our results to hold with similar speeds\nfor approximate clustering, up to some logarithmic terms. One could think of adapting works on\ncommunity detection by Gu\u00e9don and Vershynin [12] based on Grothendieck\u2019s inequality, or work by\nFei and Chen [10] from the stochastic-block-model community on similar semide\ufb01nite programs. In\nfact, referring to a detection bound by Banks, Moore, Vershynin, Verzelen and Xu (2017) [3], our\n\u221a\nonly margin for improvement on the separation speed is to transform the logarithmic factor\nlog n\nlog K when the number of clusters K is of order O(log n) \u2013 otherwise the problem is rather\ninto\nopen.\nAs for the robustness of this procedure, a few aspects are to be considered: the algorithm we studied\nsolves for a convexi\ufb01ed objective, therefore its performances are empirically more stable than that of\nan objective that would prove non-convex, especially in the high-dimensional context. In this work\nwe also bene\ufb01t from a permissive probabilistic framework that allows for multiple deviations from the\nclassical gaussian cluster model, and come at no price in terms of the performance of our estimator.\nPoints from a same cluster are allowed to have signi\ufb01cantly different means or \ufb02uctuations, and the\nresults for exact recovery with high probability are unchanged, near-minimax and adaptive. Likewise\non simulated data the estimator proves the most ef\ufb01cient in exact as well as approximate recovery.\n\n\u221a\n\nAcknowledgements\n\nThis work is supported by a public grant overseen by the French National research Agency (ANR) as\npart of the \u201cInvestissement d\u2019Avenir\" program, through the \u201cIDI 2015\" project funded by the IDEX\nParis-Saclay, ANR-11-IDEX-0003-02. It is also supported by the CNRS PICS funding HighClust.\nWe thank Christophe Giraud for a shrewd, unwavering thesis direction.\n\nReferences\n[1] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings\nof the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA \u201907, pages\n1027\u20131035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics.\n\n[2] M. Azizyan, A. Singh, and L. Wasserman. Minimax theory for high-dimensional gaussian\nmixtures with sparse mean separation. In Proceedings of the 26th International Conference\non Neural Information Processing Systems, NIPS\u201913, pages 2139\u20132147, USA, 2013. Curran\nAssociates Inc.\n\n[3] J. Banks, C. Moore, N. Verzelen, R. Vershynin, and J. Xu.\n\nInformation-theoretic bounds\nand phase transitions in clustering, sparse PCA, and submatrix localization. arXiv e-prints\narXiv:1607.05222, July 2016.\n\n[4] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Found. Trends Mach. Learn.,\n3(1):1\u2013122, January 2011.\n\n[5] F. Bunea, C. Giraud, and X. Luo. Minimax optimal variable clustering in g-models via cord.\n\narXiv preprint arXiv:1508.01939, 2015.\n\n[6] F. Bunea, C. Giraud, M. Royer, and N. Verzelen. PECOK: a convex optimization approach to\n\nvariable clustering. arXiv e-prints arXiv:1606.05100, June 2016.\n\n[7] Y. Chen and J. Xu. Statistical-computational tradeoffs in planted problems and submatrix\nlocalization with a growing number of clusters and submatrices. Journal of Machine Learning\nResearch, 17(27):1\u201357, 2016.\n\n[8] S. Chr\u00e9tien, C. Dombry, and A. Faivre. A semi-de\ufb01nite programming approach to low dimen-\n\nsional embedding for unsupervised clustering. CoRR, abs/1606.09190, 2016.\n\n[9] S. Dasgupta and L. Schulman. A probabilistic analysis of em for mixtures of separated, spherical\n\ngaussians. J. Mach. Learn. Res., 8:203\u2013226, May 2007.\n\n8\n\n\f[10] Y. Fei and Y. Chen. Exponential error rates of SDP for block models: Beyond Grothendieck\u2019s\n\ninequality. arXiv e-prints arXiv:1705.08391, May 2017.\n\n[11] C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density\n\nestimation. Journal of the American Statistical Association, 97(458):611\u2013631, 2002.\n\n[12] O. Gu\u00e9don and R. Vershynin. Community detection in sparse networks via grothendieck\u2019s\n\ninequality. CoRR, abs/1411.4686, 2014.\n\n[13] S. Lloyd. Least squares quantization in pcm. IEEE Trans. Inf. Theor., 28(2):129\u2013137, September\n\n1982.\n\n[14] J. MacQueen. Some methods for classi\ufb01cation and analysis of multivariate observations.\nIn Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,\nVolume 1: Statistics, pages 281\u2013297, Berkeley, Calif., 1967. University of California Press.\n\n[15] F. McSherry. Spectral partitioning of random graphs.\n\nIn Proceedings of the 42Nd IEEE\nSymposium on Foundations of Computer Science, FOCS \u201901, pages 529\u2013, Washington, DC,\nUSA, 2001. IEEE Computer Society.\n\n[16] D. G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures with k-means. In 2016\n\nIEEE Information Theory Workshop (ITW), pages 211\u2013215, Sept 2016.\n\n[17] J. Peng and Y. Wei. Approximating k-means-type clustering via semide\ufb01nite programming.\n\nSIAM J. on Optimization, 18(1):186\u2013205, February 2007.\n\n[18] Martin Royer. ADMM implementation of PECOK. https://github.com/martinroyer/\n\npecok, October, 2017.\n\n[19] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci, 1:801\u2013804,\n\n1956.\n\n[20] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. Chapter 5 of:\n\nCompressed Sensing, Theory and Applications. Cambridge University Press, 2012.\n\n[21] N. Verzelen and E. Arias-Castro. Detection and Feature Selection in Sparse Mixture Models.\n\narXiv e-prints arXiv:1405.1478, May 2014.\n\n[22] Nguyen Xuan Vinh, Julien Epps, and James Bailey. Information theoretic measures for cluster-\nings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn.\nRes., 11:2837\u20132854, December 2010.\n\n[23] J. H. Ward. Hierarchical grouping to optimize an objective function. Journal of the American\n\nStatistical Association, 58(301):236\u2013244, 1963.\n\n[24] B. Yan and P. Sarkar. Convex Relaxation for Community Detection with Covariates. arXiv\n\ne-prints arXiv:1607.02675, July 2016.\n\n9\n\n\f", "award": [], "sourceid": 1121, "authors": [{"given_name": "Martin", "family_name": "Royer", "institution": "Universit\u00e9 Paris-Saclay"}]}