{"title": "Optimal Cluster Recovery in the Labeled Stochastic Block Model", "book": "Advances in Neural Information Processing Systems", "page_first": 965, "page_last": 973, "abstract": "We consider the problem of community detection or clustering in the labeled Stochastic Block Model (LSBM) with a finite number $K$ of clusters of sizes linearly growing with the global population of items $n$. Every pair of items is labeled independently at random, and label $\\ell$ appears with probability $p(i,j,\\ell)$ between two items in clusters indexed by $i$ and $j$, respectively. The objective is to reconstruct the clusters from the observation of these random labels.   Clustering under the SBM and their extensions has attracted much attention recently. Most existing work aimed at characterizing the set of parameters such that it is possible to infer clusters either positively correlated with the true clusters, or with a vanishing proportion of misclassified items, or exactly matching the true clusters. We find  the set of parameters such that there exists a clustering algorithm with at most $s$ misclassified items in average under the general LSBM and for any $s=o(n)$, which solves one open problem raised in \\cite{abbe2015community}. We further develop an algorithm, based on simple spectral methods, that achieves this fundamental performance limit within $O(n \\mbox{polylog}(n))$ computations and without the a-priori knowledge of the model parameters.", "full_text": "Optimal Cluster Recovery\n\nin the Labeled Stochastic Block Model\n\nSe-Young Yun\n\nCNLS, Los Alamos National Lab.\n\nLos Alamos, NM 87545\n\nsyun@lanl.gov\n\nAlexandre Proutiere\n\nAutomatic Control Dept., KTH\n\nStockholm 100-44, Sweden\n\nalepro@kth.se\n\nAbstract\n\nWe consider the problem of community detection or clustering in the labeled\nStochastic Block Model (LSBM) with a \ufb01nite number K of clusters of sizes\nlinearly growing with the global population of items n. Every pair of items is\nlabeled independently at random, and label (cid:96) appears with probability p(i, j, (cid:96))\nbetween two items in clusters indexed by i and j, respectively. The objective is to\nreconstruct the clusters from the observation of these random labels.\nClustering under the SBM and their extensions has attracted much attention recently.\nMost existing work aimed at characterizing the set of parameters such that it is\npossible to infer clusters either positively correlated with the true clusters, or with\na vanishing proportion of misclassi\ufb01ed items, or exactly matching the true clusters.\nWe \ufb01nd the set of parameters such that there exists a clustering algorithm with\nat most s misclassi\ufb01ed items in average under the general LSBM and for any\ns = o(n), which solves one open problem raised in [2]. We further develop\nan algorithm, based on simple spectral methods, that achieves this fundamental\nperformance limit within O(npolylog(n)) computations and without the a-priori\nknowledge of the model parameters.\n\n1\n\nIntroduction\n\nCommunity detection consists in extracting (a few) groups of similar items from a large global\npopulation, and has applications in a wide spectrum of disciplines including social sciences, biology,\ncomputer science, and statistical physics. The communities or clusters of items are inferred from the\nobserved pair-wise similarities between items, which, most often, are represented by a graph whose\nvertices are items and edges are pairs of items known to share similar features.\nThe stochastic block model (SBM), introduced three decades ago in [12], constitutes a natural\nperformance benchmark for community detection, and has been, since then, widely studied. In the\nSBM, the set of items V = {1, . . . , n} are partitioned into K non-overlapping clusters V1, . . . ,VK,\nthat have to be recovered from an observed realization of a random graph. In the latter, an edge\nbetween two items belonging to clusters Vi and Vj, respectively, is present with probability p(i, j),\nindependently of other edges. The analyses presented in this paper apply to the SBM, but also to the\nlabeled stochastic block model (LSBM) [11], a more general model to describe the similarities of\nitems. There, the observation of the similarity between two items comes in the form of a label taken\nfrom a \ufb01nite set L = {0, 1, . . . , L}, and label (cid:96) is observed between two items in clusters Vi and Vj,\nrespectively, with probability p(i, j, (cid:96)), independently of other labels. The standard SBM can be seen\nas a particular instance of its labeled counterpart with two possible labels 0 and 1, and where the\nedges present (resp. absent) in the SBM correspond to item pairs with label 1 (resp. 0). The problem\nof cluster recovery under the LSBM consists in inferring the hidden partition V1, . . . ,VK from the\nobservation of the random labels on each pair of items.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fOver the last few years, we have seen remarkable progresses for the problem of cluster recovery under\nthe SBM (see [7] for an exhaustive literature review), highlighting its scienti\ufb01c relevance and richness.\nMost recent work on the SBM aimed at characterizing the set of parameters (i.e., the probabilities\np(i, j) that there exists an edge between nodes in clusters i and j for 1 \u2264 i, j \u2264 K) such that some\nqualitative recovery objectives can or cannot be met. For sparse scenarios where the average degree\nof items in the graph is O(1), parameters under which it is possible to extract clusters positively\ncorrelated with the true clusters have been identi\ufb01ed [5, 18, 16]. When the average degree of the\ngraph is \u03c9(1), one may predict the set of parameters allowing a cluster recovery with a vanishing (as\nn grows large) proportion of misclassi\ufb01ed items [22, 17], but one may also characterize parameters\nfor which an asymptotically exact cluster reconstruction can be achieved [1, 21, 8, 17, 2, 3, 13].\nIn this paper, we address the \ufb01ner and more challenging question of determining, under the general\nLSBM, the minimal number of misclassi\ufb01ed items given the parameters of the model. Speci\ufb01cally,\nfor any given s = o(n), our goal is to identify the set of parameters such that it is possible to devise a\nclustering algorithm with at most s misclassi\ufb01ed items. Of course, if we achieve this goal, we shall\nrecover all the aforementioned results on the SBM.\n\nMain results. We focus on the labeled SBM as described above, and where each item is assigned\nto cluster Vk with probability \u03b1k > 0, independently of other items. We assume w.l.o.g.\nthat\n\u03b11 \u2264 \u03b12 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03b1K. We further assume that \u03b1 = (\u03b11, . . . , \u03b1K) does not depend on the total\npopulation of items n. Conditionally on the assignment of items to clusters, the pair or edge\n(v, w) \u2208 V 2 has label (cid:96) \u2208 L = {0, 1, . . . , L} with probability p(i, j, (cid:96)), when v \u2208 Vi and w \u2208 Vj.\nW.l.o.g., 0 is the most frequent label, i.e., 0 = arg max(cid:96)\nj=1 \u03b1i\u03b1jp(i, j, (cid:96)). Throughout the\npaper, we typically assume that \u00afp = o(1) and \u00afpn = \u03c9(1) where \u00afp = maxi,j,(cid:96)\u22651 p(i, j, (cid:96)) denotes the\nmaximum probability of observing a label different than 0. We shall explicitly state whether these\nassumption are made when deriving our results. In the standard SBM, the second assumption means\nthat the average degree of the corresponding random graph is \u03c9(1). This also means that we can hope\nto recover clusters with a vanishing proportion of misclassi\ufb01ed items. We \ufb01nally make the following\nassumption: there exist positive constants \u03b7 and \u03b5 such that for every i, j, k \u2208 [K] = {1, . . . , K},\n\n(cid:80)K\n\n(cid:80)K\n\ni=1\n\n(cid:80)K\n\nk=1\n\n(cid:80)L\n(cid:96)=1(p(i, k, (cid:96)) \u2212 p(j, k, (cid:96)))2\n\n\u2265 \u03b5.\n\n\u00afp2\n\n(A1) \u2200(cid:96) \u2208 L,\n\np(i, j, (cid:96))\np(i, k, (cid:96))\n\n\u2264 \u03b7\n\nand\n\n(A2)\n\n(A2) imposes a certain separation between the clusters. For example, in the standard SBM with two\ncommunities, p(1, 1, 1) = p(2, 2, 1) = \u03be, and p(1, 2, 1) = \u03b6, (A2) is equivalent to 2(\u03be \u2212 \u03b6)2/\u03be2 \u2265 \u0001.\nIn summary, the LSBM is parametrized by \u03b1 and p = (p(i, j, (cid:96)))1\u2264i,j\u2264K,0\u2264(cid:96)\u2264L, and recall that \u03b1\ndoes not depend on n, whereas p does.\nFor the above LSBM, we derive, for any arbitrary s = o(n), a necessary condition under which\nthere exists an algorithm inferring clusters with s misclassi\ufb01ed items. We further establish that\nunder this condition, a simple extension of spectral algorithms extract communities with less than s\nmisclassi\ufb01ed items. To formalize these results, we introduce the divergence of (\u03b1, p). We denote by\np(i) the K \u00d7 (L + 1) matrix whose element on the j-th row and the ((cid:96) + 1)-th column is p(i, j, (cid:96))\nand denote by p(i, j) \u2208 [0, 1]L+1 the vector describing the probability distribution of the label of a\npair of items in Vi and Vj, respectively. Let P K\u00d7(L+1) denote the set of K \u00d7 (L + 1) matrices such\nthat each row represents a probability distribution. The divergence D(\u03b1, p) of (\u03b1, p) is de\ufb01ned as\nfollows: D(\u03b1, p) = mini,j:i(cid:54)=j DL+(\u03b1, p(i), p(j)) with\n\nDL+(\u03b1, p(i), p(j)) =\n\nKL(y(k), p(i, k)) =(cid:80)L\n\nmin\n\ny\u2208P K\u00d7(L+1)\n\nmax\n\n\u03b1kKL(y(k), p(i, k)),\n\n\u03b1kKL(y(k), p(j, k))\n\nk=1\n\nk=1\n\nwhere KL denotes the Kullback-Leibler divergence between two label distributions, i.e.,\n\np(i,k,(cid:96)). Finally, we denote by \u03b5\u03c0(n) the number of misclas-\nsi\ufb01ed items under the clustering algorithm \u03c0, and by E[\u03b5\u03c0(n)] its expectation (with respect to the\nrandomness in the LSBM and in the algorithm).\n\n(cid:96)=0 y(k, (cid:96)) log y(k,(cid:96))\n\nWe \ufb01rst derive a tight lower bound on the average number of misclassi\ufb01ed items when the latter is\no(n). Note that such a bound was unknown even for the SBM [2].\n\nTheorem 1 Assume that (A1) and (A2) hold, and that \u00afpn = \u03c9(1). Let s = o(n).\nIf there\nexists a clustering algorithm \u03c0 misclassifying in average less than s items asymptotically, i.e.,\n\n2\n\n(cid:40) K(cid:88)\n\nK(cid:88)\n\n(cid:41)\n\n\flim supn\u2192\u221e\n\nE[\u03b5\u03c0(n)]\n\ns\n\n\u2264 1, then the parameters (\u03b1, p) of the LSBM satisfy:\n\nlim inf\nn\u2192\u221e\n\nnD(\u03b1, p)\nlog(n/s)\n\n\u2265 1.\n\n(1)\n\nTo state the corresponding positive result (i.e., the existence of an algorithm misclassifying only\ns items), we make an additional assumption to avoid extremely sparse labels: (A3) there exists a\nconstant \u03ba > 0 such that np(j, i, (cid:96)) \u2265 (n\u00afp)\u03ba for all i, j and (cid:96) \u2265 1.\n\nTheorem 2 Assume that (A1), (A2), and (A3) hold, and that \u00afp = o(1), \u00afpn = \u03c9(1). Let s = o(n). If\nthe parameters (\u03b1, p) of the LSBM satisfy (1), then the Spectral Partition (SP ) algorithm presented\nin Section 4 misclassi\ufb01es at most s items with high probability, i.e., limn\u2192\u221e P[\u03b5SP (n) \u2264 s] = 1.\n\nThese theorems indicate that under the LSBM with parameters satisfying (A1) and (A2), the number\nof misclassi\ufb01ed items scales at least as n exp(\u2212nD(\u03b1, p)(1 + o(1)) under any clustering algorithm,\nirrespective of its complexity. They further establish that the Spectral Partition algorithm reaches this\nfundamental performance limit under the additional condition (A3). We note that the SP algorithm\nruns in polynomial time, i.e., it requires O(n2 \u00afp log(n)) \ufb02oating-point operations.\nWe further establish a necessary and suf\ufb01cient condition on the parameters of the LSBM for the\nexistence of a clustering algorithm recovering the clusters exactly with high probability. Deriving\nsuch a condition was also open [2].\n\nTheorem 3 Assume that (A1) and (A2) hold.\nIf there exists a clustering algorithm that does\nnot misclassify any item with high probability, then the parameters (\u03b1, p) of the LSBM satisfy:\nlog(n) \u2265 1. If this condition holds, then under (A3), the SP algorithm recovers the\nlim inf n\u2192\u221e nD(\u03b1,p)\nclusters exactly with high probability.\n\nThe paper is organized as follows. Section 2 presents the related work and example of application\nof our results. In Section 3, we sketch the proof of Theorem 1, which leverages change-of-measure\nand coupling arguments. We present in Section 4 the Spectral Partition algorithm, and analyze\nits performance (we outline the proof of Theorem 2). All results are proved in details in the\nsupplementary material.\n\n2 Related Work and Applications\n\n2.1 Related work\n\nCluster recovery in the SBM has attracted a lot of attention recently. We summarize below existing\nresults, and compare them to ours. Results are categorized depending on the targeted level of\nperformance. First, we consider the notion of detectability, the lowest level of performance requiring\nthat the extracted clusters are just positively correlated with the true clusters. Second, we look at\nasymptotically accurate recovery, stating that the proportion of misclassi\ufb01ed items vanishes as n\ngrows large. Third, we present existing results regarding exact cluster recovery, which means that no\nitem is misclassi\ufb01ed. Finally, we report recent work whose objective, like ours, is to characterize the\noptimal cluster recovery rate.\n\nDetectability. Necessary and suf\ufb01cient conditions for detectability have been studied for the binary\nsymmetric SBM (i.e., L = 1, K = 2, \u03b11 = \u03b12, p(1, 1, 1) = p(2, 2, 1) = \u03be, and p(1, 2, 1) =\np(2, 1, 1) = \u03b6). In the sparse regime where \u03be, \u03b6 = o(1), and for the binary symmetric SBM, the main\nfocus has been on identifying the phase transition threshold (a condition on \u03be and \u03b6) for detectability:\n\nIt was conjectured in [5] that if n(\u03be \u2212 \u03b6) <(cid:112)2n(\u03be + \u03b6) (i.e., under the threshold), no algorithm\n\ncan perform better than a simple random assignment of items to clusters, and above the threshold,\nclusters can partially be recovered. The conjecture was recently proved in [18] (necessary condition),\nand [16] (suf\ufb01cient condition). The problem of detectability has been also recently studied in [24]\nfor the asymmetric SBM with more than two clusters of possibly different sizes. Interestingly, it is\nshown that in most cases, the phase transition for detectability disappears.\n\n3\n\n\fThe present paper is not concerned with conditions for detectability. Indeed detectability means that\nonly a strictly positive proportion of items can be correctly classi\ufb01ed, whereas here, we impose that\nthe proportion of misclassi\ufb01ed items vanishes as n grows large.\n\nAsymptotically accurate recovery. A necessary and suf\ufb01cient condition for asymptotically accurate\nrecovery in the SBM (with any number of clusters of different but linearly increasing sizes) has been\nderived in [22] and [17]. Using our notion of divergence specialized to the SBM, this condition is\nnD(\u03b1, p) = \u03c9(1). Our results are more precise since the minimal achievable number of misclassi\ufb01ed\nitems is characterized, and apply to a broader setting since they are valid for the generic LSBM.\n\nAsymptotically exact recovery. Conditions for exact cluster recovery in the SBM have been also\nrecently studied. [1, 17, 8] provide a necessary and suf\ufb01cient condition for asymptotically exact\nrecovery in the binary symmetric SBM. For example, it is shown that when \u03be = a log(n)\nand\nab \u2265 1. In [2, 3],\n\u03b6 = b log(n)\nthe authors consider a more general SBM corresponding to our LSBM with L = 1. They de\ufb01ne\nCH-divergence as:\n\nfor a > b, clusters can be recovered exactly if and only if a+b\n\n2 \u2212 \u221a\n\nn\n\nn\n\nD+(\u03b1, p(i), p(j))\n\n=\n\nn\n\nlog(n)\n\nmax\n\u03bb\u2208[0,1]\n\nK(cid:88)\n\nk=1\n\n(cid:0)(1 \u2212 \u03bb)p(i, k, 1) + \u03bbp(j, k, 1) \u2212 p(i, k, 1)1\u2212\u03bbp(j, k, 1)\u03bb(cid:1) ,\n\n\u03b1k\n\nand show that mini(cid:54)=j D+(\u03b1, p(i), p(j)) > 1 is a necessary and suf\ufb01cient condition for asymptotically\nexact reconstruction. The following claim, proven in the supplementary material, relates D+ to DL+.\n\nClaim 4 When \u00afp = o(1), we have for all i, j:\n\nL(cid:88)\n\nDL+(\u03b1, p(i), p(j))\nn\u2192\u221e\u223c max\n\u03bb\u2208[0,1]\n\nK(cid:88)\n\n(cid:96)=1\n\nk=1\n\n(cid:0)(1 \u2212 \u03bb)p(i, k, (cid:96)) + \u03bbp(j, k, (cid:96)) \u2212 p(i, k, (cid:96))1\u2212\u03bbp(j, k, (cid:96))\u03bb(cid:1) .\n\n\u03b1k\n\nThus, the results in [2, 3] are obtained by applying Theorem 3 and Claim 4.\nIn [13], the authors consider a symmetric labeled SBM where communities are balanced (i.e.,\nK for all k) and where label probabilities are simply de\ufb01ned as p(i, i, (cid:96)) = p((cid:96)) for all i and\n\u03b1k = 1\np(i, j, (cid:96)) = q((cid:96)) for all i (cid:54)= j. It is shown that nI\nlog(n) > 1 is necessary and suf\ufb01cient for asymptotically\nexact recovery, where I = \u2212 2\nClaim 5 In the LSBM with K clusters, if \u00afp = o(1), and for all i, j, (cid:96) such that i (cid:54)= j, \u03b1i = 1\nK ,\np(i, i, (cid:96)) = p((cid:96)), and p(j, k, (cid:96)) = q((cid:96)), we have: D(\u03b1, p) n\u2192\u221e\u223c \u2212 2\n\n(cid:17)\n(cid:112)p((cid:96))q((cid:96))\n\n(cid:17)\n(cid:112)p((cid:96))q((cid:96))\n\n. We can relate I to D(\u03b1, p):\n\n(cid:16)(cid:80)L\n\n(cid:16)(cid:80)L\n\nK log\n\n(cid:96)=0\n\n.\n\nK log\n\n(cid:96)=0\n\nAgain from this claim, the results derived in [13] are obtained by applying Theorem 3 and Claim 5.\n\nOptimal recovery rate. In [6, 19], the authors consider the binary SBM in the sparse regime where\nthe average degree of items in the graph is O(1), and identify the minimal number of misclassi\ufb01ed\nitems for very speci\ufb01c intra- and inter-cluster edge probabilities \u03be and \u03b6. Again the sparse regime\nis out of the scope of the present paper. [23, 7] are concerned with the general SBM corresponding\nto our LSBM with L = 1, and with regimes where asympotically accurate recovery is possible.\nThe authors \ufb01rst characterize the optimal recovery rate in a minimax framework. More precisely,\nthey consider a (potentially large) set of possible parameters (\u03b1, p), and provide a lower bound on\nthe expected number of misclassi\ufb01ed items for the worst parameters in this set. Our lower bound\n(Theorem 1) is more precise as it is model-speci\ufb01c, i.e., we provide the minimal expected number\nof misclassi\ufb01ed items for a given parameter (\u03b1, p) (and for a more general class of models). Then\nthe authors propose a clustering algorithm, with time complexity O(n3 log(n)), and achieving their\nminimax recovery rate. In comparison, our algorithm yields an optimal recovery rate O(n2 \u00afp log(n))\nfor any given parameter (\u03b1, p), exhibits a lower running time, and applies to the generic LSBM.\n\n4\n\n\f2.2 Applications\n\nWe provide here a few examples of application of our results, illustrating their versatility. In all\nexamples, f (n) is a function such that f (n) = \u03c9(1), and a, b are \ufb01xed real numbers such that a > b.\n\nThe binary SBM. Consider the binary SBM where the average item degree is \u0398(f (n)), and repre-\nsented by a LSBM with parameters L = 1, K = 2, \u03b1 = (\u03b11, 1\u2212\u03b11), p(1, 1, 1) = p(2, 2, 1) = af (n)\nn ,\nand p(1, 2, 1) = p(2, 1, 1) = bf (n)\nn . From Theorems 1 and 2, the optimal number of misclassi\ufb01ed\nvertices scales as n exp(\u2212g(\u03b11, a, b)f (n)(1 + o(1))) when \u03b11 \u2264 1/2 (w.l.o.g.) and where\n(1\u2212 \u03b11 \u2212 \u03bb + 2\u03b11\u03bb)a + (\u03b11 + \u03bb\u2212 2\u03b1\u03bb)b\u2212 \u03b11a\u03bbb(1\u2212\u03bb) \u2212 (1\u2212 \u03b11)a(1\u2212\u03bb)b\u03bb.\ng(\u03b11, a, b) := max\n\u03bb\u2208[0,1]\na \u2212 \u221a\n\u221a\nIt can be easily checked that g(\u03b11, a, b) \u2265 g(1/2, a, b) = 1\n2). The worst\n2 (\ncase is hence obtained when the two clusters are of equal sizes. When f (n) = log(n), we also note\nthat the condition for asymptotically exact recovery is g(\u03b11, a, b) \u2265 1.\nRecovering a single hidden community. As in [9], consider a random graph model with a hidden\ncommunity consisting of \u03b1n vertices, edges between vertices belonging the hidden community are\npresent with probability af (n)\nn , and edges between other pairs are present with probability bf (n)\nn .\nThis is modeled by a LSBM with parameters K = 2, L = 1, \u03b11 = \u03b1, p(1, 1, 1) = af (n)\nn , and\np(1, 2, 1) = p(2, 1, 1) = p(2, 2, 1) = bf (n)\nn . The minimal number of misclassi\ufb01ed items when\nsearching for the hidden community scales as n exp(\u2212h(\u03b1, a, b)f (n)(1 + o(1))) where\n\nb)2 (letting \u03bb = 1\n\n(cid:19)\n\n.\n\n(cid:18)\n\nh(\u03b1, a, b) := \u03b1\n\na \u2212 (a \u2212 b)\n\n1 + log(a \u2212 b) \u2212 log(a log(a/b))\n\nlog(a/b)\n\nWhen f (n) = log(n), the condition for asymptotically exact recovery of the hidden community is\nh(\u03b1, a, b) \u2265 1.\nOptimal sampling for community detection under the SBM. Consider a dense binary symmetric\nSBM with intra- and inter-cluster edge probabilities a and b. In practice, to recover the clusters,\none might not be able to observe the entire random graph, but sample its vertex (here item) pairs as\nconsidered in [22]. Assume for instance that any pair of vertices is sampled with probability \u03b4f (n)\nfor some \ufb01xed \u03b4 > 0, independently of other pairs. We can model such scenario using a LSBM with\nthree labels, namely \u00d7, 0 and 1, corresponding to the absence of observation (the vertex pair is not\nsampled), the observation of the absence of an edge and of the presence of an edge, respectively,\nand with parameters for all i, j \u2208 {1, 2}, p(i, j,\u00d7) = 1 \u2212 \u03b4f (n)\nn , p(1, 1, 1) = p(2, 2, 1) = a \u03b4f (n)\nn ,\nand p(1, 2, 1) = p(2, 1, 1) = b \u03b4f (n)\nn exp(\u2212l(\u03b4, a, b)f (n)(1 + o(1))) where l := \u03b4(1\u2212\u221a\nn . The minimal number of misclassi\ufb01ed vertices scales as\nthe condition for asymptotically exact recovery is l(\u03b1, a+, a\u2212, b+, b\u2212) \u2265 1.\nSigned networks. Signed networks [15, 20] are used in social sciences to model positive and negative\ninteractions between individuals. These networks can be represented by a LSBM with three possible\nlabels, namely 0, + and -, corresponding to the absence of interaction, positive and negative interaction,\nrespectively. Consider such LSBM with parameters: K = 2, \u03b11 = \u03b12, p(1, 1, +) = p(2, 2, +) =\n, and p(1, 2,\u2212) =\na+f (n)\np(2, 1,\u2212) = b\u2212f (n)\n, for some \ufb01xed a+, a\u2212, b+, b\u2212 such that a+ > b+ and a\u2212 < b\u2212. The minimal\nnumber of misclassi\ufb01ed individuals here scales as n exp(\u2212m(\u03b1, a+, a\u2212, b+, b\u2212)f (n)(1 + o(1)))\nwhere\n\nab\u2212(cid:112)(1 \u2212 a)(1 \u2212 b)). When f (n) = log(n),\n\n, p(1, 1,\u2212) = p(2, 2,\u2212) = a\u2212f (n)\n\n, p(1, 2, +) = p(2, 1, +) = b+f (n)\n\nn\n\nn\n\na+ \u2212(cid:112)b+)2 + (\n\n\u221a\n\na\u2212 \u2212(cid:112)b\u2212)2(cid:17)\n\n.\n\nm(\u03b1, a+, a\u2212, b+, b\u2212) :=\n\nn\n\nn\n\nn\n\nWhen f (n) = log(n), the condition for asymptotically exact recovery is l(\u03b1, a+, a\u2212, b+, b\u2212) \u2265 1.\n\n(cid:16)\n\n\u221a\n(\n\n1\n2\n\n3 Fundamental Limits: Change of Measures through Coupling\n\nIn this section, we explain the construction of the proof of Theorem 1. The latter relies on an\nappropriate change-of-measure argument, frequently used to identify upper performance bounds in\n\n5\n\n\fonline stochastic optimization problems [14]. In the following, we refer to \u03a6, de\ufb01ned by parameters\n(\u03b1, p), as the true stochastic model under which all the observed random labels are generated, and\ndenote by P\u03a6 = P (resp. E\u03a6[\u00b7] = E[\u00b7]) the corresponding probability measure (resp. expectation). In\nour change-of-measure argument, we construct a second stochastic model \u03a8 (whose corresponding\nprobability measure and expectation are P\u03a8 and E\u03a8[\u00b7], respectively). Using a change of measures\nfrom P\u03a6 to P\u03a8, we relate the expected number of misclassi\ufb01ed items E\u03a6[\u03b5\u03c0(n)] under any clustering\nalgorithm \u03c0 to the expected (w.r.t. P\u03a8) log-likelihood ratio Q of the observed labels under P\u03a6 and\nP\u03a8. Speci\ufb01cally, we show that, roughly, log(n/E\u03a6[\u03b5\u03c0(n)]) must be smaller than E\u03a8[Q] for n large\nenough.\n\nk=1 \u03b1kKL(q(k), p(i(cid:63), k)) =(cid:80)K\n\nq \u2208 P K\u00d7(L+1) such that: D(\u03b1, p) =(cid:80)K\n\nConstruction of \u03c8. Let (i(cid:63), j(cid:63)) = arg mini,j:i<j DL+(\u03b1, p(i), p(j)), and let v(cid:63) denote the smallest\nitem index that belongs to cluster i(cid:63) or j(cid:63). If both Vi(cid:63) and Vj(cid:63) are empty, we de\ufb01ne v(cid:63) = n. Let\nk=1 \u03b1kKL(q(k), p(j(cid:63), k)).\nThe existence of such q is proved in Lemma 7 in the supplementary material. Now to de\ufb01ne the\nstochastic model \u03a8, we couple the generation of labels under \u03a6 and \u03a8 as follows.\n1. We \ufb01rst generate the random clusters V1, . . . ,VK under \u03a6, and extract i(cid:63), j(cid:63), and v(cid:63). The clusters\ngenerated under \u03a8 are the same as those generated under \u03a6. For any v \u2208 V, we denote by \u03c3(v) the\ncluster of item v.\n2. For all pairs (v, w) such that v (cid:54)= v(cid:63) and w (cid:54)= v(cid:63), the labels generated under \u03a8 are the same as those\ngenerated under \u03a6, i.e., the label (cid:96) is observed on the edge (v, w) with probability p(\u03c3(v), \u03c3(w), (cid:96)).\n3. Under \u03a8, for any v (cid:54)= v(cid:63), the observed label on the edge (v, v(cid:63)) under \u03a8 is (cid:96) with probability\nq(\u03c3(v), (cid:96)).\nLet xv,w denote the label observed for the pair (v, w). We introduce Q, the log-likelihood ratio of\nthe observed labels under P\u03a6 and P\u03a8 as:\n\nv(cid:63)\u22121(cid:88)\n\nv=1\n\nn(cid:88)\n\n+\n\nlog\n\nv=v(cid:63)+1\n\nQ =\n\nlog\n\nq(\u03c3(v), xv(cid:63),v)\n\nq(\u03c3(v), xv(cid:63),v)\n\n.\n\n(2)\n\n1\u2264k\u2264K\n\n1\u2264k\u2264K\n\np(\u03c3(v(cid:63)), \u03c3(v), xv(cid:63),v)\n\np(\u03c3(v(cid:63)), \u03c3(v), xv(cid:63),v)\n\n\u02c6Vk \\ Vk| \u2264 |(cid:83)\n\nLet \u03c0 be a clustering algorithm with output ( \u02c6Vk)1\u2264k\u2264K, and let E =(cid:83)\nloss of generality that |(cid:83)\n\n\u02c6Vk \\ Vk be the set\nof misclassi\ufb01ed items under \u03c0. Note that in general in our analysis, we always assume without\n\u02c6V\u03b3(k) \\ Vk| for any permutation \u03b3, so that\nthe set of misclassi\ufb01ed items is indeed E. By de\ufb01nition, \u03b5\u03c0(n) = |E|. Since under \u03a6, items are\ninterchangeable (remember that items are assigned to the various clusters in an i.i.d. manner), we\nhave: nP\u03a6{v \u2208 E} = E\u03a6[\u03b5\u03c0(n)] = E[\u03b5\u03c0(n)].\nNext, we establish a relationship between E[\u03b5\u03c0(n)] and the distribution of Q under P\u03a8. For any\nfunction f (n), we can prove that: P\u03a8{Q \u2264 f (n)} \u2264 exp(f (n))\n\u03b1i(cid:63) +\u03b1j(cid:63) . Using\nthis result with f (n) = log (n/E\u03a6[\u03b5\u03c0(n)]) \u2212 log(2/\u03b1i(cid:63) ), and Chebyshev\u2019s inequality, we deduce\nthat: log (n/E\u03a6[\u03b5\u03c0(n)]) \u2212 log(2/\u03b1i(cid:63) ) \u2264 E\u03a8[Q] +\nE\u03a8[(Q \u2212 E\u03a8[Q])2], and thus, a necessary\ncondition for E[\u03b5\u03c0(n)] \u2264 s is:\n\nE\u03a6[\u03b5\u03c0(n)]\n(\u03b1i(cid:63) +\u03b1j(cid:63) )n + \u03b1j(cid:63)\n\n1\u2264k\u2264K\n\nlog (n/s) \u2212 log(2/\u03b1i(cid:63) ) \u2264 E\u03a8[Q] +\n\nE\u03a8[(Q \u2212 E\u03a8[Q])2].\n\n(3)\n\n(cid:113) 4\n(cid:114) 4\n\n\u03b1i(cid:63)\n\n\u03b1i(cid:63)\n\nprobability. From this, we can approximate E\u03a8[Q] by E\u03a8[(cid:80)n\n\nAnalysis of Q. In view of (3), we can obtain a necessary condition for E[\u03b5\u03c0(n)] \u2264 s if we evaluate\nE\u03a8[Q] and E\u03a8[(Q \u2212 E\u03a8[Q])2]. To evaluate E\u03a8[Q], we can \ufb01rst prove that v(cid:63) \u2264 log(n)2 with high\np(\u03c3(v(cid:63)),\u03c3(v),xv(cid:63) ,v) ], which\n\nv=v(cid:63)+1 log\n\nq(\u03c3(v),xv(cid:63) ,v)\n\nis itself well-approximated by nD(\u03b1, p). More formally, we can show that:\n\nE\u03a8[Q] \u2264 (cid:0)n + 2 log(\u03b7) log(n)2(cid:1) D(\u03b1, p) +\n\n(4)\nSimilarly, we prove that E\u03a8[(Q \u2212 E\u03a8[Q])2] = O(n\u00afp), which in view of Lemma 8 (refer to the\nsupplementary material) and assumption (A2), implies that: E\u03a8[(Q \u2212 E\u03a8[Q])2] = o(nD(\u03b1, p)).\n\nlog \u03b7\nn3 .\n\n6\n\n\fWe complete the proof of Theorem 1 by putting the above arguments together: From (3), (4) and\nthe above analysis of Q, when the expected number of misclassi\ufb01ed items is less than s (i.e.,\nE[\u03b5\u03c0(n)] \u2264 s), we must have: lim inf n\u2192\u221e nD(\u03b1,p)\n\nlog(n/s) \u2265 1.\n\n4 The Spectral Partition Algorithm and its Optimality\n\n(cid:12)(cid:12)\u222aK\n\nMore precisely, A = (cid:80)L\n\nIn this section, we sketch the proof of Theorem 2. To this aim, we present the Spectral Partition\n(SP) algorithm and analyze its performance. The SP algorithm consists in two parts, and its detailed\npseudo-code is presented at the beginning of the supplementary document (see Algorithm 1).\nThe \ufb01rst part of the algorithm can be interpreted as an initialization for its second part, and consists in\napplying a spectral decomposition of a n \u00d7 n random matrix A constructed from the observed labels.\n(cid:96)=1 w(cid:96)A(cid:96), where A(cid:96) is the binary matrix identifying the item pairs with\nobserved label (cid:96), i.e., for all v, w \u2208 V, A(cid:96)\nvw = 1 if and only if (v, w) has label (cid:96). The weight w(cid:96) for\nlabel (cid:96) \u2208 {1, . . . , L} is generated uniformly at random in [0, 1], independently of other weights. From\nthe spectral decomposition of A, we estimate the number of communities and provide asymptotically\naccurate estimates S1, . . . , SK of the hidden clusters asymptotically accurately, i.e., we show that\nwhen n\u00afp = \u03c9(1), with high probability, \u02c6K = K and there exists a permutation \u03b3 of {1, . . . , K}\nsuch that 1\n. This \ufb01rst part of the SP algorithm is adapted from\nn\nalgorithms proposed for the standard SBM in [4, 22] to handle the additional labels in the model\nwithout the knowledge of the number K of clusters.\nThe second part is novel, and is critical to ensure the optimality of the SP algorithm.\nIt con-\nsists in \ufb01rst constructing an estimate \u02c6p of the true parameters p of the model from the matrices\n(A(cid:96))1\u2264(cid:96)\u2264L and the estimated clusters S1, . . . , SK provided in the \ufb01rst part of SP. We expect p to\nbe well estimated since S1, . . . , SK are asymptotically accurate. Then our cluster estimates are\niteratively improved. We run (cid:98)log(n)(cid:99) iterations. Let S(t)\nK denote the clusters estimated\nafter the t-th iteration, initialized with (S(0)\nK ) = (S1, . . . , SK). The improved clusters\nare obtained by assigning each item v \u2208 V to the cluster maximizing a log-\nS(t+1)\n1\nlikelihood formed from \u02c6p, S(t)\nK , and the observations (A(cid:96))1\u2264(cid:96)\u2264L: v is assigned to S(t+1)\n\n(cid:16) log(n \u00afp)2\n\nk=1Vk \\ S\u03b3(k)\n\n(cid:12)(cid:12) = O\n\n1 , . . . , S(0)\n\n1 , . . . , S(t)\n\n(cid:17)\n\nn \u00afp\n\nk(cid:63)\n\n(cid:96)=0 A(cid:96)\n\nvw log \u02c6p(k, i, (cid:96))}.\n\nK\n\n, . . . , S(t+1)\n\nwhere k(cid:63) = arg maxk{(cid:80)K\n\n(cid:80)\n\n1 , . . . , S(t)\nw\u2208S(t\u22121)\n\ni\n\ni=1\n\n(cid:80)L\n\n\u0393], and M = (cid:80)L\n\nPart 1: Spectral Decomposition. The spectral decomposition is described in Lines 1 to 4 in\nAlgorithm 1. As usual in spectral methods, the matrix A is \ufb01rst trimmed (to remove lines and columns\ncorresponding to items with too many observed labels \u2013 as they would perturb the spectral analysis).\nTo this aim, we estimate the average number of labels per item, and use this estimate, denoted by \u02dcp in\nAlgorithm 1, as a reference for the trimming process. \u0393 and A\u0393 denote the set of remaining items\nafter trimming, and the corresponding trimmed matrix, respectively.\nIf the number of clusters K is known and if we do not account for time complexity, the two step\nalgorithm in [4] can extract the clusters from A\u0393: \ufb01rst the optimal rank-K approximation A(K) of\nA\u0393 is derived using the SVD; then, one applies the k-mean algorithm to the columns of A(K) to\nreconstruct the clusters. The number of misclassi\ufb01ed items after this two step algorithm is obtained\nas follows. Let M (cid:96) = E[A(cid:96)\n(cid:96)=1 w(cid:96)M (cid:96) (using the same weights as those de\ufb01ning\n\u221a\nA). Then, M is of rank K. If v and w are in the same cluster, Mv = Mw and if v and w do not\nbelong to the same cluster, from (A2), we must have with high probability: (cid:107)Mv \u2212 Mw(cid:107)2 = \u2126(\u00afp\nn).\n\u221a\nv \u2212 Mv(cid:107)2 = \u2126(\u00afp\nThus, the k-mean algorithm misclassi\ufb01es v only if (cid:107)A(K)\nn). By leveraging\nv (cid:107)A(k)\n2 =\nF = O(n\u00afp) with high probability. Hence the algorithm misclassi\ufb01es O(1/\u00afp) items with\n\nelements of random graph and random matrix theories, we can establish that(cid:80)\n\n(cid:107)A(k) \u2212 M(cid:107)2\nhigh probability.\nHere the number of clusters K is not given a-priori. In this scenario, Algorithm 2 estimates the rank\nof M using a singular value thresholding procedure. To reduce the complexity of the algorithm, the\nsingular values and singular vectors are obtained using the iterative power method instead of a direct\nSVD. It is known from [10] that with \u0398 (log(n)) iterations, the iterative power method \ufb01nd singular\nvalues and the rank-K approximation very accurately. Hence, when n\u00afp = \u03c9(1), we can easily\n\nv \u2212 Mv(cid:107)2\n\n7\n\n\f\u221a\n\n\u221a\n\n(cid:80)\nv (cid:107) \u02c6Av \u2212 Mv(cid:107)2\n\nn\u02dcp log(n\u02dcp),\nestimate the rank of M by looking at the number of singular values above the threshold\nsince we know from random matrix theory that the (K + 1)-th singular value of A\u0393 is much less\nthan\nn\u02dcp log(n\u02dcp) with high probability. In the pseudo-code of Algorithm 2, the estimated rank of\nM is denoted by \u02dcK.\nThe rank- \u02dcK approximation of A\u0393 obtained by the iterative power method is \u02c6A = \u02c6U \u02c6V = \u02c6U \u02c6U(cid:62)A\u0393.\nFrom the columns of \u02c6A, we can estimate the number of clusters and classify items. Almost every\ncolumn of \u02c6A is located around the corresponding column of M within a distance 1\nlog(n \u02dcp), since\n2\nF = O(n\u00afp log(n\u00afp)2) with high probability (we rigorously analyze\nthis distance in the supplementary material Section D.2). From this observation, the columns can be\ncategorised into K groups. To \ufb01nd these groups, we randomly pick log(n) reference columns and for\neach reference column, search all columns within distance\nlog(n \u02dcp). Then, with high probability,\neach cluster has at least one reference column and each reference column can \ufb01nd most of its cluster\nmembers. Finally, the K groups are identi\ufb01ed using the reference columns. To this aim, we compute\nthe distance of n log(n) column pairs \u02c6Av, \u02c6Aw. Observe that (cid:107) \u02c6Av \u2212 \u02c6Aw(cid:107)2 = (cid:107) \u02c6Vv \u2212 \u02c6Vw(cid:107)2 for any\nu, v \u2208 \u0393, since the columns of \u02c6U are orthonormal. Now \u02c6Vv is of dimension \u02dcK, and hence we can\nidentify the groups using O(n \u02dcK log(n)) operations.\n\n2 = (cid:107) \u02c6A \u2212 M(cid:107)2\n\n(cid:113) n \u02dcp2\n\n(cid:113) n \u02dcp2\n\nTheorem 6 Assume that (A1) and (A2) hold, and that n\u00afp = \u03c9(1). After Step 4 (spectral decom-\nposition) in the SP algorithm, with high probability, \u02c6K = K and there exists a permutation \u03b3 of\n\n{1, . . . , K} such that:(cid:12)(cid:12)\u222aK\n\nk=1Vk \\ S\u03b3(k)\n\n(cid:16) log(n \u00afp)2\n\n(cid:17)\n\n(cid:12)(cid:12) = O\n\n.\n\n\u00afp\n\ni=1\n\n(cid:80)L\n(cid:96)=0 e(v,Vi, (cid:96)) log p(k,i,(cid:96))\n\nv \u2208 Vk,(cid:80)K\nwhere for any S \u2282 V and (cid:96), e(v, S, (cid:96)) =(cid:80)\n\nPart 2: Successive clusters improvements. Part 2 of the SP algorithm is described in Lines 5 and\n6 in Algorithm 1. To analyze the performance of each improvement iteration, we introduce the set\nof items H as the largest subset of V such that for all v \u2208 H: (H1) e(v,V) \u2264 10\u03b7n\u00afpL; (H2) when\nlog(n \u00afp)4 for all j (cid:54)= k; (H3) e(v,V \\ H) \u2264 2 log(n\u00afp)2,\np(j,i,(cid:96)) \u2265 n \u00afp\n(cid:96)=1 e(v, S, (cid:96)). Condition\nw\u2208S A(cid:96)\n(H1) means that there are not too many observed labels (cid:96) \u2265 1 on pairs including v, (H2) means that\nan item v \u2208 Vk must be classi\ufb01ed to Vk when considering the log-likelihood, and (H3) states that v\ndoes not share too many labels with items outside H.\nWe then prove that |V \\ H| \u2264 s with high probability when nD(\u03b1, p) \u2212 n \u00afp\nlog(n \u00afp)3 \u2265 log(n/s) +\n\n(cid:112)log(n/s). This is mainly done using concentration arguments to relate the quantity\n(cid:80)K\n\nvw, and e(v, S) =(cid:80)L\n\n(cid:80)L\n(cid:96)=0 e(v,Vi, (cid:96)) log p(k,i,(cid:96))\n\np(j,i,(cid:96)) involved in (H2) to nD(\u03b1, p).\n\ni=1\n\nFinally, we establish that if the clusters provided after the \ufb01rst part of the SP algorithm are asymptoti-\ncally accurate, then after log(n) improvement iterations, there is no misclassi\ufb01ed items in H. To that\naim, we denote by E (t) the set of misclassi\ufb01ed items after the t-th iteration, and show that with high\nprobability, for all t, |E (t+1)\u2229H|\nn \u00afp. This completes the proof of Theorem 2, since after log(n)\niterations, the only misclassi\ufb01ed items are those in V \\ H.\n\n|E (t)\u2229H| \u2264 1\u221a\n\nAcknowledgments\n\nWe gratefully acknowledge the support of the U.S. Department of Energy through the LANL/LDRD\nProgram for this work.\n\n8\n\n\fReferences\n[1] E. Abbe, A. Bandeira, and G. Hall. Exact recovery in the stochastic block model. CoRR, abs/1405.3267,\n\n2014.\n\n[2] E. Abbe and C. Sandon. Community detection in general stochastic block models: fundamental limits and\n\nef\ufb01cient recovery algorithms. In FOCS, 2015.\n\n[3] E. Abbe and C. Sandon. Recovering communities in the general stochastic block model without knowing\n\nthe parameters. In NIPS, 2015.\n\n[4] A. Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Combinatorics, Probability &\n\nComputing, 19(2):227\u2013284, 2010.\n\n[5] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00e1. Inference and phase transitions in the detection of\n\nmodules in sparse networks. Phys. Rev. Lett., 107, Aug 2011.\n\n[6] Y. Deshpande, E. Abbe, and A. Montanari. Asymptotic mutual information for the two-groups stochastic\n\nblock model. CoRR, abs/1507.08685, 2015.\n\n[7] C. Gao, Z. Ma, A. Zhang, and H. Zhou. Achieving optimal misclassi\ufb01cation proportion in stochastic block\n\nmodel. CoRR, abs/1505.03772, 2015.\n\n[8] B. Hajek, Y. Wu, and J. Xu. Achieving exact cluster recovery threshold via semide\ufb01nite programming.\n\nCoRR, abs/1412.6156, 2014.\n\n[9] B. Hajek, Y. Wu, and J. Xu. Information limits for recovering a hidden community. CoRR, abs/1509.07859,\n\n2015.\n\n[10] N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness: Probabilistic algorithms for\n\nconstructing approximate matrix decompositions. SIAM review, 53(2):217\u2013288, 2011.\n\n[11] S. Heimlicher, M. Lelarge, and L. Massouli\u00e9. Community detection in the labelled stochastic block model.\n\nIn NIPS Workshop on Algorithmic and Statistical Approaches for Large Social Networks, 2012.\n\n[12] P. Holland, K. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109 \u2013\n\n137, 1983.\n\n[13] V. Jog and P. Loh. Information-theoretic bounds for exact recovery in weighted stochastic block models\n\nusing the renyi divergence. CoRR, abs/1509.06418, 2015.\n\n[14] T. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied Mathematics,\n\n6(1):4\u201322, 1985.\n\n[15] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Signed networks in social media. In CHI, 2010.\n\n[16] L. Massouli\u00e9. Community detection thresholds and the weak ramanujan property. In STOC, 2014.\n\n[17] E. Mossel, J. Neeman, and A. Sly. Consistency thresholds for binary symmetric block models. In STOC,\n\n2015.\n\n[18] E. Mossel, J. Neeman, and A. Sly. Reconstruction and estimation in the planted partition model. Probability\n\nTheory and Related Fields, 162(3-4):431\u2013461, 2015.\n\n[19] E. Mossel and J. Xu. Density evolution in the degree-correlated stochastic block model. CoRR,\n\nabs/1509.03281, 2015.\n\n[20] V. Traag and J. Bruggeman. Community detection in networks with positive and negative links. Physical\n\nReview E, 80(3):036115, 2009.\n\n[21] S. Yun and A. Proutiere. Accurate community detection in the stochastic block model via spectral\n\nalgorithms. CoRR, abs/1412.7335, 2014.\n\n[22] S. Yun and A. Proutiere. Community detection via random and adaptive sampling. In COLT, 2014.\n\n[23] A. Zhang and H. Zhou. Minimax rates of community detection in stochastic block models. CoRR,\n\nabs/1507.05313, 2015.\n\n[24] P. Zhang, C. Moore, and M. Newman. Community detection in networks with unequal groups. CoRR,\n\nabs/1509.00107, 2015.\n\n9\n\n\f", "award": [], "sourceid": 578, "authors": [{"given_name": "Se-Young", "family_name": "Yun", "institution": "Los Alamos National Laboratory"}, {"given_name": "Alexandre", "family_name": "Proutiere", "institution": "KTH"}]}