{"title": "Optimal Sampling and Clustering in the Stochastic Block Model", "book": "Advances in Neural Information Processing Systems", "page_first": 13422, "page_last": 13430, "abstract": "This paper investigates the design of joint adaptive sampling and clustering algorithms in networks whose structure follows the celebrated Stochastic Block Model (SBM). To extract hidden clusters, the interaction between edges (pairs of nodes) may be sampled sequentially, in an adaptive manner. After gathering samples, the learner returns cluster estimates. We derive information-theoretical upper bounds on the cluster recovery rate. These bounds actually reveal the optimal sequential edge sampling strategy, and interestingly, the latter does not depend on the sampling budget, but on the parameters of the SBM only. We devise a joint sampling and clustering algorithm matching the recovery rate upper bounds. The algorithm initially uses a fraction of the sampling budget to estimate the SBM parameters, and to learn the optimal sampling strategy. This strategy then guides the remaining sampling process, which confers the optimality of the algorithm. We show both analytically and numerically that adaptive edge sampling yields important improvements over random sampling (traditionally used in the SBM analysis). For example, we prove that adaptive sampling significantly enlarges the region of the SBM parameters where asymptotically exact cluster recovery is feasible.", "full_text": "Optimal Sampling and Clustering\n\nin the Stochastic Block Model\n\nSe-Young Yun\n\nKAIST\n\nDaejeon, South Korea\n\nyunseyoung@kaist.ac.kr\n\nAlexandre Prouti\u00e8re\n\nKTH\n\nStockholm, Sweden\nalepro@kth.se\n\nAbstract\n\nThis paper investigates the design of joint adaptive sampling and clustering al-\ngorithms in networks whose structure follows the celebrated Stochastic Block\nModel (SBM). To extract hidden clusters, the interaction between edges (pairs\nof nodes) may be sampled sequentially, in an adaptive manner. After gathering\nsamples, the learner returns cluster estimates. We derive information-theoretical\nupper bounds on the cluster recovery rate. These bounds actually reveal the optimal\nsequential edge sampling strategy, and interestingly, the latter does not depend on\nthe sampling budget, but on the parameters of the SBM only. We devise a joint\nsampling and clustering algorithm matching the recovery rate upper bounds. The\nalgorithm initially uses a fraction of the sampling budget to estimate the SBM\nparameters, and to learn the optimal sampling strategy. This strategy then guides\nthe remaining sampling process, which confers the optimality of the algorithm.\nWe show both analytically and numerically that adaptive edge sampling yields\nimportant improvements over random sampling (traditionally used in the SBM\nanalysis). For example, we prove that adaptive sampling signi\ufb01cantly enlarges\nthe region of the SBM parameters where asymptotically exact cluster recovery is\nfeasible.\n\n1\n\nIntroduction\n\nExtracting clusters in networks is a central task in many \ufb01elds including biology, computer science,\nand social science. The Stochastic Block Model (SBM) [9] and its extensions provide a natural\nstatistical benchmark to assess the performance of network clustering algorithms. The SBM de\ufb01nes a\nrandom graph with n nodes and consisting of K non-overlapping clusters, V1, . . . ,VK, of respective\nsizes \u03b11n, . . . , \u03b1Kn with \u03b1k > 0 for all k. An edge between two nodes from respective clusters\nVi and Vj indicates whether these nodes interact and appears in the graph with probability pij,\nindependently of other edges. The SBM is hence parametrized by p = [pij]1\u2264i,j\u2264K and \u03b1 =\n(\u03b11, . . . , \u03b1K). We assume that the relative cluster sizes \u03b1 do not depend on the network size n,\nwhereas on the contrary, p may vary with n. Most existing work on the SBM and its extensions\ninvestigate the problem of recovering the clusters from an observed realization of the random graph.\nIn contrast, in this paper, we are interested in active learning scenarios where the interaction between\npairs of nodes may be sampled sequentially, which allows a given node pair to be sampled several\ntimes. In these scenarios, the algorithm samples edges in an adaptive manner: In a given round, the\nedge selected to be sampled may depend on the information gathered previously, and should the\nalgorithm select the edge (v, w) \u2208 Vi \u00d7 Vj, it observes a Bernoulli r.v. with mean pij, independent\nof the previous observations. The algorithm has an observation budget of T samples (typically\ndepending on the network size), and after collecting these samples, it should return estimates of the\nclusters. The objective is to devise a joint sampling and clustering algorithm such that the estimated\nclusters are as accurate as possible. Speci\ufb01cally, we aim at characterizing the minimal cluster recovery\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ferror rate for a given observation budget T . Adaptive sampling can be critical in clustering tasks\nwhere collecting edge samples is expensive (e.g., in biology, one has to run tedious experiments to\nassess whether two proteins share similarities). For such tasks, it is important to discover clusters\nwith a minimum number of samples (these in turn need to be selected in an adaptive manner).\nEven for the simple binary symmetric SBM (i.e., two clusters of equal sizes) with non-adaptive\nsampling, obtaining an explicit expression for the minimal number of mis-classi\ufb01ed nodes remains\nillusory, especially when the graph is sparse, i.e., when the pij\u2019s scale as 1/n (see e.g. [6, 14]). Hence,\nin this paper, we restrict our attention to models where only a vanishing fraction of nodes is allowed to\nbe mis-classi\ufb01ed. More precisely, for any s = o(n), we aim at identifying a necessary and suf\ufb01cient\ncondition on n, T , p and \u03b1 such that there exists a joint adaptive sampling and clustering algorithm\nmis-classifying less than s nodes with high probability. This objective is more ambitious than just\nderiving conditions for weak consistency (also referred to as asymptotically accurate detection)\n[15, 12, 1], that is to say, conditions under which the proportion of mis-classi\ufb01ed nodes vanishes as\nn grows large. Indeed, we are interested in the minimal recovery error rate. Further observe that\nderiving conditions for asymptotically exact recovery is part of our objective (these conditions are\nobtained selecting s = 1).\n\nMain results. We establish that under mild assumptions, for any s = o(n), a necessary and suf\ufb01cient\ncondition for the existence of a joint adaptive sampling and clustering algorithm mis-classifying less\nthan s nodes w.h.p.1 is\n\nlim inf\nn\u2192\u221e\n\n2T D(p, \u03b1)\nn log(n/s)\n\n\u2265 1,\n\n(1)\n\nwhere the divergence D(p, \u03b1) is de\ufb01ned as: D(p, \u03b1) = maxx\u2208X (\u03b1) \u2206(x, p),\n\nK(cid:88)\n\nwith \u2206(x, p) = min\ni,j:i(cid:54)=j\n\nX (\u03b1) =\n\nxikKL(pik, pjk) and\n\nk=1\n\n\uf8f1\uf8f2\uf8f3x = [xij] : \u03b1ixij = \u03b1jxji,\n\ni=1\n\nj=1\n\nK(cid:88)\n\nK(cid:88)\n\nxij = 1, and xij \u2265 0, \u2200i, j\n\n\u03b1i\n\n\uf8fc\uf8fd\uf8fe ,\n\nand where KL(a, b) denotes the KL divergence between two Bernoulli distributions with respective\nmeans a and b. A consequence of this result is that when T = \u03c9(n), the best possible joint sampling\nand clustering mis-classi\ufb01es n exp(\u2212 2T\n\nn D(p, \u03b1)(1 + o(1)) nodes.\n\nGains through adaptive sampling: exact recovery conditions and numerical experiments. To\nillustrate the gain obtained by adaptive sampling, we can compare the conditions for asymptotically\nexact recovery with or without adaptive sampling. Consider the following binary symmetric SBM:\nK = 2, \u03b1 = (1/2, 1/2), p11 = p22 = af (n)\nn . For the classical cluster\nrecovery problem in SBMs without adaptive sampling, one observes a realization of the random\ngraph and hence one has T = n(n \u2212 1)/2 observations (one per pair of nodes) to estimate clusters.\nFor the above binary symmetric SBM, asymptotically exact recovery [3] is possible if and only if\neither f (n) = \u03c9(log(n)) or f (n) = log(n) and\n\nn , and p12 = p21 = bf (n)\n\nmax{\u221a\n\na \u2212\n\n\u221a\n\n\u221a\n\nb \u2212 \u221a\n\nb,\n\na} >\n\n\u221a\n\n2\n\n(Non-adaptive sampling)\n\nNow with adaptive sampling and the same observation budget T = n(n \u2212 1)/2, our results show that\nasymptotically exact recovery is feasible if and only if either f (n) = \u03c9(log(n)) or f (n) = log(n)\nand\n\nmax{a log(\n\na\nb\n\n) + b \u2212 a, b log(\n\n) + a \u2212 b} >\n\nb\na\n\n1\n2\n\n(Adaptive sampling)\n\nFigure 1 (left) presents the regions (described through a and b) where asymptotically exact recovery\nis feasible with or without adaptive sampling. Observe that adaptive sampling signi\ufb01cantly enlarges\nthe region where exact recovery is possible.\n\n1w.h.p. means that the probability tends to 1 as n grows large.\n\n2\n\n\fFigure 1: (Left) Regions where asymptotically exact recovery is possible for the binary symmetric\nSBM. The intra- and inter-cluster edge probabilities are a log(n)/n and b log(n)/n. Dark green:\nregion with non-adaptive sampling, dark green+light green: region with adaptive sampling. (Right)\nRecovery error rate of ASP (with adaptive sampling) in blue and of the optimal clustering algorithm\nwith random sampling [16] in red for the binary symmetric SBM with n = 20000 nodes and\np11 = 0.5 and p12 = 0.1 \u2013 Figure done using matlab boxplot function with outliers (the red crosses).\n\nTo further illustrate the gain achieved with adaptive sampling, we compare Figure 1 (right) the\ncluster recovery rate using ASP (Adaptive Sampling Partition), the proposed optimal joint sampling\nand clustering algorithm and the optimal clustering algorithm with random sampling presented in\n[16]. This experiment concerns the above binary symmetric SBM with 20000 nodes and parameters\np11 = p22 = 0.5, and p12 = p21 = 0.1. As soon as the sampling budget exceeds 450000, ASP\nexactly recovers the clusters, whereas a much larger budget is needed to achieve exact recovery if\nedges are sampled randomly.\n\nDeriving fundamental limits. Existing information-theoretical limits derived in the classical SBM\nwithout adaptive sampling concern the expected number of mis-classi\ufb01ed nodes [16, 7]. Here, we are\ninterested in establishing lower bounds on the number of mis-classi\ufb01ed nodes that hold with high\nprobability, which is more challenging than establishing similar bounds in expectation. To derive the\nnecessary condition (1), we leverage and combine change-of-measure arguments as those used in\nonline stochastic optimization [10], as well as tools from hypothesis testing. In particular, we need to\nconsider and enumerate a large number of hypotheses pertaining to the way nodes are allocated to the\nvarious clusters (these hypotheses concern allocations that differ from more than s nodes). Such an\nenumeration is not required to derive lower bounds on the expected number of mis-classi\ufb01ed nodes\n[16]. There, simple symmetry arguments can be exploited instead.\n\nAn explicit optimal sampling strategy. As in some other sequential decision making problems (e.g.\nbandit problems), the fundamental limits do not only provide the performance of the best possible\nalgorithm, but also provide insights into the design of such an algorithm. To devise a joint sampling\nand clustering strategy whose performance matches our fundamental limits, we \ufb01rst exploit the\nfollowing interpretation of the divergence D(p, \u03b1) = maxx\u2208X (\u03b1) \u2206(x, p) involved in our necessary\ncondition. There, the vector x encodes the average number of samples of edges between the various\nclusters. More precisely, 2xijT /n is the average number of samples of edges between a given node\nin cluster Vi and nodes in cluster Vj. With this interpretation, an optimal sampling strategy consists\nin allocating the observation budget T according to x\u2217(p, \u03b1) = arg maxx\u2208X (\u03b1) \u2206(x, p). Note that\ninterestingly, x\u2217(p, \u03b1) does not depend on the total observation budget. However, the optimal budget\nallocation depends on the initially unknown SBM parameters (p, \u03b1). To devise an optimal joint\nsampling and clustering algorithm, we start, using a fraction of the observation budget, by estimating\nx\u2217(p, \u03b1). More precisely, the proposed algorithm consists in three main steps: (i) \ufb01rst, we use a\nsmall fraction of the observation budget and spectral methods to obtain initial cluster estimates; (ii)\nthe latter are then used to derive precise estimators of the SBM parameters, which in turn yield an\nestimate \u02c6x\u2217 of x\u2217(p, \u03b1); (iii) \ufb01nally, \u02c6x\u2217 and our initial cluster estimates dictate the way to sample\nedges with the remaining budget, and based on these observations, the cluster estimates are improved.\n\n3\n\n400000450000500000550000600000650000T10-410-310-2Error rate\f2 Related Work\n\nClustering in the SBM and its extensions have received a lot of attention recently. Almost all studies\nconcern the problem of recovering the clusters from a realization of the random graph generated\nunder the SBM (one sample for each edge is observed). Nevertheless, it is interesting to summarize\nthe results obtained in this simple non-adaptive setting. Results may be categorized depending the\ntargeted level of performance.\nThe lowest level of performance is often referred to as detectability and requires that the extracted\nclusters should be only positively correlated with the true clusters. In fact, the question of detectability\nis mostly relevant in the case of the sparse SBM, where the intra- and inter-cluster edge probabilities\nn. In the case of the binary symmetric SBM, it has been\np and q scale as 1/n, say p = a\n\nestablished that a necessary and suf\ufb01cient condition for detectability was a \u2212 b > (cid:112)2(a + b)\n\nn, q = b\n\n[5, 13, 11]. Refer to [1] for more recent results.\nAsymptotically accurate recovery or weak consistency refers to scenarios where the proportion of\nmis-classi\ufb01ed nodes vanishes as n grows large. Necessary and suf\ufb01cient conditions for such recovery\nhave been derived in [15, 12, 4]. Recently, the authors of [17, 7, 16] manage to quantify the optimal\nrecovery rate when asymptotically accurate recovery is possible. Unfortunately, in these papers, the\nauthors establish a lower bound for the expected number of mis-classi\ufb01ed nodes and provide an\nalgorithm with guarantees valid with high probability. In this paper, we \ufb01x this gap, and develop new\ntechniques to derive lower bounds valid with high probability.\nThe highest level of performance, asymptotically exact recovery, means that there is no mis-classi\ufb01ed\nnode asymptotically with high probability. Necessary and suf\ufb01cient for exact recovery are provided\nin [2, 12, 8, 16].\nIn this paper, we cover both asymptotically accurate detection, and exact recovery. But unlike the\naforementioned papers, we investigate the design of joint adaptive sampling and clustering algorithms.\nAs far as we are aware, the only relevant reference for adaptive sampling in the SBM is [15], and\nonly provides a condition for asymptotically accurate detection in homogeneous SBMs where the\nintra- and inter-cluster edge probabilities do not depend on the clusters (i.e., pii = p and pij = q for\nall i (cid:54)= j). We manage to derive matching lower and upper bounds valid with high probability on the\nrecovery rate for general SBMs. Our algorithm is very different than that developed in [15] since it is\nbased on the explicit optimal sampling strategy revealed by our lower bounds on the cluster recovery\nrate (lower bounds that are lacking in [15]).\n\n3 Fundamental Limits\n\nThis section is devoted to state and prove a necessary condition for the existence of a joint sampling\nand clustering algorithm mis-classifying less than s = o(n) nodes with high probability. The\nderivation of the necessary condition combines hypothesis testing techniques and change-of-measure\narguments where we pretend that the observations are generated by models obtained by slightly\nmodifying the true SBM model. More precisely, modi\ufb01ed models are built by moving nodes from one\ncluster to another. Since the clusters have different sizes, the number of nodes moved from cluster\nVi to cluster Vj should depend on i and j. As a consequence, the different resulting models should\nhave different distribution vectors \u03b1. To deal with this asymmetry, we hence introduce the class of\n(s, \u03b2)-locally stable algorithms de\ufb01ned as follows.\nDe\ufb01nition 1 ((s, \u03b2)-locally stable algorithms). A joint sampling and clustering algorithm \u03c0 is (s, \u03b2)-\nlocally stable at (p, \u03b1), if there exists a sequence \u03b7n \u2265 0 with limn\u2192\u221e \u03b7n = 0 such that for all\npartition vectors \u02dc\u03b1 such that (cid:107) \u02dc\u03b1\u2212 \u03b1(cid:107)2 \u2264 \u03b2, \u03c0 mis-classi\ufb01es at most s nodes with probability greater\nthan 1 \u2212 \u03b7n for any n. (Note that the de\ufb01nition makes sense even if p depends on n.)\nWe derive our necessary condition for (s, \u03b2)-locally stable algorithms. Considering (s, \u03b2)-locally\nstable algorithms is not restrictive, as good algorithms should adapt to the SBM and in particular, to\nvarious possible proportions of nodes in the different clusters. Furthermore, the theorem below is\nvalid for all \u03b2 \u2265 s\nTheorem 1. Let s = o(n). Assume that there exists a (s, \u03b2)-locally stable clustering algorithm at\n(p, \u03b1) for \u03b2 \u2265 s\n\n(cid:1), and hence \u03b2 can be made as small as we want when n grows large.\n(cid:1). Then we have: lim inf n\u2192\u221e 2T D(p,\u03b1)\n\nn log(cid:0) n\nn log(cid:0) n\n\nn log(n/s) \u2265 1.\n\ns\n\ns\n\n4\n\n\fTo establish Theorem 1, we consider a (s, \u03b2)-locally stable algorithm, and assume that the corre-\nsponding budget allocation is de\ufb01ned by x (representing the expected number of samples gathered\nfrom a node to all clusters). We then exhibit a large number M of hypotheses, each de\ufb01ned by\nan allocation of nodes to clusters. These hypotheses correspond to allocations differing from each\nother by more than s nodes. We enumerate the hypotheses and quantify M as a function of n and\nthe SBM parameters. Using the fact that the algorithm is (s, \u03b2)-locally stable, we can \ufb01nd a worst\nhypothesis (not corresponding to the true allocation of nodes) occurring with probability less than\n\u03b7n/M. Next, we apply a change-of-measure argument. Speci\ufb01cally, we pretend that the observa-\ntions are generated by a (perturbed) allocation built from that corresponding the worst hypothesis.\nWe study the log-likelihood ratio of the observations under the true and the perturbed allocations.\nCombining the analysis with the fact that the worst hypothesis occurs with probability less than\n\u03b7n/M, we conclude that the number of nodes from Vj actually classi\ufb01ed in Vi must roughly exceed\n\u03b1jn exp(\u2212 2T\nn \u2206(x, p)) nodes are\nmis-classi\ufb01ed. Now optimizing this lower bound over x, we deduce that at least n exp(\u2212 2T\nn D(p, \u03b1))\nnodes are mis-classi\ufb01ed. The complete proof is presented in Appendix. There, we also provide the\nproof for the binary symmetric SBM (this proof helps the understanding of that for general SBMs).\n\n(cid:80)K\nk=1 xikKL(pik, pjk)). This implies that at least n exp(\u2212 2T\n\nn\n\n4 The Adaptive Spectral Partition Algorithm\n\nIn this section, we present the Adaptive Spectral Partition (ASP) algorithm, whose pseudo-code is\ngiven in Algorithm 1, and prove that it mis-classi\ufb01es less than s = o(n) w.h.p. whenever this is at all\npossible, i.e., when (1) holds.\n\n4.1 Algorithm and its optimality\n\nThe design of the ASP algorithm leverages the results derived to establish fundamental perfor-\nmance limits. In particular, we know that an optimal sampling strategy corresponds to x\u2217(p, \u03b1) =\narg maxx\u2208X (\u03b1) \u2206(x, p). ASP consists in three main steps: (i) \ufb01rst, we use a small fraction of the\nobservation budget and apply spectral methods to obtain initial cluster estimates (Line 1 in Algorithm\n1); (ii) the latter are then used to derive precise estimators of the SBM parameters, which in turn\nyield an estimate \u02c6x\u2217 of x\u2217(p, \u03b1) (Lines 2 and 3); (iii) \ufb01nally, \u02c6x\u2217 dictates the way to sample edges\nwith the remaining budget, and based on these additional observations, the cluster estimates are\nimproved (Lines 4 and 5). The complexity of the ASP is polynomial to both n and T . Indeed, Step 1,\nincluding the Spectral Clustering Algorithm, requires O(T log(n)) operations. Step 2 requires O(T )\noperations to estimate parameters and Step 3 solves a linear program where the number of variables\nis k2 which does not scale with n and T . The remaining steps simply check the log-likelihood values\nof each node, which requires O(T ) computations. Overall, the computational complexity of ASP is\nO(T log n).\nWe analyze the performance of ASP under the following mild assumptions, essentially stating some\nkind of homogeneity of the SBM parameters associated to the various clusters. There exist positive\nconstants \u03baL and \u03baU such that\n\n(cid:12)(cid:12)(cid:12)(cid:12)log\n\n(A1)\n\n(A2) \u03baL \u2264\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03baU\n\n(cid:18) pik(1 \u2212 pjk)\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)log\n(cid:18) pik\n\npjk(1 \u2212 pik)\n\npjk\n\nfor all\n\ni, j, k\n\nfor all\n\ni, j, k.\n\nWe emphasize that no other assumptions are made on p.\nTheorem 2. Assume that (A1) and (A2) hold. Let s = o(n). The ASP algorithm mis-classi\ufb01es less\nthan s nodes with high probability, if lim inf n\u2192\u221e 2T D(p,\u03b1)\n\nn log(n/s) \u2265 1.\n\nThe above theorem is proved in the next subsection. In addition, the following lemma, proved in\nAppendix, directly implies that the ASP algorithm is (s, \u03b2)-locally stable at (p, \u03b1) for \u03b2 = s\nn log( n\ns )\nwhen (1) holds.\nLemma 1. Assume that (A1) and (A2) hold. For all \u03b1 and \u02dc\u03b1, |D(p,\u03b1)\u2212D(p, \u02dc\u03b1)|\nD(p,\u03b1)(cid:107) \u02dc\u03b1\u2212\u03b1(cid:107)2\n\n= O (1) .\n\n5\n\n\fAlgorithm 1 Adaptive Spectral Partition(T, \u03b4, K)\n\nT\n\nT\n\n1. Initial random observations.\nSample\n\n1\n\nlog e(V,V)T\nd(V,V)n\n\n4 log(T /n) edges uniformly at random without replacement and compute \u03b4 \u2190\n4 \u2212\n4 log(T /n) additional edges uniformly at random without replacement\n\n|Si|(|Si|\u22121) for all 1 \u2264 i \u2264 K and \u02c6pij = 4e(Si,Sj )\nn(n\u22121)\n\nSample \u03b4T\nExtract S1, . . . ,SK using the spectral clustering algorithm of [16]\n2. Estimating the SBM parameters. Estimate \u02c6\u03b1 and \u02c6p from the observations made in 1. and the\nextracted clusters S1, . . . ,SK:\nn and \u02c6pii = 4e(Si,Si)\n|Si|\n\u02c6\u03b1i =\ni (cid:54)= j.\n3. Computing the optimal sampling strategy. Solve \u02c6x\u2217 \u2208 arg maxx\u2208X ( \u02c6\u03b1) D(x, \u02c6p)\n4. First round of cluster improvement.\n\u02c6p \u2190 maxi,j \u02c6pij; e(A, B) \u2190 0 for all A, B (i.e., reset all pairs)\nRandomly observe 2(1 \u2212 \u03b4\nfor all 1 \u2264 i, j \u2264 K\nfor all 1 \u2264 i \u2264 K do\n\u02c6Vi = \u2205\nfor all v \u2208 Si do\n\nn edges between node v \u2208 Si and nodes in Sj for all v \u2208 Si and\n\nn(n\u22121)\n2|Si||Sj| for all\n\n2 )\u02c6x\u2217\n\n\u03b4T\n\n\u03b4T\n\nij\n\nT\n\n(cid:12)(cid:12)e(v,Sk) \u2212 2(1 \u2212 \u03b4\n\n2 )\u02c6x\u2217\n\nik \u02c6pik\n\nT\nn\n\n(cid:12)(cid:12) \u2264 \u03b4\n\n4 \u02c6p T\n\nn\n\nAdd v to \u02c6Vi when max1\u2264k\u2264K\n\nend for\n\nend for\n5. Second round of cluster improvement.\ne(A, B) \u2190 0 for all A, B (i.e., reset all pairs)\nfor all v \u2208 \u222aK\n\n\u03b4\n4K\n\nk=1(Sk \\ \u02c6Vk) do\nn\u2212(cid:80)K\nk=1 | \u02c6Vk| nodes from \u02c6Vi for all i; and observe the edges between v and\nRandomly select\n(cid:17)\n(cid:16)\n(cid:80)K\nthe selected nodes\nv is assigned to \u02c6Vk\u2217 where\nk\u2217 = arg max1\u2264i\u2264K\n\ne(v,Sk) log (\u02c6pik) + ( \u03b4\n\nk=1 | \u02c6Vk| \u2212 e(v,Sk)) log (1 \u2212 \u02c6pik)\n\nn\u2212(cid:80)K\n\nk=1\n\n4K\n\nT\n\nT\n\n.\n\nend for\n\n4.2 The steps of ASP and their analysis\n\nWe describe below the various steps of the ASP algorithm, and analyze their performance. The proofs\nof all lemmas are postponed to the Appendix. In Algorithm 1, the pseudo-code of ASP, e(A,B)\ndenotes the number of positive observations between nodes in A and nodes in B, and d(A,B) denotes\nthe number of sampled edges between A and B.\n\n4.2.1 Initial random observations\n\n4 random edge samples so as to build initial cluster\nThe \ufb01rst task of the algorithm is to collect \u03b4T\nestimates and approximate the SBM parameters (p, \u03b1) which in turn will lead to an approximate\noptimal sampling strategy x\u2217(p, \u03b1). To output initial cluster estimates, we plan to use the spectral\ndecomposition algorithm from [16], and to exploit the results therein about its performance. With this\ngoal in mind, the parameter \u03b4 must be set so that the initial cluster estimates have an appropriate level\n4 log(T /n) samples\nby e(V,V)\nlog pT\n. The\nn\nspectral decomposition algorithm outputs S1, . . . ,SK, which in view of Theorem 2 in [16], satisfy\nwith high probability, and for some constant C > 0,\n\nof accuracy. To set \u03b4, we \ufb01rst estimate p =(cid:80)K\n(cid:17)\u22121\n(cid:12)(cid:12) \u2264 n exp\n\nj=1 \u03b1i\u03b1jpij from the \ufb01rst\ni=1\nwhich is approximately equal to\n\n(cid:16)\n(cid:12)(cid:12)\u222aK\n\nd(V,V). We then let \u03b4 =\n\n.\n\n(2)\n\nlog e(V,V)T\nd(V,V)n\n\ni=1Vi \\ Si\n\n(cid:80)K\n\n(cid:17)\u22121\n\n(cid:19)\n\n(cid:18)\n\n\u2212C\n\n(cid:16)\n\nT\n\npT /n\n\nlog(pT /n)\n\n4.2.2 Estimating the SBM parameters\n\nNext, using the initial cluster estimates, ASP approximates the SBM parameters (p, \u03b1) by ( \u02c6p, \u02c6\u03b1)\n(refer to Line 2 in Algorithm 1). The previous step extracted the hidden clusters with at most\n\n6\n\n\fn exp(\u2212 T\n\nn Cp) mis-classi\ufb01ed nodes. This observation directly implies that: for any k,\n\n|\u02c6\u03b1k \u2212 \u03b1k| \u2264 exp\n\n\u2212C\n\n(cid:18)\n\n.\n\npT /n\n\nlog(pT /n)\n\n(cid:19)\n(cid:12)(cid:12) \u2264 n exp\n(cid:17)\ni=1Vi \\ Si\nT p/n + 1\u221a\n\n.\n\nn\n\nIn addition, we can show that the initial cluster estimates also lead to a very accurate estimate of p, as\nstated in the following lemma.\n\nLemma 2. Assume that (A1) and (A2) hold. When(cid:12)(cid:12)\u222aK\n(cid:16) log(T p/n)\n\nn = \u03c9(1), with high probability, |pij\u2212 \u02c6pij|\n\n= O\n\npT\n\npij\n\n(cid:16)\u2212C pT /n\n\nlog(pT /n)\n\n(cid:17)\n\nand\n\n4.2.3 Computing the optimal sampling strategy\nASP now computes \u02c6x\u2217 \u2208 arg maxx\u2208X (\u03b1) \u2206( \u02c6x, \u02c6\u03b1). The main idea behind ASP is to use \u02c6x\u2217 to de\ufb01ne\nwhich edges should be sampled. However, \u02c6x\u2217 de\ufb01nes how many times edges from cluster Vi to\ncluster Vj should be sampled, and these clusters are for now just approximated by Si and Sj. Hence\nusing \u02c6x\u2217 induces two sources of errors: \ufb01rst \u02c6x\u2217 is inexact and then when randomly sampling an edge\nfrom v \u2208 Vi to Sj, the binary outcome is a Bernoulli r.v. with mean \u00afpij rather than pij, where\n\nK(cid:88)\n\n\u00afpij =\n\n1\n|Sj|\n\npik|Sj \u2229 Vk|.\n\nLet \u00afp = [\u00afpij]i,j. The following lemma is instrumental to bound the impact of these errors:\n\nLemma 3. Assume that (A1) and (A2) hold. When (cid:12)(cid:12)\u222aK\n\nk=1\n\n(cid:12)(cid:12) \u2264 n exp\n\n(cid:16)\u2212C pT /n\n\nlog(pT /n)\n\n(cid:17)\n\n,\n\ni=1Vi \\ Si\n\n(cid:16) log(T p/n)\n\n(cid:17)\n\n|pij\u2212 \u02c6pij|\n\npij\n\n= O\n\nT p/n + 1\u221a\n|\u2206(x\u2217( \u02c6p, \u02c6\u03b1), \u00afp) \u2212 \u2206(x\u2217(p, \u03b1), p)|\n\nn = \u03c9(1), with high probability,\nlog(T p/n)\n\n, and pT\n\nn\n\n\u2206(x\u2217(p, \u03b1), p)\n\n= O(\n\nT p/n\n\n+\n\n1\u221a\nn\n\n),\n\n4.2.4 First round of cluster improvement\nIn this step, the ASP algorithm \ufb01rst re-sets the values of e(A,B) to 0 for all sets of nodes A and B.\nIt then randomly samples edges according to the sampling strategy \u02c6x\u2217. More precisely, for every\nv \u2208 Si, for each k = 1, . . . , K, it randomly selects 2(1 \u2212 \u03b4\nn edges from v to Sk. Due to the\nre-set, after these new samples, e(v,Sk) is a sum of independent Bernoulli r.v. with mean \u00afpjk when\nv \u2208 Vj.\nIn this \ufb01rst round of cluster improvement, ASP classi\ufb01es a node v \u2208 Si only if the values e(v,Sk)\u2019s\nclearly indicate its cluster. More precisely, v \u2208 Si is classi\ufb01ed in \u02c6Vi only if:\n\n2 )\u02c6x\u2217\n\nik\n\nT\n\nmax\n1\u2264k\u2264K\nNote that E[e(v,Sk)] = 2(1 \u2212 \u03b4\nik \u00afpik\nprobability at least 1 \u2212 exp(\u2212 pT /n\nand the Markov inequality, with high probability,\n\n(3)\nn when v \u2208 Si \u2229 Vi. When v \u2208 Si \u2229 Vi, v satis\ufb01es (3) with\n2 )\u02c6x\u2217\n(log(pT /n))3 ) from Chernoff-Hoeffding inequality. Therefore, from (2)\n\nik \u02c6pik\n\n\u02c6p\n\n4\n\n.\n\n)\u02c6x\u2217\n\n(cid:12)(cid:12)(cid:12)(cid:12)e(v,Sk) \u2212 2(1 \u2212 \u03b4\n(cid:12)(cid:12)(cid:12) \u2264 |Si \\ Vi| +\n(cid:12)(cid:12)(cid:12)(Vi \u2229 Si) \\ \u02c6Vi\n\n2\n\nT\n\nT\nn\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b4\n(cid:12)(cid:12)(cid:12) \u2264 \u03b1ine\n\n\u2212\n\nT\nn\n\n(cid:12)(cid:12)(cid:12)Si \\ \u02c6Vi\n\npT /n\n\n(log(pT /n))4 .\n\n\u03b4\n\nT\n\nik \u02c6pik\n\nk=1(Si \\ \u02c6Vi) will be classi\ufb01ed in the second round of cluster improvement.\n\nn which could be far from 2(1 \u2212\nn . Thus, to satisfy (3), e(v,Sk) has to deviate from its mean a lot. From Chernoff-\nn (\u2206(x\u2217( \u02c6p, \u02c6\u03b1), \u00afp) +\n\nAll the nodes in \u222aK\nWhen v \u2208 Si \u2229 Vj for j (cid:54)= i, E[e(v,Sk)] = 2(1 \u2212 \u03b4\n2 )\u02c6x\u2217\nHoeffding inequality, we show that (3) holds with probability at most exp(\u2212 2T\nO(\u03b4p))). Thus, using Lemma 3 and the Markov inequality, with high probability,\n\u03b1ine\u22122 T\nIn summary, we have:\n\n(cid:12)(cid:12)(cid:12) \u02c6Vi \\ Vi\n\nn \u00b7D(p,\u03b1)+O( \u03b4pT\n\n(cid:12)(cid:12)(cid:12) \u2264\n\n2 )\u02c6x\u2217\n\nik \u00afpjk\n\nn ).\n\nT\n\n7\n\n\fLemma 4. Assume that (A1) and (A2) hold. After the \ufb01rst round of cluster improvement of the ASP\nalgorithm, we have, with high probability, for all 1 \u2264 i \u2264 K,\n\n\u2212\n\npT /n\n\n(log(pT /n))4\n\nand\n\nn \u00b7D(p,\u03b1)+O( \u03b4pT\n\nn ).\n\n(cid:12)(cid:12)(cid:12) \u02c6Vi \\ Vi\n\n(cid:12)(cid:12)(cid:12) \u2264 \u03b1ine\u22122 T\n\n(cid:12)(cid:12)(cid:12)Si \\ \u02c6Vi\n\n(cid:12)(cid:12)(cid:12) \u2264 \u03b1ine\n\n4.2.5 Second round of cluster improvement\n\nFinally, we use the \u03b4\nassigned to a cluster in the previous round. From Lemma 4, there are at most n exp\n\n4 T remaining samples to classify the nodes in \u222aK\n\n(cid:16)\n\n(cid:17)\n\nsuch nodes. Hence, we can use at least \u03b4T\nnode.\nIn the second round of cluster improvement, the ASP algorithm classi\ufb01es nodes using the maximum\nlikelihood with \u02c6p, which is very similar to the greedy improvement step of the spectral clustering\nalgorithm of [16]. De\ufb01ne:\n\nsamples to classify each remaining\n\n4n exp\n\n(log(pT /n))4\n\npT /n\n\n(cid:17)\nk=1(Si \\ \u02c6Vi) that were not\n\n(cid:16)\u2212 pT /n\n\n(log(pT /n))4\n\nDR(p, \u03b1) = min\ni,j:i(cid:54)=j\n\nD+(pi, pj, \u03b1) with\n\n(cid:40) K(cid:88)\n\nk=1\n\n(cid:41)\n\nK(cid:88)\n\nk=1\n\nD+(pi, pj, \u03b1) =\n\nmin\n\ny\u2208[0,1]K\n\nmax\n\n\u03b1kKL(yk, pik),\n\n\u03b1kKL(yk, pjk)\n\n.\n\nFrom the de\ufb01nition of DR( \u02c6p, \u02c6\u03b1), we can show as in [16] that v is mis-classi\ufb01ed only when\n\nK(cid:88)\n\nk=1\n\nd(v,Sk)KL(\n\ne(v,Sk)\nd(v,Sk)\n\n, \u02c6pik) \u2265 d(v,V)DR( \u02c6p, \u02c6\u03b1).\n\n(cid:16)\u2212 pT\n\nFrom Chernoff-Hoeffding inequality, we analyze the probability of the above event and conclude:\nLemma\nn exp\n(A2).\n\nround\nat most\nmis-classi\ufb01ed nodes with high probability under (A1) and\n\n5.\nn exp\n\nsecond\npT /n\n\nimprovement\n\ngenerates\n\n(log(pT /n))5\n\ncluster\n\nThe\n\nof\n\n(cid:17)(cid:17)\n\n(cid:16)\n\nn \u00b7D(p,\u03b1)+O( \u03b4pT\n\nSo far we have analyzed the number of mis-classi\ufb01ed nodes generated in the \ufb01rst round and\nthe second round of cluster improvement.\nIn the \ufb01rst round, the ASP algorithm generates at\n(cid:16)\nmost ne\u22122 T\nn ) mis-classi\ufb01ed nodes and in the second round, generates at most\nmis-classi\ufb01ed nodes. Note that the number of mis-classi\ufb01ed nodes in the\nne\nsecond round is negligible compared to that in the \ufb01rst round. We thus conclude that the ASP\nalgorithm outputs cluster estimates with at most\n\n(log(pT /n))5\n\nn exp\n\n\u2212 pT\n\n(cid:17)\n\npT /n\n\n(cid:12)(cid:12)(cid:12)\u222aK\n\n\u02c6Vi \\ Vi\n\ni=1\n\n(cid:12)(cid:12)(cid:12) \u2264 ne\u22122 T\n\nn \u00b7D(p,\u03b1)+O( \u03b4pT\nn )\n\nmis-classi\ufb01ed nodes, which concludes the proof of Theorem 2, in view of our choice of \u03b4.\n\n5 Conclusion\n\nIn this paper, we derived a necessary condition for the existence of a joint sampling and clustering\nalgorithm mis-classifying less than s = o(n) nodes w.h.p. in the SBM. This derivation revealed the\noptimal sampling strategy, and allowed us to devise ASP, an algorithm that mis-classi\ufb01ed s = o(n)\nwhen the necessary condition holds. To our knowledge, this is the \ufb01rst time the optimal cluster\nrecovery rate is characterized in the SBM with adaptive sampling. Being able to characterize the\noptimal sampling strategy is promising, and opens up new research directions. In particular, we could\nnow investigate various online optimization problems involving random graphs generated by the\nSBM and using adaptive edge sampling.\n\nAcknowledgments\n\nS. Yun was supported by Korea Electric Power Corporation. (Grant number:R18XA05). A. Proutiere\nwas partially supported by the Wallenberg Autonomous Systems and Software Program (WASP)\nfunded by the Knut and Alice Wallenberg Foundation.\n\n8\n\n\fReferences\n[1] Emmanuel Abbe. Community detection and stochastic block models. Foundations and Trends\n\nin Communications and Information Theory, 14(1-2):1\u2013162, 2018.\n\n[2] Emmanuel Abbe, Alfonso Bandeira, and Georgina Hall. Exact recovery in the stochastic block\n\nmodel. CoRR, abs/1405.3267, 2014.\n\n[3] Emmanuel Abbe and Colin Sandon. Community detection in general stochastic block models:\nFundamental limits and ef\ufb01cient algorithms for recovery. In Proceedings of the 2015 IEEE 56th\nAnnual Symposium on Foundations of Computer Science (FOCS), pages 670\u2013688, 2015.\n\n[4] Emmanuel Abbe and Colin Sandon. Recovering communities in the general stochastic block\nmodel without knowing the parameters. In Advances in Neural Information Processing Systems\n28, pages 676\u2013684. Curran Associates, Inc., 2015.\n\n[5] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov\u00e1. Inference and\nphase transitions in the detection of modules in sparse networks. Phys. Rev. Lett., 107, Aug\n2011.\n\n[6] Yash Deshpande, Emmanuel Abbe, and Andrea Montanari. Asymptotic mutual information for\nthe binary stochastic block model. In IEEE International Symposium on Information Theory,\nISIT 2016, Barcelona, Spain, July 10-15, 2016, pages 185\u2013189, 2016.\n\n[7] Chao Gao, Zongming Ma, Anderson Y. Zhang, and Harrison H. Zhou. Achieving optimal\nmisclassi\ufb01cation proportion in stochastic block models. J. Mach. Learn. Res., 18(1):1980\u20132024,\nJanuary 2017.\n\n[8] Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via\nsemide\ufb01nite programming: Extensions. IEEE Trans. Inf. Theor., 62(10):5918\u20135937, October\n2016.\n\n[9] Paul Holland, Kathryn Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps.\n\nSocial Networks, 5(2):109 \u2013 137, 1983.\n\n[10] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Adv.\n\nAppl. Math., 6(1):4\u201322, March 1985.\n\n[11] Laurent Massouli\u00e9. Community detection thresholds and the weak ramanujan property. In\nProceedings of the Forty-sixth Annual ACM Symposium on Theory of Computing, pages 694\u2013\n703, 2014.\n\n[12] Elchanan Mossel, Joe Neeman, and Allan Sly. Consistency thresholds for the planted bisection\nmodel. In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing,\npages 69\u201375, 2015.\n\n[13] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted\n\npartition model. Probability Theory and Related Fields, 162(3-4):431\u2013461, 2015.\n\n[14] Elchanan Mossel and Jiaming Xu. Density evolution in the degree-correlated stochastic block\nmodel. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York,\nUSA, June 23-26, 2016, pages 1319\u20131356, 2016.\n\n[15] Se-Young Yun and Alexandre Proutiere. Community detection via random and adaptive\nsampling. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona,\nSpain, June 13-15, 2014, pages 138\u2013175, 2014.\n\n[16] Se-Young Yun and Alexandre Proutiere. Optimal cluster recovery in the labeled stochastic\n\nblock model. In Advances in Neural Information Processing Systems, pages 965\u2013973, 2016.\n\n[17] Anderson Y. Zhang and Harrison H. Zhou. Minimax rates of community detection in stochastic\n\nblock models. Ann. Statist., 44(5):2252\u20132280, 10 2016.\n\n9\n\n\f", "award": [], "sourceid": 7423, "authors": [{"given_name": "Se-Young", "family_name": "Yun", "institution": "KAIST"}, {"given_name": "Alexandre", "family_name": "Proutiere", "institution": "KTH"}]}