{"title": "Overlapping Clustering Models, and One (class) SVM to Bind Them All", "book": "Advances in Neural Information Processing Systems", "page_first": 2126, "page_last": 2136, "abstract": "People belong to multiple communities, words belong to multiple topics, and books cover multiple genres; overlapping clusters are commonplace. Many existing overlapping clustering methods model each person (or word, or book) as a non-negative weighted combination of \"exemplars\" who belong solely to one community, with some small noise. Geometrically, each person is a point on a cone whose corners are these exemplars. This basic form encompasses the widely used Mixed Membership Stochastic Blockmodel of networks and its degree-corrected variants, as well as topic models such as LDA. We show that a simple one-class SVM yields provably consistent parameter inference for all such models, and scales to large datasets. Experimental results on several simulated and real datasets show our algorithm (called SVM-cone) is both accurate and scalable.", "full_text": "Overlapping Clustering Models, and One (class) SVM\n\nto Bind Them All\n\nXueyu Mao, Purnamrita Sarkar, Deepayan Chakrabarti\n\nThe University of Texas at Austin\n\nxmao@cs.utexas.edu, purna.sarkar@austin.utexas.edu, deepay@utexas.edu\n\nAbstract\n\nPeople belong to multiple communities, words belong to multiple topics, and books\ncover multiple genres; overlapping clusters are commonplace. Many existing over-\nlapping clustering methods model each person (or word, or book) as a non-negative\nweighted combination of \u201cexemplars\u201d who belong solely to one community, with\nsome small noise. Geometrically, each person is a point on a cone whose corners\nare these exemplars. This basic form encompasses the widely used Mixed Member-\nship Stochastic Blockmodel of networks [1] and its degree-corrected variants [16],\nas well as topic models such as LDA [9]. We show that a simple one-class SVM\nyields provably consistent parameter inference for all such models, and scales to\nlarge datasets. Experimental results on several simulated and real datasets show\nour algorithm (called SVM-cone) is both accurate and scalable.\n\n1\n\nIntroduction\n\nClustering has many real-world applications: market segmentation, product recommendation, doc-\nument clustering, \ufb01nding protein complexes in gene networks, among others. The simplest form\nof a clustering model assumes that every record or entity belongs to exactly one cluster. More\ngeneral forms allow for overlapping clusters, where each entity may belong to different clusters or\ncommunities to different degrees. For example, George Orwell\u2019s 1984 belongs to both the dystopian\n\ufb01ction and political \ufb01ction genres, and Pink Floyd\u2019s music is both progressive and psychedelic. In\nthis paper, we show that many existing overlapping clustering models can be written in a general\nform, whose parameters can then be inferred using a one-class SVM.\nIn many clustering problems, overlapping or otherwise, we have access to a data matrix \u02c6Z \u2208 Rn\u00d7m,\nwhich is a noisy version of an ideal matrix Z, i.e. \u02c6Z = Z + R where the norm of the rows of\nR is small. Also, Z = GZP , where ZP are ideal \u201cexemplars\u201d of the various communities, and\nG \u2208 Rn\u00d7K\u22650\nConsider the Stochastic Blockmodel (SBM) [13] for networks. In this model, each node belongs to\none of K communities, and the probability Pij of an edge between nodes i and j is a function of their\nrespective communities. Recent results [21] show that the eigenvectors \u02c6V of the adjacency matrix\nconcentrate row-wise around the eigenvectors V of P. The matrix V is also blockwise constant,\nmapping all nodes in one cluster to one point [27]. Hence, \u02c6V = GVP + R, where G \u2208 {0, 1}n\u00d7K\nis a binary membership matrix where each row sums to one. The Mixed Membership Stochastic\nBlockmodel (MMSB) [1] relaxes this by allowing the entries of G to be in [0, 1]. Since the rows of G\nsum to one, the ideal matrix Z arranges points in a simplex. The corners of this simplex represent the\n\u201cpure\u201d nodes, i.e. nodes belonging to exactly one community. Most algorithms \ufb01rst \ufb01nd the corners,\nand then estimate model parameters via regression [20, 21, 16, 23, 28]. Other notable methods\ninclude tensor based approaches [14, 2], Bayesian inference [12], etc. Related models and inference\nmethods for overlapping networks have been presented in [24, 17, 26, 19], etc.\n\ngives the community memberships of each entity. We will now give some examples.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Simplex methods can fail\n\n(b) Normalized points\n\n(c) Supporting hyperplane\n\nFigure 1: (a) Simplex-based corner-\ufb01nding methods require points on a simplex, with uniformly\nsmall errors. Projecting points to a simplex with normal vector q works well, but a very similar q0\ndoes not. Some points (such as A(cid:13)) get projected to far-off points, amplifying errors in their positions.\n(b) Instead, we normalize points to the unit sphere, and (c) \ufb01nd corners from the support vectors of a\none-class SVM.\n\nThe MMSB model does not allow for degree heterogeneity, which can be achieved via the Degree-\ncorrected Mixed Membership Stochastic Blockmodel (DCMMSB) [16]. In DCMMSB, each node\nhas an extra degree parameter, with a high parameter value leading to more edges for that node. Now,\nG is non-negative, but its rows do not sum to one. Thus the points lie inside a cone, and the pure\nnodes lie on the corner rays of this cone. Other network models also give rise to such cones [32, 17].\nExisting algorithms for degree corrected overlapping models use a range of different techniques.\nOCCAM [32] uses a k-medians step on the regularized eigenvectors of the adjacency matrix to get\nthe corners. While the algorithm is computationally ef\ufb01cient, a key assumption is that the k-medians\nloss function attains its minimum at the locations of the pure nodes and there is a curvature around\nthis minimum. This condition is typically hard to check. In [16], the authors show an interesting\nresult that the second to K eigenvectors, element-wise divided by the \ufb01rst eigenvector entries form\na simplex. The authors provide an algorithm for \ufb01nding this simplex with K corners in K \u2212 1\ndimensions. The algorithm requires a combinatorial search step, which is prohibitive for large K.\nTopic models [9] are another example of overlapping clustering models. Here the documents can be\ngenerated from a mixture of topics, which are the analog of communities in networks. The normalized\nword co-occurrence matrix forms a simplex structure, with the corners representing anchor words, i.e.\nwords that belong to exactly one topic. While there are many existing inference methods, the ones\nthat provide consistency guarantees are typically based on analyzing tensors or \ufb01nding corners in\nsimplexes [4, 18, 11, 3, 15, 7, 6, 8].\nIn this paper, we provide an overarching framework which incorporates all the above problems, from\nMixed membership models (with or without degree correction) to topic models. As discussed before,\nin all the above models, the ideal data matrix lies inside a cone (a simplex is a special type of a cone).\nThe goal is to infer G, which depends on \ufb01nding the correct corner rays.\nLet us illustrate why seemingly obvious methods fail to obtain the corner rays. The simplest idea\nwould be to generate a random plane (the \u201csimplex\u201d), and project points to the intersection of the line\njoining these points to this random plane as shown in Figure 1a. Corners of this simplex correspond\nto the corner rays. However, extending the idea to the sample or empirical cone is dif\ufb01cult, because if\nthe simplex is not good, some points can get projected to arbitrarily far points, which will amplify\ntheir error. As Figure 1a shows, the set of good simplexes may be quite limited, and \ufb01nding a good\nsimplex is dif\ufb01cult.\nWe will illustrate our idea with the \u201cideal cone.\u201d First, we row-normalize the ideal data matrix to have\nunit \u20182 norm (similar to Ng et al. [22], Qin and Rohe [25]). This projects all points inside the cone to\nthe surface of the sphere, with the points on the corner rays being projected to the corners (Figure 1b).\nThen, we show a rather fascinating result, namely, for all the above models, the corners can be\nobtained via the support vectors of a one class SVM [29], where all normalized points are in the\npositive class, and the origin is in the negative class (Figure 1c). Observe that a hyperplane through\n\n2\n\n\fthe corners separates all the points from the origin. We also show that if the row-wise error of R is\nsmall, the SVM approach can be used to infer G from empirical cones. Finally, we show that since\nthe row-wise error of R is indeed small for different degree-corrected overlapping network models\nand topic models, we can use our algorithm to infer the parameters consistently. We provide error\nbounds for parameter estimates at the per-node and per-word level, in contrast to typical bounds for\nthe entire parameter matrix. We conclude with experimental results on simulated and real datasets.\n\n>0 being a positive\na cross-community\n\na community-membership matrix, and B \u2208 RK\u00d7K\u22650\n\nP VP , where VP = V(I, :) is full rank and \u0393P = \u0393(I,I).\n\n2 Proposed work\nConsider a population matrix P of the form P = \u03c1\u0393\u0398B\u0398T \u0393, with \u0393 \u2208 Rn\u00d7n\ndiagonal matrix, \u0398 \u2208 Rn\u00d7K\u22650\nconnection matrix. We will make the following assumptions which are common in the literature:\nAssumption 2.1. (a) Pure nodes: Each community has at least one \u201cpure\u201d node, which belongs\nsolely to that community. (b) Non-zero rows: No row of \u0398 is identically 0. (c) B is full rank.\nThe form of the population matrix P, alongside Assumption 2.1, induces a conic structure on the\nrows of the eigenvectors of P.\nLemma 2.1. Let there be K communities (rank(P) = K), and let I be indices of K pure nodes, one\nfrom each community. Let P = VEVT be the top-K eigen-decomposition of P, where columns of\nV \u2208 Rn\u00d7K are the K principal eigenvectors and E \u2208 RK\u00d7K is a diagonal matrix of the K principal\neigenvalues. Then, V = \u0393\u0398\u0393\u22121\nP )ij \u2265 0 for all (i, j), the rows of V fall within a cone with corners VP . This suggests\nSince (\u0393\u0398\u0393\u22121\nthe following idealized problem:\nProblem 1 (Ideal cone problem). We are given a matrix Z \u2208 Rn\u00d7m such that Z = MYP , where\nM \u2208 Rn\u00d7K\u22650\n, no row of M is 0, and YP \u2208 RK\u00d7m corresponds to K (unknown) rows of Z, each\nscaled to unit \u20182 norm. Infer M from Z.\nThe rows of YP are unit vectors representing the corner rays of the cone. Each row of Z is constructed\nfrom a non-negative weighted combination of these unit vectors, with the weights being given by the\ncorresponding rows of M. Rows of Z that lie on some corner correspond to rows of M that have\nzero in all but one component. Observe that M is invariant to the choice of K corner rows of Z used\nto construct YP .\nNow consider solving the ideal cone problem with the eigenvector matrix, i.e., Z = V. From\nLemma 2.1, the corner rows correspond to the pure nodes. Choosing one such row from each corner\ngives us a set of pure node indices I. Hence, M = \u0393\u0398\u0393\u22121\nP , where N is a diagonal matrix with\nNii = 1/keT\ni Zk and NP = N(I, :). We also have the identity \u03c1\u0393P B\u0393P = VP EVT\nP . Coupled\nwith model-speci\ufb01c identi\ufb01ability conditions (details are provided in the supplementary material),\nthese can be used to infer \u0398 and \u03c1B (\u0393 are typically considered nuisance parameters).\nIn practice, we only have an observation matrix A that is stochastically generated from the population\nmatrix P. Hence, we must actually solve:\nProblem 2 (Empirical cone problem). We are given a matrix \u02c6Z \u2208 Rn\u00d7m such that\nmaxi\u2208[n] keT\ni (Y \u2212 \u02c6Y)k2 \u2264 \u0001, where Y = NZ is the row-normalized version of Z, and \u02c6Y is con-\nstructed similarly from \u02c6Z. Again, Z = MYP , where M \u2265 0, no row of M is 0, and YP = Y(I, :)\ncorresponds to K (unknown) rows of Y with indices I. Infer M from \u02c6Z.\nWe will \ufb01rst present the solution to the ideal cone problem. We will then show that the same algorithm\nwith some post-processing solves the empirical cone problem up to O(\u0001) error. Finally, we apply our\nalgorithm to infer parameters for a variety of models, and present error bounds for each.\nNotation: We shall refer to the ith row of Z as zT\nThe same pattern will be used for rows mT\n\nP N\u22121\n\ni expressed using a column vector, i.e., zi = ZT ei.\n\ni of M, and other matrices as well.\n\n2.1 The Ideal Cone Problem\nObserve that given the corner indices I (i.e., given YP ), \ufb01nding M such that Z = MYP is a simple\nregression problem. Thus, the only dif\ufb01culty is in \ufb01nding the corner indices.\n\n3\n\n\fOur key insight is that under certain conditions, the ideal cone problem can be solved easily by\na one-class SVM applied to the rows of Y. Figure 1 plots the normalized rows yi of Y for an\nexample cone. Observe that a hyperplane through the corners separates all the points from the origin.\nThis suggests that the normalized corners are the support vectors found by a one-class SVM:\n\nmaximize b\n\ns.t. wT yi \u2265 b (for i = 1, . . . , n) and kwk2 \u2264 1.\n\n(1)\n\nP )\u221211 > 0.\n\nWe show next that this intuition is correct. De\ufb01ne the following condition.\nCondition 1. The matrix YP satis\ufb01es (YP YT\nTheorem 2.2. Each support vector selected by the one-class SVM (Eq. 1) is a corner of Z. Also, if\nCondition 1 holds, there is at least one support vector for every corner.\nThus, under Condition 1, we can get all the corners from the support vectors, and then \ufb01nd M via\nregression of Z on YP . Condition 1 is always satis\ufb01ed for our problem setting, as shown next.\nTheorem 2.3. Let P be a population matrix satisfying Assumption 2.1. Let Z = V, where V is the\nrank-K eigenvector matrix. Let Y = NZ as de\ufb01ned above. Then, Condition 1 is true.\n\nThus, the ideal cone problem is easily solved by a one-class SVM. Next, we show that the same\nmethod suf\ufb01ces for the empirical cone problem too.\n\n2.2 The Empirical Cone Problem\nNow, instead of the normalized eigenvector rows Y, we are given the empirical matrix \u02c6Y with rows\ni ( \u02c6Y \u2212 Y)k \u2264 \u0001. Once again, we focus on \ufb01nding the corner indices, using\ni /k\u02c6zik, where maxi keT\n\u02c6zT\nwhich M can be inferred by regression. We will show that running a one-class SVM on the rows of\n\u02c6Y yields \u201cnear-corners,\u201d after some post-processing. We will need a stronger form of Condition 1:\nCondition 2. The matrix YP satis\ufb01es (YP YT\nIt is easy to show that the solution (w, b) of the population SVM under Condition 1 is given by\n\nP )\u221211 \u2265 \u03b71 for some constant \u03b7 > 0.\n\nw = b\u22121 \u00b7 YT\nP \u03b2\n\nb =(cid:0)1T (YP YT\n\nP )\u221211(cid:1)\u22121/2\n\n\u03b2 = (YP YT\n1T (YP YT\n\nP )\u221211\nP )\u221211 .\n\n(2)\n\nThus, Condition 1 implies that w is a convex combination of the corners, while Condition 2 addition-\nally requires a minimum contribution from each corner.\nLemma 2.4 (SVM solution is nearly ideal). Let ( \u02c6w,\u02c6b) be the solution for the one-class SVM (Eq. 1)\napplied to the rows of \u02c6Y. Under Condition 2, we have |\u02c6b \u2212 b| \u2264 \u0001 and k \u02c6w \u2212 wk \u2264 \u03b6\u0001, for\n\u03b6 =\n\n\u221a\n\n\u2264\n\n4K\n\n4\n\n\u03b7b2\n\n\u03bbK(YP YT\nP )\n\n\u03b7(\u03bbK(YP YT\n\nP ))1.5 .\n\nUnlike the ideal cone scenario, the rows \u02c6YP corresponding to the corners need not be support vectors\nfor the empirical cone. However, they are not far off.\nLemma 2.5 (Corners are nearly support vectors). The corners of the population cone are close to\nthe supporting hyperplane: \u02c6b1 \u2264 \u02c6YP \u02c6w \u2264 \u02c6b1 + (\u03b6 + 2)\u00011.\nThis suggests that we should consider all points that are up to (\u03b6 + 2)\u0001 away from the supporting\nhyperplane when searching for corners. The next Lemma shows that each such point is a \u201cnear-\ncorner.\u201d\ni YPk, which\nRecall that each row \u02c6yT\ni YP , where\ncan be rewritten as a scaled convex combination of the normalized corners: yT\ni 1\ni 1. For a corner, ri = 1 and \u03c6i = ej for some j.\ni 1 = 1. Speci\ufb01cally, ri = mT\ni YP k and \u03c6i = mi\n\u03c6T\nkmT\nWe now show that every point i that is close to the supporting hyperplane is nearly a corner of the\nideal cone.\nLemma 2.6 (Points close to support vectors are near-corners). If \u02c6wT \u02c6yi \u2264 \u02c6b + (\u03b6 + 2)\u0001 for some\npoint i \u2208 [n], then 1 \u2264 ri \u2264 1 + (\u03b6+4)\u0001\n\nis a noisy version of a population row yT\n\ni YP /kmT\ni = ri\u03c6T\n\nb\u2212\u0001 and \u03c6ij \u2265 1 \u2212\n\nP ) for some j \u2208 [K].\n\ni = mT\n\nb\u03bbK(YP YT\n\n2\u03b6\u0001\n\nmT\n\ni\n\n4\n\n\f\u221a\n\n\u03b7(\u03bbK(YP YT\nK1.5\n\nConsider the set of points Sc = {i | \u02c6wT \u02c6yi \u2264 \u02c6b + (\u03b6 + 2)\u0001} that are close to the supporting\nhyperplane. Lemmas 2.5 and 2.6 show that Sc contains all corners, and possibly other points that\nare all near-corners. This suggests that we can cluster the vectors {\u02c6yi | i \u2208 S} into K clusters, each\ncorresponding to one corner and possibly extra near-corners close to that corner. Randomly selecting\none point from each cluster gives us the set of inferred corners.\nLemma 2.7 (Each corner has its own cluster). There exist exactly K clusters in Sc, as long as\n\u0001 \u2264 c\u0001\nLet C be the indices of the near-corners picked by this clustering step. Since Z = MYP , this\nsuggests M can be obtained via regression: M \u2248 \u02c6Z \u02c6YT\nC)\u22121\u03a0, where \u02c6YC := \u02c6Y(C, :) and\n\u03a0 is a permutation matrix that matches the ordering of ideal corners and the empirical near-corners.\nTheorem 2.8. If Condition 2 and the condition on \u0001 in Lemma 2.7 holds, then for any i \u2208 [n],\nkeT\n\u0001, where cM is a global constant, and \u03ba(.)\nis the ratio of the largest and smallest nonzero singular values of a matrix.\n\nC)\u22121\u03a0)k \u2264 cM \u03ba(YP YT\n\n, for some global constant c\u0001.\n\ni (M \u2212 \u02c6Z \u02c6YT\n\nP )keT\n(\u03bbK(YP YT\n\nP ))3\n\u03ba(YP YT\nP )\n\ni ZkK\u03b6\nP ))2.5\n\nC( \u02c6YC \u02c6YT\n\nC( \u02c6YC \u02c6YT\n\nAlgorithm 1 shows all the steps of our method (SVM-cone). The algorithm requires an estimate of\n\u03b4 := (\u03b6 + 2)\u0001, and returns the inferred M and near-corners C. When the row-wise error bound \u0001 is\nunknown, we can start with \u03b4 = 0 and incrementally increase it until K distinct clusters are found.\n\nAlgorithm 1 SVM-cone\nInput: \u02c6Z \u2208 Rn\u00d7m, number of corners K, estimated distance of corners from hyperplane \u03b4\nOutput: Estimated conic combination matrix \u02c6M and near-corner set C\n1: Normalize rows of \u02c6Z by \u20182 norm to get \u02c6Y with rows \u02c6yT\n2: Run one-class SVM on \u02c6yi to get the normal \u02c6w and distance \u02c6b of the supporting hyperplane\n3: Cluster points {\u02c6yi | \u02c6wT \u02c6yi \u2264 \u02c6b + \u03b4} that are close to the hyperplane into K clusters\n4: Pick one point from each cluster to get near-corner set C\n5: \u02c6M = \u02c6Z \u02c6YT\n\nC( \u02c6YC \u02c6YT\n\nC)\u22121\n\ni\n\n3 Applications\nMany network models and topic models have population matrices of the form P = \u03c1\u0393\u0398B\u0398T \u0393. We\nhave already shown that in such cases, the eigenvector matrix V forms an ideal cone (Lemma 2.1),\nand that Condition 1 holds. It is easy to see that the same holds for VVT as well. This suggests that\nSVM-cone can be applied to the matrix \u02c6V \u02c6VT , where \u02c6V is the empirical top-K eigenvector matrix.\nWe shall show that this yields per-node error bounds in estimating community memberships and\nper-word error bounds for word-topic distributions.\n\n3.1 Network models\nDe\ufb01ne a \u201cDCMMSB-type\u201d model as a model with population matrix P = \u03c1\u0393\u0398B\u0398T \u0393 and an\nempirical adjacency matrix A with Aji = Aij \u223c Bernoulli(Pij) for all i > j. Assume that rows of\n\u0398 have unit \u2018p norm, for p = 1 (DCMMSB) or p = 2 (OCCAM [32]). Let vi = VT ei, \u02c6vi = \u02c6VT ei,\nyi = Vvi/kVvik, and \u02c6yi = \u02c6V\u02c6vi/k \u02c6V\u02c6vik. Denote \u03b3max = maxi \u0393ii and \u03b3min = mini \u0393ii.\nTheorem 3.1 (Small row-wise error in Network Models). Consider a DCMMSB-type model\nn\u03c1)\nwith \u03b8i \u223c Dirichlet(\u03b1), and \u03b10\n\u03bb\u2217(B)\n\nmin(q n\n\n27 log n , \u03b32\nmin\n\u03b32\nmax\n2(1 + \u03b10)\n\nIf \u03bd := \u03b10\n\u03b1min\n\nfor some constant \u03be > 1, \u03ba(\u0398T \u03932\u0398) = \u0398(1), and \u03b10 = O(1),\n\n\u2265 8(1 + \u03b10)(log n)\u03be\n\n:= \u03b1T 1.\n\n\u221a\n\n\u2264\n\n,\n\n\u03bd\nthen\n\n\u03b32\nmin\n\nn\u03c1\n\n\u0001 = max\n\nkyi \u2212 \u02c6yik = \u02dcO\n\ni\n\n(cid:18) \u03b3max min{K2, (\u03ba(P))2}K0.5\u03bd(1 + \u03b10)\n\n(cid:19)\n\nmin\u03bb\u2217(B)\u221a\n\u03b33\n\nn\u03c1\n\nwith probability at least 1 \u2212 O(Kn\u22122). Here \u03bb\u2217(B) is the smallest singular value of B.\n\n5\n\n\fn\u03c1(log n)\u03be), and maxi kV(:, i)k = O(\u221a\n\nSimilar results for the non-Dirichlet case follow easily as long as n\u03c1 = \u2126((log n)2\u03be), \u03bbK(P) =\n\u2126(\u221a\n\u03c1) with high probability. This shows that the rows of\n\u02c6V \u02c6VT are close to those of VVT , and the latter forms an ideal cone satisfying Condition 1. Hence,\nthe conic combination for each node can be recovered by Algorithm 1 applied to \u02c6V \u02c6VT . In fact, we\ncan run the algorithm on \u02c6V itself; the output depends only on the SVM dual variables \u03b2 (Eq. 2),\nwhich are the same whether the input is \u02c6V or \u02c6V \u02c6VT . The output is the same conic combination matrix\n\u02c6M and the same set C of nearly-pure nodes.\n\nFor identi\ufb01ability of \u0398, we need another condition. We will assume thatP \u0393ii = n and all diagonal\n\nentries of B are equal (details are provided in the supplementary material). The next theorem\nshows that SVM-cone can be used to consistently infer the parameters of DCMMSB as well as\nOCCAM [32].\nTheorem 3.2 (Consistent inference of community memberships for each node). Consider DCMMSB-\ntype models where the conditions of Theorem 3.1 are satis\ufb01ed and \u03ba(\u0398T \u03932\u0398) = \u0398(1). Let \u02c6D\nCei. Let \u02c6\u0398 = \u02c6F\u22121 \u02c6M \u02c6D, where \u02c6F\nbe a diagonal matrix with entries \u02c6Dii =\nis a diagonal matrix with entries \u02c6Fii = keT\n\u02c6M \u02c6Dk2 (for\nOCCAM [32]). Then there exists a permutation matrix \u03a0 such that\n\neT\n\u02c6M \u02c6Dk1 (for DCMMSB) and \u02c6Fii = keT\n\nq\n(cid:18) \u03b3maxK2.5 min{K2, (\u03ba(P))2}n3/2\nK(\u0398T \u03932\u0398)\u221a\n\n\u02c6YC \u02c6V\u02c6E \u02c6VT \u02c6YT\n\n(cid:19)\n\nkeT\n\n\u03c1\n\n\u03b3min\u03b7\u03bb\u2217(B)\u03bb2\n\ni (\u0398 \u2212 \u02c6\u0398\u03a0)k = \u02dcO\nwith probability at least 1 \u2212 O(Kn\u22122).\nRemark 3.1. The error bound is small when the clusters are well separated (large \u03bb\u2217(B)), the network\nis dense (large \u03c1), there are few blocks (small K), and the membership vectors \u0398 are drawn from a\nbalanced Dirichlet distribution (small \u03bd, and hence small \u03ba(P)), which leads to balanced block sizes.\nRemark 3.2. For DCMMSB-type models, \u03b7 \u2265 \u03b32\n. Also, under the conditions of\nTheorem 3.1, \u03b7 \u2265 \u03b32\nmin\n3\u03bd\u03b32\nObserve that these are per-node error bounds, as against a simpler bound on k\u0398 \u2212 \u02c6\u0398k. Clearly,\nthe same results extend to the special case of the Mixed Membership Stochastic Blockmodel [1] and\nthe Stochastic Blockmodel [13] as well (the assumption of equal diagonal entries of B is no longer\nneeded, since \u0393ii = 1 is enough for parameter identi\ufb01ability [21]).\n\nwith high probability. Proofs are in the supplementary material.\n\nmin mini(eT\n\n\u03bb1(\u0398T \u03932\u0398)\n\ni \u0398T 1)\n\nmax\n\ni\n\ni\n\ni\n\n3.2 Topic Models\nLet T \u2208 RV \u00d7K\u22650\nbe a matrix of the word to topic probabilities with unit column sum, and let\nbe the topic to document matrix. Then A := TH is the probability matrix for words\nH \u2208 RK\u00d7D\u22650\nappearing in documents. The actual counts of words in documents are assumed to be generated iid as\nAij \u223c Binomial(N,Aij) for i \u2208 [V ], j \u2208 [D].\nThe word co-occurrence probability matrix is given by AAT /D = T(HHT /D)TT . Setting\n\u0393ii = kT(i, :)k1, \u0398 = \u0393\u22121T, and B = HHT /D, we \ufb01nd that AAT /D = \u0393\u0398B\u0398T \u0393 with\n\u03981 = 1. This clearly matches the form of P in the DCMMSB model. Hence, its eigenvector matrix\nhas the desired conic structure with weight matrix M = T\u0393\u22121\nP , with the \u201cpure nodes\u201d being\nanchor words that only occur in a single topic. We now show that the row-wise error between the\nempirical and population eigenvector matrices decays with increasing number of documents D and\nnumber of words in a document N.\ni AAT ek. We assume that when it is not zero, it goes to in\ufb01nity, in\nAssumption 3.1. Let gik = eT\nparticular, gik \u2265 N log max(V, D), which gives D/N \u2192 \u221e. We also assume that \u03bbi(HHT ) =\n\u0398(D), for i \u2208 [K], and \u03ba(TTT ) = \u0398(1).\nThese assumptions are similar to ones made in other theoretical literature on topic models [18].\nWe will construct a matrix A1AT2 , where A1 and A2 are obtained by dividing the words in each\ndocument uniformly randomly in two equal parts. This ensures that E[A1AT2 ] = AAT , which in\nturn helps establishing concentration of empirical singular vectors as shown in the following lemma.\nFor simplicity denote N1 = N/2.\n\nP N\u22121\n\n6\n\n\f(c) Varying sparsity for\nOCCAM model\n\n(b) Varying sparsity for\nDCMMSB\n\n(d) Varying sparsity for\nSBMO [17]\n\n(a) Varying degree hetero-\ngeneity for DCMMSB\nFigure 2: Relative error in estimation of community memberships: Plots (a) and (b) compare SVM-\ncone against the closest baseline (GeoNMF) on the degree-corrected MMSB model. We then compare\nagainst (c) OCCAM and (d) SAAC on networks drawn from their generative models.\nLemma 3.3 (Small row-wise error in Topic Models). Let \u02c6V denote the matrix of the top-K singular\nvectors of U = A1AT2 /N 2\n1 , and let the population counterpart of this be V. Let vi = VT ei,\n\u02c6vi = \u02c6VT ei, yi = Vvi/kVvik, and \u02c6yi = \u02c6V\u02c6vi/k \u02c6V\u02c6vik. Under Assumption 3.1, we have:\nK log max(V, D)\n\n r\n\n!\n\nminj keT\n\n1\ni Tk1\n\np\u03bbK(TT T) OP\n\n\u0001 = max\n\nkyi \u2212 \u02c6yik =\n\ni\n\nwith probability at least 1 \u2212 O(1/D2).\nThus, Algorithm 1 run on \u02c6V \u02c6VT (or equivalently, just \u02c6V) can be used to \ufb01nd the conic combination\nweights \u02c6M \u2248 M. Since M being the product of T with a diagonal matrix where T has unit column\nsum, we can extract \u02c6T = \u02c6M \u02c6D\u22121, where \u02c6D is a diagonal matrix with \u02c6Dii = k \u02c6M(:, i)k1.\nTheorem 3.4 (Consistent inference of word-topic probabilities for each word). Under Assumption 3.1,\nthere exists a permutation matrix \u03a0 such that with probability at least 1 \u2212 O(1/D2),\n\nDN\n\n \n\nkeT\n\ni ( \u02c6T \u2212 T\u03a0T )k\n\nkeT\n\ni Tk\n\nK4 maxj keT\n\u03b7(minj keT\n\n= OP\ni Tk1/\u03bb1(TT T) \u2265 mini keT\n\nj Tk1\nj Tk1)2\n\nRemark 3.3. We have \u03b7 \u2265 mini keT\n\nrlog max(V, D)\n\n!\n\nDN\n\n.\n\ni Tk1/K (supplementary material).\n\n4 Experiments\n\nWe ran experiments on simulated and real-world datasets to verify the accuracy and scalability of\nSVM-cone. We compared SVM-cone against several competing baselines. For network models,\nGeoNMF detects the corners of a simplex formed by the MMSB model by constructing the graph\nLaplacian and picking nodes that have large norms in the Laplacian [20]. It assumes balanced\ncommunities (i.e., the rows of \u0398 are drawn from a Dirichlet with identical community weights). SVI\nuses stochastic variational inference for MMSB [12]. BSNMF [24] presents a Bayesian approach to\nSymmetric Nonnegative Matrix Factorization; it can be applied to do inference for MMSB models\nwith B = cI where c \u2208 [0, 1]. OCCAM works on a variant of MMSB where each row of \u0398 has\nunit \u20182 norm, and the model allows for degree heterogeneity [32]. SAAC [17] uses alternating\noptimization on a version of the stochastic blockmodel where each node can be a member of\nmultiple communities, but the membership weight is binary. For topic models, RecoverL2 [5] uses a\ncombinatorial algorithm to pick anchor words from the word co-occurrence matrix and then recovers\nthe word-topic vectors by optimizing a quadratic loss function. TSVD [7] uses a thresholded SVD\nbased procedure to recover the topics. GDM [31] is a geometric algorithm that involves a weighted\nclustering procedure augmented with geometric corrections. We could not obtain the code for [18].\n\n4.1 Networks with overlapping communities\n\nIn this section, we present experiments on simulated and large real networks.\n\n4.1.1 Simulations\nWe test the recovery of population parameters (\u0398, B) given adjacency matrices A generated from\nthe corresponding population matrices P (\u0393 are nuisance parameters). We generate networks with\n\n7\n\n\f(a) DBLP coauthorship\n\n(b) DBLP bipartite\n\n(c) Wall-clock time\n\nFigure 3: Accuracy of estimated community memberships for (a) the DBLP coauthorship network\nand (b) the biparite author-paper DBLP network. (c) The wall-clock time of the competing methods\non the DBLP coauthorship network respectively.\nn = 5000 nodes and K = 3 communities. The rows of \u0398 are drawn from Dirichlet(\u03b1) for DCMMSB\nand OCCAM; for DCMMSB, \u03b1 = (1/3, 1/3, 1/3); for OCCAM, \u03b1 = (1/6, 1/6, 1/6) and the rows\nare normalized to have unit \u20182 norm. We set Bii = 1 and Bij = 0.1 for all i 6= j. The default degree\nparameters for DCMMSB are as follows: for all nodes i that are predominantly in the j-th community\n(\u03b8ij > 0.5), we set \u0393ii to 0.3, 0.5, and 0.7 for the 3 respective communities; all other nodes have\n\u0393ii = 1. For OCCAM, we draw degree parameters from a Beta(1, 3) distribution.\n\nVarying degree parameters \u0393: We set the degree parameters for predominant nodes in the 3\ncommunities as 0.5 + \u0001\u0393, 0.5, and 0.5 + \u0001\u0393 respectively. Figure 2a shows SVM-cone outperforms\nGeoNMF consistently for all choices of \u0001\u0393.\nVarying network sparsity \u03c1: Figure 2b shows the relative error in estimating \u0398 as a function of\nthe network sparsity \u03c1. Increasing \u03c1 increases the average degree of nodes in the network without\naffecting the skew induced by their degree parameters \u0393. As expected, all methods tend to improve\nwith increasing degree. Our method dominates GeoNMF over the entire range of average degrees.\nFigures 2c and 2d show results for networks generated under the models used by OCCAM and SAAC\nrespectively. SVM-cone is comparable or better than these methods even on their generative models.\nThe smaller error bars on SVM-cone show that it is more stable than SAAC.\n\n4.1.2 Real-world experiments\n\nK max\u03c3\n\nWe tested SVM-cone on large network datasets and word-document datasets. For networks, we\nused the 5 DBLP coauthorship networks1 (used in [20], where each ground truth community\ncorresponds to a group of conferences on the same topic. We also use bipartite author-paper\nPK\nvariants for these 5 networks. Following [20], we evaluate results by the rank correlation be-\ntween the predicted vector for community i against the true vector, averaged over all communities:\ni=1 RC( \u02c6\u0398(:, i), \u0398(:, \u03c3(i))), where \u03c3 is a permutation over the K com-\nRCavg( \u02c6\u0398, \u0398) = 1\nmunities. We have \u22121 \u2264 RCavg( \u02c6\u0398, \u0398) \u2264 1, with higher numbers implying a better match between\n\u02c6\u0398 and \u0398. We do not use metrics like NMI [30] or ExNVI [32] that require binary overlapping\nmembership vectors to avoid thresholding issues on real-valued membership vectors.\nWe \ufb01nd that SVM-cone outperforms competing baselines on 2 of the 5 DBLP coauthorship datasets,\nand is similar on the remaining three (Figure 3a). The closest competitor is GeoNMF [20], which\nassumes that all nodes have the same degree parameter, and the community sizes are balanced. Both\nassumptions are reasonable for the dataset, since the number of coauthors (the degree) does not\nvary signi\ufb01cantly among authors, and the communities are formed from conferences where no one\nconference dominates the others. The differences between SVM-cone and the competition is starker\non the bipartite dataset (Figure 3b). There is severe degree heterogeneity: an author can be connected\nto many papers, while each paper only has a few authors at best. Our method is able to accommodate\nsuch differences between the nodes, and hence yields much better accuracy than others.\nFinally, Figure 3c shows the wall-clock time for running the various methods on DBLP coauthorship\nnetworks (wall-clock time on DBLP bipartite author-paper networks is included in the supplementary\nmaterial). Our method is among the fastest. This is expected; the only computationally intensive step\nis the one-class SVM and top-K eigen-decomposition (or SVD), for which off-the-shelf ef\ufb01cient and\nscalable implementations already exist [10].\n\n1http://www.cs.utexas.edu/~xmao/coauthorship\n\n8\n\nDBLP1DBLP2DBLP3DBLP4DBLP50.000.050.100.150.200.250.300.350.40RCavgSVM-coneGeoNMFSVIBSNMFOCCAMSAACDBLP1DBLP2DBLP3DBLP4DBLP50.00.10.20.30.40.5RCavgSVM-coneGeoNMFSVIBSNMFOCCAMSAACDBLP1DBLP2DBLP3DBLP4DBLP5100101102103104105Runningtime/sSVM-coneGeoNMFSVIBSNMFOCCAMSAAC\f4.2 Topic Models\n\nWe generate semi-synthetic data following [5] and [7] using NIPS1, New York Times1 (NYT),\nPubMed1, and 20NewsGroup2 (20NG). Dataset statistics are included in the supplementary material.\nWe use Matlab R2018a built-in Gibbs Sampling function for learning topic models to learn the word by\ntopic matrix, which should retain the characteristics of real data distributions. Then we draw the topic-\ndocument matrix from Dirichlet with symmetric hyper-parameter 0.01. We set K = 40 for the \ufb01rst 3\nP\ndatasets and K = 20 for 20NG. The word counts matrix is sampled with N = 1000, 300, 100, 200\nrespectively, which matches the mean document length of the real datasets. We evaluate the perfor-\ni,j |T (i, j) \u2212 \u02c6T (i, \u03c0(j))|, where\nmance of different algorithms using \u20181 reconstruction error 1\n\u03c0(.) is a permutation function that matches the topics. Table 1 shows the \u20181 reconstruction error and\nwall-clock running time of different algorithms with datasets generated from different number of\ndocuments. Each setting is repeated 5 times, and we report the mean and standard deviation of the\nresults. SVM-cone is much faster than the other methods. Its accuracy is comparable to RecoverL2,\nand signi\ufb01cantly better than TSVD and GDM. The supplementary material also shows the top-10\nwords of 5 topics learned from SVM-cone for each dataset.\n\nK\n\nTable 1: \u20181 reconstruction error and wall-clock time on semi-synthetic datasets\nDocuments\n\nRecoverL2\n\nGDM\n\nCorpus\n\nNIPS\n\nNYT\n\nPubMed\n\n20NG\n\n20000\n\n40000\n\n60000\n\n20000\n\n40000\n\n60000\n\n20000\n\n40000\n\n60000\n\n20000\n\n40000\n\n60000\n\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\u20181 Error\nTime/s\n\n0.059 (\u00b1 0.000)\n100.11 (\u00b1 8.81)\n0.043 (\u00b1 0.000)\n143.34 (\u00b1 0.53)\n0.036 (\u00b1 0.000)\n247.34 (\u00b1 20.84)\n0.125 (\u00b1 0.000)\n78.15 (\u00b1 7.14)\n0.103 (\u00b1 0.000)\n140.84 (\u00b1 15.50)\n0.095 (\u00b1 0.000)\n184.69 (\u00b1 20.65)\n0.163 (\u00b1 0.000)\n54.32 (\u00b1 5.94)\n0.122 (\u00b1 0.000)\n78.99 (\u00b1 9.99)\n0.098 (\u00b1 0.000)\n98.19 (\u00b1 15.06)\n0.100 (\u00b1 0.000)\n40.74 (\u00b1 0.64)\n0.074 (\u00b1 0.000)\n94.42 (\u00b1 9.92)\n0.058 (\u00b1 0.000)\n142.34 (\u00b1 20.31)\n\nTSVD\n\n0.237 (\u00b1 0.017)\n18.54 (\u00b1 2.04)\n0.250 (\u00b1 0.045)\n21.97 (\u00b1 1.49)\n0.269 (\u00b1 0.064)\n35.77 (\u00b1 3.28)\n0.207 (\u00b1 0.025)\n25.11 (\u00b1 6.39)\n0.197 (\u00b1 0.045)\n50.18 (\u00b1 13.14)\n0.166 (\u00b1 0.028)\n42.96 (\u00b1 7.95)\n0.239 (\u00b1 0.032)\n15.75 (\u00b1 2.34)\n0.255 (\u00b1 0.018)\n26.44 (\u00b1 4.49)\n0.275 (\u00b1 0.041)\n24.57 (\u00b1 4.59)\n0.111 (\u00b1 0.051)\n7.51 (\u00b1 0.42)\n0.081 (\u00b1 0.043)\n16.04 (\u00b1 2.28)\n0.133 (\u00b1 0.045)\n23.36 (\u00b1 5.85)\n\n0.081 (\u00b1 0.057)\n119.66 (\u00b1 4.41)\n0.061 (\u00b1 0.038)\n220.92 (\u00b1 3.10)\n0.059 (\u00b1 0.038)\n406.87 (\u00b1 36.57)\n0.223(\u00b1 0.008)\n193.43 (\u00b1 12.02)\n0.216 (\u00b1 0.010)\n394.16 (\u00b1 30.42)\n0.210 (\u00b1 0.010)\n595.54 (\u00b1 91.57)\n0.277 (\u00b1 0.051)\n205.95 (\u00b1 11.27)\n0.251 (\u00b1 0.041)\n459.17 (\u00b1 30.71)\n0.269 (\u00b1 0.052)\n649.97 (\u00b1 26.48)\n0.137 (\u00b1 0.071)\n102.86 (\u00b1 4.05)\n0.131 (\u00b1 0.072)\n273.51 (\u00b1 16.45)\n0.096 (\u00b1 0.063)\n388.47 (\u00b1 43.22)\n\nSVM-cone\n0.071 (\u00b1 0.004)\n5.33 (\u00b1 0.39)\n0.051 (\u00b1 0.002)\n9.07 (\u00b1 0.00)\n0.041 (\u00b1 0.002)\n17.63 (\u00b1 5.29)\n0.131 (\u00b1 0.003)\n4.51 (\u00b1 0.70)\n0.106 (\u00b1 0.001)\n8.04 (\u00b1 1.15)\n0.096 (\u00b1 0.002)\n11.82 (\u00b1 1.91)\n0.181 (\u00b1 0.002)\n2.06 (\u00b1 0.46)\n0.138 (\u00b1 0.001)\n3.73 (\u00b1 0.37)\n0.114 (\u00b1 0.001)\n5.44 (\u00b1 0.38)\n0.090 (\u00b1 0.003)\n1.85 (\u00b1 0.26)\n0.064 (\u00b1 0.001)\n4.33 (\u00b1 0.71)\n0.052 (\u00b1 0.002)\n5.89 (\u00b1 0.67)\n\n5 Conclusions\n\nWe showed that many distinct models for overlapping clustering can be placed under one general\nframework, where the data matrix is a noisy version of an ideal matrix and each row is a non-negative\nweighted sum of \u201cexemplars.\u201d In other words, the connection probabilities of one node to others\nin a network is a non-negative combination of the connection probabilities of K \u201cpure\u201d nodes to\nothers in the network. Each pure node is an examplar of a single community, and we require one\npure node from each of the K communities. This geometrically corresponds to a cone, with the\npure nodes being its corners. This subsumes Mixed-Membership Stochastic Blockmodels and their\ndegree-corrected variants, as well as commonly used topic models. We showed that a one-class SVM\napplied to the normalized rows of the data matrix can \ufb01nd both the corners and the weight matrix.\nWe proved the consistency of our SVM-cone algorithm, and used it to develop consistent parameter\ninference methods for several widely used network and topic models. Experiments on simulated and\nlarge real-world datasets show both the accuracy and the scalability of SVM-cone.\n\n1https://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n2http://qwone.com/~jason/20Newsgroups/\n\n9\n\n\fAcknowledgments\n\nX.M. and P.S. were partially supported by NSF grant DMS 1713082. D.C. was partially supported by\na Facebook Faculty Research Award.\n\nReferences\n[1] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels.\n\nJMLR, 9:1981\u20132014, 2008.\n\n[2] A. Anandkumar, R. Ge, D. Hsu, and S. M. Kakade. A tensor approach to learning mixed membership\n\ncommunity models. JMLR, 15(1):2239\u20132312, 2014.\n\n[3] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning\n\nlatent variable models. The Journal of Machine Learning Research, 15(1):2773\u20132832, 2014.\n\n[4] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization\u2013provably. In\n\nSTOC, pages 145\u2013162. ACM, 2012.\n\n[5] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm\nfor topic modeling with provable guarantees. In International Conference on Machine Learning, pages\n280\u2013288, 2013.\n\n[6] P. Awasthi, B. Kalantari, and Y. Zhang. Robust vertex enumeration for convex hulls in high dimensions. In\nA. Storkey and F. Perez-Cruz, editors, International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 84, pages 1387\u20131396. PMLR, 09\u201311 Apr 2018.\n\n[7] T. Bansal, C. Bhattacharyya, and R. Kannan. A provable svd-based algorithm for learning topics in\ndominant admixture corpus. In Advances in Neural Information Processing Systems, pages 1997\u20132005,\n2014.\n\n[8] X. Bing, F. Bunea, and M. Wegkamp. A fast algorithm with minimax optimal guarantees for topic models\n\nwith an unknown number of topics. arXiv preprint arXiv:1805.06837, 2018.\n\n[9] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of machine Learning research,\n\n3(Jan):993\u20131022, 2003.\n\n[10] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent\n\nsystems and technology (TIST), 2(3):27, 2011.\n\n[11] W. Ding, M. H. Rohban, P. Ishwar, and V. Saligrama. Topic discovery through data dependent and random\n\nprojections. In International Conference on Machine Learning, pages 1202\u20131210, 2013.\n\n[12] P. K. Gopalan and D. M. Blei. Ef\ufb01cient discovery of overlapping communities in massive networks. PNAS,\n\n110(36):14534\u201314539, 2013.\n\n[13] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):\n\n109\u2013137, June 1983. ISSN 0378-8733.\n\n[14] S. B. Hopkins and D. Steurer. Bayesian estimation from few samples: community detection and related\n\nproblems. In FOCS, pages 379\u2013390. IEEE, 2017.\n\n[15] K. Huang, X. Fu, and N. D. Sidiropoulos. Anchor-free correlated topic modeling: Identi\ufb01ability and\n\nalgorithm. In Advances in Neural Information Processing Systems, pages 1786\u20131794, 2016.\n\n[16] J. Jin, Z. T. Ke, and S. Luo. Estimating network memberships by simplex vertex hunting. arXiv preprint\n\narXiv:1708.07852, 2017.\n\n[17] E. Kaufmann, T. Bonald, and M. Lelarge. A spectral algorithm with additive clustering for the recovery\nof overlapping communities in networks. In International Conference on Algorithmic Learning Theory,\npages 355\u2013370. Springer, 2016.\n\n[18] Z. T. Ke and M. Wang. A new svd approach to optimal topic estimation. arXiv preprint arXiv:1704.07016,\n\n2017.\n\n[19] P. Latouche, E. Birmel\u00e9, C. Ambroise, et al. Overlapping stochastic block models with application to the\n\nfrench political blogosphere. The Annals of Applied Statistics, 5(1):309\u2013336, 2011.\n\n10\n\n\f[20] X. Mao, P. Sarkar, and D. Chakrabarti. On mixed memberships and symmetric nonnegative matrix\n\nfactorizations. In ICML, pages 2324\u20132333, 2017.\n\n[21] X. Mao, P. Sarkar, and D. Chakrabarti. Estimating mixed memberships with sharp eigenvector deviations.\n\narXiv preprint arXiv:1709.00407, 2017.\n\n[22] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in\n\nneural information processing systems, pages 849\u2013856, 2002.\n\n[23] M. Panov, K. Slavnov, and R. Ushakov. Consistent estimation of mixed memberships with successive\nIn International Workshop on Complex Networks and their Applications, pages 53\u201364.\n\nprojections.\nSpringer, 2017.\n\n[24] I. Psorakis, S. Roberts, M. Ebden, and B. Sheldon. Overlapping community detection using bayesian\n\nnon-negative matrix factorization. Physical Review E, 83(6):066114, 2011.\n\n[25] T. Qin and K. Rohe. Regularized spectral clustering under the degree-corrected stochastic blockmodel. In\n\nAdvances in Neural Information Processing Systems, pages 3120\u20133128, 2013.\n\n[26] A. Ray, J. Ghaderi, S. Sanghavi, and S. Shakkottai. Overlap graph clustering via successive removal. In\n\n52nd Annual Allerton Conference, pages 278\u2013285. IEEE, 2014.\n\n[27] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel.\n\nThe Annals of Statistics, pages 1878\u20131915, 2011.\n\n[28] P. Rubin-Delanchy, C. E. Priebe, M. Tang, and J. Cape. A statistical interpretation of spectral embedding:\n\nthe generalised random dot product graph. arXiv preprint arXiv:1709.05506, 2017.\n\n[29] B. Sch\u00f6lkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a\n\nhigh-dimensional distribution. Neural computation, 13(7):1443\u20131471, 2001.\n\n[30] A. Strehl and J. Ghosh. Cluster ensembles\u2014a knowledge reuse framework for combining multiple\n\npartitions. Journal of machine learning research, 3(Dec):583\u2013617, 2002.\n\n[31] M. Yurochkin and X. Nguyen. Geometric dirichlet means algorithm for topic inference. In Advances in\n\nNeural Information Processing Systems, pages 2505\u20132513, 2016.\n\n[32] Y. Zhang, E. Levina, and J. Zhu. Detecting overlapping communities in networks using spectral methods.\n\narXiv preprint arXiv:1412.3432, 2014.\n\n11\n\n\f", "award": [], "sourceid": 1080, "authors": [{"given_name": "Xueyu", "family_name": "Mao", "institution": "University of Texas at Austin"}, {"given_name": "Purnamrita", "family_name": "Sarkar", "institution": "UT Austin"}, {"given_name": "Deepayan", "family_name": "Chakrabarti", "institution": "UT Austin"}]}