{"title": "Modeling Overlapping Communities with Node Popularities", "book": "Advances in Neural Information Processing Systems", "page_first": 2850, "page_last": 2858, "abstract": "We develop a probabilistic approach for accurate network modeling using node popularities within the framework of the mixed-membership stochastic blockmodel (MMSB). Our model integrates some of the basic properties of nodes in social networks: homophily and preferential connection to popular nodes. We develop a scalable algorithm for posterior inference, based on a novel nonconjugate variant of stochastic variational inference. We evaluate the link prediction accuracy of our algorithm on eight real-world networks with up to 60,000 nodes, and 24 benchmark networks. We demonstrate that our algorithm predicts better than the MMSB. Further, using benchmark networks we show that node popularities are essential to achieving high accuracy in the presence of skewed degree distribution and noisy links---both characteristics of real networks.", "full_text": "Modeling Overlapping Communities with\n\nNode Popularities\n\nPrem Gopalan1, Chong Wang2, and David M. Blei1\n\n1Department of Computer Science, Princeton University, {pgopalan,blei}@cs.princeton.edu\n\n2Machine Learning Department, Carnegie Mellon University, {chongw}@cs.cmu.edu\n\nAbstract\n\nWe develop a probabilistic approach for accurate network modeling using node\npopularities within the framework of the mixed-membership stochastic block-\nmodel (MMSB). Our model integrates two basic properties of nodes in social\nnetworks: homophily and preferential connection to popular nodes. We develop a\nscalable algorithm for posterior inference, based on a novel nonconjugate variant\nof stochastic variational inference. We evaluate the link prediction accuracy of our\nalgorithm on nine real-world networks with up to 60,000 nodes, and on simulated\nnetworks with degree distributions that follow a power law. We demonstrate that\nthe AMP predicts signi\ufb01cantly better than the MMSB.\n\n1\n\nIntroduction\n\nSocial network analysis is vital to understanding and predicting interactions between network enti-\nties [6, 19, 21]. Examples of such networks include online social networks, collaboration networks\nand hyperlinked blogs. A central problem in social network analysis is to identify hidden community\nstructures and node properties that can best explain the network data and predict connections [19].\nTwo node properties underlie the most successful models that explain how network connections\nare generated. The \ufb01rst property is popularity. This is the basis for preferential attachment [12],\naccording to which nodes preferentially connect to popular nodes. The resulting degree distributions\nfrom this process are known to satisfy empirically observed properties such as power laws [24]. The\nsecond property that underlies many network models is homophily or similarity, according to which\nnodes with similar observed or unobserved attributes are more likely to connect to each other. To best\nexplain social network data, a probabilistic model must capture these competing node properties.\nRecent theoretical work [24] has argued that optimizing the trade-offs between popularity and simi-\nlarity best explains the evolution of many real networks. It is intuitive that combining both notions\nof attractiveness, i.e., popularity and similarity, is essential to explain how networks are generated.\nFor example, on the Internet a user\u2019s web page may link to another user due to a common interest in\nskydiving. The same user\u2019s page may also link to popular web pages such as Google.com.\nIn this paper, we develop a probabilistic model of networks that captures both popularity and ho-\nmophily. To capture homophily, our model is built on the mixed-membership stochastic blockmodel\n(MMSB) [2], a community detection model that allows nodes to belong to multiple communities.\n(For example, a member of a large social network might belong to overlapping communities of\nneighbors, co-workers, and school friends.) The MMSB provides better \ufb01ts to real network data\nthan single community models [23, 27], but cannot account for node popularities.\nSpeci\ufb01cally, we extend the assortative MMSB [9] to incorporate per-community node popularity.\nWe develop a scalable algorithm for posterior inference, based on a novel nonconjugate variant of\nstochastic variational inference [11]. We demonstrate that our model predicts signi\ufb01cantly better\n\n1\n\n\fFigure 1: We visualize the discovered community structure and node popularities in a giant component of the\nnetscience collaboration network [22] (Left). Each link denotes a collaboration between two authors, colored\nby the posterior estimate of its community assignment. Each author node is sized by its estimated posterior\npopularity and colored by its dominant research community. The network is visualized using the Fructerman-\nReingold algorithm [7]. Following [14], we show an example where incorporating node popularities helps in\naccurately identifying communities (Right). The division of the political blog network [1] discovered by the\nAMP corresponds closely to the liberal and conservative blogs identi\ufb01ed in [1]; the MMSB has dif\ufb01culty in\ndelineating these groups.\n\nthan the stochastic variational inference algorithm for the MMSB [9] on nine large real-world net-\nworks. Further, using simulated networks, we show that node popularities are essential for predictive\naccuracy in the presence of power-law distributed node degrees.\nRelated work. There have been several research efforts to incorporate popularity into network mod-\nels. Karrer et al. [14] proposed the degree-corrected blockmodel that extends the classic stochastic\nblockmodels [23] to incorporate node popularities. Krivitsky et al. [16] proposed the latent cluster\nrandom effects model that extends the latent space model [10] to include node popularities. Both\nmodels capture node similarity and popularity, but assume that unobserved similarity arises from\neach node participating in a single community. Finally, the Poisson community model [4] is a\nprobabilistic model of overlapping communities that implicitly captures degree-corrected mixed-\nmemberships. However, the standard EM inference under this model drives many of the per-node\ncommunity parameters to zero, which makes it ineffective for prediction or model metrics based on\nprediction (e.g., to select the number of communities).\n\n2 Modeling node popularity and similarity\n\nThe assortative mixed-membership stochastic blockmodel (MMSB) [9] treats the links or non-links\nyab of a network as arising from interactions between nodes a and b. Each node a is associated\nwith community memberships \u03c0a, a distribution over communities. The probability that two nodes\nare linked is governed by the similarity of their community memberships and the strength of their\nshared communities.\nGiven the communities of a pair of nodes, the link indicators yab are independent. We draw yab re-\npeatedly by choosing a community assignment (za\u2192b, za\u2190b) for a pair of nodes (a, b), and drawing\na binary value from a community distribution. Speci\ufb01cally, the conditional probability of a link in\nMMSB is\n\np(yab = 1|za\u2192b,i, za\u2190b,j, \u03b2) =(cid:80)K\n\n(cid:80)K\n\ni=1\n\nj=1 za\u2192b,iza\u2190b,j\u03b2ij,\n\nwhere \u03b2 is the blockmodel matrix of community strength parameters to be estimated. In the as-\nsortative MMSB [9], the non-diagonal entries of the blockmodel matrix are set close to 0. This\ncaptures node similarity in community memberships\u2014if two nodes are linked, it is likely that the\nlatent community indicators were the same.\n\n2\n\nBARABASI, AJEONG, HNEWMAN, MSOLE, RPASTORSATORRAS, RHOLME, PNETSCIENCE COLLABORATION NETWORKPOLITICAL BLOG NETWORKAMPMMSBAMP\fIn the proposed model, assortative MMSB with node popularities, or AMP, we introduce latent\nvariables \u03b8a to capture the popularity of each node a, i.e., its propensity to attract links independent\nof its community memberships. We capture the effect of node popularity and community similarity\non link generation using a logit model\n\nlogit (p(yab = 1|za\u2192b, za\u2190b, \u03b8, \u03b2)) \u2261 \u03b8a + \u03b8b +(cid:80)K\n\nab\u03b2k,\n\nk=1 \u03b4k\n\n(1)\nab is one if both nodes assume the\n\nab = za\u2192b,kza\u2190b,k. The indicator \u03b4k\n\neffect \u03b8a captures the popularity of individual nodes while the(cid:80)K\n\nwhere we de\ufb01ne indicators \u03b4k\nsame community k.\nEq. 1 is a log-linear model [20]. In log-linear models, the random component, i.e., the expected\nprobability of a link, has a multiplicative dependency on the systematic components, i.e., the covari-\nates. This model is also similar in the spirit of the random effects model [10]\u2014the node-speci\ufb01c\nab\u03b2k term captures the in-\nteractions through latent communities. Notice that we can easily extend the predictor in Eq. 1 to\ninclude observed node covariates, if any.\nWe now de\ufb01ne a hierarchical generative process for the observed link or non-link under the AMP:\n\nk=1 \u03b4k\n\n1. Draw K community strengths \u03b2k \u223c N (\u00b50, \u03c32\n0).\n2. For each node a,\n\n(a) Draw community memberships \u03c0a \u223c Dirichlet(\u03b1).\n(b) Draw popularity \u03b8a \u223c N (0, \u03c32\n1).\n\n3. For each pair of nodes a and b,\n\n(a) Draw interaction indicator za\u2192b \u223c \u03c0a.\n(b) Draw interaction indicator za\u2190b \u223c \u03c0b.\n(c) Draw the probability of a link yab|za\u2192b, za\u2190b, \u03b8, \u03b2 \u223c logit\u22121(za\u2192b, za\u2190b, \u03b8, \u03b2).\n\nUnder the AMP, the similarities between the nodes\u2019 community memberships and their respective\npopularities compete to explain the observations.\nWe can make AMP simpler by replacing the vector of K latent community strengths \u03b2 with a\nsingle community strength \u03b2. In \u00a74, we demonstrate that this simpler model gives good predictive\nperformance on small networks.\nthe latent variables\nWe analyze data with the AMP via the posterior distribution over\np(\u03c01:N , \u03b81:N , z, \u03b21:K|y, \u03b1, \u00b50, \u03c32\n1), where \u03b81:N represents the node popularities, and the pos-\nterior over \u03c01:N represents the community memberships of the nodes. With an estimate of this\nlatent structure, we can characterize the network in many useful ways. Figure 1 gives an example.\nThis is a subgraph of the netscience collaboration network [22] with N = 1460 nodes. We analyzed\nthis network with K = 100 communities, using the algorithm from \u00a73. This results in posterior\nestimates of the community memberships and popularities for each node and posterior estimates\nof the community assignments for each link. With these estimates, we visualized the discovered\ncommunity structure and the popular authors.\nIn general, with an estimate of this latent structure, we can study individual links, characterizing\nthe extent to which they occur due to similarity between nodes and the extent to which they are an\nartifact of the popularity of the nodes.\n\n0, \u03c32\n\n3 A stochastic gradient algorithm for nonconjugate variational inference\nOur goal is to compute the posterior distribution p(\u03c01:N , \u03b81:N , z, \u03b21:K|y, \u03b1, \u00b50\u03c32\nence is intractable; we use variational inference [13].\nTraditionally, variational inference is a coordinate ascent algorithm. However, the AMP presents\ntwo challenges. First, in variational inference the coordinate updates are available in closed form\nonly when all the nodes in the graphical model satisfy conditional conjugacy. The AMP is not condi-\ntionally conjugate. To see this, note that the Gaussian priors on the popularity \u03b8 and the community\nstrengths \u03b2 are not conjugate to the conditional likelihood of the data. Second, coordinate ascent\nalgorithms iterate over all the O(N 2) node pairs making inference intractable for large networks.\n\n1). Exact infer-\n\n0, \u03c32\n\n3\n\n\fWe address these challenges by deriving a stochastic gradient algorithm that optimizes a tractable\nlower bound of the variational objective [11]. Our algorithm avoids the O(N 2) computational cost\nper iteration by subsampling a \u201cmini-batch\u201d of random nodes and a subset of their interactions in\neach iteration [9].\n\n3.1 The variational objective\n\nIn variational inference, we de\ufb01ne a family of distributions over the hidden variables q(\u03b2, \u03b8, \u03c0, z)\nand \ufb01nd the member of that family that is closest to the true posterior. We use the mean-\ufb01eld family,\nwith the following variational distributions:\n\nq(za\u2192b = i, za\u2190b = j) = \u03c6ij\nab;\nq(\u03b2k) = N (\u03b2k; \u00b5k, \u03c32\n\u03b2);\n\nq(\u03c0n) = Dirichlet(\u03c0n; \u03b3n);\nq(\u03b8n) = N (\u03b8n; \u03bbn, \u03c32\n\u03b8 ).\n\n(2)\n\nThe posterior over the joint distribution of link community assignments per node pair (a, b) is pa-\n1, the community memberships by \u03b3, the com-\nrameterized by the per-interaction memberships \u03c6ab\nmunity strength distributions by \u00b5 and the popularity distributions by \u03bb.\nMinimizing the KL divergence between q and the true posterior is equivalent to optimizing an ev-\nidence lower bound (ELBO) L, a bound on the log likelihood of the observations. We obtain this\nbound by applying Jensen\u2019s inequality [13] to the data likelihood. The ELBO is\n\nL =(cid:80)\n+(cid:80)\n+(cid:80)\n+(cid:80)\n+(cid:80)\n\nn\n\nn\n\nk\n\nEq[log p(\u03c0n|\u03b1)] \u2212(cid:80)\n1)] \u2212(cid:80)\n0)] \u2212(cid:80)\n\nEq[log q(\u03c0n|\u03b3n)]\nEq[log q(\u03b8n|\u03bbn, \u03c32\n\u03b8 )]\nEq[log q(\u03b2k|\u00b5k, \u03c32\n\u03b2)]\n\nEq[log p(\u03b8n|\u03c32\nEq[log p(\u03b2k|\u00b50, \u03c32\nEq[log p(za\u2192b|\u03c0a)] + Eq[log p(za\u2190b|\u03c0b)] \u2212 Eq[log q(za\u2192b, za\u2190b|\u03c6ab)]\nEq[log p(yab|za\u2192b, za\u2190b, \u03b8, \u03b2)].\n\nn\n\nn\n\nk\n\n(3)\n\na,b\n\na,b\n\nNotice that the \ufb01rst three lines in Eq. 3 contains summations over communities and nodes; we call\nthese global terms. They relate to the global parameters which are (\u03b3, \u03bb, \u00b5). The remaining lines\ncontain summations over all node pairs; we call these local terms. They relate to the local parame-\nters which are the \u03c6ab. The distinction between the global and local parameters is important\u2014the\nupdates to global parameters depends on all (or many) local parameters, while the updates to lo-\ncal parameters for a pair of nodes only depends on the relevant global and local parameters in that\ncontext.\nEstimating the global variational parameters is a challenging computational problem. Coordinate\nascent inference must consider each pair of nodes at each iteration, but even a single pass through\nthe O(N 2) node pairs can be prohibitive. Previous work [9] has taken advantage of conditional con-\njugacy of the MMSB to develop fast stochastic variational inference algorithms. Unlike the MMSB,\nthe AMP is not conditionally conjugate. Nevertheless, by carefully manipulating the variational\nobjective, we can develop a scalable stochastic variational inference algorithm for the AMP.\n\n3.2 Lower bounding the variational objective\n\nTo optimize the ELBO with respect to the local and global parameters we need its derivatives. The\ndata likelihood terms in the ELBO can be written as\n\nEq[log p(yab|za\u2192b, za\u2190b, \u03b8, \u03b2)] = yabEq[xab] \u2212 Eq[log(1 + exp(xab))],\n\n(4)\n\nwhere we de\ufb01ne xab \u2261 \u03b8a + \u03b8b +(cid:80)K\n\nab. The terms in Eq. 4 cannot be expanded analytically.\nTo address this issue, we further lower bound \u2212Eq[log(1+exp(xab))] using Jensen\u2019s inequality [13],\n\nk=1 \u03b2k\u03b4k\n\n\u2212Eq[log(1 + exp(xab))] \u2265 \u2212 log[Eq(1 + exp(xab))]\n\n= \u2212 log[1 + Eq[exp(\u03b8a + \u03b8b +(cid:80)K\n\n= \u2212 log[1 + exp(\u03bba + \u03c32\n\nk=1 \u03b2k\u03b4k\n\u03b8 /2) exp(\u03bbb + \u03c32\n\nab)]]\n\u03b8 /2)sab],\n\n(5)\n\n1Following [15], we use a structured mean-\ufb01eld assumption.\n\n4\n\n\fAlgorithm 1 The stochastic AMP algorithm\n1: Initialize variational parameters. See \u00a73.5.\n2: while convergence criteria is not met do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end while\n\nSample a mini-batch S of nodes. Let P be the set of node pairs in S.\nlocal step\nOptimize \u03c6ab \u2200(a, b) \u2208 P using Eq. 11 and Eq. 12.\nglobal step\nUpdate memberships \u03b3a, for each node a \u2208 S, using stochastic natural gradients in Eq. 6.\nUpdate popularities \u03bba, for each node a \u2208 S using stochastic gradients in Eq. 7.\nUpdate community strengths \u00b5 using stochastic gradients in Eq. 9.\nSet \u03c1a(t) = (\u03c40 + ta)\u2212\u03ba; ta \u2190 ta + 1, for each node a \u2208 S.\nSet \u03c1(cid:48)(t) = (\u03c40 + t)\u2212\u03ba; t \u2190 t + 1.\n\nwhere we de\ufb01ne sab \u2261 (cid:80)K\n\n\u03b2/2} + (1 \u2212(cid:80)K\n\nab exp{\u00b5k + \u03c32\n\nk=1 \u03c6kk\n\nIn simplifying Eq. 5,\nwe have used that q(\u03b8n) is a Gaussian. Using the mean of a log-normal distribution, we have\nEq[exp(\u03b8n)] = exp(\u03bbn + \u03c32\n\u03b8 /2). A similar substitution applies for the terms involving \u03b2k in Eq. 5.\nWe substitute Eq. 5 in Eq. 3 to obtain a tractable lower bound L(cid:48) of the ELBO L in Eq. 3. This allows\nus to develop a coordinate ascent algorithm that iteratively updates the local and global parameters\nto optimize this lower bound on the ELBO.\n\nk=1 \u03c6kk\n\nab ).\n\n3.3 The global step\n\nWe optimize the ELBO with respect to the global variational parameters using stochastic gradient\nascent. Stochastic gradient algorithms follow noisy estimates of the gradient with a decreasing step-\nsize. If the expectation of the noisy gradient equals to the gradient and if the step-size decreases\naccording to a certain schedule, then the algorithm converges to a local optimum [26]. Subsampling\nthe data to form noisy gradients scales inference as we avoid the expensive all-pairs sums in Eq. 3.\nThe global step updates the global community memberships \u03b3, the global popularity parameters \u03bb\nand the global community strength parameters \u00b5 with a stochastic gradient of the lower bound on\nthe ELBO L(cid:48). In [9], the authors update community memberships of all nodes after each iteration\nby obtaining the natural gradients of the ELBO 2 with respect to the vector \u03b3 of dimension N \u00d7 K.\nWe use natural gradients for the memberships too, but use distinct stochastic optimizations for the\nmemberships and popularity parameters of each node and maintain a separate learning rate for each\nnode. This restricts the per-iteration updates to nodes in the current mini-batch.\nSince the variational objective is a sum of terms, we can cheaply compute a stochastic gradient by\n\ufb01rst subsampling a subset of terms and then forming an appropriately scaled gradient. We use a\nvariant of the random node sampling method proposed in [9]. At each iteration we sample a node\nuniformly at random from the N nodes in the network. (In practice we sample a \u201cmini-batch\u201d S of\nnodes per update to reduce noise [11, 9].) While a naive method will include all interactions of a\nsampled node as the observed pairs, we can leverage network sparsity for ef\ufb01ciency; in many real\nnetworks, only a small fraction of the node pairs are linked. Therefore, for each sampled node, we\ninclude as observations all of its links and a small uniform sample of m0 non-links.\na be the natural gradient of L(cid:48) with respect to \u03b3a, and \u2202\u03bbt\nLet \u2202\u03b3t\nwith respect to \u03bba and \u00b5k, respectively. Following [2, 9], we have\n\nk be the gradients of L(cid:48)\n\na and \u2202\u00b5t\n\n\u2202\u03b3t\n\n(a,b)\u2208links(a) \u03c6kk\n\n(6)\nwhere links(a) and nonlinks(a) correspond to the set of links and non-links of a in the training set.\nNotice that an unbiased estimate of the summation term over non-links in Eq. 6 can be obtained from\na subsample of the node\u2019s non-links. Therefore, the gradient of L(cid:48) with respect to the membership\nparameter \u03b3a, computed using all of the nodes\u2019 links and a subsample of its non-links, is a noisy but\nunbiased estimate of the natural gradient in Eq. 6.\n\n(a,b)\u2208nonlinks(a) \u03c6kk\n\nab (t),\n\na,k + \u03b1k +(cid:80)\n\na,k = \u2212\u03b3t\u22121\n\nab (t) +(cid:80)\n\n2The natural gradient [3] points in the direction of steepest ascent in the Riemannian space. The local\ndistance in the Riemannian space is de\ufb01ned by KL divergence, a better measure of dissimilarity between prob-\nability distributions than Euclidean distance [11].\n\n5\n\n\fThe gradient of the approximate ELBO with respect to the popularity parameter \u03bba is\n\n(a,b)\u2208links(a) \u222a nonlinks(a)(yab \u2212 rabsab),\n\n(7)\n\n\u2202\u03bbt\n\na = \u2212 \u03bbt\u22121\n\na\n\u03c32\n1\n\nwhere we de\ufb01ne rab as\n\n+(cid:80)\n\n(8)\nFinally, the stochastic gradient of L(cid:48) with respect to the global community strength parameter \u00b5k is\n\n.\n\n\u03b8 /2} exp{\u03bbb+\u03c32\n\u03b8 /2} exp{\u03bbb+\u03c32\n\n\u03b8 /2}\n\u03b8 /2}sab\n\n\u2202\u00b5t\n\nk = \u00b50\u2212\u00b5t\u22121\n\nk\n\u03c32\n0\n\n+ N\n2|S|\n\n(a,b)\u2208links(S) \u222a nonlinks(S) \u03c6kk\n\nab (yab \u2212 rab exp{\u00b5k + \u03c32\n\n\u03b2/2}).\n\n(9)\n\nrab \u2261 exp{\u03bba+\u03c32\n1+exp{\u03bba+\u03c32\n(cid:80)\n\nAs with the community membership gradients, notice that an unbiased estimate of the summation\nterm over non-links in Eq. 7 and Eq. 9 can be obtained from a subsample of the node\u2019s non-links. To\nobtain an unbiased estimate of the true gradient with respect to \u00b5k, the summation over a node\u2019s links\nand non-links must be scaled by the inverse probability of subsampling that node in Eq. 9. Since\neach pair is shared between two nodes, and we use a mini-batch with S nodes, the summations over\nthe node pairs are scaled by N\n\n2|S| in Eq. 9.\n\nWe can interpret the gradients in Eq. 7 and Eq. 9 by studying the terms involving rab in Eq. 7 and\nEq. 9. In Eq. 7, (yab \u2212 rabsab) is the residual for the pair (a, b), while in Eq. 9, (yab \u2212 rab exp{\u00b5k +\n\u03b2/2}) is the residual for the pair (a, b) conditional on the latent community assignment for both\n\u03c32\nnodes a and b being set to k. Further, notice that the updates for the global parameters of node a and\nb, and the updates for \u00b5 depend only on the diagonal entries of the indicator variational matrix \u03c6ab.\nWe can similarly obtain stochastic gradients for the variational variances \u03c3\u03b2 and \u03c3\u03b8; however, in our\nexperiments we found that \ufb01xing them already gives good results. (See \u00a74.)\nThe global step for the global parameters follows the noisy gradient with an appropriate step-size:\n\n\u03b3a \u2190 \u03b3a + \u03c1a(t)\u2202\u03b3t\na;\n\n\u03bba \u2190 \u03bba + \u03c1a(t)\u2202\u03bbt\n\na; \u00b5 \u2190 \u00b5 + \u03c1(cid:48)(t)\u2202\u00b5t.\n\n(10)\n\nthat(cid:80)\n\nWe maintain separate learning rates \u03c1a for each node a, and only update the \u03b3 and \u03bb for the nodes\nin the mini-batch in each iteration. There is a global learning rate \u03c1(cid:48) for the community strength\nparameters \u00b5, which are updated in every iteration. For each of these learning rates \u03c1, we require\nt \u03c1(t) = \u221e for convergence to a local optimum [26]. We set \u03c1(t) (cid:44)\n(\u03c40 + t)\u2212\u03ba, where \u03ba \u2208 (0.5, 1] is the learning rate and \u03c40 \u2265 0 downweights early iterations.\n\nt \u03c1(t)2 < \u221e and(cid:80)\n\n3.4 The local step\n\nWe now derive the updates for the local parameters. The local step optimizes the per-interaction\nmemberships \u03c6 with respect to a subsample of the network. There is a per-interaction variational\nparameter of dimension K \u00d7 K for each node pair\u2014\u03c6ab\u2014representing the posterior approximation\nof which pair of communities are active in determining the link or non-link. The coordinate ascent\nupdate for \u03c6ab is\n\n(cid:110)Eq[log \u03c0a,k] + Eq[log \u03c0b,k] + yab\u00b5k \u2212 rab(exp{\u00b5k + \u03c32\n(cid:110)Eq[log \u03c0a,i] + Eq[log \u03c0b,j]\n\n, i (cid:54)= j,\n\n(cid:111)\n\n(cid:111)\n\n\u03b2/2} \u2212 1)\n\nab \u221d exp\n\u03c6kk\nab \u221d exp\n\u03c6ij\n\n(11)\n\n(12)\n\nwhere rab is de\ufb01ned in Eq. 8. We present the full stochastic variational inference in Algorithm 1.\n\n3.5 Initialization and convergence\n\nWe initialize the community memberships \u03b3 using approximate posterior memberships from the\nvariational inference algorithm for the MMSB [9]. We initialized popularities \u03bb to the logarithm of\nthe normalized node degrees added to a small random offset, and initialized the strengths \u00b5 to zero.\nWe measure convergence by computing the link prediction accuracy on a validation set with 1% of\nthe networks links, and an equal number of non-links. The algorithm stops either when the change\nin log-likelihood on this validation set is less than 0.0001%, or if the log-likelihood decreases for\nconsecutive iterations.\n\n6\n\n\fFigure 2: Network data sets. N is the number of nodes, d is the percent of node pairs that are links and P is\nthe mean perplexity over the links and nonlinks in the held-out test set.\n\nDATA SET\nUS AIR\nPOLITICAL BLOGS\nNETSCIENCE\nRELATIVITY\nHEP-TH\nHEP-PH\nASTRO-PH\nCOND-MAT\nBRIGHTKITE\n\nN\n712\n1224\n1450\n4158\n8638\n11204\n17903\n36458\n56739\n\nPAMP\nd(%)\n2.75 \u00b1 0.04\n1.7%\n2.97 \u00b1 0.03\n1.9%\n2.73 \u00b1 0.11\n0.2%\n3.69 \u00b1 0.18\n0.1%\n0.05% 12.35 \u00b1 0.17\n2.75 \u00b1 0.06\n0.16%\n5.04 \u00b1 0.02\n0.11%\n0.02% 10.82 \u00b1 0.09\n0.01% 10.98 \u00b1 0.39\n\nPMMSB\n3.41 \u00b1 0.15\n3.12 \u00b1 0.01\n3.02 \u00b1 0.19\n6.53 \u00b1 0.37\n23.06 \u00b1 0.87\n3.310 \u00b1 0.15\n5.28 \u00b1 0.07\n13.52 \u00b1 0.21\n41.11 \u00b1 0.89\n\nTYPE\n\nTRANSPORT\nHYPERLINK\nCOLLAB.\nCOLLAB.\nCOLLAB.\nCOLLAB.\nCOLLAB.\nCOLLAB.\nSOCIAL\n\nSOURCE\n[25]\n[1]\n[22]\n[18]\n[18]\n[18]\n[18]\n[22]\n[18]\n\nFigure 3: The AMP model outperforms the MMSB model of [9] in predictive accuracy on real networks. Both\nmodels were \ufb01t using stochastic variational inference [11]. For the data sets shown, the number of communities\nK was set to 100 and hyperparameters were set to the same values across data sets. The perplexity results are\nbased on \ufb01ve replications. A single replication is shown for the mean precision and mean recall.\n\n4 Empirical study\n\nWe use the predictive approach to evaluating model \ufb01tness [8], comparing the predictive accuracy\nof AMP (Algorithm 1) to the stochastic variational inference algorithm for the MMSB with link\nsampling [9]. In all data sets, we found that AMP gave better \ufb01ts to real-world networks. Our\nnetworks range in size from 712 nodes to 56,739 nodes. Some networks are sparse, having as\nlittle as 0.01% of all pairs as links, while others have up to 2% of all pairs as links. Our data sets\ncontain four types of networks: hyperlink, transportation, collaboration and social networks. We\nimplemented Algorithm 1 in 4,800 lines of C++ code. 3\n\nMetrics. We used perplexity, mean precision and mean recall in our experiments to evaluate the\npredictive accuracy of the algorithms. We computed the link prediction accuracy using a test set of\nnode pairs that are not observed during training. The test set consists of 10% of randomly selected\nlinks and non-links from each data set. During training, these test set observations are treated as\nzeros. We approximate the predictive distribution of a held-out node pair yab under the AMP using\nposterior estimates \u02c6\u03b8, \u02c6\u03b2 and \u02c6\u03c0 as\n\np(yab|y) \u2248(cid:80)\n\n(cid:80)\n\nza\u2192b\n\nza\u2190b\n\np(yab|za\u2192b, za\u2190b, \u02c6\u03b8, \u02c6\u03b2)p(za\u2192b|\u02c6\u03c0a)p(za\u2190b|\u02c6\u03c0b).\n\n(13)\n\n3Our software is available at https://github.com/premgopalan/sviamp.\n\n7\n\nNumber of recommendationsMean precision2%3%4%5%6%7%8%relativity10501000%1%2%3%4%5%6%astro10501004%6%8%10%12%hepph10501000%0.5%1%1.5%2%hepth10501000%0.2%0.4%0.6%0.8%cond\u2212mat10501000%0.5%1%1.5%2%brightkite1050100ampmmsbNumber of recommendationsMean recall10%15%20%25%30%relativity10501000%5%10%15%astro10501005%10%15%20%hepph10501000%5%10%15%20%hepth10501000%2%4%6%8%10%12%cond\u2212mat10501000%2%4%6%8%10%12%14%brightkite1050100ampmmsb\fFigure 4: The AMP predicts signi\ufb01cantly better than the MMSB [9] on 12 LFR benchmark networks [17].\nEach plot shows 4 networks with increasing right-skewness in degree distribution. \u00b5 is the fraction of noisy\nlinks between dissimilar nodes\u2014nodes that share no communities. The precision is computed at 50 recommen-\ndations for each node, and is averaged over all nodes in the network.\n\nPerplexity is the exponential of the average predictive log likelihood of the held-out node pairs. For\nmean precision and recall, we generate the top n pairs for each node ranked by the probability of\na link between them. The ranked list of pairs for each node includes nodes in the test set, as well\nas nodes in the training set that were non-links. We compute precision-at-m, which measures the\nfraction of the top m recommendations present in the test set; and we compute recall-at-m, which\ncaptures the fraction of nodes in the test set present in the top m recommendations. We vary m from\n10 to 100. We then obtain the mean precision and recall across all nodes. 4\n\nHyperparameters and constants. For the stochastic AMP algorithm, we set the \u201cmini-batch\u201d size\nS = N/100, where N is the number of nodes in the network and we set the non-link sample size\nm0 = 100. We set the number of communities K = 2 for the political blog network and K = 20\n0 = 1.0,\nfor the US air; for all other networks, K was set to 100. We set the hyperparameters \u03c32\n1 = 10.0 and \u00b50 = 0, \ufb01xed the variational variances at \u03c3\u03b8 = 0.1 and \u03c3\u03b2 = 0.5 and set the learning\n\u03c32\nparameters \u03c40 = 65536 and \u03ba = 0.5. We set the Dirichlet hyperparameter \u03b1 = 1\nK for the AMP and\nthe MMSB.\n\nResults on real networks. Figure 2 compares the AMP and the MMSB stochastic algorithms on a\nnumber of real data sets. The AMP de\ufb01nitively outperforms the MMSB in predictive performance.\nAll hyperparameter settings were held \ufb01xed across data sets. The \ufb01rst four networks are small in\nsize, and were \ufb01t using the AMP model with a single community strength parameter. All other\nnetworks were \ufb01t with the AMP model with K community strength parameters. As N increases,\nthe gap between the mean precision and mean recall performance of these algorithms appears to\nincrease. Without node popularities, MMSB is dependent entirely on node memberships and com-\nmunity strengths to predict links. Since K is held \ufb01xed, communities are likely to have more nodes\nas N increases, making it increasingly dif\ufb01cult for the MMSB to predict links. For the small US air,\npolitical blogs and netscience data sets, we obtained similar performance for the replication shown\nin Figure 2. For the AMP the mean precision at 10 for US Air, political blogs and netscience were\n0.087, 0.07, 0.092, respectively; for the MMSB the corresponding values were 0.007, 0.0, 0.063,\nrespectively.\n\nResults on synthetic networks. We generated 12 LFR benchmark networks [17], each with 1000\nnodes. Roughly 50% of the nodes were assigned to 4 overlapping communities, and the other 50%\nwere assigned to single communities. We set a community size range of [200, 500] and a mean node\ndegree of 10 with power-law exponent set to 2.0. Figure 4 shows that the MMSB performs poorly as\nthe skewness is increased, while the AMP performs signi\ufb01cantly better in the presence of both noisy\nlinks and right-skewness, both characteristics of real networks. The skewness in degree distributions\ncauses the community strength parameters of MMSB to overestimate or underestimate the linking\npatterns within communities. The per-node popularities in the AMP can capture the heterogeneity\nin node degrees, while learning the corrected community strengths.\n\nAcknowledgments\n\nDavid M. Blei\nis supported by ONR N00014-11-1-0651, NSF CAREER IIS-0745520, and\nthe Alfred P. Sloan foundation. Chong Wang is supported by NSF DBI-0546594 and NIH\n1R01GM093156.\n\n4Precision and recall are better metrics than ROC AUC on highly skewed data sets [5].\n\n8\n\nRatio of max. degree to avg. degreeMean precision0.0200.0250.0300.035mu 024680.0180.0200.0220.0240.0260.028mu 0.224680.0160.0180.0200.0220.0240.0260.028mu 0.42468ampmmsb\fReferences\n[1] L. A. Adamic and N. Glance. The political blogosphere and the 2004 U.S. election: divided they blog. In\n\nLinkKDD, LinkKDD \u201905, page 3643, New York, NY, USA, 2005. ACM.\n\n[2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. J.\n\nMach. Learn. Res., 9:1981\u20132014, June 2008.\n\n[3] S. Amari. Differential geometry of curved exponential Families-Curvatures and information loss. The\n\nAnnals of Statistics, 10(2):357\u2013385, June 1982.\n\n[4] B. Ball, B. Karrer, and M. E. J. Newman. Ef\ufb01cient and principled method for detecting communities in\n\nnetworks. Physical Review E, 84(3):036103, Sept. 2011.\n\n[5] J. Davis and M. Goadrich. The relationship between precision-recall and ROC curves. In Proceedings\nof the 23rd international conference on Machine learning, ICML \u201906, pages 233\u2013240, New York, NY,\nUSA, 2006. ACM.\n\n[6] S. Fortunato. Community detection in graphs. Physics Reports, 486(35):75\u2013174, Feb. 2010.\n[7] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement. Softw. Pract.\n\nExper., 21(11):1129\u20131164, Nov. 1991.\n\n[8] S. Geisser and W. Eddy. A predictive approach to model selection. Journal of the American Statistical\n\nAssociation, 74:153\u2013160, 1979.\n\n[9] P. K. Gopalan and D. M. Blei. Ef\ufb01cient discovery of overlapping communities in massive networks.\n\nProceedings of the National Academy of Sciences, 110(36):14534\u201314539, 2013.\n\n[10] P. Hoff, A. Raftery, and M. Handcock. Latent space approaches to social network analysis. Journal of the\n\nAmerican Statistical Association, 97(460):1090\u20131098, 2002.\n\n[11] M. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 2013.\n\n[12] H. Jeong, Z. Nda, and A. L. Barabsi. Measuring preferential attachment in evolving networks. EPL\n\n(Europhysics Letters), 61(4):567, 2003.\n\n[13] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Mach. Learn., 37(2):183\u2013233, Nov. 1999.\n\n[14] B. Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks. Phys.\n\nRev. E, 83:016107, Jan 2011.\n\n[15] D. I. Kim, P. Gopalan, D. M. Blei, and E. B. Sudderth. Ef\ufb01cient online inference for bayesian nonpara-\n\nmetric relational models. In Neural Information Processing Systems, 2013.\n\n[16] P. N. Krivitsky, M. S. Handcock, A. E. Raftery, and P. D. Hoff. Representing degree distributions, clus-\ntering, and homophily in social networks with latent cluster random effects models. Social Networks,\n31(3):204\u2013213, July 2009.\n\n[17] A. Lancichinetti and S. Fortunato. Benchmarks for testing community detection algorithms on directed\n\nand weighted graphs with overlapping communities. Physical Review E, 80(1):016118, July 2009.\n\n[18] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahone. Community structure in large networks:\n\nNatural cluster sizes and the absence of large well-de\ufb01ned cluster. In Internet Mathematics, 2008.\n\n[19] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Proceedings of the\ntwelfth international conference on Information and knowledge management, CIKM \u201903, pages 556\u2013559,\nNew York, NY, USA, 2003. ACM.\n\n[20] P. McCullagh and J. A. Nelder. Generalized Linear Models, Second Edition. Chapman and Hall/CRC, 2\n\nedition, Aug. 1989.\n\n[21] M. E. J. Newman. Assortative mixing in networks. Physical Review Letters, 89(20):208701, Oct. 2002.\n[22] M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical\n\nReview E, 74(3):036104, 2006.\n\n[23] K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of\n\nthe American Statistical Association, 96(455):1077\u20131087, Sept. 2001.\n\n[24] F. Papadopoulos, M. Kitsak, M. . Serrano, M. Bogu, and D. Krioukov. Popularity versus similarity in\n\ngrowing networks. Nature, 489(7417):537\u2013540, Sept. 2012.\n\n[25] RITA. U.S. Air Carrier Traf\ufb01c Statistics, Bur. Trans. Stats, 2010.\n[26] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\n22(3):400\u2013407, Sept. 1951.\n\n[27] Y. J. Wang and G. Y. Wong. Stochastic blockmodels for directed graphs. Journal of the American\n\nStatistical Association, 82(397):8\u201319, 1987.\n\n9\n\n\f", "award": [], "sourceid": 1304, "authors": [{"given_name": "Prem", "family_name": "Gopalan", "institution": "Princeton University"}, {"given_name": "Chong", "family_name": "Wang", "institution": "CMU"}, {"given_name": "David", "family_name": "Blei", "institution": "Princeton University"}]}